Information Integration Research Group

Data Extraction: Extraction from Unstructured, Ungrammatical Data Sources

The wrapper methods provide extraction techniques for semi-structured sources, such as similarly-looking Web pages, but lots of data on the World Wide Web exists in an unstructured and ungrammatical form. For example, a user selling an item on EBay will not necessarily use the same textual layout as other posts about the same item. Nor will this user necessarily conform to the grammatical rules of language in their post (often times there are no verbs in listings). This lack of structure and grammar makes it difficult to develop wrapper methods for extraction on this type of data. We call this type of unstructured, ungrammatical data "posts" and it includes a large segment of Web data, such as auction listings, classified listings, and forum postings.

To overcome this lack of structural and grammatical characterisics for extraction, we infuse the extraction process with outside knowledge, which we call reference sets. A reference set is a collection of entities and the associated attributes. For example, a reference set of cars would include all known car makes, models, trim options and years. A reference set could come from a set of pages on the Web, a database, a knowledge-base and ontology from the Semantic Web, or even be built directly from the posts themselves (see our IJCAI 2009 paper).

Exploiting these reference sets is a two step process. First, we take each listing for extraction, called the post, and we match it to a member of the reference set. This yields a set of clues for what the system might extract from the post. The second step extracts the items from the post that are most similar to the attributes from the matching reference set member. This two step process can be done using machine learning methods, which yield highly accurate extractions, but at the cost of training the system (JAIR 2008 paper), or it can be done using automatic methods, which are slightly less accurate, but do not require a human to train the system (IJDAR 2007 paper).

To make this type of extraction clear, consider the example posts from the website www.BiddingForTravel.com, shown in the table Example Posts. These posts include information users typed in regarding hotels. They contain useful attributes such as the hotel name and the area of that hotel. Once we extract these attributes we could then query the data set using them in a structured way, and derive useful conclusions. Now, suppose we have a reference set such as that shown in the table Reference Set of Hotels. This reference set contains the hotel name and the hotel area. As stated above, the first step is to find the records in this reference set that best match each post, which in turn yields the clues for extracting the attributes. To see the best matching record from the reference set, click on a post from the Example Post table. This will highlight the best matching member of the reference set in the Reference Set of Hotels table and show the extracted results in the table Extracted Information.

Example Posts
Post from www.BiddingForTravel.com
$25 winning bid at holiday inn sel. univ. ctr.
4* Hyatt DT 8/18 $40 1 nite
Hol. Inn Greentree, $40 2/1

Reference Set of Hotels
Hotel Name	Hotel Area
Holiday Inn	Greentree
Holiday Inn Select	University Center
Hyatt Regency	Downtown

Extracted information
Extracted Hotel Name	Extracted Hotel Area

We have released software implementations of the above methods described in the JAIR 2008 and IJDAR 2007 papers. This software is open-sourced and freely available, and you can obtain a copy here. Also, we have released experimental data related to this task, which you can obtain here.