Information Integration Research Group


ISI's Information Integration (II) research group combines artificial intelligence, the semantic web, and database integration techniques to solve complex information integration problems. We leverage general research techniques across information-intensive disciplines, including medical informatics, geospatial data integration and the social Web.

The II group currently is pursuing automatic discovery and semantic modeling of online information sources, interactive data integration, information mediators for bioinformatics, and learning from online social networks, among other techniques. Our work focuses on solving real-world problems to bridge research and relevant applications, such as integrating and linking cultural heritage data, analyzing social media data, and the integration of genetic, pathology and other data with clinical assessment to support clinical trials and medical research.

Our work has included development of:

#digusc #karmausc #wtnicusc Tweets



Technologies for building domain-specific knowledge graphs


A data integration tool


Learning about art by building multimedia stories

Semantic Modeling

Automatically building semantic descriptions of sources

Entity Extractors

Easily extract features from free text using Amazon Mechanical Turk


Extract information to find relationship among 500,000 private and public firms having web-based presence



Integrate data from multiple-sources, provide common operating picture, and issue alerts for large air operations


Predicting Cyber Attacks

American Art Collaborative

Creating linked data for cultural heritage data from American art museums




Software: Karma

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.

MapFinder: Harvesting maps on the Web

Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

BSL: A system for learning blocking schemes

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.

EIDOS: Efficiently Inducing Definitions for Online Sources

The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.

Digg 2009

This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.

Twitter 2010

This data set contains information about URLs that were tweeted over a 3 week period in the Fall of 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of (tweeting) users.

Flickr personal taxonomies

This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

Wrapper maintenance

Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.S


Social network analysis methods examine topology of a network in order to indentify its structure, for example, who the important nodes are. Centrality, however, depends on both network topology (or social links) and the dynamical processes (or flow) taking place on the network, which determines how ideas, pathogens, or influence flow along social links. Click the link below to see Matlab code for calculating random walk-based centrality (PageRank) and epidemic diffusion-based centrality (given by Bonacich's Alpha-Centrality).