Information Integration Research Group


ISI's Information Integration (II) research group combines artificial intelligence, the semantic web, and database integration techniques to solve complex information integration problems. We leverage general research techniques across information-intensive disciplines, including medical informatics, geospatial data integration and the social Web.

The II group currently is pursuing automatic discovery and semantic modeling of online information sources, interactive data integration, information mediators for bioinformatics, and learning from online social networks, among other techniques. Our work focuses on solving real-world problems to bridge research and relevant applications, such as integrating and linking cultural heritage data, analyzing social media data, and the integration of genetic, pathology and other data with clinical assessment to support clinical trials and medical research.

Our work has included development of:

#digusc #karmausc #wtnicusc Tweets



Technologies for building domain-specific knowledge graphs


A data integration tool


Learning about art by building multimedia stories

Semantic Modeling

Automatically building semantic descriptions of sources

Entity Extractors

Easily extract features from free text using Amazon Mechanical Turk


Extract information to find relationship among 500,000 private and public firms


Automatically adapting to changes and failures in sensors


Integrate data from multiple-sources, provide common operating picture, and issue alerts for large air operations


Predicting Cyber attacks by mining online sources.

American Art Collaborative

Creating linked data for cultural heritage data from American art museums


Text-enabled Humanitarian Operations in Real-time



Software: Karma

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.

MapFinder: Harvesting maps on the Web

Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

BSL: A system for learning blocking schemes

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.

EIDOS: Efficiently Inducing Definitions for Online Sources

The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.

Digg 2009

This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.

Twitter 2010

This data set contains information about URLs that were tweeted over a 3 week period in the Fall of 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of (tweeting) users.

Flickr personal taxonomies

This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

Wrapper maintenance

Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.S


Social network analysis methods examine topology of a network in order to indentify its structure, for example, who the important nodes are. Centrality, however, depends on both network topology (or social links) and the dynamical processes (or flow) taking place on the network, which determines how ideas, pathogens, or influence flow along social links. Click the link below to see Matlab code for calculating random walk-based centrality (PageRank) and epidemic diffusion-based centrality (given by Bonacich's Alpha-Centrality).


KDD 2017 Tutorial: Data mining in unusual domains with information-rich knowledge graph construction, inference and search

The growth of the Web is a success story that has spurred much research in knowledge discovery and data mining. Data mining over Web domains that are unusual is an even harder problem. There are several factors that make a domain unusual. In particular, such domains have significant long tails and exhibit concept drift, and are characterized by high levels of heterogeneity. Notable examples of unusual Web domains include both illicit domains, such as human trafficking advertising, illegal weapons sales, counterfeit goods transactions, patent trolling and cyberattacks, and also non-illicit domains such as humanitarian and disaster relief. Data mining in such domains has the potential for widespread social impact, and is also very challenging technically. In this tutorial, we provide an overview, using demos, examples and case studies, of the research landscape for data mining in unusual domains, including recent work that has achieved state-of-the-art results in constructing knowledge graphs in a variety of unusual domains, followed by inference and search using both command line and graphical interfaces.

ISWC 2017 Tutorial: Constructing Domain-specific Knowledge Graphs (KGC)

The vast amounts of ontologically unstructured information on the Web, including semi-structured HTML, XML and JSON documents, natural language documents, tweets, blogs, markups, and even structured documents like CSV tables, all contain useful knowledge that can present a tremendous advantage to Semantic Web researchers if extracted robustly, efficiently and semi-automatically as an RDF knowledge graph. Domain-specific Knowledge Graph Construction (KGC) is an active research area that has recently witnessed impressive advances due to machine learning techniques like deep neural networks and word embeddings. This tutorial will synthesize and present KGC techniques, especially information extraction (IE) in a manner that is accessible to Semantic Web researchers. The presenters of the tutorial will use their experience as instructors and Semantic Web researchers, as well as lessons from actual IE implementations, to accomplish this purpose through visually intuitive and example-driven slides, accessible, high-level overviews of related work, instructor demos, and at least five IE participatory activities that attendees will be able to set up on their laptops.


ISWC 2017

Hybrid Statistical Semantic Understanding and Emerging Semantics