Information Integration Research Group


ISI's Information Integration (II) research group combines artificial intelligence, the semantic web, and database integration techniques to solve complex information integration problems. We leverage general research techniques across information-intensive disciplines, including medical informatics, geospatial data integration and the social Web.

The II group currently is pursuing automatic discovery and semantic modeling of online information sources, interactive data integration, information mediators for bioinformatics, and learning from online social networks, among other techniques. Our work focuses on solving real-world problems to bridge research and relevant applications, such as integrating and linking cultural heritage data, analyzing social media data, and the integration of genetic, pathology and other data with clinical assessment to support clinical trials and medical research.

Our work has included development of:

#digusc #karmausc #wtnicusc Tweets



Technologies for building domain-specific knowledge graphs


A data integration tool

Semantic Modeling

Automatically building semantic descriptions of sources


Extract information to find relationship among 500,000 private and public firms


Automatically adapting to changes and failures in sensors


Integrate data from multiple-sources, provide common operating picture, and issue alerts for large air operations


Predicting Cyber attacks by mining online sources.

American Art Collaborative

Creating linked data for cultural heritage data from American art museums


Text-enabled Humanitarian Operations in Real-time


A novel knowledge organization system that integrates concepts of causality, factual knowledge and meta-reasoning

Unsupervised Data Integration

Automatic integration of the data for the City of Los Angeles

Linked Maps

Exploiting Context in Cartographic Evolutionary Documents to Extract and Build Linked Spatial-Temporal Datasets



Software: Karma

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.

MapFinder: Harvesting maps on the Web

Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

BSL: A system for learning blocking schemes

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.

EIDOS: Efficiently Inducing Definitions for Online Sources

The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.

Digg 2009

This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.

Twitter 2010

This data set contains information about URLs that were tweeted over a 3 week period in the Fall of 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of (tweeting) users.

Flickr personal taxonomies

This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

Wrapper maintenance

Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.S


Social network analysis methods examine topology of a network in order to indentify its structure, for example, who the important nodes are. Centrality, however, depends on both network topology (or social links) and the dynamical processes (or flow) taking place on the network, which determines how ideas, pathogens, or influence flow along social links. Click the link below to see Matlab code for calculating random walk-based centrality (PageRank) and epidemic diffusion-based centrality (given by Bonacich's Alpha-Centrality).


WWW 2018 Tutorial: Scalable Construction and Querying of Massive Knowledge Bases

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media posts, scientific publications, to a wide range of textual information from various vertical domains (e.g., corporate reports, advertisements, legal acts, medical reports). How to turn such massive and unstructured text data into structured, actionable knowledge, and how to enable effective and user-friendly access to such knowledge is a grand challenge to the research community.

AAAI 2018 Tutorial: Knowledge Graph Construction from Web Corpora

Knowledge Graphs (KGs) like Wikidata, NELL and DBPedia have recently played instrumental roles in several machine learning applications, including search and information retrieval, information extraction, and data mining. Constructing knowledge graphs is a difficult problem typically studied for natural language documents. With the rapid rise in Web data, there are interesting opportunities to construct domain-specific knowledge graphs over corpora that have been crawled or acquired through techniques like focused crawling. In this tutorial, we survey the techniques for knowledge graph construction from domain-specific Web corpora.

KDD 2017 Tutorial: Data mining in unusual domains with information-rich knowledge graph construction, inference and search

The growth of the Web is a success story that has spurred much research in knowledge discovery and data mining. Data mining over Web domains that are unusual is an even harder problem. There are several factors that make a domain unusual. In particular, such domains have significant long tails and exhibit concept drift, and are characterized by high levels of heterogeneity. Notable examples of unusual Web domains include both illicit domains, such as human trafficking advertising, illegal weapons sales, counterfeit goods transactions, patent trolling and cyberattacks, and also non-illicit domains such as humanitarian and disaster relief. Data mining in such domains has the potential for widespread social impact, and is also very challenging technically. In this tutorial, we provide an overview, using demos, examples and case studies, of the research landscape for data mining in unusual domains, including recent work that has achieved state-of-the-art results in constructing knowledge graphs in a variety of unusual domains, followed by inference and search using both command line and graphical interfaces.

ISWC 2017 Tutorial: Constructing Domain-specific Knowledge Graphs (KGC)

The vast amounts of ontologically unstructured information on the Web, including semi-structured HTML, XML and JSON documents, natural language documents, tweets, blogs, markups, and even structured documents like CSV tables, all contain useful knowledge that can present a tremendous advantage to Semantic Web researchers if extracted robustly, efficiently and semi-automatically as an RDF knowledge graph. Domain-specific Knowledge Graph Construction (KGC) is an active research area that has recently witnessed impressive advances due to machine learning techniques like deep neural networks and word embeddings. This tutorial will synthesize and present KGC techniques, especially information extraction (IE) in a manner that is accessible to Semantic Web researchers. The presenters of the tutorial will use their experience as instructors and Semantic Web researchers, as well as lessons from actual IE implementations, to accomplish this purpose through visually intuitive and example-driven slides, accessible, high-level overviews of related work, instructor demos, and at least five IE participatory activities that attendees will be able to set up on their laptops.


ISWC 2017

Hybrid Statistical Semantic Understanding and Emerging Semantics

ACM WWW 2018

Latent Semantics for the Web (LSW)

ESWC 2018

Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies (DL4KGS)