Karma

A Data Integration Tool

GitHub User Guide Tutorial

Center on Knowledge Graphs

Information Sciences Institute

WHAT IS KARMA

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information by modeling it according to an ontology of their choice using a graphical user interface that automates much of the process. Karma learns to recognize the mapping of data to ontology classes and then uses the ontology to propose a model that ties together these classes. Users then interact with the system to adjust the automatically generated model. During this process, users can transform the data as needed to normalize data expressed in different formats and to restructure it. Once the model is complete, users can published the integrated data as RDF or store it in a database.

All the project publications are here. The best paper on the technical aspects of Karma is our ESWC'2012 paper, and the best application paper is our ESWC'2013 paper, which received the best in-use paper award at the conference.

KARMA INNOVATIONS

Ease of Use

Karma uses programming-by-example, learning techniques and a Steiner tree optimization algorithm to automate as much of the process as possible to enable end-users to map their data to a chosen ontology. Users adjust the automatically generated model using a graphical user interface and never see the complex mapping rules used in other systems.

Hierarchical Sources

Many systems have been developed to map tabular sources to ontologies. Karma is unique in that it also supports hierarchcal data sources such as XML, JSON and KML.

Web APIs

In addition to static sources (databases and files), Karma supports data integration from Web APIs, enabling users to leverage the thousands of data sources that are available today via Web APIs.

Semantic Models

Karma uses ontologies as a basis for integrating infomation, leveraring the class and property hierarchies, domain and range information and other ontology constructs to help users integrate their data. Karma allows users to combine multiple ontologies to enable users to map their data to standard vocabularies.

Scalable Processing

Users work with a subset of their data to define the models that integrate their data sources. This enables Karma to offer a responsive user interface when users are defining the model that integrates their data. Karma can then use these models in batch mode to integrate large data sources.

Data Transformation

Karma offers a programming-by-example interface to enable users to define data transformation scripts that transform data expressed in multiple data formats into a common format.

CASE STUDY

Integration of Bio-Informatics Data

We used Karma to replicate the mappings done in a scenario from the Semantic MediaWiki Linked Data Extension (SMW-LDE) work where researchers integrated ABA, Uniprot, KEGG Pathway, PharmGKB and Linking Open Drug Data datasets by mapping them to a common ontology. Papers: ESWC'2012, ISWC'2011 Linked Science.
Mapping USC Faculty Data to VIVO

VIVO is a system to build researcher networks across institutions. In this case study we used Karma to map data about USC faculty to the VIVO ontology and to publish the data in the RDF format that the VIVO system can ingest. Karma enables users to ingest data in VIVO by interacting with an easy to use graphical user interface, and does not require knowledge of SPARQL or other Web languages such as XSLT or XQuery. The video shows how to ingest the sample files provided in the VIVO Data Ingest Guide.

Karma at the VIVO'2012 Conference in Miami, Florida: Abstract (PDF), Slides (Powerpoint), Ontology and datasets (zip)
Smithsonian American Art Museum

In this case study we used Karma to convert records of 44,000 of the museum’s holdings to Linked Open Data according to the Europeana Data Model (EDM). The records are stored in several tables in a SQL Server database. Using Karma we modeled these tables in terms of the EDM ontology and converted the data into RDF. We are creating a 5-star Linked Data, linked to DBpedia, the Getty Union List of Artist Names (ULAN)® and the NY Times Linked Data. The USC press release. Our paper on this work received the best in-use paper award at the ESWC'2013 conference: paper, slides. The Linked Data is now deployed: each time you visit an artist page in the Smithsonian American Art Museum web site, a SPARQL query is issued to retrieve links to Wikipedia and the NY Times.
Geospatial Data and Services

In this case study we show how Karma could be used to help first responders plan evacuations of affected personnel in the event of a fire in an oil field. We used Karma to integrate publicly available data about oil well locations available in MS Excel format, data about personnel locations from a text file, information about the extent of the fire and the location of care centers from a KML file. In this example, no detailed road network information is available for the region in question, so our software extracted the road network data from a USGS map. In this case study we also show how Karma can invoke services that perform complex geospatial reasoning to 1) subtract from the road network the roads that intersect the region affected by the fire, 2) compute the shortest evacuation path for each person avoiding roads that go through the fire, and 3) perform a simulation of the likely spread of the fire based on wind conditions extracted from a public weather service. Users can perform the information integration tasks, invoke the services interactively and visualize the results on a map using Karma.
Integration of Environmental Data

In this case study we used Karma to help an environmental scientists to construct a model of the metabolism of the Merced river in California. An important bottleneck that the scientists face is to prepare the data used to fit and run the models. In this case, data came from the California Data Exchange Center (CDEC), the scientists' own sensors, and weather information from NOAA. The CDEC and NOAA data was accessible via web services, and the scientists' data was available in CSV files. In addition, the data used different formats to represent dates, times and units, different time resolutions and contained errors. We used Karma to retrieve, clean, normalize, integrate and publish the data. Karma published the data in the format needed by the executable models, and produced semantic metadata that allowed the WINGS workflow system to help users compose the different parts of the workflow. In addition, Karma exported the data preparation procedures in a script that could be executed every day to produce fresh data. This allowed WINGS to automatically execute the workflow every day. Paper: ISWC'2011.
Rapidly Integrating Services into the Linked Data Cloud

The amount of data available in the Linked Data cloud continues to grow. Yet, few services consume and produce linked data. There is recent work that allows a user to define a linked service from an online service, which includes the specifications for consuming and producing linked data, but building such models is time consuming and requires specialized knowledge of RDF and SPARQL. We present a new approach that allows domain experts to rapidly create semantic models of services by demonstration in an interactive web-based interface. First, the user provides examples of the service request URLs. Then, the system automatically proposes a service model the user can refine interactively. Finally, the system saves a service specification using a new expressive vocabulary that includes lowering and lifting rules. This approach empowers end users to rapidly model existing services and immediately use them to consume and produce linked data. Paper: ISWC'2012.

PEOPLE

Craig Knoblock

Executive Director
Pedro Szekely

Research Director
Gully Burns

Project Leader
Dipsy Kapoor

Research Programmer
Jason Slepicka

PhD Student
Minh Pham

PhD Student

PUBLICATIONS

ACKNOWLEDGMENT

This research is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under contract number FA8750-14-C-0240, the Smithsonian American Art Museum, the National Science Foundation under awards IIS-1117913 and CMMI-0753124, the NIH through the following NCRR grant: the Biomedical Informatics Research Network (1 U24 RR025736-01), the National Institutes of Health under grant number (1 UL1 RR031986-01) at the University of Southern California.