KGTK: Tools for Creating and Exploiting Large Knowledge Graphs, The Web Conference 2022 Tutorial

Presenters

Filip Ilievski

USC/ISI

Kian Ahrabian

USC/ISI

Gleb Satyukov

USC/ISI

Ke-Thia Yao

USC/ISI

Jay Pujara

USC/ISI

Filip Ilievski (ilievski@isi.edu) - Research Lead at the Information Sciences Institute, University of Southern California (USC). Research Assistant Professor of Computer Science at USC Viterbi School of Engineering.

Kian Ahrabian (ahrabian@isi.edu) - Ph.D. Student at the Information Sciences Institute, University of Southern California (USC).

Gleb Satyukov (gleb@isi.edu) - Research Lead at the Information Sciences Institute, University of Southern California (USC).

Ke-Thia Yao (kyao@isi.edu) - Research Lead at the Information Sciences Institute, University of Southern California (USC).

Jay Pujara (jpujara@isi.edu) - Director at the Center on Knowledge Graphs and Research Lead at the USC Information Sciences Institute. Research Assistant Professor of Computer Science at the USC Viterbi School of Engineering.

The Knowledge Graph ToolKit

The Knowledge Graph Toolkit (KGTK) (Ilievski et al., 2020) is a comprehensive framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Its advanced functionality includes a query language variant of Cypher (called “Kypher”), which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a variant of SQID; and includes a development environment using Jupyter notebooks that provides seamless integration with Pandas. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer. We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, creation of an extension of Wikidata for food security, and creation of an extension of Wikidata for the pharmaceutical industry.

Tutorial Program

The lab will showcase four diverse applications of KGTK, each presented in a 20-min slot by a central contributor in the corresponding project. Kian Ahrabian and Jay Pujara will lead the session on analyzing publication graphs and leveraging them to drive scientific innovation. The second session will focus on supply chain and financial transaction analytics, and will be presented by Ke-Thia Yao. Gleb Satyukov will present the third session on event analytics with moral dimensions. The last session on modeling Internet Memes and analyzing them will be led by Filip Ilievski. Each session will be based on a Python notebook. The sessions will showcase how KGTK supports a wide range of pipelines in a user-friendly and scalable way, allowing AI researchers and developers to understand how to work with realistic knowledge graphs and inspire them to come up with their own use cases of interest. This lab would be held as a quarter-day (1h 45min) session at the AAAI 2023 conference.

Use cases:

1. Internet Memes - we will show how KGTK can help connect the dots between internet meme sources and external knowledge graphs, like Wikidata. We will use KGTK to perform scalable analytics of the resulting graph and execute novel entity-centric and hybrid queries.

2. Financial transactions - we will describe how KGTK can be used analyze financial transaction data. We will illustrate how to construct KGTK pipelines with graph transformations, analytics and visualization steps for the financial sector. The KGTK pipelines enable us to highlight trading behaviors, to find potential colluders, and to find inconsistencies through differentiating knowledge graph structures.

3. Publication graphs (PubGraphs) - The recent advent of public large-scale research publications metadata repositories such as OpenAlex (Priem, Piwowar, and Orr 2022) enables us to study innovation at scales that have not been possible before. However, dealing with these large-scale repositories is extremely difficult and requires special toolkits. In this session, we will describe how KGTK can be used for data filtering, data transformation, knowledge graph extraction, and knowledge graph embedding training of knowledge graphs with scientific publications.

4. Morality in events - we will demonstrate how our knowledge graph tools are applied to make sense of complex events. Focused on a specific domain (or location) we track the changes in moral foundations (Johnson and Goldwasser 2018) and emotions to understand public perception of these events. The use of KGTK in this project makes it easy to scale up, to generalize to other domains and locations, and to browse and visualize the data.

The tutorial will be held on February 7th, 2023, 10:45AM - 12:30PM EST. All times below are in EST Timezone. Speakers: FI=Filip Ilievski, JP=Jay Pujara, KY=Ke-Thia Yao, GS=Gleb Satyukov, KA = Kian Ahrabian.

Time (EST)	Content	Speaker	Material
*10:45 - 11:00*	Welcome and Introduction	FI	Slides
*11:00 - 11:20*	Internet Memes: Knowledge connects culture and creativity	FI	Slides
*11:20 - 11:40*	Financial transactions: Detecting anomalies in trading	KY	Slides
*11:40 - 12:00*	PubGraphs: What should I read next?	KA, JP	Slides
*12:00 - 12:20*	Morality in events: From news to timelines and graph maps	GS	Slides
*12:20 - 12:30*	Discussion and Closing remarks	JP	Slides

The notebooks for this tutorial can be found on this dedicated GitHub repository.

KGTK: User-friendly Toolkit for Manipulation of Large Knowledge Graphs

Presenters

The Knowledge Graph ToolKit

Tutorial Program

Other Resources