KGTK: Tools for Creating and Exploiting Large Knowledge Graphs

KGC'22 Tutorial
May 2, 2022, 1 - 4:30PM ET


The Knowledge Graph Toolkit (KGTK) (Ilievski et al., 2020) is a comprehensive framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Its advanced functionality includes a query language variant of Cypher (called “Kypher”), which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a variant of SQID; and includes a development environment using Jupyter notebooks that provides seamless integration with Pandas. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer. We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, creation of an extension of Wikidata for food security, and creation of an extension of Wikidata for the pharmaceutical industry.

Slack channel: TBD

Note: this tutorial will be an updated iteration of our previous tutorial given at ISWC'21 in October 2021.


Hands-on materials:

Colab Notebooks:


KGTK documentation:

Similarity GUI:

KGTK Search:

KGTK Browser:

Resource paper (ESWC'20):

KGTK on GitHub:

Tutorial Program

The KGTK tutorial will introduce the full suite of commands available in KGTK, including parsers for popular existing KG formats, commands that curate or transform existing graphs, its query language Kypher (an adaptation of Cypher), commands for graph analytics, and computation of embeddings. We will introduce the commands in hands-on sessions, allowing participants to execute along in Colab notebooks. In addition, we will introduce a range of use cases that rely on KGTK in order to compute similarity of nodes, enrich Wikidata with additional knowledge from CSV files or LOD, perform knowledge graph profiling, analyze graph networks, and analyze the quality of Wikidata. The use cases will be associated with their underlying code in KGTK, and a subset of the use cases will be run live in hands-on sessions.

The tutorial will be held on May 2, 2022, 1 - 4:30PM ET. All times below are in Eastern Timezone afternoon (PM). Speakers: FI=Filip Ilievski, DG=Daniel Garijo, HC=Hans Chalupsky, PS=Pedro Szekely, GS=Gleb Satyukov, AS=Amandeep Singh.

Time (ET)





Welcome and Introduction




Introduction to KGTK and Kypher


Slides, Notebook


Session I: Profiling and browsing knowledge graphs




Session II: Extending KGs with tabular data (IMDB)








Session III: Network analysis of KGs


Slides, Notebook


Session IV: Working with Embeddings




Application and Closing Remarks


Speaker Bios

Filip Ilievski ( is a Research Scientist in the Center on Knowledge Graphs within the Information Sciences Institute at the University of Southern California. He obtained a Ph.D. in Natural Language Processing and Knowledge Representation at the Vrije Universiteit (VU) in Amsterdam. His primary research focus lies on the role of background, especially commonsense, knowledge for filling gaps in human communication. Filip’s research has been published at top-tier venues like AAAI, EMNLP, and ISWC. Filip gave tutorials on commonsense knowledge at AAAI’21 and ISWC’20.

Daniel Garijo ( is a Researcher at the Ontology Engineering Group at the Universidad Politécnica de Madrid (UPM), where he obtained his PhD. Before joining UPM, Daniel was a researcher at the Information Sciences Institute at the University of Southern California. His research is focused on using Semantic Web and Linked Data to facilitate the reuse and understanding of scientific workflows and software. Daniel has experience in presenting tutorials at international conferences such as Dublin Core and AAAI and universities such as Stanford, UCLA and USC.

Hans Chalupsky ( is a Research Lead at USC’s Information Sciences Institute where he heads the Loom Knowledge Representation and Reasoning Group. His research focuses on the design, development and application of practical knowledge representation and reasoning systems. He is a principal architect and developer of the PowerLoom KR&R system, which combines over ten years of DARPA-funded development and has been distributed to many sites worldwide. Dr. Chalupsky is also the principal architect of the KOJAK Link Discovery System whose Group Finder has been ranked first in several formal DARPA evaluations, and whose UNICORN system for anomaly detection in large knowledge graphs was awarded second place in the Open Task of the 2003 KDD Cup. His research interests include KR&R systems, KGs, semantic interoperability and neuro-symbolic reasoning systems.

Pedro Szekely ( is Principal Scientist and Director of the AI division at the University of Southern California's Information Sciences Institute, and Research Associate Professor of Computer Science. Dr. Szekely's current research focuses on table understanding and toolkits for creating and exploiting KGs in AI applications. Dr. Szekely teaches a graduate course on Building Knowledge Graphs, and has given tutorials on knowledge graph construction at KDD, ISWC, AAAI and WWW. Dr. Szekely has published over 100 papers in prestigious conferences, served as program chair for the International Knowledge Capture conference, and as conference chair for the Intelligent User Interfaces Conference.

Gleb Satyukov ( is a full-time Research Programmer at USC's Information Sciences Institute. His academic background is primarily in computational linguistics and media technology, focusing most recently on human computer interaction. Covering both back-end and front-end development, his professional expertise ranges from user experience and interaction design to architecture of distributed systems at scale. Here at ISI, Gleb is working on design, development and maintenance of online infrastructure to facilitate ongoing research projects, such as the Knowledge Graph ToolKit. As part of the KGTK team, Gleb is working on building intuitive user interfaces that are used to visualize knowledge graphs, dive into the data and find hidden insights hidden within.

Amandeep Singh ( is a full-time Research Programmer at USC's Information Sciences Institute.