KGTK: Tools for Creating and Exploiting Large Knowledge Graphs

ISWC'21 Tutorial
October, 2021


The Knowledge Graph Toolkit (KGTK) (Ilievski et al., 2020) is a comprehensive framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Its advanced functionality includes a query language variant of Cypher (called “Kypher”), which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a variant of SQID; and includes a development environment using Jupyter notebooks that provides seamless integration with Pandas. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer. We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, creation of an extension of Wikidata for food security, and creation of an extension of Wikidata for the pharmaceutical industry.

Tutorial Program

The KGTK tutorial will consist of two main parts:

  1. KGTK Commands - introduction to the full suite of commands available in KGTK, including parsers for popular existing KG formats, commands that curate or transform existing graphs, its query language Kypher (an adaptation of Cypher), commands for graph analytics, and computation of embeddings. We will present the commands first, and provide Jupyter Notebooks for tutorial participants to try the commands out on their laptops.
  2. KGTK Use Cases - three use cases will be covered in depth: consolidation of commonsense knowledge graphs into a single graph (CSKG), analysis of temporal validity and general correctness of Wikidata, and a use case of building new KG that extends Wikidata with knowledge about Ethiopia. Each use case will first be covered as a presentation, followed by a hands-on session during which participants will be able to inspect the Notebooks that correspond to each use case.

All times are tentative and given in Pacific Standard Time (PST). Speakers: FI=Filip Ilievski, DG=Daniel Garijo, HC=Hans Chalupsky, PS=Pedro Szekely.






Introduction to the KGs and available KG toolkits




Basic KGTK

Introduction to KGTK file format and basic commands

Hands-on: importing (Wikidata, DBpedia), filtering, combining graphs, deployment, exporting

Slides, Jupyter Notebooks







Advanced KGTK

Introduction to KGTK advanced functionalities

Hands-on: Kypher, embeddings, centrality, paths

Slides, Jupyter Notebooks







Use cases part I

Use case 1: Building a Commonsense Knowledge Graph

Use case 2: Analysis of all 300+ dumps of Wikidata

Slides, Jupyter Notebooks







Use cases part II & Discussion

Use case 3: Enriching Wikidata with Excel Spreadsheets & Web Tables

Wrap-up and discussion

Slides, Jupyter Notebooks


Speaker Bios

Filip Ilievski ( is a Computer Scientist in the Center on Knowledge Graphs within the Information Sciences Institute at the University of Southern California. He obtained a Ph.D. in Natural Language Processing and Knowledge Representation at the Vrije Universiteit (VU) in Amsterdam. His primary research focus lies on the role of background, especially commonsense, knowledge for filling gaps in human communication. Filip’s research has been published at top-tier venues like AAAI, EMNLP, and ISWC. Filip gave tutorials on commonsense knowledge at AAAI’21 and ISWC’20.

Daniel Garijo ( is a Researcher at the Ontology Engineering Group at the Universidad Politécnica de Madrid (UPM), where he obtained his PhD. Before joining UPM, Daniel was a researcher at the Information Sciences Institute at the University of Southern California. His research is focused on using Semantic Web and Linked Data to facilitate the reuse and understanding of scientific workflows and software. Daniel has experience in presenting tutorials at international conferences such as Dublin Core and AAAI and universities such as Stanford, UCLA and USC.

Hans Chalupsky ( is a Research Lead at USC’s Information Sciences Institute where he heads the Loom Knowledge Representation and Reasoning Group. His research focuses on the design, development and application of practical knowledge representation and reasoning systems. He is a principal architect and developer of the PowerLoom KR&R system, which combines over ten years of DARPA-funded development and has been distributed to many sites worldwide. Dr. Chalupsky is also the principal architect of the KOJAK Link Discovery System whose Group Finder has been ranked first in several formal DARPA evaluations, and whose UNICORN system for anomaly detection in large knowledge graphs was awarded second place in the Open Task of the 2003 KDD Cup. His research interests include KR&R systems, KGs, semantic interoperability and neuro-symbolic reasoning systems.

Pedro Szekely ( is Principal Scientist and Director of the AI division at the University of Southern California's Information Sciences Institute, and Research Associate Professor of Computer Science. Dr. Szekely's current research focuses on table understanding and toolkits for creating and exploiting KGs in AI applications. Dr. Szekely teaches a graduate course on Building Knowledge Graphs, and has given tutorials on knowledge graph construction at KDD, ISWC, AAAI and WWW. Dr. Szekely has published over 100 papers in prestigious conferences, served as program chair for the International Knowledge Capture conference, and as conference chair for the Intelligent User Interfaces Conference.