KGTK: Tools for Creating and Exploiting Large Knowledge Graphs

ISWC'21 Tutorial
October 24, 2021
kgtk-logo

Presenters

Full tutorial recording

The Knowledge Graph Toolkit (KGTK) (Ilievski et al., 2020) is a comprehensive framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Its advanced functionality includes a query language variant of Cypher (called “Kypher”), which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a variant of SQID; and includes a development environment using Jupyter notebooks that provides seamless integration with Pandas. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer. We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, creation of an extension of Wikidata for food security, and creation of an extension of Wikidata for the pharmaceutical industry.

Slack channel: #kgtk-tutorial (ISWC workspace)

Resources

Hands-on materials: https://github.com/usc-isi-i2/kgtk-notebooks/

Colab Notebooks: https://github.com/usc-isi-i2/kgtk-notebooks#running-the-notebooks-in-google-colab

Slides: https://github.com/usc-isi-i2/kgtk-notebooks/tree/main/slides

KGTK documentation: https://kgtk.readthedocs.io/

Similarity GUI: https://kgtk.isi.edu/similarity/

KGTK Search: https://kgtk.isi.edu/search/

KGTK Browser: https://kgtk.isi.edu/iswc/browser/Q2685

Resource paper (ESWC'20): https://arxiv.org/pdf/2006.00088.pdf

KGTK on GitHub: https://github.com/usc-isi-i2/kgtk/

Tutorial Program

The KGTK tutorial will introduce the full suite of commands available in KGTK, including parsers for popular existing KG formats, commands that curate or transform existing graphs, its query language Kypher (an adaptation of Cypher), commands for graph analytics, and computation of embeddings. We will introduce the commands in hands-on sessions, allowing participants to execute along in Colab notebooks. In addition, we will introduce a range of use cases that rely on KGTK in order to compute similarity of nodes, enrich Wikidata with additional knowledge from CSV files or LOD, perform knowledge graph profiling, analyze graph networks, and analyze the quality of Wikidata. The use cases will be associated with their underlying code in KGTK, and a subset of the use cases will be run live in hands-on sessions.


All times are given in Pacific Standard Time (PDT). The tutorial starts at 9am PDT. Speakers: FI=Filip Ilievski, DG=Daniel Garijo, HC=Hans Chalupsky, PS=Pedro Szekely.


Time (PDT)

Content

Speaker

Material

09:00-09:15

Welcome and Introduction

PS

slides

09:15-10:00

Introduction to KGTK and Kypher

HC

slides

10:00-10:15

Break

/

/

10:15-10:45

Profiling knowledge graphs

PS

10:45-11:15

Working with embeddings and estimating similarity

PS, FI

11:15-11:30

Break

/

/

11:30-12:00

Extending KG with tabular data (IMDB)

PS

12:00-12:30

Extending KG with LOD

FI

12:30-13:00

Network analysis of KGs

PS

13:00-13:15

Break

/

/

13:15-13:45

Analyzing Wikidata with KGTK

DG

slides

13:45-14:15

Building a Commonsense Knowledge Graph

FI

slides

14:15-15:00

Discussion, next steps & closing

Speaker Bios

Filip Ilievski (ilievski@isi.edu) is a Computer Scientist in the Center on Knowledge Graphs within the Information Sciences Institute at the University of Southern California. He obtained a Ph.D. in Natural Language Processing and Knowledge Representation at the Vrije Universiteit (VU) in Amsterdam. His primary research focus lies on the role of background, especially commonsense, knowledge for filling gaps in human communication. Filip’s research has been published at top-tier venues like AAAI, EMNLP, and ISWC. Filip gave tutorials on commonsense knowledge at AAAI’21 and ISWC’20.

Daniel Garijo (dgarijo@isi.edu) is a Researcher at the Ontology Engineering Group at the Universidad Politécnica de Madrid (UPM), where he obtained his PhD. Before joining UPM, Daniel was a researcher at the Information Sciences Institute at the University of Southern California. His research is focused on using Semantic Web and Linked Data to facilitate the reuse and understanding of scientific workflows and software. Daniel has experience in presenting tutorials at international conferences such as Dublin Core and AAAI and universities such as Stanford, UCLA and USC.

Hans Chalupsky (hans@isi.edu) is a Research Lead at USC’s Information Sciences Institute where he heads the Loom Knowledge Representation and Reasoning Group. His research focuses on the design, development and application of practical knowledge representation and reasoning systems. He is a principal architect and developer of the PowerLoom KR&R system, which combines over ten years of DARPA-funded development and has been distributed to many sites worldwide. Dr. Chalupsky is also the principal architect of the KOJAK Link Discovery System whose Group Finder has been ranked first in several formal DARPA evaluations, and whose UNICORN system for anomaly detection in large knowledge graphs was awarded second place in the Open Task of the 2003 KDD Cup. His research interests include KR&R systems, KGs, semantic interoperability and neuro-symbolic reasoning systems.

Pedro Szekely (pszekely@isi.edu) is Principal Scientist and Director of the AI division at the University of Southern California's Information Sciences Institute, and Research Associate Professor of Computer Science. Dr. Szekely's current research focuses on table understanding and toolkits for creating and exploiting KGs in AI applications. Dr. Szekely teaches a graduate course on Building Knowledge Graphs, and has given tutorials on knowledge graph construction at KDD, ISWC, AAAI and WWW. Dr. Szekely has published over 100 papers in prestigious conferences, served as program chair for the International Knowledge Capture conference, and as conference chair for the Intelligent User Interfaces Conference.