KGTK: Tools for Creating and Exploiting Large Knowledge Graphs

WWW'22 Tutorial
April 25, 2022, 14:00 - 17:15 CEST
kgtk-logo

Presenters

The Knowledge Graph Toolkit (KGTK) (Ilievski et al., 2020) is a comprehensive framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Its advanced functionality includes a query language variant of Cypher (called “Kypher”), which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a variant of SQID; and includes a development environment using Jupyter notebooks that provides seamless integration with Pandas. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer. We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, creation of an extension of Wikidata for food security, and creation of an extension of Wikidata for the pharmaceutical industry.

Slack channel: TBD

Note: this tutorial will be an updated iteration of our previous tutorial given at ISWC'21 in October 2021.

Resources

Hands-on materials: https://github.com/usc-isi-i2/kgtk-notebooks/

Colab Notebooks: https://github.com/usc-isi-i2/kgtk-notebooks#running-the-notebooks-in-google-colab

Slides: https://github.com/usc-isi-i2/kgtk-notebooks/tree/main/slides

KGTK documentation: https://kgtk.readthedocs.io/

Similarity GUI: https://kgtk.isi.edu/similarity/

KGTK Search: https://kgtk.isi.edu/search/

KGTK Browser: https://kgtk.isi.edu/iswc/browser/Q2685

Resource paper (ESWC'20): https://arxiv.org/pdf/2006.00088.pdf

KGTK on GitHub: https://github.com/usc-isi-i2/kgtk/

Tutorial Program

The KGTK tutorial will introduce the full suite of commands available in KGTK, including parsers for popular existing KG formats, commands that curate or transform existing graphs, its query language Kypher (an adaptation of Cypher), commands for graph analytics, and computation of embeddings. We will introduce the commands in hands-on sessions, allowing participants to execute along in Colab notebooks. In addition, we will introduce a range of use cases that rely on KGTK in order to compute similarity of nodes, enrich Wikidata with additional knowledge from CSV files or LOD, perform knowledge graph profiling, analyze graph networks, and analyze the quality of Wikidata. The use cases will be associated with their underlying code in KGTK, and a subset of the use cases will be run live in hands-on sessions.


The tutorial will be held on April 25, 2022, 2 - 5:15PM CEST. All times below are in CEST Timezone, in the afternoon (PM). Speakers: FI=Filip Ilievski, DG=Daniel Garijo, HC=Hans Chalupsky, PS=Pedro Szekely, GS=Gleb Satyukov, AS=Amandeep Singh.


Time (CEST)

Content

Speaker

Material

14:00-14:15

Welcome and Introduction

PS

Slides

14:15-14:55

Introduction to KGTK and Kypher

HC

Slides, Notebook

14:55-15:30

Session I: Exploring Knowledge Graphs

Option A: Profiling and browsing knowledge graphs

Option B: Working with Embeddings

GS,AS

PS

Notebook

Notebook

15:30-15:45

Break

/

/

15:45-16:20

Session II: KG Augmentation

Option A: Extending KGs with tabular data (IMDB)

Option B: Extending KGs with LOD

PS

FI

Notebook

Notebook

16:20-16:55

Session III: KG analysis

Option A: Network analysis of KGs

Option B: Wikidata constraint validation

PS

DG

Notebook

Slides, Notebook

16:55-17:15

Discussion and Closing remarks

Speaker Bios

Filip Ilievski (ilievski@isi.edu) is a Research Scientist in the Center on Knowledge Graphs within the Information Sciences Institute at the University of Southern California. He obtained a Ph.D. in Natural Language Processing and Knowledge Representation at the Vrije Universiteit (VU) in Amsterdam. His primary research focus lies on the role of background, especially commonsense, knowledge for filling gaps in human communication. Filip’s research has been published at top-tier venues like AAAI, EMNLP, and ISWC. Filip gave tutorials on commonsense knowledge at AAAI’21 and ISWC’20.

Daniel Garijo (dgarijo@isi.edu) is a Researcher at the Ontology Engineering Group at the Universidad Politécnica de Madrid (UPM), where he obtained his PhD. Before joining UPM, Daniel was a researcher at the Information Sciences Institute at the University of Southern California. His research is focused on using Semantic Web and Linked Data to facilitate the reuse and understanding of scientific workflows and software. Daniel has experience in presenting tutorials at international conferences such as Dublin Core and AAAI and universities such as Stanford, UCLA and USC.

Hans Chalupsky (hans@isi.edu) is a Research Lead at USC’s Information Sciences Institute where he heads the Loom Knowledge Representation and Reasoning Group. His research focuses on the design, development and application of practical knowledge representation and reasoning systems. He is a principal architect and developer of the PowerLoom KR&R system, which combines over ten years of DARPA-funded development and has been distributed to many sites worldwide. Dr. Chalupsky is also the principal architect of the KOJAK Link Discovery System whose Group Finder has been ranked first in several formal DARPA evaluations, and whose UNICORN system for anomaly detection in large knowledge graphs was awarded second place in the Open Task of the 2003 KDD Cup. His research interests include KR&R systems, KGs, semantic interoperability and neuro-symbolic reasoning systems.

Pedro Szekely (pszekely@isi.edu) is Principal Scientist and Director of the AI division at the University of Southern California's Information Sciences Institute, and Research Associate Professor of Computer Science. Dr. Szekely's current research focuses on table understanding and toolkits for creating and exploiting KGs in AI applications. Dr. Szekely teaches a graduate course on Building Knowledge Graphs, and has given tutorials on knowledge graph construction at KDD, ISWC, AAAI and WWW. Dr. Szekely has published over 100 papers in prestigious conferences, served as program chair for the International Knowledge Capture conference, and as conference chair for the Intelligent User Interfaces Conference.

Gleb Satyukov (gleb@isi.edu) is a full-time Research Programmer at USC's Information Sciences Institute. His academic background is primarily in computational linguistics and media technology, focusing most recently on human computer interaction. Covering both back-end and front-end development, his professional expertise ranges from user experience and interaction design to architecture of distributed systems at scale. Here at ISI, Gleb is working on design, development and maintenance of online infrastructure to facilitate ongoing research projects, such as the Knowledge Graph ToolKit. As part of the KGTK team, Gleb is working on building intuitive user interfaces that are used to visualize knowledge graphs, dive into the data and find hidden insights hidden within.

Amandeep Singh (amandeep@isi.edu) is a full-time Research Programmer at USC's Information Sciences Institute.