The vast amounts of ontologically unstructured information on the Web, including semi-structured HTML, XML and JSON documents, natural language documents, tweets, blogs, markups, and even structured documents like CSV tables, all contain useful knowledge that can present a tremendous advantage to Semantic Web researchers if extracted robustly, efficiently and semi-automatically as an RDF knowledge graph. Domain-specific Knowledge Graph Construction (KGC) is an active research area that has recently witnessed impressive advances due to machine learning techniques like deep neural networks and word embeddings. This tutorial will synthesize and present KGC techniques, especially information extraction (IE) in a manner that is accessible to Semantic Web researchers. The presenters of the tutorial will use their experience as instructors and Semantic Web researchers, as well as lessons from actual IE implementations, to accomplish this purpose through visually intuitive and example-driven slides, accessible, high-level overviews of related work, instructor demos, and at least five IE participatory activities that attendees will be able to set up on their laptops.
The majority of Semantic Web research often assumes that knowledge has already been extracted from raw data and serialized to an appropriate format like RDF. However, Knowledge Graph Construction (KGC), particularly Information Extraction (IE), comprises a set of complex research subjects that have undergone many impressive advances in the last decade, especially with the advent of deep neural networks. This is a timely opportunity for Semantic Web researchers to explore domain-specific KGC methods that can be usefully and semi-automatically applied to raw Web data.
The primary agenda of this tutorial is to cover the landscape of domain-specific KGC, especially IE, in a manner that will be practical and useful to Semantic Web researchers. Specific questions we will seek to answer through our presentation and activities are: what is domain-specific knowledge graph construction and why is information extraction so important? What are the strengths and caveats of major existing IE methods? Which methods are most promising for achieving the goals of our community? What issues (e.g., scalability and complexity) must we keep in mind as we implement such methods? Our tutorial will be extremely practical, and will answer these questions in a focused, foundational manner that all Semantic Web researchers will be able to easily follow.
(1) An overview of KGC and IE principles, and a basic comparison to related sub-areas like Named Entity Recognition and Relation Recognition.
(2) An overview of KGC and IE techniques, including techniques applicable to structured, semi-structured and unstructured data sources.
(3) Hands-on activities involving actual implementations, possible issues with ad-hoc implementations, and promising areas for future work.
A set of slides providing details on the topics we are covering may be found linked here.
The tutorial is intended to be informal and very hands-on. We will avoid subarea-specific symbols and terminology to the best extent possible (i.e. without losing rigor, or over-simplifying). Our slides will focus on visual intuition, use actual examples, present lessons learned from implementing a subset of KGC and IE methods in DIG, and will be accompanied by demos and hands-on activities that participants will be able to do without requiring extensive platform-dependent setup. We will permit questions and interactions throughout the tutorial. All three tutors will be mostly present throughout the tutorial but will be individually presenting our sections. At present (see People), CK will present on structured IE, MK will present on unstructured IE and PS will present on semi-structured IE.
Prior Knowledge and Prerequisites
Prior knowledge expected from participants (beyond Semantic Web concepts like RDF and ontologies) will be minimal. Some knowledge of machine learning, including basic concepts like training, testing and validating, feature engineering etc. will be helpful but are not absolute prerequisites, as we will not go into advanced machine learning math or optimization. Additionally, where possible, we will introduce basic machine learning concepts so that everyone has an opportunity to follow along. Participants are not expected to have any knowledge of Information Extraction or related fields like Named Entity Recognition.
Concerning equipment, we will be using our own computers for presenting demos and PowerPoint slides and only require equipment to facilitate such projection for an extended period of time (e.g., projector, table, power outlet). We (and also the participants) will require an internet/wifi connection to access material. There are no audio elements to our presentation. All demos and hands-on activities will be doable on a reasonable laptop by interested participants. We will also bring extra USB storage devices with copies of code, programs and slides in case some participants are not able to download the material prior to the tutorial.
Tutorial material will comprise of a set of slides for each agenda item in the slides linked in Agenda. Additionally, there will be five hands-on-IE activities in three time slots. The hands-in activities will require simple set-up, as code will be available on GitHub and executable on a Unix-like command line; all code will be installable in less than 5 minutes using command line utilities like pip or via a platform-specific installer. All material will be open-source and made available under open BSD-like licenses; we will release slides, activities and instructions shortly before the tutorial.
References and List of Tools
The list below is meant to serve as a guiding body of work (along the lines of ‘further reading’) only, and will likely not be covered comprehensively. We may revise the list at our discretion.
Experimental comparisons and details of publicly available text extractors
Readability text extractor
Li, Yunyao, et al. "Regular expression learning for information extraction." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
Krishnamurthy, Rajasekar, et al. "SystemT: a system for declarative information extraction." ACM SIGMOD Record 37.4 (2009): 7-13.
Pantel, Patrick, et al. "Web-scale distributional similarity and entity set expansion." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
Dalvi, Bhavana Bharat, William W. Cohen, and Jamie Callan. "Websets: Extracting sets of entities from the web using unsupervised information extraction." Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012.
Szekely, Pedro, et al. "Building and using a knowledge graph to combat human trafficking." International Semantic Web Conference. Springer International Publishing, 2015.
Kejriwal, Mayank, and Pedro Szekely. "Information Extraction in Illicit Web Domains." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
Kapoor, Rahul, Mayank Kejriwal, and Pedro Szekely. "Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages." arXiv preprint arXiv:1704.05569 (2017).
Doan, AnHai, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. "Managing information extraction: state of the art and research directions." Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006.
Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics, 2006.
Gupta, Shubham, et al. "Karma: A system for mapping structured sources into the Semantic Web." Extended Semantic Web Conference. Springer Berlin Heidelberg, 2012.
Sarawagi, Sunita. "Information extraction." Foundations and Trends® in Databases 1.3 (2008): 261-377.