Knowledge Graphs (KGs) like Wikidata, NELL and DBPedia have recently played instrumental roles in several machine learning applications, including search and information retrieval, information extraction, and data mining. Constructing knowledge graphs is a difficult problem typically studied for natural language documents. With the rapid rise in Web data, there are interesting opportunities to construct domain-specific knowledge graphs over corpora that have been crawled or acquired through techniques like focused crawling. In this tutorial, we survey the techniques for knowledge graph construction from domain-specific Web corpora. Topics that we cover include both Web-focused technology like state-of-the-art wrappers, standard information extractors like conditional random fields, NLP rules and glossaries, as well as more advanced techniques for cleaning the constructed KG, such as probabilistic soft logic and knowledge graph embeddings. This tutorial will have hands-on components. This tutorial is designed for a general AI practitioner, whether from research or industry. Participants will only be expected to have basic knowledge of machine learning. Some experience with Python and UNIX-style commands will also be useful.
Topics and slides
The primary agenda of this tutorial is to cover the landscape of domain-specific KGC, especially IE, in a manner that will be practical and applied, while still presenting interesting opportunities for future research. Specific questions we will seek to answer through our presentation and activities are: what is domain-specific knowledge graph construction and why is information extraction so important (and difficult)? What are the strengths and caveats of major existing IE methods? Which methods are most promising for achieving good results on text and Web data that is noisy i.e. contains artifacts like HTML tags? What issues (e.g., scalability and complexity) must we keep in mind as we implement such methods? Our tutorial will seek to answer these questions in a focused, foundational manner that all researchers and practitioners with some background in AI will be able to follow. THe tutorial will be divided into several sections, as described below:
Introduction on the domain-specific search (DSS) problem, and why knowledge graphs are useful for DSS [PDF Slides]
Knowledge graph construction (KGC) for the short tail, and semantic typing [PDF Slides] [PPTX Slides]
Demo of Inferlink wrapper+KGC for the long tail, and robustly searching noisy KGs+Demo of DIG DSS [PDF Slides]
Introduction to Knowledge Graph Completion, and Entity Resolution [PDF Slides]
Review [PDF Slides]
Key Learning Outcomes
(1) An overview of KGC and IE principles, and why domain-specific KGC is both important and difficult;
(2) Instructor demos of state-of-the-art tools, and promising areas for future work;
(3) Knowledge Graph Completion, including a detailed look at the Entity Resolution problem
References, Links and List of Tools
The list below is meant to serve as a guiding body of work (along the lines of ‘further reading’) only, and will likely not be covered comprehensively. We may revise the list at our discretion.
Experimental comparisons and details of publicly available text extractors
Readability text extractor
Li, Yunyao, et al. "Regular expression learning for information extraction." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
Krishnamurthy, Rajasekar, et al. "SystemT: a system for declarative information extraction." ACM SIGMOD Record 37.4 (2009): 7-13.
Pantel, Patrick, et al. "Web-scale distributional similarity and entity set expansion." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
Dalvi, Bhavana Bharat, William W. Cohen, and Jamie Callan. "Websets: Extracting sets of entities from the web using unsupervised information extraction." Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012.
Szekely, Pedro, et al. "Building and using a knowledge graph to combat human trafficking." International Semantic Web Conference. Springer International Publishing, 2015.
Kejriwal, Mayank, and Pedro Szekely. "Information Extraction in Illicit Web Domains." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
Kapoor, Rahul, Mayank Kejriwal, and Pedro Szekely. "Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages." arXiv preprint arXiv:1704.05569 (2017).
Doan, AnHai, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. "Managing information extraction: state of the art and research directions." Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006.
Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics, 2006.
Gupta, Shubham, et al. "Karma: A system for mapping structured sources into the Semantic Web." Extended Semantic Web Conference. Springer Berlin Heidelberg, 2012.
Sarawagi, Sunita. "Information extraction." Foundations and Trends® in Databases 1.3 (2008): 261-377.