Knowledge Graphs (KGs) like Wikidata, NELL and DBPedia have recently played instrumental roles in several machine learning applications, including search and information retrieval, information extraction, and data mining. Constructing knowledge graphs is a difficult problem typically studied for natural language documents. With the rapid rise in Web data, there are interesting opportunities to construct domain-specific knowledge graphs over corpora that have been crawled or acquired through techniques like focused crawling. In this tutorial, we survey the techniques for knowledge graph construction from domain-specific Web corpora. Topics that we cover include both Web-focused technology like state-of-the-art wrappers, standard information extractors like conditional random fields, NLP rules and glossaries, as well as more advanced techniques for cleaning the constructed KG, such as probabilistic soft logic and knowledge graph embeddings. This tutorial will have hands-on components. This tutorial is designed for a general AI practitioner, whether from research or industry. Participants will only be expected to have basic knowledge of machine learning. Some experience with Python and UNIX-style commands will also be useful.
The primary agenda of this tutorial is to cover the landscape of domain-specific KGC, especially IE, in a manner that will be practical and applied, while still presenting interesting opportunities for future research. Specific questions we will seek to answer through our presentation and activities are: what is domain-specific knowledge graph construction and why is information extraction so important (and difficult)? What are the strengths and caveats of major existing IE methods? Which methods are most promising for achieving good results on text and Web data that is noisy i.e. contains artifacts like HTML tags? What issues (e.g., scalability and complexity) must we keep in mind as we implement such methods? Our tutorial will seek to answer these questions in a focused, foundational manner that all researchers and practitioners with some background in AI will be able to follow. Some specific topics that we plan on covering (in an agenda-like format):
Introduction and basic formalism: knowledge graphs, knowledge graph construction (KGC) and the challenges (many irrelevant pages, varying structure, significant long tail) in working with Web corpora
Landscape of KGC and IE techniques most relevant to the problem
Wrappers and structured-field extractors, overview of the Inferlink extractor and Karma semantic typing system
Extracting and scraping text from Web data
Working with extracted text (1): Glossaries and entity sets
Preliminaries (brief): Conditional Random Fields (CRFs)
Preliminaries (brief): Neural embeddings, including impact and utility of algorithms like word2vec, DeepWalk…
Working with extracted text (2): Named Entity Recognition (NER)
Working with extracted text (3): Rules in SpaCy
Wrap-up; hands-on activity with the myDIG KGC and search system
(1) An overview of KGC and IE principles, and a basic comparison to related sub-areas like Named Entity Recognition and Relation Recognition;
(2) An overview of KGC and IE techniques, including techniques applicable to structured, semi-structured and unstructured Web data sources;
(3) Instructor demos, with the option to replicate using open-source code and data, involving actual implementations, possible issues with ad-hoc implementations, and promising areas for future work.
(Non-exhaustive) References and List of Tools
The list below is meant to serve as a guiding body of work (along the lines of ‘further reading’) only, and will likely not be covered comprehensively. We may revise the list at our discretion.
Experimental comparisons and details of publicly available text extractors
Readability text extractor
Li, Yunyao, et al. "Regular expression learning for information extraction." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
Krishnamurthy, Rajasekar, et al. "SystemT: a system for declarative information extraction." ACM SIGMOD Record 37.4 (2009): 7-13.
Pantel, Patrick, et al. "Web-scale distributional similarity and entity set expansion." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
Dalvi, Bhavana Bharat, William W. Cohen, and Jamie Callan. "Websets: Extracting sets of entities from the web using unsupervised information extraction." Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012.
Szekely, Pedro, et al. "Building and using a knowledge graph to combat human trafficking." International Semantic Web Conference. Springer International Publishing, 2015.
Kejriwal, Mayank, and Pedro Szekely. "Information Extraction in Illicit Web Domains." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
Kapoor, Rahul, Mayank Kejriwal, and Pedro Szekely. "Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages." arXiv preprint arXiv:1704.05569 (2017).
Doan, AnHai, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. "Managing information extraction: state of the art and research directions." Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006.
Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics, 2006.
Gupta, Shubham, et al. "Karma: A system for mapping structured sources into the Semantic Web." Extended Semantic Web Conference. Springer Berlin Heidelberg, 2012.
Sarawagi, Sunita. "Information extraction." Foundations and Trends® in Databases 1.3 (2008): 261-377.