KDD 2017

Data Mining in Unusual Domains with Knowledge Graph Construction, Inference and Search

Abstract

The growth of the Web is a success story that has spurred much research in knowledge discovery and data mining. Data mining over Web domains that are unusual is an even harder problem. There are several factors that make a domain unusual. In particular, such domains have significant long tails and exhibit concept drift, and are characterized by high levels of heterogeneity. Notable examples of unusual Web domains include both illicit domains, such as human trafficking advertising, illegal weapons sales, counterfeit goods transactions, patent trolling and cyberattacks, and also non-illicit domains such as humanitarian and disaster relief. Data mining in such domains has the potential for widespread social impact, and is also very challenging technically. In this tutorial, we provide an overview, using demos, examples and case studies, of the research landscape for data mining in unusual domains, including recent work that has achieved state-of-the-art results in constructing knowledge graphs in a variety of unusual domains, followed by inference and search using both command line and graphical interfaces.

Agenda and Outline

A preliminary agenda for the tutorial may be accessed here. We will release additional slides, datasets, papers and software as the tutorial approaches. Some links and resources will be limited to attending participants.

Each item in the agenda will have similar structure and will include video and live demos, where appropriate. Specifically, we will always start with examples and high-level ideas before engaging in formalism; we will consciously introduce subarea-specific terminology if deemed necessary for a rigorous treatment. Rather than dwell on individual references, we will synthesize trends and general techniques in each sub-area (e.g. information extraction), typically based on published surveys and overviews. We will keep returning to the question of whether techniques in these areas that perform robustly in traditional domains also perform well in dark domains. If not, why not? If so, what was (likely) the winning element?

Goals

Tutorial attendees will:

Gain a comprehensive understanding of the difficulties in mining unusual domains, which include both illicit domains e.g., human trafficking and securities fraud, as well as non-illicit domains not studied too often but with immense social utility e.g., humanitarian and disaster relief. We articulate case studies (from a data science perspective) in aspects of noise, scale, content, models and user desiderata that make data mining in such domains challenging and novel.

Gain an overview of an end-to-end knowledge graph-centric architecture for mining data in unusual domains. Using demos and examples, we will provide attendees with insight into how familiar (imperfectly solved) tasks like information extraction, entity resolution, clustering, graph embeddings, graphical user interfaces and entity-centric search can be integrated into a viable data mining framework.

Gain detailed insight into two extremely important, and well-studied, tasks in knowledge graph construction, namely information extraction and entity resolution. We cover why traditional research in these areas can be problematic in illicit domains, and some solutions presented by recent research. {Gain insight into two systems-level implementations of knowledge graph construction and mining that have been (very recently) evaluated on real-world data collected in unusual domains.

Audience and Prerequisites

We anticipate a diverse target audience, including practitioners and researchers with an interest in knowledge graphs, Web domains, information extraction, entity resolution, link prediction and text mining. We believe that almost anyone with an interest in a sub-area of data mining will find a corresponding research challenge in illicit domains, with real-world characteristics (e.g. scale and noise) and complexity. Interestingly, illicit domains share many characteristics with social media; we will explicitly make this connection (with real-world examples), including applicable similarities and differences, in the first 15 minutes of the tutorial. We believe this will further highlight the need for collaboration and interdisciplinary work in these areas.

Our technical focus will largely center on knowledge graphs (including construction, querying and inference); thus, some basic background on graph theory, information retrieval and machine learning will be useful. We will introduce concepts from the ground-up, and will not cover intricate mathematical treatments. The goal is to allow everyone to follow from the beginning, even if they have not directly worked with knowledge graphs, Web domains and illicit data. The tutorial will rely heavily on actual examples and case-studies, demos by the tutors, and visual intuition.

Related Work

Data mining in unusual domains draws on several different research areas and systems, and we will do our best to provide coverage on, and synthesize core elements of, the areas most relevant to the task at hand. Among prior work describing illicit domains (generally), especially in the Dark Web, we will cover elements of [6, 4]. Also, because of our extensive experience with a specific illicit domain, human trafficking, we will also cover the main conclusions in [1, 5] to highlight problems, challenges and characteristics of data. However, we will not limit ourselves only to illicit domains. For example, we have extensive experience in mining data in domains such as humanitarian and disaster relief, and we will also be covering those.

Research areas especially important to this tutorial are (high-level descriptions of) knowledge graphs, and knowledge graph (also, knowledge base) construction [14, 15], information extraction [3, 11], entity resolution [7, 2, 23] and documented implementations (esp. DIG and DeepDive [20, 19]). Our group has recent work in several of those areas [11, 12, 23]. We will briefly cover the principles of those methods (implemented in DIG) and their relevance to prob- lems in illicit domains.

If we cover the optional material, references that we will draw upon for conceptual coverage of knowledge graph embeddings are [17, 13, 22, 8, 21], in addition to (not necessarily knowledge-) graph embeddings (e.g. DeepWalk [16]) that were not specifically proposed for knowledge graphs but are also applicable. For entity-centric search (ECS), we will draw on concepts best expressed in [18, 9], along with our own recent implementation that won the DARPA MEMEX challenge in the summer of 2016 and that is currently under submission at an information retrieval conference [10].

Equipment

Attendees are not required to bring a laptop to follow the tutorial, although it might be useful, especially if they want to check out the DIG codebase or otherwise access elements such as previous slides or references while we are giving the tutorial.

References

[1] H. Alvari, P. Shakarian, and J. K. Snyder. A non-parametric learning approach to identify online human trafficking. In Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, pages 133-138. IEEE, 2016.

[2] D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5, 2015.

[3] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE transactions on knowledge and data engineering, 18(10):1411-1428, 2006.

[4] H. Chen. Dark web: Exploring and data mining the dark side of the web, volume 30. Springer Science & Business Media, 2011.

[5] A. Dubrawski, K. Miller, M. Barnes, B. Boecking, and E. Kennedy. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Tracking, 1(1):65-85, 2015.

[6] T. Fu, A. Abbasi, and H. Chen. A focused crawler for dark web forums. Journal of the American Society for Information Science and Technology, 61(6):1213-1231, 2010.

[7] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018-2019, 2012.

[8] S. Guo, Q. Wang, B. Wang, L. Wang, and L. Guo. Semantically smooth knowledge graph embedding. In ACL (1), pages 84-94, 2015.

[9] A. Harth, A. Hogan, R. Delbru, J. Umbrich, S. O'Riain, and S. Decker. Swse: Answers before links! In Proceedings of the 2007 International Conference on Semantic Web Challenge-Volume 295, pages 137-144. CEUR- WS. org, 2007.

[10] M. Kejriwal and P. Szekely. Entity-centric search in the human trafficking domain (under review). In SIGIR. ACM, 2017.

[11] M. Kejriwal and P. Szekely. Information extraction in illicit domains. In World Wide Web. ACM, 2017.

[12] M. Kejriwal, P. Szekely, and C. Knoblock. Investigative knowledge discovery for combating illicit activities (to appear). In Intelligent Systems Magazine. IEEE, 2017.

[13] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181-2187, 2015.

[14] F. Niu, C. Zhang, C. Re, and J. Shavlik. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. International Journal on Semantic Web and Information Systems (IJSWIS), 8(3):42-73, 2012.

[15] F. Niu, C. Zhang, C. Re, and J. W. Shavlik. Deepdive: Web- scale knowledge-base construction using statistical learning and inference. VLDS, 12:25-28, 2012.

[16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701-710. ACM, 2014.

[17] P. Ristoski and H. Paulheim. Rdf2vec: Rdf graph embeddings for data mining. In International Semantic Web Conference, pages 498-514. Springer, 2016.

[18] P. Saleiro, J. Teixeira, C. Soares, and E. Oliveira. Timemachine: Entity- centric search and visualization of news archives. In European Conference on Information Retrieval, pages 845-848. Springer, 2016.

[19] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Re. Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment, 8(11):1310-1321, 2015.

[20] P. Szekely, C. A. Knoblock, J. Slepicka, A. Philpot, A. Singh, C. Yin, D. Kapoor, P. Natarajan, D. Marcu, K. Knight, et al. Building and using a knowledge graph to combat human trafficking. In International Semantic Web Conference, pages 205-221. Springer, 2015.

[21] Z.Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly embedding. In EMNLP, volume 14, pages 1591-1601. Citeseer, 2014.

[22] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, pages 1112-1119. Citeseer, 2014.

[23] L. Zhu, M. Ghasemi-Gol, P. Szekely, A. Galstyan, and C. A. Knoblock. Unsupervised entity resolution on multi-type graphs. In International Semantic Web Conference, pages 649-667. Springer, 2016.

PEOPLE