Slides
Tutorial slides: [PDF].
Goals
Tutorial attendees will:
Gain a comprehensive understanding of the difficulties in mining unusual domains, which include both illicit domains e.g., human trafficking and securities fraud, as well as non-illicit domains not studied too often but with immense social utility e.g., humanitarian and disaster relief. We articulate case studies (from a data science perspective) in aspects
of noise, scale, content, models and user desiderata that make data mining in
such domains challenging and novel.
Gain an overview of an end-to-end knowledge graph-centric architecture
for mining data in unusual domains. Using demos and examples, we will provide
attendees with insight into how familiar (imperfectly solved) tasks like information extraction, entity resolution, clustering, graph embeddings, graphical user
interfaces and entity-centric search can be integrated into a viable data mining
framework.
Gain detailed insight into two extremely important, and well-studied, tasks
in knowledge graph construction, namely information extraction and entity resolution. We cover why traditional research in these areas can be problematic in
illicit domains, and some solutions presented by recent research.
Gain insight into two systems-level implementations of knowledge graph
construction and mining that have been (very recently) evaluated on real-world
data collected in unusual domains.
Audience and Prerequisites
We anticipate a diverse target audience, including practitioners and researchers
with an interest in knowledge graphs, Web domains, information extraction, entity resolution, link prediction and text mining. We believe that almost anyone
with an interest in a sub-area of data mining will find a corresponding research
challenge in illicit domains, with real-world characteristics (e.g. scale and noise)
and complexity. Interestingly, illicit domains share many characteristics with
social media; we will explicitly make this connection (with real-world examples),
including applicable similarities and differences, in the first 15 minutes of the
tutorial. We believe this will further highlight the need for collaboration and
interdisciplinary work in these areas.
Our technical focus will largely center on knowledge graphs (including construction, querying and inference); thus, some basic background on graph theory, information retrieval and machine learning will be useful. We will introduce
concepts from the ground-up, and will not cover intricate mathematical treatments. The goal is to allow everyone to follow from the beginning, even if they
have not directly worked with knowledge graphs, Web domains and illicit data.
The tutorial will rely heavily on actual examples and case-studies, demos by the
tutors, and visual intuition.
Related Work
Data mining in unusual domains draws on
several different research areas and systems, and we will do our best to provide
coverage on, and synthesize core elements of, the areas most relevant to the task
at hand. Among prior work describing illicit domains (generally), especially in
the Dark Web, we will cover elements of [6, 4]. Also, because of our extensive
experience with a specific illicit domain, human trafficking, we will also cover the
main conclusions in [1, 5] to highlight problems, challenges and characteristics
of data. However, we will not limit ourselves only to illicit domains. For example, we have extensive experience in mining data in domains such as humanitarian and disaster relief, and we will also be covering those.
Research areas especially important to this tutorial are (high-level descriptions of) knowledge graphs, and knowledge graph (also, knowledge base)
construction [14, 15], information extraction [3, 11], entity resolution [7, 2, 23]
and documented implementations (esp. DIG and DeepDive [20, 19]). Our group
has recent work in several of those areas [11, 12, 23]. We will briefly cover the
principles of those methods (implemented in DIG) and their relevance to prob-
lems in illicit domains.
If we cover the optional material, references that we will draw upon for
conceptual coverage of knowledge graph embeddings are [17, 13, 22, 8, 21], in
addition to (not necessarily knowledge-) graph embeddings (e.g. DeepWalk [16])
that were not specifically proposed for knowledge graphs but are also applicable.
For entity-centric search (ECS), we will draw on concepts best expressed in [18,
9], along with our own recent implementation that won the DARPA MEMEX
challenge in the summer of 2016 and that is currently under submission at an
information retrieval conference [10].
Equipment
Attendees are not required to bring a laptop to follow the tutorial, although
it might be useful, especially if they want to check out the DIG codebase or
otherwise access elements such as previous slides or references while we are
giving the tutorial.
References
[1] H. Alvari, P. Shakarian, and J. K. Snyder. A non-parametric learning
approach to identify online human trafficking. In Intelligence and Security
Informatics (ISI), 2016 IEEE Conference on, pages 133-138. IEEE, 2016.
[2] D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record
linkage methodologies. Communications of the IIMA, 6(3):5, 2015.
[3] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web
information extraction systems. IEEE transactions on knowledge and data
engineering, 18(10):1411-1428, 2006.
[4] H. Chen. Dark web: Exploring and data mining the dark side of the web,
volume 30. Springer Science & Business Media, 2011.
[5] A. Dubrawski, K. Miller, M. Barnes, B. Boecking, and E. Kennedy. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Tracking, 1(1):65-85, 2015.
[6] T. Fu, A. Abbasi, and H. Chen. A focused crawler for dark web forums.
Journal of the American Society for Information Science and Technology,
61(6):1213-1231, 2010.
[7] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice &
open challenges. Proceedings of the VLDB Endowment, 5(12):2018-2019,
2012.
[8] S. Guo, Q. Wang, B. Wang, L. Wang, and L. Guo. Semantically smooth
knowledge graph embedding. In ACL (1), pages 84-94, 2015.
[9] A. Harth, A. Hogan, R. Delbru, J. Umbrich, S. O'Riain, and S. Decker.
Swse: Answers before links! In Proceedings of the 2007 International Conference on Semantic Web Challenge-Volume 295, pages 137-144. CEUR-
WS. org, 2007.
[10] M. Kejriwal and P. Szekely. Entity-centric search in the human trafficking
domain (under review). In SIGIR. ACM, 2017.
[11] M. Kejriwal and P. Szekely. Information extraction in illicit domains. In World Wide Web. ACM, 2017.
[12] M. Kejriwal, P. Szekely, and C. Knoblock. Investigative knowledge discovery for combating illicit activities (to appear). In Intelligent Systems
Magazine. IEEE, 2017.
[13] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation
embeddings for knowledge graph completion. In AAAI, pages 2181-2187,
2015.
[14] F. Niu, C. Zhang, C. Re, and J. Shavlik. Elementary: Large-scale
knowledge-base construction via machine learning and statistical inference. International Journal on Semantic Web and Information Systems
(IJSWIS), 8(3):42-73, 2012.
[15] F. Niu, C. Zhang, C. Re, and J. W. Shavlik. Deepdive: Web-
scale knowledge-base construction using statistical learning and inference.
VLDS, 12:25-28, 2012.
[16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 701-710. ACM,
2014.
[17] P. Ristoski and H. Paulheim. Rdf2vec: Rdf graph embeddings for data mining. In International Semantic Web Conference, pages 498-514. Springer,
2016.
[18] P. Saleiro, J. Teixeira, C. Soares, and E. Oliveira. Timemachine: Entity-
centric search and visualization of news archives. In European Conference
on Information Retrieval, pages 845-848. Springer, 2016.
[19] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Re. Incremental
knowledge base construction using deepdive. Proceedings of the VLDB
Endowment, 8(11):1310-1321, 2015.
[20] P. Szekely, C. A. Knoblock, J. Slepicka, A. Philpot, A. Singh, C. Yin,
D. Kapoor, P. Natarajan, D. Marcu, K. Knight, et al. Building and using a
knowledge graph to combat human trafficking. In International Semantic
Web Conference, pages 205-221. Springer, 2015.
[21] Z.Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly
embedding. In EMNLP, volume 14, pages 1591-1601. Citeseer, 2014.
[22] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by
translating on hyperplanes. In AAAI, pages 1112-1119. Citeseer, 2014.
[23] L. Zhu, M. Ghasemi-Gol, P. Szekely, A. Galstyan, and C. A. Knoblock.
Unsupervised entity resolution on multi-type graphs. In International Semantic Web Conference, pages 649-667. Springer, 2016.