ETK: Information Extraction Toolkit¶
ETK is a Python library for high precision information extraction from many document formats. It proivdes a flexible framework of composable extractors that enables you to combine a host of predefined extractors provided in ETK with custom extractors that you may need to develop for your application. It supports extraction from HTML pages, text documents, CSV and Excel files and JSON documents. ETK is open-source software, released under the MIT license.
Features
- Extraction from HTML, text, CSV, Excel, JSON
- High-precision predefined extractors for common entities (dates, phones, email, cities, …)
- Extraction of microdata, schema.org and RDFa markup
- Integration with [spaCy](https://github.com/explosion/spaCy) for text processing
- Automatic identification and extraction of HTML tables containing data
- Automatic identification and extraction of time series
- Semi-automatic generation of Web wrappers
- Scalable execution and management of extraction pipelines
- Automatic provenance recording
Releases
- [Source code](https://github.com/usc-isi-i2/etk/releases)
- [Docker images](https://hub.docker.com/r/uscisii2/etk/tags/)
Getting Started¶
Installation (need to upload to PyPI later):
pip install -U etk
Example:
>>> import etk
>>>
Run ETK CLI:
pip install -U etk
python -m etk <command> [options]
For example:
python -m etk dummy --test "this is a test"
API Reference¶
- Extractors
- Bitcoin Address Extractor
- Cryptographic Hash Extractor
- Cve Extractor
- Date Extractor
- DBpedia Spotlight Extractor
- Decoding Value Extractor
- Email Extractor
- Excel Extractor
- Glossary Extractor
- Hostname Extractor
- HTML Content Extractor
- HTML Metadata Extractor
- Inferlink Extractor
- IP Address Extractor
- Language Identification Extractor
- Regular Expression Extractor
- Sentence Extractor
- Spacy NER Extractor
- Spacy Rule Extractor
- Table Extractor
- URL Extractor