Entity Extractors

For Domain-Specific Insight Graphs

Easily extract features from free text using Amazon Mechanical Turk

Entity Extractors In One Hour - Created with Haiku Deck, presentation software that inspires




MTurk hits can be run in sandbox (developer) or live environment. Hits can be created with/without qualifications. All the possibilities are explained below.


We will be definining an extraction which is named; the name forms the identifier for this effort. This identifier must be chose as a string which can be safely used as a directory component with only lower case letters, digits and the underscore character. For example, we might define a catalog extraction for getting information from online stores, or a makemodel for extracting atributes of automobiles for sale.

Within our extraction we will be defining one or more categories. A category is a class of entities to be recognized and extracted. For example, in the catalog extractor, we might define categories price, description, and modelnumber. The category names should be simple identifiers with only lower case letters, digits, and the underscore character. Each category also has an English label, suitable for presentation to users, which can include spaces and upper case letters. For example, in the catalog extractor, we might define the category labels as "Price", "Item Description", and "Model Number".

We will define a corpus of example texts too train and test our extractors. These are conventionally called sentences but they may be fragments of a sentence, multiple sentences, etc., as long as they can be annotated by users in the interface without scrolling. These sentences should have a fair number of instances of the categories of this extraction.

Mechanical Turk microwork tasks are conventionally termed hits and we will use that terminology as well.

Some Mechanical Turk extractions will benefit from having the workers undergo training and vetting. This project includes capabilities to define and populate an optional special task, called a qualification task, to carry this out.


  1. Install git, python 2.7, java 1.7, and maven 3
  2. Establish a github account
  3. Install Web-Karma and dig-mturk in your github repository root folder.
  4. Install following python libraries:
    • pip install requests
    • pip install boto
    • pip install nltk
    • pip install awscli


  1. Your home directory is $HOME.
  2. You will want to edit your shell init file, typically ~/.bashrc, ~/.profile, or ~/.bash_profile
  3. Edit (at the bottom will work well) your shell init file to contain the following:
  4. export GITHUB=<your github repository root folder>

    export PATH="$PATH:$GITHUB/dig-mturk/generic/scripts"

  5. source <shell init file>
  6. cd $GITHUB/dig-mturk; git checkout master; git pull; cd mturk; mvn clean install
  7. cd $GITHUB/Web-Karma; git checkout development; git pull; mvn clean install

AWS Configuration

  1. To configure for Amazon AWS, you will need an account with AWS username, AWS Access Key Id, and AWS Secret Access Key; and an Amazon S3 bucket that is writable using those credentials.
  2. (At ISI you can obtain this information from Andrew or Suba)
  3. aws configure
    • AWS Access Key ID = key id obtained
    • AWS Secret Access Key = secret key obtained
    • Default region name = usc-west-2
    • Default output format = json


  1. cd $GITHUB/dig-mturk
  2. For convenience, export EXTRACTION=<extraction>
  3. newExtraction.sh $EXTRACTION This will create $GITHUB/dig-mturk/extractions/$EXTRACTION and copy/adapt a few files from $GITHUB/dig-mturk/extractions/boilerplate/ to $GITHUB/dig-mturk/extractions/$EXTRACTION/
  4. Edit these copied configuration files using your preferred text editor to reflect your extraction name, categories, and category labels:
    • sentence data:
      • If you are providing sentence data from a plain text file, edit $GITHUB/dig-mturk/extractions/$EXTRACTION/sentences.txt to contain the sentence data, one sentence per line.
      • If you are obtaining sentence data from an ElasticSearch index, edit $GITHUB/dig-mturk/extractions/$EXTRACTION/fetchSentences.sh and/or $GITHUB/dig-mturk/extractions/$EXTRACTION/query.json to specify the index name, desired hit count, credentials, query fields, and query path. Then execute the script:
        fetchSentences.sh $EXTRACTION $TRIAL $SIZE where $SIZE is the number of sentences to be fetched from ES. This will generate $GITHUB/dig-mturk/extractions/$EXTRACTION/sentences.json.
    • $GITHUB/dig-mturk/extractions/$EXTRACTION/create.cfg is a configuration file with parameters defining how Mechanical Turk tasks are created.
      In the [create] section, edit parameters to specify the desired number of hits to be created, the number of sentences in each hit, etc. If your data comes from ElasticSearch, be sure the [field] parameter specifies how data is obtained from each ElasticSearch result 'hit'. In the [boto] section, edit the parameters to specify the AWS/S3 configurations specified above.
    • $GITHUB/dig-mturk/extractions/$EXTRACTION/hitdata.pyfmt is a template file specifying details of Mechanical Turk hits. For this file, you should edit the task description, title, and categories. You should leave the scratch_category as it is. You may wish to edit the keywords. You should leave the bracketed expressions {instructions}, {qualifications} and {sentences} as they are: they will be filled in by the page generator.
    • The files $GITHUB/dig-mturk/extractions/$EXTRACTION/head.html, $GITHUB/dig-mturk/extractions/$EXTRACTION/instructions.html, and $GITHUB/dig-mturk/extractions/$EXTRACTION/tail.html will be combined to become an instructions page for the users.
      • $GITHUB/dig-mturk/extractions/$EXTRACTION/head.html: no change.
      • $GITHUB/dig-mturk/extractions/$EXTRACTION/instructions.html: Change the sample content and the categories to conform to your categories and likely data. There are five necessary modifications in $GITHUB/dig-mturk/extractions/$EXTRACTION/instructions.html. Each is introduced by a comment suggesting what needs to be changed.
      • $GITHUB/dig-mturk/extractions/$EXTRACTION/tail.html: Change the $GITHUB/dig-mturk/extractions/$EXTRACTION/tail.html if you consider the standard boilerplate text confusing in context when presented to a user as part of the instructions.
      • You can inspect the look of the instructions by performing: cd $GITHUB/dig-mturk/extractions/$EXTRACTION; cat head.html instructions.html tail.html > page.html; open page.html
    • The $GITHUB/dig-mturk/extractions/$EXTRACTION/karma/ folder contains information so that the Karma information integration system can transform the Mechanical Turk workers' outputs into a format useful for training a CRF model.
      • Edit $GITHUB/dig-mturk/extractions/$EXTRACTION/karma/preloaded-ontologies/mturk-ontology.ttl: Define the categories as relations (see embedded example) and use the labels as the rdfs:label.
      • Edit $GITHUB/dig-mturk/extractions/$EXTRACTION/karma/python/mturk.py: In the definition of data structure categoryToAnnotationClass at the top of the file, insert a mapping between your categories and their labels.
    • The $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification/ folder contains specifications and files used to specify a qualification task. (See below.)


    The qualification task is a small test where the user practices annotating examples before performing the real, paid tasks. The objective of the test is to teach the user by practicing, so for each wrong answer the user receives feedback with the explanation of how to correctly answer it. Once the user has successfully completed the qualification task, he/she will be allowed to continue to the actual tasks. A user who is not successful at the qualification task can try again after a specified delay.

    If an extraction requires qualification, the qualification task has to be generated before creating hits. A user wishing to perform a hit belong to your extraction will be first directed to satisfy the qualification if you have specified one.

    To generate a qualification task, you need the following:

    1. Name of your qualification. Typically, we name the qualification the same as the extraction it supports. So herein we will use $EXTRACTION as the name of the qualification.
    2. Categories for this extraction.
    3. Example sentences.
    4. The answers for those sentences.
    5. Explanations of the answers (used to provide feedback during qualification testing).

    Steps to create qualification:

    1. newQualification.sh $EXTRACTION- this will copy config JSON from $GITHUB/dig-mturk/extractions/boilerplate/qualification/ to $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification/qual_${EXTRACTION}.json. It will also generate qualification.cfg.
    2. Edit $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification.cfg to reflect your organization name, Amazon S3 bucket name, Amazon access_key_id and Amazon secret_access_key.
    3. Edit $GITHUB/dig-mturk/extractions/$EXTRACTION/qual_${EXTRACTION}.json to reflect categories, sentences, answers, etc. to define your qualification domain. Note: the JSON key explanation.wrong needn't be edited. The script will populate this field as required before generating HTML.
      1. categories [label]: put the name of the attributes that should be selected, specifying both the category name (simple identifier) and label (English string)
      2. total-questions: set the total number of questions in the test
      3. correct-questions: set the number of questions that the user have to do correctly to be approved
      4. title: change the company's name
      5. subtitle: change the categories names
      6. average-time: average time that a regular user spend to finish the test
      7. help [title]: change the categories names
      8. help [html]: paste a html code with the explanation/image that helps the user to figure out the correct way to select the words for this domain
      9. aprove-message [html]: change the number for the number of correct question that the user have to answer

      For each sentence, change:

      1. id: should be in increasing order
      2. text: the complete sentence that should be analyzed
      3. annotations: set "yes" if there are selections to be done in the sentence and "no" if there is no word to select in the text
      4. select: specify which categories should be selected in this question (can be all the categories or just one. For more than one, put each categories separated by comma.
      5. answer [category]: for each category, get all the words that should be selected. If there is more than one, put the answers separated using \t. If there is no answer for this category, set as "".
      6. explanation [wrong] [text]: write the explanation for why the sentence have to be annotated as the answer specifies. The explanation text will be shown when the user selects a wrong answer for the question.

    4. newQualification.sh $EXTRACTION- this will copy config JSON from $GITHUB/dig-mturk/extractions/boilerplate/qualification/ to $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification/qual_${EXTRACTION}.json
    5. generateQualification.sh $EXTRACTION --publish --upload. This will generate and publish the qualification page HTML and assets to Amazon S3, based on the qual_${EXTRACTION}.json data.
    6. generateQualification.sh $EXTRACTION --render (optional) can be used to preview the qualification test sentences and correction essages (message presented to user after incorrect response).
    7. Deploy qualification to MTurk:
      • deployQualification.sh -sandbox $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification.cfg to create qualification in sandbox. This will update qualification.cfg with the qualification ID generated by Amazon for sandbox trials of the qualification test.
      • deployQualification.sh -live $GITHUB/dig-mturk/extractions/$EXTRACTION/qualification.cfg to create qualification in live. This will update qualification.cfg with the qualification ID generated by Amazon for live use by any interested Turkers.
    8. (Optional) Create a cron job that executes $GITHUB/dig-mturk/mturk/src/main/python/respondQualification.py every 5 minutes. This job approves/rejects all qualification requests.


    You may want to edit $GITHUB/dig-mturk/extractions/$EXTRACTION/create.cfg for a small number of hits, small number of sentences/hit.

    1. For sandbox, choose a trial name (e.g., trial01, sandbox01). For convenience export TRIAL=<trial>
    2. mkdir $GITHUB/dig-mturk/extractions/$EXTRACTION/$TRIAL
    3. createHits.sh sandbox $EXTRACTION $TRIAL. This will create config json for hits and upload it to S3 at extractions/$EXTRACTION/$TRIAL/config/.
    4. deployHits.sh -sandbox $EXTRACTION $TRIAL. This will create all files requires to create a HIT and publish those to $BUCKET/extractions/$EXTRACTION/$TRIAL/hits/ in S3 and will also yield a few sandbox URLs, which you can annotate yourself to test things out. If hit was created requiring qualification you need to take the qualification test to be qualified before you are allowed to do the hits, even for sandbox deployment.
    5. When you are satisfied with the look/feel of the tasks, execute hitResults.sh -sandbox $EXTRACTION $TRIAL. This will approve the unapproved assignments of the hits and create tab separated result files in S3 in $BUCKET/extractions/$EXTRACTION/$TRIAL/hits/.


    When your task well formed, you are ready to create live hits for the Mechanical Turk community to perform for you.

    You may want to edit $GITHUB/dig-mturk/extractions/$EXTRACTION/create.cfg to significantly increase the number of hits and/or to standardize the number of sentences/hit.

    1. For live, choose a trial name (e.g., trial01, live01). If you want previous sandbox hits to be retained, be sure to use a different trial name. For convenience export TRIAL=<trial>
    2. mkdir $GITHUB/dig-mturk/extractions/$EXTRACTION/$TRIAL
    3. createHits.sh $EXTRACTION $TRIAL. This will create config json for hits and upload it to S3 at extractions/$EXTRACTION/$TRIAL/config/.
    4. deployHits.sh -live $EXTRACTION $TRIAL. This will create all files requires to create a HIT and publish those to extractions/$EXTRACTION/$TRIAL/hits/ in S3 and yield a few sandbox URLs, which you can annotate yourself to test things out. If hit was created using qualification you need to take the qualification test, get qualified before you are allowed to do the hits.
    5. The progress of hits can be monitored in the Mechanical Turk requester interface (##NEED URL##). When a significant number of assignments have been completed, hitResults.sh -live $EXTRACTION $TRIAL. This will create all files requires to create a HIT and publish those to $BUCKET/extractions/$EXTRACTION/$TRIAL/hits/ in S3.

    AFTER PUBLISHING HITS (common steps for both live and sandbox)

    Since these are processing done on results fetched from hits and stored in S3, these steps don't depend on the staging area (live/sandbox) of the hits.

    1. consolidateResults.sh $EXTRACTION $TRIAL# Note: no -sandbox since it is independent of Mturk. This will consolidate the result files for all hits in a particular trial.
    2. fetchConsolidated.sh $EXTRACTION $TRIAL. This will download consolidated result file from S3.
    3. modelConsolidated.sh $EXTRACTION $TRIAL. This will do offline karma model for the results fetched from S3.
    4. adjudicate.sh $EXTRACTION $TRIAL. This will compute the agreement amongst the Turkers for each annotation. Adjudicated results are in $GITHUB/dig-mturk/extractions/$EXTRACTION/$TRIAL/adjudicated_$EXTRACTION_$TRIAL.json
    5. adjudicated_$EXTRACTION_$TRIAL.json can be used to TRAIN the CRF.
    6. TBD: apply trained CRF++ extractor to data.
    7. TBD: karma model for CRF++ extracted data.
    8. TBD: integrate karma mode for CRF++.
    9. wget --no-check-certificate https://drive.google.com/uc?id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ -O crfpp-0.58.tgz



This research is supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under contract number FA8750-14-C-0240.

1 The student Lidia Ferreira thank to Science Without Borders Program, CAPES Scholarship - Proc. NÂș 88888.030514/2013-00.