Date Extractor

class etk.extractors.date_extractor.DateExtractor(etk: etk.etk.ETK = None, extractor_name: str = 'date extractor')[source]

Bases: etk.extractor.Extractor

Description
This extractor pre-defines a rich set of datetime regexp to detect any format of timestamp from the input text. In addition, it employees the spaCy rule to infer the specific datetime from a relative datetime and relative datetime base, for instance, two days ago with relative datetime base 02/02/2018.

Examples

date_extractor = (etk=self.etk)
date_extractor.extract(text=input_doc,
                    extract_first_date_only=False,  # first valid
                    additional_formats=['%Y@%m@%d', '%a %Y, %b %d'],
                    use_default_formats=True,
                    ignore_dates_before=ignore_before,
                    ignore_dates_after=ignore_after,
                    relative_base=relative_base,
                    preferred_date_order="DMY",
                    prefer_language_date_order=True,
                    timezone='GMT',
                    to_timezone='UTC',
                    return_as_timezone_aware=False,
                    prefer_day_of_month='first',
                    prefer_dates_from='future',
                    )
extract(text: str = None, extract_first_date_only: bool = False, additional_formats: List[str] = [], use_default_formats: bool = False, ignore_dates_before: datetime.datetime = None, ignore_dates_after: datetime.datetime = None, detect_relative_dates: bool = False, relative_base: datetime.datetime = None, preferred_date_order: str = 'MDY', prefer_language_date_order: bool = True, timezone: str = None, to_timezone: str = None, return_as_timezone_aware: bool = True, prefer_day_of_month: str = 'first', prefer_dates_from: str = 'current', date_value_resolution: etk.extractors.date_extractor.DateResolution = <DateResolution.DAY: 4>) → List[etk.extraction.Extraction][source]
Parameters:
  • text (str) – extract dates from this ‘text’, default to None
  • extract_first_date_only (bool) – extract the first valid date only or extract all, default to False
  • additional_formats (List[str]) – user defined formats for extraction, default to empty list
  • use_default_formats (bool) – if use default formats together with addtional_formats, default to False
  • ignore_dates_before (datetime.datetime) – ignore dates before ‘ignore_dates_before’, default to None
  • ignore_dates_after (datetime.datetime) – ignore dates after ‘ignore_dates_after’, default to None
  • detect_relative_dates (bool) – if detect relative dates like ‘9 days before’, default to False
  • relative_base (datetime.datetime) – offset relative dates detected based on ‘relative_base’, default to None
  • preferred_date_order (enum['MDY', 'DMY', 'YMD']) – preferred date order when ambiguous, default to ‘MDY’
  • prefer_language_date_order (bool) – if use the text language’s preferred order, default to True
  • timezone (str) – add ‘timezone’ if there is no timezone information in the extracted date, default to None
  • to_timezone (str) – convert all dates extracted to this timezone, default to None
  • return_as_timezone_aware (bool) – returned datetime timezone awareness, default to None
  • prefer_day_of_month (enum['first', 'current', 'last']) – use which day of the month when there is no ‘day’, default to ‘first’
  • prefer_dates_from (enum['past', 'current', 'future']) – use which date when there is few info(e.g. only month), default to ‘current’
  • date_value_resolution (enum[DateResolution.SECOND, DateResolution.MINUTE, DateResolution.HOUR, DateResolution.DAY, DateResolution.MONTH, DateResolution.YEAR]) – specify resolution when convert to iso format string, default to DateResolution.DAY
Returns:

List of extractions, the information including:

Extraction._value: iso format string,
Extraction._provenance: provenance information including:
{
    'start_char': int - start_char,
    'end_char': int - end_char
},
Extraction._addition_inf: additional information including:
{
    'date_object': datetime.datetime - the datetime object,
    'original_text': str - the original str extracted from text,
    'language': enum['en', 'es'] - language of the date
}

Return type:

List[Extraction]