Date Extractor¶
-
class
etk.extractors.date_extractor.
DateExtractor
(etk: etk.etk.ETK = None, extractor_name: str = 'date extractor')[source]¶ Bases:
etk.extractor.Extractor
- Description
- This extractor pre-defines a rich set of datetime regexp to detect any format of timestamp from the input text. In addition, it employees the spaCy rule to infer the specific datetime from a relative datetime and relative datetime base, for instance, two days ago with relative datetime base 02/02/2018.
Examples
date_extractor = (etk=self.etk) date_extractor.extract(text=input_doc, extract_first_date_only=False, # first valid additional_formats=['%Y@%m@%d', '%a %Y, %b %d'], use_default_formats=True, ignore_dates_before=ignore_before, ignore_dates_after=ignore_after, relative_base=relative_base, preferred_date_order="DMY", prefer_language_date_order=True, timezone='GMT', to_timezone='UTC', return_as_timezone_aware=False, prefer_day_of_month='first', prefer_dates_from='future', )
-
extract
(text: str = None, extract_first_date_only: bool = False, additional_formats: List[str] = [], use_default_formats: bool = False, ignore_dates_before: datetime.datetime = None, ignore_dates_after: datetime.datetime = None, detect_relative_dates: bool = False, relative_base: datetime.datetime = None, preferred_date_order: str = 'MDY', prefer_language_date_order: bool = True, timezone: str = None, to_timezone: str = None, return_as_timezone_aware: bool = True, prefer_day_of_month: str = 'first', prefer_dates_from: str = 'current', date_value_resolution: etk.extractors.date_extractor.DateResolution = <DateResolution.DAY: 4>) → List[etk.extraction.Extraction][source]¶ Parameters: - text (str) – extract dates from this ‘text’, default to None
- extract_first_date_only (bool) – extract the first valid date only or extract all, default to False
- additional_formats (List[str]) – user defined formats for extraction, default to empty list
- use_default_formats (bool) – if use default formats together with addtional_formats, default to False
- ignore_dates_before (datetime.datetime) – ignore dates before ‘ignore_dates_before’, default to None
- ignore_dates_after (datetime.datetime) – ignore dates after ‘ignore_dates_after’, default to None
- detect_relative_dates (bool) – if detect relative dates like ‘9 days before’, default to False
- relative_base (datetime.datetime) – offset relative dates detected based on ‘relative_base’, default to None
- preferred_date_order (enum['MDY', 'DMY', 'YMD']) – preferred date order when ambiguous, default to ‘MDY’
- prefer_language_date_order (bool) – if use the text language’s preferred order, default to True
- timezone (str) – add ‘timezone’ if there is no timezone information in the extracted date, default to None
- to_timezone (str) – convert all dates extracted to this timezone, default to None
- return_as_timezone_aware (bool) – returned datetime timezone awareness, default to None
- prefer_day_of_month (enum['first', 'current', 'last']) – use which day of the month when there is no ‘day’, default to ‘first’
- prefer_dates_from (enum['past', 'current', 'future']) – use which date when there is few info(e.g. only month), default to ‘current’
- date_value_resolution (enum[DateResolution.SECOND, DateResolution.MINUTE, DateResolution.HOUR, DateResolution.DAY, DateResolution.MONTH, DateResolution.YEAR]) – specify resolution when convert to iso format string, default to DateResolution.DAY
Returns: List of extractions, the information including:
Extraction._value: iso format string, Extraction._provenance: provenance information including: { 'start_char': int - start_char, 'end_char': int - end_char }, Extraction._addition_inf: additional information including: { 'date_object': datetime.datetime - the datetime object, 'original_text': str - the original str extracted from text, 'language': enum['en', 'es'] - language of the date }
Return type: List[Extraction]