Language Identification Extractor

class etk.extractors.language_identification_extractor.IdentificationTool[source]

Bases: enum.Enum

An enumeration.

class etk.extractors.language_identification_extractor.LanguageIdentificationExtractor[source]

Bases: etk.extractor.Extractor

Description

Identify the language used in text, returning the identifier language using ISO 639-1 codes

Uses two libraries: - https://github.com/davidjurgens/equilid - https://github.com/saffsd/langid.py

TODO: define Enum to select which method to use. TODO: define dictionary to translate ISO 639-3 to ISO 639-1 codes https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, perhaps there is an online source that has this

Examples

language_identification_extractor = LanguageIdentificationExtractor()
language_identification_extractor.extract(text=input_stri,
                                        method=IdentificationTool.LANGDETECT.name)
extract(text: str, method: str) → List[etk.extraction.Extraction][source]
Parameters:
  • text (str) – any text, can contain HTML
  • method (Enum[IdentificationTool.LANGID, IdentificationTool.LANGDETECT]) – specifies which of the two
  • to use (algorithms) –
Returns:

an extraction containing the language code used in the text. Returns the empty list of the extractor fails to identify the language in the text.

Return type:

List(Extraction)