Language Identification Extractor¶
-
class
etk.extractors.language_identification_extractor.
IdentificationTool
[source]¶ Bases:
enum.Enum
An enumeration.
-
class
etk.extractors.language_identification_extractor.
LanguageIdentificationExtractor
[source]¶ Bases:
etk.extractor.Extractor
- Description
Identify the language used in text, returning the identifier language using ISO 639-1 codes
Uses two libraries: - https://github.com/davidjurgens/equilid - https://github.com/saffsd/langid.py
TODO: define Enum to select which method to use. TODO: define dictionary to translate ISO 639-3 to ISO 639-1 codes https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, perhaps there is an online source that has this
Examples
language_identification_extractor = LanguageIdentificationExtractor() language_identification_extractor.extract(text=input_stri, method=IdentificationTool.LANGDETECT.name)
-
extract
(text: str, method: str) → List[etk.extraction.Extraction][source]¶ Parameters: - text (str) – any text, can contain HTML
- method (Enum[IdentificationTool.LANGID, IdentificationTool.LANGDETECT]) – specifies which of the two
- to use (algorithms) –
Returns: an extraction containing the language code used in the text. Returns the empty list of the extractor fails to identify the language in the text.
Return type: List(Extraction)