Language Identification Extractor¶

class etk.extractors.language_identification_extractor.IdentificationTool[source]¶

Bases: enum.Enum

An enumeration.

class etk.extractors.language_identification_extractor.LanguageIdentificationExtractor[source]¶

Bases: etk.extractor.Extractor

Description

Identify the language used in text, returning the identifier language using ISO 639-1 codes

Uses two libraries: - https://github.com/davidjurgens/equilid - https://github.com/saffsd/langid.py

TODO: define Enum to select which method to use. TODO: define dictionary to translate ISO 639-3 to ISO 639-1 codes https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, perhaps there is an online source that has this

Examples

language_identification_extractor = LanguageIdentificationExtractor()
language_identification_extractor.extract(text=input_stri,
                                        method=IdentificationTool.LANGDETECT.name)

extract(text: str, method: str) → List[etk.extraction.Extraction][source]¶

Parameters:	text (str) – any text, can contain HTML method (Enum[IdentificationTool.LANGID, IdentificationTool.LANGDETECT]) – specifies which of the two to use (algorithms) –
Returns:	an extraction containing the language code used in the text. Returns the empty list of the extractor fails to identify the language in the text.
Return type:	List(Extraction)

Language Identification Extractor¶

ETK

Navigation