Glossary Extractor

class etk.extractors.glossary_extractor.GlossaryExtractor(glossary: List[str], extractor_name: str, tokenizer: etk.tokenizer.Tokenizer, ngrams: int = 2, case_sensitive=False)[source]

Bases: etk.extractor.Extractor

Description
This class takes a list of glossary as reference, extract the matched ngrams string from the tokenized input string.

Examples

glossary = ['Beijing', 'Los Angeles', 'New York', 'Shanghai']
glossary_extractor = GlossaryExtractor(glossary=glossary,
                                      ngrams=3,
                                      case_sensitive=True)
glossary_extractor.extract(tokens=Tokenizer(input_text))
extract(tokens: List[spacy.tokens.token.Token]) → List[etk.extraction.Extraction][source]

Extracts information from a string(TEXT) with the GlossaryExtractor instance

Parameters:token (List[Token]) – list of spaCy token to be processed.
Returns:the list of extraction or the empty list if there are no matches.
Return type:List[Extraction]