HTML Content Extractor

class etk.extractors.html_content_extractor.HTMLContentExtractor[source]

Bases: etk.extractor.Extractor

Description
This class extracts text from HTML pages. Uses readability and BeautifulSoup.

Examples

html_content_extractor = HTMLContentExtractor()
html_content_extractor.extract(text=input_doc,
                            strategy=Strategy.ALL_TEXT)
extract(html_text: str, strategy: etk.extractors.html_content_extractor.Strategy = <Strategy.ALL_TEXT: 1>) → List[etk.extraction.Extraction][source]

Extracts text from an HTML page using a variety of strategies

Parameters:
  • html_text (str) – html page in string
  • strategy (enum[Strategy.ALL_TEXT, Strategy.MAIN_CONTENT_RELAXED, Strategy.MAIN_CONTENT_STRICT]) – one of
  • Strategy.MAIN_CONTENT_STRICT and Strategy.MAIN_CONTENT_RELAXED (Strategy.ALL_TEXT,) –
Returns:

typically a singleton list with the extracted text

Return type:

List[Extraction]