HTML Content Extractor¶

class etk.extractors.html_content_extractor.HTMLContentExtractor[source]¶

Bases: etk.extractor.Extractor

Description: This class extracts text from HTML pages. Uses readability and BeautifulSoup.

Examples

html_content_extractor = HTMLContentExtractor()
html_content_extractor.extract(text=input_doc,
                            strategy=Strategy.ALL_TEXT)

extract(html_text: str, strategy: etk.extractors.html_content_extractor.Strategy = <Strategy.ALL_TEXT: 1>) → List[etk.extraction.Extraction][source]¶

Extracts text from an HTML page using a variety of strategies

Parameters:	html_text (str) – html page in string strategy (enum[Strategy.ALL_TEXT, Strategy.MAIN_CONTENT_RELAXED, Strategy.MAIN_CONTENT_STRICT]) – one of Strategy.MAIN_CONTENT_STRICT and Strategy.MAIN_CONTENT_RELAXED (Strategy.ALL_TEXT,) –
Returns:	typically a singleton list with the extracted text
Return type:	List[Extraction]

HTML Content Extractor¶

ETK

Navigation