HTML Content Extractor¶
-
class
etk.extractors.html_content_extractor.
HTMLContentExtractor
[source]¶ Bases:
etk.extractor.Extractor
- Description
- This class extracts text from HTML pages. Uses readability and BeautifulSoup.
Examples
html_content_extractor = HTMLContentExtractor() html_content_extractor.extract(text=input_doc, strategy=Strategy.ALL_TEXT)
-
extract
(html_text: str, strategy: etk.extractors.html_content_extractor.Strategy = <Strategy.ALL_TEXT: 1>) → List[etk.extraction.Extraction][source]¶ Extracts text from an HTML page using a variety of strategies
Parameters: - html_text (str) – html page in string
- strategy (enum[Strategy.ALL_TEXT, Strategy.MAIN_CONTENT_RELAXED, Strategy.MAIN_CONTENT_STRICT]) – one of
- Strategy.MAIN_CONTENT_STRICT and Strategy.MAIN_CONTENT_RELAXED (Strategy.ALL_TEXT,) –
Returns: typically a singleton list with the extracted text
Return type: List[Extraction]