HTML Metadata Extractor¶
-
class
etk.extractors.html_metadata_extractor.
HTMLMetadataExtractor
[source]¶ Bases:
etk.extractor.Extractor
- Description
Extracts META, microdata, JSON-LD and RDFa from HTML pages.
Uses https://stackoverflow.com/questions/36768068/get-meta-tag-content-property-with-beautifulsoup-and-python to extract the META tags
Uses https://github.com/scrapinghub/extruct to extract metadata from HTML pages
Examples
html_metadata_extractor = HTMLMetadataExtractor() html_metadata_extractor.extract(text=input_doc, extract_title=True, extract_meta=True, extract_microdata=False, extract_json_ld=False, extract_rdfa=False)
-
extract
(html_text: str, extract_title: bool = False, extract_meta: bool = False, extract_microdata: bool = False, extract_json_ld: bool = False, extract_rdfa: bool = False) → List[etk.extraction.Extraction][source]¶ Parameters: - html_text (str) – input html string to be extracted
- extract_title (bool) – True if string of ‘title’ tag needs to be extracted, return as { “title”: “…” }
- extract_meta (bool) – True if string of ‘meta’ tags needs to be extracted, return as { “meta”: { “author”: “…”, …}}
- extract_microdata (bool) – True if microdata needs to be extracted, returns as { “microdata”: […] }
- extract_json_ld (bool) – True if json-ld needs to be extracted, return as { “json-ld”: […] }
- extract_rdfa (bool) – True if rdfs needs to be extracted, returns as { “rdfa”: […] }
Returns: the list of extraction or the empty list if there are no matches.
Return type: List[Extraction]