HTML Metadata Extractor¶

class etk.extractors.html_metadata_extractor.HTMLMetadataExtractor[source]¶

Bases: etk.extractor.Extractor

Description

Extracts META, microdata, JSON-LD and RDFa from HTML pages.

Uses https://stackoverflow.com/questions/36768068/get-meta-tag-content-property-with-beautifulsoup-and-python to extract the META tags

Uses https://github.com/scrapinghub/extruct to extract metadata from HTML pages

Examples

html_metadata_extractor = HTMLMetadataExtractor()
html_metadata_extractor.extract(text=input_doc,
                                extract_title=True,
                                extract_meta=True,
                                extract_microdata=False,
                                extract_json_ld=False,
                                extract_rdfa=False)

extract(html_text: str, extract_title: bool = False, extract_meta: bool = False, extract_microdata: bool = False, extract_json_ld: bool = False, extract_rdfa: bool = False) → List[etk.extraction.Extraction][source]¶

Parameters:	html_text (str) – input html string to be extracted extract_title (bool) – True if string of ‘title’ tag needs to be extracted, return as { “title”: “…” } extract_meta (bool) – True if string of ‘meta’ tags needs to be extracted, return as { “meta”: { “author”: “…”, …}} extract_microdata (bool) – True if microdata needs to be extracted, returns as { “microdata”: […] } extract_json_ld (bool) – True if json-ld needs to be extracted, return as { “json-ld”: […] } extract_rdfa (bool) – True if rdfs needs to be extracted, returns as { “rdfa”: […] }
Returns:	the list of extraction or the empty list if there are no matches.
Return type:	List[Extraction]

HTML Metadata Extractor¶

ETK

Navigation