HTML Metadata Extractor

class etk.extractors.html_metadata_extractor.HTMLMetadataExtractor[source]

Bases: etk.extractor.Extractor

Description

Extracts META, microdata, JSON-LD and RDFa from HTML pages.

Uses https://stackoverflow.com/questions/36768068/get-meta-tag-content-property-with-beautifulsoup-and-python to extract the META tags

Uses https://github.com/scrapinghub/extruct to extract metadata from HTML pages

Examples

html_metadata_extractor = HTMLMetadataExtractor()
html_metadata_extractor.extract(text=input_doc,
                                extract_title=True,
                                extract_meta=True,
                                extract_microdata=False,
                                extract_json_ld=False,
                                extract_rdfa=False)
extract(html_text: str, extract_title: bool = False, extract_meta: bool = False, extract_microdata: bool = False, extract_json_ld: bool = False, extract_rdfa: bool = False) → List[etk.extraction.Extraction][source]
Parameters:
  • html_text (str) – input html string to be extracted
  • extract_title (bool) – True if string of ‘title’ tag needs to be extracted, return as { “title”: “…” }
  • extract_meta (bool) – True if string of ‘meta’ tags needs to be extracted, return as { “meta”: { “author”: “…”, …}}
  • extract_microdata (bool) – True if microdata needs to be extracted, returns as { “microdata”: […] }
  • extract_json_ld (bool) – True if json-ld needs to be extracted, return as { “json-ld”: […] }
  • extract_rdfa (bool) – True if rdfs needs to be extracted, returns as { “rdfa”: […] }
Returns:

the list of extraction or the empty list if there are no matches.

Return type:

List[Extraction]