URL Extractor

class etk.extractors.url_extractor.URLExtractor(allow_missing_http: bool = False)[source]

Bases: etk.extractors.regex_extractor.RegexExtractor

Description
This class inherits the RegexExtractor and pre-defines the url pattern as the regex pattern.

Example

url_extractor = URLExtractor(allow_missing_http=True)
url_extractor.extractor(text=text)
extract(text: str, flags=0, mode: etk.extractors.regex_extractor.MatchMode = <MatchMode.FINDALL: (<enum.auto object>, )>) → List[etk.extraction.Extraction]
Extracts information from a text using the given regex. If the pattern has no groups, it returns a list with a single Extraction. If the pattern has groups, it returns a list of Extraction, one for each group. Each extraction records the start and end char positions of matches.
Parameters:
  • text (str) – the text to extract from.
  • flags (enum['a', 'i', 'L', 'm', 's', 'u', 'x']) – flags given to search or match. The value should be one or more letters from the set ‘a’, ‘i’, ‘L’, ‘m’, ‘s’, ‘u’, ‘x’.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode matching), and re.X (verbose), for the entire regular expression.
  • mode (enum[MatchMode.MATCH, MatchMode.SEARCH, MatchMode.FINDALL, MatchMode.SPLIT]) – whether to use re.search() or re.match().
Returns:

the list of extraction or the empty list if there are no matches.

Return type:

List(Extraction)