PyWebScrapr Functions
Scrape text
Parameters:
- links_file (str): Path to a file containing links, with each link on a new line.
- links_array (list): List of links to scrape content from.
- output_file (str): File to save the scraped content.
- csv_output_file (str): File to save the URL and scraped information in CSV format.
- remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output. Defaults to True.
- remove_duplicates (bool): If True, remove duplicate or highly similar paragraphs. Defaults to False.
- similarity_threshold (float): Similarity percentage (0-1) above which paragraphs are considered duplicates. Defaults to 0.8.
- elements_to_scrape (str): Type of content to scrape. Defaults to 'text'. Options are:
'text' (default) - Scrape visible textual content.
'content' - Scrape the `content` attribute of meta tags.
'unseen' - Scrape hidden or non-visible elements (e.g., meta tags, script data).
'links' - Scrape `href` or `src` attributes of anchor and media elements.
- follow_child_links (bool): If True, follow and scrape content from child links found on the page.
- max_links_to_follow (int): Maximum number of links to follow when scraping child links.
- print_progress (bool): If True, print progress updates to the console.
- rate_limit (int): Seconds to wait between requests.
- max_workers (int): Maximum number of threads to use.
Example:
scrape_text(links_array=['https://example.com'], similarity_threshold=0.9)Scrape images
Last updated