PyWebScrapr Functions
Available functions:
scrape_text(links_file=None, links_array=None, output_file='output.txt', csv_output_file=None, remove_extra_whitespace=True, remove_duplicates=False, similarity_threshold=0.8, elements_to_scrape='text', follow_child_links=False, max_links_to_follow=None, print_progress=False, rate_limit=0, max_workers=5): Scrape textual content from the given links and save to specified output file(s).scrape_images(links_file=None, links_array=None, save_folder='images', min_width=None, min_height=None, max_width=None, max_height=None, follow_child_links=False, max_links_to_follow=None, print_progress=False, rate_limit=0, max_workers=5): Scrape image content from the given links and save it to the specified output folder.
Scrape text
Scrape textual content from the given links and save to specified output file(s).
Parameters:
- links_file (str): Path to a file containing links, with each link on a new line.
- links_array (list): List of links to scrape content from.
- output_file (str): File to save the scraped content.
- csv_output_file (str): File to save the URL and scraped information in CSV format.
- remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output. Defaults to True.
- remove_duplicates (bool): If True, remove duplicate or highly similar paragraphs. Defaults to False.
- similarity_threshold (float): Similarity percentage (0-1) above which paragraphs are considered duplicates. Defaults to 0.8.
- elements_to_scrape (str): Type of content to scrape. Defaults to 'text'. Options are:
'text' (default) - Scrape visible textual content.
'content' - Scrape the `content` attribute of meta tags.
'unseen' - Scrape hidden or non-visible elements (e.g., meta tags, script data).
'links' - Scrape `href` or `src` attributes of anchor and media elements.
- follow_child_links (bool): If True, follow and scrape content from child links found on the page.
- max_links_to_follow (int): Maximum number of links to follow when scraping child links.
- print_progress (bool): If True, print progress updates to the console.
- rate_limit (int): Seconds to wait between requests.
- max_workers (int): Maximum number of threads to use.
Example:
scrape_text(links_array=['https://example.com'], similarity_threshold=0.9)Scrape images
Scrape image content from the given links and save it to the specified output folder.
Last updated
Was this helpful?