scrape_text(links_file=None, links_array=None, output_file='output.txt', csv_output_file=None, remove_extra_whitespace=True, remove_duplicates=True, similarity_threshold=0.8, elements_to_scrape='text'): Scrape textual content from the given links and save to specified output file(s).
scrape_images(links_file=None, links_array=None, save_folder='images', min_width=None, min_height=None, max_width=None, max_height=None): Scrape image content from the given links and save it to the specified output folder.
Scrape text
Scrape textual content from the given links and save to specified output file(s).
Parameters:-links_file (str): Path to a file containing links,with each link on a new line.-links_array (list): List of links to scrape content from.-output_file (str): File to save the scraped content.-csv_output_file (str): File to save the URL and scraped information in CSV format.-remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output. Defaults to True.-remove_duplicates (bool): If True, remove duplicate or highly similar paragraphs. Defaults to True.-similarity_threshold (float): Similarity percentage (0-1) above which paragraphs are considered duplicates. Defaults to 0.8.-elements_to_scrape (str): Type of content to scrape. Defaults to 'text'. Options are:'text' (default) - Scrape visible textual content.'content'- Scrape the `content` attribute of meta tags.'unseen'- Scrape hidden or non-visible elements (e.g., meta tags, script data).'links'- Scrape `href` or `src` attributes of anchor and media elements.Example:scrape_text(links_array=['https://example.com'], similarity_threshold=0.9)
Scrape images
Scrape image content from the given links and save it to the specified output folder.
Parameters:-links_file (str): Path to a file containing links,with each link on a new line.-links_array (list): List of links to scrape images from.-save_folder (str): Folder to save the scraped images.-min_width (int): Minimum width of images to include (optional).-min_height (int): Minimum height of images to include (optional).-max_width (int): Maximum width of images to include (optional).-max_height (int): Maximum height of images to include (optional).Example:from pywebscrapr import scrape_images# Using links from a file and saving images to output_images folder.scrape_images(links_file='links.txt', save_folder='output_images', min_width=100, min_height=100)