PyWebScrapr Functions

Available functions:

scrape_text(links_file=None, links_array=None, output_file='output.txt', csv_output_file=None, remove_extra_whitespace=True, remove_duplicates=False, similarity_threshold=0.8, elements_to_scrape='text', follow_child_links=False, max_links_to_follow=None): Scrape textual content from the given links and save to specified output file(s).
scrape_images(links_file=None, links_array=None, save_folder='images', min_width=None, min_height=None, max_width=None, max_height=None, follow_child_links=False, max_links_to_follow=None): Scrape image content from the given links and save it to the specified output folder.

Scrape text

Scrape textual content from the given links and save to specified output file(s).

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape content from.
    - output_file (str): File to save the scraped content.
    - csv_output_file (str): File to save the URL and scraped information in CSV format.
    - remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output. Defaults to True.
    - remove_duplicates (bool): If True, remove duplicate or highly similar paragraphs. Defaults to False.
    - similarity_threshold (float): Similarity percentage (0-1) above which paragraphs are considered duplicates. Defaults to 0.8.
    - elements_to_scrape (str): Type of content to scrape. Defaults to 'text'. Options are:
        'text' (default) - Scrape visible textual content.
        'content' - Scrape the `content` attribute of meta tags.
        'unseen' - Scrape hidden or non-visible elements (e.g., meta tags, script data).
        'links' - Scrape `href` or `src` attributes of anchor and media elements.
    - follow_child_links (bool): If True, follow and scrape content from child links found on the page.
    - max_links_to_follow (int): Maximum number of links to follow when scraping child links.

Example:
    scrape_text(links_array=['https://example.com'], similarity_threshold=0.9)

Scrape images

Scrape image content from the given links and save it to the specified output folder.

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape images from.
    - save_folder (str): Folder to save the scraped images.
    - min_width (int): Minimum width of images to include (optional).
    - min_height (int): Minimum height of images to include (optional).
    - max_width (int): Maximum width of images to include (optional).
    - max_height (int): Maximum height of images to include (optional).
    - follow_child_links (bool): If True, follow and scrape images from child links found on the page (optional).
    - max_links_to_follow (int): Maximum number of links to follow when scraping child links (optional).

Example:
from pywebscrapr import scrape_images

# Using links from a file and saving images to output_images folder.
scrape_images(links_file='links.txt', save_folder='output_images', min_width=100, min_height=100)

PreviousPyWebScrapr Reference NextValX Package Documentation

Last updated 5 months ago

Was this helpful?