PyWebScrapr Functions

Available functions:

  • scrape_text(links_file=None, links_array=None, output_file='output.txt', csv_output_file=None, remove_extra_whitespace=True): Scrape textual content from the given links and save to specified output file(s).

  • scrape_images(links_file=None, links_array=None, save_folder='images', min_width=None, min_height=None, max_width=None, max_height=None): Scrape image content from the given links and save to specified output folder.


Scrape text

Scrape textual content from the given links and save to specified output file(s).

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape text from.
    - output_file (str): File to save the scraped text.
    - csv_output_file (str): File to save the URL and text information in CSV format.
    - remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output.

Example:
from pywebscrapr import scrape_text

# Using links from a file and saving text to output.txt
scrape_text(links_file='links.txt', output_file='output.txt')

# Using links directly and saving text to output.txt and csv_output.csv with extra whitespace removal
links = ['https://example.com/page1', 'https://example.com/page2']
scrape_text(links_array=links, output_file='output.txt', csv_output_file='csv_output.csv', remove_extra_whitespace=True)

Scrape images

Scrape image content from the given links and save to specified output folder.

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape images from.
    - save_folder (str): Folder to save the scraped images.
    - min_width (int): Minimum width of images to include (optional).
    - min_height (int): Minimum height of images to include (optional).
    - max_width (int): Maximum width of images to include (optional).
    - max_height (int): Maximum height of images to include (optional).

Example:
from pywebscrapr import scrape_images

# Using links from a file and saving images to output_images folder.
scrape_images(links_file='links.txt', save_folder='output_images', min_width=100, min_height=100)

Last updated