Documentation
Our WebsiteOur Github
  • 👋Welcome to Infinitode Documentation
  • AI Documentation
  • API Documentation
    • Basic Math API Documentation (#Experimental)
    • BMI Calculator API Documentation
    • Character Counter API Documentation
    • Chemical Equation Balancer API Documentation
    • Color Generator API Documentation
    • Date Difference Calculator API Documentation
    • Dungen API Documentation
    • Dungen Dev API Documentation
    • Factorial Calculator API Documentation
    • Fantasy Name Generator API Documentation
    • Fibonacci Sequence Generator API Documentation
    • GCD Calculator API Documentation
    • Hash API Documentation
    • Helix PSA API Documentation
    • LCM Calculator API Documentation
    • Leap Year Checker API Documentation
    • Lorem API Documentation
    • Molar Mass Calculator API Documentation (#Experimental)
    • MycoNom API Documentation
    • Name Generator API Documentation
    • Palindrome Checker API Documentation
    • Password Generator API Documentation
    • Password Strength Detector API Documentation
    • Periodic Table API Documentation
    • Prime Number Checker API Documentation
    • Quadratic Equation Solver API Documentation
    • Random Facts Generator API Documentation
    • Random Quotes Generator API Documentation
    • Roman Numeral Converter API Documentation
    • Simple Interest Calculator API Documentation
    • Slugify API Documentation
    • Text Case Converter API Documentation
    • Unit Converter API Documentation
    • Username Generator API Documentation
    • UUID Generator API Documentation
    • Vowel Counter API Documentation
  • Package Documentation
    • BlurJS Package Documentation
      • BlurJS Usage Examples
      • BlurJS Reference Documentation
    • CodeSafe Package Documentation
      • CodeSafe Reference
        • CodeSafe Functions
    • DeepDefend Package Documentation
      • DeepDefend Reference
        • Attacks Functions
        • Defenses Functions
    • DupliPy Package Documentation
      • DupliPy Reference
        • Formatting Functions
        • Replication Functions
        • Similarity Functions
        • Text Analysis Functions
    • FuncProfiler Package Documentation
      • FuncProfiler Reference
        • FuncProfiler Functions
    • Hued Package Documentation
      • Hued Reference
        • Analysis Functions
        • Colors Functions
        • Conversions Functions
        • Palettes Functions
    • LocalSiteMap Package Documentation
      • LocalSiteMap Reference
        • LocalSiteMap Functions
    • PyAutoPlot Package Documentation
      • PyAutoPlot Reference
        • PyAutoPlot Functions
    • PyWebScrapr Package Documentation
      • PyWebScrapr Reference
        • PyWebScrapr Functions
    • ValX Package Documentation
      • ValX Reference
        • ValX Functions
Powered by GitBook
On this page
  • Scrape text
  • Scrape images

Was this helpful?

  1. Package Documentation
  2. PyWebScrapr Package Documentation
  3. PyWebScrapr Reference

PyWebScrapr Functions

PreviousPyWebScrapr ReferenceNextValX Package Documentation

Last updated 2 months ago

Was this helpful?

Available functions:

  • (links_file=None, links_array=None, output_file='output.txt', csv_output_file=None, remove_extra_whitespace=True, remove_duplicates=False, similarity_threshold=0.8, elements_to_scrape='text', follow_child_links=False, max_links_to_follow=None): Scrape textual content from the given links and save to specified output file(s).

  • (links_file=None, links_array=None, save_folder='images', min_width=None, min_height=None, max_width=None, max_height=None, follow_child_links=False, max_links_to_follow=None): Scrape image content from the given links and save it to the specified output folder.


Scrape text

Scrape textual content from the given links and save to specified output file(s).

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape content from.
    - output_file (str): File to save the scraped content.
    - csv_output_file (str): File to save the URL and scraped information in CSV format.
    - remove_extra_whitespace (bool): If True, remove extra whitespace and empty lines from the output. Defaults to True.
    - remove_duplicates (bool): If True, remove duplicate or highly similar paragraphs. Defaults to False.
    - similarity_threshold (float): Similarity percentage (0-1) above which paragraphs are considered duplicates. Defaults to 0.8.
    - elements_to_scrape (str): Type of content to scrape. Defaults to 'text'. Options are:
        'text' (default) - Scrape visible textual content.
        'content' - Scrape the `content` attribute of meta tags.
        'unseen' - Scrape hidden or non-visible elements (e.g., meta tags, script data).
        'links' - Scrape `href` or `src` attributes of anchor and media elements.
    - follow_child_links (bool): If True, follow and scrape content from child links found on the page.
    - max_links_to_follow (int): Maximum number of links to follow when scraping child links.

Example:
    scrape_text(links_array=['https://example.com'], similarity_threshold=0.9)

Scrape images

Scrape image content from the given links and save it to the specified output folder.

Parameters:
    - links_file (str): Path to a file containing links, with each link on a new line.
    - links_array (list): List of links to scrape images from.
    - save_folder (str): Folder to save the scraped images.
    - min_width (int): Minimum width of images to include (optional).
    - min_height (int): Minimum height of images to include (optional).
    - max_width (int): Maximum width of images to include (optional).
    - max_height (int): Maximum height of images to include (optional).
    - follow_child_links (bool): If True, follow and scrape images from child links found on the page (optional).
    - max_links_to_follow (int): Maximum number of links to follow when scraping child links (optional).

Example:
from pywebscrapr import scrape_images

# Using links from a file and saving images to output_images folder.
scrape_images(links_file='links.txt', save_folder='output_images', min_width=100, min_height=100)
scrape_text
scrape_images