Web scraping
Katana, Scraping Fish, shot-scraper (example), Colly & Spider are neat.
Currently exploring Playwright together with AutoScraper for my scraping needs. Crawlee looks great too.
Links
- Scrapy - Fast high-level web crawling & scraping framework for Python. (Web) (Docs) (Awesome Scrapy) (Random proxy middleware)
- Scrapyd - Service for running Scrapy spiders. (Docs)
- ScrapydWeb - Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI.
- Simple Scraper - Extract data from any website in seconds.
- ScrapingBee - Web Scraping API.
- Easy web scraping with Scrapy (2019)
- A guide to Web Scraping without getting blocked in 2020
- Crawlab - Distributed web crawler admin platform for spiders management regardless of languages and frameworks.
- hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.
- JobFunnel - Tool for scraping job websites, and filtering and reviewing the job listings.
- You-Get - Tiny command-line utility to download media contents (videos, audios, images) from the Web.
- Universal Reddit Scraper - Scrape Subreddits, Redditors, and comments on posts. A command-line tool written in Python.
- Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js.
- Ask HN: Best practices for ethical web scraping? (2020)
- Newscatcher - Programmatically collect normalized news from (almost) any website. (Code)
- scrapio - Simple and easy-to-use scraper and crawler in Go.
- Colly - Elegant Scraper and Crawler Framework for Go. (Tutorial)
- Python Web Scraping with Virtual Private Networks (2020)
- extract-news-api - Flask code to deploy an API that pulls structured data from online news articles.
- Web Scraper - Scrape websites for text by CSS selector.
- List all the broken links on your website
- Creating a Robust, Reusable Link-Checker (2020)
- micawber - Small library for extracting rich content from urls.
- Spider Pro - Easy and cheap way to scrape the internet. (HN)
- Website Sitemap Parser
- rget - Download URLs and verify the contents against a publicly recorded cryptographic log.
- yarl - Yet another URL library.
- Apify - Web Scraping, Data Extraction and Automation. (GitHub)
- Gumbo - Pure-C HTML5 parser.
- What is a present-day web scraping in 2020?
- Dataflow Kit - Web scraping. Data extraction tools
- Awesome Web Scraping
- Common Crawl - Open repository of web crawl data that can be accessed and analyzed by anyone. (HN) (Lobsters)
- Analysing Petabytes of Websites using Common Crawl (2017)
- Cognito Common Crawl - Search the common crawl using lambda functions.
- Awesome Open Source Javascript Projects for Web Scraping (2020)
- ScrapingAnt - All in One Scraping API. Rotating Proxies. Headless Chrome.
- Django Dynamic Scraper - Creating Scrapy scrapers via the Django admin interface.
- AutoScraper - Smart, Automatic, Fast and Lightweight Web Scraper for Python.
- Spidey - Dead-simple crawler which focuses on ease of use and speed. Return a list of all URls of a web page.
- Scraping News and Articles From Public APIs with Python (2020)
- LinkedIn Scraper
- ScrapeOwl - Simple and affordable web scraping API.
- Pholcidae - Tiny python web crawler.
- Booking site web scraper - Downloads all of the accommodations for the chosen country and saves them in a file.
- Reddit Media Downloader - Scrapes Reddit to download media of your choice.
- Web scraping with JS (2020) (HN)
- Web scraping that just works with OpenFaaS with Puppeteer (2020)
- What Happened to XPath? (2020) (HN)
- ScrapingHub - Turn web content into useful data. (GitHub)
- extruct - Library for extracting embedded metadata from HTML markup.
- Introduction to Scraping in Python (2020)
- Test driving a HackerNews scraper with Node.js (2020)
- SecretAgent - Web browser that's built for scraping. (Web)
- Ulixee - Turns every website into an open API. Access any dataset on the world wide web. (GitHub)
- Floki - Simple HTML parser that enables search for nodes using CSS selectors.
- NYT Vote Scraper - Scrapes the NYT Votes Remaining Page JSON and commits it back to this repo. Nice use of GitHub actions for git scraping.
- Instagram Scraper - Scrapes an instagram user's photos and videos.
- Inventory Hunter - Get notified as soon as your next CPU, GPU, or game console is in stock.
- Guide on preventing Website Scraping
- Bibliographies of the Bibliometric-enhanced Information Retrieval workshops and related other workshops
- news-please - Open source, easy-to-use news crawler that extracts structured information from almost any news website.
- Web crawling with Python (2020)
- Metascraper - Scrape data from websites using Open Graph, HTML metadata & fallbacks. (Docs)
- Instaloader - Download pictures (or videos) along with their captions and other metadata from Instagram. (Docs)
- Go-Trafilatura - Go package and command-line tool which seamlessly downloads, parses, and scrapes web page data.
- htmldate - Find the publication date of web pages.
- Filtering links to gather texts on the web (2020)
- Evaluating scraping and text extraction tools for Python (2020)
- Using sitemaps to crawl websites (2019)
- Evaluation of date extraction tools for Python (2020)
- jusText - Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
- sumy - Module for automatic summarization of text documents and HTML pages.
- Voyager - Write your own web crawler/scraper as a state machine in rust.
- Trandoshan - Fast, highly configurable, cloud native dark web crawler.
- ralger - Makes it easy to scrape a website with R.
- Scraping HN content with declarative programming
- snscrape - Social networking service scraper in Python. (Fork)
- qwarc - Framework for rapidly archiving a large number of URLs with little overhead.
- select.rs - Rust library to extract useful data from HTML documents, suitable for web scraping.
- Scrapera - Provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.
- Visual scraping with Elixir and Crawly (2021)
- Headless Chrome Crawler - Distributed crawler powered by Headless Chrome.
- Tips for reliable web automation and scraping selectors (2021) (HN)
- Web Crawler for scraping Financial data (Article)
- Web Scraping 101 with Python (2021) (HN) (HN)
- Automatio - No-code Web Automation Tool. Automation Tool to Extract Data From Any Website.
- Scaling up a Serverless Web Crawler and Search Engine (2021)
- crawler-user-agents - List of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
- ant - Web crawler for Go.
- SearchScraperAPI - Implementation of an API, which allows you to scrape Google, Bing, Yandex, and Qwant.
- Scala Scraper - Scala library for scraping content from HTML pages.
- Next.js Web Scraper Playground - Build and test your own web scraper APIs with Next.js API Routes and cheerio. (Web)
- Scrapers List
- Trafilatura - Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments). (HN)
- Rarchy - Visual Sitemaps & Website Planning Tool. (HN)
- CloudProxy - Hide your scrapers IP behind the cloud. (HN)
- FlareSolverr - Proxy server to bypass Cloudflare protection.
- Schema API for the Semantic Web - Extract structured content from the semantic web.
- DataHen Till - Standalone tool that runs alongside your web scraper, and instantly makes your existing web scraper scalable, maintainable and unblockable. (Web) (HN)
- Mastering Web Scraping in Python: Crawling from Scratch (2021) (HN)
- Data-Mining Wikipedia for Fun and Profit (2021) (HN)
- Wikidata or Scraping Wikipedia (HN)
- pyspider - Powerful Spider (Web Crawler) System in Python. (Docs)
- Python-Goose - HTML Content / Article Extractor, web scrapping lib in Python.
- Dyer - Designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
- How to Crawl the Web with Scrapy (2021) (HN)
- PageMetaScraper - Page metadata scraper with several fallback strategies.
- cariddi - Take a list of domains, crawl URLs and scan for endpoints, secrets, API keys, file extensions, tokens and more.
- Super-Simple Scraper - Crawler/scraper based on Go + colly, configurable via JSON.
- Gospider - Fast web spider written in Go.
- The State Of Web Scraping in 2021 (HN)
- scrapy.js - Web Scraping library for JavaScript built using BeautifulSoup4.
- PHP Goose - Readability / HTML Content / Article Extractor & Web Scrapping library written in PHP.
- Web scraping by watching requests (2021)
- Effortless Crawling with Scrapy with one method (2021)
- Avoiding bot detection: How to scrape the web without getting blocked?
- crawley - Crawls web pages and prints any link it can find.
- grab-site - Archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns.
- cloudscraper - Python module to bypass Cloudflare's anti-bot page.
- Papercut - Scraping/crawling library for Node.js, written in Typescript.
- Marple - Collect links to profiles by username through search engines.
- Web Scraping with Go (2021) (Reddit)
- Maigret - Collect a dossier on a person by username from thousands of sites.
- Notes on Writing Web Scrapers (2021)
- Scraping Websites With Logins (2021) (Reddit)
- Skan.jl - Scan web pages for changes using Julia & GitHub Actions.
- cloudflare-scraper - Package to bypass Cloudflare's protection.
- scrapy-poet - Page Object pattern for Scrapy.
- Go Download Web - Download an entire website with Go.
- linkcheck - Fast link checker.
- scrapli - Fast, flexible, sync/async, Python 3.6+ screen scraping client specifically for network devices.
- scrapligo - scrapli, but in go.
- waybacked - Get URLs from the Wayback Machine. Able to handle large outputs.
- changedetection.io - Self-Hosted, Open Source, Change Monitoring of Web Pages.
- Jiu - Detect new images and video on social media feeds and dispatch webhooks on updates.
- Building a scalable scraper in Rust (2021)
- Instagram Scraper - Allows you to scrape posts from a user's profile page, hashtag page, or place.
- Scraping without JavaScript using Chromium on AWS Lambda: The Novel (2022)
- The State of Web Scraping 2022 (HN)
- Chrome File Downloader - Go library for scraping or downloading files bypassing Cloudflare protection and browser checks.
- Mechaml - OCaml functional web scraping library.
- WikiDump Indexer and Search - Wikipedia dump parser and indexer with search functionality. Made for Information Retrieval and Extraction course.
- Xidel - Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching.
- web-poet - Web scraping Page Objects core library.
- Are Product Hunt's featured products still online today? (2022) (HN)
- html2data - Library and cli for extracting data from HTML via CSS selectors.
- Hyperlink - Detect invalid and inefficient links on your webpages. Works with local files or websites, on the command line and as a node library.
- requests-ip-rotator - Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.
- Pinterest Web Scraper - Scraping Visually Similar Images from Pinterest.
- gazpacho - Simple, fast, and modern web scraping library. (Docs)
- Hitomi Downloader - Desktop utility to download images/videos/music/text from various websites, and more.
- Pinterest Downloader - Download all images/videos from Pinterest user/board/section.
- More notes on writing web scrapers (2022) (HN)
- scraperlite - Scrape text and HTML based on CSS selectors and save contents to a SQLite database.
- Browsertrix Crawler - Run a high-fidelity browser-based crawler in a single Docker container.
- pafy - Python library to download YouTube content and retrieve metadata.
- So you want to Scrape like the Big Boys? (2021)
- Dude - Simple framework for writing a web scraper using Python decorators.
- myfaveTT - Download all your TikTok Likes. (HN)
- Scraping web pages from the command line with shot-scraper (2022) (HN)
- Apify SDK - Scalable web crawling and scraping library for JavaScript.
- Extracting web page content using Readability.js and shot-scraper (2022)
- Texting Robots: Taming robots.txt with Rust and 34 million tests (2022) (Reddit)
- Scraping Instagram (2022) (HN)
- Linkedin Scraper - Scrapes Linkedin User Data.
- Aeon - Scan the internet for your personal information and modify or remove it.
- article-parser - Extract main article, main image and meta data from URL.
- Apify SDK - Scalable web crawling and scraping library for JavaScript.
- WebParsy - Node.JS library and cli for scraping websites using Puppeteer (or not) and YAML definitions.
- Hext - Domain-specific language for extracting structured data from HTML documents.
- AutoScrape - Automated, programming-free web scraper for interactive sites.
- Portia - Tool that allows you to visually scrape websites without any programming knowledge required.
- Surgeon - Declarative DOM extraction expression evaluator.
- Ayakashi - Next generation web scraping framework.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019) (Code)
- How To Use HTMLRewriter for Web Scraping (2022)
- Brozzler - Distributed browser-based web crawler.
- oEmbed Parser - Extract oEmbed data from given webpage.
- Proxy scraper and checker - Scrape more than 1K HTTP proxies in less than 2 seconds.
- Toutatis - Tool that allows you to extract information from instagrams accounts such as e-mails, phone numbers and more.
- Crawl Original Google Images & Youtube Videos
- OnlyFans DataScraper - Scrape all the media from an OnlyFans account.
- Shot Scraper Template - Quickly create a new GitHub repository that takes automated screenshots of a web page using shot-scraper.
- Web Scraping via JavaScript Runtime Heap Snapshots (2022) (HN) (HN)
- All the Places - Set of spiders and scrapers to extract location information from places that post their location on the internet.
- Spider - Multithreaded Web spider crawler written in Rust.
- Scrapism 2022 course
- Libextract - Extract data from websites using basic statistical magic.
- TikTok Scraper & Downloader - Download video posts, collect user/trend/hashtag/music feed metadata, sign URL and etc.
- Scraping Airbnb (2022)
- Shears - Functional web scraping in TS.
- Web scraping with Python open knowledge (HN)
- Web scraping Proxy Library for Scrapy (HN)
- SLRP - Rotating open proxy multiplexer.
- Node.js web scraper
- WarcDB - Web crawl data as SQLite databases. (HN)
- How to scrape Zillow with Python and Scrapy (2022)
- Scraply - Simple DOM scraper to fetch information from any HTML based website.
- coURLan - Clean, filter, normalize, and sample URLs.
- htmldate - Find the publication date of web pages.
- Wpull - Wget-compatible web downloader and crawler.
- City Scrapers - Scrape, standardize and share public meetings from local government websites.
- Lambda Soup - Functional HTML scraping and rewriting with CSS in OCaml.
- linkchecker - Simple CLI tool to find all broken links in your website.
- OSINT - Collections of tools and methods created to aid in OSINT collection.
- Ask HN: What are the best tools for web scraping in 2022?
- Crawlee - Web scraping and browser automation library for Node.js. (HN)
- Facebook Scraper - Scrape Facebook public pages without an API key.
- Scraping a website protected by Cloudflare (2022)
- SABLE - Scraping Assisted by Learning.
- HTML to Text
- Crawly - High-level web crawling & scraping framework for Elixir.
- Scrapoxy - Hides your scraper behind a cloud. (Web)
- Instahunter - CLI OSINT app that can fetch data from Instagram's Web API without authentication.
- sico - Sitemap comparison tool.
- Scalpel - High level web scraping library for Haskell.
- Unfurl - Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js.
- Katana - Crawling and spidering framework. (HN)
- PyWebCopy - Locally saves webpages to your hard disk with images, CSS, JS & links as is.
- Unfurl - Extract and Visualize Data from URLs.
- Indieweb site crawler and MF2 data collection tool
- Evaluating Mechanical Keyboard Delivery Estimates with Python Web Scraping (2022)
- LinkedIn Scraper
- Google Image Scraper - Library to scrape google images.
- CLI google crawler written in go
- HTML Semantic Seg - Tool to create a dataset of semantic segmentation on website screenshots from their DOM.
- fingerprint-suite - Browser fingerprinting tools for anonymizing your scrapers.
- Linvo Linkedin Scraper
- Cached Chrome Top Million Websites (HN)
- site_icons - Efficient website icon scraper for rust, with sizes, ordering, and WASM support.
- googlesearch.py - Google search scraper in Python.
- Amazon Product API Scraping - Scrape products from the amazon search result or reviews from the specific product.
- A Year of Writing about Web Scraping in Review (2023)
- goq - Declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library.
- Facepager - Fetching public available data from YouTube, Twitter and other websites on the basis of APIs and web scraping.
- MrScraper - Visual web-scraping tool. (HN)
- Fitter - Universal scraper for Websites and APIs.
- Cloudflare Scrape - Python module to bypass Cloudflare's anti-bot page.
- page-fetch - Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files.
- Crul - Query Any Webpage or API. (HN)
- WikiScrape - Download Wikipedia pages as beautiful text files.
- crawler - gRPC web crawler turbo charged for performance.
- unfluff - Automatically extract body content (and other cool stuff) from HTML document.
- cdp-scrapers - Scratchpad for scraper development and general utilities.
- scrapeghost - Library for scraping websites using OpenAI's GPT.
- Black Maria - Python package for scraping in Natural language.
- Browser AI agent, using GPT-4
- UltimaScraper - Scrape all the media from an OnlyFans account.
- From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps (2023)
- Wikipedia Article Reference Inventory WARI - Import workflows for the Wikipedia Citations Database.
- ScrapeGPT - Web scraper builder that uses GPT-4 to automatically generate Python scripts for scraping websites based on user input.
- Website Scraper and Vectorizer (Reddit)
- metadata-scraper - JavaScript library for scraping/parsing metadata from a web page.
- Gumroad Scraper and Website Generator
- HackerNews Alert - A Hacker News post contains the keyword you are interested in, you will receive a Slack message.
- MrScraper AI - Dead simple web scraper (powered by AI). (HN)
- HTML Table to JSON (Tweet)
- Forum-dl - Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC.
- Basic Statistics of Common Crawl Monthly Archives
- Webpecker - Scrape useful data from social networks to search engines effortlessly.
- estela - Elastic web scraping cluster.
- Scrape the Web with entities extraction using OpenAI Function (Reddit)
- top-crawler-agents - List of common crawler user agents useful for retrieving metadata from links.
- Web Scraping With Rust
- GPTBot - OpenAI’s Web Crawler. (HN)
- Robots Parser - NodeJS robots.txt parser with support for wildcard matching.
- Tubeup - Use yt-dlp to download video and upload to the Internet Archive with metadata.
- scrape - Simple, higher level interface for Go web scraping.
- Crystal Web Archiver - Downloads high fidelity copies of websites for long-term archival.
- Writing a simple web scraper using bash (2023) (Lobsters)
- CloudProxy - Proxy server to bypass Cloudflare protection.