On this page

Web scraping

Katana, Scraping Fish, shot-scraper (example), Colly & Spider are neat.

Currently exploring Playwright together with AutoScraper for my scraping needs. Crawlee looks great too.

Links

Scrapy - Fast high-level web crawling & scraping framework for Python. (Web) (Docs) (Awesome Scrapy) (Random proxy middleware)
Scrapyd - Service for running Scrapy spiders. (Docs)
ScrapydWeb - Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI.
Simple Scraper - Extract data from any website in seconds.
ScrapingBee - Web Scraping API.
Easy web scraping with Scrapy (2019)
A guide to Web Scraping without getting blocked in 2020
Crawlab - Distributed web crawler admin platform for spiders management regardless of languages and frameworks.
hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.
JobFunnel - Tool for scraping job websites, and filtering and reviewing the job listings.
You-Get - Tiny command-line utility to download media contents (videos, audios, images) from the Web.
Universal Reddit Scraper - Scrape Subreddits, Redditors, and comments on posts. A command-line tool written in Python.
Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js.
Ask HN: Best practices for ethical web scraping? (2020)
Newscatcher - Programmatically collect normalized news from (almost) any website. (Code)
scrapio - Simple and easy-to-use scraper and crawler in Go.
Colly - Elegant Scraper and Crawler Framework for Go. (Tutorial)
Python Web Scraping with Virtual Private Networks (2020)
extract-news-api - Flask code to deploy an API that pulls structured data from online news articles.
Web Scraper - Scrape websites for text by CSS selector.
List all the broken links on your website
Creating a Robust, Reusable Link-Checker (2020)
micawber - Small library for extracting rich content from urls.
Spider Pro - Easy and cheap way to scrape the internet. (HN)
Website Sitemap Parser
rget - Download URLs and verify the contents against a publicly recorded cryptographic log.
yarl - Yet another URL library.
Apify - Web Scraping, Data Extraction and Automation. (GitHub)
Gumbo - Pure-C HTML5 parser.
What is a present-day web scraping in 2020?
Dataflow Kit - Web scraping. Data extraction tools
Awesome Web Scraping
Common Crawl - Open repository of web crawl data that can be accessed and analyzed by anyone. (HN) (Lobsters)
Analysing Petabytes of Websites using Common Crawl (2017)
Cognito Common Crawl - Search the common crawl using lambda functions.
Awesome Open Source Javascript Projects for Web Scraping (2020)
ScrapingAnt - All in One Scraping API. Rotating Proxies. Headless Chrome.
Django Dynamic Scraper - Creating Scrapy scrapers via the Django admin interface.
AutoScraper - Smart, Automatic, Fast and Lightweight Web Scraper for Python.
Spidey - Dead-simple crawler which focuses on ease of use and speed. Return a list of all URls of a web page.
Scraping News and Articles From Public APIs with Python (2020)
LinkedIn Scraper
ScrapeOwl - Simple and affordable web scraping API.
Pholcidae - Tiny python web crawler.
Booking site web scraper - Downloads all of the accommodations for the chosen country and saves them in a file.
Reddit Media Downloader - Scrapes Reddit to download media of your choice.
Web scraping with JS (2020) (HN)
Web scraping that just works with OpenFaaS with Puppeteer (2020)
What Happened to XPath? (2020) (HN)
ScrapingHub - Turn web content into useful data. (GitHub)
extruct - Library for extracting embedded metadata from HTML markup.
Introduction to Scraping in Python (2020)
Test driving a HackerNews scraper with Node.js (2020)
SecretAgent - Web browser that's built for scraping. (Web)
Ulixee - Turns every website into an open API. Access any dataset on the world wide web. (GitHub)
Floki - Simple HTML parser that enables search for nodes using CSS selectors.
NYT Vote Scraper - Scrapes the NYT Votes Remaining Page JSON and commits it back to this repo. Nice use of GitHub actions for git scraping.
Instagram Scraper - Scrapes an instagram user's photos and videos.
Inventory Hunter - Get notified as soon as your next CPU, GPU, or game console is in stock.
Guide on preventing Website Scraping
Bibliographies of the Bibliometric-enhanced Information Retrieval workshops and related other workshops
news-please - Open source, easy-to-use news crawler that extracts structured information from almost any news website.
Web crawling with Python (2020)
Metascraper - Scrape data from websites using Open Graph, HTML metadata & fallbacks. (Docs)
Instaloader - Download pictures (or videos) along with their captions and other metadata from Instagram. (Docs)
Go-Trafilatura - Go package and command-line tool which seamlessly downloads, parses, and scrapes web page data.
htmldate - Find the publication date of web pages.
Filtering links to gather texts on the web (2020)
Evaluating scraping and text extraction tools for Python (2020)
Using sitemaps to crawl websites (2019)
Evaluation of date extraction tools for Python (2020)
jusText - Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
sumy - Module for automatic summarization of text documents and HTML pages.
Voyager - Write your own web crawler/scraper as a state machine in rust.
Trandoshan - Fast, highly configurable, cloud native dark web crawler.
ralger - Makes it easy to scrape a website with R.
Scraping HN content with declarative programming
snscrape - Social networking service scraper in Python. (Fork)
qwarc - Framework for rapidly archiving a large number of URLs with little overhead.
select.rs - Rust library to extract useful data from HTML documents, suitable for web scraping.
Scrapera - Provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.
Visual scraping with Elixir and Crawly (2021)
Headless Chrome Crawler - Distributed crawler powered by Headless Chrome.
Tips for reliable web automation and scraping selectors (2021) (HN)
Web Crawler for scraping Financial data (Article)
Web Scraping 101 with Python (2021) (HN) (HN)
Automatio - No-code Web Automation Tool. Automation Tool to Extract Data From Any Website.
Scaling up a Serverless Web Crawler and Search Engine (2021)
crawler-user-agents - List of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
ant - Web crawler for Go.
SearchScraperAPI - Implementation of an API, which allows you to scrape Google, Bing, Yandex, and Qwant.
Scala Scraper - Scala library for scraping content from HTML pages.
Next.js Web Scraper Playground - Build and test your own web scraper APIs with Next.js API Routes and cheerio. (Web)
Scrapers List
Trafilatura - Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments). (HN)
Rarchy - Visual Sitemaps & Website Planning Tool. (HN)
CloudProxy - Hide your scrapers IP behind the cloud. (HN)
FlareSolverr - Proxy server to bypass Cloudflare protection.
Schema API for the Semantic Web - Extract structured content from the semantic web.
DataHen Till - Standalone tool that runs alongside your web scraper, and instantly makes your existing web scraper scalable, maintainable and unblockable. (Web) (HN)
Mastering Web Scraping in Python: Crawling from Scratch (2021) (HN)
Data-Mining Wikipedia for Fun and Profit (2021) (HN)
Wikidata or Scraping Wikipedia (HN)
pyspider - Powerful Spider (Web Crawler) System in Python. (Docs)
Python-Goose - HTML Content / Article Extractor, web scrapping lib in Python.
Dyer - Designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
How to Crawl the Web with Scrapy (2021) (HN)
PageMetaScraper - Page metadata scraper with several fallback strategies.
cariddi - Take a list of domains, crawl URLs and scan for endpoints, secrets, API keys, file extensions, tokens and more.
Super-Simple Scraper - Crawler/scraper based on Go + colly, configurable via JSON.
Gospider - Fast web spider written in Go.
The State Of Web Scraping in 2021 (HN)
scrapy.js - Web Scraping library for JavaScript built using BeautifulSoup4.
PHP Goose - Readability / HTML Content / Article Extractor & Web Scrapping library written in PHP.
Web scraping by watching requests (2021)
Effortless Crawling with Scrapy with one method (2021)
Avoiding bot detection: How to scrape the web without getting blocked?
crawley - Crawls web pages and prints any link it can find.
grab-site - Archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns.
cloudscraper - Python module to bypass Cloudflare's anti-bot page.
Papercut - Scraping/crawling library for Node.js, written in Typescript.
Marple - Collect links to profiles by username through search engines.
Web Scraping with Go (2021) (Reddit)
Maigret - Collect a dossier on a person by username from thousands of sites.
Notes on Writing Web Scrapers (2021)
Scraping Websites With Logins (2021) (Reddit)
Skan.jl - Scan web pages for changes using Julia & GitHub Actions.
cloudflare-scraper - Package to bypass Cloudflare's protection.
scrapy-poet - Page Object pattern for Scrapy.
Go Download Web - Download an entire website with Go.
linkcheck - Fast link checker.
scrapli - Fast, flexible, sync/async, Python 3.6+ screen scraping client specifically for network devices.
scrapligo - scrapli, but in go.
waybacked - Get URLs from the Wayback Machine. Able to handle large outputs.
changedetection.io - Self-Hosted, Open Source, Change Monitoring of Web Pages.
Jiu - Detect new images and video on social media feeds and dispatch webhooks on updates.
Building a scalable scraper in Rust (2021)
Instagram Scraper - Allows you to scrape posts from a user's profile page, hashtag page, or place.
Scraping without JavaScript using Chromium on AWS Lambda: The Novel (2022)
The State of Web Scraping 2022 (HN)
Chrome File Downloader - Go library for scraping or downloading files bypassing Cloudflare protection and browser checks.
Mechaml - OCaml functional web scraping library.
WikiDump Indexer and Search - Wikipedia dump parser and indexer with search functionality. Made for Information Retrieval and Extraction course.
Xidel - Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching.
web-poet - Web scraping Page Objects core library.
Are Product Hunt's featured products still online today? (2022) (HN)
html2data - Library and cli for extracting data from HTML via CSS selectors.
Hyperlink - Detect invalid and inefficient links on your webpages. Works with local files or websites, on the command line and as a node library.
requests-ip-rotator - Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.
Pinterest Web Scraper - Scraping Visually Similar Images from Pinterest.
gazpacho - Simple, fast, and modern web scraping library. (Docs)
Hitomi Downloader - Desktop utility to download images/videos/music/text from various websites, and more.
Pinterest Downloader - Download all images/videos from Pinterest user/board/section.
More notes on writing web scrapers (2022) (HN)
scraperlite - Scrape text and HTML based on CSS selectors and save contents to a SQLite database.
Browsertrix Crawler - Run a high-fidelity browser-based crawler in a single Docker container.
pafy - Python library to download YouTube content and retrieve metadata.
So you want to Scrape like the Big Boys? (2021)
Dude - Simple framework for writing a web scraper using Python decorators.
myfaveTT - Download all your TikTok Likes. (HN)
Scraping web pages from the command line with shot-scraper (2022) (HN)
Apify SDK - Scalable web crawling and scraping library for JavaScript.
Extracting web page content using Readability.js and shot-scraper (2022)
Texting Robots: Taming robots.txt with Rust and 34 million tests (2022) (Reddit)
Scraping Instagram (2022) (HN)
Linkedin Scraper - Scrapes Linkedin User Data.
Aeon - Scan the internet for your personal information and modify or remove it.
article-parser - Extract main article, main image and meta data from URL.
Apify SDK - Scalable web crawling and scraping library for JavaScript.
WebParsy - Node.JS library and cli for scraping websites using Puppeteer (or not) and YAML definitions.
Hext - Domain-specific language for extracting structured data from HTML documents.
AutoScrape - Automated, programming-free web scraper for interactive sites.
Portia - Tool that allows you to visually scrape websites without any programming knowledge required.
Surgeon - Declarative DOM extraction expression evaluator.
Ayakashi - Next generation web scraping framework.
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019) (Code)
How To Use HTMLRewriter for Web Scraping (2022)
Brozzler - Distributed browser-based web crawler.
oEmbed Parser - Extract oEmbed data from given webpage.
Proxy scraper and checker - Scrape more than 1K HTTP proxies in less than 2 seconds.
Toutatis - Tool that allows you to extract information from instagrams accounts such as e-mails, phone numbers and more.
Crawl Original Google Images & Youtube Videos
OnlyFans DataScraper - Scrape all the media from an OnlyFans account.
Shot Scraper Template - Quickly create a new GitHub repository that takes automated screenshots of a web page using shot-scraper.
Web Scraping via JavaScript Runtime Heap Snapshots (2022) (HN) (HN)
All the Places - Set of spiders and scrapers to extract location information from places that post their location on the internet.
Spider - Multithreaded Web spider crawler written in Rust.
Scrapism 2022 course
Libextract - Extract data from websites using basic statistical magic.
TikTok Scraper & Downloader - Download video posts, collect user/trend/hashtag/music feed metadata, sign URL and etc.
Scraping Airbnb (2022)
Shears - Functional web scraping in TS.
Web scraping with Python open knowledge (HN)
Web scraping Proxy Library for Scrapy (HN)
SLRP - Rotating open proxy multiplexer.
Node.js web scraper
WarcDB - Web crawl data as SQLite databases. (HN)
How to scrape Zillow with Python and Scrapy (2022)
Scraply - Simple DOM scraper to fetch information from any HTML based website.
coURLan - Clean, filter, normalize, and sample URLs.
htmldate - Find the publication date of web pages.
Wpull - Wget-compatible web downloader and crawler.
City Scrapers - Scrape, standardize and share public meetings from local government websites.
Lambda Soup - Functional HTML scraping and rewriting with CSS in OCaml.
linkchecker - Simple CLI tool to find all broken links in your website.
OSINT - Collections of tools and methods created to aid in OSINT collection.
Ask HN: What are the best tools for web scraping in 2022?
Crawlee - Web scraping and browser automation library for Node.js. (HN)
Facebook Scraper - Scrape Facebook public pages without an API key.
Scraping a website protected by Cloudflare (2022)
SABLE - Scraping Assisted by Learning.
HTML to Text
Crawly - High-level web crawling & scraping framework for Elixir.
Scrapoxy - Hides your scraper behind a cloud. (Web)
Instahunter - CLI OSINT app that can fetch data from Instagram's Web API without authentication.
sico - Sitemap comparison tool.
Scalpel - High level web scraping library for Haskell.
Unfurl - Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js.
Katana - Crawling and spidering framework. (HN)
PyWebCopy - Locally saves webpages to your hard disk with images, CSS, JS & links as is.
Unfurl - Extract and Visualize Data from URLs.
Indieweb site crawler and MF2 data collection tool
Evaluating Mechanical Keyboard Delivery Estimates with Python Web Scraping (2022)
LinkedIn Scraper
Google Image Scraper - Library to scrape google images.
CLI google crawler written in go
HTML Semantic Seg - Tool to create a dataset of semantic segmentation on website screenshots from their DOM.
fingerprint-suite - Browser fingerprinting tools for anonymizing your scrapers.
Linvo Linkedin Scraper
Cached Chrome Top Million Websites (HN)
site_icons - Efficient website icon scraper for rust, with sizes, ordering, and WASM support.
googlesearch.py - Google search scraper in Python.
Amazon Product API Scraping - Scrape products from the amazon search result or reviews from the specific product.
A Year of Writing about Web Scraping in Review (2023)
goq - Declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library.
Facepager - Fetching public available data from YouTube, Twitter and other websites on the basis of APIs and web scraping.
MrScraper - Visual web-scraping tool. (HN)
Fitter - Universal scraper for Websites and APIs.
Cloudflare Scrape - Python module to bypass Cloudflare's anti-bot page.
page-fetch - Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files.
Crul - Query Any Webpage or API. (HN)
WikiScrape - Download Wikipedia pages as beautiful text files.
crawler - gRPC web crawler turbo charged for performance.
unfluff - Automatically extract body content (and other cool stuff) from HTML document.
cdp-scrapers - Scratchpad for scraper development and general utilities.
scrapeghost - Library for scraping websites using OpenAI's GPT.
Black Maria - Python package for scraping in Natural language.
Browser AI agent, using GPT-4
UltimaScraper - Scrape all the media from an OnlyFans account.
From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps (2023)
Wikipedia Article Reference Inventory WARI - Import workflows for the Wikipedia Citations Database.
ScrapeGPT - Web scraper builder that uses GPT-4 to automatically generate Python scripts for scraping websites based on user input.
Website Scraper and Vectorizer (Reddit)
metadata-scraper - JavaScript library for scraping/parsing metadata from a web page.
Gumroad Scraper and Website Generator
HackerNews Alert - A Hacker News post contains the keyword you are interested in, you will receive a Slack message.
MrScraper AI - Dead simple web scraper (powered by AI). (HN)
HTML Table to JSON (Tweet)
Forum-dl - Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC.
Basic Statistics of Common Crawl Monthly Archives
Webpecker - Scrape useful data from social networks to search engines effortlessly.
estela - Elastic web scraping cluster.
Scrape the Web with entities extraction using OpenAI Function (Reddit)
top-crawler-agents - List of common crawler user agents useful for retrieving metadata from links.
Web Scraping With Rust
GPTBot - OpenAI’s Web Crawler. (HN)
Robots Parser - NodeJS robots.txt parser with support for wildcard matching.
Tubeup - Use yt-dlp to download video and upload to the Internet Archive with metadata.
scrape - Simple, higher level interface for Go web scraping.
Crystal Web Archiver - Downloads high fidelity copies of websites for long-term archival.
Writing a simple web scraper using bash (2023) (Lobsters)
CloudProxy - Proxy server to bypass Cloudflare protection.

Genomics

Immunology

Startups

AWS

Serverless computing

Build systems

Computer vision

Algorithms

Formal verification

Blockchain

Figma

Message queue

Remote Procedure Calls

Psychedelics

Lysergamides

Tryptamines

Renewable energy

CSS

Game development

Game engines

CPU

Nutrition

Drinks

2018

2019

2020

2021

2022

Alfred

Keyboard Maestro

Xcode

Neural networks

Linear algebra

Logic

Automated theorem proving

Mathematical optimization

Statistics

Type Theory

Diseases

Music production

GraphQL

Internet of things

Peer to peer

VPN

GitHub

Containers

Kubernetes

iOS

Linux

Nix

Electrical engineering

Quantum physics

Functional programming

Interactive computing

Software testing

Version control

C

Clojure

C++

Dart

Elixir

Elm

Go

Go libraries

Java

JavaScript

JS libraries

React

Julia

Kotlin

Lisp

Nim

Objective C

OCaml

Processing

Prolog

Python

Python libraries

R language

ReasonML