On this page

Datasets

Hugging Face Datasets & TensorFlow Datasets have nice collections.

AutoViz is nice for visualizing datasets. Label Studio is nice for data annotating.

Links

Google Dataset Search (HN) (HN)
Tencent ML-Images - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet.
Mathematics Dataset - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
Moving autonomous vehicles forward, together. Dataset by Lyft
CodeSearchNet - Datasets, tools, and benchmarks for representation learning of code.
Introducing the CodeSearchNet challenge (2019) (HN)
Facets - Visualizations for machine learning datasets.
skdata - Data sets for machine learning in Python.
TensorFlow Datasets - Collection of datasets ready to use with TensorFlow. (HN)
Awesome Public Datasets
Awesome Public Datasets Core - Next iteration of APD project.
LORIS - Web-accessible database solution for longitudinal multi-site studies.
ProteinNet - Standardized data set for machine learning of protein structure.
Registry of Open Data on AWS (Code)
List of datasets for machine-learning research
Syndetic - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. (HN)
FaceForensics++ - Learning to Detect Manipulated Facial Images.
Scale AI - High quality training and validation data for AI applications.
Audio Datasets for Machine Learning (HN)
Collection of large datasets for conversational response selection
NSFW data source URLs - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier.
Lambdagram - Tiny Cloud Service to Build Image Datasets with Instagram.
HN Stories and comments since 2006
My Giant Data Quality Checklist (2020)
LabelImg - Graphical image annotation tool.
Common Voice - Mozilla's initiative to help teach machines how real people speak.
Replica Dataset - Dataset of high quality reconstructions of a variety of indoor spaces.
Using Decision Trees for charting ill-behaved datasets (2020)
Human parsing datasets
Data Programming: Creating Large Training Sets, Quickly (2016)
Announcing Artifacts (2020)
DataHub - Provide various solutions to Publish and Deploy your Data with power and simplicity.
Core Data - Important, commonly-used data as high quality, easy-to-use & open data packages. (Code)
Awesome collections on DataHub
Label Studio - Multi-type data labeling and annotation tool with standardized output format. (Code) (Time Series Data Labeling)
Heartex - Data Management Platform for Machine Learning.
Clothing Dataset: Call for Action (2020)
Unsplash Dataset - 2,000,000+ Unsplash images made available for research and machine learning. (Web)
100k+ Rows Topic Labeled News Dataset (2020)
Fashion-MNIST - MNIST-like fashion product database.
FiveThirtyEight Datasets
Books in .txt format for AI training purposes (HN)
Sweetviz - Visualize and compare datasets, target values and associations, with one line of code.
SuperAnnotate - Fastest annotation platform for training AI.
Activeloop Hub - Fastest way to access and manage datasets for PyTorch and TensorFlow. (Web) (Docs) (Reddit) (HN)
Objectron Dataset - Dataset of short object centeric video clips with pose annotations.
Google Research Datasets
matorage - Efficient way to store/load and manage dataset, model and optimizer for deep learning.
HN Posts datasets (HN)
Hypersim Toolkit - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes.
mirdata - Interoperable Dataset Loaders for Music Information Retrieval (MIR).
MetFaces Dataset - Image dataset of human faces extracted from works of art.
Lionbridge AI - Provides human-labeled data for hundreds of use cases.
Traditional Chinese Landscape Painting Dataset
Awesome Satellite Imagery Datasets
Wikimedia Downloads - Download the Entire Wikimedia Database. (HN)
Wikipedia: Database download
How to shuffle a big dataset (2018) (Reddit)
ESC-50: Dataset for Environmental Sound Classification
Booking.com WSDM challenge - Training dataset consists of over a million of anonymized hotel reservations, based on real data.
Computer Vision Datasets
Voicebook Datasets - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
The Pile - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
doccano - Open source text annotation tool for machine learning practitioner. (Web)
Weather and Climate Datasets for AI Research (Code)
NLP Datasets
Total Text Dataset - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Datasets collected for network science, deep learning and general machine learning research
MER and SER Data sets - Data sets for Music Emotion Recognition and Speech Emotion Recognition.
Common Voice Datasets - Multi-language dataset of voices that anyone can use to train speech-enabled applications. (Code)
Label a Dataset with a Few Lines of Code (2021) (HN)
Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) (Code)
Datasets should behave like git repositories (2021)
The Stanford Question Answering Dataset (Visual Explorer)
Data.gov - Home of the U.S. Government’s open data.
Visualizing Data Timeliness at Airbnb (2021)
The Next Evolution of Data Catalogs: Data Discovery Platforms (2021)
DeepLabel - Cross-platform image annotation tool for machine learning.
WIT : Wikipedia-based Image Text Dataset
Harry Potter Dataset
DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) (Code)
Synthetic Data: Even Better than the Real Thing? (2021)
Google C4 dataset - Colossal, cleaned version of Common Crawl's web crawl corpus.
Finding a standard dataset format for machine learning (2020) (HN)
Hashing techniques to compare large datasets? (2021)
Machine Learning Datasets | Papers With Code (Twitter)
Ocean Market - Marketplace to find, publish and trade data sets. (Code)
Ocean Protocol - Tools for the Web3 Data Economy. (Contracts) (GitHub)
Generating Datasets with Pretrained Language Models (2021)
nbodykit - Analysis kit for large-scale structure datasets, the massively parallel way.
Dataset Inference: Ownership Resolution in Machine Learning (2021) (Tweet)
Diffgram - Data Labeling Software for Machine Learning. (Code)
Data Profiler - Python library designed to make data analysis, monitoring and sensitive data detection easy.
Tonic - Fake Data Company. (GitHub)
Datasets for Google Cloud (Article)
SQLite Data Starter Packs
GitHub Collection: Open data - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets.
Scientific Data Repositories (HN)
CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) (HN)
ir_datasets - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
SEDE (Stack Exchange Data Explorer) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. (Article)
List of Medical (Imaging) Datasets
musescore.com dataset - Dataset of all music sheets and users on musescore.com.
generatedata.com - Random data generator. (Code)
MTData - Tool automates collection and preparation of machine translation datasets.
The MIT Supercloud Dataset (2021)
Datasheets for Datasets (2018) (Markdown Datasheet for Datasets)
Lightly - Label only the data which improves your ML model. (HN)
Small Open Datasets - Collection of automatically-updated, ready-to-use and open-licensed datasets.
DataQA - Labelling platform for text using distant supervision.
COCO - Common Objects in Context - Large-scale object detection, segmentation, and captioning dataset. (API)
img2dataset - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
How to fit any dataset with a single parameter (2019) (HN)
Single-dataset Experts for Multi-dataset Question Answering (2021) (Code)
LabelFlow - Open standard platform for image labeling. (Code)
Face Synthetics dataset
Toloka - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. (Code) (GitHub)
PlainTextWikipedia - Convert Wikipedia database dumps into plaintext files.
Discovering Anomalous Data with Self-Supervised Learning (2021)
Resources to get you the best quality of ML datasets (2021)
Hugging Face Datasets
SDMetrics - Metrics to evaluate quality and efficacy of synthetic datasets.
doubtlab - General tricks that may help you find bad, or noisy, labels in your dataset.
Gretel Synthetics - Synthetic data generators for structured and unstructured text, featuring differentially private learning.
Great datasets to teach with (2021)
A Cartel of Influential Datasets Are Dominating Machine Learning Research (HN)
The Toxicity Dataset
Data Linter - Identifies potential issues (lints) in your ML training data.
Cloud Annotations - Fast, easy and collaborative open source image annotation tool for teams and individuals. (Web)
pyjanitor - Clean APIs for data cleaning. Python implementation of R package Janitor.
face2comics datasets
arXiv public datasets
AIST++ Dance Motion Dataset (API Code)
TheAudioDB.com - Community Database of audio artwork and metadata with a JSON API.
Awesome Video Datasets
Conceptual 12M - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
Colliding Circles Toy Datasets
Sieve - Transform raw video into high quality datasets in minutes. (HN) (HN)
IKEA 3D Assembly Dataset
Imbalanced Dataset Sampler - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
ADE20K Dataset - Composed of more than 27K images from the SUN and Places databases. (Code)
Datasets of Automatic Keyphrase Extraction
Awesome Forests - Curated list of ground-truth forest datasets for the machine learning and forestry community.
PushShift Data Dumps
DeepEcho - Synthetic Data Generation for mixed-type, multivariate time series.
deduplify - Python tool to search for and remove duplicated files in messy datasets.
CSVtoTable - Simple command-line utility to convert CSV files to searchable and sortable HTML table.
Kubric - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
ASPset-510 - Large-scale video dataset for the training and evaluation of 3D human pose estimation models.
Self-Distilled Internet Photos (SDIP) Dataset
Fake News Corpus
Sniffer - Lightweight Python application for sorting images in your dataset.
Dataset Distillation by Matching Training Trajectories (2022) (Code)
BeeRef - Simple Reference Image Viewer.
BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) (Code)
HierText Dataset - Dataset featuring hierarchical annotations of text in natural scenes and documents.
Google Research Datasets
MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022)
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
Squirrel Datasets Core
GTA-3D Dataset - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine.
Relative Human (RH) - Multi-person in-the-wild RGB images with rich human annotations.
CSV Base - Turn CSV files into read+write APIs. (Code)
A Dataset and Explorer for 3D Signed Distance Functions (2022) (Code)
Vega Datasets - Collection of datasets used in Vega and Vega-Lite examples.
Azimuth - Open-source dataset and error analysis tool for text classification.
audio2dataset - Easily turn large sets of audio urls to an audio dataset.
Datasets for Entity Recognition - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
AudioLoader - PyTorch Dataset for Speech and Music audio.
Awesome Training Data
MIDI Dataset - Code for creating a dataset of MIDI ground truth.
Labelbox - Fastest way to annotate data to build and ship computer vision applications. (Code)
Bamboo - Mega-scale and information-dense dataset for classification and detection pre-training.
The How2 Dataset - Multimodal collection of instructional videos with English subtitles. (Code)
Unity Dataset Insights - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package.
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) (Code)
Perceptual Image Processing ALgorithms (PIPAL) (Code)
Hover - Label data at scale. Fun and precision included.
How do you share big datasets with your team and others? (2022)
Simulacra Aesthetic Captions - Dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts.
Audio Dataset Project - Audio Dataset for training CLAP and other models.
Bulk - Quick developer tool to apply some bulk labels.
stopes - Library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
MisInfoText - Datasets for fake news and misinformation detection.
Awesome Dataset Distillation
Cleaning data with sqlite-utils and Datasette
Starter code for working with the YouTube-8M dataset
BigLAM (Libraries, Archives and Museums) - Open source, community resource of LAM datasets.
Data Measurements Tool - Developing tools to automatically analyze datasets.
Cleanlab Vizzy - Learn how to automatically find label errors and out-of-distribution data. (Lobsters)
Ask HN: Will AI-generated images flooding the web pollute future training data? (2022)
Exploring 12M of the 2.3B images used to train Stable Diffusion (2022) (HN)
COYO-700M: Large-scale Image-Text Pair Dataset
WebVid Dataset - Large-scale text-video dataset. 10 million captioned short videos.
Generate Synthetic Data in 3 Lines of Code (2022) (HN)
ShowData - Large scale image dataset visiualization tool.
Synthetic Faces High Quality (SFHQ) dataset
Hugging Face Datasets Converter - Scripts to convert datasets from various sources to Hugging Face Datasets.
Multimodal datasets: misogyny, pornography, and malignant stereotypes (2022) (Tweet)
Click Points - Image viewer and on the other hand as an data display and annotation tool.
FastDup - Tool for gaining insights from a large image collection.
Downstream Datasets Make Surprisingly Good Pretraining Corpora (2022) (Tweet)
Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering
HuggingFace Datasets server - Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in. (HN)
Online Language Modelling Dataset Pipeline
What should I do if a dataset is too large to store in my local computer? (2022)
Recommendations thread: Your favorite sources of raw data (of any type) | Lobsters (2022)
Open Source Data Annotation & Labeling Tools
Waste datasets review - List of image datasets with any kind of litter, garbage, waste and trash.
TACO - Trash Annotations in Context Dataset Toolkit.
Kangas - Explore multimedia datasets at scale.
FIB Benchmark
cc2dataset - Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text.
VideoCC - Dataset containing (video-URL, caption) pairs for training video-text machine learning models.
MIR dataset papers presented at ISMIR 2022
Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets (2022) (HN)
GWA - Geometric-Wave Acoustic dataset.
Generalized RDF datasets for Rust
Video2Dataset - Easily create large video dataset from video URLs.
AutoViz - Automatically Visualize any dataset, any size with a single line of code.
KiloGram Tangrams dataset
MNIST-1D dataset - 1D analogue of the MNIST dataset for measuring spatial biases and answering "science of deep learning" questions.
any2dataset - Easily turn large sets of file URLs to an file dataset.
Database of 200k cell images yields new mathematical framework (2023) (HN)
Fashion IQ dataset
City2BA - Tools for generating synthetic bundle adjustment datasets.
Toolbox for HuMMan Dataset
Datasets for deep learning with satellite & aerial imagery
Retriever - Quickly download, clean up, and install public datasets into a database management system.
A Critical Field Guide for Working with Machine Learning Datasets (2023)
Oxen - Version your machine learning datasets like you version your code. (HN)
This Not That - Visual labeling system implemented in Jupyter widgets.
OpenWebText - Open clone of OpenAI's unreleased WebText dataset scraper.
Multiface Dataset - Multi-view dataset of multiple identities performing a sequence of facial expressions.
Wikipedia 2 Corpus - Wikipedia text corpus for self-supervised NLP model training.
Exsclaim - Toolkit for the automatic construction of self-labeled materials imaging datasets from scientific literature.
Awesome Human Label Variation
Internet Explorer - Explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset.
Occupancy Dataset for nuScenes
GINC (Generative In-Context learning Dataset) - Small-scale synthetic dataset for studying in-context learning.
Open Instruction Generalist (OIG) Dataset
Cleaned Alpaca Dataset
DeepFashion-MultiModal - Large-scale high-quality human dataset with rich multi-modal annotations.
GPTeacher - Collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.
Know Your Data - Understand datasets with the goal of improving data quality.
What’s in the RedPajama-Data-1T LLM training set (2023)
RedPajama-Data - Open Source Recipe to Reproduce LLaMA training dataset.
Instruction Tuning Datasets - All available datasets for Instruction Tuning of Large Language Models.
DataComp - Competition about designing datasets for pre-training CLIP models.
tokenmonster - Determine the tokens that best represent any given dataset.
Datalab: A Linter for ML Datasets
Renumics - Curation tool for unstructured data that connects your stack to the data-centric AI ecosystem.
SlimPajama-627B - Largest extensively deduplicated, multi1corpora, open-source dataset for training large language models.
Kart - Distributed version-control for geospatial and tabular data.
Autolabel - Label, clean and enrich text datasets with Large Language Models. (HN)
Awesome 3D LiDAR Datasets
Automated Data Quality at Scale (2023)
LLMDataHub - Awesome Datasets for LLM Training.

Genomics

Immunology

Startups

AWS

Serverless computing

Build systems

Computer vision

Algorithms

Formal verification

Blockchain

Figma

Message queue

Remote Procedure Calls

Psychedelics

Lysergamides

Tryptamines

Renewable energy

CSS

Game development

Game engines

CPU

Nutrition

Drinks

2018

2019

2020

2021

2022

Alfred

Keyboard Maestro

Xcode

Neural networks

Linear algebra

Logic

Automated theorem proving

Mathematical optimization

Statistics

Type Theory

Diseases

Music production

GraphQL

Internet of things

Peer to peer

VPN

GitHub

Containers

Kubernetes

iOS

Linux

Nix

Electrical engineering

Quantum physics

Functional programming

Interactive computing

Software testing

Version control

C

Clojure

C++

Dart

Elixir

Elm

Go

Go libraries

Java

JavaScript

JS libraries

React

Julia

Kotlin

Lisp

Nim

Objective C

OCaml

Processing

Prolog

Python

Python libraries

R language

ReasonML