Datasets
Hugging Face Datasets & TensorFlow Datasets have nice collections.
AutoViz is nice for visualizing datasets. Label Studio is nice for data annotating.
Links
- Google Dataset Search (HN) (HN)
- Tencent ML-Images - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet.
- Mathematics Dataset - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
- Moving autonomous vehicles forward, together. Dataset by Lyft
- CodeSearchNet - Datasets, tools, and benchmarks for representation learning of code.
- Introducing the CodeSearchNet challenge (2019) (HN)
- Facets - Visualizations for machine learning datasets.
- skdata - Data sets for machine learning in Python.
- TensorFlow Datasets - Collection of datasets ready to use with TensorFlow. (HN)
- Awesome Public Datasets
- Awesome Public Datasets Core - Next iteration of APD project.
- LORIS - Web-accessible database solution for longitudinal multi-site studies.
- ProteinNet - Standardized data set for machine learning of protein structure.
- Registry of Open Data on AWS (Code)
- List of datasets for machine-learning research
- Syndetic - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. (HN)
- FaceForensics++ - Learning to Detect Manipulated Facial Images.
- Scale AI - High quality training and validation data for AI applications.
- Audio Datasets for Machine Learning (HN)
- Collection of large datasets for conversational response selection
- NSFW data source URLs - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier.
- Lambdagram - Tiny Cloud Service to Build Image Datasets with Instagram.
- HN Stories and comments since 2006
- My Giant Data Quality Checklist (2020)
- LabelImg - Graphical image annotation tool.
- Common Voice - Mozilla's initiative to help teach machines how real people speak.
- Replica Dataset - Dataset of high quality reconstructions of a variety of indoor spaces.
- Using Decision Trees for charting ill-behaved datasets (2020)
- Human parsing datasets
- Data Programming: Creating Large Training Sets, Quickly (2016)
- Announcing Artifacts (2020)
- DataHub - Provide various solutions to Publish and Deploy your Data with power and simplicity.
- Core Data - Important, commonly-used data as high quality, easy-to-use & open data packages. (Code)
- Awesome collections on DataHub
- Label Studio - Multi-type data labeling and annotation tool with standardized output format. (Code) (Time Series Data Labeling)
- Heartex - Data Management Platform for Machine Learning.
- Clothing Dataset: Call for Action (2020)
- Unsplash Dataset - 2,000,000+ Unsplash images made available for research and machine learning. (Web)
- 100k+ Rows Topic Labeled News Dataset (2020)
- Fashion-MNIST - MNIST-like fashion product database.
- FiveThirtyEight Datasets
- Books in .txt format for AI training purposes (HN)
- Sweetviz - Visualize and compare datasets, target values and associations, with one line of code.
- SuperAnnotate - Fastest annotation platform for training AI.
- Activeloop Hub - Fastest way to access and manage datasets for PyTorch and TensorFlow. (Web) (Docs) (Reddit) (HN)
- Objectron Dataset - Dataset of short object centeric video clips with pose annotations.
- Google Research Datasets
- matorage - Efficient way to store/load and manage dataset, model and optimizer for deep learning.
- HN Posts datasets (HN)
- Hypersim Toolkit - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes.
- mirdata - Interoperable Dataset Loaders for Music Information Retrieval (MIR).
- MetFaces Dataset - Image dataset of human faces extracted from works of art.
- Lionbridge AI - Provides human-labeled data for hundreds of use cases.
- Traditional Chinese Landscape Painting Dataset
- Awesome Satellite Imagery Datasets
- Wikimedia Downloads - Download the Entire Wikimedia Database. (HN)
- Wikipedia: Database download
- How to shuffle a big dataset (2018) (Reddit)
- ESC-50: Dataset for Environmental Sound Classification
- Booking.com WSDM challenge - Training dataset consists of over a million of anonymized hotel reservations, based on real data.
- Computer Vision Datasets
- Voicebook Datasets - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
- The Pile - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
- doccano - Open source text annotation tool for machine learning practitioner. (Web)
- Weather and Climate Datasets for AI Research (Code)
- NLP Datasets
- Total Text Dataset - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
- Datasets collected for network science, deep learning and general machine learning research
- MER and SER Data sets - Data sets for Music Emotion Recognition and Speech Emotion Recognition.
- Common Voice Datasets - Multi-language dataset of voices that anyone can use to train speech-enabled applications. (Code)
- Label a Dataset with a Few Lines of Code (2021) (HN)
- Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) (Code)
- Datasets should behave like git repositories (2021)
- The Stanford Question Answering Dataset (Visual Explorer)
- Data.gov - Home of the U.S. Government’s open data.
- Visualizing Data Timeliness at Airbnb (2021)
- The Next Evolution of Data Catalogs: Data Discovery Platforms (2021)
- DeepLabel - Cross-platform image annotation tool for machine learning.
- WIT : Wikipedia-based Image Text Dataset
- Harry Potter Dataset
- DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) (Code)
- Synthetic Data: Even Better than the Real Thing? (2021)
- Google C4 dataset - Colossal, cleaned version of Common Crawl's web crawl corpus.
- Finding a standard dataset format for machine learning (2020) (HN)
- Hashing techniques to compare large datasets? (2021)
- Machine Learning Datasets | Papers With Code (Twitter)
- Ocean Market - Marketplace to find, publish and trade data sets. (Code)
- Ocean Protocol - Tools for the Web3 Data Economy. (Contracts) (GitHub)
- Generating Datasets with Pretrained Language Models (2021)
- nbodykit - Analysis kit for large-scale structure datasets, the massively parallel way.
- Dataset Inference: Ownership Resolution in Machine Learning (2021) (Tweet)
- Diffgram - Data Labeling Software for Machine Learning. (Code)
- Data Profiler - Python library designed to make data analysis, monitoring and sensitive data detection easy.
- Tonic - Fake Data Company. (GitHub)
- Datasets for Google Cloud (Article)
- SQLite Data Starter Packs
- GitHub Collection: Open data - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets.
- Scientific Data Repositories (HN)
- CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) (HN)
- ir_datasets - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
- SEDE (Stack Exchange Data Explorer) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. (Article)
- List of Medical (Imaging) Datasets
- musescore.com dataset - Dataset of all music sheets and users on musescore.com.
- generatedata.com - Random data generator. (Code)
- MTData - Tool automates collection and preparation of machine translation datasets.
- The MIT Supercloud Dataset (2021)
- Datasheets for Datasets (2018) (Markdown Datasheet for Datasets)
- Lightly - Label only the data which improves your ML model. (HN)
- Small Open Datasets - Collection of automatically-updated, ready-to-use and open-licensed datasets.
- DataQA - Labelling platform for text using distant supervision.
- COCO - Common Objects in Context - Large-scale object detection, segmentation, and captioning dataset. (API)
- img2dataset - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
- How to fit any dataset with a single parameter (2019) (HN)
- Single-dataset Experts for Multi-dataset Question Answering (2021) (Code)
- LabelFlow - Open standard platform for image labeling. (Code)
- Face Synthetics dataset
- Toloka - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. (Code) (GitHub)
- PlainTextWikipedia - Convert Wikipedia database dumps into plaintext files.
- Discovering Anomalous Data with Self-Supervised Learning (2021)
- Resources to get you the best quality of ML datasets (2021)
- Hugging Face Datasets
- SDMetrics - Metrics to evaluate quality and efficacy of synthetic datasets.
- doubtlab - General tricks that may help you find bad, or noisy, labels in your dataset.
- Gretel Synthetics - Synthetic data generators for structured and unstructured text, featuring differentially private learning.
- Great datasets to teach with (2021)
- A Cartel of Influential Datasets Are Dominating Machine Learning Research (HN)
- The Toxicity Dataset
- Data Linter - Identifies potential issues (lints) in your ML training data.
- Cloud Annotations - Fast, easy and collaborative open source image annotation tool for teams and individuals. (Web)
- pyjanitor - Clean APIs for data cleaning. Python implementation of R package Janitor.
- face2comics datasets
- arXiv public datasets
- AIST++ Dance Motion Dataset (API Code)
- TheAudioDB.com - Community Database of audio artwork and metadata with a JSON API.
- Awesome Video Datasets
- Conceptual 12M - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
- Colliding Circles Toy Datasets
- Sieve - Transform raw video into high quality datasets in minutes. (HN) (HN)
- IKEA 3D Assembly Dataset
- Imbalanced Dataset Sampler - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
- ADE20K Dataset - Composed of more than 27K images from the SUN and Places databases. (Code)
- Datasets of Automatic Keyphrase Extraction
- Awesome Forests - Curated list of ground-truth forest datasets for the machine learning and forestry community.
- PushShift Data Dumps
- DeepEcho - Synthetic Data Generation for mixed-type, multivariate time series.
- deduplify - Python tool to search for and remove duplicated files in messy datasets.
- CSVtoTable - Simple command-line utility to convert CSV files to searchable and sortable HTML table.
- Kubric - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
- ASPset-510 - Large-scale video dataset for the training and evaluation of 3D human pose estimation models.
- Self-Distilled Internet Photos (SDIP) Dataset
- Fake News Corpus
- Sniffer - Lightweight Python application for sorting images in your dataset.
- Dataset Distillation by Matching Training Trajectories (2022) (Code)
- BeeRef - Simple Reference Image Viewer.
- BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) (Code)
- HierText Dataset - Dataset featuring hierarchical annotations of text in natural scenes and documents.
- Google Research Datasets
- MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022)
- CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
- Squirrel Datasets Core
- GTA-3D Dataset - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine.
- Relative Human (RH) - Multi-person in-the-wild RGB images with rich human annotations.
- CSV Base - Turn CSV files into read+write APIs. (Code)
- A Dataset and Explorer for 3D Signed Distance Functions (2022) (Code)
- Vega Datasets - Collection of datasets used in Vega and Vega-Lite examples.
- Azimuth - Open-source dataset and error analysis tool for text classification.
- audio2dataset - Easily turn large sets of audio urls to an audio dataset.
- Datasets for Entity Recognition - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
- AudioLoader - PyTorch Dataset for Speech and Music audio.
- Awesome Training Data
- MIDI Dataset - Code for creating a dataset of MIDI ground truth.
- Labelbox - Fastest way to annotate data to build and ship computer vision applications. (Code)
- Bamboo - Mega-scale and information-dense dataset for classification and detection pre-training.
- The How2 Dataset - Multimodal collection of instructional videos with English subtitles. (Code)
- Unity Dataset Insights - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package.
- ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) (Code)
- Perceptual Image Processing ALgorithms (PIPAL) (Code)
- Hover - Label data at scale. Fun and precision included.
- How do you share big datasets with your team and others? (2022)
- Simulacra Aesthetic Captions - Dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts.
- Audio Dataset Project - Audio Dataset for training CLAP and other models.
- Bulk - Quick developer tool to apply some bulk labels.
- stopes - Library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
- MisInfoText - Datasets for fake news and misinformation detection.
- Awesome Dataset Distillation
- Cleaning data with sqlite-utils and Datasette
- Starter code for working with the YouTube-8M dataset
- BigLAM (Libraries, Archives and Museums) - Open source, community resource of LAM datasets.
- Data Measurements Tool - Developing tools to automatically analyze datasets.
- Cleanlab Vizzy - Learn how to automatically find label errors and out-of-distribution data. (Lobsters)
- Ask HN: Will AI-generated images flooding the web pollute future training data? (2022)
- Exploring 12M of the 2.3B images used to train Stable Diffusion (2022) (HN)
- COYO-700M: Large-scale Image-Text Pair Dataset
- WebVid Dataset - Large-scale text-video dataset. 10 million captioned short videos.
- Generate Synthetic Data in 3 Lines of Code (2022) (HN)
- ShowData - Large scale image dataset visiualization tool.
- Synthetic Faces High Quality (SFHQ) dataset
- Hugging Face Datasets Converter - Scripts to convert datasets from various sources to Hugging Face Datasets.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes (2022) (Tweet)
- Click Points - Image viewer and on the other hand as an data display and annotation tool.
- FastDup - Tool for gaining insights from a large image collection.
- Downstream Datasets Make Surprisingly Good Pretraining Corpora (2022) (Tweet)
- Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering
- HuggingFace Datasets server - Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in. (HN)
- Online Language Modelling Dataset Pipeline
- What should I do if a dataset is too large to store in my local computer? (2022)
- Recommendations thread: Your favorite sources of raw data (of any type) | Lobsters (2022)
- Open Source Data Annotation & Labeling Tools
- Waste datasets review - List of image datasets with any kind of litter, garbage, waste and trash.
- TACO - Trash Annotations in Context Dataset Toolkit.
- Kangas - Explore multimedia datasets at scale.
- FIB Benchmark
- cc2dataset - Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text.
- VideoCC - Dataset containing (video-URL, caption) pairs for training video-text machine learning models.
- MIR dataset papers presented at ISMIR 2022
- Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets (2022) (HN)
- GWA - Geometric-Wave Acoustic dataset.
- Generalized RDF datasets for Rust
- Video2Dataset - Easily create large video dataset from video URLs.
- AutoViz - Automatically Visualize any dataset, any size with a single line of code.
- KiloGram Tangrams dataset
- MNIST-1D dataset - 1D analogue of the MNIST dataset for measuring spatial biases and answering "science of deep learning" questions.
- any2dataset - Easily turn large sets of file URLs to an file dataset.
- Database of 200k cell images yields new mathematical framework (2023) (HN)
- Fashion IQ dataset
- City2BA - Tools for generating synthetic bundle adjustment datasets.
- Toolbox for HuMMan Dataset
- Datasets for deep learning with satellite & aerial imagery
- Retriever - Quickly download, clean up, and install public datasets into a database management system.
- A Critical Field Guide for Working with Machine Learning Datasets (2023)
- Oxen - Version your machine learning datasets like you version your code. (HN)
- This Not That - Visual labeling system implemented in Jupyter widgets.
- OpenWebText - Open clone of OpenAI's unreleased WebText dataset scraper.
- Multiface Dataset - Multi-view dataset of multiple identities performing a sequence of facial expressions.
- Wikipedia 2 Corpus - Wikipedia text corpus for self-supervised NLP model training.
- Exsclaim - Toolkit for the automatic construction of self-labeled materials imaging datasets from scientific literature.
- Awesome Human Label Variation
- Internet Explorer - Explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset.
- Occupancy Dataset for nuScenes
- GINC (Generative In-Context learning Dataset) - Small-scale synthetic dataset for studying in-context learning.
- Open Instruction Generalist (OIG) Dataset
- Cleaned Alpaca Dataset
- DeepFashion-MultiModal - Large-scale high-quality human dataset with rich multi-modal annotations.
- GPTeacher - Collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.
- Know Your Data - Understand datasets with the goal of improving data quality.
- What’s in the RedPajama-Data-1T LLM training set (2023)
- RedPajama-Data - Open Source Recipe to Reproduce LLaMA training dataset.
- Instruction Tuning Datasets - All available datasets for Instruction Tuning of Large Language Models.
- DataComp - Competition about designing datasets for pre-training CLIP models.
- tokenmonster - Determine the tokens that best represent any given dataset.
- Datalab: A Linter for ML Datasets
- Renumics - Curation tool for unstructured data that connects your stack to the data-centric AI ecosystem.
- SlimPajama-627B - Largest extensively deduplicated, multi1corpora, open-source dataset for training large language models.
- Kart - Distributed version-control for geospatial and tabular data.
- Autolabel - Label, clean and enrich text datasets with Large Language Models. (HN)
- Awesome 3D LiDAR Datasets
- Automated Data Quality at Scale (2023)
- LLMDataHub - Awesome Datasets for LLM Training.