On this page Datasets Links Google Dataset Search ( HN) ( HN) Tencent ML-Images - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet. Mathematics Dataset - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. Moving autonomous vehicles forward, together. Dataset by Lyft CodeSearchNet - Datasets, tools, and benchmarks for representation learning of code. Introducing the CodeSearchNet challenge (2019) ( HN) Facets - Visualizations for machine learning datasets. skdata - Data sets for machine learning in Python. TensorFlow Datasets - Collection of datasets ready to use with TensorFlow. Awesome Public Datasets Awesome Public Datasets Core - Next iteration of APD project. LORIS - Web-accessible database solution for longitudinal multi-site studies. ProteinNet - Standardized data set for machine learning of protein structure. Registry of Open Data on AWS ( Code) List of datasets for machine-learning research Syndetic - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. ( HN) FaceForensics++ - Learning to Detect Manipulated Facial Images. Scale AI - High quality training and validation data for AI applications. Audio Datasets for Machine Learning ( HN) Collection of large datasets for conversational response selection NSFW data source URLs - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier. Lambdagram - Tiny Cloud Service to Build Image Datasets with Instagram. HN Stories and comments since 2006 My Giant Data Quality Checklist (2020) LabelImg - Graphical image annotation tool. Common Voice - Mozilla's initiative to help teach machines how real people speak. Replica Dataset - Dataset of high quality reconstructions of a variety of indoor spaces. Using Decision Trees for charting ill-behaved datasets (2020) Human parsing datasets Data Programming: Creating Large Training Sets, Quickly (2016) Announcing Artifacts (2020) DataHub - Provide various solutions to Publish and Deploy your Data with power and simplicity. Core Data - Important, commonly-used data as high quality, easy-to-use & open data packages. ( Code) Awesome collections on DataHub Label Studio - Multi-type data labeling and annotation tool with standardized output format. ( Code) ( Time Series Data Labeling) Heartex - Data Management Platform for Machine Learning. Clothing Dataset: Call for Action (2020) Unsplash Dataset - 2,000,000+ Unsplash images made available for research and machine learning. ( Web) 100k+ Rows Topic Labeled News Dataset (2020) Fashion-MNIST - MNIST-like fashion product database. FiveThirtyEight Datasets Books in .txt format for AI training purposes ( HN) Sweetviz - Visualize and compare datasets, target values and associations, with one line of code. SuperAnnotate - Fastest annotation platform for training AI. Activeloop Hub - Fastest way to access and manage datasets for PyTorch and TensorFlow. ( Web) ( Docs) ( Reddit) Objectron Dataset - Dataset of short object centeric video clips with pose annotations. Google Research Datasets matorage - Efficient way to store/load and manage dataset, model and optimizer for deep learning. HN Posts datasets ( HN) Hypersim Toolkit - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes. mirdata - Interoperable Dataset Loaders for Music Information Retrieval (MIR). MetFaces Dataset - Image dataset of human faces extracted from works of art. Lionbridge AI - Provides human-labeled data for hundreds of use cases. Traditional Chinese Landscape Painting Dataset Awesome Satellite Imagery Datasets Wikimedia Downloads - Download the Entire Wikimedia Database. ( HN) Wikipedia: Database download How to shuffle a big dataset (2018) ( Reddit) ESC-50: Dataset for Environmental Sound Classification Booking.com WSDM challenge - Training dataset consists of over a million of anonymized hotel reservations, based on real data. Computer Vision Datasets Voicebook Datasets - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets). The Pile - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. doccano - Open source text annotation tool for machine learning practitioner. ( Web) Weather and Climate Datasets for AI Research ( Code) NLP Datasets Total Text Dataset - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind. Datasets collected for network science, deep learning and general machine learning research MER and SER Data sets - Data sets for Music Emotion Recognition and Speech Emotion Recognition. Common Voice Datasets - Multi-language dataset of voices that anyone can use to train speech-enabled applications. ( Code) Label a Dataset with a Few Lines of Code (2021) ( HN) Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) ( Code) Datasets should behave like git repositories (2021) The Stanford Question Answering Dataset ( Visual Explorer) Data.gov - Home of the U.S. Government’s open data. Visualizing Data Timeliness at Airbnb (2021) The Next Evolution of Data Catalogs: Data Discovery Platforms (2021) DeepLabel - Cross-platform image annotation tool for machine learning. WIT : Wikipedia-based Image Text Dataset Harry Potter Dataset DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) ( Code) Synthetic Data: Even Better than the Real Thing? (2021) Google C4 dataset - Colossal, cleaned version of Common Crawl's web crawl corpus. Finding a standard dataset format for machine learning (2020) ( HN) Hashing techniques to compare large datasets? (2021) Machine Learning Datasets | Papers With Code ( Twitter) Ocean Market - Marketplace to find, publish and trade data sets. ( Code) Ocean Protocol - Tools for the Web3 Data Economy. ( Contracts) ( GitHub) Generating Datasets with Pretrained Language Models (2021) nbodykit - Analysis kit for large-scale structure datasets, the massively parallel way. Dataset Inference: Ownership Resolution in Machine Learning (2021) ( Tweet) Diffgram - Data Labeling Software for Machine Learning. ( Code) Data Profiler - Python library designed to make data analysis, monitoring and sensitive data detection easy. Tonic - Fake Data Company. ( GitHub) Datasets for Google Cloud ( Article) SQLite Data Starter Packs GitHub Collection: Open data - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets. Scientific Data Repositories ( HN) CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) ( HN) ir_datasets - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. SEDE (Stack Exchange Data Explorer) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. ( Article) List of Medical (Imaging) Datasets musescore.com dataset - Dataset of all music sheets and users on musescore.com. generatedata.com - Random data generator. ( Code) MTData - Tool automates collection and preparation of machine translation datasets. The MIT Supercloud Dataset (2021) Datasheets for Datasets (2018) ( Markdown Datasheet for Datasets) Lightly - Label only the data which improves your ML model. ( HN) Small Open Datasets - Collection of automatically-updated, ready-to-use and open-licensed datasets. DataQA - Labelling platform for text using distant supervision. COCO - Common Objects in Context - Large-scale object detection, segmentation, and captioning dataset. ( API) img2dataset - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. How to fit any dataset with a single parameter (2019) ( HN) Single-dataset Experts for Multi-dataset Question Answering (2021) ( Code) LabelFlow - Open standard platform for image labeling. ( Code) Face Synthetics dataset Toloka - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. ( Code) ( GitHub) PlainTextWikipedia - Convert Wikipedia database dumps into plaintext files. Discovering Anomalous Data with Self-Supervised Learning (2021) Resources to get you the best quality of ML datasets (2021) Hugging Face Datasets SDMetrics - Metrics to evaluate quality and efficacy of synthetic datasets. doubtlab - General tricks that may help you find bad, or noisy, labels in your dataset. Gretel Synthetics - Synthetic data generators for structured and unstructured text, featuring differentially private learning. Great datasets to teach with (2021) A Cartel of Influential Datasets Are Dominating Machine Learning Research ( HN) The Toxicity Dataset Data Linter - Identifies potential issues (lints) in your ML training data. Cloud Annotations - Fast, easy and collaborative open source image annotation tool for teams and individuals. ( Web) pyjanitor - Clean APIs for data cleaning. Python implementation of R package Janitor. face2comics datasets arXiv public datasets AIST++ Dance Motion Dataset ( API Code) TheAudioDB.com - Community Database of audio artwork and metadata with a JSON API. Awesome Video Datasets Conceptual 12M - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training. Colliding Circles Toy Datasets Sieve - Transform raw video into high quality datasets in minutes. ( HN) ( HN) IKEA 3D Assembly Dataset Imbalanced Dataset Sampler - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones. ADE20K Dataset - Composed of more than 27K images from the SUN and Places databases. ( Code) Datasets of Automatic Keyphrase Extraction Awesome Forests - Curated list of ground-truth forest datasets for the machine learning and forestry community. PushShift Data Dumps DeepEcho - Synthetic Data Generation for mixed-type, multivariate time series. deduplify - Python tool to search for and remove duplicated files in messy datasets. CSVtoTable - Simple command-line utility to convert CSV files to searchable and sortable HTML table. Kubric - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow. ASPset-510 - Large-scale video dataset for the training and evaluation of 3D human pose estimation models. Self-Distilled Internet Photos (SDIP) Dataset Fake News Corpus Sniffer - Lightweight Python application for sorting images in your dataset. Dataset Distillation by Matching Training Trajectories (2022) ( Code) BeeRef - Simple Reference Image Viewer. BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) ( Code) HierText Dataset - Dataset featuring hierarchical annotations of text in natural scenes and documents. Google Research Datasets MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022) CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus Squirrel Datasets Core GTA-3D Dataset - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine. Relative Human (RH) - Multi-person in-the-wild RGB images with rich human annotations. CSV Base - Turn CSV files into read+write APIs. ( Code) A Dataset and Explorer for 3D Signed Distance Functions (2022) ( Code) Vega Datasets - Collection of datasets used in Vega and Vega-Lite examples. Azimuth - Open-source dataset and error analysis tool for text classification. audio2dataset - Easily turn large sets of audio urls to an audio dataset. Datasets for Entity Recognition - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types. AudioLoader - PyTorch Dataset for Speech and Music audio. Awesome Training Data MIDI Dataset - Code for creating a dataset of MIDI ground truth. Labelbox - Fastest way to annotate data to build and ship computer vision applications. ( Code) Bamboo - Mega-scale and information-dense dataset for classification and detection pre-training. The How2 Dataset - Multimodal collection of instructional videos with English subtitles. ( Code) Unity Dataset Insights - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) ( Code) Perceptual Image Processing ALgorithms (PIPAL) ( Code) Hover - Label data at scale. Fun and precision included. How do you share big datasets with your team and others? (2022) Simulacra Aesthetic Captions - Dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts. Audio Dataset Project - Audio Dataset for training CLAP and other models. Bulk - Quick developer tool to apply some bulk labels. stopes - Library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team. MisInfoText - Datasets for fake news and misinformation detection. Awesome Dataset Distillation Cleaning data with sqlite-utils and Datasette Starter code for working with the YouTube-8M dataset BigLAM (Libraries, Archives and Museums) - Open source, community resource of LAM datasets. Data Measurements Tool - Developing tools to automatically analyze datasets.