Machine learning algorithms are only as good as the data they are trained on. This reflects the fact that the data provided to the algorithm will determine what patterns the algorithm learns, and thus what content it may correctly recognize in the future. To quote a well known concept in computer science: “Garbage in, garbage out!”. Consequently, it is important to use quality image datasets for the creation of image classification and computer vision systems. In this post, you’ll find various datasets and links to portals you’re able to visit to find the perfect image dataset that’s relevant to your projects.
Many companies have come to publish their datasets in the public domain. This allows to develop machine learning and AI.
We are happy to present a ready-to-use list of datasets so that you can start your machine learning today.
Focus: Face recognition
Use Cases: School Safety; health care; assisting the blind; looking for missing persons
Labelled Faces in the Wild : 13,000 labeled human faces images that can be used for developing applications that involve facial recognition for different use-cases.
UMD Faces : This dataset consists of images of more than 350 thousand faces of 8,5K people; all the images are annotated.
CASIA WebFace : Facial dataset of 453,453 images over 10,575 identities after face detection; requires some filtering for quality.
FERET : The Facial Recognition Technology Database contains more than 14 thousand images of people faces; most of them are annotated.
MS-Celeb-1M : 1 million images of celebrities from around the world; requires some quality filtering for best results on deep networks.
100,000 Faces : 100,000 Faces Generated by AI; built original machine learning dataset to construct a realistic set of 100,000 faces; it was built by taking 29K photos of 69 models over the last 2 years.
Use Cases: Standard, breed classification
Stanford Dogs Dataset : The dataset made by Stanford University contains more than 20 thousand annotated images and 120 different dog breed categories.
Fishnet.AI : AI training dataset for fisheries; 35K images with an average of 5 bounding boxes per image were collected from on-board monitoring cameras for long line tuna fishing activity.
The Oxford-IIIT Pet Dataset : The Oxford-IIIT Pet Dataset contains 37 categories of pets images with roughly 200 images of each.
Focus: Satellite Imagery
Use Cases: Getting setellite imagery for vatious use-cases
OpenStreetMap : Vector data for the entire planet under a free license; contains (an older version of) the US Census Bureau’s data.
NEXRAD : The Next Generation Weather Radar dataset presents a big collection of Doppler radar scans of atmospheric conditions in the USA.
xBD : A dataset for assessing building damage from satellite imagery; 850K building polygons from 6 types of natural disaster around the world, covering a total area of 45K square kilometers.
Spacenet : An online repository of freely available satellite imagery; open innovation project for the geospatial industry is a collaboration between CosmiQ Works, DigitalGlobe and NVIDIA.
Radiant Earth Foundation : An open library for geospatial training data to advance machine learning applications on Earth Observations.
Use Cases: Dress recommendation; trend prediction; virtual trying on clothes
iMaterialist-Fashion : Samasource and Cornell Tech announced the iMaterialist-Fashion dataset in May 2019, with over 50K clothing images labeled for fine-grained segmentation.
Fashion-MNIST : 60K training images and 10K test images; a MNIST-like fashion product database – a direct replacement for overused MNIST dataset; each image is in greyscale and associated with a label from 10 classes.
Fashion IQ : A new dataset for natural language based fashion image retrieval; it provides natural language annotations to facilitate the training of interactive image retrieval systems.
DeepFashion2 : A versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval; 801K clothing items where each item has rich annotations.
Focus: Scene & Action Recognition
Use Cases: Dangerous situations detection
TV Human Interaction Dataset : The dataset consists of 300+ videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.
Berkeley Multimodal Human Action Database (MHAD) : The dataset contains video clips in which a single person performing 12 different actions.
THUMOS Dataset : THUMOS Dataset is a large collection of video clips of different kinds; the dataset can be used for action classification.
13,000 video clips.
The 20BN-something-something Dataset V2 : Densely-labeled video clips that show humans performing predefined basic actions with everyday objects.
220,000 video clips.
50 Salads Dataset : Fully annotated 4.5 hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each.
Focus: Handwritting Recognition
Use Cases: Handwritten texts digitization; robotic process automation; recognizing documents and writing
Artificial Characters Dataset : Artificial Characters Dataset is a collection of artificially generated data describing the structure of 10 capital English letters.
Character Trajectories Dataset : Character Trajectories Dataset contains over 3,000 labeled samples of pen tip trajectories for people writing simple characters.
Gisette Dataset : Handwriting samples from the often-confused 4 and 9 characters; the total number of images in the dataset is more than 13 thousand.
MNIST database : MNIST database presents different kinds of databases; this one presents samples of handwritten digits.
Semeion Handwritten Digit Dataset : Database of handwritten digits from 80 people; the total number of images is about 1500.
Focus: Biological & Medical
Use Cases: Mutual analysing work automazation
Recursion Cellular Image Classification : This data comes from the Recursion 2019 challenge; this goal of the competition was to use biological microscopy data to develop a model that identifies replicates.
TensorFlow patch_camelyon Medical Images : This medical image classification dataset comes from the TensorFlow website; it contains just over 327K color images; the images are histopathological lymph node scans which contain metastatic tissue.
NIH Database of 100,000 Chest X-Rays : The dataset of scans is from more than 30,000 patients, including many with advanced lung disease.
ADNI : Alzheimer’s Disease Neuroimaging Initiative (ADNI) unites researchers with study data as they work to define the progression of Alzheimer’s disease.
BrainWeb : BrainWeb is a simulated brain MR database under a free license that contains large number of annotated images.
Aberystwyth Leaf Evaluation Dataset : Timelapse plant images with hand-marked up leaf-level segmentations for some time steps, and biological data from plant sacrifice.
Atlas of Digital Pathology : 17,668 histological patch images extracted from 100 slides annotated with up to 57 HTTs from different organs – the aim is to provide training data for supervised multi-label learning.
Breast Ultrasound Dataset B : Breast Ultrasound Dataset B contains 2D Breast Ultrasound Images with 53 malignant lesions and 110 benign lesions.
CheXpert : A large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets.
ASU DR-AutoCC Data : A Multiple-Instance Learning feature space for a diabetic retinopathy classification open dataset.
Focus: Autonomous Driving
Use Cases: Autonomous/self-driving sysems building
Waymo : Waymo is one of the largest and most diverse autonomous driving open datasets that have been ever released.
Boxy Vehicle Detection by Bosch : A large vehicle detection dataset with almost two million annotated vehicles
for training and evaluating object detection methods for self-driving cars on freeways.
Cityscapes : A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities.
Daimler Pedestrian : The dataset is an iterative framework for pedestrian segmentation combining shape models and multiple data cues.
Nuscenes : Large-scale open-source dataset for autonomous driving that consists of about 1500 images.
Mapillary : Global dataset of over 750 million street-level images enables you to train robust perception models for autonomous driving.
Use Cases: Any
xView : One of the largest publicly available datasets of overhead imagery. It contains a large number of images from complex scenes around the world.
1 million objects.
Labelme : A large dataset of annotated images; an online annotation tool to build image databases for computer vision research is being created.
ImageNet : Image dataset for new algorithms, organized like the WordNet hierarchy, in which hundreds and thousands of images depict each node of the hierarchy.
14 million images.
Visual Genome: Visual Genome is not just a dataset, it is a very detailed visual knowledge base with captioning more than 100 thousand images.
Google’s Open Images : A big images URLs collection consisting of 9 million items “that have been annotated with labels spanning over 6,000 categories”.
9 million images.