Evaluating Datasets for Computer Vision

The quality of your dataset directly influences the effectiveness of your computer vision model. Garbage in, garbage out, as they say! This section will provide the tools and strategies to critically evaluate a dataset’s suitability for your specific computer vision task. We’ll cover which criteria to consider, how to analyze a dataset, and common challenges you might encounter when working with real-world image data. This module includes an exercise where we’ll look at imbalance. Some of the more advanced techniques such as data augmentation will be covered in the next modules in this course.

Where Does Vision Data Come From?

As mentioned, you can generate training data or use online dataset resources (such as Kaggle or [HuggingFace(https://huggingface.co/)]). However, when doing scientific research, you will typically need to collect and label your own data. This process can be the most time-consuming aspect of computer vision tasks, but we will only touch on it briefly.

You will need a dataset of images labeled by class for image classification. The end goal of labeling will typically be a set of folders, each with a label name and containing images classified as belonging to that class. There are many tools to assist in labeling, and some even use AI to try to automate the process. However, care should be taken during the labeling process. For example, having multiple people independently label images can result in a higher-quality training dataset. Incorrectly labeled training data will hamper all downstream processes.

Going beyond classification, object detection needs a set of images with bounding boxes drawn around the individual objects in each image. These are usually stored as coordinates for the upper left and lower right corners of the bounding box or the center coordinate and the box’s height and width. Again, there are tools to help with the task, and having multiple people labeling is a best practice.

Lastly, for image segmentation, each image is accompanied by an object mask that labels each image pixel as representing one class. This level of annotation is the most labor-intensive for preparing the training data. For example, the COCO dataset (arXiv:1405.0312), with 200,000 images, required 70,000 hours of effort to label. Recent advances (e.g., arxiv:2301.03992) have shown good results in auto-masking using bounding box datasets to produce segmentation masks.

Key Considerations for Dataset Evaluation

Data Relevance

Real-World Alignment: Does the dataset accurately represent the real-world scenario your model will encounter? If there’s a mismatch, your model might struggle to generalize. For example, a self-driving car model trained on sunny weather data might perform poorly in rain or snow.
Diversity and Variation: Are the images/videos diverse enough to capture potential variations (lighting, angles, occlusions)?

Data Sufficiency

Volume: Is the dataset large enough to allow the model to learn complex patterns? Larger datasets generally lead to better performance.
Model Complexity: The more parameters in a model, the more data the network will tend to need. Models with many layers often require larger datasets than simpler models.
Class Imbalance: If some classes are heavily over-represented compared to others, the model might become biased towards the majority classes. Real-world classification problems frequently have imbalanced class distribution, such as fraud detection, extreme weather events, or identifying species of animals in a region. This can usually be addressed through data augmentation or sampling techniques.

Data Quality

Accurate Labeling: Mislabeled images (e.g., a bee labeled as a wasp), incorrect bounding boxes in object detection, or sloppy segmentation masks will confuse the model during training. Thoroughly check annotation quality.
Noise and Errors: Consider the level of noise in the data. Real-world images can have blurriness, artifacts, or distortions. A certain level of noise might be tolerated, but heavily corrupted images can hinder model performance. This can also be an issue if you are aggregating different datasets and/or using datasets that are very different from what you’ll be using as input when running inferences. In the example of dataset aggregation, images of different resolutions or sizes can induce noise when you preprocess them to the same shape for training.