Here are some techniques for evaluating your datasets:
- Data Visualization
- Distribution Analysis: Plot histograms of image/video counts for each class to visually identify any potential imbalances or underrepresented classes. This will guide you on whether to implement strategies like undersampling, oversampling, or data augmentation to mitigate potential bias.
- Qualitative Examination: Display a random selection of image or video samples. This provides a quick way to assess overall image quality, spot obvious labeling errors, and get a sense of dataset diversity.
- Visualization Tools: Consider using libraries like Matplotlib, Plotnine or specialized data visualization platforms to create other plots to explore your dataset.
- Error Analysis
- Perform an analysis of incorrectly predicted samples after initial training and validation. This helps identify systematic errors related to specific types of images or scenarios, indicating potential dataset deficiencies to address. Confusion matrices are useful for this strategy.
Strategies for Mitigating Dataset Challenges
If you’ve determined that there is an issue with your dataset, here are some strategies to try to mitigate those issues:
- Splitting Strategies
- Importance of Proper Splitting: Randomly splitting your data into training, validation, and testing sets is essential. The training set is used to train the model, the validation set for tuning hyperparameters, and the testing set for an unbiased evaluation of final performance and for comparing different models and model configurations. A (very) rough standard for train-validation-test splits is 60-80% training data, 10-20% validation data, and 10-20% test data.
- Stratified Approach for Imbalance: If your dataset suffers from class imbalance, a stratified split ensures the proportion of classes remains consistent across the training, validation, and testing sets. Say, for example, that you had a dataset with 70% images of wasps and 30% images of bees. Instead of randomly splitting the data in aggregate, you would split it such that the training, validation, and testing sets each contained 70% wasps and 30% bees. This helps mitigate the model skewing toward overrepresented classes in the training set.
- Data Augmentation: Data augmentation is the process of artificially creating variations of existing images (cropping, flipping, color jittering, etc.) to increase your dataset size and robustness. While we include it here as a way to mitigate limited or biased data, it is important to note that data augmentation is an important tool for training any computer vision model, and we will cover how image data augmentation works in the next module!
- Oversampling: Oversampling is the strategy of creating more samples of the minority class(es) using existing samples. Two popular methods of oversampling are Random Oversampling and Synthetic Minority Oversampling Technique (SMOTE). Random Oversampling is simply making copies of random minority class images until the class has as many samples as the majority class. SMOTE takes two minority class samples and generates a synthetic image that is the “average” (in terms of pixel values) between the two. SMOTE repeats this process until you tell it to stop, with the goal being to create enough images that the minority class(es) have as many samples as the majority class. Unfortunately, both techniques tend to lead to overfitting of the minority classes (with Random Oversampling being worse than SMOTE) and shouldn’t be the only strategy you use. It should be noted that SMOTE was originally designed for tabular data, and it works spectacularly in that domain. For more information on SMOTE, please see its documentation hereLinks to an external site..
- Undersampling: Undersampling is the technique of removing samples from the majority class to match the number of samples from the minority class. Two popular examples are Random Undersampling and NearMiss Undersampling. Random Undersampling is the removal of random majority class samples. NearMiss Undersampling looks at images that “confused” the model during initial training and removes those from the majority class to improve the decision boundary between classes. Undersampling works best when you have enough samples of the majority class that disposing of some won’t hurt the model’s ability to generalize.
- Transfer Learning: Transfer learning involves starting with a model pre-trained on a massive dataset (like ImageNet) and fine-tuning it on your smaller, specialized dataset. This allows you to leverage the rich patterns learned on the large dataset, even if your own dataset is relatively modest. It is particularly effective when your dataset is small, or your task is like the one the pre-trained model was trained on. Ensure there’s enough similarity between your task and the pre-trained model’s domain, or else transfer learning might have limited benefits.
Evaluating your dataset is crucial for building robust and reliable computer vision models. A well-curated dataset is the foundation for achieving success in any computer vision endeavors!