When training a classification model, it’s possible to get seemingly adequate results without a mountain of training data. Unfortunately, these models don’t live up to the expectations once put into production. Those single-digit failures (false positives and negatives) can seem insignificant in a demo but will render the model unusable in a production scenario. Getting larger training datasets will improve the accuracy, but the returns are diminishing. This is because most of the training data isn’t improving the accuracy of the model. A better strategy that you can use, is one of active learning.
The Active Learning Mechanism
Active learning is a mechanism where the system requests feedback on specifically targeted data and uses that selected data for the training, testing, and continuous improvement of the model. The secret is minimizing the amount of training data that needs labeling, by selecting training samples that address the weakness in the model. Human labeling of data is a slow and costly process; therefore, human input needs to be minimized and valued.
If your model is used to detect the species of a bird, it might have specific scenarios where it performs worse e.g., if it’s late in the afternoon or when the image has background trees. The model will improve faster if the training samples address these weak points.
The images used for active learning should mostly come from the production system inputs i.e., use images detected by your system’s camera, rather than general images of the targeted class. Large external datasets can often cause the model to generalize on irrelevant features and camera angles, which can hurt the performance of your system’s model.
Strategies for Classification
There are a couple of strategies to determine if a previously classified image is a candidate for human review. The best training samples will use a combination of strategies.
Firstly, the easiest is for the system to simply flag any image that is part of an underrepresented class in the training set, as this will ensure a balanced training set. Underrepresented classes are also typically weak points in the model accuracy.
Secondly, the confidence in a prediction can be used to identify training images. Low confidence images are very often a sign of model weakness.
Another strategy would be the use of a parallel model used to find model weaknesses. This strategy has a fair amount, more complexity, and pitfalls, but is very powerful if done correctly. A parallel classifier is trained to predict the images that are the production model’s weak points. Training data for this parallel model will come from images failing in the test set, and images that have been tagged as incorrect. If implemented correctly, this model will identity features such as “low exposure” or “overcast conditions” where the model performs poorly. Risks of this approach are the bias, previously reviewed images, as well as potential test data contamination if done incorrectly.
Lastly, the image meta-data can be used to predict model weaknesses. An example would be the camera that was used. As poor predictions will often come from a particular device or camera angle.
After identifying the candidate review images, the next step in an active learning model will be ensuring that this minimal review set is easy for a human labeler to quickly review and re-classify. This should include a mechanism for the end-users of the model to flag predictions as “wrong”. The labeling process should also allow the person to simply label an image as ambiguous. Many models have been compromised by ambiguous data being inconsistently labeled and added to the training set. A better strategy is to simply flag this data and exclude it from the training set.
The Role of Data Augmentation
This new labeled set size can then be further enlarged, through data augmentation and semi-supervised learning. Data augmentation is the approach taking a labeled image, and creating more examples through various transforms e.g., a single image can be multiplied by doing rotations and color adjustments. Semi-supervised learning is the approach of taking a labeled item and getting further examples, using system-specific rules or meta-data. For example, an image can be multiplied by getting images from other cameras, pointing at the same object at the same point in time. Another example would be to use a barcode or other pieces of data, to identify the same object in multiple images.
Finally, the model training, testing, and deployment should all be automated. Active learning feels complete when the model is just getting better in the background while getting minimal human input. This is only possible if the training has scheduled runs, followed by automated tests of the model to determine if the model performance has improved. If the model candidate is an improvement over the current production model, the model deployment into production should also be automated.
To become competitive players in the market, organizations need to embrace the era of innovation and technology as a key ingredient to reaching maximum business benefits for success. Machine and deep learning form part of the foundation for smarter, more efficient business optimization.
Active learning enables production systems to keep getting better and more useful over time, without excessive effort by data scientists.