Machine learning systems are born through the marriage of both code and data. The code specifies how the machine should learn and the training data encapsulates what should be learned. Academia mostly focuses on ways to improve the learning algorithms, the how of machine learning. When you come to build practical AI systems though, the dataset you're training on has at least as much impact on performance as the choice of algorithm.
Although there are a lot of tools for improving machine learning models, there are very few options when it comes to improving your data set. At Humanloop, we've given a lot of thought to how you can systematically improve data sets for machine learning.
In a recent talk, Andrew Ng, shared a story from a project he worked on at Landing AI, building a computer vision system to help find defects in steel. Their first attempt at the system had a baseline performance of 76%. Humans could find defects with 90% accuracy so this wasn't good enough to put into production. The team working on the project then split in two. One team worked on trying different model types, hyper-parameters and architecture changes. The other team looked to improve the quality of their data set. After a few weeks of iteration, the results came in. The modeling team despite huge effort had not been able to improve performance at all. The data team on the other hand were able to get a 16% performance improvement. Improving the dataset actually led to super-human performance on this task.
By fixing errors in their data set the data-team was able to take their algorithm from worse than human to super-human
This story isn't at all unique. I've had a similar experience at Humanloop. We worked with a team of lawyers from one of the big-4 accountancy firms to train a document classifier on legal contracts. Similar to finding defects in steel, the task was subtle and required domain expertise. After the first round of labeling and training was complete, the model still wasn't good enough to match Human level performance. Within Humanloop, there's a tool to investigate data points where there is disagreement between the AI model and the human annotators. Using this view the team were able to find around 30 misclassifications in a data set of 1000 documents. Fixing just these 30 mistakes was enough to get the AI system to match Human level performance.
There's a lot of discussion of "data prep" and "data cleaning" but what actually differentiates high quality data from low quality?
Most machine learning systems today are still using supervised learning. That means that the training data consists of (input, output) pairs and we wish the system to be able to take the input and map to the output. For example, the input may be an audio clip and the output could be the transcribed speech. Or the input might be an image of a damaged car and the output could be the locations of all the scratches. At Humanloop we focus on NLP so an example input for us might be a customer service message and the output could be a templated response. Building these training datasets usually requires having human's manually label the inputs for the computer to learn from.
If there's ambiguity in the way data is labeled then the ML model will need much more data to get to high performance. There are a few different ways that data collection and annotation can go wrong:
Most companies and research groups don't have access to the internet scale datasets that Google, Facebook and other tech giants have. When the dataset is that large you can get away with some noise in your data. However, most teams are operating in domains where they have hundreds to thousands of labeled examples. In this small data regime, data quality becomes even more important.
To get some intuition as to why data quality matters so much, consider the very simple 1-dimensional supervised learning problem shown above. In this case we're trying to fit a curve to some measured data points. On the left we see a large noisy dataset and on the right a small clean data set. It's clear that a small number of very low noise data points shows the same curve of a large but noisy dataset. The corollary of this is that noise in small datasets is particularly harmful. Though most machine learning problems are very high dimensional they operate on the same principles as curve fitting and are affected in analogous ways.
There are lots of tools for improving machine learning models but how can we systematically improve machine learning datasets?
A workflow that some teams are beginning to adopt is to iterate between training models and then correcting "data bugs". Tools are emerging to facilitate this workflow such as label noise in context and aquarium learning or the Humanloop data debugger.
The way these tools work is to use the model being trained to help find "data bugs". This can be done by looking at areas where the model and human's have high disagreement or at classes where there is high disagreement between different annotators. Various different forms of visualization can help find clusters of mistakes and fix them all at once.
Another approach to improving datasets is to embrace noise but use heuristic rules to scale annotation.
As we saw in the curve-fitting example above, you can get good results either through very small datasets that are clean or very large datasets that are noisy. The idea behind weak labeling is to automatically generate a very large number of noisy labels. These labels are generated by having subject matter experts write down heuristic rules.
For example, you may have a rule for an email classifier that says "mark an email as job application if it contains the word 'cv'". This rule will not be very reliable but can be automatically applied to thousands or millions of examples.
If there are lots of different rules then their labels can be combined and de-noised to produce high quality data.
Data cleaning tools still rely on Human's to manually find the mistakes in data sets and don't help us address the problems of class imbalance described above. active learning is an approach that trains a model as a team annotates and uses that model to search for high value data. Active learning can automatically improve the balance of data sets and help teams get to high performing models with significantly less data.
As we've written about recently, one of the big advantages of adopting a data-first approach to machine learning is that it allows for much better collaboration between all the different teams involved. Improving datasets forces collaboration between the subject matter experts who are annotating data and the data scientists who are thinking about how to train models.
The increased involvement of non-technical subject matter experts in training and improving machine learning models is one of the most exciting aspects of machine learning software compared to traditional software. At Humanloop we've been working on incorporating machine teaching into the normal workflows of non-technical experts so that they can automate tasks with much less dependence on machine learning engineers. In our next blog, we'll share some of our lessons on how machine learning will be incorporated into many people's daily jobs.