Identifying Label Errors
Identifying and Fixing Mislabels
In the field of machine learning, it is not uncommon to encounter mislabeled data, where the initial labels assigned to the data instances are incorrect.
As seen from Cleanlab's labelerrors.com, benchmark test datasets in ML, such as MNIST and IMAGENET, can be riddled with label errors.
In the domain of text data, benchmark datasets sucha as AMAZON REVIEWS and IMDB are also filled with label errors.
Practically, most likely over 10 percent of data has label errors. This problem is only growing with the hallucination of LLMs, making data quality an even more paramount issue. It is essential to identify and rectify these mislabels to ensure the accuracy and reliability of machine learning models.
More Example Datasets with Label Errors
Consider the following example dataset, where each row represents a text instance along with its initial label and the label predicted by a model:
Table 1: GoEmotions
Text | Initial_label | Predicted_label | Error_threshold |
---|---|---|---|
YAY, cold Mc'Donalds. My favorite | LOVE | SARCASM | 0.95 |
hell yeah my brother | ANNOYANCE | EXCITEMENT | 0.94 |
Table 2: Sentiment Analysis
Text | Initial_label | Predicted_label | Error_threshold |
---|---|---|---|
Like everyone else my preference is for real mashed potatoes, but for fake ones... | NEUTRAL | POSITIVE | 0.87 |
Helps me realize I am ok Not a big slob now I feel better!!!!!!! Yay Yay Ya! No more... | NEGATIVE | POSITIVE | 0.84 |
Approach for Identifying and Fixing Mislabels
In the Mislabels Classification and Mislabels Prompting sections, we will go over some standard zero-shot approaches to identify label errors within datasets. By following these steps and continually refining the labeling process, we can enhance the accuracy and reliability of machine learning models and minimize the impact of mislabeled data on model performance.