Skip to content

Identifying Label Errors

Identifying and Fixing Mislabels

In the field of machine learning, it is not uncommon to encounter mislabeled data, where the initial labels assigned to the data instances are incorrect.

As seen from Cleanlab's labelerrors.com, benchmark test datasets in ML, such as MNIST and IMAGENET, can be riddled with label errors.

tasks

In the domain of text data, benchmark datasets sucha as AMAZON REVIEWS and IMDB are also filled with label errors.

tasks

Practically, most likely over 10 percent of data has label errors. This problem is only growing with the hallucination of LLMs, making data quality an even more paramount issue. It is essential to identify and rectify these mislabels to ensure the accuracy and reliability of machine learning models.

More Example Datasets with Label Errors

Consider the following example dataset, where each row represents a text instance along with its initial label and the label predicted by a model:

Table 1: GoEmotions

Text Initial_label Predicted_label Error_threshold
YAY, cold Mc'Donalds. My favorite LOVE SARCASM 0.95
hell yeah my brother ANNOYANCE EXCITEMENT 0.94

Table 2: Sentiment Analysis

Text Initial_label Predicted_label Error_threshold
Like everyone else my preference is for real mashed potatoes, but for fake ones... NEUTRAL POSITIVE 0.87
Helps me realize I am ok Not a big slob now I feel better!!!!!!! Yay Yay Ya! No more... NEGATIVE POSITIVE 0.84

Approach for Identifying and Fixing Mislabels

In the Mislabels Classification and Mislabels Prompting sections, we will go over some standard zero-shot approaches to identify label errors within datasets. By following these steps and continually refining the labeling process, we can enhance the accuracy and reliability of machine learning models and minimize the impact of mislabeled data on model performance.