Skip to content

Named Entity Recognition Evaluation Metrics

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities within text into predefined categories, such as names of people, organizations, locations, dates, etc. Evaluating the performance of NER models is essential for understanding their accuracy and effectiveness. This document outlines key evaluation metrics used to assess NER models, including examples and explanations of how to handle discrepancies between model predictions and ground truth data.

Evaluation Metrics

1. Precision

Precision measures the accuracy of the entities that the model predicted as belonging to a certain category. It is defined as the number of true positives (correctly predicted entities) divided by the total number of entities predicted in that category.

Formula: Precision = True Positives / (True Positives + False Positives)

2. Recall

Recall measures the ability of the model to identify all relevant entities in the text. It is defined as the number of true positives divided by the total number of actual entities in that category.

Formula: Recall = True Positives / (True Positives + False Negatives)

3. F1 Score

The F1 Score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.

Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

4. Intersection over Union (IoU)

Intersection over Union (IoU) measures the overlap between the ground truth and predicted spans. It is particularly useful for evaluating partial matches between the predicted entities and the actual entities in the text.

Formula: IoU = Number of tokens in the intersection / Number of tokens in the union

Example: Consider a case where the ground truth span is ["123 Main Street, Anytown, USA"], and the model predicted span is ["123 Main St, Anytown, USA"]. The IoU would calculate the overlap between these two spans, providing a measure of how well the model's prediction matches the actual entity.

Evaluating Ground Truth vs. Model Predictions

In real-world scenarios, the model's predictions might not perfectly match the ground truth data. The following examples illustrate how to evaluate such cases:

Example 1 (IoU = 1):

  • Sentence: "My name is John Doe and my email is johndoe@example.com."

  • Ground Truth: [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")]

  • Model Prediction: [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")]

In this example, the model's prediction perfectly matches the ground truth for both entities. Therefore, the IoU for the "EMAIL" entity would be calculated as: IOU=1

Example 2 (IoU = 0):

  • Sentence: "My name is John Doe and my email is johndoe@example.com."

  • Ground Truth: [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")]

  • Model Prediction: [("John", "PERSON"), ("john@example.com", "SSN")]

In this example, the model's prediction for the email does not overlap at all with the ground truth. The IoU for the "EMAIL" entity would be calculated as: IOU = 0.00

Summary Table

Sentence Ground Truth Entities Model Predicted Entities Precision Recall F1 Score IoU
"My name is John Doe and my email is johndoe@example.com." [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")] [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")] 1.00 1.00 1.00 1.00
"My name is John Doe and my email is johndoe@example.com." [("John Doe", "NAME"), ("johndoe@example.com", "EMAIL")] [("John Doe", "NAME"), ("john@example.com", "EMAIL")] 1.00 0.50 0.67 0.00
"I live at 123 Main Street, Anytown, USA. My phone number is 555-123-4567." [("123 Main Street, Anytown, USA", "ADDRESS"), ("555-123-4567", "PHONE")] [("123 Main St, Anytown, USA", "ADDRESS"), ("555-987-6543", "PHONE")] 0.50 0.50 0.50 0.50
"I was born on January 1, 1980." [("January 1, 1980", "DATE_OF_BIRTH")] [("Jan 1, 1980", "DATE_OF_BIRTH")] 1.00 1.00 1.00 1.00
"My driver's license number is D1234567 and my passport number is 123456789." [("D1234567", "DRIVER_LICENSE"), ("123456789", "PASSPORT")] [("D1234567", "DRIVER_LICENSE"), ("987654321", "PASSPORT")] 1.00 0.50 0.67 0.50
"My social security number is 123-45-6789 and my credit card number is 1234 5678 9012 3456." [("123-45-6789", "SSN"), ("1234 5678 9012 3456", "CREDIT_CARD")] [("123-45-6789", "SSN"), ("1234 5678 9012 3456", "CREDIT_CARD")] 1.00 1.00 1.00 1.00
"My IP address is 192.0.2.0." [("192.0.2.0", "IP_ADDRESS")] [("192.168.1.1", "IP_ADDRESS")] 0.00 0.00 0.00 0.00
"My social security number is 123-45-6789." [("123-45-6789", "SSN")] [("123-45-6789", "SSN"), ("myemail@example.com", "EMAIL")] 0.50 1.00 0.67 0.00

Conclusion

Evaluating NER models involves more than just measuring accuracy. Precision, recall, F1 score, IoU, and other metrics provide a comprehensive view of a model's performance. By considering partial matches and discrepancies, you can gain deeper insights into how your NER model is performing and where it may need improvement.