Skip to content

Programmatic Labeling Functions

The idea behind programmatic labeling functions is that for each row of data, we check whether the row satisfies a heuristic - if so we assign it a label. Programmatic Labeling Functions can be simple heuristics such as:

  • keyword matches: If keyword "meow" AND NOT keyword "woof" then Category Cat
  • multiple keyword matches: If keyword "the cat meows" then Category Cat
  • named entity recognitions: If ENTITY PERSON then Category Human.
  • regex expressions: If $WEIGHT > 150 LBS then Category Overweight.
  • part of speech tagging: If keyword "run" is a VERB then Category Exercise.

These programmatic labeling functions could also be more complex ontologies that the subject matter expert recommends, such as co-referencing and 2D Embeddings. We don’t believe that programmatic labeling functions substitute for manual data labeling, but think they serve as a solid foundation for initializing a pre-trained transformer model, like BERT for instance.

Example Use Case - Safe and Violent Tweets

Watch the video

For an example of how to create labeling functions, lets take a non-profit organization analyzes Twitter data for public safety in NYC. Their goal is to build a model using annotated data to identify violent and safe tweets in the area.

Categories

The programmatic labeling functions in our system assign data rows into two categories: safe and violent. These categories help classify the nature or characteristics of the data. The existing labeling functions that determine the category for each row are as follows:

  • IF ENTITY(MONEY) THEN violent
  • IF ENTITY(LOC) THEN safe
  • IF peace THEN safe

tweets

These labeling functions analyze the text data and apply specific conditions to assign the appropriate category. To remove incorrect labeling functions, you can click the trash icon next to the labeling function to delete.

Tagged Data and Coverage

The tagged data provides the assigned labels for each row, along with the coverage percentage, which indicates the proportion of rows affected by the labeling function.

tweets

Adding a New Labeling Function

To enhance the labeling process, we have added a new labeling function: IF ENTITY(GPE) THEN safe. This function examines the entities in the text and assigns the "safe" category to rows where the geographical entity (GPE) is mentioned.

tweets

We will also add two more labeling functions:

  • IF danger THEN violent
  • IF bank THEN violent

Updated Tagged Data as CSV

When finished, you can download the updated tagged data, including the newly added labeling functions, as a CSV file

tweets