When we discuss how a data sample “should” be interpreted, there are three possible meanings to the word “should”.
When we’re starting a project and developing our taxonomy/ontology, we are asking what a typical user might expect us to do with this particular sample, particularly in the context of the use case that is motivating our project.
In this context, “should” answers the question: “What outcome is best for the user in this specific case?”
After we’ve collected many of those observations and turned their patterns into a taxonomy, we write our annotation guidelines. When doing that, we accept that in the interest of simplicity and consistency, we may choose to specify rules that get a few infrequent cases “wrong” vis-à-vis our ideal answer.
In this context, “should” answers the question: “What set of outcomes give the best tradeoffs over the corpus we expect?”
As we move ahead with collecting annotations at scale, we are likely to discover that we missed a class of examples when writing the guidelines, or that what we thought was an edge case turns out to occur more frequently than we expected. We then have to weigh the cost vs. benefit of adjusting the guidelines. The later it gets, the higher the cost, especially when a change requires us to review much or all of our existing data labels.
If we decide to stick with our existing guidelines, then the “correct” label in the context of our current labeling effort is defined not by “what is best for the user in an ideal world”, and not by “what gives us the best tradeoff”. The correct label is the one specified by the existing taxonomy. Period.
In this context, “should” answers the question: “What annotation is the result of following our agreed-upon guidelines?”
It’s really important that everyone working on a project is clear which stage we’re in. When we’re trying to understand our model’s behavior and we look at an example, we sometimes ask whether the ground truth is correct. “It should be class A, not class B”. But when this analysis is happening after the annotation is complete (or has made significant progress), we need to be clear which “should” we mean. Certainly it may appear to us, looking at this one example that it “should” have been assigned label B, but we’re probably thinking of the first “should”, or sometimes the second.
Be careful!
If the example has in fact been mislabeled according to the guidelines, then correcting it is appropriate.
But if the example’s labels are a correct application of the guidelines, we have to ask which should is being challenged. Did we misjudge the tradeoffs? If so, is it cost-effective to alter the guidelines and re-annotate the dataset, possibly introducing cases where we disagree with the outcome in the other direction? If so, we are changing our answer to the should question from the second stage. That’s a legitimate thing to do, but requires careful preparation and justification.
When the case is rare, it is a mistake to overreact to the thing in front of us and lose our larger perspective. And if we simply start declaring certain data points as “exceptions to the rules” and annotate them by our first definition of “should”, without working out the impact on the guidelines and reviewing existing annotations, then we end up with inconsistent data.
There is a situation where there is only one “should”. That is the approach where we’re trying to model answers from real-world people without defining a taxonomy in the first place. If we want to identify photos that people think are beautiful, then we don’t need guidelines, we simply show a bunch of pictures to the crowdsourced annotators and ask their opinions. If we have a panel of SMEs (subject matter experts) and we’re asking them whether each X‑ray shows a tumor, our “guidelines” are the years of medical school that each radiologist went through and their subsequent real-life experience. In those cases, there is only the first context, and the second and third “shoulds” don’t exist.
But for most of our work, we have three meanings of “should”, and we need to keep in mind which one applies at any point in the project.