When we hire annotators, whether they are paid by the data point or whether they are salaried, we expect them to produce answers: labels, bounding boxes, and so on. After all, that’s why we hired them.
In this essay, I will argue that we must always offer them the opportunity to respond with “I can’t answer.”
Why?
There are different reasons that an annotator may be unable to provide an answer:
- The sample may be corrupted (e.g., an image that can’t load)
- The sample may be impractical to work with (e.g., an image that is too out-of-focus to see clearly, or too large to fit onscreen in the annotation tool)
- The sample may be from a completely different domain (e.g., a screen-capture of an email when the task is to find bounding boxes in natural images)
- The sample may be one for which the current task does not apply (e.g., in a multistep annotation process, the annotators for the previous step may have incorrectly agreed that this is a dog when it is actually a cat, and the current question asks whether the cat is a Dalmatian or a poodle.)
- The sample may satisfy multiple contradictory branches in the annotation instructions (e.g., a photo that has both a cat and a dog, in a single-class task)
- The sample may satisfy none of the branches in the annotation instructions (e.g., a photo that has a hippo in a task to differentiate cats from dogs)
- The annotation instructions may have left something underspecified (e.g., do you want to treat hyenas as cats or as dogs?)
- There may be some other reason that the tooling does not allow the annotator to annotate correctly (e.g., the tool may enforce a minimum bounding-box size but this image has an object that is smaller than that)
- The annotation instructions may have been inadequately translated into the annotator’s preferred language
- The annotator may not have learned the annotation rules well enough, and thinks that one of the above categories applies here.
- And there are certainly other reasons not listed here.
In all of these cases, if we force the annotator to choose the “best” answer, we are likely to end up with noise in our dataset.
If we allow the annotators to choose “I can’t answer” — and require them to choose from a list like the one above, plus give them a free-text field to provide additional details — then we gain several benefits.
First, our data are not corrupted with “least bad” choices.
Second, the annotators have a way to resolve the cognitive dissonance from the fact that the job we’ve asked them to do is not possible within the given constraints.
Third, we have some signal that can help us improve some or all of: the unlabeled dataset, our understanding of the project domain, the tooling for our annotations, the taxonomy/ontology/guidelines, and the way we train our annotators.
What about Intra-Annotator Agreement and data metrics?
One objection that I have heard to this approach is that this gives lazy annotators a way to chalk up a high number of annotations that can’t be incorrect — by definition, the argument goes, if an annotator claims to be unsure how to apply the instructions, they can’t be wrong.
But many of the reasons for a “can’t answer” response are objective. If the annotators agree that a particular sample is drawn from the wrong population, or that the instructions are contradictory, then that is the correct answer and someone who actually provides a label is the one who has made a mistake.
Even in cases where the reason is somewhat subjective, we may find that the annotators share uncertainty because they’ve received the same not-clear-enough instructions. Such items may need to go back into the pool for more annotations once the instructions have been improved.
If a particular annotator responds “can’t answer” more often than their peers to a statistically significant degree, this would certainly be a reason to review their work, and to understand why they are confused. The result should not be to reject this person from our team, assuming they are working in good faith. Helping them understand the task should result in better training materials for all future annotators.
In fact, the real problem may be in the other direction. We should check for annotators who are avoiding responding “can’t answer”, by comparing their rates to the average for the team, and by looking at their annotations on items that other annotators marked as “can’t answer.” False overconfidence (or a desire to get past the cognitively demanding and time-consuming challenging samples and back to the high-velocity easier ones) can lead annotators to pick the “least bad” choice even when they have “can’t answer” as an option — after all, “can’t answer” requires them to take extra time and effort to specify why.
Conclusion
Offering the annotators a “can’t answer” option makes extra work for everyone. The annotation tool must be configured to support it. The training needs to explain how to use it. The annotators have to consider it and spend extra time specifying which subcase applies. The adjudicators have to respond to these cases in an appropriate way. The project leads have to consider whether the uncertainty from the annotators calls into question assumptions about the data domain that underlie the model itself.
On the other hand, addressing this need reduces noise in our dataset. Accounting for “can’t answer” improves the annotations on many other samples, not just on the immediate items to which it applies. It can make us aware of larger-scale issues in the use case that we intend our model to meet.
We’re going through the time and expense to annotate a dataset: we should get full value for our investment.
Do you have an “answer” to these thoughts? Please share it below!