“I Can’t Answer”

When we hire anno­ta­tors, whether they are paid by the data point or whether they are salaried, we expect them to pro­duce answers: labels, bound­ing box­es, and so on. After all, that’s why we hired them.

In this essay, I will argue that we must always offer them the oppor­tu­ni­ty to respond with “I can’t answer.”

Why?

There are dif­fer­ent rea­sons that an anno­ta­tor may be unable to pro­vide an answer:

  • The sam­ple may be cor­rupt­ed (e.g., an image that can’t load)
  • The sam­ple may be imprac­ti­cal to work with (e.g., an image that is too out-of-focus to see clear­ly, or too large to fit onscreen in the anno­ta­tion tool)
  • The sam­ple may be from a com­plete­ly dif­fer­ent domain (e.g., a screen-cap­ture of an email when the task is to find bound­ing box­es in nat­ur­al images)
  • The sam­ple may be one for which the cur­rent task does not apply (e.g., in a mul­ti­step anno­ta­tion process, the anno­ta­tors for the pre­vi­ous step may have incor­rect­ly agreed that this is a dog when it is actu­al­ly a cat, and the cur­rent ques­tion asks whether the cat is a Dal­ma­t­ian or a poodle.)
  • The sam­ple may sat­is­fy mul­ti­ple con­tra­dic­to­ry branch­es in the anno­ta­tion instruc­tions (e.g., a pho­to that has both a cat and a dog, in a sin­gle-class task)
  • The sam­ple may sat­is­fy none of the branch­es in the anno­ta­tion instruc­tions (e.g., a pho­to that has a hip­po in a task to dif­fer­en­ti­ate cats from dogs)
  • The anno­ta­tion instruc­tions may have left some­thing under­spec­i­fied (e.g., do you want to treat hye­nas as cats or as dogs?)
  • There may be some oth­er rea­son that the tool­ing does not allow the anno­ta­tor to anno­tate cor­rect­ly (e.g., the tool may enforce a min­i­mum bound­ing-box size but this image has an object that is small­er than that)
  • The anno­ta­tion instruc­tions may have been inad­e­quate­ly trans­lat­ed into the anno­ta­tor’s pre­ferred language
  • The anno­ta­tor may not have learned the anno­ta­tion rules well enough, and thinks that one of the above cat­e­gories applies here.
  • And there are cer­tain­ly oth­er rea­sons not list­ed here.

In all of these cas­es, if we force the anno­ta­tor to choose the “best” answer, we are like­ly to end up with noise in our dataset.

If we allow the anno­ta­tors to choose “I can’t answer” — and require them to choose from a list like the one above, plus give them a free-text field to pro­vide addi­tion­al details — then we gain sev­er­al benefits. 

First, our data are not cor­rupt­ed with “least bad” choices. 

Sec­ond, the anno­ta­tors have a way to resolve the cog­ni­tive dis­so­nance from the fact that the job we’ve asked them to do is not pos­si­ble with­in the giv­en constraints.

Third, we have some sig­nal that can help us improve some or all of: the unla­beled dataset, our under­stand­ing of the project domain, the tool­ing for our anno­ta­tions, the taxonomy/ontology/guidelines, and the way we train our annotators.

What about Intra-Annotator Agreement and data metrics?

One objec­tion that I have heard to this approach is that this gives lazy anno­ta­tors a way to chalk up a high num­ber of anno­ta­tions that can’t be incor­rect — by def­i­n­i­tion, the argu­ment goes, if an anno­ta­tor claims to be unsure how to apply the instruc­tions, they can’t be wrong.

But many of the rea­sons for a “can’t answer” response are objec­tive. If the anno­ta­tors agree that a par­tic­u­lar sam­ple is drawn from the wrong pop­u­la­tion, or that the instruc­tions are con­tra­dic­to­ry, then that is the cor­rect answer and some­one who actu­al­ly pro­vides a label is the one who has made a mistake.

Even in cas­es where the rea­son is some­what sub­jec­tive, we may find that the anno­ta­tors share uncer­tain­ty because they’ve received the same not-clear-enough instruc­tions. Such items may need to go back into the pool for more anno­ta­tions once the instruc­tions have been improved.

If a par­tic­u­lar anno­ta­tor responds “can’t answer” more often than their peers to a sta­tis­ti­cal­ly sig­nif­i­cant degree, this would cer­tain­ly be a rea­son to review their work, and to under­stand why they are con­fused. The result should not be to reject this per­son from our team, assum­ing they are work­ing in good faith. Help­ing them under­stand the task should result in bet­ter train­ing mate­ri­als for all future annotators.

In fact, the real prob­lem may be in the oth­er direc­tion. We should check for anno­ta­tors who are avoid­ing respond­ing “can’t answer”, by com­par­ing their rates to the aver­age for the team, and by look­ing at their anno­ta­tions on items that oth­er anno­ta­tors marked as “can’t answer.” False over­con­fi­dence (or a desire to get past the cog­ni­tive­ly demand­ing and time-con­sum­ing chal­leng­ing sam­ples and back to the high-veloc­i­ty eas­i­er ones) can lead anno­ta­tors to pick the “least bad” choice even when they have “can’t answer” as an option — after all, “can’t answer” requires them to take extra time and effort to spec­i­fy why.

Conclusion

Offer­ing the anno­ta­tors a “can’t answer” option makes extra work for every­one. The anno­ta­tion tool must be con­fig­ured to sup­port it. The train­ing needs to explain how to use it. The anno­ta­tors have to con­sid­er it and spend extra time spec­i­fy­ing which sub­case applies. The adju­di­ca­tors have to respond to these cas­es in an appro­pri­ate way. The project leads have to con­sid­er whether the uncer­tain­ty from the anno­ta­tors calls into ques­tion assump­tions about the data domain that under­lie the mod­el itself.

On the oth­er hand, address­ing this need reduces noise in our dataset. Account­ing for “can’t answer” improves the anno­ta­tions on many oth­er sam­ples, not just on the imme­di­ate items to which it applies. It can make us aware of larg­er-scale issues in the use case that we intend our mod­el to meet. 

We’re going through the time and expense to anno­tate a dataset: we should get full val­ue for our investment.

Do you have an “answer” to these thoughts? Please share it below!