“Should”

When we dis­cuss how a data sam­ple “should” be inter­pret­ed, there are three pos­si­ble mean­ings to the word “should”.

When we’re start­ing a project and devel­op­ing our taxonomy/ontology, we are ask­ing what a typ­i­cal user might expect us to do with this par­tic­u­lar sam­ple, par­tic­u­lar­ly in the con­text of the use case that is moti­vat­ing our project.

In this con­text, “should” answers the ques­tion: “What out­come is best for the user in this spe­cif­ic case?”

After we’ve col­lect­ed many of those obser­va­tions and turned their pat­terns into a tax­on­o­my, we write our anno­ta­tion guide­lines. When doing that, we accept that in the inter­est of sim­plic­i­ty and con­sis­ten­cy, we may choose to spec­i­fy rules that get a few infre­quent cas­es “wrong” vis-à-vis our ide­al answer.

In this con­text, “should” answers the ques­tion: “What set of out­comes give the best trade­offs over the cor­pus we expect?”

As we move ahead with col­lect­ing anno­ta­tions at scale, we are like­ly to dis­cov­er that we missed a class of exam­ples when writ­ing the guide­lines, or that what we thought was an edge case turns out to occur more fre­quent­ly than we expect­ed. We then have to weigh the cost vs. ben­e­fit of adjust­ing the guide­lines. The lat­er it gets, the high­er the cost, espe­cial­ly when a change requires us to review much or all of our exist­ing data labels.

If we decide to stick with our exist­ing guide­lines, then the “cor­rect” label in the con­text of our cur­rent label­ing effort is defined not by “what is best for the user in an ide­al world”, and not by “what gives us the best trade­off”. The cor­rect label is the one spec­i­fied by the exist­ing tax­on­o­my. Period.

In this con­text, “should” answers the ques­tion: “What anno­ta­tion is the result of fol­low­ing our agreed-upon guidelines?”

It’s real­ly impor­tant that every­one work­ing on a project is clear which stage we’re in. When we’re try­ing to under­stand our mod­el’s behav­ior and we look at an exam­ple, we some­times ask whether the ground truth is cor­rect. “It should be class A, not class B”. But when this analy­sis is hap­pen­ing after the anno­ta­tion is com­plete (or has made sig­nif­i­cant progress), we need to be clear which “should” we mean. Cer­tain­ly it may appear to us, look­ing at this one exam­ple that it “should” have been assigned label B, but we’re prob­a­bly think­ing of the first “should”, or some­times the second. 

Be care­ful!

If the exam­ple has in fact been mis­la­beled accord­ing to the guide­lines, then cor­rect­ing it is appropriate.

But if the exam­ple’s labels are a cor­rect appli­ca­tion of the guide­lines, we have to ask which should is being chal­lenged. Did we mis­judge the trade­offs? If so, is it cost-effec­tive to alter the guide­lines and re-anno­tate the dataset, pos­si­bly intro­duc­ing cas­es where we dis­agree with the out­come in the oth­er direc­tion? If so, we are chang­ing our answer to the should ques­tion from the sec­ond stage. That’s a legit­i­mate thing to do, but requires care­ful prepa­ra­tion and justification.

When the case is rare, it is a mis­take to over­re­act to the thing in front of us and lose our larg­er per­spec­tive.  And if we sim­ply start declar­ing cer­tain data points as “excep­tions to the rules” and anno­tate them by our first def­i­n­i­tion of “should”, with­out work­ing out the impact on the guide­lines and review­ing exist­ing anno­ta­tions, then we end up with incon­sis­tent data.

There is a sit­u­a­tion where there is only one “should”. That is the approach where we’re try­ing to mod­el answers from real-world peo­ple with­out defin­ing a tax­on­o­my in the first place. If we want to iden­ti­fy pho­tos that peo­ple think are beau­ti­ful, then we don’t need guide­lines, we sim­ply show a bunch of pic­tures to the crowd­sourced anno­ta­tors and ask their opin­ions. If we have a pan­el of SMEs (sub­ject mat­ter experts) and we’re ask­ing them whether each X‑ray shows a tumor, our “guide­lines” are the years of med­ical school that each radi­ol­o­gist went through and their sub­se­quent real-life expe­ri­ence. In those cas­es, there is only the first con­text, and the sec­ond and third “shoulds” don’t exist.

But for most of our work, we have three mean­ings of “should”, and we need to keep in mind which one applies at any point in the project.