Skip to content

Testing annotator performance

Test tasks are an easy way to measure annotator performance. In order to add a test task, you will need a ground truth task first. We are not showing any signs to annotators that they are working on test tasks and they look just like regular tasks. On the other hand, you will know which tasks are test tasks as they are clearly marked as test tasks on the dashboard:


Adding a ground truth task

Adding a ground truth task is just like adding any other task. Simply add a task however you want and add annotations to it.

Adding a test task

Adding a test task is a little different: You have to add a task with the same:

  • task-type
  • attachment URL
  • labels/categories list

Other parameters can be different.

Once you've made sure that's correct, use the "GET ANNOTATIONS" button on add task form:


Note: you might have to select a task type for this button to appear on add task form.

Of course, fill your ground truth task ID into the search box first.

Measuring annotator performance

We track 6 statistics, based on the surface area in pixels. Each object has these scores and in the end you will be shown an arithmetic mean of these statistics for all annotations.

  • Accuracy - overlap - (overflow + shortage)
  • Overflow - how much the annotations are overflowing outside the GT annotations surface
  • Shortage - how much surface area is missing compared to the GT annotations
  • Bad annotations - amount of bad annotations, in order for us to classify an annotation as bad, annotation has to either have a wrong label or category to it compared to it's GT annotation or it has to have an accuracy of <50%
  • Good annotations - total number of annotations which didn't fall into the category of bad annotations
  • Confusion matrix score:
Predicted positives Predicted negatives
Positives True Positives (overlap) False Negatives (shortage)
Negatives False Positives (overflow) True Negatives (background overlap)

(TP + TN) / (TP+TN+FP+FN) -> also means (annotation_overlap + background_overlap) / (h*w)

Keep in mind that confusion matrix score is usually a bad indicator in cases where annotation surface is small compared to the image surface because it takes into account the background overlap in it's formula.

These statistics will be calculated and shown shortly after an annotator finishes a task. Expand the test task on dashboard to see the statistics on bottom: