Annotation Best Practices

This document lists the ways you can train the best performing models

You have 2 levers in order to get the best performance in our platform.

  • Data (Images/PDFs)

  • Annotations

Below are methods listed in order of importance about the things you can do

Quantity of Data

  • More data: More is always better. The most fundamental and effective way to improve the accuracy of your model is to add more data

  • Dataset Diversity & Consistency: You should train the model with the kind of images you expect it to work on. If you have a large amount of data that you expect the model to work on, your training data should represent this

Quality of Data

  1. Readable: The document should readable by our OCR. When you draw a bounding box around the image, ideally the auto populated text should match. However a few characters that are incorrect or wrong is okay.

  2. Blurry Images: If the images are blurry, and the text can't be read, populating the text might not be very effective. It is best to avoid blurry images.

Annotations

  1. Consistency: It is important to follow the same convention when annotating data. For eg: If you've annotated the date and time in a receipt under the label date, make sure you follow the same practice in all receipts

  2. Completeness: Make sure you annotate all your images. In case the number of images is very large, make sure the images you've annotated are complete. Do not partially annotate images. For eg: Do not annotate 5 out of 10 labels in an image.

  3. Only annotate text you want: If you want to extract say the invoice number, only annotate the specific invoice number. The model will learn to look for the surrounding text in order to learn this. An example is given below.

  4. Multiline fields: The model will learn multi line fields as well such as addresses. As shown in the below image, you can annotate the entire address field.

Last updated