This document lists common data capture issues and how to mitigate or resolve them.
Here's a process to identify accuracy issues in your Nanonets OCR model:
- 1.Data Evaluation:
- Collect a representative sample of documents that your OCR model will encounter in real-world scenarios.
- Manually review the extracted data from these documents for accuracy.
- Identify the common types of errors or inaccuracies present in the OCR output.
- Correct the errors found, and mark the corrected file as Approved.
- 2.Error Analysis:
- Categorize the identified errors based on their nature, such as character recognition, formatting, or layout errors.
- Determine the frequency and impact of each error category on the overall accuracy of your OCR model.
- Identify patterns or specific document characteristics that contribute to the occurrence of certain errors.
- 3.Performance Metrics:
- Establish quantitative performance metrics, such as character-level accuracy, word-level accuracy, or precision and recall, to measure the performance of your OCR model. These are available as an add-on on your Enterprise or Pro plans.
- Evaluate the OCR model's performance using the annotated dataset from step 1.
- Calculate and analyze the performance metrics to identify areas of improvement.
- 4.Error Prioritization:
- Prioritize the identified error categories based on their impact and frequency.
- Determine which error types are most critical to address for enhancing overall accuracy and user experience.
- 5.Augment training data
- Add more training data to your model. Introduce variations in document layouts, aiming to cover the scenarios where errors were found.
- 6.Model Retraining:
- Utilize the dataset from step 1, including the identified and corrected errors, to retrain your OCR model.
- Monitor the training progress and evaluate the retrained model's performance using the evaluation metrics established in step 3.
- 7.Iterative Evaluation and Improvement:
- Repeat steps 2-6 iteratively to continually improve the accuracy of your OCR model.
- Regularly evaluate the model's performance on new and representative datasets to assess progress.
By following this process, you can systematically identify accuracy issues in your Nanonets OCR model, prioritize improvements, and iteratively enhance its performance to better meet your specific requirements.
- Character Recognition Errors
- OCR can misinterpret characters, leading to incorrect recognition or substitution of similar-looking characters. Eg: lowercase "l" (L) and the number "1", letter "o" (O) and the number "0" (zero), letter "s" and the number "5".
- How to fix: Add find and replace data actions in your Workflow
- Field not captured
- Data might be not be captured for certain fields on some types of images