Optical character recognition (OCR) allows text in images to be understandable by machines, allowing programs and scripts to process the text. OCR is commonly seen across a wide range of applications, but primarily in document-related scenarios, including document digitization and receipt processing. 

While solutions for document OCR have been heavily investigated, the current state-of-the-art for OCR solutions on non-document OCR applications, occasionally referred to as "scene OCR", like reading license plates or logos, is less clear.

In this blog post, we compare seven different OCR solutions and compare their efficacy in ten different areas of industrial OCR applications.

OCR Solutions

We will test nine different OCR models:

  • Tesseract (locally via PyTesseract)
  • EasyOCR (local)
  • Surya (local)
  • DocTR (via Roboflow Hosted API)
  • OpenAI GPT-4 with Vision
  • Google Gemini Pro 1.0
  • Google Gemini Pro 1.5
  • Anthropic Claude 3 Opus
  • Hugging Face Idefics2

In addition to four open-source OCR-specific packages, we also test three Large Multimodal Models (LMMs), GPT-4 with Vision, Gemini Pro 1.0, Gemini Pro 1.5 and Claude 3 Opus, which have all previously shown effectiveness in OCR tasks.

👍
We will occasionally update this article with new OCR models are we discover them. Gemini Pro 1.5 was added March 28, 2024. Idefics2 by Hugging Face was added April 18, 2024.

OCR Testing Methodology

Most OCR solutions, as well as benchmarks, are primarily designed for reading entire pages of text. Informed by our experiences deploying computer vision models in physical world environments, we have seen the benefit of omitting a “text detection” or localization step within the OCR model in favor of a custom-trained object detection model, cropping the result of the detection model to be passed onto an OCR model.

Our goal, and the scope of this experiment, is to test as many non-document use case domains as possible with localized text examples. With the influence of what we have seen from our own experiences and our customer’s use cases, we outlined ten different domains to test.

Sample images from each of the tested domains
Sample images from each of the tested domains

For each domain, we selected an open-source dataset from Roboflow Universe and imported ten images from each domain dataset at random. If the image could be reasonably read by a human, it was included.

💡
Access the data used in this experiment on Universe.

In most cases, the domain images were cropped corresponding to object detection annotations made within the original Universe dataset. In cases where there were either no detections to crop from or the detections contained extra text that could introduce variability in our testing, we manually cropped images.

For example, with license plates, the dataset we used contained the entire license plate, which in the cases of U.S. license plates included the state, taglines, and registration stickers. In this case, we cropped the image only to include the primary identifying numbers and letters.

To create a ground truth to compare OCR predictions against, each image was manually read and annotated with the text in the image as it appeared. 

Once we prepared the dataset, we evaluated each OCR solution. A Levenshtein distance ratio, a metric used for measuring the difference between two strings, was calculated and used for scoring.

Results

Our testing gave us several insights into the various OCR solutions and when to use them. We examine the accuracy, speed, and cost aspects of the results.

Accuracy

Across the board, considering all domains, two multimodal LLMs, Gemini and Claude performed the best, followed by EasyOCR and GPT-4.  

The median accuracy of each model across all domains
The median accuracy of each model across all domains

The general trend of the median accuracy continued across most domains. 

Bar charts with the accuracy of each model on a respective domain
Bar charts with the accuracy of each model on a respective domain

Across all domains, Claude achieved the highest accuracy score the most times, followed by GPT-4 and Gemini. EasyOCR performed generally well across most domains and far surpassed its specialized-package counterparts, but underperformed compared to LMMs. 

A table of percentage accuracy scores by model and by domain. Within each domain, the highest scores are bolded and scores above average are underlined.
A table of percentage accuracy scores by model and by domain. Within each domain, the highest scores are bolded and scores above average are underlined.

A surprisingly notable aspect was GPT-4’s refusal rate which produced results that were unusable and repeated attempts did not resolve this issue. We treated this as a zero score and it significantly lowered or caused a total zero score for some domains, whereas other LMMs did not experience refusals at all. 

Speed

While accurate OCR is important, speed is also a consideration when using OCR. 

Speed and accuracy are plotted on a scatterplot (left: all data points visible, right: all datapoints are included, but graph view is limited to 30 seconds)
Speed and accuracy are plotted on a scatterplot (left: all data points visible, right: all datapoints are included, but graph view is limited to 30 seconds
The median time elapsed for a prediction for each model. Note: Surya was cut off and its median speed was 44.00 seconds
The median time elapsed for a prediction for each model. Note: Surya was cut off and its median speed was 44.00 seco

Speed, however, does not show the entire picture since a fast model with terrible accuracy is not useful. So, we calculate a metric, “speed efficiency”, which we define as accuracy over the elapsed time.

Our goal with this metric is to show how accurate the model is, considering the time it took. In this category, Gemini wins by a notable margin, with EasyOCR, and GPT-4 as runner-ups. Despite Calude’s high performance, its slow response time, negatively impacted its scores in this category.

Cost

A third factor to consider is the actual price it takes to perform each request. In high volume use cases, costs can add up quickly and having a sense of the financial impact it will have is important.

The calculations for these costs were done in one of two separate methods depending on the model. For LMMs, which do not offer a local inference option, we calculated the costs directly from the number of tokens (or characters in the case of Gemini) that were used and the respective model’s pricing.

For locally run OCR models, we calculated the cost of the OCR request as the time it took to predict multiplied by the cost of the virtual machine on Google Cloud. (The tests were run on a Google Colab CPU environment which we equated to a Computer Engine E2 instance with 2 vCPUs and 13 GB of memory).

The cost it took for a median request for each model
The cost it took for a median request for each model

DocTR, Tesseract, Surya, and EasyOCR were significantly cheaper to run compared to LMMs. As with the speed, price in isolation is not a useful indicator of how it will perform in the field.

Similar to how we calculated speed efficiency, we calculated a cost efficiency metric, defined as percentage accuracy over the price cost. This relates how performant a model is to the price of running the model. It is calculated by dividing the score achieved by the cost of the request.

EasyOCR had the best cost efficiency, with DocTR and Gemini being significantly lower runner-ups. The reason for this drastic difference originates from EasyOCR’s relatively impressive performance in terms of accuracy, while being a significantly cheaper alternative to its main competition, LMMs.

Conclusion

In this blog post, we explored how different OCR solutions perform across domains that are commonly found in industrial vision use cases, comparing LMMs and open-source solutions on speed, accuracy, and cost. 

Throughout testing, we find that running EasyOCR locally produces the most cost-efficient OCR results while maintaining competitive accuracy, while Anthropic’s Claude 3 Opus performed the best across the widest array of domains, and Google’s Gemini Pro 1.0 performs the best in terms of speed efficiency.

When comparing against local, open-source OCR solutions, EasyOCR far outperformed its counterparts in all metrics, performing at levels near or above other LMMs.