Articles in this section
Category / Section

How to efficiently perform OCR for PDF documents in C# and VB.NET?

2 mins read

Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present. The Syncfusion Essential .NET PDF supports OCR by using the Tesseract open-source engine.

How to efficiently perform OCR

You can improve the accuracy of the OCR process by choosing the correct compression method when converting scanned paper to a TIFF image and then to a PDF document.

  • Use (zip) lossless compression for color or gray-scale images.
  • Use CCITT Group 4 or JBIG2 (lossless) compression for monochrome images. It ensures that optical character recognition works on the highest-quality image, by improving the OCR accuracy. This is especially useful in low-resolution scans.
  • In addition, rotated images, and skewed images can also affect the accuracy and readability of the OCR process.

Tesseract works best with text when at least 300 dots per inch (DPI) are used, so it is beneficial to resize images.

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

You can set the different performance level to the OCRProcessor using “Performance” enumeration.

  • Rapid : high speed OCR performance and provide normal OCR accuracy
  • Fast : provides moderate OCR processing speed and accuracy
  • Slow : Slow OCR performance and provide best OCR accuracy.

C#

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
 
//Set the OCR performance
processor.Settings.Performance = Performance.Fast;

 

VB.NET

Dim processor As New OCRProcessor("TesseractBinaries\")
 
'Set the OCR performance
processor.Settings.Performance = Performance.Fast

 

Steps to efficiently perform OCR for PDF documents:

  1. Create a new ASP.NET MVC application in Visual Studio.

Create MVC application in visual studio in WinForms

  1. Install the Syncfusion.Pdf.OCR.AspNet.Mvc5 NuGet package as a reference to your .NET Framework application from the NuGet.org.

Refer the NuGet package to project in WinForms

You can improve the accuracy of the OCR process when get the text result from an existing image file.

  1. For better output result, convert the image to grey scale with the help of Magick.NET and then process the OCR.  Use the following code snippet to load the existing file and process the OCR to get the text result.

C#

using (OCRProcessor processor = new OCRProcessor("Tesseract Binaries")))
 {
     processor.Settings.TesseractVersion = TesseractVersion.Version3_05;
     processor.Settings.AutoDetectRotation = true;
     //Set OCR language to process
     processor.Settings.Language = Languages.English;
     using (MagickImage img = new MagickImage(imagePath))
     {
         img.Grayscale();
         //Process OCR by providing the PDF document and Tesseract data
         ocrText = processor.PerformOCR(img.ToBitmap(),"Tessdata"));
     }
 }

 

VB.NET

Using processor As OCRProcessor = New OCRProcessor("Tesseract Binaries")
            processor.Settings.AutoDetectRotation = True
            'Set OCR language to process
            processor.Settings.Language = Languages.English
            Using img As MagickImage = New MagickImage(imagePath)
                img.Grayscale()
                'Process OCR by providing the PDF document and Tesseract data
                ocrText = processor.PerformOCR(img.ToBitmap, "Tessdata")
            End Using
 End Using

 

A complete work sample to efficiently OCR the PDF documents can be downloaded from OCRImageSample.zip.

By executing the program, you will get the window as follows.

Perform OCR for PDF documents in WinForms

Take a moment to peruse the documentation, where you will find other options like OCR for an entire document, OCR for a region in the document, OCR on image, layout result for OCR, customizing temp folder and more.

Refer here to explore the rich set of Syncfusion Essential PDF features.

Note:

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.

 

Did you find this information helpful?
Yes
No
Help us improve this page
Please provide feedback or comments
Comments (0)
Please  to leave a comment
Access denied
Access denied