How to efficiently perform OCR for PDF documents in C# and VB.NET?
Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present. The Syncfusion Essential .NET PDF supports OCR by using the Tesseract open-source engine.
How to efficiently perform OCR
You can improve the accuracy of the OCR process by choosing the correct compression method when converting scanned paper to a TIFF image and then to a PDF document.
- Use (zip) lossless compression for color or gray-scale images.
- Use CCITT Group 4 or JBIG2 (lossless) compression for monochrome images. It ensures that optical character recognition works on the highest-quality image, by improving the OCR accuracy. This is especially useful in low-resolution scans.
- In addition, rotated images, and skewed images can also affect the accuracy and readability of the OCR process.
Tesseract works best with text when at least 300 dots per inch (DPI) are used, so it is beneficial to resize images.
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
You can set the different performance level to the OCRProcessor using “Performance” enumeration.
- Rapid : high speed OCR performance and provide normal OCR accuracy
- Fast : provides moderate OCR processing speed and accuracy
- Slow : Slow OCR performance and provide best OCR accuracy.
C#
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"); //Set the OCR performance processor.Settings.Performance = Performance.Fast;
VB.NET
Dim processor As New OCRProcessor("TesseractBinaries\") 'Set the OCR performance processor.Settings.Performance = Performance.Fast
Steps to efficiently perform OCR for PDF documents:
- Create a new ASP.NET MVC application in Visual Studio.
- Install the Syncfusion.Pdf.OCR.AspNet.Mvc5 NuGet package as a reference to your .NET Framework application from the NuGet.org.
You can improve the accuracy of the OCR process when get the text result from an existing image file.
- For better output result, convert the image to grey scale with the help of Magick.NET and then process the OCR. Use the following code snippet to load the existing file and process the OCR to get the text result.
C#
using (OCRProcessor processor = new OCRProcessor("Tesseract Binaries"))) { processor.Settings.TesseractVersion = TesseractVersion.Version3_05; processor.Settings.AutoDetectRotation = true; //Set OCR language to process processor.Settings.Language = Languages.English; using (MagickImage img = new MagickImage(imagePath)) { img.Grayscale(); //Process OCR by providing the PDF document and Tesseract data ocrText = processor.PerformOCR(img.ToBitmap(),"Tessdata")); } }
VB.NET
Using processor As OCRProcessor = New OCRProcessor("Tesseract Binaries") processor.Settings.AutoDetectRotation = True 'Set OCR language to process processor.Settings.Language = Languages.English Using img As MagickImage = New MagickImage(imagePath) img.Grayscale() 'Process OCR by providing the PDF document and Tesseract data ocrText = processor.PerformOCR(img.ToBitmap, "Tessdata") End Using End Using
A complete work sample to efficiently OCR the PDF documents can be downloaded from OCRImageSample.zip.
By executing the program, you will get the window as follows.
Take a moment to peruse the documentation, where you will find other options like OCR for an entire document, OCR for a region in the document, OCR on image, layout result for OCR, customizing temp folder and more.
Refer here to explore the rich set of Syncfusion Essential PDF features.
Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.