How to efficiently perform OCR for WinForms PDF in C# and VB.NET?

2 mins read

Tesseract is an optical character recognition engine, one of the most accurate OCR engines at present. The Syncfusion Essential .NET PDF supports OCR by using the Tesseract open-source engine.

How to efficiently perform OCR

You can improve the accuracy of the OCR process by choosing the correct compression method when converting scanned paper to a TIFF image and then to a PDF document.

Use (ZIP) lossless compression for color or grayscale images.
Use CCITT Group 4 or JBIG2 (lossless) compression for monochrome images. This ensures that optical character recognition works on the highest-quality image by improving OCR accuracy. This is especially useful in low-resolution scans.
In addition, rotated images and skewed images can also affect the accuracy and readability of the OCR process.

Tesseract works best with text when at least 300 dots per inch (DPI) are used, so it is beneficial to resize images.

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

You can set the different performance levels for the OCRProcessor using the “Performance” enumeration.

Rapid: High-speed OCR performance with normal OCR accuracy.
Fast: Moderate OCR processing speed with good accuracy.
Slow: Slow OCR performance with the best OCR accuracy.

OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
 
//Set the OCR performance
processor.Settings.Performance = Performance.Fast;

VB.NET

Dim processor As New OCRProcessor("TesseractBinaries\")
 
'Set the OCR performance
processor.Settings.Performance = Performance.Fast

Steps to efficiently perform OCR for PDF documents:

Create a new ASP.NET MVC application in Visual Studio.

Create MVC application in visual studio in WinForms

Install the Syncfusion.Pdf.OCR.AspNet.Mvc5 NuGet package as a reference to your .NET Framework application from NuGet.org.

Refer the NuGet package to project in WinForms

You can improve the accuracy of the OCR process by extracting text results from an existing image file.

For better output results, convert the image to grayscale with the help of Magick.NET and then process OCR. Use the following code snippet to load the existing file and process OCR to extract text results.

using (OCRProcessor processor = new OCRProcessor("Tesseract Binaries"))
{
    processor.Settings.TesseractVersion = TesseractVersion.Version3_05;
    processor.Settings.AutoDetectRotation = true;
    // Set OCR language to process
    processor.Settings.Language = Languages.English;
    using (MagickImage img = new MagickImage(imagePath))
    {
        img.Grayscale();
        // Process OCR by providing the PDF document and Tesseract data
        ocrText = processor.PerformOCR(img.ToBitmap(), "Tessdata");
    }
}

VB.NET

Using processor As New OCRProcessor("Tesseract Binaries")
    processor.Settings.AutoDetectRotation = True
    ' Set OCR language to process
    processor.Settings.Language = Languages.English
    Using img As New MagickImage(imagePath)
        img.Grayscale()
        ' Process OCR by providing the PDF document and Tesseract data
        ocrText = processor.PerformOCR(img.ToBitmap(), "Tessdata")
    End Using
End Using

A complete work sample to efficiently OCR PDF documents can be downloaded from OCRImageSample.zip.

By executing the program, you will see the following window.

Perform OCR for PDF documents in WinForms

Take a moment to peruse the documentation, where you will find other options like OCR for an entire document, OCR for a region in the document, OCR on images, layout results for OCR, customizing the temp folder, and more.

Refer here to explore the rich set of Syncfusion Essential PDF features.

Note:

Starting with v16.2.0.x, if you reference Syncfusion assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion license key in your application to use the components without trail message.

Conclusion

I hope you enjoyed learning about how to efficiently perform OCR for WinForms PDF in C# and VB.NET.

You can refer to our Winforms PDF feature tour page to know about its other groundbreaking feature representations. You can also explore our documentation to understand how to create and manipulate data.

For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion, you can try our 30-day free trial to check out our other controls.

If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums or feedback portal. We are always happy to assist you!

Did you find this information helpful?

Yes

Comments (0)

How to efficiently perform OCR for WinForms PDF in C# and VB.NET?

How to efficiently perform OCR

Steps to efficiently perform OCR for PDF documents:

Access denied