How to Extract the Text from Image Free PDF Documents Using WinForms OCR Processor?

6 mins read

The Syncfusion® .NET Optical Character Recognition (OCR) Library is used to extract text from scanned PDFs and images. Using this library, you can extract the text from image-free PDF documents using the OCR Processor.

The code utilizes the Syncfusion.PdfToImageConverter class to transform each page of the PDF document into an image stream. Subsequently, these image streams are incorporated into a new PDF document. The newly generated document, containing the converted images, is employed to instantiate a new PdfLoadedDocument for subsequent OCR processing.

Steps to extract the text from image-free PDF documents using OCR Processor

Create a new Windows Forms application project.
Install the Syncfusion.PdfToImageConverter.WinForms and Syncfusion.Pdf.OCR.WinForms NuGet packages as references to your .NET Framework applications from NuGet.org.

Download the language packages from the following link.
https://github.com/tesseract-ocr/tessdata_fast

Add a new button in Form1.Designer.cs to create a PDF document as follows.
Include the following namespaces in the Form1.cs file.

using Syncfusion.OCRProcessor;
using Syncfusion.Pdf;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using Syncfusion.PdfToImageConverter;
using System;
using System.Drawing;
using System.IO;
using System.Windows.Forms;

VB.NET

Imports Syncfusion.OCRProcessor
Imports Syncfusion.Pdf
Imports Syncfusion.Pdf.Graphics
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.PdfToImageConverter
Imports System
Imports System.Drawing
Imports System.IO
Imports  System.Windows.Forms

Use the following code in button1_Click to extract the text from image-free PDF documents.

// Initialize the OCR processor with Tesseract binaries folder path.
using (OCRProcessor processor = new OCRProcessor())
{
  // Get stream from an existing PDF document.
  PdfToImageConverter imageConverter = new PdfToImageConverter();
  // Load the PDF document as a stream.
  FileStream inputStream = new FileStream(@"../../Input.pdf", FileMode.Open, FileAccess.ReadWrite);
  imageConverter.Load(inputStream);
  // Create a new PDF document to store the converted images.
  PdfDocument doc = new PdfDocument();
  // Iterate through each page of the input PDF, convert to image, and add to the new document
  for (int i = 0; i < imageConverter.PageCount; i++)
  {
     // Convert PDF to Image.
     Stream outputStream = imageConverter.Convert(i, false, false);
     // Create a PdfBitmap from the converted image stream.
     PdfBitmap pdfImage = new PdfBitmap(outputStream);
     // Create a new PdfSection and add the page size.
     PdfSection section = doc.Sections.Add();
     // Set Margins
     section.PageSettings.Margins.All = 0;
     // Set the page size.
     section.PageSettings.Size = new SizeF(pdfImage.PhysicalDimension.Width, pdfImage.PhysicalDimension.Height);
     // Add a new page to section.
     PdfPage page = section.Pages.Add();
     // Obtain the graphics context for the current PDF page.
     PdfGraphics graphics = page.Graphics;
     // Draw the converted image onto the PDF page.
     graphics.DrawImage(pdfImage, 0, 0, page.Size.Width, page.Size.Height);
  }
  // Save the new document with converted images to a memory stream
  MemoryStream file = new MemoryStream();
  doc.Save(file);
  // Close the document.
  doc.Close(true);
  // Load the new document with converted images for OCR processing
  PdfLoadedDocument document = new PdfLoadedDocument(file);
  // Set OCR language.
  processor.Settings.Language = Languages.English;
  // Set OCR TesseractVersion.
  processor.Settings.TesseractVersion = TesseractVersion.Version4_0;
  // Set TessDataPath.
  processor.TessDataPath = @"../../tessdata-fast/";
  // Perform OCR with input document and tessdata (Language packs).
  string text = processor.PerformOCR(document);
  // Create file stream for the output PDF document after OCR processing.
  using (FileStream outputFileStream = new FileStream(Path.GetFullPath(@"../../Output.pdf"), FileMode.Create, FileAccess.ReadWrite))
  {
     // Save the PDF document with OCR-recognized text to the file stream.
     document.Save(outputFileStream);
  }
  // Close the document.
  document.Close(true);
}

VB.NET

' Initialize the OCR processor with Tesseract binaries folder path.
Using processor As New OCRProcessor()
   ' Get stream from an existing PDF document.
   Dim imageConverter As New PdfToImageConverter()
   ' Load the PDF document as a stream.
   Dim inputStream As New FileStream("../../Input.pdf", FileMode.Open, FileAccess.ReadWrite)
   imageConverter.Load(inputStream)
   ' Create a new PDF document to store the converted images.
   Dim doc As New PdfDocument()
   ' Iterate through each page of the input PDF, convert to image, and add to the new document
   For i As Integer = 0 To imageConverter.PageCount - 1
       ' Convert PDF to Image.
       Dim outputStream As Stream = imageConverter.Convert(i, False, False)
       ' Create a PdfBitmap from the converted image stream.
       Dim pdfImage As New PdfBitmap(outputStream)
       ' Create a new PdfSection and add the page size.
       Dim section As PdfSection = doc.Sections.Add()
       ' Set Margins
       section.PageSettings.Margins.All = 0
       ' Set the page size.
       section.PageSettings.Size = New SizeF(pdfImage.PhysicalDimension.Width, pdfImage.PhysicalDimension.Height)
       ' Add a new page to section.
       Dim page As PdfPage = section.Pages.Add()
       ' Obtain the graphics context for the current PDF page.
       Dim graphics As PdfGraphics = page.Graphics
       ' Draw the converted image onto the PDF page.
       graphics.DrawImage(pdfImage, 0, 0, page.Size.Width, page.Size.Height)
   Next
   ' Save the new document with converted images to a memory stream
   Dim file As New MemoryStream()
   doc.Save(file)
   ' Close the document.
   doc.Close(True)
   ' Load the new document with converted images for OCR processing
   Dim document As New PdfLoadedDocument(file)
   ' Set OCR language.
   processor.Settings.Language = Languages.English
   ' Set OCR TesseractVersion.
   processor.Settings.TesseractVersion = TesseractVersion.Version4_0
   ' Set TessDataPath.
   processor.TessDataPath = "../../tessdata-fast/"
   ' Perform OCR with input document and tessdata (Language packs).
   Dim text As String = processor.PerformOCR(document)
   ' Create file stream for the output PDF document after OCR processing.
   Using outputFileStream As New FileStream(Path.GetFullPath("../../Output.pdf"), FileMode.Create, FileAccess.ReadWrite)
       ' Save the PDF document with OCR-recognized text to the file stream.
       document.Save(outputFileStream)
   End Using
   ' Close the document.
   document.Close(True)
End Using

A complete working sample can be downloaded from Extract_text_from_Image_free_PDF.zip.

By executing the program, you will get a PDF document as follows.

Take a moment to peruse the documentation, where you will find other options like performing OCR on an image, region of the document, rotated page, and large PDF documents with code examples.

Refer here to explore the rich set of Syncfusion Essential® PDF features.

Note: Starting with v16.2.0.x, if you reference Syncfusion® assemblies from the trial setup or the NuGet feed, include a license key in your projects. Refer to this link to learn about generating and registering the Syncfusion® license key in your application to use the components without a trial message.

Conclusion
I hope you enjoyed learning about how to extract the text from image-free PDF documents using the OCR Processor.

You can refer to our WinForms PDF’s feature tour page to know about its other groundbreaking feature representations. You can also explore our WinForms PDF documentation to understand how to present and manipulate data.

For current customers, you can check out our WinForms components from the License and Downloads page. If you are new to Syncfusion®, you can try our 30-day free trial to check out our WinForms PDF and otherWinForms components.

If you have any queries or require clarifications, please let us know in the comments below. You can also contact us through our support forums or feedback portal. We are always happy to assist you!

Did you find this information helpful?

Yes

Comments (0)

How to Extract the Text from Image Free PDF Documents Using WinForms OCR Processor?

Access denied