Extract Text from Scanned PDFs Using Tesseract OCR 5.x in Docker on Linux

The Syncfusion^® .NET Optical Character Recognition (OCR) Library enables developers to extract text from scanned PDF documents and image files with minimal C# code. It converts raster-based PDFs into searchable and selectable documents, making it ideal for digitization and automation workflows.

By default, the library supports Tesseract OCR version 4. However, developers can also use Tesseract 5.x and later by configuring the library to work with an external OCR engine. This flexibility allows integration with the latest Tesseract features while maintaining compatibility with existing setups.

Extracted text can be saved in various formats—plain text, structured data, or searchable PDFs—ensuring high-accuracy recognition across multiple languages and layouts.

Steps to perform text recognition using the latest Tesseract in a Docker-based Linux environment

Create a new project: Create a new ASP.NET Core application project designed to perform OCR using the latest version of Tesseract, deployed within a Docker-based Linux environment.
Enable Docker Support: Configure your project to use Docker with Linux as the target operating system for containerization.
Install Required Packages: Add the Syncfusion.PDF.OCR.Net.Core NuGet package from Nuget.org to your project.
Install Docker Environment: In the Dockerfile, include the following commands to install the necessary dependency packages inside the Docker container.

RUN apt-get update && apt-get install -y tesseract-ocr
USER $APP_UID
WORKDIR /app

In the Index.cshtml file, add the following button element.

@{Html.BeginForm("PerformOCR", "Home", FormMethod.Get);
   {
       <div>
           <input type="submit" value="Perform OCR on PDF" style="width:200px;height:27px" />
       </div>
   }
   Html.EndForm();
}

In the HomeController.cs file, which is added by default when creating an ASP.NET Core project, include the following namespaces.

using Syncfusion.Drawing;
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;

In the HomeController.cs file, create a new action method named PerformOCR, and include the following code to extract text from a PDF and save both the OCR-enhanced PDF and the extracted text using the Tesseract OCR engine.

public IActionResult PerformOCR()
{
    string docPath = Path.GetFullPath(@"Data/Input.pdf");
    //Initialize the OCR processor.
    using (OCRProcessor processor = new OCRProcessor())
    {
        FileStream fileStream = new FileStream(docPath, FileMode.Open, FileAccess.Read);
        //Load a PDF document
        PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
        //Set OCR language to process
        processor.Settings.Language = Languages.English;
        IOcrEngine tesseractEngine = new Tesseract5OCREngine();
        processor.ExternalEngine = tesseractEngine;
        //Process OCR by providing the PDF document.
        processor.PerformOCR(lDoc);
        //Create memory stream
        using (MemoryStream stream = new MemoryStream())
        {
            //Save the document to memory stream
            lDoc.Save(stream);
            lDoc.Close();
            //Set the position as '0'
            stream.Position = 0;
            //Download the PDF document in the browser
            FileStreamResult fileStreamResult = new FileStreamResult(stream, "application/pdf");
            fileStreamResult.FileDownloadName = "Sample.pdf";
            return fileStreamResult;
        }
    }
}

Perform OCR Using Tesseract5OCREngine: This section details how the Tesseract5OCREngine class processes an image stream, seamlessly invokes the Tesseract OCR engine, efficiently parses the resulting HOCR output, and ultimately returns a structured layout of recognized text.

// Implements the IOcrEngine interface to perform OCR using Tesseract
class Tesseract5OCREngine : IOcrEngine
{
   private float imageHeight;
   private float imageWidth;

   // Main method to perform OCR on an input image stream
   public OCRLayoutResult PerformOCR(Stream stream)
   {
       // Validate input stream
       if (stream == null || !stream.CanRead)
           throw new ArgumentException("Input stream is null or not readable for OCR.", nameof(stream));

       stream.Position = 0;

       // Extract image dimensions from the stream
       using (var tempMemStream = new MemoryStream())
       {
           stream.CopyTo(tempMemStream);
           tempMemStream.Position = 0;
           var pdfTiffImage = new PdfTiffImage(tempMemStream);
           imageHeight = pdfTiffImage.Height;
           imageWidth = pdfTiffImage.Width;
       }

       // Prepare temporary file paths
       string tempImageFile = Path.GetTempFileName();
       string tempHocrFile = tempImageFile + ".hocr";

       // Save the stream to a temporary image file
       using (var tempFileStream = new FileStream(tempImageFile, FileMode.Create, FileAccess.Write))
       {
           stream.Position = 0;
           stream.CopyTo(tempFileStream);
       }

       // Configure Tesseract process to generate HOCR output
       var startInfo = new ProcessStartInfo
       {
           FileName = "tesseract",
           Arguments = $"\"{tempImageFile}\" \"{tempImageFile}\" -l eng hocr",
           RedirectStandardError = true,
           UseShellExecute = false,
           CreateNoWindow = true
       };

       string hocrText = null;
       using (var process = new Process { StartInfo = startInfo })
       {
           process.Start();
           string errorOutput = process.StandardError.ReadToEnd();
           process.WaitForExit();

           // Check for errors in Tesseract execution
           if (process.ExitCode != 0)
               throw new Exception($"Tesseract process failed with exit code {process.ExitCode}. Error: {errorOutput}");

           // Ensure HOCR output file exists
           if (!File.Exists(tempHocrFile))
               throw new Exception("HOCR output file not found. Tesseract might have failed or not produced output.");

           hocrText = File.ReadAllText(tempHocrFile);
       }

       // Clean up temporary files
       if (File.Exists(tempImageFile)) File.Delete(tempImageFile);
       if (File.Exists(tempHocrFile)) File.Delete(tempHocrFile);

       // Validate HOCR output
       if (string.IsNullOrEmpty(hocrText))
           throw new Exception("HOCR text could not be generated or was empty.");

       // Parse HOCR and build structured OCR result
       var ocrLayoutResult = new OCRLayoutResult();
       BuildOCRLayoutResult(ocrLayoutResult, hocrText, imageWidth, imageHeight);
       ocrLayoutResult.ImageWidth = imageWidth;
       ocrLayoutResult.ImageHeight = imageHeight;

       return ocrLayoutResult;
   }

   // Parses HOCR XML and builds structured OCR layout
   void BuildOCRLayoutResult(OCRLayoutResult ocr, string hOcrText, float imageWidth, float imageHeight)
   {
       var doc = XDocument.Parse(hOcrText, LoadOptions.None);
       var ns = "http://www.w3.org/1999/xhtml";

       // Iterate through each page in the HOCR document
       foreach (var pageElement in doc.Descendants(ns + "div").Where(d => d.Attribute("class")?.Value == "ocr_page"))
       {
           Page ocrPage = new Page();

           // Iterate through each line or header in the page
           foreach (var lineElement in pageElement.Descendants(ns + "span")
               .Where(s => s.Attribute("class")?.Value == "ocr_line" || s.Attribute("class")?.Value == "ocr_header"))
           {
               Line ocrLine = new Line();

               // Iterate through each word in the line
               foreach (var wordElement in lineElement.Descendants(ns + "span")
                   .Where(s => s.Attribute("class")?.Value == "ocrx_word"))
               {
                   Word ocrWord = new Word { Text = wordElement.Value };
                   String title = wordElement.Attribute("title")?.Value;

                   // Extract bounding box coordinates from the title attribute
                   if (title != null)
                   {
                       String bboxString = title.Split(';')[0].Replace("bbox", "").Trim();
                       int[] coords = bboxString.Split(' ', StringSplitOptions.RemoveEmptyEntries).Select(int.Parse).ToArray();

                       if (coords.Length == 4)
                       {
                           float x = coords[0];
                           float y = coords[1];
                           float width = coords[2] - coords[0];
                           float height = coords[3] - coords[1];
                           ocrWord.Rectangle = new RectangleF(x, y, width, height);
                       }
                   }

                   ocrLine.Add(ocrWord);
               }

               ocrPage.Add(ocrLine);
           }

           ocr.Add(ocrPage);
       }
   }
}

A complete working sample can be downloaded from GitHub.

By executing the program, you will generate the following PDF document.

Take a moment to explore our comprehensive documentation, where you’ll find additional OCR options and code examples for processing images, specific document regions, rotated pages, and large PDF documents.

Conclusion

I hope you found it helpful to learn how to perform text recognition using the latest Tesseract OCR in a Docker-based Linux environment.

You can refer to our .NET PDF feature tour page to learn about its other groundbreaking feature representations. You can also explore our documentation to understand how to create and manipulate data.

For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion^®, you can try our 30-day free trial to check out our other controls.

If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums or feedback portal. We are always happy to assist you!

Did you find this information helpful?

Yes

Comments (0)

Extract Text from Scanned PDFs Using Tesseract OCR 5.x in Docker on Linux

Access denied