Extract Text from Scanned PDFs Using Tesseract OCR 5.x in Docker on Linux
The Syncfusion® .NET Optical Character Recognition (OCR) Library enables developers to extract text from scanned PDF documents and image files with minimal C# code. It converts raster-based PDFs into searchable and selectable documents, making it ideal for digitization and automation workflows.
By default, the library supports Tesseract OCR version 4. However, developers can also use Tesseract 5.x and later by configuring the library to work with an external OCR engine. This flexibility allows integration with the latest Tesseract features while maintaining compatibility with existing setups.
Extracted text can be saved in various formats—plain text, structured data, or searchable PDFs—ensuring high-accuracy recognition across multiple languages and layouts.
Steps to perform text recognition using the latest Tesseract in a Docker-based Linux environment
-
Create a new project: Create a new ASP.NET Core application project designed to perform OCR using the latest version of Tesseract, deployed within a Docker-based Linux environment.
-
Enable Docker Support: Configure your project to use Docker with Linux as the target operating system for containerization.
-
Install Required Packages: Add the Syncfusion.PDF.OCR.Net.Core NuGet package from Nuget.org to your project.
-
Install Docker Environment: In the Dockerfile, include the following commands to install the necessary dependency packages inside the Docker container.
RUN apt-get update && apt-get install -y tesseract-ocr
USER $APP_UID
WORKDIR /app
- In the
Index.cshtmlfile, add the following button element.
@{Html.BeginForm("PerformOCR", "Home", FormMethod.Get);
{
<div>
<input type="submit" value="Perform OCR on PDF" style="width:200px;height:27px" />
</div>
}
Html.EndForm();
}
- In the
HomeController.csfile, which is added by default when creating an ASP.NET Core project, include the following namespaces.
using Syncfusion.Drawing;
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
- In the
HomeController.csfile, create a new action method namedPerformOCR, and include the following code to extract text from a PDF and save both the OCR-enhanced PDF and the extracted text using the Tesseract OCR engine.
public IActionResult PerformOCR()
{
string docPath = Path.GetFullPath(@"Data/Input.pdf");
//Initialize the OCR processor.
using (OCRProcessor processor = new OCRProcessor())
{
FileStream fileStream = new FileStream(docPath, FileMode.Open, FileAccess.Read);
//Load a PDF document
PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
//Set OCR language to process
processor.Settings.Language = Languages.English;
IOcrEngine tesseractEngine = new Tesseract5OCREngine();
processor.ExternalEngine = tesseractEngine;
//Process OCR by providing the PDF document.
processor.PerformOCR(lDoc);
//Create memory stream
using (MemoryStream stream = new MemoryStream())
{
//Save the document to memory stream
lDoc.Save(stream);
lDoc.Close();
//Set the position as '0'
stream.Position = 0;
//Download the PDF document in the browser
FileStreamResult fileStreamResult = new FileStreamResult(stream, "application/pdf");
fileStreamResult.FileDownloadName = "Sample.pdf";
return fileStreamResult;
}
}
}
- Perform OCR Using Tesseract5OCREngine: This section details how the
Tesseract5OCREngineclass processes an image stream, seamlessly invokes the Tesseract OCR engine, efficiently parses the resulting HOCR output, and ultimately returns a structured layout of recognized text.
// Implements the IOcrEngine interface to perform OCR using Tesseract
class Tesseract5OCREngine : IOcrEngine
{
private float imageHeight;
private float imageWidth;
// Main method to perform OCR on an input image stream
public OCRLayoutResult PerformOCR(Stream stream)
{
// Validate input stream
if (stream == null || !stream.CanRead)
throw new ArgumentException("Input stream is null or not readable for OCR.", nameof(stream));
stream.Position = 0;
// Extract image dimensions from the stream
using (var tempMemStream = new MemoryStream())
{
stream.CopyTo(tempMemStream);
tempMemStream.Position = 0;
var pdfTiffImage = new PdfTiffImage(tempMemStream);
imageHeight = pdfTiffImage.Height;
imageWidth = pdfTiffImage.Width;
}
// Prepare temporary file paths
string tempImageFile = Path.GetTempFileName();
string tempHocrFile = tempImageFile + ".hocr";
// Save the stream to a temporary image file
using (var tempFileStream = new FileStream(tempImageFile, FileMode.Create, FileAccess.Write))
{
stream.Position = 0;
stream.CopyTo(tempFileStream);
}
// Configure Tesseract process to generate HOCR output
var startInfo = new ProcessStartInfo
{
FileName = "tesseract",
Arguments = $"\"{tempImageFile}\" \"{tempImageFile}\" -l eng hocr",
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true
};
string hocrText = null;
using (var process = new Process { StartInfo = startInfo })
{
process.Start();
string errorOutput = process.StandardError.ReadToEnd();
process.WaitForExit();
// Check for errors in Tesseract execution
if (process.ExitCode != 0)
throw new Exception($"Tesseract process failed with exit code {process.ExitCode}. Error: {errorOutput}");
// Ensure HOCR output file exists
if (!File.Exists(tempHocrFile))
throw new Exception("HOCR output file not found. Tesseract might have failed or not produced output.");
hocrText = File.ReadAllText(tempHocrFile);
}
// Clean up temporary files
if (File.Exists(tempImageFile)) File.Delete(tempImageFile);
if (File.Exists(tempHocrFile)) File.Delete(tempHocrFile);
// Validate HOCR output
if (string.IsNullOrEmpty(hocrText))
throw new Exception("HOCR text could not be generated or was empty.");
// Parse HOCR and build structured OCR result
var ocrLayoutResult = new OCRLayoutResult();
BuildOCRLayoutResult(ocrLayoutResult, hocrText, imageWidth, imageHeight);
ocrLayoutResult.ImageWidth = imageWidth;
ocrLayoutResult.ImageHeight = imageHeight;
return ocrLayoutResult;
}
// Parses HOCR XML and builds structured OCR layout
void BuildOCRLayoutResult(OCRLayoutResult ocr, string hOcrText, float imageWidth, float imageHeight)
{
var doc = XDocument.Parse(hOcrText, LoadOptions.None);
var ns = "http://www.w3.org/1999/xhtml";
// Iterate through each page in the HOCR document
foreach (var pageElement in doc.Descendants(ns + "div").Where(d => d.Attribute("class")?.Value == "ocr_page"))
{
Page ocrPage = new Page();
// Iterate through each line or header in the page
foreach (var lineElement in pageElement.Descendants(ns + "span")
.Where(s => s.Attribute("class")?.Value == "ocr_line" || s.Attribute("class")?.Value == "ocr_header"))
{
Line ocrLine = new Line();
// Iterate through each word in the line
foreach (var wordElement in lineElement.Descendants(ns + "span")
.Where(s => s.Attribute("class")?.Value == "ocrx_word"))
{
Word ocrWord = new Word { Text = wordElement.Value };
String title = wordElement.Attribute("title")?.Value;
// Extract bounding box coordinates from the title attribute
if (title != null)
{
String bboxString = title.Split(';')[0].Replace("bbox", "").Trim();
int[] coords = bboxString.Split(' ', StringSplitOptions.RemoveEmptyEntries).Select(int.Parse).ToArray();
if (coords.Length == 4)
{
float x = coords[0];
float y = coords[1];
float width = coords[2] - coords[0];
float height = coords[3] - coords[1];
ocrWord.Rectangle = new RectangleF(x, y, width, height);
}
}
ocrLine.Add(ocrWord);
}
ocrPage.Add(ocrLine);
}
ocr.Add(ocrPage);
}
}
}
A complete working sample can be downloaded from GitHub.
By executing the program, you will generate the following PDF document.
Take a moment to explore our comprehensive documentation, where you’ll find additional OCR options and code examples for processing images, specific document regions, rotated pages, and large PDF documents.
Conclusion
I hope you found it helpful to learn how to perform text recognition using the latest Tesseract OCR in a Docker-based Linux environment.
You can refer to our .NET PDF feature tour page to learn about its other groundbreaking feature representations. You can also explore our documentation to understand how to create and manipulate data.
For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion®, you can try our 30-day free trial to check out our other controls.
If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums or feedback portal. We are always happy to assist you!