How to Extract the Hindi Text from Scanned PDF Document
The Syncfusion .NET Optical Character Recognition (OCR) Library is used to extract text from scanned PDFs and images. You can extract the Hindi text from an existing PDF document using this library. You can save the OCR result as text, structured data, or searchable PDF documents. The .NET OCR Library uses a powerful Tesseract OCR engine.
Steps to extract the Hindi text from scanned PDF document
- Create a new C# console application project.
- Install the Syncfusion.PDF.OCR.Net.Core NuGet package as a reference to your .NET console application from NuGet.org.
Download the language packages from the following link.
https://github.com/tesseract-ocr/tessdata
- Install the following namespaces in the Program.cs file.
C#
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Parsing;
Imports Syncfusion.OCRProcessor
Imports Syncfusion.Pdf.Parsing
- Use the following code example to extract the Hindi text from a scanned PDF document.
C#
//Load an existing PDF document.
FileStream stream = new FileStream("../../../Input.pdf", FileMode.Open);
PdfLoadedDocument document = new PdfLoadedDocument(stream);
//Set the OCR language.
processor.Settings.Language = "hin";
//Set the Unicode font to preserve the Hindi characters in a PDF document.
FileStream fileStream = new FileStream("../../../Modak-Regular.ttf", FileMode.Open, FileAccess.Read);
processor.UnicodeFont = new Syncfusion.Pdf.Graphics.PdfTrueTypeFont(fileStream, 6);
//Set the AutoOsd page segment to detect auto-page rotation.
processor.Settings.PageSegment = PageSegMode.AutoOsd;
// Set the Tessdata path.
processor.TessDataPath = "../../../tessdata/";
//Perform OCR with input document and tessdata (Language packs).
string ocredText = processor.PerformOCR(document);
//Create a file stream.
using (FileStream outputFileStream = new FileStream("../../../Output.pdf", FileMode.Create, FileAccess.ReadWrite))
{
//Save a PDF document to a file stream.
document.Save(outputFileStream);
}
File.WriteAllText("../../../OutputResult.txt", ocredText.ToString());
//Close the document.
document.Close(true);
Using processor As New OCRProcessor()
'Load an existing PDF document.
Dim stream As New FileStream("../../../Input.pdf", FileMode.Open)
Dim document As New PdfLoadedDocument(stream)
'Set the OCR language.
processor.Settings.Language = "hin"
'Set the Unicode font to preserve the Hindi characters in a PDF document.
Dim fileStream As New FileStream("../../../Modak-Regular.ttf", FileMode.Open, FileAccess.Read)
processor.UnicodeFont = New Syncfusion.Pdf.Graphics.PdfTrueTypeFont(fileStream, 6)
'Set the AutoOsd page segment to detect auto-page rotation.
processor.Settings.PageSegment = PageSegMode.AutoOsd
'Set the Tessdata path.
processor.TessDataPath = "../../../tessdata/"
'Perform OCR with input document and tessdata (Language packs).
Dim ocredText As String = processor.PerformOCR(document)
'Create a file stream.
Using outputFileStream As New FileStream("../../../Output.pdf", FileMode.Create, FileAccess.ReadWrite)
'Save a PDF document to a file stream.
document.Save(outputFileStream)
End Using
File.WriteAllText("../../../OutputResult.txt", ocredText.ToString())
'Close the document.
document.Close(True)
End Using
A complete working sample can be downloaded from PerformOCR-HindiText-Extract.zip.
By executing the program, you will get the extracted text as follows.
Take a moment to peruse the documentation, where you will find other options like performing OCR on the image, region of the document, rotated page, and large PDF documents with code examples.
Refer here to explore the rich set of Syncfusion Essential PDF features.
Note: Starting with v16.2.0.x, if you reference Syncfusion assemblies from the trial setup or NuGet feed, include a license key in your projects. Refer to this link to learn about generating and registering the Syncfusion license key in your application to use the components without a trail message.