How to perform Optical character recognition (OCR) in ASP.NET Core PDF?
The Syncfusion® .NET Optical Character Recognition (OCR) Library is used to extract text from scanned PDFs and images. With a few lines of C# code, a scanned PDF document containing a raster image is converted into a searchable and selectable PDF document. Save the OCR result as text, structured data, or searchable PDF documents. The .NET OCR Library uses a powerful Tesseract OCR engine.
Using this library, we can perform OCR on scanned PDF documents using C# and VB.NET.
Steps to perform OCR on scanned PDF programmatically
- Create a new C# ASP.NET Core Web application project.
- Install the Syncfusion.PDF.OCR.Net.Core NuGet package as a reference to your .NET Standard application from Nuget.org.
- A default controller named HomeController.cs gets added to the creation of the ASP.NET Core MVC project. Include the following namespaces in that HomeController.cs file.
C#
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
Imports Syncfusion.OCRProcessor
Imports Syncfusion.Pdf.Graphics
Imports Syncfusion.Pdf.Parsing
- Add a new button in index.cshtml as follows.
@{
Html.BeginForm("PerformOCR", "Home", FormMethod.Get);
{
<div>
<input type="submit" value="Perform OCR" style="width:150px;height:27px">
</div>
}
Html.EndForm();
}
- Add a new action method named PerformOCR in the HomeController.cs and use the following code sample to perform OCR in the ASP.NET Core application.
C#
//Load an existing PDF document.
FileStream docStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);
//Initialize the OCR processor.
using (OCRProcessor processor = new OCRProcessor())
{
//Language to process the OCR.
processor.Settings.Language = Languages.English;
FileStream fontStream = new FileStream("ARIALUNI.ttf", FileMode.Open, FileAccess.Read);
processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8);
//Process OCR by providing the loaded PDF document, Data dictionary, and language.
processor.PerformOCR(loadedDocument);
}
//Save a PDF to the MemoryStream.
MemoryStream stream = new MemoryStream();
loadedDocument.Save(stream);
//Close a PDF document.
loadedDocument.Close(true);
//Set the position as '0.'
stream.Position = 0;
//Download a PDF document in the browser.
FileStreamResult fileStreamResult = new FileStreamResult(stream, "application/pdf");
fileStreamResult.FileDownloadName = "OCR.pdf";
return fileStreamResult;
'Load an existing PDF document.
Dim docStream As FileStream = New FileStream("Input.pdf", FileMode.Open, FileAccess.Read)
Dim loadedDocument As PdfLoadedDocument = New PdfLoadedDocument(docStream)
'Initialize the OCR processor.
Using processor As OCRProcessor = New OCRProcessor()
'Language to process the OCR.
processor.Settings.Language = Languages.English
Dim fontStream As FileStream = New FileStream("ARIALUNI.ttf", FileMode.Open, FileAccess.Read)
processor.UnicodeFont = New PdfTrueTypeFont(fontStream, 8)
'Process OCR by providing the loaded PDF document, Data dictionary, and language.
processor.PerformOCR(loadedDocument)
End Using
'Saving a PDF to the MemoryStream.
Dim stream As MemoryStream = New MemoryStream()
loadedDocument.Save(stream)
'Close a PDF document.
loadedDocument.Close(True)
'Set the position as '0.'
stream.Position = 0
'Download a PDF document in the browser.
Dim fileStreamResult As FileStreamResult = New FileStreamResult(stream, "application/pdf")
fileStreamResult.FileDownloadName = "OCR.pdf"
Return fileStreamResult
A complete working sample can be downloaded from the OCRSample.zip.
By executing the program, you will get a PDF document as follows.
Take a moment to peruse the documentation, where you will find other options like performing OCR on an image, region of the document, rotated page, and large PDF documents with code examples.
Refer to here to explore the rich set of Syncfusion Essential® PDF features.
Note: Starting with v16.2.0.x, if you reference Syncfusion® assemblies from the trial setup or NuGet feed, include a license key in your projects. Refer to this link to learn about generating and registering the Syncfusion® license key in your application to use the components without a trail message.
Conclusion
I hope you enjoyed learning about how to perform Optical character recognition (OCR) in ASP.NET Core PDF.
You can refer to our ASP.NET Core PDF feature tour page to know about its other groundbreaking feature representations and documentation, and how to quickly get started for configuration specifications. You can also explore our ASP.NET Core PDF example to understand how to create and manipulate data. For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion®, you can try our 30-day free trial to check out our other controls.
If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums, Direct-Trac, or feedback portal. We are always happy to assist you!