How to perform OCR on PDF document using C# and deploy in Azure app service for linux?
The Syncfusion® .NET OCR library is used to extract text from scanned PDFs and images in Azure with the help of Google’s Tesseract Optical Character Recognition engine. You can perform OCR on PDF documents in Azure App Service on Linux.
Steps to perform OCR on PDF documents in Azure App Service on Linux
-
Create a new ASP.NET Core MVC application.
-
In configuration windows, name your project and click Next.
-
Install the Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET Core application NuGet.org.
-
There are two ways to install the dependency packages on the Azure server:
(i) Using SSH from the Azure portal.
(ii) By running the commands from C#.
4.1 Using the SSH command line
1.After publishing the Web application, log in to the Azure portal in a web interface and open the published app service.
2.Under the Development Tools Menu, open the SSH and Click Go link.
3.In the terminal window, install the dependency packages. Use the following single command to install all dependencies packages.
sudo apt-get update
sudo apt-get install libgdiplus
sudo apt-get install libc6-dev
4.2 Running the commands from C#
1.Create a shell file using the above commands in the project and name it as dependenciesInstall.sh. In this article, these steps have been followed to install dependencies packages.
2.Set Copy to Output Directory as Copy if newer to the dependenciesInstall.sh file.
3.Include the following code sample to install the dependencies code in Configure method in a startup.cs file.
//Install the dependencies packages for PDF OCR conversion in Linux
string shellFilePath = System.IO.Path.Combine(env.ContentRootPath, "dependenciesInstall.sh");
InstallDependecies(shellFilePath);
C#
private void InstallDependecies(string shellFilePath)
{
Process process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = "/bin/bash",
Arguments = "-c " + shellFilePath,
CreateNoWindow = true,
UseShellExecute = false,
}
};
process.Start();
process.WaitForExit();
}
4.Add Perform OCR button in index.cshtml.
@{ Html.BeginForm("ExportToPDF", "Home", FormMethod.Post);
{
<input type="submit" value="Perform OCR" class=" btn" />
}
}
- Include the following namespaces and code samples in the controller for converting scanned PDF to searchable PDF.
C#
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf;
using System.IO;
C#
//To Perform OCR operation
public IActionResult ExportToPDF()
{
Environment.SetEnvironmentVariable("ASPNETCORE_ENVIRONMENT", "Development");
MemoryStream stream = new MemoryStream();
string OCRText = string.Empty;
//Initialize the OCR processor with tesseract binaries folder path
OCRProcessor processor = new OCRProcessor();
string path = System.IO.Path.GetFullPath(Path.Combine("Data", "Input.pdf"));
//Load a PDF document
FileStream fileStream = new FileStream(path, FileMode.Open);
PdfLoadedDocument document = new PdfLoadedDocument(fileStream);
//Set OCR language
processor.Settings.Language = Languages.English;
//Perform OCR with input document.
OCRText = processor.PerformOCR(document);
//Save the document into stream.
document.Save(stream);
//If the position is not set to '0' then the PDF will be empty.
stream.Position = 0;
return File(stream.ToArray(), System.Net.Mime.MediaTypeNames.Application.Pdf, "Sample.pdf");
}
Refer to the following steps to publish as Azure App service Linux
6.Right-click the project and select Publish.
7.Create a new profile in publish target window.
8.Create App Service using an Azure subscription in your portal and select the app service.
9.After creating a profile, click Publish.
10.Now, the published webpage will open in the browser. Click Perform OCR to convert the scanned PDF to a searchable PDF.
A complete work sample for performing OCR on a PDF document in Azure App Service on Linux can be downloaded from GitHub
Take a moment to peruse the documentation and other options like performing OCR in windows and Mac, image, region on the document, and Unicode characters.
Click here to explore the rich set of Syncfusion Essential® PDF features.
Note: Starting with v16.2.0.x, if you reference Syncfusion® assemblies from the trial setup or the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering the Syncfusion® license key in your application to use the components without a trail message.
See Also
Convert scanned image to searchable PDF by OCR Processor in WF platform
Conclusion
I hope you enjoyed learning about how to perform OCR on PDF document using C# and deploy in Azure app service for Linux.
You can refer to our ASP.NET Core PDF’s feature tour page to know about its other groundbreaking feature representations. You can also explore our ASP.NET Core PDF example to understand how to present and manipulate data.
For current customers, you can check out our ASP.NET Core components from the License and Downloads page. If you are new to Syncfusion®, you can try our 30-day free trial to check out our ASP.NET Core PDF and other ASP.NET Core components.
If you have any queries or require clarifications, please let us know in comments below. You can also contact us through our Support forums, Direct-Trac, or Feedback portal. We are always happy to assist you!