How to Perform OCR for a PDF Document in Azure Functions
The Syncfusion PDF is a .NET Core PDF library that supports OCR by using the Tesseract open-source engine. Using this library, perform OCR for a PDF document in Azure Functions using .NET Core.
Steps to perform OCR on the entire PDF document in Azure Functions
Step 1: Create the Azure function project.
Step 2: Select the framework to Azure Functions and select HTTP triggers as follows.
Step 3: Install the Syncfusion.PDF.OCR.NET NuGet package as a reference to your .NET Core application NuGet.org.
Step 4: Copy the tessdata folder from the bin->Debug->net6.0->runtimes and paste it into the folder that contains the project file.
Step 5: Then, set Copy to output directory to give copy always the tessdata folder.
Step 6: Include the following namespaces in the Function1.cs file to perform OCR for a PDF document using C#.
using System;
using System.IO;
using System.Threading.Tasks;
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf;
using System.Net.Http;
using Syncfusion.Pdf.Parsing;
using System.Net.Http.Headers;
using System.Net;
using Microsoft.Azure.WebJobs.Host;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
Step 7: Add the following code sample in the Function1 class to perform OCR for a PDF document using the PerformOCR method of the OCRProcessor class in Azure Functions.
[FunctionName("Function1")]
public static async Task<HttpResponseMessage> Run([HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequestMessage req, TraceWriter log, ExecutionContext executionContext)
{
MemoryStream ms = new MemoryStream();
try
{
OCRProcessor processor = new OCRProcessor();
FileStream stream = new FileStream(Path.Combine(executionContext.FunctionAppDirectory, "Data", "Input.pdf"), FileMode.Open);
//Load a PDF document.
PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
//Set OCR language to process.
processor.Settings.Language = Languages.English;
//Perform OCR with input document.
string ocr = processor.PerformOCR(lDoc,Path.Combine(executionContext.FunctionAppDirectory, "tessdata"));
//Save a PDF document.
lDoc.Save(ms);
ms.Position = 0;
}
catch (Exception ex)
{
//Add a page to the document.
PdfDocument document = new PdfDocument();
PdfPage page = document.Pages.Add();
//Create PDF graphics for the page.
PdfGraphics graphics = page.Graphics;
//Set the standard font.
PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 6);
//Draw the text.
graphics.DrawString(ex.ToString(), font, PdfBrushes.Black, new Syncfusion.Drawing.PointF(0, 0));
ms = new MemoryStream();
//Save a PDF document.
document.Save(ms);
}
HttpResponseMessage response = new HttpResponseMessage(HttpStatusCode.OK);
response.Content = new ByteArrayContent(ms.ToArray());
response.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment")
{
FileName = "Output.pdf"
};
response.Content.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/pdf");
return response;
}
Step 8: Now, check the OCR creation in the local machine.
Steps to publish as Azure Functions
Step 1: Right-click the project and click Publish. Then, create a new profile in the Publish Window and create the Azure Function App with a consumption plan.
Step 2: After creating the profile, click Publish.
Step 3: Now, publish has been succeeded.
Step 4: Now, go to the Azure portal and select the Functions Apps. After running the service, click Get function URL > Copy. Include the URL as a query string in the URL. Then, paste it into the new browser tab. You will get a PDF document as follows.
A complete working sample can be downloaded from GitHub.
Click here to explore the rich Syncfusion PDF library features.