How to perform OCR for a PDF document in Azure environment

Step 1:

Create an Azure website project and refer the following assemblies in it:

Syncfusion.Compression.Base.dll
Syncfusion.Pdf.Base.dll
Syncfusion.OCRProcessor.Base.dll

Step 2:

Add the Tesseract binaries and Tesseract data to the created project in separate folders as embedded resources.

The below screenshot shows the Tesseract binaries and Tesseract data added as separate files inside App_Data folder of the project.

Tesseract binaries and data

Once the files are added, rebuild the project, then the added files can be found inside the “bin” folder of the project.

Step 3:

Now refer the Tesseract binaries and Tesseract data from the “bin” folder as shown in code snippet below.

C# :

using (OCRProcessor.OCRProcessor processor = new OCRProcessor.OCRProcessor(Server.MapPath("~/bin/App_Data/Tesseract_Binaries/")))
{
  
    // Load a PDF document
  
    Stream fileStream = File.OpenRead(Server.MapPath("~/bin/App_Data/input.pdf"));
    PdfLoadedDocument lDoc = new PdfLoadedDocument(fileStream);
 
    // Set OCR language and perform OCR
    processor.Settings.Language = "eng";
    processor.PerformOCR(lDoc, Server.MapPath("~/bin/App_Data/Tesseract_Data/"));     // Save and close the document
    lDoc.Save("Output.pdf", this.Response, HttpReadType.Save);
    lDoc.Close(true);
}

VB:

Using processor As New OCRProcessor.OCRProcessor(Server.MapPath("~/bin/App_Data/Tesseract_Binaries/"))
 
 
            'Load a PDF document
 
            Dim fileStream As Stream = File.OpenRead(Server.MapPath("~/bin/App_Data/input.pdf"))
            Dim lDoc As New PdfLoadedDocument(fileStream)
 
            'Set OCR language and perform OCR
 
            processor.Settings.Language = "eng"
 
            processor.PerformOCR(lDoc, Server.MapPath("~/bin/App_Data/Tesseract_Data/"))
 
            'Save and close the document
            lDoc.Save("Output.pdf", Me.Response, HttpReadType.Save)
 
 
            lDoc.Close(True)
End Using

Now when this project is published in Azure it will directly refer the Tesseract binaries and Tesseract data from” bin” folder and OCR process can be performed with this code snippet.

Note:

Starting with v16.2.0.x, if you reference Syncfusion® assemblies from a trial setup or from the NuGet feed, include a license key in your projects. Refer to the link to learn about generating and registering the Syncfusion® license key in your application to use the components without a trial message.

Conclusion

I hope you enjoyed learning about how to perform OCR for a PDF document in Azure environment.

Did you find this information helpful?

Yes

Comments (0)

How to perform OCR for a PDF document in Azure environment

Access denied