Extract the Text from the Specific Region of the PDF Page
The Syncfusion® .NET Optical Character Recognition (OCR) Library is used to extract a text from the specific region of the PDF page. Using this library, you can extract the text from a specific region of the PDF page using the OCR Processor.
The code uses the Syncfusion.PdfToImageConverter class to transform each PDF document page into an image stream. Subsequently, these image streams are incorporated into a new PDF document. The newly generated document, containing the converted images, is employed to instantiate a new PdfLoadedDocument for subsequent OCR processing. In this article, we are going to extract the text which highlighted using the Highlight annotation in the PDF document.
Steps to extract the text from a specific region of the PDF page using OCR Processor
- Create a new Console application project.
- Install the Syncfusion.PdfToImageConverter.Net and Syncfusion.PDF.OCR.Net.Core NuGet packages from NuGet.org.
Download the language packages from the following link.
https://github.com/tesseract-ocr/tessdata_fast
- Include the following namespaces in the Program.cs file.
C#
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using Syncfusion.Pdf;
using Syncfusion.PdfToImageConverter;
using Syncfusion.Drawing;
using Syncfusion.Pdf.Interactive;
Imports Syncfusion.OCRProcessor
Imports Syncfusion.Pdf.Graphics
Imports Syncfusion.Pdf.Parsing
Imports Syncfusion.Pdf
Imports Syncfusion.PdfToImageConverter
Imports Syncfusion.Drawing
Imports Syncfusion.Pdf.Interactive
- Use the following code in Program.cs to get the bounds value of highlighted region of the PDF page.
C#
// Create a list to store the bounds values of annotations.
List<RectangleF> annotsBoundsList = new List<RectangleF>();
// Load the PDF document as a stream.
FileStream inputStream = new FileStream(@"../../../Input.pdf", FileMode.Open, FileAccess.ReadWrite);
PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(inputStream);
// Iterate through each page in the PDF document.
foreach (PdfLoadedPage page in pdfLoadedDocument.Pages)
{
// Iterate through annotations on the page.
foreach (PdfAnnotation annotation in page.Annotations)
{
// Check if the annotation is a PdfLoadedTextMarkUpAnnotation.
if (annotation is PdfLoadedTextMarkupAnnotation)
{
// Get the bounding rectangle of the annotation.
RectangleF annotBounds = annotation.Bounds;
// Convert the bounding rectangle to pixels.
PdfUnitConvertor converter = new PdfUnitConvertor();
RectangleF rect = converter.ConvertToPixels(new RectangleF(annotBounds.X, annotBounds.Y, annotBounds.Width, annotBounds.Height), PdfGraphicsUnit.Point);
// Add the converted bounding rectangle to a list.
annotsBoundsList.Add(rect);
}
}
// Clear annotations from the page.
page.Annotations.Clear();
}
// Save the modified PDF document to a memory stream.
MemoryStream inputDocumentStream = new MemoryStream();
pdfLoadedDocument.Save(inputDocumentStream);
//Close the document.
pdfLoadedDocument.Close(true);
' Create a list to store the bounds values of annotations.
Dim annotsBoundsList As New List(Of RectangleF)()
' Load the PDF document as a stream.
Dim inputStream As New FileStream("../../../Input.pdf", FileMode.Open, FileAccess.ReadWrite)
Dim pdfLoadedDocument As New PdfLoadedDocument(inputStream)
' Iterate through each page in the PDF document.
For Each page As PdfLoadedPage In pdfLoadedDocument.Pages
' Iterate through annotations on the page.
For Each annotation As PdfAnnotation In page.Annotations
' Check if the annotation is a PdfLoadedTextMarkUpAnnotation.
If TypeOf annotation Is PdfLoadedTextMarkupAnnotation Then
' Get the bounding rectangle of the annotation.
Dim annotBounds As RectangleF = annotation.Bounds
' Convert the bounding rectangle to pixels.
Dim converter As New PdfUnitConvertor()
Dim rect As RectangleF = converter.ConvertToPixels(New RectangleF(annotBounds.X, annotBounds.Y, annotBounds.Width, annotBounds.Height), PdfGraphicsUnit.Point)
' Add the converted bounding rectangle to a list.
annotsBoundsList.Add(rect)
End If
Next
' Clear annotations from the page.
page.Annotations.Clear()
Next
' Save the modified PDF document to a memory stream.
Dim inputDocumentStream As New MemoryStream()
pdfLoadedDocument.Save(inputDocumentStream)
' Close the document.
pdfLoadedDocument.Close(True)
- Use the following code to be converting PDF to image and save the converted file.
C#
// Create an instance of PdfToImageConverter for converting PDF files to images.
PdfToImageConverter imageConverter = new PdfToImageConverter();
// Load the modified PDF document into the image converter.
imageConverter.Load(inputDocumentStream);
// Create a new PDF document to store the converted images.
PdfDocument doc = new PdfDocument();
// Iterate through each page of the input PDF, convert to image, and add to the new document.
for (int i = 0; i < imageConverter.PageCount; i++)
{
// Convert PDF to Image.
Stream outputStream = imageConverter.Convert(i, false, false);
// Create a PdfBitmap from the converted image stream.
PdfBitmap pdfImage = new PdfBitmap(outputStream);
//Create a new PdfSection and add the page size.
PdfSection section = doc.Sections.Add();
//Set Margins.
section.PageSettings.Margins.All = 0;
//Set the page size.
section.PageSettings.Size = new SizeF(pdfImage.PhysicalDimension.Width, pdfImage.PhysicalDimension.Height);
// Add a new page to section.
PdfPage page = section.Pages.Add();
// Obtain the graphics context for the current PDF page.
PdfGraphics graphics = page.Graphics;
// Draw the converted image onto the PDF page.
graphics.DrawImage(pdfImage, 0, 0, page.Size.Width, page.Size.Height);
}
// Save the new document with converted images to a memory stream.
MemoryStream file = new MemoryStream();
doc.Save(file);
//Close the document.
doc.Close(true);
' Get stream from an existing PDF document.
Dim imageConverter As New PdfToImageConverter()
' Load the modified PDF document into the image converter.
imageConverter.Load(inputDocumentStream)
' Create a new PDF document to store the converted images.
Dim doc As New PdfDocument()
' Iterate through each page of the input PDF, convert to image, and add to the new document.
For i As Integer = 0 To imageConverter.PageCount - 1
' Convert PDF to Image.
Dim outputStream As Stream = imageConverter.Convert(i, False, False)
' Create a PdfBitmap from the converted image stream.
Dim pdfImage As New PdfBitmap(outputStream)
' Create a new PdfSection and add the page size.
Dim section As PdfSection = doc.Sections.Add()
' Set Margins.
section.PageSettings.Margins.All = 0
' Set the page size.
section.PageSettings.Size = New SizeF(pdfImage.PhysicalDimension.Width, pdfImage.PhysicalDimension.Height)
' Add a new page to section.
Dim page As PdfPage = section.Pages.Add()
' Obtain the graphics context for the current PDF page.
Dim graphics As PdfGraphics = page.Graphics
' Draw the converted image onto the PDF page.
graphics.DrawImage(pdfImage, 0, 0, page.Size.Width, page.Size.Height)
Next
' Save the new document with converted images to a memory stream.
Dim file As New MemoryStream()
doc.Save(file)
' Close the document.
doc.Close(True)
- Use the following code to process OCR for extract text.
C#
// Initialize OCRProcessor.
using (OCRProcessor processor = new OCRProcessor())
{
// Load the new document with converted images for OCR processing.
PdfLoadedDocument document = new PdfLoadedDocument(file);
// Set OCR language.
processor.Settings.Language = Languages.English;
for (int i = 0; i < annotsBoundsList.Count; i++)
{
List<PageRegion> pageRegions = new List<PageRegion>();
//Create page region.
PageRegion region = new PageRegion();
//Set page index.
region.PageIndex = 0;
//Set page region.
region.PageRegions = new RectangleF[] { annotsBoundsList[i] };
//Add region to page region.
pageRegions.Add(region);
//Set page regions.
processor.Settings.Regions = pageRegions;
// Set TessDataPath.
processor.TessDataPath = @"../../../tessdata-fast/";
// Perform OCR with input document and tessdata (Language packs).
string text = processor.PerformOCR(document);
}
// Create file stream for the output PDF document after OCR processing.
using (FileStream outputFileStream = new FileStream(Path.GetFullPath(@"../../../Output.pdf"), FileMode.Create, FileAccess.ReadWrite))
{
// Save the PDF document with OCR-recognized text to the file stream.
document.Save(outputFileStream);
}
// Close the document.
document.Close(true);
}
' Initialize OCRProcessor.
Using processor As New OCRProcessor()
' Load the new document with converted images for OCR processing.
Dim document As New PdfLoadedDocument(file)
' Set OCR language.
processor.Settings.Language = Languages.English
For i As Integer = 0 To annotsBoundsList.Count - 1
Dim pageRegions As New List(Of PageRegion)()
' Create page region.
Dim region As New PageRegion()
' Set page index.
region.PageIndex = 0
' Set page region.
region.PageRegions = New RectangleF() {annotsBoundsList(i)}
' Add region to page region.
pageRegions.Add(region)
' Set page regions.
processor.Settings.Regions = pageRegions
' Set TessDataPath.
processor.TessDataPath = "../../../tessdata-fast/"
' Perform OCR with input document and tessdata (Language packs).
Dim text As String = processor.PerformOCR(document)
Next
' Create file stream for the output PDF document after OCR processing.
Using outputFileStream As New FileStream(Path.GetFullPath("../../../Output.pdf"), FileMode.Create, FileAccess.ReadWrite)
' Save the PDF document with OCR-recognized text to the file stream.
document.Save(outputFileStream)
End Using
' Close the document.
document.Close(True)
End Using
A complete working sample can be downloaded from Extract_text_from_Image_free_PDF.zip.
By executing the program, you will get a PDF document as follows.
Take a moment to peruse the documentation, where you will find other options like performing OCR on an image, region of the document, rotated page, and large PDF documents with code examples.
Refer here to explore the rich set of Syncfusion Essential® PDF features.