How to get the bounds of words by extracting text using PDF Viewer server library

Extract text using PDF Viewer server library

The PDF Viewer server library allows you to extract text from a page along with its bounds. Text extraction can be done using the ExtractText() method. It will extract text from the PDF document and return the bounds of each character. Refer to the following UG link for more details:

https://ej2.syncfusion.com/aspnetcore/documentation/pdfviewer/how-to/extract-text

Getting bounds of words using ExtractText()

The ExtractText() method of the PDF Viewer server library will return the bounds of each character. Refer to the following code to get the bounds of the words:

Step 1: Extracting the text from the PDF document.

PdfRenderer renderer = new PdfRenderer();

renderer.Load(@"currentDirectory\..\..\..\..\Data\HTTP Succinctly.pdf");

List<TextData> textDataCollection = new List<TextData>();

// "text" contains the whole text extracted from the PDF document

string text = renderer.ExtractText(1, out textDataCollection);

System.IO.File.WriteAllText(@"currentDirectory\..\..\..\..\Data\ExtractedText.txt", text);

Step 2: Getting the bounds of the words with the extracted text

//"textBounds" contain the bound of each word

List<TextBounds> textBounds = new List<TextBounds>();

int count = 0;

string finalText = "";

var glyphBounds = new RectangleF(0, 0, 0, 0);

for (int j = count; j < textDataCollection.Count; j++)

{

//To find whether the character us empty string or new line

if (!textDataCollection[j].Text.Contains("\r") && !textDataCollection[j].Text.Contains(" "))

{

finalText += textDataCollection[j].Text;

int wordCount = 1;

var minx = textDataCollection[j].Bounds.Left;

var miny = textDataCollection[j].Bounds.Top;

var maxx = textDataCollection[j].Bounds.Right;

var maxy = textDataCollection[j].Bounds.Bottom;

for (int k = j + 1; k < textDataCollection.Count; k++, wordCount++)

{

if (!textDataCollection[k].Text.Contains(" ") && !textDataCollection[k].Text.Contains("\r"))

{

//Calculating the word bounds

if (minx > textDataCollection[k].Bounds.Left)

minx = textDataCollection[k].Bounds.Left;

if (miny > textDataCollection[k].Bounds.Top)

miny = textDataCollection[k].Bounds.Top;

if (maxx < textDataCollection[k].Bounds.Right)

maxx = textDataCollection[k].Bounds.Right;

if (maxy < textDataCollection[k].Bounds.Bottom)

maxy = textDataCollection[k].Bounds.Bottom;

finalText += textDataCollection[k].Text;

j = k;

if (j == textDataCollection.Count - 1)

{

glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny));

textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds));

finalText = "";

break;

}

else

{

glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny));

textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds));

finalText = "";

break;

}

else if (textDataCollection[j].Text.Contains("\r"))

{

j++;

}

Sample link:

https://www.syncfusion.com/downloads/support/directtrac/general/ze/WordBounds-1782596420

Conclusion

I hope you enjoyed learning about how to get the bounds of words by extracting text using PDF Viewer server library.

You can refer to our ASP.NET Core PDF Viewer feature tour page to know about its other groundbreaking feature representations and documentation, and how to quickly get started with configuration specifications. You can also explore our ASP.NET Core PDF Viewer example to understand how to create and manipulate data.

For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion®, you can try our 30-day free trial to check out our other controls.

If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums or feedback portal. We are always happy to assist you!

Did you find this information helpful?

Yes

Comments (0)

How to get the bounds of words by extracting text using PDF Viewer server library

Extract text using PDF Viewer server library

Getting bounds of words using ExtractText()

Sample link:

Access denied