How to get the bounds of words by extracting text using PDF Viewer server library
Extract text using PDF Viewer server library
The PDF Viewer server library allows you to extract text from a page along with its bounds. Text extraction can be done using the ExtractText() method. It will extract text from the PDF document and return the bounds of each character. Refer to the following UG link for more details:
https://ej2.syncfusion.com/aspnetcore/documentation/pdfviewer/how-to/extract-text
Getting bounds of words using ExtractText()
The ExtractText() method of the PDF Viewer server library will return the bounds of each character. Refer to the following code to get the bounds of the words:
Step 1: Extracting the text from the PDF document.
PdfRenderer renderer = new PdfRenderer(); renderer.Load(@"currentDirectory\..\..\..\..\Data\HTTP Succinctly.pdf"); List<TextData> textDataCollection = new List<TextData>(); // "text" contains the whole text extracted from the PDF document string text = renderer.ExtractText(1, out textDataCollection); System.IO.File.WriteAllText(@"currentDirectory\..\..\..\..\Data\ExtractedText.txt", text); |
Step 2: Getting the bounds of the words with the extracted text
//"textBounds" contain the bound of each word List<TextBounds> textBounds = new List<TextBounds>(); int count = 0; string finalText = ""; var glyphBounds = new RectangleF(0, 0, 0, 0); for (int j = count; j < textDataCollection.Count; j++) { //To find whether the character us empty string or new line if (!textDataCollection[j].Text.Contains("\r") && !textDataCollection[j].Text.Contains(" ")) { finalText += textDataCollection[j].Text; int wordCount = 1; var minx = textDataCollection[j].Bounds.Left; var miny = textDataCollection[j].Bounds.Top; var maxx = textDataCollection[j].Bounds.Right; var maxy = textDataCollection[j].Bounds.Bottom; for (int k = j + 1; k < textDataCollection.Count; k++, wordCount++) { if (!textDataCollection[k].Text.Contains(" ") && !textDataCollection[k].Text.Contains("\r")) { //Calculating the word bounds if (minx > textDataCollection[k].Bounds.Left) minx = textDataCollection[k].Bounds.Left; if (miny > textDataCollection[k].Bounds.Top) miny = textDataCollection[k].Bounds.Top; if (maxx < textDataCollection[k].Bounds.Right) maxx = textDataCollection[k].Bounds.Right; if (maxy < textDataCollection[k].Bounds.Bottom) maxy = textDataCollection[k].Bounds.Bottom; finalText += textDataCollection[k].Text; j = k; if (j == textDataCollection.Count - 1) { glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny)); textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds)); finalText = ""; break; } } else { glyphBounds = new RectangleF((float)minx, (float)miny, (float)(maxx - minx), (float)(maxy - miny)); textBounds.Add(new TextBounds(finalText.ToString(), glyphBounds)); finalText = ""; break; } } } else if (textDataCollection[j].Text.Contains("\r")) { j++; } } |
Sample link:
https://www.syncfusion.com/downloads/support/directtrac/general/ze/WordBounds-1782596420
Conclusion