How to Extract Individual Data in WinForms PDF Viewer?

3 mins read

You can extract individual questions and answers in WinForms PDF Viewer using the `ExtractText` method of `PdfViewerControl` and by performing string manipulation with the extracted text using some predefined set of terms for questions and answers respectively.

For example, if you have the PDF document with the questions and answers in a structure as illustrated in the following screenshot, you can identify the questions by checking if there are any numeric values and the answers by checking the term “Ans” at the beginning of the text.

Sample question and answer

You can refer to the following steps for performing the same:

Steps to extract individual questions and answers from a PDF document

Step 1: Extract text from the PDF document using `PdfViewerControl`.

string fileText = string.Empty;
 
//Initialize PdfViewerControl
PdfViewerControl pdfViewerControl = new PdfViewerControl();
// Load PDF document.
pdfViewerControl.Load(@"../../Data/sample.pdf");
 
//Extract text from the document
List<TextData> textData = new List<TextData>();
for (int i = 0; i < pdfViewerControl.PageCount; i++)
{   
    //Get text from a particular page at the index `i` 
    string text = pdfViewerControl.ExtractText(i, out textData);
    //Add new line for next page.
    fileText += "\n" + text;
}

Step 2: Collect questions from the extracted text.

private void Form1_Load (object sender, System.EventArgs e)
{
   int questionNumber;
   //Check whether the line of text starts with a numeric value
   if (int.TryParse(text[0].ToString(), out questionNumber))
   {
     for (int i = 0; i < text.Length; i++)
     {
        if (text[i].ToString() == ".")
        {
             //Add the line of text to the question collection list
             if (int.TryParse(text.Substring(0, i).ToString(), out questionNumber))
                QuestionCollection.Add(text.Substring(i+questionStartIndex,text.Length-(i+    questionStartIndex)));      
        }
      }
    }
}

Step 3: Collect answers from the extracted text.

private void Form1_Load (object sender, System.EventArgs e)
{
   //Check whether the line of text starts with “Ans.”
   if (answer == "Ans.")
   //Add the line of text to the answer collection list
   AnswerCollection.Add(text.Substring(answerStartIndex, text.Length - answerStartIndex));
 }

Note:

In the sample, we have used a PDF document with a simple structure as mentioned in the above definition. If you have different structured PDF document, need to make some changes in the sample based on the structure.

Refer to the following sample link for the complete code snippet.

ExtractQuestionsAndAnswers

How to Extract Individual Data in WinForms PDF Viewer?

Steps to extract individual questions and answers from a PDF document

Access denied