How to convert tables in PDF document to Data Table in C#.
The Syncfusion Essential® PDF is a feature-rich and high-performance .NET PDF library that is used to create, read, and edit PDF documents programmatically without Adobe dependencies. At present, there is no support for converting the tables from the PDF document to Data Table. However, you can achieve this using the tabula and Syncfusion PDF library. Refer to the following code.
Steps to convert the tables from the PDF document to Data Table using C# programmatically
1. Create a new C# console application project.
2. Include the following namespaces in the Program.cs file.
C#
using System; using System.Data; using System.Diagnostics;
3. The following code example shows how to convert the PDF tables to CSV conversion using the Tabula source in C#.
string csvName = fileName.Split('.')[0]; ProcessStartInfo startInfo = new ProcessStartInfo(@"C:\Program Files (x86)\Java\jre1.8.0_261\bin\java.exe"); startInfo.WindowStyle = ProcessWindowStyle.Hidden; //Sets the working directory startInfo.WorkingDirectory = outputpath; //Using the java dependencies to create a csv file startInfo.Arguments = "-jar tabula-1.0.2-jar-with-dependencies.jar -p all -o " + csvName + ".csv " + fileName; Process currentProcess = Process.Start(startInfo); currentProcess.WaitForExit(); string[] files = Directory.GetFiles(outputpath, csvName + ".csv"); if (files.Length > 0) { DataTable res = ConvertCSVtoDataTable(files[0]); Console.WriteLine("Extracted table from PDF to DataTable"); DrawDataTabletoPDF(res); }
4. The following code example shows how to convert the CSV to DataTable using C#.
public static DataTable ConvertCSVtoDataTable(string strFilePath) { DataTable dtCsv = new DataTable(); string Fulltext; using (StreamReader sr = new StreamReader(strFilePath)) { while (!sr.EndOfStream) { //read the full file text Fulltext = sr.ReadToEnd().ToString(); //split the full file text into rows string[] rows = Fulltext.Split('\n'); for (int i = 0; i < rows.Count() - 1; i++) { //split each row with comma to get the individual values string[] rowValues = rows[i].Split(','); { if (i == 0) { for (int j = 0; j < rowValues.Count(); j++) { //add headers dtCsv.Columns.Add(rowValues[j]); } } else { DataRow dr = dtCsv.NewRow(); for (int k = 0; k < rowValues.Count(); k++) { dr[k] = rowValues[k].ToString(); } //add other rows dtCsv.Rows.Add(dr); } } } } } return dtCsv; }
A complete working sample can be downloaded from PdfSample.zip.
In the sample, we are converting the PDF tables into (.csv) file and store it in the Data folder of the sample. Then convert the CSV file data to DataTable using the system assemblies.
1. If you get an issue while uploading the PDF file and the .csv file is not created in the Data folder, then the problem will be related to the Tabula.
2. Ensure the “tabula-1.0.2-jar-with-dependencies.jar” dependency in the Data folder.
3. Provide the Java installed location properly in the PdfToDataTable() method.
ProcessStartInfo startInfo = new ProcessStartInfo(@"C:\Program Files(x86)\Java\jre1.8.0_261\bin\java.exe");