This tutorial provides details on how to read PDF table in Java and access text belonging to each cell inside the desired table. You will have full control to refer to a particular table in the target page of the PDF and parse through all the rows and cells to retrieve the data. To write this PDF table reader in Java no other third-party tool or software is required.
Steps to Read PDF Table in Java
- Configure your PDF table reader application to add Aspose.PDF from the Maven repository
- Load the sample PDF file containing a table in it using the Document class object
- Instantiate and initialize the TableAbsorber object to fetch all the PDF tables from the selected PDF page
- Iterate through all the rows in the desired table
- Iterate through all the cells in the desired row and fetch all the text fragments from each cell
- Display the text fetched from the cell
These steps explain how using Java extract table from PDF along with the information about the required libraries which are to be added to the project. It also states the order of operations to complete the task like first loading the PDF, then accessing a particular page, and fetching the desired table. Finally, parse through all the rows and cells to get the information.
Code to Read PDF Table in Java
import com.aspose.pdf.License; | |
import com.aspose.pdf.AbsorbedCell; | |
import com.aspose.pdf.AbsorbedRow; | |
import com.aspose.pdf.AbsorbedTable; | |
import com.aspose.pdf.Document; | |
import com.aspose.pdf.TableAbsorber; | |
import com.aspose.pdf.TextFragmentCollection; | |
public class ReadPDFTableInJava { | |
public static void main(String[] args) throws Exception { // main function for reading PDF table data in ReadPDFTableInJava | |
// For avoiding the trial version limitation, load the Aspose.PDF license prior to reading table data | |
License licenseForHtmlToPdf = new License(); | |
licenseForHtmlToPdf.setLicense("Aspose.Pdf.lic"); | |
// Load a source PDF document which contains a table in it | |
Document pdfDocument = new Document("PdfWithTable.pdf"); | |
// Instantiate the TableAbsorber object for PDF tables extraction | |
TableAbsorber tableAbsorber = new TableAbsorber(); | |
// visit the table collection in the input PDF | |
tableAbsorber.visit(pdfDocument.getPages().get_Item(1)); | |
// Access the desired table from the tables collection | |
AbsorbedTable absorbedTable = tableAbsorber.getTableList().get(0); | |
// Parse all the rows and get each row using the AbsorbedRow | |
for (AbsorbedRow pdfTableRow : absorbedTable.getRowList()) | |
{ | |
// Access each cell in the cells collection using AbsorbedCell | |
for (AbsorbedCell pdfTableCell : pdfTableRow.getCellList()) | |
{ | |
// Access each text fragment from the cell | |
TextFragmentCollection textFragmentCollection = pdfTableCell.getTextFragments(); | |
// Access each text fragment from the fragments collection | |
for (com.aspose.pdf.TextFragment textFragment : textFragmentCollection) | |
{ | |
// Display the table cell text | |
System.out.println(textFragment.getText()); | |
} | |
} | |
} | |
System.out.println("Done"); | |
} | |
} |
To extract table from PDF Java code is provided here that uses TableAbsorber and AbsorbedTable classes to handle the tables in PDF. It also uses AbsorbedRow and AbsorbedCell classes for managing rows and columns before using the TextFragment class for fetching the cell data. Also, there are many other absorber classes available for different elements in the document like fonts, paragraphs, text, and text fragments.
This article has described that by using Java PDF table extraction can be performed in a few steps. If you want to learn how to read text and images from a PDF file, refer to the article on how to read PDF file in Java.