How to Read PDF Table in Python

This brief tutorial describes the process of how to read PDF table in Python. It presents all the basic information to set the development environment, a sequence of steps to write the application, and a runnable sample code to extract table from PDF in Python. You will get guidance to access each cell of the table and then fetch all the data in it.

Steps to Extract Table Data from PDF using Python

Set the environment to use Aspose.PDF for Python via .NET to read tables
Load the source PDF file using the Document class having a table
Create an instance of the TableAbsorber class object to read tables from the loaded PDF file
Select a page and parse all the tables in it
Access the first table and parse through rows and columns to fetch all the TextFragment instances in a cell
Parse through all the text fragments and display text in each fragment

These steps explain the process to read PDF table in Python. The process is commenced by loading the PDF file and then creating the TableAbsorber object that has methods to read tables from a PDF file. Once all the tables are parsed on a particular page the first table is accessed from the collection and then each row and column is parsed to get the collection of text fragments in it for fetching the data.

Code to Extract Table from PDF using Python

	import aspose.pdf as pdf

	# Load the license
	license = pdf.License()
	license.set_license("Aspose.Total.lic")

	# Load source PDF
	pdfDocument = pdf.Document("PdfWithTable.pdf")

	# Declare and initialize TableAbsorber object
	tableAbsorber = pdf.text.TableAbsorber()

	# Parse all the tables
	tableAbsorber.visit(pdfDocument.pages[1])

	# Get a reference to the first table
	absorbedTable = tableAbsorber.table_list[0]

	# Iterate through all the rows
	for pdfTableRow in absorbedTable.row_list:
	# Iterate through all the columns
	for pdfTableCell in pdfTableRow.cell_list:
	# Fetch the text fragments
	textFragmentCollection = pdfTableCell.text_fragments
	# Iterate through the text fragments
	for textFragment in textFragmentCollection:
	# Display the text
	print(textFragment.text)
	print("Data read successfully from the table")

view raw How to Read PDF Table in Python.py hosted with ❤ by GitHub

The above code shows that how using python read pdf table and fetch its data for processing. When we call the visit() method in the TableAbsorber class, it fills the table_list array that is used to access individual tables. Each table in the tables collection has row_list property that has a cell_list property providing access to the columns collection and finally you reach the text_fragments property to get the collection of data in a particular cell.

This article has taught us that to extract table from PDF Python can be used easily. If you want to learn the process to read bookmarks in a PDF, refer to the article on how to read bookmarks in PDF using Python.

Aspose Knowledge Base

Find Answers by API

How to Read PDF Table in Python

Steps to Extract Table Data from PDF using Python

Code to Extract Table from PDF using Python