Extract Data from PDF Form using Python

This article guides on how to extract data from PDF form using Python. It has all the details to set the IDE, a list of steps, and a sample code for accessing form fields data. The sample code will create a test PDF with fields and values, and fetch data in all the fields.

Steps to Extract Data from PDF Form Fields using Python

  1. Establish the environment for using Aspose.PDF for Python via .NET to extract form data
  2. Create or load a PDF file into a Document object with input fields containing data
  3. Fetch all the fields from the form property of the loaded PDF document
  4. Parse through all the fields and access each field
  5. Display the field full name and value

These steps describe how to extract data from fillable PDF using Python. Create or load a PDF file with fields and values, and access the collection of fields from the Form property in the PDF file. Iterate through all the fields and access full name and value for processing.

Code to Extract Form Fields from PDF using Python

import aspose.pdf as pdf
from aspose.pdf import Document, License, Rectangle
from aspose.pdf.forms import TextBoxField
def main():
# Load Aspose PDF license
license = License()
license.set_license("license.lic")
# Generate PDF with input fields
create_pdf_with_fields()
# Open and process the generated PDF file
pdf_document = Document("UserForm.pdf")
# Retrieve and display form fields
form_fields = pdf_document.form.fields
for form_field in form_fields:
print("Field Name:", form_field.full_name)
print("Field Content:", form_field.value)
def create_pdf_with_fields():
# Instantiate new PDF document
pdf_file = Document()
for page_index in range(1, 4): # 3 pages
new_page = pdf_file.pages.add()
for field_index in range(1, 5): # 4 fields per page
# Define a text input field
input_field = TextBoxField(new_page, Rectangle(120, field_index * 90, 320,(field_index + 1) * 90,True))
input_field.partial_name = f"inputField_{page_index}_{field_index}"
input_field.value = f"Data Entry {page_index}-{field_index}"
# Attach field to the document form
pdf_file.form.add(input_field, page_index)
# Save document to disk
pdf_file.save("UserForm.pdf")
main()

This code has demonstrated how to extract data from PDF form. We have used Document.form.fields collection that contains all the fields in PDF. You can filter the fields from a particular page by using the page_index in the Field object accessed from the collection.

This article has taught us the process to read PDF form data. If you want to flatten a PDF file, refer to the article on How to flatten PDF in Python.

 English