This article guides on how to extract data from PDF form using Python. It has all the details to set the IDE, a list of steps, and a sample code for accessing form fields data. The sample code will create a test PDF with fields and values, and fetch data in all the fields.
Steps to Extract Data from PDF Form Fields using Python
- Establish the environment for using Aspose.PDF for Python via .NET to extract form data
- Create or load a PDF file into a Document object with input fields containing data
- Fetch all the fields from the form property of the loaded PDF document
- Parse through all the fields and access each field
- Display the field full name and value
These steps describe how to extract data from fillable PDF using Python. Create or load a PDF file with fields and values, and access the collection of fields from the Form property in the PDF file. Iterate through all the fields and access full name and value for processing.
Code to Extract Form Fields from PDF using Python
import aspose.pdf as pdf | |
from aspose.pdf import Document, License, Rectangle | |
from aspose.pdf.forms import TextBoxField | |
def main(): | |
# Load Aspose PDF license | |
license = License() | |
license.set_license("license.lic") | |
# Generate PDF with input fields | |
create_pdf_with_fields() | |
# Open and process the generated PDF file | |
pdf_document = Document("UserForm.pdf") | |
# Retrieve and display form fields | |
form_fields = pdf_document.form.fields | |
for form_field in form_fields: | |
print("Field Name:", form_field.full_name) | |
print("Field Content:", form_field.value) | |
def create_pdf_with_fields(): | |
# Instantiate new PDF document | |
pdf_file = Document() | |
for page_index in range(1, 4): # 3 pages | |
new_page = pdf_file.pages.add() | |
for field_index in range(1, 5): # 4 fields per page | |
# Define a text input field | |
input_field = TextBoxField(new_page, Rectangle(120, field_index * 90, 320,(field_index + 1) * 90,True)) | |
input_field.partial_name = f"inputField_{page_index}_{field_index}" | |
input_field.value = f"Data Entry {page_index}-{field_index}" | |
# Attach field to the document form | |
pdf_file.form.add(input_field, page_index) | |
# Save document to disk | |
pdf_file.save("UserForm.pdf") | |
main() |
This code has demonstrated how to extract data from PDF form. We have used Document.form.fields collection that contains all the fields in PDF. You can filter the fields from a particular page by using the page_index in the Field object accessed from the collection.
This article has taught us the process to read PDF form data. If you want to flatten a PDF file, refer to the article on How to flatten PDF in Python.