Data Extraction from Invoices using Python

This tutorial guides on how to perform data extraction from invoices using Python. It has all the details to set the IDE for the development, a list of steps defining the program flow, and a sample code demonstrating the invoice OCR software using Python. You will learn to customize the detection process from images such as PNG, JPEG, BMP, TIFF, and GIF as per your requirements.

Steps for Invoice OCR using Python

  1. Set the environment to use Aspose.OCR for Python via .NET for extracting invoice data
  2. Create an instance of the Aspose.Ocr for OCR processing
  3. Create an instance of the OcrInput class for holding receipts
  4. Add receipts to the OcrInput collection
  5. Set up receipt recognition settings and set recognition language
  6. Perform OCR using the recognize_receipt method to recognize text from the input receipts
  7. Display recognized text from the receipts

These steps describe how to apply OCR for receipts using Python. Create an instance of the Aspose.Ocr object, initialize the OcrInput object for holding receipts, and create the ReceiptRecognitionSettings object for defining the parameters for the invoices OCR. Finally, call the recognize_receipt() method by providing the receipts list and settings for extracting the text.

Code for Invoice Data Extraction using Python

import aspose.ocr as api
from aspose.ocr import License
# Instantiate and apply the license for Aspose.OCR to enable full functionality.
license = License()
license.set_license("License.lic")
# Create an instance of the Aspose.Ocr class for OCR processing.
extractTextFromReceipt = api.AsposeOcr()
# Initialize an OcrInput object to hold input image(s) for OCR processing.
receiptDatas = api.OcrInput(api.InputType.SINGLE_IMAGE)
# Add images (receipts) to the OcrInput object for recognition.
receiptDatas.add("Receipt1.png")
receiptDatas.add("Receipt2.png")
# Set up receipt recognition settings.
recognitionSettings = api.ReceiptRecognitionSettings()
recognitionSettings.language = api.Language.ENG # Specify the language as English.
# Perform OCR to recognize text from the input receipts using the specified settings.
results = extractTextFromReceipt.recognize_receipt(receiptDatas, recognitionSettings)
# Get the number of recognized results (one result per input image).
length = results.length
# Loop through each result and print the recognized text for each input image.
for i in range(length):
print(results[i].recognition_text)

This sample code demonstrates the usage of the invoice OCR API using Python. You may set the input type to PDF, TIFF, URL, Directory, Zip etc., and set detection language from a large list of language names in the Language enumerator, The ReceiptRecognitionSettings class contains a number of properties such as setting the allowed characters set, flag to set automatic color inversion and define a black list of characters for ignoring them.

This article has taught us the process of extracting invoice text. To convert handwritten text to editable and searchable text, refer to the article on Convert handwriting to text using Python.

 English