Extract Text From Word Document in Python

By utilizing this example, you will learn how to extract Text from Word document in Python. It also provides the information to configure the development environment by adhering to a step-by-step workflow, and an example code to develop a Word to TXT converter using Python. This application can be integrated into any environment supporting Python and the .NET framework in Windows, Linux, or macOS.

Steps to Extract Text from Word Document in Python

Establish the environment by installing Aspose.Words for Python via .NET to convert DOCX file to a TXT file using Python
By using an instance of the Document class, access the source Word DOCX file
Use a TxtSaveOptions class object instance to set the required properties
Convert the loaded Word document to a TXT file using the save method

These precise steps in Python extract text from DOCX file using a very simple API interface. The process will commence by accessing the source DOCX file from the disk using an instance of the Document class, which is then followed by setting the desired output TXT file properties using the TxtSaveOptions class object. Finally, the loaded Word document file is saved as a TXT file on the disk using the save method.

Code to Convert DOCX to TXT in Python

	import aspose.words as aw
	import io

	# Path to the source files
	filePath = "Y:////KB//TestData//"

	# Load the Aspose.Words license in your application to convert DOCX to TXT
	wordtoTxtLicense = aw.License()
	wordtoTxtLicense.set_license(filePath + "Conholdate.Total.Product.Family.lic")

	# Use the Document class object to access the source DOCX file
	srcDocument = aw.Document(filePath + "Test1.docx")

	#Optional Text saving options
	txtOpts = aw.saving.TxtSaveOptions()
	txtOpts.max_characters_per_line = 100
	txtOpts.save_format = aw.SaveFormat.TEXT
	txtOpts.pretty_format = True

	srcDocument.save(filePath + "ExtractedText.txt", txtOpts);

	print ("Document converted to TXT successfully")

view raw Extract Text From Word Document in Python.py hosted with ❤ by GitHub

The example demonstrates the API capability to convert DOCX to TXT in Python. Using TxtSaveOptions class instance is optional and you can save the TXT file using the default options. However, if you desire to customize the output TXT file, you can use different properties exposed by the TxtSaveOptions class including setting encoding, force_page_breaks, max_characters_per_line, paragraph_break, and pretty_format to name a few.

In this article, we have learned that in order to extract Text from DOCX Python based API can be a good choice. If you want to learn to compare PDF documents, refer to the article on Compare PDF Documents using Python.

Aspose Knowledge Base

Find Answers by API

Extract Text From Word Document in Python

Steps to Extract Text from Word Document in Python

Code to Convert DOCX to TXT in Python