How to Extract Text from Scanned PDF in Java

This quick tutorial provides details on how to extract text from scanned PDF in Java. You can configure this process of extracting text from scanned PDF in Java by setting the detection parameters. The option is also available to make a selection between the speed or accuracy depending upon the PDF quality and other application requirements.

Steps to Extract Text from Scanned PDF in Java

  1. From the Maven repository, configure Aspose.OCR in your project to read scanned PDF text
  2. Initialize AsposeOcrPdf object to read text from the PDF
  3. Instantiate the DocumentRecognitionSettings class object for setting the recognition parameters
  4. Set start page and number of pages in the PDF for reading text
  5. To increase the detection speed, set the detect areas flag to false
  6. Call RecognizePdf function to read all the text according to the above configuration
  7. Iterate through all the extracted results from the PDF pages and display them on the console

During the process to scan text from PDF in Java, an object of AsposeOCRPdf is initiated that actually contains features to recognize text from the PDF. It supports configuring the detection process like the start page number, number of PDF pages to be read, and option to set detection areas for controlling speed and accuracy. Finally, we parse through the results collection scanned from each page and display them on the console.

Code to Convert scanned PDF to text in Java

This code uses AsposeOCRPdf to get text from scanned PDF in Java. The DocumentRecognitionSettings class object contains options to set pages configuration either using the constructor as demonstrated in this sample code or by setting the StartPage and PagesNumber separately. You can also set the language, image skew corrections, and threads count for parallel detection of text from the scanned PDF.

In this article, we have learned how to extract text from scanned PDF in Java along with the configuration of the detection process. However, if you want to extract text from an image, refer to the article on how to extract text from image using Java.