How to Extract Text from Scanned PDF in C#

This step by step tutorial shows you how to extract text from scanned PDF in C#. When you scan documents into a PDF, those pages are added as scanned images inside the PDF file. So in order to extract text from the scanned PDF file, you’ll actually have to extract text from images in PDF in C# by applying Optical Character Recognition (OCR).

Steps to Extract Text from Scanned PDF in C#

  1. Get Aspose.OCR for .NET from NuGet.org package manager
  2. Add reference to Aspose.OCR namespace
  3. Apply license code using SetLicense method
  4. Initiate an instance of AsposeOcr Class
  5. Specify recognition settings using DocumentRecognitionSettings class
  6. Extract all PDF pages using RecognizePDF method
  7. Get text from each PDF page using RecognitionText property

With the help of the above steps, you can read text from scanned PDF in C# quickly and easily. Earlier, we showed you how to Extract Text From Image in C#. However, this example helps you get text from PDF in C#.

Code to Extract Text from Scanned PDF in C#

The above C# get text from PDF example is simple and easy to understand. We’re simply reading a scanned PDF file and then extracting text from each page. However, one important point to understand here is DetectArea property. If you set it to true then it will provide you more accuracy but will reduce speed of processing the PDF. However, by setting it to false, the speed will improve and the accuracy might be a little reduced. So you have to chose between the two options based on your situation.

 English