How to Extract Text from Scanned PDF in C#

This step by step tutorial shows you how to extract text from scanned PDF in C#. When you scan documents into a PDF, those pages are added as scanned images inside the PDF file. So in order to extract text from the scanned PDF file, you’ll actually have to extract text from images in PDF in C# by applying Optical Character Recognition (OCR).

Steps to Extract Text from Scanned PDF in C#

  1. Get Aspose.OCR for .NET from NuGet.org package manager
  2. Add reference to Aspose.OCR namespace
  3. Apply license code using SetLicense method
  4. Initiate an instance of AsposeOcr Class
  5. Specify recognition settings using DocumentRecognitionSettings class
  6. Extract all PDF pages using RecognizePDF method
  7. Get text from each PDF page using RecognitionText property

With the help of the above steps, you can read text from scanned PDF in C# quickly and easily. Earlier, we showed you how to Extract Text From Image in C#. However, this example helps you get text from PDF in C#.

Code to Extract Text from Scanned PDF in C#

using System;
using System.Collections.Generic;
//Use following namespace to extract text from scanned PDF
using Aspose.OCR;
namespace ExtractTextFromScannedPDFFile
{
class Program
{
static void Main(string[] args)
{
//Set license before extracting text from scanned PDF file
Aspose.OCR.License AsposeOCRLicense = new Aspose.OCR.License();
AsposeOCRLicense.SetLicense(@"c:\asposelicense\license.lic");
//create AsposeOcr object
AsposeOcr ScannedPDFFile = new AsposeOcr();
//set recognition settings
DocumentRecognitionSettings RecognitionSettings = new DocumentRecognitionSettings();
RecognitionSettings.StartPage = 1;
RecognitionSettings.PagesNumber = 3;
//when set true, improves accuracy but reduces speed
RecognitionSettings.DetectAreas = false;
//extract text from specified pages
List<RecognitionResult> ExtractedResults = ScannedPDFFile.RecognizePdf("InputScannedPDFFile.pdf", RecognitionSettings);
//fetch extracted text of each page
int PageCounter = 1;
foreach(RecognitionResult SinglePage in ExtractedResults)
{
Console.WriteLine("Page: {0}, Extracted Text:{1}", PageCounter, SinglePage.RecognitionText);
PageCounter++;
}
}
}
}

The above C# get text from PDF example is simple and easy to understand. We’re simply reading a scanned PDF file and then extracting text from each page. However, one important point to understand here is DetectArea property. If you set it to true then it will provide you more accuracy but will reduce speed of processing the PDF. However, by setting it to false, the speed will improve and the accuracy might be a little reduced. So you have to chose between the two options based on your situation.

 English