How to Read PDF File in C#

Reading different types of documents programmatically is common practice these days. In this how-to guide, you will get to learn how to read PDF File in C# by following below simple steps.

Steps to Read PDF File in C#

  1. Create an empty C# Console Application in Visual Studio
  2. Add reference to Aspose.PDF for .NET by installing it from NuGet.org
  3. Load an existing PDF file in Document object
  4. Initialize TextAbsorber Class to read PDF file
  5. Extract PDF text and write it to Console output
  6. Iterate through PDF Page Resources to find images
  7. Create FileStream object with found image
  8. Save the image to local disk

Below code snippet explains how to open and read PDF file in C#. You will be able to read text and extract images from a PDF file by using it. The API offers TextAbsorber Class that is used to read text from PDF file and you can achieve the extracted results via Text object. Finding images and saving them to local disk is also possible by looping through PDF Page resources as shown below.

Code to Read PDF File in C#

using System;
using System.IO;
// Add reference to Aspose.PDF for .NET API
// Use following namespace to read PDF file
using Aspose.Pdf;
namespace ReadPDFFiles
{
class Program
{
static void Main(string[] args)
{
// Set license before reading PDF file
Aspose.Pdf.License AsposePDFLicense = new Aspose.Pdf.License();
AsposePDFLicense.SetLicense(@"c:\asposelicense\license.lic");
string inFile = @"c:\ReadPDFFileInCSharp.pdf";
// Load an existing PDF file in Document object to read
Document pdf = new Document(inFile);
// 1. Read text from PDF file
// Initialize TextAbsorber Class to read Text from PDF file
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
// Call Page.Accept() method to let TextAbsorber find text in PDF Pages
pdf.Pages.Accept(textAbsorber);
// Write the extracted text to Console output
Console.WriteLine(textAbsorber.Text);
// 2. Extract images from PDF file
int imageIndex = 1;
// Iterate through PDF pages
foreach (var pdfPage in pdf.Pages)
{
// Check available images while reading the PDF
foreach (XImage image in pdfPage.Resources.Images)
{
// Create file stream for found image
FileStream extractedImage = new FileStream(String.Format("Page{0}_Image{1}.jpg", pdfPage.Number, imageIndex), FileMode.Create);
// Save output image to the disk
image.Save(extractedImage, System.Drawing.Imaging.ImageFormat.Jpeg);
// Close stream
extractedImage.Close();
imageIndex++;
}
// Reset image index
imageIndex = 1;
}
}
}
}

In the previous topic, you learnt how to process large PDF files in C#. The above information and code example will enable you to open and read PDF files in C# in order to extract text and images.

 English