如何在 C# 中从扫描的 PDF 中提取文本

本分步教程向您展示了如何在 C# 中从扫描的 PDF 中提取文本。当您将文档扫描成 PDF 时，这些页面将作为扫描图像添加到 PDF 文件中。因此，为了从扫描的 PDF 文件中提取文本，您实际上必须通过应用光学字符识别 (OCR) 从 C# 中的 PDF 中的图像中提取文本。

在 C# 中从扫描的 PDF 中提取文本的步骤

从 NuGet.org 包管理器获取 Aspose.OCR for .NET
添加对 Aspose.OCR namespace 的引用
使用 SetLicense 方法应用许可证代码
启动 AsposeOcr 类的实例
使用 DocumentRecognitionSettings class 指定识别设置
使用 RecognizePDF 方法提取所有 PDF 页面
使用 RecognitionText 属性从每个 PDF 页面获取文本

借助上述步骤，您可以使用 C# 快速轻松地从扫描的 PDF 中读取文本。之前，我们向您展示了如何在 C# 中从图像中提取文本。但是，此示例可帮助您在 C# 中从 PDF 中获取文本。

在 C# 中从扫描的 PDF 中提取文本的代码

	using System;
	using System.Collections.Generic;
	//Use following namespace to extract text from scanned PDF
	using Aspose.OCR;


	namespace ExtractTextFromScannedPDFFile
	{
	class Program
	{
	static void Main(string[] args)
	{
	//Set license before extracting text from scanned PDF file
	Aspose.OCR.License AsposeOCRLicense = new Aspose.OCR.License();
	AsposeOCRLicense.SetLicense(@"c:\asposelicense\license.lic");

	//create AsposeOcr object
	AsposeOcr ScannedPDFFile = new AsposeOcr();

	//set recognition settings
	DocumentRecognitionSettings RecognitionSettings = new DocumentRecognitionSettings();
	RecognitionSettings.StartPage = 1;
	RecognitionSettings.PagesNumber = 3;
	//when set true, improves accuracy but reduces speed
	RecognitionSettings.DetectAreas = false;

	//extract text from specified pages
	List<RecognitionResult> ExtractedResults = ScannedPDFFile.RecognizePdf("InputScannedPDFFile.pdf", RecognitionSettings);

	//fetch extracted text of each page
	int PageCounter = 1;
	foreach(RecognitionResult SinglePage in ExtractedResults)
	{
	Console.WriteLine("Page: {0}, Extracted Text:{1}", PageCounter, SinglePage.RecognitionText);
	PageCounter++;
	}
	}
	}
	}

view raw Extract Text From Scanned PDF in C#.cs hosted with ❤ by GitHub

上面的 C# 从 PDF 中获取文本的例子简单易懂。我们只是阅读扫描的 PDF 文件，然后从每一页中提取文本。但是，这里要理解的重要一点是 DetectArea 属性。如果您将其设置为 true，那么它将为您提供更高的准确性，但会降低处理 PDF 的速度。但是，通过将其设置为 false，速度会有所提高，但准确性可能会有所降低。因此，您必须根据自己的情况在两个选项之间进行选择。

Aspose 知识库

查找API的答案

如何在 C# 中从扫描的 PDF 中提取文本

在 C# 中从扫描的 PDF 中提取文本的步骤

在 C# 中从扫描的 PDF 中提取文本的代码