Extract Text from PDF Documents in .NET Applications

With HiQPdf Library you can extract the text from PDF documents to a .NET System. String object using the PdfTextExtract class. You can set the text extraction mode with PdfTextExtract.TextExtractMode property and choose to keep the original positioning of the text in the PDF document or you can choose to extract the text in a layout more suitable for reading.

The C# sample code below shows how easy you can extract the text from existing PDF documents. With just a few lines of code you can obtain the text representation of a PDF document:

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

// set the text extraction mode
pdfTextExtract.TextExtractMode = GetTextExtractMode();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// extract the text from a range of pages of the PDF document
string text = pdfTextExtract.ExtractText(pdfFile, fromPdfPageNumber, toPdfPageNumber);

// get UTF-8 bytes
byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);

// the UTF-8 marker
byte[] utf8Marker = new byte[] { 0xEF, 0xBB, 0xBF };

// the text document bytes with UTF-8 marker followed by UTF-8 bytes
byte[] bytes = new byte[utf8Bytes.Length + utf8Marker.Length];
Array.Copy(utf8Marker, 0, bytes, 0, utf8Marker.Length);
Array.Copy(utf8Bytes, 0, bytes, utf8Marker.Length, utf8Bytes.Length);

// inform the browser about the data format
HttpContext.Current.Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");

// let the browser know how to open the text document and the text document name
HttpContext.Current.Response.AddHeader("Content-Disposition",
    String.Format("{0}; filename=ExtractedText.txt; size={1}", "attachment", bytes.Length.ToString()));

// write the text buffer to HTTP response
HttpContext.Current.Response.BinaryWrite(bytes);

// call End() method of HTTP response to stop ASP.NET page processing
HttpContext.Current.Response.End();

See also the live demo for Text Extraction from PDF documents for a fully functional example.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s