Extract Text from PDF Documents in .NET Applications

With HiQPdf Library you can extract the text from PDF documents to a .NET System. String object using the PdfTextExtract class. You can set the text extraction mode with PdfTextExtract.TextExtractMode property and choose to keep the original positioning of the text in the PDF document or you can choose to extract the text in a layout more suitable for reading.

The C# sample code below shows how easy you can extract the text from existing PDF documents. With just a few lines of code you can obtain the text representation of a PDF document:

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

// set the text extraction mode
pdfTextExtract.TextExtractMode = GetTextExtractMode();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// extract the text from a range of pages of the PDF document
string text = pdfTextExtract.ExtractText(pdfFile, fromPdfPageNumber, toPdfPageNumber);

// get UTF-8 bytes
byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);

// the UTF-8 marker
byte[] utf8Marker = new byte[] { 0xEF, 0xBB, 0xBF };

// the text document bytes with UTF-8 marker followed by UTF-8 bytes
byte[] bytes = new byte[utf8Bytes.Length + utf8Marker.Length];
Array.Copy(utf8Marker, 0, bytes, 0, utf8Marker.Length);
Array.Copy(utf8Bytes, 0, bytes, utf8Marker.Length, utf8Bytes.Length);

// inform the browser about the data format
HttpContext.Current.Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");

// let the browser know how to open the text document and the text document name
HttpContext.Current.Response.AddHeader("Content-Disposition",
    String.Format("{0}; filename=ExtractedText.txt; size={1}", "attachment", bytes.Length.ToString()));

// write the text buffer to HTTP response
HttpContext.Current.Response.BinaryWrite(bytes);

// call End() method of HTTP response to stop ASP.NET page processing
HttpContext.Current.Response.End();

See also the live demo for Text Extraction from PDF documents for a fully functional example.

 

Convert HTML to PDF in ASP.NET and MVC with C# and VB.NET

The majority of the websites are already able to produce reports or to present various results in HTML pages. While the HTML content is simple generate and edit it is not suitable for printing or for transmission by email. The de facto standard for printing is the PDF format. The HiQPdf HTML to PDF Converter for .NET can be used in your .NET applications to transform any HTML page into a PDF document preserving the original aspect of the HTML document.

The HiQPdf Library for .NET offers you a modern, simple, fast, flexible and powerful tool to create complex and stylish PDF documents in your applications with just a few lines of code.

Using the high quality HTML to PDF conversion engine you can easily design a document in HTML with CSS3, JavaScript, SVG or Canvas and then convert it to PDF preserving the exact content and style.

The main features of the converter are listed below:

  • Convert HTML and HTML5 Documents and Web Pages to PDF
  • Convert URLs and HTML Strings to PDF Files or Memory Buffers
  • Set the PDF Page Size and Orientation
  • Fit HTML Content in PDF Page Size
  • Advanced Support for Web Fonts in .WOFF and .TTF Formats
  • Advanced Support for Scalar Vector Graphics (SVG)
  • Advanced Support for HTML5 and CSS3
  • Delayed Conversion Triggering Mode
  • Control PDF page breaks with page-break CSS attributes in HTML
  • Repeat HTML Table Header and Footer on Each PDF Page
  • Packaged and Delivered as a Zip Archive
  • No External Dependencies
  • Direct Copy Deployment Supported
  • ASP.NET and Windows Forms Samples, Complete Documentation
  • Supported on All Windows Versions

You can find all the HiQPdf Library for .NET Features with a brief description of each feature on product page.

html_to_pdf

The C# sample code below shows how easy you can create the PDF documents from existing HTML pages or HTML strings. With just a few lines of code you can get richly formatted PDF document:

protected void buttonConvertToPdf_Click(object sender, EventArgs e)
{
    // create the HTML to PDF converter
    HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

    // set browser width
    htmlToPdfConverter.BrowserWidth = int.Parse(textBoxBrowserWidth.Text);

    // set browser height if specified, otherwise use the default
    if (textBoxBrowserHeight.Text.Length > 0)
        htmlToPdfConverter.BrowserHeight = int.Parse(textBoxBrowserHeight.Text);

    // set HTML Load timeout
    htmlToPdfConverter.HtmlLoadedTimeout = int.Parse(textBoxLoadHtmlTimeout.Text);

    // set PDF page size and orientation
    htmlToPdfConverter.Document.PageSize = GetSelectedPageSize();
    htmlToPdfConverter.Document.PageOrientation = GetSelectedPageOrientation();

    // set the PDF standard used by the document
    htmlToPdfConverter.Document.PdfStandard = checkBoxPdfA.Checked ? PdfStandard.PdfA : PdfStandard.Pdf;

    // set PDF page margins
    htmlToPdfConverter.Document.Margins = new PdfMargins(5);

    // set whether to embed the true type font in PDF
    htmlToPdfConverter.Document.FontEmbedding = checkBoxFontEmbedding.Checked;

    // set triggering mode; for WaitTime mode set the wait time before convert
    switch (dropDownListTriggeringMode.SelectedValue)
    {
        case "Auto":
            htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Auto;
            break;
        case "WaitTime":
            htmlToPdfConverter.TriggerMode = ConversionTriggerMode.WaitTime;
            htmlToPdfConverter.WaitBeforeConvert = int.Parse(textBoxWaitTime.Text);
            break;
        case "Manual":
            htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Manual;
            break;
        default:
            htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Auto;
            break;
    }

    // set header and footer
    SetHeader(htmlToPdfConverter.Document);
    SetFooter(htmlToPdfConverter.Document);

    // set the document security
    htmlToPdfConverter.Document.Security.OpenPassword = textBoxOpenPassword.Text;
    htmlToPdfConverter.Document.Security.AllowPrinting = checkBoxAllowPrinting.Checked;

    // set the permissions password too if an open password was set
    if (htmlToPdfConverter.Document.Security.OpenPassword != null && htmlToPdfConverter.Document.Security.OpenPassword != String.Empty)
        htmlToPdfConverter.Document.Security.PermissionsPassword = htmlToPdfConverter.Document.Security.OpenPassword + "_admin";

    // convert HTML to PDF
    byte[] pdfBuffer = null;

    if (radioButtonConvertUrl.Checked)
    {
        // convert URL to a PDF memory buffer
        string url = textBoxUrl.Text;

        pdfBuffer = htmlToPdfConverter.ConvertUrlToMemory(url);
    }
    else
    {
        // convert HTML code
        string htmlCode = textBoxHtmlCode.Text;
        string baseUrl = textBoxBaseUrl.Text;

        // convert HTML code to a PDF memory buffer
        pdfBuffer = htmlToPdfConverter.ConvertHtmlToMemory(htmlCode, baseUrl);
    }

    // inform the browser about the binary data format
    HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

    // let the browser know how to open the PDF document, attachment or inline, and the file name
    HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("{0}; filename=HtmlToPdf.pdf; size={1}",
        checkBoxOpenInline.Checked ? "inline" : "attachment", pdfBuffer.Length.ToString()));

    // write the PDF buffer to HTTP response
    HttpContext.Current.Response.BinaryWrite(pdfBuffer);

    // call End() method of HTTP response to stop ASP.NET page processing
    HttpContext.Current.Response.End();
}

You can find more HTML to PDF C# and VB.NET samples in the online demo.