Extract Text from PDF Documents in .NET Applications

With HiQPdf Library you can extract the text from PDF documents to a .NET System. String object using the PdfTextExtract class. You can set the text extraction mode with PdfTextExtract.TextExtractMode property and choose to keep the original positioning of the text in the PDF document or you can choose to extract the text in a layout more suitable for reading.

The C# sample code below shows how easy you can extract the text from existing PDF documents. With just a few lines of code you can obtain the text representation of a PDF document:

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

// set the text extraction mode
pdfTextExtract.TextExtractMode = GetTextExtractMode();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// extract the text from a range of pages of the PDF document
string text = pdfTextExtract.ExtractText(pdfFile, fromPdfPageNumber, toPdfPageNumber);

// get UTF-8 bytes
byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);

// the UTF-8 marker
byte[] utf8Marker = new byte[] { 0xEF, 0xBB, 0xBF };

// the text document bytes with UTF-8 marker followed by UTF-8 bytes
byte[] bytes = new byte[utf8Bytes.Length + utf8Marker.Length];
Array.Copy(utf8Marker, 0, bytes, 0, utf8Marker.Length);
Array.Copy(utf8Bytes, 0, bytes, utf8Marker.Length, utf8Bytes.Length);

// inform the browser about the data format
HttpContext.Current.Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");

// let the browser know how to open the text document and the text document name
HttpContext.Current.Response.AddHeader("Content-Disposition",
    String.Format("{0}; filename=ExtractedText.txt; size={1}", "attachment", bytes.Length.ToString()));

// write the text buffer to HTTP response
HttpContext.Current.Response.BinaryWrite(bytes);

// call End() method of HTTP response to stop ASP.NET page processing
HttpContext.Current.Response.End();

See also the live demo for Text Extraction from PDF documents for a fully functional example.

 

Search Text In PDF Using HiQPdf Library

With HiQPdf Library for .NET you can search a text in a PDF document using the SearchText() method of the PdfTextExtract class. You can choose to match the case or to match the whole word only when searching using this method parameters.

In the C# code sample below you can see how to search for a text in an existing PDF document. The found text is then highlighted in the original PDF.

C# Code Sample to Search and Highlight Text in PDF

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// get the text to search
string textToSearch = textBoxTextToSearch.Text;

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// search the text in PDF document
PdfTextSearchItem[] searchTextInstances = pdfTextExtract.SearchText(pdfFile, textToSearch,
            fromPdfPageNumber, toPdfPageNumber, checkBoxMatchCase.Checked, checkBoxMatchWholeWord.Checked);

// load the PDF file to highlight the searched text
PdfDocument pdfDocument = PdfDocument.FromFile(pdfFile);

// highlight the searched text in PDF document
foreach (PdfTextSearchItem searchTextInstance in searchTextInstances)
{
    PdfRectangle pdfRectangle = new PdfRectangle(searchTextInstance.BoundingRectangle);

    // set rectangle color and opacity
    pdfRectangle.BackColor = Color.Yellow;
    pdfRectangle.Opacity = 30;

    // highlight the text
    pdfDocument.Pages[searchTextInstance.PdfPageNumber - 1].Layout(pdfRectangle);
}

// write the modified PDF document
try
{
    // write the PDF document to a memory buffer
    byte[] pdfBuffer = pdfDocument.WriteToMemory();

    // inform the browser about the binary data format
    HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

    // let the browser know how to open the PDF document and the file name
    HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment; filename=SearchText.pdf; size={0}",
                pdfBuffer.Length.ToString()));

    // write the PDF buffer to HTTP response
    HttpContext.Current.Response.BinaryWrite(pdfBuffer);

    // call End() method of HTTP response to stop ASP.NET page processing
    HttpContext.Current.Response.End();
}
finally
{
    pdfDocument.Close();
}

You can find a live demo for searching and highlighting the text in PDF on product website.

Partially Convert a HTML Page to PDF

The HiQPdf HTML to PDF converter allows you to convert only a selected HTML element from the HTML document. The selected element can be for example a TABLE element or a DIV element containing other HTML elements.

This feature is useful when you want to convert only a part of the HTML document. For example, a web page usually has a header with menu and logo and a footer with contact information and copyright notice besides the main HTML content you want to convert to PDF. In order to convert only the main content of the document you can place the main content in a block element like a DIV or a TABLE and configure the converter to convert only that block element.

The HTML element to be converted is selected by the ConvertedHtmlElementSelector property. This property can be set with a value representing the CSS selector of the HTML element to be converted. For example, the #MyHtmlElement CSS selector will select the HTML element having the ‘MyHtmlElement‘ ID from document and the the *[class=”ConvertibleElementStyle”] CSS selector will select only the HTML element having the ‘ConvertibleElementStyle‘ CSS class. If many elements in the HTML document are selected by a CSS selector, only the the first one will be converted. The values of the attributes in the CSS selectors are case sensitive. If this property is not set then the whole HTML document is converted.

C# Code Sample for Partially Converting a HTML to PDF

// create the HTML to PDF converter
HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

// convert only the HTML element having the MyHtmlElement ID 
htmlToPdfConverter.ConvertedHtmlElementSelector = "#MyHtmlElement";

You can test this feature live in Convert Only a Selected Region of HTML Page demo.

Convert HTML with Web Fonts to PDF

The Web Fonts offer a great flexibility to web designers to create special effects on text in a HTML document because they are not limited anymore to a small set of fonts installed on the client computers displaying the HTML document. The Web Fonts can be downloaded on the fly by the modern web browsers and used to render the HTML document without installing those fonts on the local machine. The location from where they can be downloaded is given in a CSS3 @font-face rule.

The HiQPdf HTML to PDF Converter has the capacity to convert HTML documents with Web Fonts. It offers support for TrueType fonts in .ttf files, OpenType fonts with TrueType Outlines in .otf files and Web Open Font Format (WOFF) fonts with TrueType Outlines in .woff files.

The Web Open Font Format (WOFF), as its name suggests, was designed to be used with web pages. It is based on a compression algorithm which makes the fonts file smaller and more appropriate for distribution over a network. The WOFF format is currently supported by all major browsers (Firefox 3.6 and later versions, Google Chrome 6.0 and later versions, Internet Explorer 9 and later versions, Opera 11.10 and later versions, Safari 5.1 and later versions).

In the live demo for Converting HTML with Web Fonts to PDF you learn how to define the web fonts in HTML using the @font-face rules and the C# code to convert such a HTML document to PDF.

Set Different Layouts for Screen and Print in HTML to PDF Converter for .NET

Using CSS media types a HTML document can have one layout for screen, one for print , one for handheld devices. The @media rule allows different style rules for different media in the same style sheet in a HTML document.

By default the HTML to PDF converter will render the HTML document for screen, but this can be changed when another media type is assigned to MediaType property. For example, when this property is set to print the CSS properties defined by the @media print rule will be used when the HTML is rendered instead of the CSS properties defined by the @media screen rule.

Below there is a HTML document which demonstrates how to define different styles for screen and for print media types.

<html> 
<head> 
    <title>
        HTML to PDF Rendering Changes Based on Selected Media Type
    </title> 
    <style type="text/css"> 
        body { 
            font-family: 'Arial'; 
            font-size: 16px; 
        } 

        @media screen 
        { 
            p 
            { 
                font-family: Verdana; 
                font-size: 14px; 
                font-style: italic; 
                color: Green; 
            } 
        } 
        @media print 
        { 
            p 
            { 
                font-family: 'Courier New';
                font-size: 12px; 
                color: Black; 
            } 
        } 
        @media screen,print 
        { 
            p 
            { 
                font-weight: bold; 
            } 
        } 
    </style> 
</head> 
<body> 
    <br /><br /> 
    The style of the paragraph below is changed based on 
        the selected media type:
    <br /><br /> 
    <p> 
        This is a media type selection test. When viewing on screen
        the text is bigger, italic and green. When printing the 
        text is smaller, normal and black.
    </p> 
</body> 
</html>

For more details and C# and VB.NET code samples please visit the HTML to PDF Conversion Media Types online demo.

HTML to PDF Conversion Triggering Modes

By default the HTML to PDF starts immediately after the HTML document was loaded in converter. This default behavior is suitable for converting most of the HTML documents. However, there are situations when some JavaScript scripts continue execution even after the document was loaded. In this case it is necessary to configure the HiQPdf HTML to PDF Converter for .NET to wait a predefined interval before starting the rendering to PDF or to configure the converter to wait for the hiqPdfConverter.startConversion() method to be manually called from JavaScript.

Below there is an example of HTML document which requires manual triggering of the conversion. In the sample script we provide, a ticks counter is incremented each 30 ms after the document was loaded. When the ticks count reached 100 in about 3 seconds the startConversion() method is called.

In the HTML script below you can also notice a call to the hiqPdfInfo.getVersion() JavaScript method. This method is exposed by the converter to the JavaScript code in the HTML document being converted to offer information about the HiQPdf library version. The existence of the hiqPdfInfo object in JavaScript can be used to determine whether the document is currently loaded in HiQPdf converter. For example, when the HTML document is loaded in converter you can run a script to change the styles in the document.

<html>
<head>
    <title>Conversion Triggering Mode</title>
</head>
<body>
    <span style="font-family: Times New Roman; font-size: 10pt">When the triggering mode
        is 'Manual' the conversion is triggered by the call to <b>hiqPdfConverter.startConversion()</b>
        from JavaScript.<br />
        In this example document the startConversion() method is called when the ticks count
        reached 100 which happens in about 3 seconds.</span>
    <br />
    <br />
    <b>Ticks Count:</b> <span style="color: Red" id="ticks">0</span>
    <br />
    <br />
    <!-- display HiQPdf HTML converter version if the document is loaded in converter-->
    <span style="font-family: Times New Roman; font-size: 10pt">HiQPdf Info:
        <script type="text/javascript">
            // check if the document is loaded in HiQPdf HTML to PDF Converter
            if (typeof hiqPdfInfo == "undefined") {
                // hiqPdfInfo object is not defined and the document is loaded in a browser
                document.write("Not in HiQPdf");
            }
            else {
                // hiqPdfInfo object is defined and the document is loaded in converter
                document.write(hiqPdfInfo.getVersion());
            }
        </script>
    </span>
    <br />
    <script type="text/javascript">
        var ticks = 0;
        function tick() {
            // increment ticks count
            ticks++;

            var ticksElement = document.getElementById("ticks");
            // set ticks count
            ticksElement.innerHTML = ticks;
            if (ticks == 100) {
                // trigger conversion
                ticksElement.style.color = "green";
                hiqPdfConverter.startConversion();
            }
            else {
                // wait one more tick
                setTimeout("tick()", 30);
            }
        }

        tick();
    </script>
</body>
</html>

There are three conversion triggering modes: Auto, WaitTime and Manual. In the sample script above, a ticks counter is incremented each 30 ms after the document was loaded. When the ticks count reached 100 in about 3 seconds the startConversion() is called.

When the triggering mode is Manual the call to startConversion() will trigger the conversion.

When the triggering mode is WaitTime a wait time of 5 seconds is sufficient to allow the ticks count reach 100.

When the triggering mode is Auto the conversion will start before the counter reached 100.

private void buttonCreatePdf_Click(object sender, EventArgs e)
{
    // create the HTML to PDF converter
    HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

    Cursor = Cursors.WaitCursor;

    string pdfFile = Application.StartupPath + @"\DemoOutput\TriggeringMode.pdf";
    try
    {
        // set triggering mode; for WaitTime mode set the wait time before convert
        switch (comboBoxTriggeringMode.SelectedItem.ToString())
        {
            case "Auto":
                htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Auto;
                break;
            case "WaitTime":
                htmlToPdfConverter.TriggerMode = ConversionTriggerMode.WaitTime;
                htmlToPdfConverter.WaitBeforeConvert = int.Parse(textBoxWaitTime.Text);
                break;
            case "Manual":
                htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Manual;
                break;
            default:
                htmlToPdfConverter.TriggerMode = ConversionTriggerMode.Auto;
                break;
        }

        // convert the URL to PDF
        htmlToPdfConverter.ConvertHtmlToFile(textBoxHtmlCode.Text, null, pdfFile);
    }
    catch (Exception ex)
    {
        MessageBox.Show(String.Format("Conversion failed. {0}", ex.Message));
        return;
    }
    finally
    {
        Cursor = Cursors.Arrow;
    }

    // open the PDF document
    try
    {
        System.Diagnostics.Process.Start(pdfFile);
    }
    catch (Exception ex)
    {
        MessageBox.Show(String.Format("Conversion succeeded but cannot open '{0}'. {1}", pdfFile, ex.Message));
    }
}

For more details and C# and VB.NET code samples please visit the HTML to PDF Conversion Triggering Modes online demo.

Adding Many HTML Documents to the Same PDF Document

big_iconHiQPdf HTML to PDF Converter for .NET allows you convert multiple HTML pages into same document. The HTML content from a document can immediately follow the content from the previous HTML document or it can start to a new page.

This feature can also be used when converting very large HTML documents. It is more efficient to split the large HTML document in sections and add each section separately to PDF document.

Basically the approach is to create a PdfDocument object and then create a PdfHtml object to be added to PdfDocument for each HTML document. The position where the previous PdfHtml ended can be determined from PdfLayoutInfo object returned after adding the previous PdfHtml object to document.

When starting each PdfHtml object on a new page you set the PDF page size, orientation and margins. These page settings will be inherited by the all the HTML pages automatically generated during the HTML to PDF Conversion.

The C# sample code for this feature is:

using System;
using System.Data;
using System.Configuration;
using System.Collections;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;

using HiQPdf;

namespace HiQPdf_Demo
{
    public partial class MultipleHtmlLayers : System.Web.UI.Page
    {
        protected void buttonCreatePdf_Click(object sender, EventArgs e)
        {
            // create an empty PDF document
            PdfDocument document = new PdfDocument();

            // set a demo serial number
            document.SerialNumber = "YCgJMTAE-BiwJAhIB-EhlWTlBA-UEBRQFBA-U1FOUVJO-WVlZWQ==";

            // add a page to document
            PdfPage page1 = document.AddPage(PdfPageSize.A4, new PdfDocumentMargins(5), PdfPageOrientation.Portrait);

            try
            {
                // set the document header and footer before adding any objects to document
                SetHeader(document);
                SetFooter(document);

                // layout the HTML from URL 1
                PdfHtml html1 = new PdfHtml(textBoxUrl1.Text);
                html1.WaitBeforeConvert = 2;
                PdfLayoutInfo html1LayoutInfo = page1.Layout(html1);

                // determine the PDF page where to add URL 2
                PdfPage page2 = null;
                System.Drawing.PointF location2 = System.Drawing.PointF.Empty;
                if (checkBoxNewPage.Checked)
                {
                    // URL 2 is laid out on a new page with the selected orientation
                    page2 = document.AddPage(PdfPageSize.A4, new PdfDocumentMargins(5), GetSelectedPageOrientation());
                    location2 = System.Drawing.PointF.Empty;
                }
                else
                {
                    // URL 2 is laid out immediately after URL 1 and html1LayoutInfo
                    // gives the location where the URL 1 layout finished
                    page2 = document.Pages[html1LayoutInfo.LastPageIndex];
                    location2 = new System.Drawing.PointF(html1LayoutInfo.LastPageRectangle.X, html1LayoutInfo.LastPageRectangle.Bottom);
                }

                // layout the HTML from URL 2
                PdfHtml html2 = new PdfHtml(location2.X, location2.Y, textBoxUrl2.Text);
                html2.WaitBeforeConvert = 2;
                page2.Layout(html2);

                // write the PDF document to a memory buffer
                byte[] pdfBuffer = document.WriteToMemory();

                // inform the browser about the binary data format
                HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

                // let the browser know how to open the PDF document and the file name
                HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment; filename=LayoutMultipleHtml.pdf; size={0}",
                            pdfBuffer.Length.ToString()));

                // write the PDF buffer to HTTP response
                HttpContext.Current.Response.BinaryWrite(pdfBuffer);

                // call End() method of HTTP response to stop ASP.NET page processing
                HttpContext.Current.Response.End();
            }
            finally
            {
                document.Close();
            }
        }

        private void SetHeader(PdfDocument document)
        {
            if (!checkBoxAddHeader.Checked)
                return;

            // create the document header
            document.CreateHeaderCanvas(50);

            // add PDF objects to the header canvas
            string headerImageFile = Server.MapPath("~") + @"\DemoFiles\Images\HiQPdfLogo.png";
            PdfImage logoHeaderImage = new PdfImage(5, 5, 40, System.Drawing.Image.FromFile(headerImageFile));
            document.Header.Layout(logoHeaderImage);

            // layout HTML in header
            PdfHtml headerHtml = new PdfHtml(50, 5, @"<span style=""color:Navy; font-family:Times New Roman; font-style:italic"">
                            Quickly Create High Quality PDFs with </span><a href=""http://www.hiqpdf.com"">HiQPdf</a>", null);
            headerHtml.FitDestHeight = true;
            headerHtml.FontEmbedding = true;
            document.Header.Layout(headerHtml);

            // create a border for header
            float headerWidth = document.Header.Width;
            float headerHeight = document.Header.Height;
            PdfRectangle borderRectangle = new PdfRectangle(1, 1, headerWidth - 2, headerHeight - 2);
            borderRectangle.LineStyle.LineWidth = 0.5f;
            borderRectangle.ForeColor = System.Drawing.Color.Navy;
            document.Header.Layout(borderRectangle);
        }

        private void SetFooter(PdfDocument document)
        {
            if (!checkBoxAddFooter.Checked)
                return;

            //create the document footer
            document.CreateFooterCanvas(50);

            // layout HTML in footer
            PdfHtml footerHtml = new PdfHtml(5, 5, @"<span style=""color:Navy; font-family:Times New Roman; font-style:italic"">
                            Quickly Create High Quality PDFs with </span><a href=""http://www.hiqpdf.com"">HiQPdf</a>", null);
            footerHtml.FitDestHeight = true;
            footerHtml.FontEmbedding = true;
            document.Footer.Layout(footerHtml);


            float footerHeight = document.Footer.Height;
            float footerWidth = document.Footer.Width;

            // add page numbering
            System.Drawing.Font pageNumberingFont = new System.Drawing.Font(new System.Drawing.FontFamily("Times New Roman"), 8, System.Drawing.GraphicsUnit.Point);
            PdfText pageNumberingText = new PdfText(5, footerHeight - 12, "Page {CrtPage} of {PageCount}", pageNumberingFont);
            pageNumberingText.HorizontalAlign = PdfTextHAlign.Center;
            pageNumberingText.EmbedSystemFont = true;
            pageNumberingText.ForeColor = System.Drawing.Color.DarkGreen;
            document.Footer.Layout(pageNumberingText);

            string footerImageFile = Server.MapPath("~") + @"\DemoFiles\Images\HiQPdfLogo.png";
            PdfImage logoFooterImage = new PdfImage(footerWidth - 40 - 5, 5, 40, System.Drawing.Image.FromFile(footerImageFile));
            document.Footer.Layout(logoFooterImage);

            // create a border for footer
            PdfRectangle borderRectangle = new PdfRectangle(1, 1, footerWidth - 2, footerHeight - 2);
            borderRectangle.LineStyle.LineWidth = 0.5f;
            borderRectangle.ForeColor = System.Drawing.Color.DarkGreen;
            document.Footer.Layout(borderRectangle);
        }

        private PdfPageOrientation GetSelectedPageOrientation()
        {
            return (dropDownListPageOrientations.SelectedValue == "Portrait") ?
                PdfPageOrientation.Portrait : PdfPageOrientation.Landscape;
        }

        protected void Page_Load(object sender, EventArgs e)
        {
            if (!IsPostBack)
            {
                panelNewPageOrientation.Visible = checkBoxNewPage.Checked;

                Master.SelectNode("multipleHtmlLayers");
            }
        }

        protected void checkBoxNewPage_CheckedChanged(object sender, EventArgs e)
        {
            panelNewPageOrientation.Visible = checkBoxNewPage.Checked;
        }
    }
}

The live demo for converting many HTML pages to same PDF can be accessed in the product online demonstration website.