Login
 
Atalasoft DotImage
Released: 04/18/2010

Automatically Creating Bookmarks in a Searchable PDF

PDF documents may include a tree of bookmarks that can be added to a document. These bookmarks are very flexible in the functionality they afford. Typically though, they represent a table of contents in a document that can be used to quickly navigate through the document. Using the Atalasoft OCR module with the PDF Translator, it is possible to author these bookmarks automatically.

The Atalasoft OCREngine object has an event model that allows client code to execute during the process of performing OCR on a document. These events can be leveraged to generate bookmarks within a document as it is being translated into a searchable PDF.

To do this requires keeping track of the current page number and hooking into the engine’s document progress event and the engine’s page constructed event. When the engine fires the document progress event indicating that it has started, we will clear our page count. When the engine fires the page constructed event, we will consider adding in a bookmark for that page.

Hooking into the events is easy – this can be done with the following code:

_engine = new GlyphReaderEngine();
_translator = new PdfTranslator();
_engine.Translators.Add(_translator);
_engine.DocumentProgress += (s, e) => {
    if (e.Stage == OcrDocumentStage.BeginDocument)
    {
        _pageCount = 0;
        _translator.BookmarkTree.Bookmarks.Clear();
    }
};

_engine.PageConstructed += new OcrPageConstructionEventHandler(MyPageConstructed);

This code creates a new OcrEngine and it makes a PDF Translator and adds it to the engine. It hooks into the DocumentProgress event with a lambda expression that simply looks for BeginDocument and sets the page count to 0 and clears any existing bookmarks. Finally, it hooks MyPageConstructed into the PageConstructed event.

MyPageConstructed will do all the work for us. In this case, it will try to find a chapter heading and if it does, it will add a bookmark to the current tree.

void MyPageConstructed(object sender, OcrPageConstructionEventArgs e)
{
    _pageCount++;
    string bookmarkText = GetBookmarkText(e.Page);
    if (bookmarkText == null)
        return;
    PdfDestination destination = PdfDestination.FitPage();
    destination.Page = new PdfIndexedPageReference(_pageCount - 1);
    PdfGoToViewAction action = new PdfGoToViewAction(destination);
    PdfBookmark bookmark = new PdfBookmark(bookmarkText, action);
    _translator.BookmarkTree.Bookmarks.Add(bookmark);
}

This code first adds one to the page count and calls a method to get the bookmark text. If this method returns a non-null value, we make a bookmark for the page. A bookmark consists of Text describing it and one or more actions to be taken when the bookmark has been clicked. For our action, we will use a “Go to View” action. This is the most common type of action. This type of action includes a destination which is a target page and a description of the view on that target page. In this case, we’ll use the “fit page” destination and build our action from that and then use that action and the text to make a bookmark. Finally, we add the bookmark to our document.

The last piece in the puzzle is how to determine if we should create a bookmark or not. For the purposes of this example, I created a simple method that looks for a single line of text on a page that is in the form “Chapter xyz” where “xyz” is a number. The code for that walks through a recognized page and searches it for this pattern. Clearly, this is a brute force approach and it needs to be more thorough in a production environment, but it is good enough for the sake of demonstration.

string GetBookmarkText(OcrPage page)
{
    foreach (OcrRegion region in page.Regions)
    {
        OcrTextRegion textRegion = region as OcrTextRegion;
        if (textRegion == null)
            continue;
        foreach (OcrLine line in textRegion.Lines)
        {
            int chapterNumber;
            if (line.Words.Count == 2 &&
                line.Words[0].Text == "Chapter" &&
                Int32.TryParse(line.Words[1].Text, out chapterNumber))
                    return line.Text;
        }
    }
    return null;
}

The code loops over all regions on the page and when it finds text regions, it looks for lines that contain exactly two words: “Chapter” and a number. If this test succeeds, then entire text of line will be returned.

From this example, we can see that relatively simple code can be used to make PDF documents that are richer in content and easier to navigate.

Download 30-day Trial
preload preload preload