Fixing the table of contents bookmarks in the book “Speech and Language Processing (3rd ed. draft)” by Dan Jurafsky and James H. Martin

A draft of the new version of the book "Speech and Language Processing (3rd ed. draft)" by Dan Jurafsky and James H. Martin is available on authors' website. Although it is produced using pdflatex with hyperref, the bookmarks of the table of contents (TOC) have an incorrect hierarchy. However, the TOC in the beginning of the book is correct. It can be used to regenerate the bookmarks.

The first step is to use pdftotext to extract the text from TOC preserving the format (location, spacing, line separations etc.) as much as possible. I played with the cropping parameters and was able to extract the TOC:

The TOC looks like this (only small part is given):

Now we need to parse it and create a bookmark list file (details are here) for use with cpdf. AWK may be a good solution, but I used a more familiar python programming language:

Python v3 should be used to run the program. Bookmarks will open at level 0 when PDF file is viewed (see LaTeX hyperref's "bookmarksopenlevel") .  Let's run the script:

If you need all bookmarks to be opened, then

should be replaced with

The output cpdf-bookmark file looks like this:

Now we can update the bookmarks in the book:

The book looks nicer now: