Fixing the table of contents bookmarks in the book “Speech and Language Processing (3rd ed. draft)” by Dan Jurafsky and James H. Martin

A draft of the new version of the book "Speech and Language Processing (3rd ed. draft)" by Dan Jurafsky and James H. Martin is available on authors' website. Although it is produced using pdflatex with hyperref, the bookmarks of the table of contents (TOC) have an incorrect hierarchy. However, the TOC in the beginning of the book is correct. It can be used to regenerate the bookmarks.

The first step is to use pdftotext to extract the text from TOC preserving the format (location, spacing, line separations etc.) as much as possible. I played with the cropping parameters and was able to extract the TOC:

# Download the book
[johndoe@ArchLinux]% wget -vc "https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf"

# Install pdftotext
[johndoe@ArchLinux]% sudo pacman -S poppler

# Convert to text pages 3-8 (table of contents)
# -fixed number : Assume fixed-pitch (or tabular) text, with the specified character width (in points).  This forces physical layout mode.
# -x number : Specifies the x-coordinate of the crop area top left corner
# -y number : Specifies the y-coordinate of the crop area top left corner
# -W number : Specifies the width of crop area in pixels (default is 0)
# -H number : Specifies the height of crop area in pixels (default is 0)
# -r number : Specifies the resolution, in DPI.  The default is 72 DPI.
[johndoe@ArchLinux]% pdftotext -f 3 -l 8  -fixed 4 -x 20 -H 620 -W 800 -y 100 -r 72 ed3book.pdf

# Download the book

[johndoe@ArchLinux]% wget -vc "https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf"

# Install pdftotext

[johndoe@ArchLinux]% sudo pacman -S poppler

# Convert to text pages 3-8 (table of contents)

# -fixed number : Assume fixed-pitch (or tabular) text, with the specified character width (in points). This forces physical layout mode.

# -x number : Specifies the x-coordinate of the crop area top left corner

# -y number : Specifies the y-coordinate of the crop area top left corner

# -W number : Specifies the width of crop area in pixels (default is 0)

# -H number : Specifies the height of crop area in pixels (default is 0)

# -r number : Specifies the resolution, in DPI. The default is 72 DPI.

[johndoe@ArchLinux]% pdftotext -f 3 -l 8 -fixed 4 -x 20 -H 620 -W 800 -y 100 -r 72 ed3book.pdf

The TOC looks like this (only small part is given):

Contents
                                   1   Introduction                                                                  9

                                   2   Regular Expressions, Text Normalization, Edit Distance                       10
                                       2.1    Regular Expressions   . .. . . . . . .. . . . . . . .. . . . . .. .   11
                                       2.2    Words    .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. .   19
                                       2.3    Corpora   . . . . . . . .. . . . . . .. . . . . . . .. . . . . .. .   21
                                       2.4    Text Normalization    . .. . . . . . .. . . . . . . .. . . . . .. .   22
                                       2.5    Minimum Edit Distance    . . . . . . .. . . . . . . .. . . . . .. .   30
                                       2.6    Summary .   . . . . . . .. . . . . . .. . . . . . . .. . . . . .. .   34
                                       Bibliographical and Historical Notes  . . . .. . . . . . . .. . . . . .. .   34
                                       Exercises   . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. .   35

Contents

1 Introduction 9

2 Regular Expressions, Text Normalization, Edit Distance 10

2.1 Regular Expressions . .. . . . . . .. . . . . . . .. . . . . .. . 11

2.2 Words .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 19

2.3 Corpora . . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 21

2.4 Text Normalization . .. . . . . . .. . . . . . . .. . . . . .. . 22

2.5 Minimum Edit Distance . . . . . . .. . . . . . . .. . . . . .. . 30

2.6 Summary . . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 34

Bibliographical and Historical Notes . . . .. . . . . . . .. . . . . .. . 34

Exercises . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 35

Now we need to parse it and create a bookmark list file (details are here) for use with cpdf. AWK may be a good solution, but I used a more familiar python programming language:

# This bookmark creator is written for the table of contents of the
# "Speech and Language Processing" book (3rd ed. draft) by Dan Jurafsky and James H. Martin
# Link: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

import sys
import os
import pandas as pd
import re
with open(sys.argv[1]) as f:
    lines = f.readlines()

df = pd.DataFrame(columns=['Bookmark', 'Page', 'Level'])
Str = re.sub('\s{2,}', ' ', lines[0].strip())
Str = Str.split(' ')
# Add "Contents" bookmark with level 1 on page 2
cols = {'Bookmark': '{:>2s} '.format(Str[0]), 'Page': str(2), 'Level': 1}
df = df.append(cols, ignore_index=True)
for line in lines[1:-1]:
    # Replace consecutive dots with a space
    Str = re.sub('\.+\s', ' ', line.strip())
    # Strip white spaces in the beginning
    Str = re.sub('\s{2,}([0-9]+)', ' \\1', Str)
    # Split by white space
    Str = Str.split(' ')
    # Merge all elements except last using white space
    Bookmark_Str = ' '.join(Str[:-1])
    # Replace consecutive two and more white spaces with a single white space
    Bookmark_Str = re.sub('(\s{2,})', ' ', Bookmark_Str)
    # Add leading space to a single digit chapters ("1 Introduction" -> " 1 Introduction")
    Bookmark_Str = re.sub('^([0-9])\s', '  \\1 ', Bookmark_Str)
    # Replace weird apostrophe with the ASCII one
    Bookmark_Str = re.sub('’', '\'', Bookmark_Str)
    # Print the bookmark
    print(Bookmark_Str)
    # Find level 1 chapters: these are
    #                       one or two digit integer followed by space and preceeded by zero to two spaces
    #                       "A", "B", "C" followed by space and preceeded by zero to two spaces
    #                       Appendix, Author Index, Subject Index, Bibliography
    if re.match('(\s{0,2}([0-9]{1,2}|[A-C]{1})\s|Appendix|Author Index|Subject Index|Bibliography)', Bookmark_Str):
        cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 1}
    else:
        # The rest are sections, not chapters
        cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 2}
    # Skip empty lines
    if cols["Bookmark"] != "":
        df = df.append(cols, ignore_index=True)
df['Level'] = df['Level'].astype('str')
Str = ""
for index, row  in df.iterrows():
    Str = Str + df.loc[index,"Level"] + " \"" + df.loc[index,"Bookmark"] + "\" "  + (df.loc[index,"Page"])
    # This specifies LaTeX hyperref's "bookmarksopenlevel" property. Use df.loc[index, "Level"] == "1" to unfold all sections
    if df.loc[index, "Level"] == "0":
        Str = Str + " open\n"
    else:
        Str = Str + "\n"

Output_File = os.path.splitext(sys.argv[1])[0] + ".info"
text_file = open(Output_File, "w+")
text_file.write(Str)
text_file.truncate()
text_file.close()

# This bookmark creator is written for the table of contents of the

# "Speech and Language Processing" book (3rd ed. draft) by Dan Jurafsky and James H. Martin

# Link: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

import sys

import os

import pandas as pd

import re

with open(sys.argv[1]) as f:

lines = f.readlines()

df = pd.DataFrame(columns=['Bookmark', 'Page', 'Level'])

Str = re.sub('\s{2,}', ' ', lines[0].strip())

Str = Str.split(' ')

# Add "Contents" bookmark with level 1 on page 2

cols = {'Bookmark': '{:>2s} '.format(Str[0]), 'Page': str(2), 'Level': 1}

df = df.append(cols, ignore_index=True)

for line in lines[1:-1]:

# Replace consecutive dots with a space

Str = re.sub('\.+\s', ' ', line.strip())

# Strip white spaces in the beginning

Str = re.sub('\s{2,}([0-9]+)', ' \\1', Str)

# Split by white space

Str = Str.split(' ')

# Merge all elements except last using white space

Bookmark_Str = ' '.join(Str[:-1])

# Replace consecutive two and more white spaces with a single white space

Bookmark_Str = re.sub('(\s{2,})', ' ', Bookmark_Str)

# Add leading space to a single digit chapters ("1 Introduction" -> " 1 Introduction")

Bookmark_Str = re.sub('^([0-9])\s', ' \\1 ', Bookmark_Str)

# Replace weird apostrophe with the ASCII one

Bookmark_Str = re.sub('’', '\'', Bookmark_Str)

# Print the bookmark

print(Bookmark_Str)

# Find level 1 chapters: these are

# one or two digit integer followed by space and preceeded by zero to two spaces

# "A", "B", "C" followed by space and preceeded by zero to two spaces

# Appendix, Author Index, Subject Index, Bibliography

if re.match('(\s{0,2}([0-9]{1,2}|[A-C]{1})\s|Appendix|Author Index|Subject Index|Bibliography)', Bookmark_Str):

cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 1}

else:

# The rest are sections, not chapters

cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 2}

# Skip empty lines

if cols["Bookmark"] != "":

df = df.append(cols, ignore_index=True)

df['Level'] = df['Level'].astype('str')

Str = ""

for index, row in df.iterrows():

Str = Str + df.loc[index,"Level"] + " \"" + df.loc[index,"Bookmark"] + "\" " + (df.loc[index,"Page"])

# This specifies LaTeX hyperref's "bookmarksopenlevel" property. Use df.loc[index, "Level"] == "1" to unfold all sections

if df.loc[index, "Level"] == "0":

Str = Str + " open\n"

else:

Str = Str + "\n"

Output_File = os.path.splitext(sys.argv[1])[0] + ".info"

text_file = open(Output_File, "w+")

text_file.write(Str)

text_file.truncate()

text_file.close()

Python v3 should be used to run the program. Bookmarks will open at level 0 when PDF file is viewed (see LaTeX hyperref's "bookmarksopenlevel") . Let's run the script:

[johndoe@ArchLinux]% python TOC2CPDF_Bookmarks.py ed3book.txt

1	[johndoe@ArchLinux]% python TOC2CPDF_Bookmarks.py ed3book.txt

If you need all bookmarks to be opened, then

df.loc[index, "Level"] == "0"

1	df.loc[index, "Level"] == "0"

should be replaced with

df.loc[index, "Level"] == "1"

1	df.loc[index, "Level"] == "1"

The output cpdf-bookmark file ed3book.info looks like this:

1 "Contents " 2
1 "  1 Introduction" 9
1 "  2 Regular Expressions, Text Normalization, Edit Distance" 10
2 "2.1 Regular Expressions" 11
2 "2.2 Words" 19
2 "2.3 Corpora" 21
2 "2.4 Text Normalization" 22
2 "2.5 Minimum Edit Distance" 30
2 "2.6 Summary" 34
2 "Bibliographical and Historical Notes" 34
2 "Exercises" 35

1 "Contents " 2

1 " 1 Introduction" 9

1 " 2 Regular Expressions, Text Normalization, Edit Distance" 10

2 "2.1 Regular Expressions" 11

2 "2.2 Words" 19

2 "2.3 Corpora" 21

2 "2.4 Text Normalization" 22

2 "2.5 Minimum Edit Distance" 30

2 "2.6 Summary" 34

2 "Bibliographical and Historical Notes" 34

2 "Exercises" 35

Now we can update the bookmarks in the book:

[johndoe@ArchLinux]% yaourt -S cpdf-bin

# Update the old PDF file with bookmarks provided in "ed3book.info" file
[johndoe@ArchLinux]% cpdf -add-bookmarks ed3book.info ed3book.pdf -o ed3book_toc.pdf

[johndoe@ArchLinux]% yaourt -S cpdf-bin

# Update the old PDF file with bookmarks provided in "ed3book.info" file

[johndoe@ArchLinux]% cpdf -add-bookmarks ed3book.info ed3book.pdf -o ed3book_toc.pdf

The book looks nicer now:

Altynbek Isabekov

Machine Learning and Embedded Systems

Fixing the table of contents bookmarks in the book “Speech and Language Processing (3rd ed. draft)” by Dan Jurafsky and James H. Martin