A draft of the new version of the book "Speech and Language Processing (3rd ed. draft)" by Dan Jurafsky and James H. Martin is available on authors' website. Although it is produced using pdflatex with hyperref, the bookmarks of the table of contents (TOC) have an incorrect hierarchy. However, the TOC in the beginning of the book is correct. It can be used to regenerate the bookmarks.
The first step is to use pdftotext to extract the text from TOC preserving the format (location, spacing, line separations etc.) as much as possible. I played with the cropping parameters and was able to extract the TOC:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Download the book [johndoe@ArchLinux]% wget -vc "https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf" # Install pdftotext [johndoe@ArchLinux]% sudo pacman -S poppler # Convert to text pages 3-8 (table of contents) # -fixed number : Assume fixed-pitch (or tabular) text, with the specified character width (in points). This forces physical layout mode. # -x number : Specifies the x-coordinate of the crop area top left corner # -y number : Specifies the y-coordinate of the crop area top left corner # -W number : Specifies the width of crop area in pixels (default is 0) # -H number : Specifies the height of crop area in pixels (default is 0) # -r number : Specifies the resolution, in DPI. The default is 72 DPI. [johndoe@ArchLinux]% pdftotext -f 3 -l 8 -fixed 4 -x 20 -H 620 -W 800 -y 100 -r 72 ed3book.pdf |
The TOC looks like this (only small part is given):
1 2 3 4 5 6 7 8 9 10 11 12 |
Contents 1 Introduction 9 2 Regular Expressions, Text Normalization, Edit Distance 10 2.1 Regular Expressions . .. . . . . . .. . . . . . . .. . . . . .. . 11 2.2 Words .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 19 2.3 Corpora . . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 21 2.4 Text Normalization . .. . . . . . .. . . . . . . .. . . . . .. . 22 2.5 Minimum Edit Distance . . . . . . .. . . . . . . .. . . . . .. . 30 2.6 Summary . . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 34 Bibliographical and Historical Notes . . . .. . . . . . . .. . . . . .. . 34 Exercises . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . .. . 35 |
Now we need to parse it and create a bookmark list file (details are here) for use with cpdf. AWK may be a good solution, but I used a more familiar python programming language:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# This bookmark creator is written for the table of contents of the # "Speech and Language Processing" book (3rd ed. draft) by Dan Jurafsky and James H. Martin # Link: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf import sys import os import pandas as pd import re with open(sys.argv[1]) as f: lines = f.readlines() df = pd.DataFrame(columns=['Bookmark', 'Page', 'Level']) Str = re.sub('\s{2,}', ' ', lines[0].strip()) Str = Str.split(' ') # Add "Contents" bookmark with level 1 on page 2 cols = {'Bookmark': '{:>2s} '.format(Str[0]), 'Page': str(2), 'Level': 1} df = df.append(cols, ignore_index=True) for line in lines[1:-1]: # Replace consecutive dots with a space Str = re.sub('\.+\s', ' ', line.strip()) # Strip white spaces in the beginning Str = re.sub('\s{2,}([0-9]+)', ' \\1', Str) # Split by white space Str = Str.split(' ') # Merge all elements except last using white space Bookmark_Str = ' '.join(Str[:-1]) # Replace consecutive two and more white spaces with a single white space Bookmark_Str = re.sub('(\s{2,})', ' ', Bookmark_Str) # Add leading space to a single digit chapters ("1 Introduction" -> " 1 Introduction") Bookmark_Str = re.sub('^([0-9])\s', ' \\1 ', Bookmark_Str) # Replace weird apostrophe with the ASCII one Bookmark_Str = re.sub('’', '\'', Bookmark_Str) # Print the bookmark print(Bookmark_Str) # Find level 1 chapters: these are # one or two digit integer followed by space and preceeded by zero to two spaces # "A", "B", "C" followed by space and preceeded by zero to two spaces # Appendix, Author Index, Subject Index, Bibliography if re.match('(\s{0,2}([0-9]{1,2}|[A-C]{1})\s|Appendix|Author Index|Subject Index|Bibliography)', Bookmark_Str): cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 1} else: # The rest are sections, not chapters cols = {'Bookmark': Bookmark_Str, 'Page': Str[-1], 'Level': 2} # Skip empty lines if cols["Bookmark"] != "": df = df.append(cols, ignore_index=True) df['Level'] = df['Level'].astype('str') Str = "" for index, row in df.iterrows(): Str = Str + df.loc[index,"Level"] + " \"" + df.loc[index,"Bookmark"] + "\" " + (df.loc[index,"Page"]) # This specifies LaTeX hyperref's "bookmarksopenlevel" property. Use df.loc[index, "Level"] == "1" to unfold all sections if df.loc[index, "Level"] == "0": Str = Str + " open\n" else: Str = Str + "\n" Output_File = os.path.splitext(sys.argv[1])[0] + ".info" text_file = open(Output_File, "w+") text_file.write(Str) text_file.truncate() text_file.close() |
Python v3 should be used to run the program. Bookmarks will open at level 0 when PDF file is viewed (see LaTeX hyperref's "bookmarksopenlevel") . Let's run the script:
1 |
[johndoe@ArchLinux]% python TOC2CPDF_Bookmarks.py ed3book.txt |
If you need all bookmarks to be opened, then
1 |
df.loc[index, "Level"] == "0" |
should be replaced with
1 |
df.loc[index, "Level"] == "1" |
The output cpdf-bookmark file ed3book.info looks like this:
1 2 3 4 5 6 7 8 9 10 11 |
1 "Contents " 2 1 " 1 Introduction" 9 1 " 2 Regular Expressions, Text Normalization, Edit Distance" 10 2 "2.1 Regular Expressions" 11 2 "2.2 Words" 19 2 "2.3 Corpora" 21 2 "2.4 Text Normalization" 22 2 "2.5 Minimum Edit Distance" 30 2 "2.6 Summary" 34 2 "Bibliographical and Historical Notes" 34 2 "Exercises" 35 |
Now we can update the bookmarks in the book:
1 2 3 4 |
[johndoe@ArchLinux]% yaourt -S cpdf-bin # Update the old PDF file with bookmarks provided in "ed3book.info" file [johndoe@ArchLinux]% cpdf -add-bookmarks ed3book.info ed3book.pdf -o ed3book_toc.pdf |
The book looks nicer now: