Introduction to pdftools

PDFs are one of the most common file format for documents due to its ease of printing and near universal support for viewing on devices. Furthermore, the file format is light and simple to work with. However, scraping or analyzing data from a PDF is a nightmare because the formatting is unordered and just place there to make it look nice.

This is where pdftools is extremely useful. Pdftools allows you to:

Pull all the text from a PDF into a character vector
Pull the information and metadata about a PDF
Pull tables from a PDF and save as a dataframe

Data

We will use Jeroen Ooms’s research paper and tabulizer’s example data pdf that contains standard PDF tables. The links for the original documents can be found here:

https://arxiv.org/pdf/1403.2805.pdf

https://github.com/ropensci/tabulizer/blob/master/inst/examples/data.pdf

download.file("http://arxiv.org/pdf/1403.2805.pdf", "jsonlite_paper.pdf", mode = "wb")

example_pdf_tables <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"

Basic functions

pdftools has a function to extract each part of a PDF:

pdf_info() gets the metadata of the PDF file.
pdf_text() gets the text on every page.
pdf_data() extracts the tables within the PDFs.
pdf_fonts() returns all the fonts within a PDF.
pdf_attachments() returns the attachments connected to the PDF.
pdf_toc() creates a table of contents.
pdf_pagesize() returns size of the pages.

Extract metadata using `pdf_info()`

pdf_info() puts the metadata such as author, title, date of when the file was created, etc. into a list.

info <- pdf_info("jsonlite_paper.pdf")

# Displays the number of pages within the PDF
info[2]

## $pages
## [1] 29

Extract the text within the PDF using `pdf_text()`

pdf_text() renders each page as an element within a character vector with its newline formatting and correct spacing. The argument is a filename that is a PDF. It returns a character vector where each element is a page from the PDF. This also works on tables and datasets.

example_pdf <- pdf_text("jsonlite_paper.pdf")

pdf_text_tables <- pdf_text(example_pdf_tables)

#running this line below will show the first page will all the proper formatting
#cat(example_pdf[1])

#running this line below will show the first table with proper spacing
#cat(pdf_text_tables[1])

Extract data and tables using `pdf_data()`

pdf_data() returns a list for page where each element is a table from the PDF. Every table is its own element within the list. To access a specific table or data, index the list.

pdf_data_table <- pdf_data(example_pdf_tables)
head(as.data.frame(pdf_data_table[1]), 4)

##   width height   x   y space  text
## 1    29      8 154 139  TRUE Mazda
## 2    19      8 187 139 FALSE   RX4
## 3    29      8 154 151  TRUE Mazda
## 4    19      8 187 151  TRUE   RX4

Extract fonts used within the PDF using `pdf_fonts()`

`pdf_fonts()`` returns a table of fonts used in the PDF. (Warning: you’ll need the TTF of all fonts used within the PDF or else you will get an error that the font is missing)

fonts <- pdf_fonts("jsonlite_paper.pdf")

## PDF error: No display font for 'ArialUnicode'

## PDF error: Couldn't find a font for 'Times-Roman', subst is 'Helvetica'

head(fonts, 1)

##          name  type embedded                          file
## 1 Times-Roman type1    FALSE C:\\WINDOWS\\Fonts\\arial.ttf

Extract attachments on the PDF using `pdf_attachments()`

pdf_attachments() returns a list of attachments connected to the PDF.

attachments <- pdf_attachments("jsonlite_paper.pdf")

Extract the table of contents using `pdf_toc()`

pdf_toc() creates a table of contents through a multidimensonal list where each level within the list is a header level, which can be easily formatted into a JSON.

# Table of contents
table_of_contents <- pdf_toc("jsonlite_paper.pdf")

# Show as JSON
jsonlite::toJSON(table_of_contents[[2]][[1]], auto_unbox = TRUE, pretty = TRUE)

## {
##   "title": "1 Introduction",
##   "children": [
##     {
##       "title": "1.1 Parsing and type safety",
##       "children": []
##     },
##     {
##       "title": "1.2 Reference implementation: the jsonlite package",
##       "children": []
##     },
##     {
##       "title": "1.3 Class-based versus type-based encoding",
##       "children": []
##     },
##     {
##       "title": "1.4 Scope and limitations",
##       "children": []
##     }
##   ]
## }

Extract the size of the pages using `pdf_pagesize()`

pdf_pagesize() returns the size of each page in a dataframe. The size values are “top”, “right”, “bottom”, and “left” which correspond to the distance they are from the PDF’s origin point (top left corner). It also shows the width and height of the PDF as if it were a standard image size or sheet of paper.

page_size <- pdf_pagesize("jsonlite_paper.pdf")
head(page_size, 4)

##   top right bottom left width height
## 1   0   612    792    0   612    792
## 2   0   612    792    0   612    792
## 3   0   612    792    0   612    792
## 4   0   612    792    0   612    792

Data

Basic functions

Extract metadata using pdf_info()

Extract the text within the PDF using pdf_text()

Extract data and tables using pdf_data()

Extract fonts used within the PDF using pdf_fonts()

Extract attachments on the PDF using pdf_attachments()

Extract the table of contents using pdf_toc()

Extract the size of the pages using pdf_pagesize()