PDFs are one of the most common file format for documents due to its ease of printing and near universal support for viewing on devices. Furthermore, the file format is light and simple to work with. However, scraping or analyzing data from a PDF is a nightmare because the formatting is unordered and just place there to make it look nice.

This is where pdftools is extremely useful. Pdftools allows you to:

  • Pull all the text from a PDF into a character vector
  • Pull the information and metadata about a PDF
  • Pull tables from a PDF and save as a dataframe

Basic functions

pdftools has a function to extract each part of a PDF:

  • pdf_info() gets the metadata of the PDF file.
  • pdf_text() gets the text on every page.
  • pdf_data() extracts the tables within the PDFs.
  • pdf_fonts() returns all the fonts within a PDF.
  • pdf_attachments() returns the attachments connected to the PDF.
  • pdf_toc() creates a table of contents.
  • pdf_pagesize() returns size of the pages.

Extract metadata using pdf_info()

pdf_info() puts the metadata such as author, title, date of when the file was created, etc. into a list.

## $pages
## [1] 29

Extract the text within the PDF using pdf_text()

pdf_text() renders each page as an element within a character vector with its newline formatting and correct spacing. The argument is a filename that is a PDF. It returns a character vector where each element is a page from the PDF. This also works on tables and datasets.

Extract data and tables using pdf_data()

pdf_data() returns a list for page where each element is a table from the PDF. Every table is its own element within the list. To access a specific table or data, index the list.

##   width height   x   y space  text
## 1    29      8 154 139  TRUE Mazda
## 2    19      8 187 139 FALSE   RX4
## 3    29      8 154 151  TRUE Mazda
## 4    19      8 187 151  TRUE   RX4

Extract fonts used within the PDF using pdf_fonts()

`pdf_fonts()`` returns a table of fonts used in the PDF. (Warning: you’ll need the TTF of all fonts used within the PDF or else you will get an error that the font is missing)

## PDF error: No display font for 'ArialUnicode'
## PDF error: Couldn't find a font for 'Times-Roman', subst is 'Helvetica'
##          name  type embedded                          file
## 1 Times-Roman type1    FALSE C:\\WINDOWS\\Fonts\\arial.ttf

Extract attachments on the PDF using pdf_attachments()

pdf_attachments() returns a list of attachments connected to the PDF.

Extract the table of contents using pdf_toc()

pdf_toc() creates a table of contents through a multidimensonal list where each level within the list is a header level, which can be easily formatted into a JSON.

## {
##   "title": "1 Introduction",
##   "children": [
##     {
##       "title": "1.1 Parsing and type safety",
##       "children": []
##     },
##     {
##       "title": "1.2 Reference implementation: the jsonlite package",
##       "children": []
##     },
##     {
##       "title": "1.3 Class-based versus type-based encoding",
##       "children": []
##     },
##     {
##       "title": "1.4 Scope and limitations",
##       "children": []
##     }
##   ]
## }

Extract the size of the pages using pdf_pagesize()

pdf_pagesize() returns the size of each page in a dataframe. The size values are “top”, “right”, “bottom”, and “left” which correspond to the distance they are from the PDF’s origin point (top left corner). It also shows the width and height of the PDF as if it were a standard image size or sheet of paper.

##   top right bottom left width height
## 1   0   612    792    0   612    792
## 2   0   612    792    0   612    792
## 3   0   612    792    0   612    792
## 4   0   612    792    0   612    792