PDFs are one of the most common file format for documents due to its ease of printing and near universal support for viewing on devices. Furthermore, the file format is light and simple to work with. However, scraping or analyzing data from a PDF is a nightmare because the formatting is unordered and just place there to make it look nice.
This is where pdftools is extremely useful. Pdftools allows you to:
- Pull all the text from a PDF into a character vector
- Pull the information and metadata about a PDF
- Pull tables from a PDF and save as a dataframe
Data
We will use Jeroen Ooms’s research paper and tabulizer’s example data pdf that contains standard PDF tables. The links for the original documents can be found here:
https://arxiv.org/pdf/1403.2805.pdf
https://github.com/ropensci/tabulizer/blob/master/inst/examples/data.pdf
Basic functions
pdftools has a function to extract each part of a PDF:
pdf_info()
gets the metadata of the PDF file.pdf_text()
gets the text on every page.pdf_data()
extracts the tables within the PDFs.pdf_fonts()
returns all the fonts within a PDF.pdf_attachments()
returns the attachments connected to the PDF.pdf_toc()
creates a table of contents.pdf_pagesize()
returns size of the pages.
Extract metadata using pdf_info()
pdf_info()
puts the metadata such as author, title, date of when the file was created, etc. into a list.
## $pages
## [1] 29
Extract the text within the PDF using pdf_text()
pdf_text()
renders each page as an element within a character vector with its newline formatting and correct spacing. The argument is a filename that is a PDF. It returns a character vector where each element is a page from the PDF. This also works on tables and datasets.
Extract data and tables using pdf_data()
pdf_data()
returns a list for page where each element is a table from the PDF. Every table is its own element within the list. To access a specific table or data, index the list.
## width height x y space text
## 1 29 8 154 139 TRUE Mazda
## 2 19 8 187 139 FALSE RX4
## 3 29 8 154 151 TRUE Mazda
## 4 19 8 187 151 TRUE RX4
Extract fonts used within the PDF using pdf_fonts()
`pdf_fonts()`` returns a table of fonts used in the PDF. (Warning: you’ll need the TTF of all fonts used within the PDF or else you will get an error that the font is missing)
## PDF error: No display font for 'ArialUnicode'
## PDF error: Couldn't find a font for 'Times-Roman', subst is 'Helvetica'
## name type embedded file
## 1 Times-Roman type1 FALSE C:\\WINDOWS\\Fonts\\arial.ttf
Extract attachments on the PDF using pdf_attachments()
pdf_attachments()
returns a list of attachments connected to the PDF.
Extract the table of contents using pdf_toc()
pdf_toc()
creates a table of contents through a multidimensonal list where each level within the list is a header level, which can be easily formatted into a JSON.
# Table of contents
table_of_contents <- pdf_toc("jsonlite_paper.pdf")
# Show as JSON
jsonlite::toJSON(table_of_contents[[2]][[1]], auto_unbox = TRUE, pretty = TRUE)
## {
## "title": "1 Introduction",
## "children": [
## {
## "title": "1.1 Parsing and type safety",
## "children": []
## },
## {
## "title": "1.2 Reference implementation: the jsonlite package",
## "children": []
## },
## {
## "title": "1.3 Class-based versus type-based encoding",
## "children": []
## },
## {
## "title": "1.4 Scope and limitations",
## "children": []
## }
## ]
## }
Extract the size of the pages using pdf_pagesize()
pdf_pagesize()
returns the size of each page in a dataframe. The size values are “top”, “right”, “bottom”, and “left” which correspond to the distance they are from the PDF’s origin point (top left corner). It also shows the width and height of the PDF as if it were a standard image size or sheet of paper.
## top right bottom left width height
## 1 0 612 792 0 612 792
## 2 0 612 792 0 612 792
## 3 0 612 792 0 612 792
## 4 0 612 792 0 612 792