Extracting data from a PDF into R

Sometimes you need to get a data from a table in a paper or from some report that have some data that you need to your project or research. Here I will show how to get that data from a pdf file and create a tidy dataset from it. For this example I will be using the Oregon grass seed crop data from 2017 available at the OSU - Crop and soil science website

1. Creating a new project

First thing you need to do is to create a R project on R studio to make easier for you to get your pdf that you want to extract the data. After creating the project, put the pdf inside the folder of the project you just created.

2. Packages

library(pdftools)
library(tidyverse)

3. Load PDF

The next step is to load your PDF. I am going to call my new object ‘PDF’ and I am going to use the pdf_text command to read the text of my file. The read_lines() function reads the lines of our new file.

PDF <- pdf_text("oregon_grass_and_legume_seed_crops_preliminary_estimates_2017.pdf") %>%
  readr::read_lines() #open the PDF inside your project folder

3. Clean PDF data

Check your new PDF item that you just create and remove the lines that we don’t want and select only the lines that contain the that you want to use:

PDF.grass <-PDF[-c(1:3,6:8,20:35)] # remove lines
PDF.grass

##  [1] "Species                                     Area                    per                                                               of pro-" 
##  [2] "                                         Harvested                  acre              Production                  Price               duction" 
##  [3] "        Annual ryegrass                   120,250                  1,721                206,909                  42.00                 86,902" 
##  [4] "        Perennial ryegrass                  83,450                 1,476                123,208                  79.00                 97,334" 
##  [5] "        Tall fescue                       134,370                  1,542                207,148                  82.00                169,861" 
##  [6] "        Kentucky bluegrass                  20,650                   915                 18,887                 140.00                 26,442" 
##  [7] "        Rough bluegrass                      1,080                 1,000                   1,080                135.00                   1,458"
##  [8] "        Orchardgrass                        15,190                 1,046                 15,889                 225.00                 35,750" 
##  [9] "        Chewings fescue                      8,790                 1,141                 10,029                 105.00                  10,531"
## [10] "        Red fescue                          11,370                 1,009                 11,469                 103.00                  11,813"
## [11] "        Hard fescue                          2,280                 1,009                   2,300                140.00                   3,220"
## [12] "        Colonial bentgrass                   3,030                   379                   1,147                250.00                   2,868"
## [13] "        Creeping bentgrass                   3,590                   719                   2,581                350.00                   9,034"

In the next series of steps, I will use functions in the stringr package to manipulate the lines of text into a desirable form. The first problem to tackle is the whitespace between the different elements in each line of text. The str_squish() function reduces the repeated whitespace between each string. After the whitespace, focus on separating each element. I will use strsplt() to split the elements of each string into substrings.

all_stat_lines <- PDF.grass[3:13] %>%
  str_squish() %>%
  strsplit(split = " ")# remove empty spaces

var_lines <- c("Species", "Acreage", "Yield", "Production", "Price", "Value") # create your variable names

var_lines

## [1] "Species"    "Acreage"    "Yield"      "Production" "Price"     
## [6] "Value"

In this case, since line 6 was the only species that does not had a compost name, it is easier to separate this line now and join it with the other collumn later.

all_stat_lines[[6]] <- c("Orchard", "grass","15,190","1,046","15,889","225.00","35,750") #change the line 6

The next step is to transform the data into a data frame. I will use the ldply() function in the plyr package, which applies a function to each element in a list and combines the results into a data frame.

df <- plyr::ldply(all_stat_lines) #create a data frame
head(df)

##          V1        V2      V3    V4      V5     V6      V7
## 1    Annual  ryegrass 120,250 1,721 206,909  42.00  86,902
## 2 Perennial  ryegrass  83,450 1,476 123,208  79.00  97,334
## 3      Tall    fescue 134,370 1,542 207,148  82.00 169,861
## 4  Kentucky bluegrass  20,650   915  18,887 140.00  26,442
## 5     Rough bluegrass   1,080 1,000   1,080 135.00   1,458
## 6   Orchard     grass  15,190 1,046  15,889 225.00  35,750

Now we need to join Column V1 and V2 together:

df <- df %>% unite(V1.2, V1, V2, sep = ".")
df

##                  V1.2      V3    V4      V5     V6      V7
## 1     Annual.ryegrass 120,250 1,721 206,909  42.00  86,902
## 2  Perennial.ryegrass  83,450 1,476 123,208  79.00  97,334
## 3         Tall.fescue 134,370 1,542 207,148  82.00 169,861
## 4  Kentucky.bluegrass  20,650   915  18,887 140.00  26,442
## 5     Rough.bluegrass   1,080 1,000   1,080 135.00   1,458
## 6       Orchard.grass  15,190 1,046  15,889 225.00  35,750
## 7     Chewings.fescue   8,790 1,141  10,029 105.00  10,531
## 8          Red.fescue  11,370 1,009  11,469 103.00  11,813
## 9         Hard.fescue   2,280 1,009   2,300 140.00   3,220
## 10 Colonial.bentgrass   3,030   379   1,147 250.00   2,868
## 11 Creeping.bentgrass   3,590   719   2,581 350.00   9,034

Now the last step is to add the variables name into a final data frame:

colnames(df) <- var_lines

final_df <- as_tibble(df) 

# transform variables to numeric and factor
final_df$Species <- as.factor(final_df$Species)
final_df$Acreage <- as.numeric(gsub(",","",final_df$Acreage))
final_df$Yield <- as.numeric(gsub(",","",final_df$Yield))
final_df$Production <- as.numeric(gsub(",","",final_df$Production))
final_df$Price <- as.numeric(gsub(",","",final_df$Price))
final_df$Value <- as.numeric(gsub(",","",final_df$Value))

#final dataset
final_df

## # A tibble: 11 x 6
##    Species            Acreage Yield Production Price  Value
##    <fct>                <dbl> <dbl>      <dbl> <dbl>  <dbl>
##  1 Annual.ryegrass     120250  1721     206909    42  86902
##  2 Perennial.ryegrass   83450  1476     123208    79  97334
##  3 Tall.fescue         134370  1542     207148    82 169861
##  4 Kentucky.bluegrass   20650   915      18887   140  26442
##  5 Rough.bluegrass       1080  1000       1080   135   1458
##  6 Orchard.grass        15190  1046      15889   225  35750
##  7 Chewings.fescue       8790  1141      10029   105  10531
##  8 Red.fescue           11370  1009      11469   103  11813
##  9 Hard.fescue           2280  1009       2300   140   3220
## 10 Colonial.bentgrass    3030   379       1147   250   2868
## 11 Creeping.bentgrass    3590   719       2581   350   9034

Extracting data from a PDF into R

Lucas Kopecky Bobadilla; Oregon State University; Crop and Soil Science Departament

August 28, 2018