Sometimes you need to get a data from a table in a paper or from some report that have some data that you need to your project or research. Here I will show how to get that data from a pdf file and create a tidy dataset from it. For this example I will be using the Oregon grass seed crop data from 2017 available at the OSU - Crop and soil science website
1. Creating a new project
First thing you need to do is to create a R project on R studio to make easier for you to get your pdf that you want to extract the data. After creating the project, put the pdf inside the folder of the project you just created.
2. Packages
library(pdftools)
library(tidyverse)
3. Load PDF
The next step is to load your PDF. I am going to call my new object ‘PDF’ and I am going to use the pdf_text command to read the text of my file. The read_lines() function reads the lines of our new file.
PDF <- pdf_text("oregon_grass_and_legume_seed_crops_preliminary_estimates_2017.pdf") %>%
readr::read_lines() #open the PDF inside your project folder
3. Clean PDF data
Check your new PDF item that you just create and remove the lines that we don’t want and select only the lines that contain the that you want to use:
PDF.grass <-PDF[-c(1:3,6:8,20:35)] # remove lines
PDF.grass
## [1] "Species Area per of pro-"
## [2] " Harvested acre Production Price duction"
## [3] " Annual ryegrass 120,250 1,721 206,909 42.00 86,902"
## [4] " Perennial ryegrass 83,450 1,476 123,208 79.00 97,334"
## [5] " Tall fescue 134,370 1,542 207,148 82.00 169,861"
## [6] " Kentucky bluegrass 20,650 915 18,887 140.00 26,442"
## [7] " Rough bluegrass 1,080 1,000 1,080 135.00 1,458"
## [8] " Orchardgrass 15,190 1,046 15,889 225.00 35,750"
## [9] " Chewings fescue 8,790 1,141 10,029 105.00 10,531"
## [10] " Red fescue 11,370 1,009 11,469 103.00 11,813"
## [11] " Hard fescue 2,280 1,009 2,300 140.00 3,220"
## [12] " Colonial bentgrass 3,030 379 1,147 250.00 2,868"
## [13] " Creeping bentgrass 3,590 719 2,581 350.00 9,034"
In the next series of steps, I will use functions in the stringr package to manipulate the lines of text into a desirable form. The first problem to tackle is the whitespace between the different elements in each line of text. The str_squish() function reduces the repeated whitespace between each string. After the whitespace, focus on separating each element. I will use strsplt() to split the elements of each string into substrings.
all_stat_lines <- PDF.grass[3:13] %>%
str_squish() %>%
strsplit(split = " ")# remove empty spaces
var_lines <- c("Species", "Acreage", "Yield", "Production", "Price", "Value") # create your variable names
var_lines
## [1] "Species" "Acreage" "Yield" "Production" "Price"
## [6] "Value"
In this case, since line 6 was the only species that does not had a compost name, it is easier to separate this line now and join it with the other collumn later.
all_stat_lines[[6]] <- c("Orchard", "grass","15,190","1,046","15,889","225.00","35,750") #change the line 6
The next step is to transform the data into a data frame. I will use the ldply() function in the plyr package, which applies a function to each element in a list and combines the results into a data frame.
df <- plyr::ldply(all_stat_lines) #create a data frame
head(df)
## V1 V2 V3 V4 V5 V6 V7
## 1 Annual ryegrass 120,250 1,721 206,909 42.00 86,902
## 2 Perennial ryegrass 83,450 1,476 123,208 79.00 97,334
## 3 Tall fescue 134,370 1,542 207,148 82.00 169,861
## 4 Kentucky bluegrass 20,650 915 18,887 140.00 26,442
## 5 Rough bluegrass 1,080 1,000 1,080 135.00 1,458
## 6 Orchard grass 15,190 1,046 15,889 225.00 35,750
Now we need to join Column V1 and V2 together:
df <- df %>% unite(V1.2, V1, V2, sep = ".")
df
## V1.2 V3 V4 V5 V6 V7
## 1 Annual.ryegrass 120,250 1,721 206,909 42.00 86,902
## 2 Perennial.ryegrass 83,450 1,476 123,208 79.00 97,334
## 3 Tall.fescue 134,370 1,542 207,148 82.00 169,861
## 4 Kentucky.bluegrass 20,650 915 18,887 140.00 26,442
## 5 Rough.bluegrass 1,080 1,000 1,080 135.00 1,458
## 6 Orchard.grass 15,190 1,046 15,889 225.00 35,750
## 7 Chewings.fescue 8,790 1,141 10,029 105.00 10,531
## 8 Red.fescue 11,370 1,009 11,469 103.00 11,813
## 9 Hard.fescue 2,280 1,009 2,300 140.00 3,220
## 10 Colonial.bentgrass 3,030 379 1,147 250.00 2,868
## 11 Creeping.bentgrass 3,590 719 2,581 350.00 9,034
Now the last step is to add the variables name into a final data frame:
colnames(df) <- var_lines
final_df <- as_tibble(df)
# transform variables to numeric and factor
final_df$Species <- as.factor(final_df$Species)
final_df$Acreage <- as.numeric(gsub(",","",final_df$Acreage))
final_df$Yield <- as.numeric(gsub(",","",final_df$Yield))
final_df$Production <- as.numeric(gsub(",","",final_df$Production))
final_df$Price <- as.numeric(gsub(",","",final_df$Price))
final_df$Value <- as.numeric(gsub(",","",final_df$Value))
#final dataset
final_df
## # A tibble: 11 x 6
## Species Acreage Yield Production Price Value
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Annual.ryegrass 120250 1721 206909 42 86902
## 2 Perennial.ryegrass 83450 1476 123208 79 97334
## 3 Tall.fescue 134370 1542 207148 82 169861
## 4 Kentucky.bluegrass 20650 915 18887 140 26442
## 5 Rough.bluegrass 1080 1000 1080 135 1458
## 6 Orchard.grass 15190 1046 15889 225 35750
## 7 Chewings.fescue 8790 1141 10029 105 10531
## 8 Red.fescue 11370 1009 11469 103 11813
## 9 Hard.fescue 2280 1009 2300 140 3220
## 10 Colonial.bentgrass 3030 379 1147 250 2868
## 11 Creeping.bentgrass 3590 719 2581 350 9034