Scraping PDF files in R ‘manually’ with tabulizer package

Recently I had to scrape 15,000–page PDF file and extract about 13,000 data tables. Luckily, R has an excellent package — tablulizer to automate this process and extract those locked tables and make them machine-readable. There are several well-written articles regarding the extraction of tables from the PDF files (for example see: Introduction to tabulizer, PDF Scraping in R with tabulizer). Even though tabulizer does an excellent job in automatically detecting the pages that have tabular data, sometimes it may wrongly calculate the width of the columns. As a result some columns of the table may merge. One of the solutions to clean this data and fixing combined columns using the separate() command.

In this tutorial I want to show how to extract tables from PDF files by manually setting the location and column width of the data table. We extract and clean data using following packages:

tabulizer
shiny
tidyverse

Note that tabulizer uses Tabula java library to process PDF data. First, you need to install correct version of Java to your machine before using this package. Since I am using 64-bit Windows OS, I installed 64-bit Java.

Here is the full R code I used to extract and tabulate the data.

options(java.parameters = "-Xmx12000m")

library(beepr) #for fun sounds when task is completed :)
library(tidyverse)
library(tabulizer)
library(shiny)

area <- locate_areas("sample_table.pdf", pages = 1)

table <- extract_tables("sample_table.pdf", pages = 1, area = list(c(101, 32, 582, 734)), columns = list(c(126, 184, 227, 270, 313, 358, 400, 435, 467, 499, 531, 566, 600, 639)), guess = FALSE,  output = "data.frame")
beep(sound = 8)

In the first line I changed the Java settings before applying the tabulizer library. Depending on the amount of pages and table data to process you may or may not need to change it. In my case Java always crashed when number of table data exceeded about 500 pages. So I changed Java settings based on the advice on stackoverflow to avoid Java overheating. After this change everything run smoothly.

Here are the steps to manually set the location and the column widths of tables. For this example I will be using following test table.

Run the locate_areas command.
Click on “Show in new window” icon on Viewer and related PDF pages will open in a web browser
Now you have to select the area where the data table is located while R will be “listening” to the selections you make on your browser. Once you select the area of the table click on “Done” button.
The location of the table will be stored to area object. You need to save these parameters for top, left, bottom and right. You can round these numbers.
Once you save the numbers for the location of the table, next task is to find the location of each column. We will simply repeat the previous task and run the locate_areas command and open the PDF on a browser. This time we will simply select the columns of the table. In this example I simply select column A and press “Done” button.
Once again the location of the table will be stored to area object. You only need to save parameters for left and right.
We will repeat the tasks 5–6 with the next column (in our case “next” is column C; you skip the column B since its parameters will be defined when you calculate the parameters of columns A and C).
Again you only need to save parameters for the left and right.
You will repeat steps 5–6 for all columns and save the parameters for the left and right.
Once you have the location parameters for the table and all columns it is the time to manually extract the table. For this task we will use extract_tables function. And we will use the location parameters for the table (given in area =) and all columns (given in columns =). Note that guess = FALSE so that tabulizer uses “manual” table locations. Extracted table data can be viewed at table[[1]].

Scraping PDF files in `R` ‘manually’ with `tabulizer` package

Behzod Ahundjanov

7/5/2020