Data Sciene In Context - PDF Extraction

11/6/2019

Agenda

The Challenge
Sample PDF
The Tools
PDFTools
Working Example

The Challenge

Get Tidy Data From A PDF File
PDF Files Appear To Have Clean And Structured Paragraphs and Tables
THEY DON’T!
PDF = Printing Format

Hence what appears to be tidy and clean in the PDF document is often a hot mess comprised of unrelated lines, bitmaps and text boxes with given size, position and content. These documents lack the structure of HTML, JSON and XML data.

Extracting Text From PDF Files Can Be Difficult

Sample PDF

Skye Diamonds - 23 Speed Figures Over 4 Years

Skye_Diamonds_Sheet

The Tools

Several Good Packages For Extracting Data From PDF Files

pdftools
tm
tabulizer

We Will Work With PDF Tools

Also many tools that convert from PDF to HTML or JSON so you can extract from a more structured document.

PDFTools

install.packages("pdftools")
library (pdftools)

The Functions

pdf_info - returns document info: pages, version, encryption, etc.

pdf_fonts- returns a data frame of fonts

pdf_text - renders all textboxes on a text canvas and returns 
a character vector of equal length to the number of pages in the PDF file. 

pdf_data - returns one data frame per page, containing one row 
comprised of text, width, height, x, y, space for each textbox in the PDF.
This is low level data.

Our Sample PDF - pdf_info

pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf"
info <- pdf_info(pdf_file)
info

## $version
## [1] "1.5"
## 
## $pages
## [1] 148
## 
## $encrypted
## [1] FALSE
## 
## $linearized
## [1] FALSE
## 
## $keys
## $keys$Producer
## [1] "GPL Ghostscript 9.21"
## 
## 
## $created
## [1] "2018-10-31 22:47:31 EDT"
## 
## $modified
## [1] "2018-10-31 22:47:31 EDT"
## 
## $metadata
## [1] "<?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>\n<?adobe-xap-filters esc=\"CRLF\"?>\n<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>\n<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='GPL Ghostscript 9.21'/>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2018-10-31T22:47:31-04:00</xmp:ModifyDate>\n<xmp:CreateDate>2018-10-31T22:47:31-04:00</xmp:CreateDate>\n<xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710'/>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Untitled</rdf:li></rdf:Alt></dc:title></rdf:Description>\n</rdf:RDF>\n</x:xmpmeta>\n                                                                        \n                                                                        \n<?xpacket end='w'?>"
## 
## $locked
## [1] FALSE
## 
## $attachments
## [1] FALSE
## 
## $layout
## [1] "no_layout"

Our Sample PDF - pdf_fonts

pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf"
fonts <- pdf_fonts(pdf_file)
fonts

## # A tibble: 15 x 4
##    name                           type   embedded file 
##    <chr>                          <chr>  <lgl>    <chr>
##  1 BBEBZF+Helvetica-Bold          type1c TRUE     ""   
##  2 XJFMAW+Helvetica-BoldOblique   type1c TRUE     ""   
##  3 ZCDDSL+H7                      type1c TRUE     ""   
##  4 DJJQIH+NewCenturySchlbk-Italic type1c TRUE     ""   
##  5 ZTDARH+A2Gross                 type1c TRUE     ""   
##  6 NOAYYY+HV3-Normal              type1c TRUE     ""   
##  7 DKVGSP+AftSym-Bold             type1c TRUE     ""   
##  8 ZOCOZE+W1HotDog                type1c TRUE     ""   
##  9 FWIDOR+Courier-Bold            type1c TRUE     ""   
## 10 FCTZRW+Rs30Bold                type1c TRUE     ""   
## 11 EGFVLB+Helvetica-Narrow        type1c TRUE     ""   
## 12 VKGRRG+R2Sq1                   type1c TRUE     ""   
## 13 ZXXPJZ+NewCenturySchlbk-Roman  type1c TRUE     ""   
## 14 VCBVYO+ZapfDingbats            type1c TRUE     ""   
## 15 BSMSUZ+CaxExBold1              type1c TRUE     ""

Our Sample PD - pdf_text

pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf"
txt <- pdf_text(pdf_file)
cat(txt[28])

##   Ragozin -- The Sheets                     TM
##                      FIRST DUDE
##          CD p27      EXONERATED-JOHANNESBURG CA           SKYE DIAMONDS        F 13          7 Race 3
##       3 RACES 15                   7 RACES 16                     7 RACES 17             6 RACES 18
##  = 17"           v AWSAD27                            D                          D
##                        E                              E                          E
##                        C                              C                          C
## F 15+           vw MSLA 5
##                        N                              N                          N
##                        O                              O                          O
##                        V                              V                          V
##                                 10              v AWSA 6        8+         v AWDM 4  F/M 5YO 3NOV BSr
##                        O                       Yw 50SAO                          O
##                        C        12-                   CBSr                       C
##                        T                              T                          T
##                                                                                       7-         vw AWSA 7
##                        S                              S                          S
##                        E                              E                          E
##                        P                              P                          P
##                        A        12-        Vw&BS 40DMA27                         A
##                        U                              U                          U
##                        G                              G      5-           Yw AWDMG13     13-      v AWDM 12
##                        J                              J                          J
##                        L                              L                          L
##                        Y                              Y                          Y
##                                                              5-           Vw AWLA 8   7-          v AWLA 7
##                        J                              J                          J
##                        U                              U                          U
##                        N                              N                          N
##                        M                              M                          M
##                        A                              A                          A
##  r' 19 -         v MSSAYMMy                           Y                          Y
##                                                               . 8"         m AWSA 7     ' 14+     V AWCD 5
##                        A                              A                          A
##                        P                              P                          P
##                        R                              R                          R
##                                     r. 24+        AWSA 9
##                        M                              M         9          w AWSAM26  7-            AWSA 24
##                        A                              A                          A
##                        R                              R                          R
##                                  12-            w AWSA 6
##                        F                              F                          F
##                        E           18          st AWSAE20
##                        B                              B      5-           vw AWSAE
##                                                                                  B16
##                                                                                      4           bv AWSA 18
##                                   14"           v AWSA 29
##                        J                              J                          J
##                        A                              A                          A
##                        N                              N                          N
##                                                                . 10      mYw AWSA 7

Our Sample PD - pdf_data

x <- pdf_data(pdf_file)[[28]] %>% 
   as_tibble() %>%
   rename(xval=x,yval = y) 

x

## # A tibble: 254 x 6
##    width height  xval  yval space text                   
##    <int>  <int> <int> <int> <lgl> <chr>                  
##  1    50     12    70    46 TRUE  Ragozin                
##  2     7      8   125    47 TRUE  --                     
##  3    20      8   136    47 TRUE  The                    
##  4    38      8   160    47 FALSE Sheets                 
##  5    14      5   130    63 TRUE  FIRST                  
##  6    13      5   145    63 FALSE DUDE                   
##  7    76      5   130    70 TRUE  EXONERATED-JOHANNESBURG
##  8     6      5   210    70 FALSE CA                     
##  9    11      7    85    66 TRUE  CD                     
## 10    13      7   101    66 FALSE p27                    
## # ... with 244 more rows

Our Sample PD - Header Data

header <- x %>% 
    filter(yval > 47 & yval < 80) %>%
    arrange(yval, xval) %>% 
    print(n=10)

## # A tibble: 25 x 6
##    width height  xval  yval space text    
##    <int>  <int> <int> <int> <lgl> <chr>   
##  1    14      5   130    63 TRUE  FIRST   
##  2    13      5   145    63 FALSE DUDE    
##  3    11      7    85    66 TRUE  CD      
##  4    13      7   101    66 FALSE p27     
##  5    21      7   238    66 TRUE  SKYE    
##  6    43      7   262    66 FALSE DIAMONDS
##  7     4      7   319    66 TRUE  F       
##  8     8      7   328    66 FALSE 13      
##  9     4      7   381    66 TRUE  7       
## 10    19      7   390    66 TRUE  Race    
## # ... with 15 more rows

Our Sample PD - HORSE

horse <- header %>% 
  filter(height==7 & yval < 69 & xval >110 & xval <320 & width > 6) %>%
  select(text) %>% 
  summarise(val=paste(text, collapse=" ")) %>%
  rename(horse = val) %>%
  mutate(horse = if_else(horse == "","NA",horse)) %>% 
  print(n = Inf)

## # A tibble: 1 x 1
##   horse        
##   <chr>        
## 1 SKYE DIAMONDS

Our Sample PD - SIRE

sire <- header %>% 
  filter(height==5 & yval < 69) %>%
  select(text) %>% 
  summarise(val=paste(text, collapse=" ")) %>%
  rename(sire = val) %>%
  mutate(sire = if_else(sire == "","NA",sire)) %>% 
  print(n = Inf)

## # A tibble: 1 x 1
##   sire      
##   <chr>     
## 1 FIRST DUDE

Reference Material

The PDF Tools Package:

https://cran.r-project.org/web/packages/pdftools/pdftools.pdf

Introducting pdftools - ROpenSci:

https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/

Extracting PDF Text with R and creating Tidy Data - Datazar Blog:

https://blog.datazar.com/extracting-pdf-text-with-r-and-creating-tidy-data-f399011549cc

Pdftools 2.0: powerful pdf text extraction tool:

https://www.r-bloggers.com/pdftools-2-0-powerful-pdf-text-extraction-tools/