- The Challenge
- Sample PDF
- The Tools
- PDFTools
- Working Example
11/6/2019
Hence what appears to be tidy and clean in the PDF document is often a hot mess comprised of unrelated lines, bitmaps and text boxes with given size, position and content. These documents lack the structure of HTML, JSON and XML data.
Skye_Diamonds_Sheet
Several Good Packages For Extracting Data From PDF Files
We Will Work With PDF Tools
Also many tools that convert from PDF to HTML or JSON so you can extract from a more structured document.
install.packages("pdftools")
library (pdftools)
pdf_info - returns document info: pages, version, encryption, etc. pdf_fonts- returns a data frame of fonts pdf_text - renders all textboxes on a text canvas and returns a character vector of equal length to the number of pages in the PDF file. pdf_data - returns one data frame per page, containing one row comprised of text, width, height, x, y, space for each textbox in the PDF. This is low level data.
pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf" info <- pdf_info(pdf_file) info
## $version ## [1] "1.5" ## ## $pages ## [1] 148 ## ## $encrypted ## [1] FALSE ## ## $linearized ## [1] FALSE ## ## $keys ## $keys$Producer ## [1] "GPL Ghostscript 9.21" ## ## ## $created ## [1] "2018-10-31 22:47:31 EDT" ## ## $modified ## [1] "2018-10-31 22:47:31 EDT" ## ## $metadata ## [1] "<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>\n<?adobe-xap-filters esc=\"CRLF\"?>\n<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>\n<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='GPL Ghostscript 9.21'/>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2018-10-31T22:47:31-04:00</xmp:ModifyDate>\n<xmp:CreateDate>2018-10-31T22:47:31-04:00</xmp:CreateDate>\n<xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710'/>\n<rdf:Description rdf:about='uuid:fa2362e4-dfdb-11e8-0000-9bf6d4d2f710' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Untitled</rdf:li></rdf:Alt></dc:title></rdf:Description>\n</rdf:RDF>\n</x:xmpmeta>\n \n \n<?xpacket end='w'?>" ## ## $locked ## [1] FALSE ## ## $attachments ## [1] FALSE ## ## $layout ## [1] "no_layout"
pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf" fonts <- pdf_fonts(pdf_file) fonts
## # A tibble: 15 x 4 ## name type embedded file ## <chr> <chr> <lgl> <chr> ## 1 BBEBZF+Helvetica-Bold type1c TRUE "" ## 2 XJFMAW+Helvetica-BoldOblique type1c TRUE "" ## 3 ZCDDSL+H7 type1c TRUE "" ## 4 DJJQIH+NewCenturySchlbk-Italic type1c TRUE "" ## 5 ZTDARH+A2Gross type1c TRUE "" ## 6 NOAYYY+HV3-Normal type1c TRUE "" ## 7 DKVGSP+AftSym-Bold type1c TRUE "" ## 8 ZOCOZE+W1HotDog type1c TRUE "" ## 9 FWIDOR+Courier-Bold type1c TRUE "" ## 10 FCTZRW+Rs30Bold type1c TRUE "" ## 11 EGFVLB+Helvetica-Narrow type1c TRUE "" ## 12 VKGRRG+R2Sq1 type1c TRUE "" ## 13 ZXXPJZ+NewCenturySchlbk-Roman type1c TRUE "" ## 14 VCBVYO+ZapfDingbats type1c TRUE "" ## 15 BSMSUZ+CaxExBold1 type1c TRUE ""
pdf_file <- "C:/Users/mutue/OneDrive/Documents/cd110318.pdf" txt <- pdf_text(pdf_file) cat(txt[28])
## Ragozin -- The Sheets TM ## FIRST DUDE ## CD p27 EXONERATED-JOHANNESBURG CA SKYE DIAMONDS F 13 7 Race 3 ## 3 RACES 15 7 RACES 16 7 RACES 17 6 RACES 18 ## = 17" v AWSAD27 D D ## E E E ## C C C ## F 15+ vw MSLA 5 ## N N N ## O O O ## V V V ## 10 v AWSA 6 8+ v AWDM 4 F/M 5YO 3NOV BSr ## O Yw 50SAO O ## C 12- CBSr C ## T T T ## 7- vw AWSA 7 ## S S S ## E E E ## P P P ## A 12- Vw&BS 40DMA27 A ## U U U ## G G 5- Yw AWDMG13 13- v AWDM 12 ## J J J ## L L L ## Y Y Y ## 5- Vw AWLA 8 7- v AWLA 7 ## J J J ## U U U ## N N N ## M M M ## A A A ## r' 19 - v MSSAYMMy Y Y ## . 8" m AWSA 7 ' 14+ V AWCD 5 ## A A A ## P P P ## R R R ## r. 24+ AWSA 9 ## M M 9 w AWSAM26 7- AWSA 24 ## A A A ## R R R ## 12- w AWSA 6 ## F F F ## E 18 st AWSAE20 ## B B 5- vw AWSAE ## B16 ## 4 bv AWSA 18 ## 14" v AWSA 29 ## J J J ## A A A ## N N N ## . 10 mYw AWSA 7
x <- pdf_data(pdf_file)[[28]] %>% as_tibble() %>% rename(xval=x,yval = y) x
## # A tibble: 254 x 6 ## width height xval yval space text ## <int> <int> <int> <int> <lgl> <chr> ## 1 50 12 70 46 TRUE Ragozin ## 2 7 8 125 47 TRUE -- ## 3 20 8 136 47 TRUE The ## 4 38 8 160 47 FALSE Sheets ## 5 14 5 130 63 TRUE FIRST ## 6 13 5 145 63 FALSE DUDE ## 7 76 5 130 70 TRUE EXONERATED-JOHANNESBURG ## 8 6 5 210 70 FALSE CA ## 9 11 7 85 66 TRUE CD ## 10 13 7 101 66 FALSE p27 ## # ... with 244 more rows
header <- x %>%
filter(yval > 47 & yval < 80) %>%
arrange(yval, xval) %>%
print(n=10)
## # A tibble: 25 x 6 ## width height xval yval space text ## <int> <int> <int> <int> <lgl> <chr> ## 1 14 5 130 63 TRUE FIRST ## 2 13 5 145 63 FALSE DUDE ## 3 11 7 85 66 TRUE CD ## 4 13 7 101 66 FALSE p27 ## 5 21 7 238 66 TRUE SKYE ## 6 43 7 262 66 FALSE DIAMONDS ## 7 4 7 319 66 TRUE F ## 8 8 7 328 66 FALSE 13 ## 9 4 7 381 66 TRUE 7 ## 10 19 7 390 66 TRUE Race ## # ... with 15 more rows
horse <- header %>% filter(height==7 & yval < 69 & xval >110 & xval <320 & width > 6) %>% select(text) %>% summarise(val=paste(text, collapse=" ")) %>% rename(horse = val) %>% mutate(horse = if_else(horse == "","NA",horse)) %>% print(n = Inf)
## # A tibble: 1 x 1 ## horse ## <chr> ## 1 SKYE DIAMONDS
sire <- header %>% filter(height==5 & yval < 69) %>% select(text) %>% summarise(val=paste(text, collapse=" ")) %>% rename(sire = val) %>% mutate(sire = if_else(sire == "","NA",sire)) %>% print(n = Inf)
## # A tibble: 1 x 1 ## sire ## <chr> ## 1 FIRST DUDE
https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/
https://blog.datazar.com/extracting-pdf-text-with-r-and-creating-tidy-data-f399011549cc
https://www.r-bloggers.com/pdftools-2-0-powerful-pdf-text-extraction-tools/