Reading PDF Documents

How to read PDF into R using pdftools

When you have a pdf paper with tables that no one bothered to upload as a supplementary file in an R-friendly .csv format… that is what you can do.

First, let’s get the offending data in:

link<-"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5785562/bin/NIHMS867379-supplement-1.pdf"
dir.create("scratch", showWarnings = FALSE) # it is tidier to keep your data in a separate folder, see Reproducible Research
myfile <- "scratch/NIHMS867379-supplement-1.pdf"
download.file(link, myfile)

library(pdftools)

## Warning: package 'pdftools' was built under R version 3.5.2

txt<-pdf_text(myfile)

This supplementary is a bunch of tables bundled together. They are all in different forms (number of columns, column names, also some table legends are present), so I need to process them one by one. Let’s say the one I am specifically interested in can is on the page 23.

I will load dplyr package, because it makes data handling more easy and closer to the natural language, in particular has piping function ‘%>%’ that passes data on to the next function, just like ‘|’ in command line. Note the use of suppressMessages() function which does exactly that - suppresses messages, in partucular printouts from loading libraries, that sometimes could be quite extensive.

suppressMessages(library(dplyr))

## Warning: package 'dplyr' was built under R version 3.5.2

Now, ’ere is my page 23:

txt[23]

## [1] "Supplementary Table 5. Predicted upstream regulators in TCGA subtypes.\n  Subtype    Upstream Regulator        Gene Functions                        Predicted State     z-score*\n    EBV      STAT1                     transcription regulator               Activated           2.092\n             CDKN2A                    transcription regulator               Activated           1.969\n             IL3                       cytokine                              Activated           1.951\n             IL27                      cytokine                              Activated           1.934\n             IFNB1                     cytokine                              Activated           1.927\n             MYD88                     other                                 Activated           1.802\n             IFNG                      cytokine                              Activated           1.79\n             IL2                       cytokine                              Activated           1.438\n             IL21                      cytokine                              Activated           1.397\n             IL1B                      cytokine                              Activated           1.293\n             ERBB4                     kinase                                Inhibited           -1.98\n             INSR                      kinase                                Inhibited           -1.982\n             FOXC2                     transcription regulator               Inhibited           -1.982\n             ADRB                      group                                 Inhibited           -2\n             ERBB3                     kinase                                Inhibited           -2\n             HNF1A                     transcription regulator               Inhibited           -2.138\n             PPARA                     ligand-dependent nuclear receptor     Inhibited           -2.176\n             EPAS1                     transcription regulator               Inhibited           -2.177\n             PPARG                     ligand-dependent nuclear receptor     Inhibited           -2.292\n             HNF4A                     transcription regulator               Inhibited           -2.535\n    MSI      PPARGC1A                  transcription regulator               Activated           2.163\n             IL1B                      cytokine                              Activated           2.093\n             EZH2                      transcription regulator               Activated           2\n             NFKBIA                    transcription regulator               Activated           1.955\n             IL17A                     cytokine                              Activated           1.944\n             TP53                      transcription regulator               Activated           1.845\n             F2                        peptidase                             Activated           1.746\n             APP                       other                                 Activated           1.744\n             PGR                       ligand-dependent nuclear receptor     Activated           1.741\n             MGEA5                     enzyme                                Activated           1.667\n             TGFB1                     growth factor                         Inhibited           -1.056\n             NR1I2                     ligand-dependent nuclear receptor     Inhibited           -1.067\n             KLF3                      transcription regulator               Inhibited           -1.134\n             CST5                      other                                 Inhibited           -1.633\n             PRDM1                     transcription regulator               Inhibited           -1.964\n             CBX5                      transcription regulator               Inhibited           -2.236\n     GS      NUPR1                     transcription regulator               Activated           4.69\n             CDKN2A                    transcription regulator               Activated           3.706\n             SRF                       transcription regulator               Activated           3.592\n             TP53                      transcription regulator               Activated           3.59\n             mir-21                    microrna                              Activated           3.548\n             TCF3                      transcription regulator               Activated           3.286\n             RBL2                      other                                 Activated           2.91\n             MKL1                      transcription regulator               Activated           2.777\n             MYOCD                     transcription regulator               Activated           2.763\n             BNIP3L                    other                                 Activated           2.646\n             MED1                      transcription regulator               Inhibited           -3.108\n             FOXM1                     transcription regulator               Inhibited           -3.23\n             CCND1                     transcription regulator               Inhibited           -3.342\n             HGF                       growth factor                         Inhibited           -3.613\n             TBX2                      transcription regulator               Inhibited           -3.729\n             RABL6                     other                                 Inhibited           -3.873\n             MYC                       transcription regulator               Inhibited           -3.993\n             ERBB2                     kinase                                Inhibited           -4.23\n             PTGER2                    g-protein coupled receptor            Inhibited           -4.243\n             CSF2                      cytokine                              Inhibited           -4.574\n    CIN      PRDM1                     transcription regulator               Activated           2.236\n             TGFB1                     growth factor                         Activated           1.98\n             IL6                       cytokine                              Inhibited           -0.965\n             IL2                       cytokine                              Inhibited           -1.079\n             IL21                      cytokine                              Inhibited           -1.109\n             CHUK                      kinase                                Inhibited           -1.225\n             IL17A                     cytokine                              Inhibited           -1.344\n             IFNG                      cytokine                              Inhibited           -1.839\n             SMARCA4                   transcription regulator               Inhibited           -1.914\n             IFNB1                     cytokine                              Inhibited           -1.944\n             TLR3                      transmembrane receptor                Inhibited           -1.951\n             INF alpha                 group                                 Inhibited           -2.194\n* z-score was calculated to predict activation or inhibition of upstream regulators based on published findings\naccessible through the Ingenuity knowledge base. Regulators with z-score >1 or <-1 were reported in table. When\nmore than 10 regulators are >1 or <-1, top 10 regulators are only included.\n"

As we (hopefully) see we have a tab- and - separated data. Now we have to process that into a table

suppressMessages(library(stringi)) # has a very useful function that removes empty members from strings, of which we will have plenty

## Warning: package 'stringi' was built under R version 3.5.2

txt[23] %>% strsplit(., split ="\n") %>% unlist() %>% as.character() %>% strsplit(., split = "\t") %>% head(., n = 2) %>% tail(., n=1) %>% unlist() %>% strsplit(., split = "  ") %>% unlist() %>% stri_remove_empty() -> table.names5 # parsing the second row which has column names
table_content<-txt[23] %>% strsplit(., split ="\n") %>% unlist() %>% as.character() %>% strsplit(., split = "\t") %>% unlist() %>% head(., n = 70) %>% tail(., n = -2) # read all starting from the 3rd row

Let’s make it a little be more compact, if not readble. I like using functions for creating shortcats for things I keep reusing, like the lengthy combination of strsplit() and unlist()

By the way what are they here for - strsplit() splits string of characters and returns a list, unlist() makes a vector out of a list, in this case, it’s a character vector.

SplittingPDFStrings<-function(txt, separator = "\n") {
  txt %>% strsplit(., split = separator) %>% unlist() %>% stri_remove_empty()
}
table_content<-txt[23] %>% SplittingPDFStrings() %>% SplittingPDFStrings(., separator = "\t") %>% head(., n = 70) %>% tail(., n = -2) 
table_content

##  [1] "    EBV      STAT1                     transcription regulator               Activated           2.092" 
##  [2] "             CDKN2A                    transcription regulator               Activated           1.969" 
##  [3] "             IL3                       cytokine                              Activated           1.951" 
##  [4] "             IL27                      cytokine                              Activated           1.934" 
##  [5] "             IFNB1                     cytokine                              Activated           1.927" 
##  [6] "             MYD88                     other                                 Activated           1.802" 
##  [7] "             IFNG                      cytokine                              Activated           1.79"  
##  [8] "             IL2                       cytokine                              Activated           1.438" 
##  [9] "             IL21                      cytokine                              Activated           1.397" 
## [10] "             IL1B                      cytokine                              Activated           1.293" 
## [11] "             ERBB4                     kinase                                Inhibited           -1.98" 
## [12] "             INSR                      kinase                                Inhibited           -1.982"
## [13] "             FOXC2                     transcription regulator               Inhibited           -1.982"
## [14] "             ADRB                      group                                 Inhibited           -2"    
## [15] "             ERBB3                     kinase                                Inhibited           -2"    
## [16] "             HNF1A                     transcription regulator               Inhibited           -2.138"
## [17] "             PPARA                     ligand-dependent nuclear receptor     Inhibited           -2.176"
## [18] "             EPAS1                     transcription regulator               Inhibited           -2.177"
## [19] "             PPARG                     ligand-dependent nuclear receptor     Inhibited           -2.292"
## [20] "             HNF4A                     transcription regulator               Inhibited           -2.535"
## [21] "    MSI      PPARGC1A                  transcription regulator               Activated           2.163" 
## [22] "             IL1B                      cytokine                              Activated           2.093" 
## [23] "             EZH2                      transcription regulator               Activated           2"     
## [24] "             NFKBIA                    transcription regulator               Activated           1.955" 
## [25] "             IL17A                     cytokine                              Activated           1.944" 
## [26] "             TP53                      transcription regulator               Activated           1.845" 
## [27] "             F2                        peptidase                             Activated           1.746" 
## [28] "             APP                       other                                 Activated           1.744" 
## [29] "             PGR                       ligand-dependent nuclear receptor     Activated           1.741" 
## [30] "             MGEA5                     enzyme                                Activated           1.667" 
## [31] "             TGFB1                     growth factor                         Inhibited           -1.056"
## [32] "             NR1I2                     ligand-dependent nuclear receptor     Inhibited           -1.067"
## [33] "             KLF3                      transcription regulator               Inhibited           -1.134"
## [34] "             CST5                      other                                 Inhibited           -1.633"
## [35] "             PRDM1                     transcription regulator               Inhibited           -1.964"
## [36] "             CBX5                      transcription regulator               Inhibited           -2.236"
## [37] "     GS      NUPR1                     transcription regulator               Activated           4.69"  
## [38] "             CDKN2A                    transcription regulator               Activated           3.706" 
## [39] "             SRF                       transcription regulator               Activated           3.592" 
## [40] "             TP53                      transcription regulator               Activated           3.59"  
## [41] "             mir-21                    microrna                              Activated           3.548" 
## [42] "             TCF3                      transcription regulator               Activated           3.286" 
## [43] "             RBL2                      other                                 Activated           2.91"  
## [44] "             MKL1                      transcription regulator               Activated           2.777" 
## [45] "             MYOCD                     transcription regulator               Activated           2.763" 
## [46] "             BNIP3L                    other                                 Activated           2.646" 
## [47] "             MED1                      transcription regulator               Inhibited           -3.108"
## [48] "             FOXM1                     transcription regulator               Inhibited           -3.23" 
## [49] "             CCND1                     transcription regulator               Inhibited           -3.342"
## [50] "             HGF                       growth factor                         Inhibited           -3.613"
## [51] "             TBX2                      transcription regulator               Inhibited           -3.729"
## [52] "             RABL6                     other                                 Inhibited           -3.873"
## [53] "             MYC                       transcription regulator               Inhibited           -3.993"
## [54] "             ERBB2                     kinase                                Inhibited           -4.23" 
## [55] "             PTGER2                    g-protein coupled receptor            Inhibited           -4.243"
## [56] "             CSF2                      cytokine                              Inhibited           -4.574"
## [57] "    CIN      PRDM1                     transcription regulator               Activated           2.236" 
## [58] "             TGFB1                     growth factor                         Activated           1.98"  
## [59] "             IL6                       cytokine                              Inhibited           -0.965"
## [60] "             IL2                       cytokine                              Inhibited           -1.079"
## [61] "             IL21                      cytokine                              Inhibited           -1.109"
## [62] "             CHUK                      kinase                                Inhibited           -1.225"
## [63] "             IL17A                     cytokine                              Inhibited           -1.344"
## [64] "             IFNG                      cytokine                              Inhibited           -1.839"
## [65] "             SMARCA4                   transcription regulator               Inhibited           -1.914"
## [66] "             IFNB1                     cytokine                              Inhibited           -1.944"
## [67] "             TLR3                      transmembrane receptor                Inhibited           -1.951"
## [68] "             INF alpha                 group                                 Inhibited           -2.194"

table_content[1] %>% strsplit(., split = " ") %>% unlist() %>% stri_remove_empty()

## [1] "EBV"           "STAT1"         "transcription" "regulator"    
## [5] "Activated"     "2.092"

There are several variables here. First, the cancer subtype which is only mentioned once per table, other rows will have an empty space here until the next subtype takes its place. We need to read this position only four times in four lines and after that skip the first position altogether.

cancerTypes<-c("EBV", "MSI", "GS", "CIN") #how do I know that?
table_content %>% SplittingPDFStrings() %>% grep(cancerTypes[4], .)

## [1] 57

Well, first of all I happened to read the initial paper, next clue is that four times in this table a gene name (a word in ALL CAPS) is preceded by some other word in ALL CAPS - as it happens, an abbreviation of a cancer type.

I was going to read in a separate sequence rows that have a cancer type (1, 21, 37, 57) and rows that don’t (2:20, 38:56, 58:68),

but am sure there is a more programmatic way of solving this, with a one or two loops, for example (split strings into individual entries of each varible, count number of members in each vector and act accordingly). However, we should be careful, sinse some variables have entries that have a space within an entry type (i.e., “transcription regulator” should be counted as one thing, not two separate)

Let’s try using two spaces as a separator

mylineL<-vector()
for (i in 1:length(table_content)) {
  mylineL[i]<-table_content[i] %>% SplittingPDFStrings(., separator = "  ") %>% length()
}
hist(mylineL)

Looks like it worked, because we don’t have vectors with more then 5 entries in them, so let’s go! Remember, we have names of the variables:

table.names5[5]

## [1] " z-score*"

They look a bit iffy towards the end

suppressMessages(library(stringr))

## Warning: package 'stringr' was built under R version 3.5.2

for (i in 1:4) {
  table.names5[i]<-table.names5[i] %>% gsub(" ", "_")
}
table.names5

## [1] "_"         "_"         "_"         "_"         " z-score*"

Not at all what I expected!

txt[23] %>% SplittingPDFStrings() %>% SplittingPDFStrings(., separator = "\t") %>% head(., n = 2) %>% tail(., n=1) %>% SplittingPDFStrings(., separator = "  ") -> table.names5 
table.names5

## [1] "Subtype"            "Upstream Regulator" "Gene Functions"    
## [4] "Predicted State"    " z-score*"

for (i in 1:5) {
  table.names5[i]<-gsub('([[:punct:]])|\\s+','_',table.names5[i])
    #str_replace_all(x,"[[:punct:]\\\s]+","_")
}
table.names5

## [1] "Subtype"            "Upstream_Regulator" "Gene_Functions"    
## [4] "Predicted_State"    "_z_score_"

table.names5[5]<-"z_score"

Regulator<-vector()
Function<-vector()
State<-vector()
Zscore<-vector()
for (i in 1:length(table_content)) {
  myline<-table_content[i] %>% SplittingPDFStrings(., separator = "  ")
  if (length(myline) == 4) { 
    myline <-myline 
  } else { 
    myline<-myline[c(2:5)] 
    }
  myline<-trimws(myline) # removes leading and trailing white space
  Regulator[i]<-myline[1]
Function[i]<-myline[2]
State[i]<-myline[3]
Zscore[i]<-myline[4]
}
MyDataFrame<-data.frame(Regulator,
Function,
State,
Zscore)
MyDataFrame$Type<-c(rep("EBV", 20), rep("MSI", 36-20), rep("GS", 56-36), rep("CIN", 68-56))
names(MyDataFrame)<-table.names5

dim(MyDataFrame)

## [1] 68  5

head(MyDataFrame)

##   Subtype      Upstream_Regulator Gene_Functions Predicted_State z_score
## 1   STAT1 transcription regulator      Activated           2.092     EBV
## 2  CDKN2A transcription regulator      Activated           1.969     EBV
## 3     IL3                cytokine      Activated           1.951     EBV
## 4    IL27                cytokine      Activated           1.934     EBV
## 5   IFNB1                cytokine      Activated           1.927     EBV
## 6   MYD88                   other      Activated           1.802     EBV

Now, apparently I have messed up the order of my column names, let’s fix it

names(MyDataFrame)<-table.names5[c(2, 3, 4, 5, 1)]
head(MyDataFrame)

##   Upstream_Regulator          Gene_Functions Predicted_State z_score Subtype
## 1              STAT1 transcription regulator       Activated   2.092     EBV
## 2             CDKN2A transcription regulator       Activated   1.969     EBV
## 3                IL3                cytokine       Activated   1.951     EBV
## 4               IL27                cytokine       Activated   1.934     EBV
## 5              IFNB1                cytokine       Activated   1.927     EBV
## 6              MYD88                   other       Activated   1.802     EBV

Phew! Totally worth it.

Reading PDF Documents

Elizabeth Permina

12/19/2019

How to read PDF into R using pdftools