When you have a pdf paper with tables that no one bothered to upload as a supplementary file in an R-friendly .csv format… that is what you can do.
First, let’s get the offending data in:
link<-"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5785562/bin/NIHMS867379-supplement-1.pdf"
dir.create("scratch", showWarnings = FALSE) # it is tidier to keep your data in a separate folder, see Reproducible Research
myfile <- "scratch/NIHMS867379-supplement-1.pdf"
download.file(link, myfile)
library(pdftools)
## Warning: package 'pdftools' was built under R version 3.5.2
txt<-pdf_text(myfile)
This supplementary is a bunch of tables bundled together. They are all in different forms (number of columns, column names, also some table legends are present), so I need to process them one by one. Let’s say the one I am specifically interested in can is on the page 23.
I will load dplyr package, because it makes data handling more easy and closer to the natural language, in particular has piping function ‘%>%’ that passes data on to the next function, just like ‘|’ in command line. Note the use of suppressMessages() function which does exactly that - suppresses messages, in partucular printouts from loading libraries, that sometimes could be quite extensive.
suppressMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.5.2
Now, ’ere is my page 23:
txt[23]
## [1] "Supplementary Table 5. Predicted upstream regulators in TCGA subtypes.\n Subtype Upstream Regulator Gene Functions Predicted State z-score*\n EBV STAT1 transcription regulator Activated 2.092\n CDKN2A transcription regulator Activated 1.969\n IL3 cytokine Activated 1.951\n IL27 cytokine Activated 1.934\n IFNB1 cytokine Activated 1.927\n MYD88 other Activated 1.802\n IFNG cytokine Activated 1.79\n IL2 cytokine Activated 1.438\n IL21 cytokine Activated 1.397\n IL1B cytokine Activated 1.293\n ERBB4 kinase Inhibited -1.98\n INSR kinase Inhibited -1.982\n FOXC2 transcription regulator Inhibited -1.982\n ADRB group Inhibited -2\n ERBB3 kinase Inhibited -2\n HNF1A transcription regulator Inhibited -2.138\n PPARA ligand-dependent nuclear receptor Inhibited -2.176\n EPAS1 transcription regulator Inhibited -2.177\n PPARG ligand-dependent nuclear receptor Inhibited -2.292\n HNF4A transcription regulator Inhibited -2.535\n MSI PPARGC1A transcription regulator Activated 2.163\n IL1B cytokine Activated 2.093\n EZH2 transcription regulator Activated 2\n NFKBIA transcription regulator Activated 1.955\n IL17A cytokine Activated 1.944\n TP53 transcription regulator Activated 1.845\n F2 peptidase Activated 1.746\n APP other Activated 1.744\n PGR ligand-dependent nuclear receptor Activated 1.741\n MGEA5 enzyme Activated 1.667\n TGFB1 growth factor Inhibited -1.056\n NR1I2 ligand-dependent nuclear receptor Inhibited -1.067\n KLF3 transcription regulator Inhibited -1.134\n CST5 other Inhibited -1.633\n PRDM1 transcription regulator Inhibited -1.964\n CBX5 transcription regulator Inhibited -2.236\n GS NUPR1 transcription regulator Activated 4.69\n CDKN2A transcription regulator Activated 3.706\n SRF transcription regulator Activated 3.592\n TP53 transcription regulator Activated 3.59\n mir-21 microrna Activated 3.548\n TCF3 transcription regulator Activated 3.286\n RBL2 other Activated 2.91\n MKL1 transcription regulator Activated 2.777\n MYOCD transcription regulator Activated 2.763\n BNIP3L other Activated 2.646\n MED1 transcription regulator Inhibited -3.108\n FOXM1 transcription regulator Inhibited -3.23\n CCND1 transcription regulator Inhibited -3.342\n HGF growth factor Inhibited -3.613\n TBX2 transcription regulator Inhibited -3.729\n RABL6 other Inhibited -3.873\n MYC transcription regulator Inhibited -3.993\n ERBB2 kinase Inhibited -4.23\n PTGER2 g-protein coupled receptor Inhibited -4.243\n CSF2 cytokine Inhibited -4.574\n CIN PRDM1 transcription regulator Activated 2.236\n TGFB1 growth factor Activated 1.98\n IL6 cytokine Inhibited -0.965\n IL2 cytokine Inhibited -1.079\n IL21 cytokine Inhibited -1.109\n CHUK kinase Inhibited -1.225\n IL17A cytokine Inhibited -1.344\n IFNG cytokine Inhibited -1.839\n SMARCA4 transcription regulator Inhibited -1.914\n IFNB1 cytokine Inhibited -1.944\n TLR3 transmembrane receptor Inhibited -1.951\n INF alpha group Inhibited -2.194\n* z-score was calculated to predict activation or inhibition of upstream regulators based on published findings\naccessible through the Ingenuity knowledge base. Regulators with z-score >1 or <-1 were reported in table. When\nmore than 10 regulators are >1 or <-1, top 10 regulators are only included.\n"
As we (hopefully) see we have a tab- and - separated data. Now we have to process that into a table
suppressMessages(library(stringi)) # has a very useful function that removes empty members from strings, of which we will have plenty
## Warning: package 'stringi' was built under R version 3.5.2
txt[23] %>% strsplit(., split ="\n") %>% unlist() %>% as.character() %>% strsplit(., split = "\t") %>% head(., n = 2) %>% tail(., n=1) %>% unlist() %>% strsplit(., split = " ") %>% unlist() %>% stri_remove_empty() -> table.names5 # parsing the second row which has column names
table_content<-txt[23] %>% strsplit(., split ="\n") %>% unlist() %>% as.character() %>% strsplit(., split = "\t") %>% unlist() %>% head(., n = 70) %>% tail(., n = -2) # read all starting from the 3rd row
Let’s make it a little be more compact, if not readble. I like using functions for creating shortcats for things I keep reusing, like the lengthy combination of strsplit() and unlist()
By the way what are they here for - strsplit() splits string of characters and returns a list, unlist() makes a vector out of a list, in this case, it’s a character vector.
SplittingPDFStrings<-function(txt, separator = "\n") {
txt %>% strsplit(., split = separator) %>% unlist() %>% stri_remove_empty()
}
table_content<-txt[23] %>% SplittingPDFStrings() %>% SplittingPDFStrings(., separator = "\t") %>% head(., n = 70) %>% tail(., n = -2)
table_content
## [1] " EBV STAT1 transcription regulator Activated 2.092"
## [2] " CDKN2A transcription regulator Activated 1.969"
## [3] " IL3 cytokine Activated 1.951"
## [4] " IL27 cytokine Activated 1.934"
## [5] " IFNB1 cytokine Activated 1.927"
## [6] " MYD88 other Activated 1.802"
## [7] " IFNG cytokine Activated 1.79"
## [8] " IL2 cytokine Activated 1.438"
## [9] " IL21 cytokine Activated 1.397"
## [10] " IL1B cytokine Activated 1.293"
## [11] " ERBB4 kinase Inhibited -1.98"
## [12] " INSR kinase Inhibited -1.982"
## [13] " FOXC2 transcription regulator Inhibited -1.982"
## [14] " ADRB group Inhibited -2"
## [15] " ERBB3 kinase Inhibited -2"
## [16] " HNF1A transcription regulator Inhibited -2.138"
## [17] " PPARA ligand-dependent nuclear receptor Inhibited -2.176"
## [18] " EPAS1 transcription regulator Inhibited -2.177"
## [19] " PPARG ligand-dependent nuclear receptor Inhibited -2.292"
## [20] " HNF4A transcription regulator Inhibited -2.535"
## [21] " MSI PPARGC1A transcription regulator Activated 2.163"
## [22] " IL1B cytokine Activated 2.093"
## [23] " EZH2 transcription regulator Activated 2"
## [24] " NFKBIA transcription regulator Activated 1.955"
## [25] " IL17A cytokine Activated 1.944"
## [26] " TP53 transcription regulator Activated 1.845"
## [27] " F2 peptidase Activated 1.746"
## [28] " APP other Activated 1.744"
## [29] " PGR ligand-dependent nuclear receptor Activated 1.741"
## [30] " MGEA5 enzyme Activated 1.667"
## [31] " TGFB1 growth factor Inhibited -1.056"
## [32] " NR1I2 ligand-dependent nuclear receptor Inhibited -1.067"
## [33] " KLF3 transcription regulator Inhibited -1.134"
## [34] " CST5 other Inhibited -1.633"
## [35] " PRDM1 transcription regulator Inhibited -1.964"
## [36] " CBX5 transcription regulator Inhibited -2.236"
## [37] " GS NUPR1 transcription regulator Activated 4.69"
## [38] " CDKN2A transcription regulator Activated 3.706"
## [39] " SRF transcription regulator Activated 3.592"
## [40] " TP53 transcription regulator Activated 3.59"
## [41] " mir-21 microrna Activated 3.548"
## [42] " TCF3 transcription regulator Activated 3.286"
## [43] " RBL2 other Activated 2.91"
## [44] " MKL1 transcription regulator Activated 2.777"
## [45] " MYOCD transcription regulator Activated 2.763"
## [46] " BNIP3L other Activated 2.646"
## [47] " MED1 transcription regulator Inhibited -3.108"
## [48] " FOXM1 transcription regulator Inhibited -3.23"
## [49] " CCND1 transcription regulator Inhibited -3.342"
## [50] " HGF growth factor Inhibited -3.613"
## [51] " TBX2 transcription regulator Inhibited -3.729"
## [52] " RABL6 other Inhibited -3.873"
## [53] " MYC transcription regulator Inhibited -3.993"
## [54] " ERBB2 kinase Inhibited -4.23"
## [55] " PTGER2 g-protein coupled receptor Inhibited -4.243"
## [56] " CSF2 cytokine Inhibited -4.574"
## [57] " CIN PRDM1 transcription regulator Activated 2.236"
## [58] " TGFB1 growth factor Activated 1.98"
## [59] " IL6 cytokine Inhibited -0.965"
## [60] " IL2 cytokine Inhibited -1.079"
## [61] " IL21 cytokine Inhibited -1.109"
## [62] " CHUK kinase Inhibited -1.225"
## [63] " IL17A cytokine Inhibited -1.344"
## [64] " IFNG cytokine Inhibited -1.839"
## [65] " SMARCA4 transcription regulator Inhibited -1.914"
## [66] " IFNB1 cytokine Inhibited -1.944"
## [67] " TLR3 transmembrane receptor Inhibited -1.951"
## [68] " INF alpha group Inhibited -2.194"
table_content[1] %>% strsplit(., split = " ") %>% unlist() %>% stri_remove_empty()
## [1] "EBV" "STAT1" "transcription" "regulator"
## [5] "Activated" "2.092"
There are several variables here. First, the cancer subtype which is only mentioned once per table, other rows will have an empty space here until the next subtype takes its place. We need to read this position only four times in four lines and after that skip the first position altogether.
cancerTypes<-c("EBV", "MSI", "GS", "CIN") #how do I know that?
table_content %>% SplittingPDFStrings() %>% grep(cancerTypes[4], .)
## [1] 57
Well, first of all I happened to read the initial paper, next clue is that four times in this table a gene name (a word in ALL CAPS) is preceded by some other word in ALL CAPS - as it happens, an abbreviation of a cancer type.
I was going to read in a separate sequence rows that have a cancer type (1, 21, 37, 57) and rows that don’t (2:20, 38:56, 58:68),
but am sure there is a more programmatic way of solving this, with a one or two loops, for example (split strings into individual entries of each varible, count number of members in each vector and act accordingly). However, we should be careful, sinse some variables have entries that have a space within an entry type (i.e., “transcription regulator” should be counted as one thing, not two separate)
Let’s try using two spaces as a separator
mylineL<-vector()
for (i in 1:length(table_content)) {
mylineL[i]<-table_content[i] %>% SplittingPDFStrings(., separator = " ") %>% length()
}
hist(mylineL)
Looks like it worked, because we don’t have vectors with more then 5 entries in them, so let’s go! Remember, we have names of the variables:
table.names5[5]
## [1] " z-score*"
They look a bit iffy towards the end
suppressMessages(library(stringr))
## Warning: package 'stringr' was built under R version 3.5.2
for (i in 1:4) {
table.names5[i]<-table.names5[i] %>% gsub(" ", "_")
}
table.names5
## [1] "_" "_" "_" "_" " z-score*"
Not at all what I expected!
txt[23] %>% SplittingPDFStrings() %>% SplittingPDFStrings(., separator = "\t") %>% head(., n = 2) %>% tail(., n=1) %>% SplittingPDFStrings(., separator = " ") -> table.names5
table.names5
## [1] "Subtype" "Upstream Regulator" "Gene Functions"
## [4] "Predicted State" " z-score*"
for (i in 1:5) {
table.names5[i]<-gsub('([[:punct:]])|\\s+','_',table.names5[i])
#str_replace_all(x,"[[:punct:]\\\s]+","_")
}
table.names5
## [1] "Subtype" "Upstream_Regulator" "Gene_Functions"
## [4] "Predicted_State" "_z_score_"
table.names5[5]<-"z_score"
Regulator<-vector()
Function<-vector()
State<-vector()
Zscore<-vector()
for (i in 1:length(table_content)) {
myline<-table_content[i] %>% SplittingPDFStrings(., separator = " ")
if (length(myline) == 4) {
myline <-myline
} else {
myline<-myline[c(2:5)]
}
myline<-trimws(myline) # removes leading and trailing white space
Regulator[i]<-myline[1]
Function[i]<-myline[2]
State[i]<-myline[3]
Zscore[i]<-myline[4]
}
MyDataFrame<-data.frame(Regulator,
Function,
State,
Zscore)
MyDataFrame$Type<-c(rep("EBV", 20), rep("MSI", 36-20), rep("GS", 56-36), rep("CIN", 68-56))
names(MyDataFrame)<-table.names5
dim(MyDataFrame)
## [1] 68 5
head(MyDataFrame)
## Subtype Upstream_Regulator Gene_Functions Predicted_State z_score
## 1 STAT1 transcription regulator Activated 2.092 EBV
## 2 CDKN2A transcription regulator Activated 1.969 EBV
## 3 IL3 cytokine Activated 1.951 EBV
## 4 IL27 cytokine Activated 1.934 EBV
## 5 IFNB1 cytokine Activated 1.927 EBV
## 6 MYD88 other Activated 1.802 EBV
Now, apparently I have messed up the order of my column names, let’s fix it
names(MyDataFrame)<-table.names5[c(2, 3, 4, 5, 1)]
head(MyDataFrame)
## Upstream_Regulator Gene_Functions Predicted_State z_score Subtype
## 1 STAT1 transcription regulator Activated 2.092 EBV
## 2 CDKN2A transcription regulator Activated 1.969 EBV
## 3 IL3 cytokine Activated 1.951 EBV
## 4 IL27 cytokine Activated 1.934 EBV
## 5 IFNB1 cytokine Activated 1.927 EBV
## 6 MYD88 other Activated 1.802 EBV
Phew! Totally worth it.