A lot of the information is always burried in the supplementary section of any paper. The best possible scenarios is to have raw information avaialble, for example any information in tabular format should have been made available through a csv/tsv file. However for various reasons, this information is buried deep in doc/pdfs.

My aim here was to draw information from one of such supplementary pdfs from a 2011 paper Kishore et al.

rOpenSci’s pdftools package comes in very handy. I was able to extract info from Supplementary table 1. The input here is a trimmed version of the original pdf, with only Supplementary table 1.

I used pdfshuffler, but there are lot of tools that can do this.

library(pdftools)
library(stringr)
txt <- pdf_text('../datasets/kishore_quantitative_2011/kishore_targets.pdf')
all_lines <- unlist(strsplit(txt, split="\r\n",fixed = TRUE))
splitted_lines <- unlist(strsplit(all_lines, split="\n"))
## Do away with the footer
splitted_lines <- as.character(splitted_lines[!grepl('Nature Methods*|Supplementary*', splitted_lines)])
## Replace multiple spaces by single
splitted_lines <- lapply(splitted_lines, function(x) gsub("\\s+", " ", str_trim(x)))
unlist(splitted_lines)[1:7]
[1] "TranscriptID BeginSite EndSite CoverageCLIP CoverageMRNASeq Sequence Sample"                        "NM_018638 3267 3306 256934.0444 0.390896169 TAATACTCTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTT HuR_CLIP_A"   
[3] "NM_020177 5604 5643 12482.26704 0.25452461 TCATCATCATTTTCTTTTTTTCTTTTTCTTTTTTTTTTTT HuR_CLIP_A"     "NM_032682 3030 3069 8616.659701 0.25384843 CCTTTACTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT HuR_CLIP_A"    
[5] "NM_001025105 2650 2689 42182.42275 1.788372875 CATTCTTTCATTTTTTTCTTTTTTTTTTTTTTTTTTTATG HuR_CLIP_A" "NM_001164579 691 730 2775.808962 0.1277939 ACTCTTATTATTTTTTATTTTATTTTATTTTTTTATTTTT HuR_CLIP_A"    
[7] "NM_032226 2192 2231 3040.574399 0.183886964 TCATCAATATTTTTCAACTTTTTTTTTTTTTTTTTTTACT HuR_CLIP_A"   

Next step is simple, convert the list of strings to a dataframe.

df <- data.frame( do.call( rbind, strsplit( as.character(splitted_lines), ' ' ) ) ) 
## Set column names
colnames(df) <- as.character(unlist(df[1,]))
df <- df[-1,]
df
write.table(df, file='../datasets/kishore_quantitative_2011/kishore_clip_targets.csv', row.names = F, sep="\t", quote = F)
LS0tCnRpdGxlOiAiTWluaW5nIFN1cHBsZW1lbnRhcnkgUERGcyIKYXV0aG9yOiAiU2FrZXQgQ2hvdWRoYXJ5IChzYWtldGtjQGdtYWlsLmNvbSkiCnN1YnRpdGxlOiBVc2luZyBwZGZ0b29scyB0byBleHRyYWN0IHRhYnVsYXIgaW5mb3JtYXRpb24gaGlkZGVuIGluIHN1cHBsZW10YXJ5IHBkZnMKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6IGRlZmF1bHQKICBodG1sX2RvY3VtZW50OiBkZWZhdWx0Ci0tLQoKQSBsb3Qgb2YgdGhlIGluZm9ybWF0aW9uIGlzIGFsd2F5cyBidXJyaWVkIGluIHRoZSBzdXBwbGVtZW50YXJ5IHNlY3Rpb24gb2YgYW55IHBhcGVyLiBUaGUgYmVzdCBwb3NzaWJsZSBzY2VuYXJpb3MgaXMgdG8gaGF2ZSByYXcgaW5mb3JtYXRpb24gYXZhaWFsYmxlLCBmb3IgZXhhbXBsZSBhbnkgaW5mb3JtYXRpb24gaW4gdGFidWxhciBmb3JtYXQgc2hvdWxkIGhhdmUgYmVlbiBtYWRlIGF2YWlsYWJsZSB0aHJvdWdoIGEgY3N2L3RzdiBmaWxlLiBIb3dldmVyIGZvciB2YXJpb3VzIHJlYXNvbnMsIHRoaXMgaW5mb3JtYXRpb24gaXMgYnVyaWVkIGRlZXAgaW4gZG9jL3BkZnMuCgpNeSBhaW0gaGVyZSB3YXMgdG8gZHJhdyBpbmZvcm1hdGlvbiBmcm9tIG9uZSBvZiBzdWNoIHN1cHBsZW1lbnRhcnkgW3BkZnNdKGh0dHA6Ly93d3cubmF0dXJlLmNvbS9ubWV0aC9qb3VybmFsL3Y4L243L2V4dHJlZi9ubWV0aC4xNjA4LVMxLnBkZikgZnJvbSBhIDIwMTEgcGFwZXIgW0tpc2hvcmUgZXQgYWwuXShodHRwOi8vd3d3Lm5hdHVyZS5jb20vbm1ldGgvam91cm5hbC92OC9uNy9mdWxsL25tZXRoLjE2MDguaHRtbCkKCgpbck9wZW5TY2knc10oaHR0cHM6Ly9yb3BlbnNjaS5vcmcvKSBgcGRmdG9vbHNgIHBhY2thZ2UgY29tZXMgaW4gdmVyeSBoYW5keS4gSSB3YXMgYWJsZSB0byBleHRyYWN0IGluZm8gZnJvbSBTdXBwbGVtZW50YXJ5IHRhYmxlIDEuIFRoZSBpbnB1dCBoZXJlIGlzIGEgdHJpbW1lZCB2ZXJzaW9uIG9mIHRoZSBvcmlnaW5hbCBwZGYsIHdpdGggb25seSBTdXBwbGVtZW50YXJ5IHRhYmxlIDEuIAoKSSB1c2VkIFtwZGZzaHVmZmxlcl0oaHR0cHM6Ly9zb3VyY2Vmb3JnZS5uZXQvcHJvamVjdHMvcGRmc2h1ZmZsZXIvKSwgYnV0IHRoZXJlIGFyZSBsb3Qgb2YgdG9vbHMgdGhhdCBjYW4gZG8gdGhpcy4KCmBgYHtyfQpsaWJyYXJ5KHBkZnRvb2xzKQpsaWJyYXJ5KHN0cmluZ3IpCnR4dCA8LSBwZGZfdGV4dCgnLi4vZGF0YXNldHMva2lzaG9yZV9xdWFudGl0YXRpdmVfMjAxMS9raXNob3JlX3RhcmdldHMucGRmJykKYWxsX2xpbmVzIDwtIHVubGlzdChzdHJzcGxpdCh0eHQsIHNwbGl0PSJcclxuIixmaXhlZCA9IFRSVUUpKQpzcGxpdHRlZF9saW5lcyA8LSB1bmxpc3Qoc3Ryc3BsaXQoYWxsX2xpbmVzLCBzcGxpdD0iXG4iKSkKCiMjIERvIGF3YXkgd2l0aCB0aGUgZm9vdGVyCnNwbGl0dGVkX2xpbmVzIDwtIGFzLmNoYXJhY3RlcihzcGxpdHRlZF9saW5lc1shZ3JlcGwoJ05hdHVyZSBNZXRob2RzKnxTdXBwbGVtZW50YXJ5KicsIHNwbGl0dGVkX2xpbmVzKV0pCgojIyBSZXBsYWNlIG11bHRpcGxlIHNwYWNlcyBieSBzaW5nbGUKc3BsaXR0ZWRfbGluZXMgPC0gbGFwcGx5KHNwbGl0dGVkX2xpbmVzLCBmdW5jdGlvbih4KSBnc3ViKCJcXHMrIiwgIiAiLCBzdHJfdHJpbSh4KSkpCnVubGlzdChzcGxpdHRlZF9saW5lcylbMTo3XQpgYGAKCk5leHQgc3RlcCBpcyBzaW1wbGUsIGNvbnZlcnQgdGhlIGxpc3Qgb2Ygc3RyaW5ncyB0byBhIGRhdGFmcmFtZS4KCmBgYHtyfQpkZiA8LSBkYXRhLmZyYW1lKCBkby5jYWxsKCByYmluZCwgc3Ryc3BsaXQoIGFzLmNoYXJhY3RlcihzcGxpdHRlZF9saW5lcyksICcgJyApICkgKSAKIyMgU2V0IGNvbHVtbiBuYW1lcwpjb2xuYW1lcyhkZikgPC0gYXMuY2hhcmFjdGVyKHVubGlzdChkZlsxLF0pKQpkZiA8LSBkZlstMSxdCmRmCmBgYAoKYGBge3J9CndyaXRlLnRhYmxlKGRmLCBmaWxlPScuLi9kYXRhc2V0cy9raXNob3JlX3F1YW50aXRhdGl2ZV8yMDExL2tpc2hvcmVfY2xpcF90YXJnZXRzLmNzdicsIHJvdy5uYW1lcyA9IEYsIHNlcD0iXHQiLCBxdW90ZSA9IEYpCmBgYAoKCgoK