Purpose of this notebook is to develop some graphs of CpGo/e for the Geoduck transcriptome.

lets load up some data!

data <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/Geoduck-transcriptome_v3_bigtable_v08.tab.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
Parsed with column specification:
cols(
  .default = col_character(),
  length = col_integer(),
  GC_content = col_double(),
  `CpG_o/e` = col_double(),
  male_unique_count = col_integer(),
  female_unique_count = col_integer(),
  Evalue = col_double(),
  `evalue-Gigaton` = col_double(),
  `evalue-Ruphibase` = col_double(),
  `evalue-Sigenae` = col_double(),
  Dheilly_Cluster = col_integer()
)
See spec(...) for full column specifications.
dim(data)
[1] 153982     23
head(data)
colnames(data)
 [1] "ContigID"                           "length"                             "GC_content"                        
 [4] "CpG_o/e"                            "male_unique_count"                  "female_unique_count"               
 [7] "sex-based_expresssion"              "Blastx_UniProt_Acc"                 "EntryName"                         
[10] "ProteinName"                        "Organism"                           "Evalue"                            
[13] "Gene_Ontology_ID"                   "Gene_Ontology"                      "PFAM"                              
[16] "Blastn_Gigaton-ID"                  "evalue-Gigaton"                     "Blastn_Ruphibase-ID"               
[19] "evalue-Ruphibase"                   "Blastn_Sigenae-ID"                  "evalue-Sigenae"                    
[22] "Dheilly_Cluster"                    "Dheilly_Tissue-enriched-expression"
dim(GOslim_bin)
[1] 19652     2
head(GOslim_bin)

Lets weld these two together, at least where they match.


data_merged <- merge(data, GOslim_bin, by.x = "ContigID", by.y = "Column1", all = FALSE)
GO_means[6] >- GO_means[3] + GO_means[4]
Error in `[.data.frame`(GO_means, 6) : undefined columns selected

Lets do a couple quick spot checks, to make sure my logic for calculating the SEM works (I have the right S and sqrt(N) lined up.)

protein_met_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])])
protein_met_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])]) / sqrt(protein_met_n)
protein_met_SEM == GO_means[10,4]
protein metabolism 
              TRUE 

Good deal, looks like it worked! Or I got really lucky twice.

Now to play with the forestplot command. I had some issues with the column gapping, originally it wanted to right justify the graph, leading to a large gap between the labels and the graph. I cheated around this by using the colgap argument with a negative number. This led to some weird spacing issues with the graph label. I’ll likely look for a better tool for making the forest plot, but this is a functioning first pass.

forestplot(labeltext = GO_means$`Gene Types`, mean = GO_means$`CpG O/E Mean`, lower = GO_means$Lower, upper = GO_means$Upper, xticks = c(0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75), boxsize = 0.08, grid = TRUE, lwd.ci = 5, ci.verticies = TRUE, lty.ci = 1, colgap = unit(-140, 'mm'), xlab = "                                                                                                                                                                                                     CpG O/E")

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

LS0tCnRpdGxlOiAiR2VvZHVjayB0cmFuc2NyaXB0b21lIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpQdXJwb3NlIG9mIHRoaXMgbm90ZWJvb2sgaXMgdG8gZGV2ZWxvcCBzb21lIGdyYXBocyBvZiBDcEdvL2UgZm9yIHRoZSBHZW9kdWNrIHRyYW5zY3JpcHRvbWUuCgpsZXRzIGxvYWQgdXAgc29tZSBkYXRhIQoKYGBge3J9CmxpYnJhcnkoZm9yZXN0cGxvdCkKCmRhdGEgPC0gcmVhZF9kZWxpbSgifi9Eb2N1bWVudHMvUm9iZXJ0cyBMYWIvZ2VvZHVjayB0cmFuc2NyaXB0b21lL0dlb2R1Y2stdHJhbnNjcmlwdG9tZV92M19iaWd0YWJsZV92MDgudGFiLnR4dCIsICJcdCIsIGVzY2FwZV9kb3VibGUgPSBGQUxTRSwgdHJpbV93cyA9IFRSVUUpCgpHT3NsaW1fYmluIDwtIHJlYWRfZGVsaW0oIn4vRG9jdW1lbnRzL1JvYmVydHMgTGFiL2dlb2R1Y2sgdHJhbnNjcmlwdG9tZS9HT3NsaW0tYmluLnR4dCIsIAogICAgIlx0IiwgZXNjYXBlX2RvdWJsZSA9IEZBTFNFLCB0cmltX3dzID0gVFJVRSkKYGBgCgoKCmBgYHtyfQpkaW0oZGF0YSkKCmhlYWQoZGF0YSkKCmNvbG5hbWVzKGRhdGEpCgpkaW0oR09zbGltX2JpbikKCmhlYWQoR09zbGltX2JpbikKCmBgYAoKTGV0cyB3ZWxkIHRoZXNlIHR3byB0b2dldGhlciwgYXQgbGVhc3Qgd2hlcmUgdGhleSBtYXRjaC4KCmBgYHtyfQoKZGF0YV9tZXJnZWQgPC0gbWVyZ2UoZGF0YSwgR09zbGltX2JpbiwgYnkueCA9ICJDb250aWdJRCIsIGJ5LnkgPSAiQ29sdW1uMSIsIGFsbCA9IEZBTFNFKQpgYGAKCmBgYHtyfQpkYXRhX21lcmdlZCRHT1NsaW1fYmluIDwtIGFzLmZhY3RvcihkYXRhX21lcmdlZCRHT1NsaW1fYmluKQpHT19tZWFuc1sxXSA8LSBsZXZlbHMoZGF0YV9tZXJnZWQkR09TbGltX2JpbikKR09fbWVhbnNbMl0gPC0gYXMuZGF0YS5mcmFtZSh0YXBwbHkoZGF0YV9tZXJnZWQkYENwR19vL2VgLCBkYXRhX21lcmdlZCRHT1NsaW1fYmluLCBtZWFuKSkKR09fbWVhbnNbM10gPC0gdGFwcGx5KGRhdGFfbWVyZ2VkJGBDcEdfby9lYCwgZGF0YV9tZXJnZWQkR09TbGltX2Jpbiwgc2QpCkdPX21lYW5zWzRdIDwtIEdPX21lYW5zWzNdL3NxcnQoc3VtbWFyeShkYXRhX21lcmdlZCRHT1NsaW1fYmluKSkKR09fbWVhbnNbNV0gPC0gR09fbWVhbnNbMl0gLSBHT19tZWFuc1s0XQpHT19tZWFuc1s2XSA8LSBHT19tZWFuc1syXSArIEdPX21lYW5zWzRdCmNvbG5hbWVzKEdPX21lYW5zKSA8LSBjKCJHZW5lIFR5cGVzIiwgIkNwRyBPL0UgTWVhbiIsICJDcEcgTy9FIFNEIiwgIkNwRyBPL0UgU0VNIiwgIkxvd2VyIiwgIlVwcGVyIikKCmhlYWQoR09fbWVhbnMpCmBgYAoKCkxldHMgZG8gYSBjb3VwbGUgcXVpY2sgc3BvdCBjaGVja3MsIHRvIG1ha2Ugc3VyZSBteSBsb2dpYyBmb3IgY2FsY3VsYXRpbmcgdGhlIFNFTSB3b3JrcyAoSSBoYXZlIHRoZSByaWdodCBTIGFuZCBzcXJ0KE4pIGxpbmVkIHVwLikKCmBgYHtyfQpzdW1tYXJ5KGRhdGFfbWVyZ2VkJEdPU2xpbV9iaW4pCmZhY3RvcihkYXRhX21lcmdlZCRHT1NsaW1fYmluKVsxXQoKdHJhbnNwb3J0X24gPC0gbGVuZ3RoKGRhdGFfbWVyZ2VkJEdPU2xpbV9iaW5bd2hpY2goZGF0YV9tZXJnZWQkR09TbGltX2JpbiA9PSBmYWN0b3IoZGF0YV9tZXJnZWQkR09TbGltX2JpbilbMV0pXSkKCnRyYW5zcG9ydF9TRU0gPC0gc2QoZGF0YV9tZXJnZWQkYENwR19vL2VgW3doaWNoKGRhdGFfbWVyZ2VkJEdPU2xpbV9iaW4gPT0gZmFjdG9yKGRhdGFfbWVyZ2VkJEdPU2xpbV9iaW4pWzFdKV0pIC8gc3FydCh0cmFuc3BvcnRfbikKCnRyYW5zcG9ydF9TRU0gPT0gR09fbWVhbnNbMTQsNF0KCnByb3RlaW5fbWV0X24gPC0gbGVuZ3RoKGRhdGFfbWVyZ2VkJEdPU2xpbV9iaW5bd2hpY2goZGF0YV9tZXJnZWQkR09TbGltX2JpbiA9PSBmYWN0b3IoZGF0YV9tZXJnZWQkR09TbGltX2JpbilbMl0pXSkKCnByb3RlaW5fbWV0X1NFTSA8LSBzZChkYXRhX21lcmdlZCRgQ3BHX28vZWBbd2hpY2goZGF0YV9tZXJnZWQkR09TbGltX2JpbiA9PSBmYWN0b3IoZGF0YV9tZXJnZWQkR09TbGltX2JpbilbMl0pXSkgLyBzcXJ0KHByb3RlaW5fbWV0X24pCgpwcm90ZWluX21ldF9TRU0gPT0gR09fbWVhbnNbMTAsNF0KYGBgCgoKR29vZCBkZWFsLCBsb29rcyBsaWtlIGl0IHdvcmtlZCEgT3IgSSBnb3QgcmVhbGx5IGx1Y2t5IHR3aWNlLgoKTm93IHRvIHBsYXkgd2l0aCB0aGUgZm9yZXN0cGxvdCBjb21tYW5kLiBJIGhhZCBzb21lIGlzc3VlcyB3aXRoIHRoZSBjb2x1bW4gZ2FwcGluZywgb3JpZ2luYWxseSBpdCB3YW50ZWQgdG8gcmlnaHQganVzdGlmeSB0aGUgZ3JhcGgsIGxlYWRpbmcgdG8gYSBsYXJnZSBnYXAgYmV0d2VlbiB0aGUgbGFiZWxzIGFuZCB0aGUgZ3JhcGguIEkgY2hlYXRlZCBhcm91bmQgdGhpcyBieSB1c2luZyB0aGUgY29sZ2FwIGFyZ3VtZW50IHdpdGggYSBuZWdhdGl2ZSBudW1iZXIuIFRoaXMgbGVkIHRvIHNvbWUgd2VpcmQgc3BhY2luZyBpc3N1ZXMgd2l0aCB0aGUgZ3JhcGggbGFiZWwuIEknbGwgbGlrZWx5IGxvb2sgZm9yIGEgYmV0dGVyIHRvb2wgZm9yIG1ha2luZyB0aGUgZm9yZXN0IHBsb3QsIGJ1dCB0aGlzIGlzIGEgZnVuY3Rpb25pbmcgZmlyc3QgcGFzcy4gCgpgYGB7cn0KZm9yZXN0cGxvdChsYWJlbHRleHQgPSBHT19tZWFucyRgR2VuZSBUeXBlc2AsIG1lYW4gPSBHT19tZWFucyRgQ3BHIE8vRSBNZWFuYCwgbG93ZXIgPSBHT19tZWFucyRMb3dlciwgdXBwZXIgPSBHT19tZWFucyRVcHBlciwgeHRpY2tzID0gYygwLjQ1LCAwLjUwLCAwLjU1LCAwLjYwLCAwLjY1LCAwLjcwLCAwLjc1KSwgYm94c2l6ZSA9IDAuMDgsIGdyaWQgPSBUUlVFLCBsd2QuY2kgPSA1LCBjaS52ZXJ0aWNpZXMgPSBUUlVFLCBsdHkuY2kgPSAxLCBjb2xnYXAgPSB1bml0KC0xNDAsICdtbScpLCB4bGFiID0gIiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgQ3BHIE8vRSIpCgpgYGAKCkFkZCBhIG5ldyBjaHVuayBieSBjbGlja2luZyB0aGUgKkluc2VydCBDaHVuayogYnV0dG9uIG9uIHRoZSB0b29sYmFyIG9yIGJ5IHByZXNzaW5nICpDbWQrT3B0aW9uK0kqLgo=