Purpose of this notebook is to develop some graphs of CpGo/e for the Geoduck transcriptome.
lets load up some data!
data <- read_delim("~/Documents/Roberts Lab/geoduck transcriptome/Geoduck-transcriptome_v3_bigtable_v08.tab.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
Parsed with column specification:
cols(
.default = col_character(),
length = col_integer(),
GC_content = col_double(),
`CpG_o/e` = col_double(),
male_unique_count = col_integer(),
female_unique_count = col_integer(),
Evalue = col_double(),
`evalue-Gigaton` = col_double(),
`evalue-Ruphibase` = col_double(),
`evalue-Sigenae` = col_double(),
Dheilly_Cluster = col_integer()
)
See spec(...) for full column specifications.
dim(data)
[1] 153982 23
head(data)
colnames(data)
[1] "ContigID" "length" "GC_content"
[4] "CpG_o/e" "male_unique_count" "female_unique_count"
[7] "sex-based_expresssion" "Blastx_UniProt_Acc" "EntryName"
[10] "ProteinName" "Organism" "Evalue"
[13] "Gene_Ontology_ID" "Gene_Ontology" "PFAM"
[16] "Blastn_Gigaton-ID" "evalue-Gigaton" "Blastn_Ruphibase-ID"
[19] "evalue-Ruphibase" "Blastn_Sigenae-ID" "evalue-Sigenae"
[22] "Dheilly_Cluster" "Dheilly_Tissue-enriched-expression"
dim(GOslim_bin)
[1] 19652 2
head(GOslim_bin)
Lets weld these two together, at least where they match.
data_merged <- merge(data, GOslim_bin, by.x = "ContigID", by.y = "Column1", all = FALSE)
GO_means[6] >- GO_means[3] + GO_means[4]
Error in `[.data.frame`(GO_means, 6) : undefined columns selected
Lets do a couple quick spot checks, to make sure my logic for calculating the SEM works (I have the right S and sqrt(N) lined up.)
protein_met_n <- length(data_merged$GOSlim_bin[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])])
protein_met_SEM <- sd(data_merged$`CpG_o/e`[which(data_merged$GOSlim_bin == factor(data_merged$GOSlim_bin)[2])]) / sqrt(protein_met_n)
protein_met_SEM == GO_means[10,4]
protein metabolism
TRUE
Good deal, looks like it worked! Or I got really lucky twice.
Now to play with the forestplot command. I had some issues with the column gapping, originally it wanted to right justify the graph, leading to a large gap between the labels and the graph. I cheated around this by using the colgap argument with a negative number. This led to some weird spacing issues with the graph label. I’ll likely look for a better tool for making the forest plot, but this is a functioning first pass.
forestplot(labeltext = GO_means$`Gene Types`, mean = GO_means$`CpG O/E Mean`, lower = GO_means$Lower, upper = GO_means$Upper, xticks = c(0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75), boxsize = 0.08, grid = TRUE, lwd.ci = 5, ci.verticies = TRUE, lty.ci = 1, colgap = unit(-140, 'mm'), xlab = " CpG O/E")
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.