Correlations of RP expression with expression of various nucleolar & selected cancer genes

Author: Charleen D. Adams

Abbreviations and definitions

Gene-Tissue Expression (GTex) Project
Ribosomal protein (RP)
Transcripts per million (TPM)

Methods: Part 1

I downloaded the transcripts per million (TPM) expression data from Gene-Tissue Expression (GTEx). After importing it into R and converting ensembl IDs to entrez ID symbols, I imported GTEx’s file that has the sample annotations. I used this to subset for samples denotating breast tissue. I log2-transformed the data and ran correlations between all RPs for which there was expression data with TP53, MYC, MKI67, ESR1, ESR2, ESRRA, ESRRB, ESRRG, and genes expressed in the nucleolus.

The list for the nucleolar genes came from the Human Protein Atlas: https://www.proteinatlas.org/humanproteome/tissue

I verified that the list of RPs from Human Protein Atlas were not pseudogenes by looking them up in Ensembl. (I’ve read there are over 2000 known RP pseudogenes…)

Long format*: correlations of RPs with nucleolar and select cancer genes

*Long format was used to select the correlations by p-value

setwd("/n/home04/cdadams/TPM/")
#Long format

cors=read.csv('cor.gather_RP_not_6_28.csv')
cors$gene=cors$var1
cors$RP=cors$var2
cors$X <- NULL
cors$var1 <- NULL
cors$var2 <- NULL
myvars=c('gene', 'RP','cor','p')
cors=cors[myvars]

DT::datatable(cors, class = 'cell-border stripe', rownames = F, filter = 'top',
  editable = TRUE, extensions = 'Buttons', options = list (
  dom = 'Bfrtip', 
  buttons = c('copy', 'csv')
  ))

Methods: Part 2

I took the absolute value of the correlations of the RPs with the nucleolar and select cancer genes I’d correlated (using the normalized GTEx TPM data in breast tissue). Then I created ~1325 (excluding RPs) new binary variables, dichotomizing them into >=.50 and <0.50 correlated. With the Bonferroni-signficant set, I wrote a script to perform a loop of fisher’s exact tests (with Monte Carlo-simulated p-values, using 10,000 replicates) on the dichotomized correlations to determine whether the strength of the correlations were different depending on whether the RPs are mitochondrial or cytosolic.

I then adjusted for multiple testing with the FDR method and obtained the list of genes that were different than expected between mitochondrial and cytosolic RPs by strength of their correlation with the RPs.

Wide format*: correlations (with the non-Bonferroni correlations removed)

For the set of Bonferroni-signficant correlations, I took the absolute value of the correlations between the RPs and the other genes and dichotomized them at >=0.50. After removing nucleolar genes didn’t vary (only one level, i.e., all cells for a nucleolar gene having correlations >=0.50 or all below), I ran fisher’s exact tests on the dichotomized strength of correlation variables and type of ribosome (mitochondrial vs cytosolic).

The pre-aboslute value and pre-dichotomized data are below. Empty cells indicate that the correlations were not Bonferroni-signficiant, thus they were treated as NA in the analysis.

*Wide format was used to run the 2x2 tests

setwd("/n/home04/cdadams/TPM/")
#Bonferroni-corrected correlation list
#Threshold set to P<2.515723e-07 (0.05/150*1325)

data_wide=read.csv('wide_final_6-28.csv')
data_wide$X=NULL

DT::datatable(data_wide, class = 'cell-border stripe', rownames = F, filter = 'top',
  editable = TRUE, extensions = 'Buttons', options = list (
  dom = 'Bfrtip', 
  buttons = c('copy', 'csv')
  ))

2x2 results of Bonferroni-significant correlations: Meaning? Interpretation?

These results show that absolute strength of the correlations between RPs and nucleolar genes and ribosomal location are different than expected for some nucleolar genes. Maybe this lends credence to the hypothesis about “specialized” ribosomes. But why would the strength of the correlation between RPs and nucleolar genes depend on type of ribosome?

fishers=read.csv('fishers_6-28.csv')
fishers$X=NULL
myvars_fishers=c('gene', 'Pval', 'fdr')
fishers=fishers[myvars_fishers]

#### RMD table
DT::datatable(fishers, class = 'cell-border stripe', 
  rownames = F, filter = 'top',
  editable = TRUE, extensions = 'Buttons', options = list (
  dom = 'Bfrtip', 
  buttons = c('copy', 'csv')
  ))

Methods: Part 3

I had previously performed this analysis without restricting to the Bonferroni-signficant correlations. That is, I dichotomized all the correlations, not just the ones that had very small p-values. I then looked up the list of 2x2 results in the Geneset Enrichment Analysis (GSEA) browser and observed that the set of genes that were more different than expected between mitoribosomes and cytosolic ribosomes by strength of correlation with RPs were enriched for those that overlap with BRCA1-PUJANA pathway. Using the more-stringent selection criteria, a set of the nucleolar genes are still enriched for overlap with the BRCA1-PUJANA pathway.

#Copy the list of FDR-genes into GSEA and see which pathways 
#are overrepresented: weblink, https://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp
BRCA_bon=read.csv('fishers3_fdr_6-28_PUJANA.csv')
BRCA_bon_pathway_set=Reduce(intersect, list(BRCA_bon$gene, BRCA_bon$PUJANA))
BRCA_bon_pathway_set

##  [1] "POP5"     "HDDC2"    "LSM3"     "RPP40"    "IMP4"     "C1QBP"   
##  [7] "EWSR1"    "BMS1"     "LARP4B"   "ZNF330"   "NUP188"   "EBNA1BP2"
## [13] "ANP32B"   "ATP13A3"  "POP4"     "MDM2"     "PIK3CB"   "PTBP1"   
## [19] "PDHA1"    "PAFAH1B2" "TAOK2"    "SNX15"    "NAA10"    "KIF2A"   
## [25] "HNRNPM"   "ACAT2"    "DDX19A"

I’ve done preliminary analyses stratified by whether the correlations with the RPs are positive or inverse. I plan to do this analysis for prostate, lung, and ovary soon.

TL;DR: Code

#### Transcripts per million (TPM) GTEx data
# Word searches on 'nucleolar' and 'nucleolus' were done in the Protein Atlas: # https://www.proteinatlas.org/humanproteome/tissue

# Analytical set includes proteins associated with the nucleolus, as well as the ESR1, ESR2, ESRRA, ESRRB, ESRRG, MYC, and TP53

setwd("/n/home04/cdadams/TPM/")
(.packages())

#### Packages used in this script but also for the entire GTEx RP project
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(broom)
library(rstatix)
library(pastecs)
library(statmod)
library(stringr)
library(PerformanceAnalytics)
library(taRifx)
library(Hmisc)
library(EnsDb.Hsapiens.v86)
library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
library(hyprcoloc)
library(e1071)
library(rcompanion)
library(car)
library(sqldf)
library(kableExtra)
library(gplots)
library(graphics)
library(gridExtra)
library(GGally)
library(CCA)
library(lme4)
library(CCP)
library(qqman)
library(dplyr)
library(tidyr)
library(EnsDb.Hsapiens.v79) 
library(biomaRt)
library(SNPlocs.Hsapiens.dbSNP.20120608)
library(data.table)
library(DataCombine)
library(Cairo)
library(devtools)
library(RColorBrewer)
library(MRInstruments) 
library(TwoSampleMR)
library(RadialMR)
library(tidyverse)
library(packcircles)
library(ggplot2)
library(viridis)
library(igraph)
library(CMplot)
library(oncofunco)
library(metaviz)
library(vcd)
library(corrplot)
library(CePa)
library(Biobase)
library(caret)
library(GO.db)
library(edgeR)
library(limma)
library(Glimma)
library(biomaRt)
library(annotate)
library(genefu)
library(xtable)
library(rmeta)
library(Biobase)
library(org.Hs.eg.db)

#### Read-in the TPM data
tpm=read.gct('GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm3gct')
rna=as.data.frame(tpm)
rna$GENEID=rownames(rna)

#### Read-in the TPM data (previously formatted and saved as a .csv)
#rna=read.csv('rna.csv') #deleted must re-transfer to cluster

#### Remove the version numbers at the end of the ensembl names
rna$GENEID <- gsub('\\..+$', '', rna$GENEID)

#### Fetch the gene names: convert from ensembl.gene to gene.symbol
ensembl.genes=as.character(rna$GENEID)
geneIDs1 <- ensembldb::select(EnsDb.Hsapiens.v79, keys= ensembl.genes, keytype = "GENEID", columns = c("SYMBOL","GENEID"));

#### Merge the gene names 
rna <- merge(x = geneIDs1, 
             y = rna, 
             by.x="GENEID",
             by.y="GENEID")

which(rna$GENEID=='ENSG00000273716')
which(rna$GENEID=='ENSG00000279483')

#### Remove the two RPL41 pseudogenes: "ENSG00000273716" "ENSG00000279483"
myData <- rna[-c(51625, 54471), ]

#### Read-in the set of nucleolar genes
nuc=read.csv('nucelolar.csv')
nuc2=nuc %>% distinct(nuc, .keep_all = TRUE)

#### Read-in the set of RPs from the Protein Atlas
expand=read.csv('expanded_verified.csv')#verified
dim(expand)
colnames(nuc2)[1]="all_RP"
vars=rbind(nuc2, expand)
vars2=vars %>% distinct(all_RP, .keep_all = TRUE)

#### Add the select cancer genes
lemos_vars=0
lemos_vars$all_RP=c("MYC", "TP53", "MKI67", "ESR1", "ESR2","ESRRA", "ESRRB", "ESRRG" )
lemos_vars=as.data.frame(lemos_vars)
lemos_vars$X0=NULL

#### Bind gene list together
complete_vars=rbind(lemos_vars, vars2)

complete_vars <- complete_vars[order(complete_vars$all_RP),]
complete_vars=as.data.frame(complete_vars)

complete_vars %>% n_distinct(complete_vars, .keep_all=TRUE)
complete_vars$complete_vars=as.character(complete_vars$complete_vars)

#### Select just the genes in the gene list: complete_vars
rps=myData[which(myData$SYMBOL %in% complete_vars$complete_vars),]

#rm(list=setdiff(ls(), c("rna", "rps", "complete_vars", "myData")))
rps2=rps
length(rps$SYMBOL)

rps3=rps2 %>% distinct(SYMBOL, .keep_all = TRUE)
rownames(rps3)=rps3$SYMBOL
dim(rps3)
head(colnames(rps3))
#write.csv(rps3, 'rps3_6-24.csv')

#### Remve GENEID, SYMBOL, and X from the colnames
rps4=rps3[,-1] 
rps4=rps4[,-1] 
rps4=rps4[,-1]

#dat <- japply( rps5, which(sapply(rps5, class)=="factor"), as.numeric )

#### Get just the breast cancer data 
read.url <- function(url, ...){
  tmpFile <- tempfile()
  download.file(url, destfile = tmpFile, method = "curl")
  url.data <- fread(tmpFile, ...)
  return(url.data)
}

samples_v8=read.url("https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt")
#write.csv(samples_v8, 'samples_v8.csv')

breast=samples_v8[which(samples_v8$SMTSD=="Breast - Mammary Tissue"),]
breast$SAMPID <- gsub("\\-", ".", breast$SAMPID)

rps5=t(rps4)
rps6=as.data.frame(rps5)

breast_rp=rps6[which(rownames(rps6) %in% breast$SAMPID),]
breast_rp6=t(breast_rp)

#write.csv(breast_rp6, "breast_rp_6-24.csv")

# Which values in TPM are greater than 0.5?
thresh <- breast_rp6 > 0.5

# This produces a logical matrix with TRUEs and FALSEs
head(thresh)

# Summary of how many TRUEs there are in each row
table(rowSums(thresh))

# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2

# Subset the rows of rna2 to keep the more highly expressed genes
counts.keep <- breast_rp6[keep,]

dgeObj <- DGEList(counts.keep)

# have a look at dgeObj
dgeObj

# See what slots are stored in dgeObj
names(dgeObj)

# Library size information is stored in the samples slot
dgeObj$samples

# Get log2 counts per million
logcounts <- cpm(dgeObj,log=TRUE)

#####################################################################
#### Correlation matrix for RMD table
#####################################################################
logs2=t(logcounts)
logs2=as.data.frame(logs2)
logs2=logs2[,order(colnames(logs2))]
dim(logs2)
write.csv(logs2, 'logs2_6_28.csv')

expand=read.csv('expanded_verified.csv')#verified
dim(expand)
logs2=read.csv('logs2_6_28.csv')
str(logs2)
dim(logs2)
head(colnames(logs2))
head(logs2$X)
head(rownames(logs2))
rownames(logs2) <- logs2$X
logs2$X <- NULL
dim(logs2)

res3 <- rcorr(as.matrix(logs2))
str(res3)

#Get the pvals
cor.gather <- res3 %>% cor_gather()
dim(cor.gather)
colnames(cor.gather)
rownames(cor.gather)=cor.gather$var2
#write.csv(cor.gather, "cor.gather_6-28.csv")

#Get just the RPs in var2
#cor.gather=as.data.frame(cor.gather)
cor.gather_RP=cor.gather[which(cor.gather$var2 %in% expand$all_RP),]
dim(cor.gather_RP)
str(cor.gather_RP)
#cor.gather1<-cor.gather[grepl("RPL", cor.gather$var2),]

#Keep only the non-RP nucleolar & cancer genes in var1
'%ni%' <- Negate('%in%')
cor.gather_RP_not=cor.gather_RP[which(cor.gather_RP$var1 %ni% expand$all_RP),]
dim(cor.gather_RP_not)

cor.gather_RP_not_uniq_var1 <- unique(cor.gather_RP_not$var1)
length(cor.gather_RP_not_uniq_var1)

cor.gather_RP_not_uniq_var2 <- unique(cor.gather_RP_not$var2)
length(cor.gather_RP_not_uniq_var2)

bf_thresh=0.05/(1325*150)
#write.csv(cor.gather_RP_not, 'cor.gather_RP_not_6_28.csv')

head(cor.gather_RP_not[1:4], n=50)
cor.gather_RP_not=cor.gather_RP_not[order(-cor.gather_RP_not$cor),]
dim(cor.gather_RP_not)

bf <- cor.gather_RP_not[ which(cor.gather_RP_not$p<2.515723e-07),]
str(bf)
dim(bf)

#write.csv(bf, 'bf_6-28.csv')

#Remove the pvalues
bf$p <- NULL

#Rename the variables
colnames(bf)[1] <- "gene"
colnames(bf)[2] <- "RP"

wide=xtabs(cor ~ RP + gene, data = bf)
colnames(wide2)
rownames(wide2)
str(wide2)
wide2=as.matrix(wide)
wide_final=as.data.frame(wide2)

write.csv(wide_final, 'wide_final.csv')

#wide3=read.csv('wide2_matrix_form.csv')
#head(colnames(wide3))
#dim(wide3)
#str(wide3)

#Change the 0's to NA
is.na(wide_final) <- !wide_final
head(wide_final[3:3])

#Get the absolute values
dt_abs <- lapply(wide_final, abs)

dt_abs2=as.data.frame(dt_abs)
str(dt_abs2)

rownames(dt_abs2)=rownames(wide_final)

#Transform to a tibble to create the dichotomized variables
mydata=as_tibble(dt_abs2)
mydata2 = mutate_all(mydata, funs("dich"= ifelse(.>=.50, '>=0.50 (high)', '<0.50 (low)')))
mydata2=as.data.frame(mydata2)
rownames(mydata2)=rownames(dt_abs2)
colnames(mydata2)
#write.csv(mydata2, 'mydata2_strength_6_28.csv')

#Create the cell location variable
mydata2$location=0
mydata2$location=ifelse(startsWith(rownames(mydata2), 'RP'), 'cyto', 'mito')
dim(mydata2)
colnames(mydata2)[2597]
colnames(mydata2)[1298]
colnames(mydata2)[1299]

#Select just the dichotomized variables and the location variable
mydata3= mydata2[c(1299:2597)]

# Convert all the variables in factors for the fisher's exact tests
for(i in 1:ncol(mydata3)){
  mydata3[,i] <- as.factor(mydata3[,i])
}

# Remove data with fewer than two levels
mydata4=mydata3[, sapply(mydata3, nlevels) > 1]
dim(mydata4)
head(mydata4[,2])
tail(colnames(mydata4))
rownames(mydata4)
dim(mydata4)
mydata4$location=0
mydata4$location=ifelse(startsWith(rownames(mydata2), 'RP'), 'cyto', 'mito')

colnames(mydata4)[1082]
str(mydata4)
write.csv(mydata4, 'mydata4.csv')

fishers=data.frame(lapply(mydata4[,-1082], 
  function(x) fisher.test(table(x,mydata4$location), 
  simulate.p.value = TRUE)$p.value), B = 10000)

#Transpose the data to get the simulated Monte Carlo p-values in a column
fishers2=t(fishers)
fishers2=as.data.frame(fishers2)
colnames(fishers2)[1]='Pval'

fishers2$gene=rownames(fishers2)
myvars=rownames(dt_abs2)
fishers2$gene=str_remove_all(fishers2$gene, "[_dich]")
dim(fishers2)

#To get the proper FDR adjusted list, remove the RPs
fishers3=fishers2[which(!fishers2$gene %in% myvars),] 
dim(fishers3)

fishers3 <- fishers3[order(fishers3$Pval),]
fishers3$fdr=p.adjust(fishers3$Pval, 'fdr')
fishers3_fdr=fishers3[which(fishers3$fdr<0.05),]
dim(fishers3_fdr)
fishers3_fdr

#write.csv(fishers3, 'fishers_6-28.csv')
#write.csv(fishers3_fdr, 'fishers3_fdr_6-28.csv')

Venn diagram of the overlap between the BRCA1-PUJANA sets for all correlations, positive, and inversely associated sets*

BRCA_bon=read.csv('fishers3_fdr_6-28_PUJANA.csv')
BRCA_bon_pathway_set=Reduce(intersect, list(BRCA_bon$gene, BRCA_bon$PUJANA))

BRCA_bon_pos=read.csv('fishers3_bf_pos_fdr_6-28_PUJANA.csv')
BRCA_bon_pos_pathway_set=Reduce(intersect, list(BRCA_bon_pos$gene, BRCA_bon_pos$PUJANA))

BRCA_bon_neg=read.csv('fishers3_bf_neg_fdr_6-28_PUJANA.csv')
BRCA_bon_neg_pathway_set=Reduce(intersect, list(BRCA_bon_neg$gene, BRCA_bon_neg$PUJANA))

geneList <- list(ALL=BRCA_bon_pathway_set, Positive=BRCA_bon_pos_pathway_set, Inverse=BRCA_bon_neg_pathway_set)

Venn_PUJANA <- Venn(geneList)
plot(Venn_PUJANA,doWeights = FALSE )

Correlations of RP expression with expression of various nucleolar & selected cancer genes – in breast tissue in GTEx

2020-June-29