Author: Charleen D. Adams
I had made a correlation matrix of the RP eQTLs across 49 GTex tissues. However, it is likely that the correlations are confounded by a factor related to which donor donated which tissues.
To investigate this possibility and correct for the influence of latent factors related to donor contribution, I used CCA. CCA is a tool that is similar to multivariate multiple regression except it also considers dimensionality, operating on a principle similar to prinicpal components analysis.
SAMPID is the sample ID variable. It contains the donor ID, the tissue site, and aliquot ID. I separated these into variables to obtain donor IDs.
#### Separate SAMPID into subject, tissue, and aliquot
df <- data.frame(x = samples_v8$SAMPID)
samples_v83=df %>% separate(x, c("GTex","Donor_ID", "tissue_site_ID", "SM","aliquot_ID")) #name choice 'samples_v83' is arbitrary
#### cbind the separated id data with the samples_v8 data
samples_v8_final <- cbind(samples_v8, samples_v83)
donor_tissue=table(samples_v8_final$Donor_ID, samples_v8_final$SMTSD)
#write.csv(donor_tissue, 'donor_tissue.csv')
#### Determine how many tissue site each donor contributed to
donor_tissue2=read.csv("donor_tissue2.csv")
rownames(donor_tissue2)=donor_tissue2$Donor
#donor_tissue2$Donor <- NULL
cols=colnames(donor_tissue2[2:56])
donor_tissue2$new_col <- apply(donor_tissue2[,cols],1,sum)
donor_tissue2 <- donor_tissue2[order(donor_tissue2$new_col),]
summary(donor_tissue2$new_col)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 24.00 23.42 28.00 217.00
my_drop=c('Bladder', "Cells - Leukemia cell line (CML)", "Cervix - Ectocervix",
"Cervix - Endocervix", "Fallopian Tube", "Kidney - Medulla")
samples_v8_final_clean <- samples_v8_final[!which(samples_v8_final$SMTSD %in% my_drop),]
samples_v8_final_clean_tab=table(samples_v8_final_clean$SMTSD)
#write.csv(samples_v8_final_clean_tab, 'samples_v8_final_clean_tab.csv')
dim(newdata)
## [1] 979 51
# 979 individuals (the 51 variables includes two brain duplicates and the columns for "Donor" and "new_col2" (counts of samples across tissues per donor): there are 47 tissue sites suitable for the eQTL analysis)
979 individuals (excluding the leukemia cell line) contributed on average 23 samples across the tissues (some with more than one sample per tissue site)
## Dropping the leukemia cell line, which contributed 217 samples, & the tissues for which there are no eQTLs
summary(newdata$new_col2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 23.00 23.17 28.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 3.000 3.359 4.000 9.000
## Obtain tissue-by-donor ID counts
donor_tissue_clean=table(samples_v8_final_clean$SMTSD, samples_v8_final_clean$Donor_ID)
## Get the correlation matrix
donor_tissue_clean_mat<- cor(t(donor_tissue_clean))
Y=donor_tissue_clean_mat
#### Read in the RP eQTL data to tabulate tissue by RP gene
load('./V8_master_EAF.Rdata')
tissue_gene=table(merged_EAF$tissue,merged_EAF$gene_name )
#### Get the correlation matrix
tissue_gene_mat<- cor(t(tissue_gene))
X=tissue_gene_mat
#### Save the RP lists, latest versions of the dataset (merged_EAF) and the matrices
save(ao, read.url,myvars, myvars_40S,myvars_40S_60S,myvars_60S, myvars_all_mito,myvars_mito_39S, master_v8_n, RP_sig, stringent_RP_sig_beta, merged_EAF, tissue_gene, X, tissue_gene_mat, donor_tissue_clean_mat, Y, file="./V8_masterRP_lists_mergedEAF_matrices.Rdata")
\(cor(X,Y)\) is a \(2\times 2\) matrix, where \(X\) is the tissue-RP-gene matrix, \(Y\) is the tissue-by-donor matrix, and:
\[
cor(X,Y) = \begin{bmatrix}
\sum{X X} & \sum{X Y} \\
\sum{Y X} & \sum{Y Y}
\end{bmatrix}
\]
correl <- matcor(X, Y )
#str(correl)
img.matcor(correl, type = 2) #dev.off()
can_cor1=cc(X,Y)
The interpretation of the coefficients is similar to that of multiple regression. If I’m interpretting this correctly, they are the tissue-pair coefficients when holding all the other variables constant; i.e., for the \(X\) set, they are the relationship between tissue pairs when accounting for latent structure from donor sample imbalances and all other pairs.
#standardizing the first set of canonical coefficients(X)
std_coef1<-diag(sqrt(diag(cov(X))))
## Obtain the correlations for the corrected coefficients
cor_correct_mat<- cor(cor_correct)
cor_correct_round=round(cor_correct_mat, 1)
datatable(cor_correct_round, options = list(pageLength = 5))
## Thoughts - I do not know if this approach is correct (nor if I did it right), as I have never done it before. I aimed to correct the tissue-by-RP-gene correlation matrix for likely confounding due to imbalance in how certain donors contributed disproportionately.
Assuming I did it correctly, the new data display corrected correlations between the tissues across the RP genes. There doesn’t seem to be an obvious pattern between tissues, but maybe that makes sense because each new tissue-pair estimate takes into account all the pairs and the latent influence of the donors.
So, not sure we can use this analysis, but I wanted to share my thought process. Not sure if it would be worth consulting with a statistician to go over it. If so, who?
Also, the GTex documentation explains that some of the brain tissues (specifically, “Brain - Frontal Cortex” and “Brain- cortex”, and “Brain - Cerebellum” and “Brain - Cerebellar Hemisphere”) are duplicates. I’ve kept them in the data for now, but eventually, it would make sense to remove duplicates.
*Body made with
gganatogram
Link to the main tutorial I used to teach myself CCA.
Link for the code for intra- and inter-correlations of the matrices.
Another helpful link to read up on CCA.
Copy this link into your browser (Github refuses to link): https://github.com/charleendadams/ribosomal_protein_eQTLs