CCA to correct tissue-by-RP-gene matrix

Author: Charleen D. Adams

Abbreviations and definitions

Gene-Tissue Expression (GTex) Project – database of tissue-specific gene expression and SNPs associated with gene expression
Ribosomal protein (RP)
expression quantitative trait loci (eQTL) – SNPs associated with gene expression
significance threshold (Q-value for ‘eGene’ status <0.05 & nominal P<5x10-6) – analyses here use the set of GTex eQTLs meet this criteria
Canonical Correlation Analysis (CCA)

Overview

I had made a correlation matrix of the RP eQTLs across 49 GTex tissues. However, it is likely that the correlations are confounded by a factor related to which donor donated which tissues.

Correlation matrix of tissues by RP eQTLs across the RP genes

To investigate this possibility and correct for the influence of latent factors related to donor contribution, I used CCA. CCA is a tool that is similar to multivariate multiple regression except it also considers dimensionality, operating on a principle similar to prinicpal components analysis.

Read-in the GTex sample annotations

Get the donor IDs

SAMPID is the sample ID variable. It contains the donor ID, the tissue site, and aliquot ID. I separated these into variables to obtain donor IDs.

#### Separate SAMPID into subject, tissue, and aliquot
df <- data.frame(x = samples_v8$SAMPID)
samples_v83=df %>% separate(x, c("GTex","Donor_ID", "tissue_site_ID", "SM","aliquot_ID")) #name choice 'samples_v83' is arbitrary

#### cbind the separated id data with the samples_v8 data
samples_v8_final <- cbind(samples_v8, samples_v83)
donor_tissue=table(samples_v8_final$Donor_ID, samples_v8_final$SMTSD)
#write.csv(donor_tissue, 'donor_tissue.csv')

Range of samples across tissues

Some individuals contributed more than one sample per tissue site.
One obvious outlier “individual” contributed 217 samples. This is the leukemia cell line, which I will drop.

#### Determine how many tissue site each donor contributed to
donor_tissue2=read.csv("donor_tissue2.csv")
rownames(donor_tissue2)=donor_tissue2$Donor
#donor_tissue2$Donor <- NULL
cols=colnames(donor_tissue2[2:56])
donor_tissue2$new_col <- apply(donor_tissue2[,cols],1,sum)
donor_tissue2 <- donor_tissue2[order(donor_tissue2$new_col),]
summary(donor_tissue2$new_col)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   24.00   23.42   28.00  217.00

Remove tissues for which there are no eGene eQTLs

my_drop=c('Bladder', "Cells - Leukemia cell line (CML)", "Cervix - Ectocervix",
          "Cervix - Endocervix", "Fallopian Tube", "Kidney - Medulla")
samples_v8_final_clean <- samples_v8_final[!which(samples_v8_final$SMTSD %in% my_drop),]
samples_v8_final_clean_tab=table(samples_v8_final_clean$SMTSD)
#write.csv(samples_v8_final_clean_tab, 'samples_v8_final_clean_tab.csv')

Total number of individuals in the analysis

dim(newdata)

## [1] 979  51

# 979 individuals (the 51 variables includes two brain duplicates and the columns for "Donor" and "new_col2" (counts of samples across tissues per donor): there are 47 tissue sites suitable for the eQTL analysis)

Range of samples per donor

979 individuals (excluding the leukemia cell line) contributed on average 23 samples across the tissues (some with more than one sample per tissue site)

## Dropping the leukemia cell line, which contributed 217 samples, & the tissues for which there are no eQTLs
summary(newdata$new_col2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   23.00   23.17   28.00   72.00

Example: whole blood

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   3.000   3.359   4.000   9.000

Create a correlation matrix for tissue-by-Donor_ID

## Obtain tissue-by-donor ID counts
donor_tissue_clean=table(samples_v8_final_clean$SMTSD, samples_v8_final_clean$Donor_ID)

## Get the correlation matrix
donor_tissue_clean_mat<- cor(t(donor_tissue_clean))
Y=donor_tissue_clean_mat

Read in the correlation matrix for tissue by presence of RP eQTLs

#### Read in the RP eQTL data to tabulate tissue by RP gene 
load('./V8_master_EAF.Rdata')
tissue_gene=table(merged_EAF$tissue,merged_EAF$gene_name )

#### Get the correlation matrix
tissue_gene_mat<- cor(t(tissue_gene))
X=tissue_gene_mat

#### Save the RP lists, latest versions of the dataset (merged_EAF) and the matrices
save(ao, read.url,myvars, myvars_40S,myvars_40S_60S,myvars_60S, myvars_all_mito,myvars_mito_39S, master_v8_n, RP_sig, stringent_RP_sig_beta, merged_EAF, tissue_gene, X, tissue_gene_mat, donor_tissue_clean_mat, Y,  file="./V8_masterRP_lists_mergedEAF_matrices.Rdata")

Construct a matrix of the matrices

\(cor(X,Y)\) is a \(2\times 2\) matrix, where \(X\) is the tissue-RP-gene matrix, \(Y\) is the tissue-by-donor matrix, and:
\[ cor(X,Y) = \begin{bmatrix} \sum{X X} & \sum{X Y} \\ \sum{Y X} & \sum{Y Y} \end{bmatrix} \]

correl <- matcor(X, Y )
#str(correl)
img.matcor(correl, type = 2) #dev.off()

Canonical correlations

can_cor1=cc(X,Y)

Raw canonical coefficients

The interpretation of the coefficients is similar to that of multiple regression. If I’m interpretting this correctly, they are the tissue-pair coefficients when holding all the other variables constant; i.e., for the \(X\) set, they are the relationship between tissue pairs when accounting for latent structure from donor sample imbalances and all other pairs.

Standardized coefficients

#standardizing the first set of canonical coefficients(X)
 std_coef1<-diag(sqrt(diag(cov(X))))

Corrected correlations

## Obtain the correlations for the corrected coefficients
cor_correct_mat<- cor(cor_correct)
cor_correct_round=round(cor_correct_mat, 1)
datatable(cor_correct_round, options = list(pageLength = 5))

CCA-corrected heatmap

## Thoughts - I do not know if this approach is correct (nor if I did it right), as I have never done it before. I aimed to correct the tissue-by-RP-gene correlation matrix for likely confounding due to imbalance in how certain donors contributed disproportionately.

Assuming I did it correctly, the new data display corrected correlations between the tissues across the RP genes. There doesn’t seem to be an obvious pattern between tissues, but maybe that makes sense because each new tissue-pair estimate takes into account all the pairs and the latent influence of the donors.
So, not sure we can use this analysis, but I wanted to share my thought process. Not sure if it would be worth consulting with a statistician to go over it. If so, who?
Also, the GTex documentation explains that some of the brain tissues (specifically, “Brain - Frontal Cortex” and “Brain- cortex”, and “Brain - Cerebellum” and “Brain - Cerebellar Hemisphere”) are duplicates. I’ve kept them in the data for now, but eventually, it would make sense to remove duplicates.

47 tissues (excluding duplicates)

*Body made with gganatogram

Sources

Link to the main tutorial I used to teach myself CCA.

Link for the code for intra- and inter-correlations of the matrices.

Another helpful link to read up on CCA.

Github

Copy this link into your browser (Github refuses to link): https://github.com/charleendadams/ribosomal_protein_eQTLs

This markdown file is stored there.