The ovarian cancer genome is characterized profound defects in DNA damage repair mechanisms, including almost universal loss of TP53 function. This results in extensive somatic mutation mutation and copy number variation. Copy number variation is associated with changes in gene expression, and lack of this association can be used as a multi-assay quality control measure that is not possible from a single assay. This case study demonstrates the identification of TCGA ovarian cancer Affymetrix microarray sample mix-ups through correlation to copy number, and confirms these mix-ups by comparison to Agilent microarray and level III RNA-seq data.
First, load TCGA ovarian cancer ExpressionSets from the biocMultiAssay package:
load(system.file("extdata/tcga_ov","ovc_eset_copynumber_gistic_2013_01_16.rda",package="biocMultiAssay"))
load(system.file("extdata/tcga_ov","TCGA_eset.rda",package="biocMultiAssay"))
load(system.file("extdata/tcga_ov","TCGA.agilent_eset.rda",package="biocMultiAssay"))
These ExpressionSets have HGNC symbols as featureNames, and some have featureData.
head(pData(featureData(eset.cn)))
## Gene.Symbol Locus.ID Cytoband
## ACAP3 ACAP3 116983 1p36.33
## ACTRT2 ACTRT2 140625 1p36.32
## AGRN AGRN 375790 1p36.33
## ANKRD65 ANKRD65 441869 1p36.33
## ATAD3A ATAD3A 55210 1p36.33
## ATAD3B ATAD3B 83858 1p36.33
head(pData(featureData(TCGA_eset)))
## probeset gene
## A1CF 220951_s_at A1CF
## A2M 217757_at A2M
## A4GALT 219488_at A4GALT
## A4GNT 221131_at A4GNT
## AAAS 218075_at AAAS
## AACS 218434_s_at AACS
I invisibly define a function “corFinder()” to calculate pair-wise correlations between samples using the expr() slots of a list of two ExpressionSets. It subsets each ExpressionSet to the intersection of common features, performs ComBat batch correction using each experiment as a batch, then calculates column-wise Pearson correlation between all pairs of samples.
cn.cor <- corFinder(list(microarray=TCGA_eset, cn=eset.cn), use.ComBat=FALSE)
agilent.cor <- corFinder(list(microarray=TCGA_eset, agilent=TCGA.agilent_eset), use.ComBat=TRUE)
Just to give an idea of what these results look like, they are a matrix of sample pairwise Pearson correlations:
dim(cn.cor)
## [1] 545 545
cn.cor[1:4, 1:4]
## cn:TCGA.20.0987 cn:TCGA.23.1031 cn:TCGA.24.0979
## microarray:TCGA.20.0987 0.12756 0.01611 0.06557
## microarray:TCGA.23.1031 0.04538 0.10947 0.07109
## microarray:TCGA.24.0979 0.06831 0.02549 0.13948
## microarray:TCGA.23.1117 0.04730 0.02018 0.05235
## cn:TCGA.23.1117
## microarray:TCGA.20.0987 0.04851
## microarray:TCGA.23.1031 0.06111
## microarray:TCGA.24.0979 0.04765
## microarray:TCGA.23.1117 0.09011
This histogram shows the correlation between genome-wide copy number and expression for each sample. The lower mode is caused by mixed-up samples.
These samples are also suspect from low correlation between Affy - Agilent technical replicates:
There are more systematic ways to determine a cutoff, but for now I just set it manually and count the number of likely sample mix-ups by Agilent and by Copy Number. First the number of suspect samples identified by CN that are also assayed by Agilent:
samples.intersect <- intersect(colnames(cn.cor), colnames(agilent.cor))
sum(diag(cn.cor[samples.intersect, samples.intersect]) < 0.066)
## [1] 45
And the number of samples identified as suspect by both Affy/CN and Affy/Agilent:
sum(diag(cn.cor[samples.intersect, samples.intersect]) < 0.066 & diag(agilent.cor[samples.intersect, samples.intersect]) < 0.87)
## [1] 38
Methylation is not nearly so informative for QC in this case, but I left this analysis in place anyways. It still identifies three outliers, two of which overlap with the CN outliers. There is overall a strong negative correlation between expression and methylation (mean cor ~ -0.35)
load(system.file("extdata/tcga_ov","ovc_eset_methylation.rda",package="biocMultiAssay"))
Drop features with too many missing values, then impute:
eset.meth <- eset.meth[esApply(eset.meth, 1, function(x) sum(is.na(x))) < 300, ]
exprs(eset.meth) <- impute.knn(exprs(eset.meth))$data
Repeat the analysis with Affy expression and methylation:
meth.cor <- corFinder(list(microarray=TCGA_eset, meth=eset.meth), use.ComBat=FALSE)
colnames(meth.cor) <- sub(".+:", "", colnames(meth.cor))
rownames(meth.cor) <- sub(".+:", "", rownames(meth.cor))
hist(diag(meth.cor), breaks="FD")
abline(v=-0.2, col="red")
arrows(x0=-0.2, y0=20, x1=-0.1, y1=20, col="red", lw=2, length=0.1)
text(x=-0.15, y=c(40, 30), labels=c("flagged", "samples"), col="red")
There are only three samples flagged:
sum(diag(meth.cor) > -0.2)
## [1] 3
And two of these were also identified by lack of correlation between CN and expression:
samples.intersect <- intersect(colnames(meth.cor), colnames(cn.cor))
summary(diag(meth.cor[samples.intersect, samples.intersect]) > -0.2 & diag(cn.cor[samples.intersect, samples.intersect]) < 0.066)
## Mode FALSE TRUE NA's
## logical 542 2 0
I would like to establish for which assay combinations in which cancer types this is possible for, using all of TCGA. It is likely to be cancer specific; for example, many prostate cancers have ``quiet’’ genomes with little copy number variation, and the procedure demonstrated in this vignette would not work. However correlation between other regulatory elements and downstream elements may be effective, for example: * promotor region methylation and gene expression * somatic mutation of particular genes and associated gene expression * microRNA expression and target gene expression * transcription factor and target gene expression
It is easy to simulate sample mix-ups simply by swapping sample labels, making testing feasible in datasets that do not contain actual mix-ups.