Motivation

The ovarian cancer genome is characterized profound defects in DNA damage repair mechanisms, including almost universal loss of TP53 function. This results in extensive somatic mutation mutation and copy number variation. Copy number variation is associated with changes in gene expression, and lack of this association can be used as a multi-assay quality control measure that is not possible from a single assay. This case study demonstrates the identification of TCGA ovarian cancer Affymetrix microarray sample mix-ups through correlation to copy number, and confirms these mix-ups by comparison to Agilent microarray and level III RNA-seq data.

Methods

First, load TCGA ovarian cancer ExpressionSets from the biocMultiAssay package:

load(system.file("extdata/tcga_ov","ovc_eset_copynumber_gistic_2013_01_16.rda",package="biocMultiAssay"))
load(system.file("extdata/tcga_ov","TCGA_eset.rda",package="biocMultiAssay"))
load(system.file("extdata/tcga_ov","TCGA.agilent_eset.rda",package="biocMultiAssay"))

These ExpressionSets have HGNC symbols as featureNames, and some have featureData.

head(pData(featureData(eset.cn)))

##         Gene.Symbol Locus.ID Cytoband
## ACAP3         ACAP3   116983  1p36.33
## ACTRT2       ACTRT2   140625  1p36.32
## AGRN           AGRN   375790  1p36.33
## ANKRD65     ANKRD65   441869  1p36.33
## ATAD3A       ATAD3A    55210  1p36.33
## ATAD3B       ATAD3B    83858  1p36.33

head(pData(featureData(TCGA_eset)))

##           probeset   gene
## A1CF   220951_s_at   A1CF
## A2M      217757_at    A2M
## A4GALT   219488_at A4GALT
## A4GNT    221131_at  A4GNT
## AAAS     218075_at   AAAS
## AACS   218434_s_at   AACS

I invisibly define a function “corFinder()” to calculate pair-wise correlations between samples using the expr() slots of a list of two ExpressionSets. It subsets each ExpressionSet to the intersection of common features, performs ComBat batch correction using each experiment as a batch, then calculates column-wise Pearson correlation between all pairs of samples.

cn.cor <- corFinder(list(microarray=TCGA_eset, cn=eset.cn), use.ComBat=FALSE)
agilent.cor <- corFinder(list(microarray=TCGA_eset, agilent=TCGA.agilent_eset), use.ComBat=TRUE)

Just to give an idea of what these results look like, they are a matrix of sample pairwise Pearson correlations:

dim(cn.cor)

## [1] 545 545

cn.cor[1:4, 1:4]

##                         cn:TCGA.20.0987 cn:TCGA.23.1031 cn:TCGA.24.0979
## microarray:TCGA.20.0987         0.12756         0.01611         0.06557
## microarray:TCGA.23.1031         0.04538         0.10947         0.07109
## microarray:TCGA.24.0979         0.06831         0.02549         0.13948
## microarray:TCGA.23.1117         0.04730         0.02018         0.05235
##                         cn:TCGA.23.1117
## microarray:TCGA.20.0987         0.04851
## microarray:TCGA.23.1031         0.06111
## microarray:TCGA.24.0979         0.04765
## microarray:TCGA.23.1117         0.09011

Initial identification of sample mix-ups

This histogram shows the correlation between genome-wide copy number and expression for each sample. The lower mode is caused by mixed-up samples. plot of chunk unnamed-chunk-8

These samples are also suspect from low correlation between Affy - Agilent technical replicates:

plot of chunk unnamed-chunk-9

There are more systematic ways to determine a cutoff, but for now I just set it manually and count the number of likely sample mix-ups by Agilent and by Copy Number. First the number of suspect samples identified by CN that are also assayed by Agilent:

samples.intersect <- intersect(colnames(cn.cor), colnames(agilent.cor))
sum(diag(cn.cor[samples.intersect, samples.intersect]) < 0.066)

## [1] 45

And the number of samples identified as suspect by both Affy/CN and Affy/Agilent:

sum(diag(cn.cor[samples.intersect, samples.intersect]) < 0.066 & diag(agilent.cor[samples.intersect, samples.intersect]) < 0.87)

## [1] 38

Using methylation for QC

Methylation is not nearly so informative for QC in this case, but I left this analysis in place anyways. It still identifies three outliers, two of which overlap with the CN outliers. There is overall a strong negative correlation between expression and methylation (mean cor ~ -0.35)

load(system.file("extdata/tcga_ov","ovc_eset_methylation.rda",package="biocMultiAssay"))

Drop features with too many missing values, then impute:

eset.meth <- eset.meth[esApply(eset.meth, 1, function(x) sum(is.na(x))) < 300, ]
exprs(eset.meth) <- impute.knn(exprs(eset.meth))$data

Repeat the analysis with Affy expression and methylation:

meth.cor <- corFinder(list(microarray=TCGA_eset, meth=eset.meth), use.ComBat=FALSE)

colnames(meth.cor) <- sub(".+:", "", colnames(meth.cor))
rownames(meth.cor) <- sub(".+:", "", rownames(meth.cor))
hist(diag(meth.cor), breaks="FD")
abline(v=-0.2, col="red")
arrows(x0=-0.2, y0=20, x1=-0.1, y1=20, col="red", lw=2, length=0.1)
text(x=-0.15, y=c(40, 30), labels=c("flagged", "samples"), col="red")

plot of chunk unnamed-chunk-15

There are only three samples flagged:

sum(diag(meth.cor) > -0.2)

## [1] 3

And two of these were also identified by lack of correlation between CN and expression:

samples.intersect <- intersect(colnames(meth.cor), colnames(cn.cor))
summary(diag(meth.cor[samples.intersect, samples.intersect]) > -0.2 & diag(cn.cor[samples.intersect, samples.intersect]) < 0.066)

##    Mode   FALSE    TRUE    NA's 
## logical     542       2       0

Summary

Short-term objectives

create SummarizedExperiment objects for these TCGA ovarian cancer datasets
use biocMultiAssay objects to repeat this correlation analysis

Longer-term future directions

I would like to establish for which assay combinations in which cancer types this is possible for, using all of TCGA. It is likely to be cancer specific; for example, many prostate cancers have ``quiet’’ genomes with little copy number variation, and the procedure demonstrated in this vignette would not work. However correlation between other regulatory elements and downstream elements may be effective, for example: * promotor region methylation and gene expression * somatic mutation of particular genes and associated gene expression * microRNA expression and target gene expression * transcription factor and target gene expression

It is easy to simulate sample mix-ups simply by swapping sample labels, making testing feasible in datasets that do not contain actual mix-ups.