Re-analysis of the microarray data published by Widmer et al. (2012) “Systematic classification of melanoma cells by phenotype-specific gene expression mapping”

Rosa Martinez, Gerard Serra, Pau Puigdevall and Lidia Mateo

Introduction

Metastasis has been deeply studied both at cellular and molecular level. Recent evidence (Widmer et al. 2012) suggests that the metastasic spread of melanoma is not driven not a regular increase in tumorigenic agressiveness but rather by the switch between two different cellular phenotypes, one with strong proliferation/weak invasiveness (proliferative phenotype from now on) and another with weak proliferation/strong invasiveness (invasive phenotype).

In this study we re-analyse a subset of the expression data published by Widmer et al. (2012) for both phenotypic conditions in order to determine the key genes involved in the switch between proliferative and invasive states.

In the first steps of the analysis (presented below) we determine which genes are differentially expressed in both phenotypes and later on, a functional analysis will be performed so as to see their influence on metastasis.

We start by loading the required packages and the raw data.

## Load packages
library(affy)
library(affyPLM)
library(limma)
library(genefilter)
library(sva)
library(hgu133plus2.db)
library(corpcor)

## Load data
rawdir = "./data/GSE33728_RAW"
rdata <- ReadAffy(celfile.path = rawdir)
sampleNames(rdata) <- letters[1:15]
rdata

## AffyBatch object
## size of arrays=1164x1164 features (21 kb)
## cdf=HG-U133_Plus_2 (54675 affyids)
## number of samples=15
## number of genes=54675
## annotation=hgu133plus2
## notes=

The chip that was used to generate this data was an Affymetrix hgu133plus2 microarray chip with 1164x1164=1354896 probes and a total of 54675 probesets (there are more than one probeset per gene). We can see that there are 15 samples in the dataset.

Quality assessment

In order to detect possible technical artifacts we performed a quality assessment of the raw data to see if any of the samples should be excluded from further analysis. We first examined by visual inspection the raw intensities of the scanned images. Next, we fitted the intensity values in a linear model (probe level model (PLM)) and obtained the weights and residuals. For each sample, we show the raw intensities (left), residuals (middle) and weights (right). We expect them to be homogeneously distributed.

Pset <- fitPLM(rdata)

par(mfrow = c(15, 3), mar = c(1, 1, 3, 1))
for (i in 1:15) {
    image(rdata[, i])
    image(Pset, type = "resids", which = i)
    image(Pset, type = "weights", which = i)
}
par(oma = c(0, 0, 1, 0))
title(main = "RAW INTENSITIES", outer = T, adj = 0.12, cex.main = 1.5)
title(main = "RESIDUALS", outer = T, adj = 0.5, cex.main = 1.5)
title(main = "WEIGHTS", outer = T, adj = 0.86, cex.main = 1.5)

We can see that in general the images are quite similar and the weights and residuals are homogeneously distributed. However, we can observe that samples a, f, m, n and o show overall higer intensity. Furthermore, in sample o we can detect a small high density spot. Next, we checked the median and interquartile range over all probesets in the chip. We plot the distribution of the raw intensity values in logarithmic scale.

boxplot(rdata, xlab = "Samples", ylab = "Raw log-intensity")

plot of chunk boxPlot_densityPlot

The boxplots show some heterogeneity between samples. Moreover, a deeper inspection looking at the raw log-intensity distributions also reveals some heterogeneity in the raw intesities between different samples (see comparison before and after normalization of density distibution plots below).

Below we plot the Normalized Unscaled Standard Errors (NUSE) of expression estimates, and the Relative Log Expression (RLE).

par(mfrow = c(1, 2), mar = c(3, 3, 3, 3))
NUSE(Pset, xlab = "Samples", ylab = "NUSE values", main = "NUSE")
RLE(Pset, xlab = "Samples", ylab = "RLE values", main = "RLE")

plot of chunk unnamed-chunk-1

As expected, NUSE values are centered around 1. In more detail, we can see that the normalized standard errors are higher in samples a, f, m, n and o, which is consitent with our observations in the previous section. RLE values are centered around 0. This is expected if the number of upregulated and downregulated genes is similar. In our case, we cannot infer, a priori, if upregulated or downregulated genes predominate in the comparison of our two melanoma phenotypes.

nuseDiag <- NUSE(Pset, type = "stats")
rleDiag <- RLE(Pset, type = "stats")

From a quantiative point of view, we can see that none of the samples deviates significantly from the expected values (1 for the NUSE values and 0 for the RLE values), as seen in the Boolean output (FALSE) shown below:

nuseDiag["median", ] > 1.1 & abs(rleDiag["median", ]) > 0.1

##     a     b     c     d     e     f     g     h     i     j     k     l 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##     m     n     o 
## FALSE FALSE FALSE

In order to detect possible intensity-dependent biases in the samples, we look at the MA plots:

par(mfrow = c(5, 3), mar = c(2, 2, 2, 2))
MAplot(rdata, plot.method = "smoothScatter", cex = 0.75)

plot of chunk MA_plot

In the MA plots we can see that the M values are approximately centered around 0 in most of the samples. Samples a and f show a non-linear shape and samples m, n and o are slightly below 0. Some of the other samples also show some deviations. According to the quality assessment results, we conclude that the quality of all samples is acceptable. We could detect that samples a, f, m, n and o were consistently slightly different from the rest. We will try to diminish the heterogeneity between samples in the normalization process.

Normalization and filtering

In the normalization process we tried to correct the heterogeneity observed in the previous analysis. We normalized the data using the robust multi-array average algorithm (rma), which performs background correction, between-array normalization and summarization. Furthermore, we also applied a non-specific filering (nsFilter) to remove the control probes, redundant probes, non-annotated probes and probes that show extremely low variablilty across samples. We can see the impact that these steps have on the probesets intensity distributions, in the intensity boxplots, and density distribution plots.

rmadata <- rma(rdata)

## Background correcting
## Normalizing
## Calculating Expression

rmadata_filtered_ns <- nsFilter(rmadata, require.entrez = TRUE, remove.dupEntrez = TRUE, 
    var.cutoff = 0.1)
rmadata_filtered_ns$filter.log

## $numDupsRemoved
## [1] 21348
## 
## $numLowVar
## [1] 2011
## 
## $numRemoved.ENTREZID
## [1] 13207
## 
## $feature.exclude
## [1] 11

par(mfrow = c(1, 3), mar = c(4, 4, 4, 4))
boxplot(rdata, xlab = "Samples", ylab = "log-intensity", main = "RAW DATA")
boxplot(rmadata, xlab = "Samples", ylab = "log-intensity", main = "NORMALIZED DATA")
boxplot(exprs(rmadata_filtered_ns$eset), xlab = "Samples", ylab = "log-intensity", 
    main = "FILTERED DATA")

plot of chunk boxPlot_densityPlot_normalized

We can see that after normalization the raw intensities are much more similar between the samples. As for filtered data, it can be seen that many outliers appear. This is not a problem since the effect is a consequence of the non-specific filtering step.

par(mfrow = c(1, 3), oma = c(5, 1, 1, 1), mar = c(4, 4, 4, 4))

plotDensity.AffyBatch(rdata, col = 1:15, lty = 1:15, xlab = "log-intensity", 
    ylab = "density", main = "RAW DATA")
plotDensity(exprs(rmadata), col = 1:15, lty = 1:15, xlab = "log-intensity", 
    ylab = "density", main = "NORMALIZED DATA")
plotDensity(exprs(rmadata_filtered_ns$eset), col = 1:15, lty = 1:15, xlab = "log-intensity", 
    ylab = "density", main = "FILTERED DATA")
par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE)
plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n")
legend("bottom", letters[1:15], col = 1:15, lty = 1:15, horiz = TRUE, , xpd = TRUE, 
    inset = c(0, 0))

plot of chunk filtering

Before normalizing the samples that show deviations in the other analysis, it can be seen that the intensity density distribution also remarks these differences. After normalizing all samples, the same distribution is shown but it is skewed and slightly bimodal. After filtering the distribution is more bell-shaped and the differences between samples have been corrected, although in one case there is still some bimodality.

We also plotted again the MA plots:

par(mfrow = c(15, 3), mar = c(2, 2, 3, 2))
for (i in 1:15) {
    MAplot(rdata, plot.method = "smoothScatter", cex = 0.75, which = i)
    MAplot(rmadata, plot.method = "smoothScatter", cex = 0.75, which = i)
    MAplot(rmadata_filtered_ns$eset, plot.method = "smoothScatter", cex = 0.75, 
        which = i)
}
par(oma = c(0, 0, 1, 0))
title(main = "RAW DATA", outer = T, adj = 0.14, cex.main = 1.5)
title(main = "NORMALIZED", outer = T, adj = 0.5, cex.main = 1.5)
title(main = "FILTERED", outer = T, adj = 0.85, cex.main = 1.5)

plot of chunk MAPlot_normalized

For each sample we show that the normalization step has improved the MA plots, since now all the M values are centered around 0. However, the filtering step has increased the dispersion, as can be seen in the increased inter-quartile range (IQR). This could be explained by the fact that in the filtering step we are removing half of the probesets targeting the same gene, control probesets and probesets with small variablity.

Identification of Batch effect

We used the scan date as a proxy to detect possible differences in sample processing (batch effect). If batch effect is not strong, we expect that samples with the same phenotype cluster together independently of the scan date. If there is a significant batch effect, we expect that samples with the same scan date to cluster together independently of their phenotype. We used the metadata associated to the expression set to create a batch variable from the scan date. We found that samples were processed in three days and, therefore, were grouped using variables batch1, batch2 and batch3.The raw data did not have associated phenotypic information. We used the phenotypic information deposited in GEO to manually assign the phenotype to each sample. There were two phenotypic classes: proliferative and invasive.

scanDate <- protocolData(rmadata_filtered_ns$eset)$ScanDate
scanDate <- gsub("T.*", "", scanDate)
scanDate <- gsub(" .*", "", scanDate)
dates1 <- as.Date(scanDate[1:12])
dates2 <- as.Date(scanDate[13:15], format = "%m/%d/%y")
scanDate <- c(dates1, dates2)
print(scanDate)

##  [1] "2009-08-13" "2009-03-31" "2009-03-31" "2009-03-31" "2009-03-31"
##  [6] "2009-08-13" "2009-03-31" "2009-03-31" "2009-03-31" "2009-03-31"
## [11] "2009-03-31" "2009-03-31" "2011-06-15" "2011-06-15" "2011-06-15"

Here we can see that samples a and f were processed on 2009-08-13 (batch 2), samples m, n, o were processed on 2011-06-15 (batch 3) and the rest were processed on 2009-03-31 (batch 1). This could explain why these samples were behaving differently in the previous anlaysis and justifies the need for adjusting for batch effect.

minscan <- min(scanDate)
days <- scanDate - minscan
batch <- cut(as.numeric(days), c(-1, 10, 800, 1000))
batch <- as.numeric(batch)

phenotypes <- c(rep("proliferative", 6), rep("invasive", 9))
samples <- pData(rmadata_filtered_ns$eset[1])
phenoData <- data.frame(samples, phenotypes, batch)
pData(rmadata_filtered_ns$eset) <- phenoData

table(data.frame(Phenotype = rmadata_filtered_ns$eset$phenotypes, Batch = batch))

##                Batch
## Phenotype       1 2 3
##   invasive      6 0 3
##   proliferative 4 2 0

At first glance, we can see that the blocking is not very well designed. Batch 2 and batch 3 contain samples belonging to only one phenotypic class, which could be introducing severe confounding effects. However, the largest batch (Batch1), has a balanced number of samples belonging to the two phenotypes. In order to assess the extent of the batch effect we proceed to perform a clustering analysis of the data.

d <- as.dist(1 - cor(exprs(rmadata_filtered_ns$eset), method = "spearman"))
sampleClustering <- hclust(d)
sampleDendrogram <- as.dendrogram(sampleClustering, hang = 0.1)
names(batch) <- sampleNames(rmadata_filtered_ns$eset)
outcome <- as.character(rmadata_filtered_ns$eset$phenotypes)
names(outcome) <- sampleNames(rmadata_filtered_ns$eset)
sampleDendrogram <- dendrapply(sampleDendrogram, function(x, batch, labels) {
    if (is.leaf(x)) {
        attr(x, "nodePar") <- list(lab.col = as.vector(batch[attr(x, "label")]))
        attr(x, "label") <- as.vector(labels[attr(x, "label")])
    }
    x
}, batch, outcome)

plot(sampleDendrogram)
legend("topright", paste("Batch", sort(unique(batch))), fill = sort(unique(batch)))

plot of chunk dendrogram

We can see that samples belonging to the same phenotypic class cluster together in the dendrogram. We specially trust the clustering by phenotype of samples from batch 1 because it is the only batch that contains samples from two different phenotypes. As a complementary analysis, we represented the multidimensional scaling plot, where it can be seen that samples from the same phenotype lay close together in the space and are separated from samples from the other phenotype. However, we can still detect some heterogeneity inside each pehnotypic class since some samples fall a little bit far from the rest.

cmd <- cmdscale(d)
plot(cmd, type = "n")
text(cmd, outcome, col = batch, cex = 0.9)
legend("top", paste("Batch", unique(batch)), fill = unique(batch), inset = 0.01)

plot of chunk MDS

Batch effect can also be assessed by Principal Component Analysis and Surrogate Variable Analysis, which allows the identification of hidden sources of non-biological variability.

s <- fast.svd(t(scale(t(exprs(rmadata_filtered_ns$eset)), center = TRUE, scale = TRUE)))
plot(s$d^2/sum(s$d^2), xlab = "Principal Component", ylab = "Proportion of variance")

plot of chunk PCA

par(mfrow = c(1, 3))
boxplot(split(s$v[, 1], batch), main = sprintf("PC1 %.0f%%", 100 * s$d[1]^2/sum(s$d^2)))
boxplot(split(s$v[, 2], batch), main = sprintf("PC2 %.0f%%", 100 * s$d[2]^2/sum(s$d^2)))
boxplot(split(s$v[, 3], batch), main = sprintf("PC3 %.0f%%", 100 * s$d[3]^2/sum(s$d^2)))

plot of chunk PCA2

mod <- model.matrix(~phenotypes, data = pData(rmadata_filtered_ns$eset))
mod0 <- model.matrix(~1, data = pData(rmadata_filtered_ns$eset))
sv <- sva(exprs(rmadata_filtered_ns$eset), mod, mod0)

## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5

par(mfrow = c(1, 4))
for (i in 1:4) boxplot(sv$sv[, i] ~ batch, main = sprintf("SV %d", i), xlab = "Batch")

plot of chunk SVA

As it can be seen in the plots above, with just 4 variables we can explain most part of the variance in the system. In fact, the 5th variable detected in the graph contributes below 5%. The fact that the principal components and the surrogate variable values are not uniform across batches indicates that there are sources of heterogeneity that will have to be adjusted for when analysing the differential expression of genes between samples. We cannot completely discard that batch effect is playing a role but the clustering analysis suggests that it is not strong enough to confound the biological variability. However, in order to test whether having samples from different batches is confounding the results, we removed samples belonging to batches 2 and 3 and performed the differential expression analysis also with this subset of the data (batch 1), which already contains a balanced number of samples regarding melanoma phenotypes. Here we show how this subsetting was performed and the clustering and multidimensional scaling with only batch 1, which shows good separation of samples according to their phenotypes:

rdataSubset <- rdata[, c(2:5, 7:12)]
subsampleNames <- c(letters[2:5], letters[7:12])
sampleNames(rdataSubset) <- subsampleNames
rmadataSubset <- rma(rdataSubset)

## Background correcting
## Normalizing
## Calculating Expression

rmadataSubset_filtered_ns <- nsFilter(rmadataSubset, require.entrez = TRUE, 
    remove.dupEntrez = TRUE, var.cutoff = 0.1)

phenotypesSubset <- c(rep("proliferative", 4), rep("invasive", 6))
samplesSubset <- pData(rmadataSubset_filtered_ns$eset[1])
phenoDataSubset <- data.frame(samplesSubset, phenotypesSubset)
pData(rmadataSubset_filtered_ns$eset) <- phenoDataSubset


dSubset <- as.dist(1 - cor(exprs(rmadataSubset_filtered_ns$eset), method = "spearman"))
sampleClusteringSubset <- hclust(dSubset)
sampleDendrogramSubset <- as.dendrogram(sampleClusteringSubset, hang = 0.1)
outcomeSubset <- as.character(rmadataSubset_filtered_ns$eset$phenotypesSubset)
names(outcomeSubset) <- sampleNames(rmadataSubset_filtered_ns$eset)
sampleDendrogramSubset <- dendrapply(sampleDendrogramSubset, function(x, labels) {
    if (is.leaf(x)) {
        attr(x, "label") <- as.vector(labels[attr(x, "label")])
    }
    x
}, outcomeSubset)


par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
plot(sampleDendrogramSubset)


cmdSubset <- cmdscale(dSubset)
plot(cmdSubset, type = "n")
text(cmdSubset, outcomeSubset, cex = 0.9)

plot of chunk subsetting

Differential Gene Expression Analysis

We carried out the differential expression analysis of the data after normalization and filtering using the moderated t-tests from the package limma. It was applied to:

a) The whole dataset

b) A subset with only batch 1

A) Whole dataset

First we adjusted for the surrogate variables detected before:

design_filtered_sv <- cbind(mod, sv$sv)
fit_filtered_sv <- lmFit(rmadata_filtered_ns$eset, design_filtered_sv)
fit_filtered_sv <- eBayes(fit_filtered_sv)
ttfsv <- topTable(fit_filtered_sv, coef = 2, n = Inf)
DEgenes_sv <- rownames(ttfsv[ttfsv$adj.P.Val < 0.1, ])

[1] “With this we obtained 9202 differentially expressed genes with an adjusted P-value under the cutoff (0.1).”

We plotted the distribution of raw P-values:

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(ttfsv$P.Value, xlab = "Raw P-values", main = "")
hist(ttfsv$P.Value, xlab = "Raw P-values", main = "", breaks = 1000)

Because the P-values are not vey unifromly distributed, we adjusted for batch specifically:


design_filtered_b <- model.matrix(~phenotypes + batch, data = rmadata_filtered_ns$eset)
fit_filtered_b <- lmFit(rmadata_filtered_ns$eset, design_filtered_b)
fit_filtered_b <- eBayes(fit_filtered_b)
ttfb <- topTable(fit_filtered_b, coef = 2, n = Inf)
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(ttfb$P.Value, xlab = "Raw P-values", main = "")
hist(ttfb$P.Value, xlab = "Raw P-values", main = "", breaks = 1000)

plot of chunk DE4

DEgenes_b <- rownames(ttfb[ttfb$adj.P.Val < 0.1, ])

## [1] "With this we obtained 5746 differentially expressed genes with an adjusted P-value under the cutoff (0.1). 5739 genes intersected with the list obtained adjusting only for surrogate variables."

Finally, we adjusted for batch and extra heterogeneity:

mod0 <- model.matrix(~batch, data = rmadata_filtered_ns$eset)
sv_fb <- sva(exprs(rmadata_filtered_ns$eset), design_filtered_b, mod0)

## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5

design_filtered_bsv <- cbind(design_filtered_b, sv_fb$sv)
fit_filter_bsv <- lmFit(rmadata_filtered_ns$eset, design_filtered_bsv)
fit_filter_bsv <- eBayes(fit_filter_bsv)
ttfbsv <- topTable(fit_filter_bsv, coef = 2, n = Inf)
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(ttfbsv$P.Value, xlab = "Raw P-values", main = "")
hist(ttfbsv$P.Value, xlab = "Raw P-values", main = "", breaks = 1000)

plot of chunk DEbsv

DEgenes_fbsv <- rownames(ttfbsv[ttfbsv$adj.P.Val < 0.1, ])

## [1] "With this we obtained 5300 differentially expressed genes with an adjusted P-value under the cutoff (0.1). 4784 genes intersected with the list obtained adjusting only for surrogate variables."

We can represent the distribution of p-values with respect to the changes in gene expression with the volcano plots:

mylist = list(ttfsv, ttfb, ttfbsv)
listnames = c(sprintf("adjusting for surrogate variables, %i DE genes", length(DEgenes_sv)), 
    sprintf("adjusting for batch, %i DE genes", length(DEgenes_b)), sprintf("adjusting for batch and extra heterogeneity, %i DE genes", 
        length(DEgenes_fbsv)))

par(mfrow = c(1, 3), mar = c(5, 5, 5, 5))
j = 0
for (i in mylist) {
    j = j + 1
    name = listnames[j]
    tt <- i
    plot(tt$logFC, -log10(tt$P.Value), main = name, pch = ".", cex = 4, col = grey(0.75), 
        cex.axis = 1.2, las = 1, cex.lab = 1.5, xlab = expression(paste(log[2], 
            " Fold change")), ylab = expression(paste(-log[10], " P-value")))
    points(tt[tt$adj.P.Val < 0.1, "logFC"], -log10(tt[tt$adj.P.Val < 0.1, "P.Value"]), 
        pch = ".", cex = 4, col = "red")
    abline(h = -log10(max(tt[tt$adj.P.Val < 0.1, "P.Value"])), col = grey(0.5), 
        lty = 2)

}

plot of chunk volcano

The volcano plots show that the number of upregulated and downregulated genes is similar when comparing the samples of one melanoma phenotype against the other. Furthermore, we can see that adjusting for batch and extra heterogeneity reduces the number of differentially expressed genes compared to adjustment for surrogate variables only, which yields an extremely high number of differentially expressed genes. In addition, it seems that the best distribution of P-values is obtained when adjusting for batch and additional heterogeneity. Therefore, we will consider these genes for further analysis. Finally, we can obtain the gene names corresponding to the probesets as follows:

genenames <- select(hgu133plus2.db, keys = DEgenes_fbsv, columns = "GENENAME", 
    keytype = "PROBEID")
head(genenames)

##        PROBEID                                                 GENENAME
## 1  201117_s_at                                       carboxypeptidase E
## 2  204466_s_at synuclein, alpha (non A4 component of amyloid precursor)
## 3 1555505_a_at                                               tyrosinase
## 4  206427_s_at                                                  melan-A
## 5    209072_at                                     myelin basic protein
## 6    204086_at             preferentially expressed antigen in melanoma

B) Subset with only batch 1

In this case we only adjusted for surrogate variables, since there cannot be any batch effect.

modSubset <- model.matrix(~phenotypesSubset, data = pData(rmadataSubset_filtered_ns$eset))
modSubset0 <- model.matrix(~1, data = pData(rmadataSubset_filtered_ns$eset))
svSubset <- sva(exprs(rmadataSubset_filtered_ns$eset), modSubset, modSubset0)

## Number of significant surrogate variables is:  3 
## Iteration (out of 5 ):1  2  3  4  5


design_filtered_svSubset <- cbind(modSubset, svSubset$sv)
fit_filtered_svSubset <- lmFit(rmadataSubset_filtered_ns$eset, design_filtered_svSubset)
fit_filtered_svSubset <- eBayes(fit_filtered_svSubset)
ttfsvSubset <- topTable(fit_filtered_svSubset, coef = 2, n = Inf)
DEgenes_svSubset <- rownames(ttfsvSubset[ttfsvSubset$adj.P.Val < 0.1, ])

## [1] "With this we obtained 7188 differentially expressed genes with an adjusted P-value under the cutoff (0.1)."

In contrast to the analysis performed using all the samples and adjusting for surrogate variables only, the number of DE genes is reduced. However, the number of DE genes is higher than that obtained for all the samples when adjusting for batch and surrogate variables. This may be explained by the fact that by using samples from only one batch we have reduced the heterogeneity among samples and the noise, so that now we are able to detect more differentially expressed genes.

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(ttfsvSubset$P.Value, xlab = "Raw P-values", main = "")
hist(ttfsvSubset$P.Value, xlab = "Raw P-values", main = "", breaks = 1000)

plot of chunk p-values

name = sprintf("adjusting for surrogate variables, %i DE genes", length(DEgenes_svSubset))

plot(ttfsvSubset$logFC, -log10(ttfsvSubset$P.Value), main = name, pch = ".", 
    cex = 4, col = grey(0.75), cex.axis = 1, las = 1, cex.lab = 1.2, xlab = expression(paste(log[2], 
        " Fold change")), ylab = expression(paste(-log[10], " P-value")))
points(ttfsvSubset[ttfsvSubset$adj.P.Val < 0.1, "logFC"], -log10(ttfsvSubset[ttfsvSubset$adj.P.Val < 
    0.1, "P.Value"]), pch = ".", cex = 4, col = "red")
abline(h = -log10(max(ttfsvSubset[ttfsvSubset$adj.P.Val < 0.1, "P.Value"])), 
    col = grey(0.5), lty = 2)

plot of chunk volcano2

The distribution of p-values and the volcano plot are similar to the corresponding plots with the whole data.

And we can obtain the names of these genes:

genenamesSubset <- select(hgu133plus2.db, keys = DEgenes_svSubset, columns = "GENENAME", 
    keytype = "PROBEID")
head(genenamesSubset)

##       PROBEID                           GENENAME
## 1   205694_at       tyrosinase-related protein 1
## 2 225242_s_at   coiled-coil domain containing 80
## 3   201438_at         collagen, type VI, alpha 3
## 4   213638_at  phosphatase and actin regulator 1
## 5 221644_s_at solute carrier family 45, member 2
## 6 206427_s_at                            melan-A

Discussion and conclusions

We can see that we obtained a large number of differentially expressed genes independently on the adjustment method. The results obtained with the samples from only one batch are similar to the numbers obtained when not excluding samples from different batches. This reveals that the large amount of differentially expressed genes is not due to a technical artifact that we can adjust for and most probably is due to to biological variability in gene expression. Accordingly, we decided to keep all samples for further analysis due to the lack of evidence of batch effect confounding the observations. However we were not very strict in the inclusion criteria for these genes since we used a very large p-value (0.1). How different p-value thresholds affect the results can be seen below:


pvalThreshold <- c()
numGenes <- c()
for (i in 1:5) {
    pvalThreshold <- append(pvalThreshold, 10^(-i))
    numGenes <- append(numGenes, dim(ttfbsv[ttfbsv$adj.P.Val < 10^(-i), ])[1])
}
numGenesVSpvalThreshold <- data.frame(pvalThreshold, numGenes)
numGenesVSpvalThreshold

##   pvalThreshold numGenes
## 1         1e-01     5300
## 2         1e-02     1672
## 3         1e-03      251
## 4         1e-04       33
## 5         1e-05        6

barplot(t(as.matrix(numGenesVSpvalThreshold)), xlab = "p-value threshold", ylab = "number of genes", 
    names = pvalThreshold, main = "Number of DE genes depending on threshold p-value", 
    col = "white")

plot of chunk genesbypvalue

As expected, reducing the p-value reduces the number of statistically significant DE genes. Looking at the volcano plots, we can see that small p-values correspond to higher fold changes. If we were interested in finding marker genes, we would choose the genes that showed major fold-changes. However, because we are interested in understanding the biological changes underlying the differenece between a proliferative and invasive phenotype, we are interested in keeping also the genes with subtle changes. As a result, an intermediate threshold is best suited in this case, and we decided to keep those genes with a p-value < 0.01, which correspond to approximately the 10% of the genes that have passed the filtering step.

The selected genes, sorted by adjusted P-value, are stored in the following variable. We show the first 100.

DEgenes_fbsv_0.01 <- ttfbsv[ttfbsv$adj.P.Val < 0.01, ]
DEgenes_fbsv_0.01_byPVal <- rownames(DEgenes_fbsv_0.01[with(DEgenes_fbsv_0.01, 
    order(adj.P.Val)), ])
genenamesfinal <- select(hgu133plus2.db, keys = DEgenes_fbsv_0.01_byPVal, columns = "GENENAME", 
    keytype = "PROBEID")
head(genenamesfinal[2], n = 100)

##                                                                              GENENAME
## 1                                                                  carboxypeptidase E
## 2                            synuclein, alpha (non A4 component of amyloid precursor)
## 3                                                                          tyrosinase
## 4                                                                             melan-A
## 5                                                                myelin basic protein
## 6                                        preferentially expressed antigen in melanoma
## 7                                                     S100 calcium binding protein A1
## 8                                                          myelin expression factor 2
## 9                                                                      ALX homeobox 1
## 10                                                  carboxypeptidase N, polypeptide 1
## 11                                                                        mucolipin 3
## 12                                         long intergenic non-protein coding RNA 518
## 13               cytidine monophospho-N-acetylneuraminic acid hydroxylase, pseudogene
## 14                                                     tripartite motif-containing 51
## 15                  solute carrier family 25 (aspartate/glutamate carrier), member 13
## 16                                                       tyrosinase-related protein 1
## 17                                                              proteolipid protein 1
## 18                                                                      ets variant 5
## 19                                                    matrix-remodelling associated 7
## 20                   IMP2 inner mitochondrial membrane peptidase-like (S. cerevisiae)
## 21                                                            carbonic anhydrase VIII
## 22                                           potassium channel, subfamily K, member 2
## 23                                                 solute carrier family 45, member 2
## 24                                                                   calpain 3, (p94)
## 25                                                leucine rich adaptor protein 1-like
## 26                  UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 4
## 27                       serpin peptidase inhibitor, clade G (C1 inhibitor), member 1
## 28                                 ST3 beta-galactoside alpha-2,3-sialyltransferase 4
## 29  Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain, 1
## 30                                                                   sorting nexin 10
## 31                                                                           KIAA1211
## 32                                cell migration inducing protein, hyaluronan binding
## 33                                                    inositol-3-phosphate synthase 1
## 34                                   growth regulation by estrogen in breast cancer 1
## 35                                                   glycoprotein integral membrane 1
## 36                                                      TBC1 domain family, member 16
## 37                                                                      ets variant 4
## 38                                                  phosphatase and actin regulator 1
## 39                                                          calcium modulating ligand
## 40                          ADAM metallopeptidase with thrombospondin type 1 motif, 6
## 41           phosphatidylinositol-3,4,5-trisphosphate-dependent Rac exchange factor 1
## 42                               solute carrier family 26 (anion exchanger), member 6
## 43                     Dab, mitogen-responsive phosphoprotein, homolog 2 (Drosophila)
## 44                                           ankyrin repeat and SOCS box containing 9
## 45                                                              GID complex subunit 4
## 46                                                              cystatin A (stefin A)
## 47                  N-acetylneuraminate pyruvate lyase (dihydrodipicolinate synthase)
## 48          glucosaminyl (N-acetyl) transferase 2, I-branching enzyme (I blood group)
## 49                                                              bridging integrator 3
## 50                                                             shroom family member 2
## 51                                                  BMP binding endothelial regulator
## 52                                                           anthrax toxin receptor 2
## 53                                                           prostaglandin E synthase
## 54                                   praja ring finger 2, E3 ubiquitin protein ligase
## 55                                                                  tumor protein D52
## 56                                                                      fibronectin 1
## 57                                         CAP-GLY domain containing linker protein 3
## 58                                                 lymphoid enhancer-binding factor 1
## 59                                                         fibroblast growth factor 7
## 60                                                       uncharacterized LOC100127888
## 61                                                  RAB38, member RAS oncogene family
## 62                                                              renin binding protein
## 63                                                          ribonucleic acid export 1
## 64                                      family with sequence similarity 213, member A
## 65                                major histocompatibility complex, class II, DM beta
## 66                                  ribosomal protein S6 kinase, 90kDa, polypeptide 5
## 67                                                                         hephaestin
## 68                                 URB2 ribosome biogenesis 2 homolog (S. cerevisiae)
## 69                                               immunoglobulin superfamily, member 3
## 70      low density lipoprotein receptor-related protein 8, apolipoprotein e receptor
## 71                                                          UDP-galactose-4-epimerase
## 72                                                            high mobility group 20B
## 73                                                                   glycoprotein M6B
## 74                                         amyloid beta (A4) precursor-like protein 2
## 75                                                     poly (ADP-ribose) polymerase 1
## 76                                                  inositol-trisphosphate 3-kinase B
## 77                                          zinc finger and SCAN domain containing 18
## 78                                                          cyclin-dependent kinase 2
## 79                                      family with sequence similarity 217, member B
## 80                               major histocompatibility complex, class II, DR alpha
## 81                                                    lysosomal trafficking regulator
## 82                                                                  docking protein 5
## 83                              tumor necrosis factor receptor superfamily, member 14
## 84          nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 2
## 85                                       family with sequence similarity 26, member E
## 86                                                   uncharacterized protein FLJ25694
## 87                               ring finger protein 125, E3 ubiquitin protein ligase
## 88                                               gremlin 2, DAN family BMP antagonist
## 89                                            SAP domain containing ribonucleoprotein
## 90                             potassium channel tetramerization domain containing 11
## 91                                   fibronectin leucine rich transmembrane protein 3
## 92                                                           cell adhesion molecule 4
## 93                                                                        EPS8-like 2
## 94                                                                WD repeat domain 61
## 95                                                rhophilin associated tail protein 1
## 96                                               exostosin-like glycosyltransferase 1
## 97                                            coagulation factor XII (Hageman factor)
## 98                             tumor necrosis factor receptor superfamily, member 11b
## 99                                     microphthalmia-associated transcription factor
## 100                                                          transmembrane protein 51

sessionInfo()

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hgu133plus2cdf_2.14.0 hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0  
##  [4] RSQLite_0.11.4        DBI_0.2-7             AnnotationDbi_1.26.0 
##  [7] GenomeInfoDb_1.0.2    sva_3.10.0            mgcv_1.7-29          
## [10] nlme_3.1-117          corpcor_1.6.6         genefilter_1.46.0    
## [13] limma_3.20.1          affyPLM_1.40.0        preprocessCore_1.26.0
## [16] gcrma_2.36.0          affy_1.42.2           Biobase_2.24.0       
## [19] BiocGenerics_0.10.0   knitr_1.5            
## 
## loaded via a namespace (and not attached):
##  [1] affyio_1.32.0        annotate_1.42.0      BiocInstaller_1.14.2
##  [4] Biostrings_2.32.0    digest_0.6.4         evaluate_0.5.5      
##  [7] formatR_0.10         grid_3.1.0           IRanges_1.22.6      
## [10] lattice_0.20-29      Matrix_1.1-3         splines_3.1.0       
## [13] stats4_3.1.0         stringr_0.6.2        survival_2.37-7     
## [16] tools_3.1.0          XML_3.98-1.1         xtable_1.7-3        
## [19] XVector_0.4.0        zlibbioc_1.10.0