Discovering Batch Effect with EDA

For the dataset we have been working with, models do not help due to the almost perfect confounding. This is one reason we created the subset dataset:

library(GSE5859Subset) data(GSE5859Subset) Here we purposely confounded month and group (sex) but not completely:

sex = sampleInfo$group month = factor( format(sampleInfo$date,“%m”)) table( sampleInfo$group, month)

Modeling Batch Effects Exercises #1

1/1 point (graded) Using the functions rowttests and qvalue compare the two groups, in this case males and females so coded in sex. Because this is a smaller dataset, which decreases our power, we will use a more lenient FDR cut-off of 10%.

How many gene have q-values less than 0.1?

library(GSE5859Subset)
data(GSE5859Subset)
library(qvalue)
library(genefilter)

sex = as.factor(sampleInfo$group)
month = factor( format(sampleInfo$date,"%m"))
table( sampleInfo$group, month)

##    month
##     06 10
##   0  9  3
##   1  3  9

ttxt <- rowttests(geneExpression, sex)
pval <- ttxt$p.val
qval <- qvalue(pval)

sum(qval$qvalues < 0.1)

## [1] 59

Exercises #2

Note that sampleInfo$group here represents males and females. Thus we expect differences to be on chrY and, for genes that escape inactivation, chrX. Note that we do not expect many autosomal genes to be different between males and females. This gives us an opportunity to evaluate false and true positives with experimental data. For example, we evaluate results using the proportion genes of the list that are on chrX or chrY.

For the list of genes with q<0.1 calculated in Modeling Batch Effects Exercises #1, what proportion of genes are on chrX or chrY?

sexchr <- geneAnnotation$CHR[qval$qvalues < 0.1]

mean(sexchr == "chrX" | sexchr == "chrY")

## [1] 0.3389831

Exercise #3

Now for the autosomal genes (not on chrX and chrY) for which q-value < 0.1 perform a t-test comparing samples processed in June to those processed in October.

What proportion of these have p-values < 0.05?

filter <- geneAnnotation[geneAnnotation$CHR != "chrX" & geneAnnotation$CHR != "chrY" & qval$qvalues < 0.1, ]

autosomal <- geneExpression[filter$PROBEID,]

month = factor( format(sampleInfo$date,"%m"))

ttxt2 <- rowttests(autosomal, month)

mean(ttxt2$p.val < 0.05)

## [1] 0.8717949

Exercise #5

Now use the X defined above to fit a regression model using lm for each gene. Note that you can obtain p-values for estimated parameters using summary. Here is an example:

X = model.matrix(~sex+month) i = 234 y = geneExpression[i,] fit = lm(y~X-1) summary(fit)$coef

How many of the q-values for the group comparison are <0.1 now? > comparing sex group, thus extract p value for sex parameter

pvals <- vector()
for(i in 1:nrow(geneExpression)){
        X = model.matrix( ~ sex + month)
        y = geneExpression[i,]
        fit = lm( y ~ X - 1)
        sa <- summary(fit)
        pvals[i] <- sa$coefficients[2,4]
}
        
qvals <- qvalue(pvals)
sum(qvals$qvalues < 0.1)

## [1] 17

With this new list, what proportion of these are chrX and chrY?

chr <- geneAnnotation$CHR[qvals$qvalues < 0.1]

mean(chr == "chrX" | chr == "chrY")

## [1] 0.8823529

Now, from the linear model in Modeling Batch Effects Exercises #6, extract the p-values related to the coefficient representing the October versus June differences using the same linear model.

How many of the q-values for the month comparison are < 0.1 now?

Mpvals <- vector()
for(i in 1:nrow(geneExpression)){
        X = model.matrix(~ sex + month)
        y = geneExpression[i,]
        fit = lm(y ~ X - 1 )
        sa <- summary(fit)
        Mpvals[i] <- sa$coefficients[3, 4]
        
}

Mqvals <- qvalue(Mpvals)
sum(Mqvals$qvalues < 0.1)

## [1] 3170

Discovering Batch Effect with EDA

Xiaoping Li

February 10, 2017

Modeling Batch Effects Exercises #1

Exercises #2

Exercise #3

Exercise #5