Homework 01

In this assignment you will be analyzing a publicly-available expression study of mouse brain tissue, run on the Affymetrix MGU74Av2 platform. You can read about the study at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7191.

The motivation is to study a gene S1P2, sphingosine-1-phosphate receptor 2, which, when mutated, results in seizures. In contrast, mutation of the related gene S1P3 does not do this. The study includes two brain regions, hippocampus and neocortex, and three mouse strains (wild type, S1P2 mutant, and S1P3 mutant). In addition the authors provide information on the gender of the mice. Paul was also able to extract “processing date” information from the raw data files (the authors did not explicitly provide this).

You can find two data files for this study here, specifically the expression data and the design of the experiment. The data have been re-preprocessed with RMA from the Bioconductor affy package, so it is on a log2 scale. Some probes that are considered “controls” were also removed, so overall it's a little different than the processed data provided via GEO. Our focus here is on quality control, exploration, and differential expression analysis.

Evaluation

The assignment has 25 points and represents 25% of your overall course mark. Overall presentation and mechanics is worth 5 points, with full points awarded if it's exceptional. The remaining 20 points are spread among the specific questions, as detailed below. We will try to give partial credit when/if we can figure out what you're trying to do.

Each student must submit their own work: no group efforts. It's fine to talk to fellow students and have discussion on the Google Group. If someone is really helpful, give them some credit. You have definitely crossed the line if there's any copying and pasting of code or wording.

How to submit the work

Follow the instructions from week 1 re: creating your private GitHub repository and making JB, SJ, and GL collaborators.
Write your homework in R Markdown (.rmd)
Compile your homework to markdown (.md) and HTML (.html)
Read and follow the Polish and submit instructions
Add the R Markdown, Markdown and HTML to your private GitHub repository named stat540-2014-lastname-firstname-hw
Do not add the data files GSE7191-data.txt and GSE7191-design.txt
Open your private repositry on GitHub in a web browser
On the web page just above the files, look for the text “latest commit” followed by ten numbers and letters (called the revision SHA) and a clipboard icon
Click on the clipboard icon to copy the revision SHA to your clipboard
Create an issue in your private repository (click “Issues” and “New Issue”) on GitHub named “Mark homework 1 of firstname lastname”
In the description of the issue, tag both TAs (@gloriali and @sjackman) and paste the revision SHA
Click “Submit new issue”
You're done! Congratulations!

If you're concerned that something hasn't gone right with the submission, send Gloria glorialuolanli@gmail.com and Shaun sjackman@gmail.com an e-mail with your assignment attached. Note: this is plan B. Please submit your homework using GitHub.

Your mission

# Load required datasets:
data <- read.table("GSE7191-data.txt", header = TRUE, row.names = 1)
design <- read.table("GSE7191-design.txt", header = TRUE, row.names = 1)

# Load required R packages:
library(lattice)
library(reshape2)
library(xtable)
library(RColorBrewer)
library(gplots)

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

library(plyr)
library(hexbin)

## Loading required package: grid

library(preprocessCore)
library(limma)

Q0 (0 pts) Intake

Load the data. Smell test it. Fiddle with factor levels and do any other light cleaning you deem necessary. You can set include = FALSE for some/all of this code, but DO IT.

Q1 (2 points) What are the basic characteristics of the data and meta-data?

Q1a: How many probes? How many samples?

There are 12422 probes and there are 50 samples.

# Alternative way to to this
print(paste("There are", nrow(data), "probes"))

## [1] "There are 12422 probes"

print(paste("There are", ncol(data), "samples"))

## [1] "There are 50 samples"

Q1b: What is the breakdown of samples for Genotype, BrainRegion, Sex, and DateRun?

For starters, cross-tabulate Genotype and BrainRegion.

addmargins(with(design, table(Genotype, BrainRegion)))

##            BrainRegion
## Genotype    hippocampus neocortex Sum
##   S1P2_KO            10        10  20
##   S1P3_KO             5         5  10
##   Wild_type          10        10  20
##   Sum                25        25  50

Then do some cross tabulation involving DateRun

addmargins(with(design, table(Genotype, DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        4        4        4        0        0        7
##   S1P3_KO          7        0        0        0        0        0        0
##   Wild_type        0        0        0        4        8        7        0
##   Sum              7        4        4        8        8        7        7
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          1  20
##   S1P3_KO          3  10
##   Wild_type        1  20
##   Sum              5  50

addmargins(with(design, table(BrainRegion, DateRun)))

##              DateRun
## BrainRegion   01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03
##   hippocampus        4        2        2        4        4        3
##   neocortex          3        2        2        4        4        4
##   Sum                7        4        4        8        8        7
##              DateRun
## BrainRegion   10/23/03 12/18/03 Sum
##   hippocampus        4        2  25
##   neocortex          3        3  25
##   Sum                7        5  50

and, optionally, Sex to investigate if these “nuisance” factors are confounded with the experimental factors.

# Within the subset *hippocampus*
addmargins(with(subset(design, subset = BrainRegion == "hippocampus"), table(Genotype, 
    DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        2        2        2        0        0        4
##   S1P3_KO          4        0        0        0        0        0        0
##   Wild_type        0        0        0        2        4        3        0
##   Sum              4        2        2        4        4        3        4
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          0  10
##   S1P3_KO          1   5
##   Wild_type        1  10
##   Sum              2  25

# Within the subset *neocortex*
addmargins(with(subset(design, subset = BrainRegion == "neocortex"), table(Genotype, 
    DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        2        2        2        0        0        3
##   S1P3_KO          3        0        0        0        0        0        0
##   Wild_type        0        0        0        2        4        4        0
##   Sum              3        2        2        4        4        4        3
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          1  10
##   S1P3_KO          2   5
##   Wild_type        0  10
##   Sum              3  25

# Within the subset *neocortex* for *male*
addmargins(with(subset(subset(design, subset = BrainRegion == "neocortex"), 
    subset = Sex == "male"), table(Genotype, DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        2        0        0        0        0        2
##   S1P3_KO          2        0        0        0        0        0        0
##   Wild_type        0        0        0        0        4        1        0
##   Sum              2        2        0        0        4        1        2
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          1   5
##   S1P3_KO          0   2
##   Wild_type        0   5
##   Sum              1  12

# Within the subset *neocortex* for *female*
addmargins(with(subset(subset(design, subset = BrainRegion == "neocortex"), 
    subset = Sex == "female"), table(Genotype, DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        0        2        2        0        0        1
##   S1P3_KO          1        0        0        0        0        0        0
##   Wild_type        0        0        0        2        0        3        0
##   Sum              1        0        2        4        0        3        1
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          0   5
##   S1P3_KO          2   3
##   Wild_type        0   5
##   Sum              2  13

# Within the subset *hippocampus* for *male*
addmargins(with(subset(subset(design, subset = BrainRegion == "hippocampus"), 
    subset = Sex == "male"), table(Genotype, DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        2        0        0        0        0        3
##   S1P3_KO          2        0        0        0        0        0        0
##   Wild_type        0        0        0        0        4        0        0
##   Sum              2        2        0        0        4        0        3
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          0   5
##   S1P3_KO          0   2
##   Wild_type        1   5
##   Sum              1  12

# Within the subset *hippocampus* for *female*
addmargins(with(subset(subset(design, subset = BrainRegion == "hippocampus"), 
    subset = Sex == "female"), table(Genotype, DateRun)))

##            DateRun
## Genotype    01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
##   S1P2_KO          0        0        2        2        0        0        1
##   S1P3_KO          2        0        0        0        0        0        0
##   Wild_type        0        0        0        2        0        3        0
##   Sum              2        0        2        4        0        3        1
##            DateRun
## Genotype    12/18/03 Sum
##   S1P2_KO          0   5
##   S1P3_KO          1   3
##   Wild_type        0   5
##   Sum              1  13

Make sure you've included counts for all 4 individual factors somewhere, even if it's just in the margins of a cross-tabulation.

How do you feel about the experimental design?

Replicates within the same group are done on different dates; this can cause batch effect.
There is only 2 replicates for male S1P3_KO and only 3 replicates for female S1P3_KO for each brain region. This can cause problem if one or a few of those samples turn out to be outliers. To aviod this problem, the sample number needs to be increased.

Q1c: Create a plot showing the gene expression data for one probe.

Use position, panels or facets, and color to convey the Genotype, BrainRegion, and Sex of the samples.

tall.data <- melt(mdata, id.vars = c("DateRun", "Genotype", "BrainRegion", "Sex"), 
    variable.name = "Probe", value.name = "Expression")
# making sure that the melt function worked as expected
nrow(mdata) * ncol(tdata) == nrow(tall.data)

## [1] TRUE

stripplot(Expression ~ Genotype | BrainRegion, tall.data, subset = Probe == 
    "92482_at", groups = Sex, auto.key = TRUE, jitter.data = TRUE, type = c("p", 
    "a"))

plot of chunk unnamed-chunk-9

Q1d: Report the average expression of the selected probe for all possible combinations of Genotype, BrainRegion and Sex.

aggregate(Expression ~ Genotype * BrainRegion * Sex, tall.data, subset = Probe == 
    "92482_at", FUN = mean)

##     Genotype BrainRegion    Sex Expression
## 1    S1P2_KO hippocampus female      6.160
## 2    S1P3_KO hippocampus female      6.193
## 3  Wild_type hippocampus female      6.017
## 4    S1P2_KO   neocortex female      6.498
## 5    S1P3_KO   neocortex female      6.959
## 6  Wild_type   neocortex female      6.272
## 7    S1P2_KO hippocampus   male      6.352
## 8    S1P3_KO hippocampus   male      6.628
## 9  Wild_type hippocampus   male      6.058
## 10   S1P2_KO   neocortex   male      6.876
## 11   S1P3_KO   neocortex   male      6.967
## 12 Wild_type   neocortex   male      6.118

Q2 (4 points) Examine the sample correlation matrix.

Q2a: Depict the sample-to-sample correlations in a heatmap.

At the very least, order the samples by run date; within run date, you could also sort on other factors.

Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
data.matrix <- as.matrix(data)
# Heatmap before sorted
heatmap.2(cor(data), Rowv = NA, Colv = NA, symm = T, trace = "none", dendrogram = "none", 
    col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations", 
    margins = c(20, 20))

plot of chunk unnamed-chunk-11


htdata <- mdata
mdata.date <- htdata[order(as.Date(htdata$DateRun, format = "%m/%d/%Y")), ]
design.date <- mdata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htdata.date <- mdata.date
htdata.date$DateRun <- NULL
htdata.date$Genotype <- NULL
htdata.date$BrainRegion <- NULL
htdata.date$Sex <- NULL
htdata.date <- as.matrix(htdata.date)
rownames(htdata.date) <- with(design.date, paste(Genotype, BrainRegion, Sex, 
    DateRun, sep = "_"))
htdata.date <- t(htdata.date)

# Heatmap after sorted by date
heatmap.2(cor(htdata.date), Rowv = NA, Colv = NA, symm = T, trace = "none", 
    dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations sorted by DateRun", 
    margins = c(20, 20))

plot of chunk unnamed-chunk-11

What do you see about batch effects, outliers, etc.? The sample GSM172976 seems to be a very obvious outlier. The batch effects are not obvious as the outlier but is still quite visiable (i.e. small blocks).

Q2b: Identify the outlier sample.

cdata <- data.frame(cor(data))
cdata.mean <- data.frame(AverageCorrelation = "Average Correlation", Sample = row.names(cdata), 
    mean = rowMeans(cdata))
bwplot(~mean, cdata.mean, xlab = "Average Correlation")

plot of chunk unnamed-chunk-12

Try to go beyond merely “eyeballing” it, i.e. make a quantitative statement about how it “sticks out” from the other samples.

cdata.mean.sort <- cdata.mean[order(cdata.mean$mean), ]
mean(cdata.mean.sort$mean)

## [1] 0.9904

sd(cdata.mean.sort$mean)

## [1] 0.003993

(mean(cdata.mean.sort$mean) - min(cdata.mean.sort$mean))/sd(cdata.mean.sort$mean)

## [1] 6.335

data.outlier <- cdata.mean[which.min(cdata.mean$mean), c("Sample", "mean")]

The average correlation coefficient of sample GSM172976 is 6.3352 standard deviation away from mean.

Q2c: Examine the outlier in the context of its experimental group.

Which group is the outlier in, in terms of Genotype, BrainRegion, and Sex?

# List sample group of the outlier
(design.outlier <- design[data.outlier$Sample, ])

##            DateRun Genotype BrainRegion    Sex
## GSM172976 01/16/04  S1P3_KO hippocampus female

# All the samples that are in the same group as the ourlier
(design.outlier.group <- subset((subset(subset(design, subset = BrainRegion == 
    design.outlier$BrainRegion), subset = Sex == design.outlier$Sex)), subset = Genotype == 
    design.outlier$Genotype))

##            DateRun Genotype BrainRegion    Sex
## GSM172974 12/18/03  S1P3_KO hippocampus female
## GSM172975 01/16/04  S1P3_KO hippocampus female
## GSM172976 01/16/04  S1P3_KO hippocampus female

Scatter plot these samples against each other and comment on the plot.

# Things are overplotted!
row.names(design.outlier.group)

## [1] "GSM172974" "GSM172975" "GSM172976"

data.outlier.group <- data[, row.names(design.outlier.group)]
splom(data.outlier.group)

plot of chunk unnamed-chunk-15

Try to address overplotting!

# Now it's much better!
hexplom(data.outlier.group)

plot of chunk unnamed-chunk-16

Q3 (4 points) Normalization, what it can and can't do.

Q3a: Make boxplots.

Make a pair of plots: each plot will show gene expression boxplots, one per sample. Identify the outlier sample with a different color, if you can. Comment on these plots, in general and with respect to the outlier.

# Normalize the data
ndata <- data.frame(normalize.quantiles(as.matrix(data), copy = TRUE))
colnames(ndata) <- colnames(data)
rownames(ndata) <- rownames(data)
tndata <- as.data.frame(t(ndata))  #Transpose the data frame
mndata <- cbind(design, tndata)
smndata <- mndata
smndata$Sample <- row.names(mndata)
tall.smndata <- melt(smndata, id.vars = c("Sample", "DateRun", "Genotype", "BrainRegion", 
    "Sex"), variable.name = "Probe", value.name = "Expression")

Here are the 2 plots:

The data as provided
The data after you apply quantile normalization.

# Data before normalization
smdata <- mdata
smdata$Sample <- row.names(mdata)
tall.smdata <- melt(smdata, id.vars = c("Sample", "DateRun", "Genotype", "BrainRegion", 
    "Sex"), variable.name = "Probe", value.name = "Expression")
bwplot(Expression ~ Sample, tall.smdata, scales = list(x = list(rot = 45)), 
    )

plot of chunk unnamed-chunk-18

# Data after normalization
bwplot(Expression ~ Sample, tall.smndata, scales = list(x = list(rot = 45)), 
    )

plot of chunk unnamed-chunk-18

Q3b: Did normalization fix the outlier?

With the quantile-normalized data, re-make the heatmap of the sample correlation matrix and re-compare the outlier to its experimental group.

Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
ndata.matrix <- as.matrix(ndata)
htndata <- mndata
mndata.date <- htndata[order(as.Date(htndata$DateRun, format = "%m/%d/%Y")), 
    ]
ndesign.date <- mndata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htndata.date <- mndata.date
htndata.date$DateRun <- NULL
htndata.date$Genotype <- NULL
htndata.date$BrainRegion <- NULL
htndata.date$Sex <- NULL
htndata.date <- as.matrix(htndata.date)
rownames(htndata.date) <- with(ndesign.date, paste(Genotype, BrainRegion, Sex, 
    DateRun, sep = "_"))
htndata.date <- t(htndata.date)
heatmap.2(cor(htndata.date), Rowv = NA, Colv = NA, symm = T, trace = "none", 
    dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations after normalization sorted by DateRun", 
    margins = c(20, 20))

plot of chunk unnamed-chunk-19

Is everything OK now? No, normlization did not help fix the outlier.

Q3c: Form a dataset that omits the outlier and quantile normalize it.

row.names(design.outlier)

## [1] "GSM172976"

ordata <- data
colnames(ordata)

##  [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
##  [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975" "GSM172976"

ordata[, row.names(design.outlier)] <- NULL
colnames(ordata)

##  [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
##  [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975"

ordesign <- design
rownames(ordesign)

##  [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
##  [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975" "GSM172976"

ordesign <- ordesign[!(row.names(ordesign) == row.names(design.outlier)), ]
rownames(ordesign)

##  [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
##  [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975"

library(preprocessCore)
nordata <- data.frame(normalize.quantiles(as.matrix(ordata), copy = TRUE))
colnames(nordata) <- colnames(ordata)
rownames(nordata) <- rownames(ordata)
tnordata <- as.data.frame(t(nordata))

The data set after quantile normalization and outlier removal is now tnordata

Q3d Re-make the heatmap of the sample correlation matrix, now that the worst outlier is gone. Interpret what you see.

Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))

nordata.matrix <- as.matrix(nordata)

mnordata <- cbind(ordesign, tnordata)
smnordata <- mnordata
smnordata$Sample <- row.names(mnordata)
htnordata <- mnordata
mnordata.date <- htnordata[order(as.Date(htnordata$DateRun, format = "%m/%d/%Y")), 
    ]
nordesign.date <- mnordata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htnordata.date <- mnordata.date
htnordata.date$DateRun <- NULL
htnordata.date$Genotype <- NULL
htnordata.date$BrainRegion <- NULL
htnordata.date$Sex <- NULL
htnordata.date <- as.matrix(htnordata.date)
rownames(htnordata.date) <- with(nordesign.date, paste(Genotype, BrainRegion, 
    Sex, DateRun, sep = "_"))
htnordata.date <- t(htnordata.date)
heatmap.2(cor(htnordata.date), Rowv = NA, Colv = NA, symm = T, trace = "none", 
    dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations after normalization sorted by DateRun", 
    margins = c(20, 20))

plot of chunk unnamed-chunk-21

The data looks much normal now!

Q3e: Remake the expression boxplots for all samples before moving on to differential expression analysis.

mnordata <- cbind(ordesign, tnordata)
smnordata <- mnordata
smnordata$Sample <- row.names(mnordata)
tall.smnordata <- melt(smnordata, id.vars = c("Sample", "DateRun", "Genotype", 
    "BrainRegion", "Sex"), variable.name = "Probe", value.name = "Expression")
bwplot(Expression ~ Sample, tall.smnordata, scales = list(x = list(rot = 45)), 
    )

plot of chunk unnamed-chunk-22

How many samples remain?

There are 49 samples remain.

Q4 (5 points) For each probe, test if it has differential expression across the three genotypes within the neocortex brain region only.

You should be using data with the worst outlier removed, after quantile normalization and, for this question, restricted to neocortex.

identical(rownames(ordesign), colnames(nordata))

## [1] TRUE

neoordesign <- subset(ordesign, BrainRegion == "neocortex")
neoordesign$Genotype <- factor(neoordesign$Genotype, levels = c("Wild_type", 
    "S1P2_KO", "S1P3_KO"))
neonordata <- subset(nordata, select = ordesign$BrainRegion == "neocortex")

# Design Matrix
neoordesign.m <- model.matrix(~Genotype, neoordesign)
neoorFit <- lmFit(neonordata, neoordesign.m)
neoorEbFit <- eBayes(neoorFit)

Q4a: Write out, in an equation or English or, ideally, both, the model you are fitting.

I used limma to fit a linear model in a one-way ANOVA style

Q4b: Explore your hits.

Display the expression data for the top 50 probes in a heat map. Order the probes by p-value; order the samples by genotype.

colnames(coef(neoorEbFit))

## [1] "(Intercept)"     "GenotypeS1P2_KO" "GenotypeS1P3_KO"

neoorHits <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))), 
    number = Inf)
neoorHits50 <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))), 
    number = 50)


tneonordata <- as.data.frame(t(neonordata))
HneoorHits <- tneonordata[, rownames(neoorHits50), drop = FALSE]
HneoorHits <- as.matrix(HneoorHits)
rownames(HneoorHits) <- with(neoordesign, paste(rownames(neoordesign), Genotype, 
    sep = "_"))
GraysR <- colorRampPalette(brewer.pal(n = 9, "Greys"))
heatmap.2(HneoorHits, Rowv = NA, Colv = NA, symm = T, trace = "none", dendrogram = "none", 
    col = GraysR, cexCol = 1, cexRow = 1, main = "Top 50 probes", margins = c(6, 
        10))

plot of chunk unnamed-chunk-25

Q4c: Count your hits.

neoorHits1E3 <- data.frame(PValueCutOff = neoorHits$P.Value < 0.001)
addmargins(with(neoorHits1E3, table(PValueCutOff)))

## PValueCutOff
## FALSE  TRUE   Sum 
## 12326    96 12422

How many probes have unadjusted p-value < 1e-3? There are 96 probes have unadjusted p-value < 1e-3

If we took the top 50 probes as our “hits”, what is the (estimated) false discovery rate?

neoorHits50[nrow(neoorHits50), "P.Value"]

The p-value of the 50th probe from the top 50 list is 3.0565 × 10^-4.

How many of these hits do we expect to be false discoveries?

FD <- nrow(neoorHits50) * neoorHits50[nrow(neoorHits50), "P.Value"]

We will expect 0.0153 of these hits to be false

Q4d: Plot the gene expression data for a few top hits and a few boring probes.

neoorHits3 <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))), 
    number = 3)
neoorBoring <- neoorHits[nrow(neoorHits):(nrow(neoorHits) - 2), ]
neoorExp <- rbind(neoorBoring, neoorHits3)

xyneoorHits <- tneonordata[, rownames(neoorExp), drop = FALSE]
xyneoorHits <- cbind(neoordesign, xyneoorHits)
xyneoorHits <- melt(xyneoorHits, id.vars = c("DateRun", "Genotype", "BrainRegion", 
    "Sex"), variable.name = "Probe", value.name = "Expression")
stripplot(Expression ~ Genotype | Probe, xyneoorHits, jitter.data = TRUE, auto.key = TRUE, 
    type = c("p", "a"), grid = TRUE)

plot of chunk unnamed-chunk-29

What are the p-values associated with these probes? Comment on the plots.

neoorExp[, "P.Value", drop = FALSE]

##               P.Value
## 162455_f_at 9.999e-01
## 94211_at    9.996e-01
## 160994_at   9.996e-01
## 98590_at    3.091e-07
## 160832_at   1.840e-06
## 92482_at    2.120e-06

The Hits p value are very small and the non-Hits p value are very close to 1

Q4e: Find probes where the expression in the S1P3 knockout is different from that of wild type.

neoorHitsS1P3 <- topTable(neoorEbFit, coef = "GenotypeS1P3_KO", number = Inf)
neoorHitsS1P3.0.1 <- data.frame(PValueCutOff = neoorHits$P.Value < 0.1)
addmargins(with(neoorHitsS1P3.0.1, table(PValueCutOff)))

## PValueCutOff
## FALSE  TRUE   Sum 
## 10263  2159 12422

How many hits do you find if you control FDR at 0.10?

There are 2159 probes if I control FDR at 0.10

Q5 (5 points) Differential expression analysis for Genotype * BrainRegion.

You should be using data with the worst outlier removed, after quantile normalization and for both brain regions.

ordesign$Genotype <- factor(ordesign$Genotype, levels = c("Wild_type", "S1P2_KO", 
    "S1P3_KO"))
ordesign.m <- model.matrix(~Genotype * BrainRegion, ordesign)
orFit <- lmFit(nordata, ordesign.m)
orEbFit <- eBayes(orFit)

Q5a: Fit a 3x2 full factorial model.

Test for any effect of Genotype and/or BrainRegion, i.e. test your model against a model with just an intercept.

colnames(coef(orEbFit))

## [1] "(Intercept)"                         
## [2] "GenotypeS1P2_KO"                     
## [3] "GenotypeS1P3_KO"                     
## [4] "BrainRegionneocortex"                
## [5] "GenotypeS1P2_KO:BrainRegionneocortex"
## [6] "GenotypeS1P3_KO:BrainRegionneocortex"

orHits <- topTable(orEbFit, p.value = 0.001, number = Inf)
nrow(orHits) == nrow(nordata)

## [1] TRUE

How many probes have a BH-adjusted p-value, a.k.a. q-value, less than 1e-3? All of the probes have BH-adjusted p-value less than 1e-3

Q5b: Test the null hypothesis that BrainRegion doesn't matter, i.e. that all terms involving BrainRegion are zero.

orHitsBrainRegion <- topTable(orEbFit, coef = grep("BrainRegion", colnames(coef(orEbFit))), 
    p.value = 0.1, number = Inf)

How many probes have a BH-adjusted p-value less than 0.1?

nrow(orHitsBrainRegion)

There are 3227 probes have a BH-adjusted p-value less than 0.1.

Q5c: Highlight some probes where BrainRegion does and does not matter.

Using the results from Q5b, plot and describe some results for a couple of hits and non-hits.

orHitsBrainRegion3 <- topTable(orEbFit, coef = grep("BrainRegion", colnames(coef(orEbFit))), 
    number = 3)
orHitsBrainRegionBoring <- orHitsBrainRegion[nrow(orHitsBrainRegion):(nrow(orHitsBrainRegion) - 
    2), ]
orHitsBrainRegionExp <- rbind(orHitsBrainRegionBoring, orHitsBrainRegion3)

xyneoorHitsBrainRegion <- tnordata[, rownames(orHitsBrainRegionExp), drop = FALSE]
xyneoorHitsBrainRegion <- cbind(ordesign, xyneoorHitsBrainRegion)
xyneoorHitsBrainRegion <- melt(xyneoorHitsBrainRegion, id.vars = c("DateRun", 
    "Genotype", "BrainRegion", "Sex"), variable.name = "Probe", value.name = "Expression")
stripplot(Expression ~ Genotype | Probe, xyneoorHitsBrainRegion, jitter.data = TRUE, 
    groups = BrainRegion, auto.key = TRUE, type = c("p", "a"), grid = TRUE)

plot of chunk unnamed-chunk-36

Q5d: Test the null hypothesis that Genotype doesn't matter, i.e. that all terms involving Genotype are zero.

orHitsGenotype <- topTable(orEbFit, coef = grep("Genotype", colnames(coef(orEbFit))), 
    p.value = 0.1, number = Inf)

How many probes have a BH-adjusted p-value less than 0.1?

nrow(orHitsGenotype)

There are 141 probes have a BH-adjusted p-value less than 0.1.

Compare these results to those obtained for BrainRegion in Q5b, either based on this count or based on all the p-values.

nrow(orHitsBrainRegion) > nrow(orHitsGenotype)

## [1] TRUE

There are more hits in the results from BrainRegion than from Genotype when p-value is set to 0.1

What do you conclude about the relative magnitude of the influence of brain region vs. genotype on gene expression?

The BrainRegion has a bigger effect on gene expression than Genotype