In this assignment you will be analyzing a publicly-available expression study of mouse brain tissue, run on the Affymetrix MGU74Av2 platform. You can read about the study at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7191.
The motivation is to study a gene S1P2, sphingosine-1-phosphate receptor 2, which, when mutated, results in seizures. In contrast, mutation of the related gene S1P3 does not do this. The study includes two brain regions, hippocampus and neocortex, and three mouse strains (wild type, S1P2 mutant, and S1P3 mutant). In addition the authors provide information on the gender of the mice. Paul was also able to extract “processing date” information from the raw data files (the authors did not explicitly provide this).
You can find two data files for this study here, specifically the expression data and the design of the experiment. The data have been re-preprocessed with RMA from the Bioconductor affy package, so it is on a log2 scale. Some probes that are considered “controls” were also removed, so overall it's a little different than the processed data provided via GEO. Our focus here is on quality control, exploration, and differential expression analysis.
The assignment has 25 points and represents 25% of your overall course mark. Overall presentation and mechanics is worth 5 points, with full points awarded if it's exceptional. The remaining 20 points are spread among the specific questions, as detailed below. We will try to give partial credit when/if we can figure out what you're trying to do.
Each student must submit their own work: no group efforts. It's fine to talk to fellow students and have discussion on the Google Group. If someone is really helpful, give them some credit. You have definitely crossed the line if there's any copying and pasting of code or wording.
If you're concerned that something hasn't gone right with the submission, send Gloria glorialuolanli@gmail.com and Shaun sjackman@gmail.com an e-mail with your assignment attached. Note: this is plan B. Please submit your homework using GitHub.
# Load required datasets:
data <- read.table("GSE7191-data.txt", header = TRUE, row.names = 1)
design <- read.table("GSE7191-design.txt", header = TRUE, row.names = 1)
# Load required R packages:
library(lattice)
library(reshape2)
library(xtable)
library(RColorBrewer)
library(gplots)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
library(plyr)
library(hexbin)
## Loading required package: grid
library(preprocessCore)
library(limma)
Load the data. Smell test it. Fiddle with factor levels and do any other light cleaning you deem necessary. You can set include = FALSE for some/all of this code, but DO IT.
There are 12422 probes and there are 50 samples.
# Alternative way to to this
print(paste("There are", nrow(data), "probes"))
## [1] "There are 12422 probes"
print(paste("There are", ncol(data), "samples"))
## [1] "There are 50 samples"
For starters, cross-tabulate Genotype and BrainRegion.
addmargins(with(design, table(Genotype, BrainRegion)))
## BrainRegion
## Genotype hippocampus neocortex Sum
## S1P2_KO 10 10 20
## S1P3_KO 5 5 10
## Wild_type 10 10 20
## Sum 25 25 50
Then do some cross tabulation involving DateRun
addmargins(with(design, table(Genotype, DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 4 4 4 0 0 7
## S1P3_KO 7 0 0 0 0 0 0
## Wild_type 0 0 0 4 8 7 0
## Sum 7 4 4 8 8 7 7
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 1 20
## S1P3_KO 3 10
## Wild_type 1 20
## Sum 5 50
addmargins(with(design, table(BrainRegion, DateRun)))
## DateRun
## BrainRegion 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03
## hippocampus 4 2 2 4 4 3
## neocortex 3 2 2 4 4 4
## Sum 7 4 4 8 8 7
## DateRun
## BrainRegion 10/23/03 12/18/03 Sum
## hippocampus 4 2 25
## neocortex 3 3 25
## Sum 7 5 50
and, optionally, Sex to investigate if these “nuisance” factors are confounded with the experimental factors.
# Within the subset *hippocampus*
addmargins(with(subset(design, subset = BrainRegion == "hippocampus"), table(Genotype,
DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 2 2 2 0 0 4
## S1P3_KO 4 0 0 0 0 0 0
## Wild_type 0 0 0 2 4 3 0
## Sum 4 2 2 4 4 3 4
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 0 10
## S1P3_KO 1 5
## Wild_type 1 10
## Sum 2 25
# Within the subset *neocortex*
addmargins(with(subset(design, subset = BrainRegion == "neocortex"), table(Genotype,
DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 2 2 2 0 0 3
## S1P3_KO 3 0 0 0 0 0 0
## Wild_type 0 0 0 2 4 4 0
## Sum 3 2 2 4 4 4 3
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 1 10
## S1P3_KO 2 5
## Wild_type 0 10
## Sum 3 25
# Within the subset *neocortex* for *male*
addmargins(with(subset(subset(design, subset = BrainRegion == "neocortex"),
subset = Sex == "male"), table(Genotype, DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 2 0 0 0 0 2
## S1P3_KO 2 0 0 0 0 0 0
## Wild_type 0 0 0 0 4 1 0
## Sum 2 2 0 0 4 1 2
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 1 5
## S1P3_KO 0 2
## Wild_type 0 5
## Sum 1 12
# Within the subset *neocortex* for *female*
addmargins(with(subset(subset(design, subset = BrainRegion == "neocortex"),
subset = Sex == "female"), table(Genotype, DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 0 2 2 0 0 1
## S1P3_KO 1 0 0 0 0 0 0
## Wild_type 0 0 0 2 0 3 0
## Sum 1 0 2 4 0 3 1
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 0 5
## S1P3_KO 2 3
## Wild_type 0 5
## Sum 2 13
# Within the subset *hippocampus* for *male*
addmargins(with(subset(subset(design, subset = BrainRegion == "hippocampus"),
subset = Sex == "male"), table(Genotype, DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 2 0 0 0 0 3
## S1P3_KO 2 0 0 0 0 0 0
## Wild_type 0 0 0 0 4 0 0
## Sum 2 2 0 0 4 0 3
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 0 5
## S1P3_KO 0 2
## Wild_type 1 5
## Sum 1 12
# Within the subset *hippocampus* for *female*
addmargins(with(subset(subset(design, subset = BrainRegion == "hippocampus"),
subset = Sex == "female"), table(Genotype, DateRun)))
## DateRun
## Genotype 01/16/04 03/11/04 07/23/04 08/14/03 08/21/03 09/11/03 10/23/03
## S1P2_KO 0 0 2 2 0 0 1
## S1P3_KO 2 0 0 0 0 0 0
## Wild_type 0 0 0 2 0 3 0
## Sum 2 0 2 4 0 3 1
## DateRun
## Genotype 12/18/03 Sum
## S1P2_KO 0 5
## S1P3_KO 1 3
## Wild_type 0 5
## Sum 1 13
Make sure you've included counts for all 4 individual factors somewhere, even if it's just in the margins of a cross-tabulation.
How do you feel about the experimental design?
Replicates within the same group are done on different dates; this can cause batch effect.
There is only 2 replicates for male S1P3_KO and only 3 replicates for female S1P3_KO for each brain region. This can cause problem if one or a few of those samples turn out to be outliers. To aviod this problem, the sample number needs to be increased.
Use position, panels or facets, and color to convey the Genotype, BrainRegion, and Sex of the samples.
tall.data <- melt(mdata, id.vars = c("DateRun", "Genotype", "BrainRegion", "Sex"),
variable.name = "Probe", value.name = "Expression")
# making sure that the melt function worked as expected
nrow(mdata) * ncol(tdata) == nrow(tall.data)
## [1] TRUE
stripplot(Expression ~ Genotype | BrainRegion, tall.data, subset = Probe ==
"92482_at", groups = Sex, auto.key = TRUE, jitter.data = TRUE, type = c("p",
"a"))
aggregate(Expression ~ Genotype * BrainRegion * Sex, tall.data, subset = Probe ==
"92482_at", FUN = mean)
## Genotype BrainRegion Sex Expression
## 1 S1P2_KO hippocampus female 6.160
## 2 S1P3_KO hippocampus female 6.193
## 3 Wild_type hippocampus female 6.017
## 4 S1P2_KO neocortex female 6.498
## 5 S1P3_KO neocortex female 6.959
## 6 Wild_type neocortex female 6.272
## 7 S1P2_KO hippocampus male 6.352
## 8 S1P3_KO hippocampus male 6.628
## 9 Wild_type hippocampus male 6.058
## 10 S1P2_KO neocortex male 6.876
## 11 S1P3_KO neocortex male 6.967
## 12 Wild_type neocortex male 6.118
At the very least, order the samples by run date; within run date, you could also sort on other factors.
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
data.matrix <- as.matrix(data)
# Heatmap before sorted
heatmap.2(cor(data), Rowv = NA, Colv = NA, symm = T, trace = "none", dendrogram = "none",
col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations",
margins = c(20, 20))
htdata <- mdata
mdata.date <- htdata[order(as.Date(htdata$DateRun, format = "%m/%d/%Y")), ]
design.date <- mdata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htdata.date <- mdata.date
htdata.date$DateRun <- NULL
htdata.date$Genotype <- NULL
htdata.date$BrainRegion <- NULL
htdata.date$Sex <- NULL
htdata.date <- as.matrix(htdata.date)
rownames(htdata.date) <- with(design.date, paste(Genotype, BrainRegion, Sex,
DateRun, sep = "_"))
htdata.date <- t(htdata.date)
# Heatmap after sorted by date
heatmap.2(cor(htdata.date), Rowv = NA, Colv = NA, symm = T, trace = "none",
dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations sorted by DateRun",
margins = c(20, 20))
What do you see about batch effects, outliers, etc.? The sample GSM172976 seems to be a very obvious outlier. The batch effects are not obvious as the outlier but is still quite visiable (i.e. small blocks).
cdata <- data.frame(cor(data))
cdata.mean <- data.frame(AverageCorrelation = "Average Correlation", Sample = row.names(cdata),
mean = rowMeans(cdata))
bwplot(~mean, cdata.mean, xlab = "Average Correlation")
Try to go beyond merely “eyeballing” it, i.e. make a quantitative statement about how it “sticks out” from the other samples.
cdata.mean.sort <- cdata.mean[order(cdata.mean$mean), ]
mean(cdata.mean.sort$mean)
## [1] 0.9904
sd(cdata.mean.sort$mean)
## [1] 0.003993
(mean(cdata.mean.sort$mean) - min(cdata.mean.sort$mean))/sd(cdata.mean.sort$mean)
## [1] 6.335
data.outlier <- cdata.mean[which.min(cdata.mean$mean), c("Sample", "mean")]
The average correlation coefficient of sample GSM172976 is 6.3352 standard deviation away from mean.
Which group is the outlier in, in terms of Genotype, BrainRegion, and Sex?
# List sample group of the outlier
(design.outlier <- design[data.outlier$Sample, ])
## DateRun Genotype BrainRegion Sex
## GSM172976 01/16/04 S1P3_KO hippocampus female
# All the samples that are in the same group as the ourlier
(design.outlier.group <- subset((subset(subset(design, subset = BrainRegion ==
design.outlier$BrainRegion), subset = Sex == design.outlier$Sex)), subset = Genotype ==
design.outlier$Genotype))
## DateRun Genotype BrainRegion Sex
## GSM172974 12/18/03 S1P3_KO hippocampus female
## GSM172975 01/16/04 S1P3_KO hippocampus female
## GSM172976 01/16/04 S1P3_KO hippocampus female
Scatter plot these samples against each other and comment on the plot.
# Things are overplotted!
row.names(design.outlier.group)
## [1] "GSM172974" "GSM172975" "GSM172976"
data.outlier.group <- data[, row.names(design.outlier.group)]
splom(data.outlier.group)
Try to address overplotting!
# Now it's much better!
hexplom(data.outlier.group)
Make a pair of plots: each plot will show gene expression boxplots, one per sample. Identify the outlier sample with a different color, if you can. Comment on these plots, in general and with respect to the outlier.
# Normalize the data
ndata <- data.frame(normalize.quantiles(as.matrix(data), copy = TRUE))
colnames(ndata) <- colnames(data)
rownames(ndata) <- rownames(data)
tndata <- as.data.frame(t(ndata)) #Transpose the data frame
mndata <- cbind(design, tndata)
smndata <- mndata
smndata$Sample <- row.names(mndata)
tall.smndata <- melt(smndata, id.vars = c("Sample", "DateRun", "Genotype", "BrainRegion",
"Sex"), variable.name = "Probe", value.name = "Expression")
Here are the 2 plots:
# Data before normalization
smdata <- mdata
smdata$Sample <- row.names(mdata)
tall.smdata <- melt(smdata, id.vars = c("Sample", "DateRun", "Genotype", "BrainRegion",
"Sex"), variable.name = "Probe", value.name = "Expression")
bwplot(Expression ~ Sample, tall.smdata, scales = list(x = list(rot = 45)),
)
# Data after normalization
bwplot(Expression ~ Sample, tall.smndata, scales = list(x = list(rot = 45)),
)
With the quantile-normalized data, re-make the heatmap of the sample correlation matrix and re-compare the outlier to its experimental group.
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
ndata.matrix <- as.matrix(ndata)
htndata <- mndata
mndata.date <- htndata[order(as.Date(htndata$DateRun, format = "%m/%d/%Y")),
]
ndesign.date <- mndata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htndata.date <- mndata.date
htndata.date$DateRun <- NULL
htndata.date$Genotype <- NULL
htndata.date$BrainRegion <- NULL
htndata.date$Sex <- NULL
htndata.date <- as.matrix(htndata.date)
rownames(htndata.date) <- with(ndesign.date, paste(Genotype, BrainRegion, Sex,
DateRun, sep = "_"))
htndata.date <- t(htndata.date)
heatmap.2(cor(htndata.date), Rowv = NA, Colv = NA, symm = T, trace = "none",
dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations after normalization sorted by DateRun",
margins = c(20, 20))
Is everything OK now? No, normlization did not help fix the outlier.
row.names(design.outlier)
## [1] "GSM172976"
ordata <- data
colnames(ordata)
## [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
## [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975" "GSM172976"
ordata[, row.names(design.outlier)] <- NULL
colnames(ordata)
## [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
## [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975"
ordesign <- design
rownames(ordesign)
## [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
## [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975" "GSM172976"
ordesign <- ordesign[!(row.names(ordesign) == row.names(design.outlier)), ]
rownames(ordesign)
## [1] "GSM172927" "GSM172928" "GSM172929" "GSM172930" "GSM172931"
## [6] "GSM172932" "GSM172933" "GSM172934" "GSM172935" "GSM172936"
## [11] "GSM172937" "GSM172938" "GSM172939" "GSM172940" "GSM172941"
## [16] "GSM172942" "GSM172943" "GSM172944" "GSM172945" "GSM172946"
## [21] "GSM172947" "GSM172948" "GSM172949" "GSM172950" "GSM172951"
## [26] "GSM172952" "GSM172953" "GSM172954" "GSM172955" "GSM172956"
## [31] "GSM172957" "GSM172958" "GSM172959" "GSM172960" "GSM172961"
## [36] "GSM172962" "GSM172963" "GSM172964" "GSM172965" "GSM172966"
## [41] "GSM172967" "GSM172968" "GSM172969" "GSM172970" "GSM172971"
## [46] "GSM172972" "GSM172973" "GSM172974" "GSM172975"
library(preprocessCore)
nordata <- data.frame(normalize.quantiles(as.matrix(ordata), copy = TRUE))
colnames(nordata) <- colnames(ordata)
rownames(nordata) <- rownames(ordata)
tnordata <- as.data.frame(t(nordata))
The data set after quantile normalization and outlier removal is now tnordata
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
nordata.matrix <- as.matrix(nordata)
mnordata <- cbind(ordesign, tnordata)
smnordata <- mnordata
smnordata$Sample <- row.names(mnordata)
htnordata <- mnordata
mnordata.date <- htnordata[order(as.Date(htnordata$DateRun, format = "%m/%d/%Y")),
]
nordesign.date <- mnordata.date[, c("DateRun", "Genotype", "BrainRegion", "Sex")]
htnordata.date <- mnordata.date
htnordata.date$DateRun <- NULL
htnordata.date$Genotype <- NULL
htnordata.date$BrainRegion <- NULL
htnordata.date$Sex <- NULL
htnordata.date <- as.matrix(htnordata.date)
rownames(htnordata.date) <- with(nordesign.date, paste(Genotype, BrainRegion,
Sex, DateRun, sep = "_"))
htnordata.date <- t(htnordata.date)
heatmap.2(cor(htnordata.date), Rowv = NA, Colv = NA, symm = T, trace = "none",
dendrogram = "none", col = Grays, cexCol = 1, cexRow = 1, main = "sample-to-sample correlations after normalization sorted by DateRun",
margins = c(20, 20))
The data looks much normal now!
mnordata <- cbind(ordesign, tnordata)
smnordata <- mnordata
smnordata$Sample <- row.names(mnordata)
tall.smnordata <- melt(smnordata, id.vars = c("Sample", "DateRun", "Genotype",
"BrainRegion", "Sex"), variable.name = "Probe", value.name = "Expression")
bwplot(Expression ~ Sample, tall.smnordata, scales = list(x = list(rot = 45)),
)
How many samples remain?
There are 49 samples remain.
You should be using data with the worst outlier removed, after quantile normalization and, for this question, restricted to neocortex.
identical(rownames(ordesign), colnames(nordata))
## [1] TRUE
neoordesign <- subset(ordesign, BrainRegion == "neocortex")
neoordesign$Genotype <- factor(neoordesign$Genotype, levels = c("Wild_type",
"S1P2_KO", "S1P3_KO"))
neonordata <- subset(nordata, select = ordesign$BrainRegion == "neocortex")
# Design Matrix
neoordesign.m <- model.matrix(~Genotype, neoordesign)
neoorFit <- lmFit(neonordata, neoordesign.m)
neoorEbFit <- eBayes(neoorFit)
I used limma to fit a linear model in a one-way ANOVA style
Display the expression data for the top 50 probes in a heat map. Order the probes by p-value; order the samples by genotype.
colnames(coef(neoorEbFit))
## [1] "(Intercept)" "GenotypeS1P2_KO" "GenotypeS1P3_KO"
neoorHits <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))),
number = Inf)
neoorHits50 <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))),
number = 50)
tneonordata <- as.data.frame(t(neonordata))
HneoorHits <- tneonordata[, rownames(neoorHits50), drop = FALSE]
HneoorHits <- as.matrix(HneoorHits)
rownames(HneoorHits) <- with(neoordesign, paste(rownames(neoordesign), Genotype,
sep = "_"))
GraysR <- colorRampPalette(brewer.pal(n = 9, "Greys"))
heatmap.2(HneoorHits, Rowv = NA, Colv = NA, symm = T, trace = "none", dendrogram = "none",
col = GraysR, cexCol = 1, cexRow = 1, main = "Top 50 probes", margins = c(6,
10))
neoorHits1E3 <- data.frame(PValueCutOff = neoorHits$P.Value < 0.001)
addmargins(with(neoorHits1E3, table(PValueCutOff)))
## PValueCutOff
## FALSE TRUE Sum
## 12326 96 12422
How many probes have unadjusted p-value < 1e-3? There are 96 probes have unadjusted p-value < 1e-3
If we took the top 50 probes as our “hits”, what is the (estimated) false discovery rate?
neoorHits50[nrow(neoorHits50), "P.Value"]
The p-value of the 50th probe from the top 50 list is 3.0565 × 10-4.
How many of these hits do we expect to be false discoveries?
FD <- nrow(neoorHits50) * neoorHits50[nrow(neoorHits50), "P.Value"]
We will expect 0.0153 of these hits to be false
neoorHits3 <- topTable(neoorEbFit, coef = grep("Genotype", colnames(coef(neoorEbFit))),
number = 3)
neoorBoring <- neoorHits[nrow(neoorHits):(nrow(neoorHits) - 2), ]
neoorExp <- rbind(neoorBoring, neoorHits3)
xyneoorHits <- tneonordata[, rownames(neoorExp), drop = FALSE]
xyneoorHits <- cbind(neoordesign, xyneoorHits)
xyneoorHits <- melt(xyneoorHits, id.vars = c("DateRun", "Genotype", "BrainRegion",
"Sex"), variable.name = "Probe", value.name = "Expression")
stripplot(Expression ~ Genotype | Probe, xyneoorHits, jitter.data = TRUE, auto.key = TRUE,
type = c("p", "a"), grid = TRUE)
What are the p-values associated with these probes? Comment on the plots.
neoorExp[, "P.Value", drop = FALSE]
## P.Value
## 162455_f_at 9.999e-01
## 94211_at 9.996e-01
## 160994_at 9.996e-01
## 98590_at 3.091e-07
## 160832_at 1.840e-06
## 92482_at 2.120e-06
The Hits p value are very small and the non-Hits p value are very close to 1
neoorHitsS1P3 <- topTable(neoorEbFit, coef = "GenotypeS1P3_KO", number = Inf)
neoorHitsS1P3.0.1 <- data.frame(PValueCutOff = neoorHits$P.Value < 0.1)
addmargins(with(neoorHitsS1P3.0.1, table(PValueCutOff)))
## PValueCutOff
## FALSE TRUE Sum
## 10263 2159 12422
How many hits do you find if you control FDR at 0.10?
There are 2159 probes if I control FDR at 0.10
You should be using data with the worst outlier removed, after quantile normalization and for both brain regions.
ordesign$Genotype <- factor(ordesign$Genotype, levels = c("Wild_type", "S1P2_KO",
"S1P3_KO"))
ordesign.m <- model.matrix(~Genotype * BrainRegion, ordesign)
orFit <- lmFit(nordata, ordesign.m)
orEbFit <- eBayes(orFit)
Test for any effect of Genotype and/or BrainRegion, i.e. test your model against a model with just an intercept.
colnames(coef(orEbFit))
## [1] "(Intercept)"
## [2] "GenotypeS1P2_KO"
## [3] "GenotypeS1P3_KO"
## [4] "BrainRegionneocortex"
## [5] "GenotypeS1P2_KO:BrainRegionneocortex"
## [6] "GenotypeS1P3_KO:BrainRegionneocortex"
orHits <- topTable(orEbFit, p.value = 0.001, number = Inf)
nrow(orHits) == nrow(nordata)
## [1] TRUE
How many probes have a BH-adjusted p-value, a.k.a. q-value, less than 1e-3? All of the probes have BH-adjusted p-value less than 1e-3
orHitsBrainRegion <- topTable(orEbFit, coef = grep("BrainRegion", colnames(coef(orEbFit))),
p.value = 0.1, number = Inf)
How many probes have a BH-adjusted p-value less than 0.1?
nrow(orHitsBrainRegion)
There are 3227 probes have a BH-adjusted p-value less than 0.1.
Using the results from Q5b, plot and describe some results for a couple of hits and non-hits.
orHitsBrainRegion3 <- topTable(orEbFit, coef = grep("BrainRegion", colnames(coef(orEbFit))),
number = 3)
orHitsBrainRegionBoring <- orHitsBrainRegion[nrow(orHitsBrainRegion):(nrow(orHitsBrainRegion) -
2), ]
orHitsBrainRegionExp <- rbind(orHitsBrainRegionBoring, orHitsBrainRegion3)
xyneoorHitsBrainRegion <- tnordata[, rownames(orHitsBrainRegionExp), drop = FALSE]
xyneoorHitsBrainRegion <- cbind(ordesign, xyneoorHitsBrainRegion)
xyneoorHitsBrainRegion <- melt(xyneoorHitsBrainRegion, id.vars = c("DateRun",
"Genotype", "BrainRegion", "Sex"), variable.name = "Probe", value.name = "Expression")
stripplot(Expression ~ Genotype | Probe, xyneoorHitsBrainRegion, jitter.data = TRUE,
groups = BrainRegion, auto.key = TRUE, type = c("p", "a"), grid = TRUE)
orHitsGenotype <- topTable(orEbFit, coef = grep("Genotype", colnames(coef(orEbFit))),
p.value = 0.1, number = Inf)
How many probes have a BH-adjusted p-value less than 0.1?
nrow(orHitsGenotype)
There are 141 probes have a BH-adjusted p-value less than 0.1.
Compare these results to those obtained for BrainRegion in Q5b, either based on this count or based on all the p-values.
nrow(orHitsBrainRegion) > nrow(orHitsGenotype)
## [1] TRUE
There are more hits in the results from BrainRegion than from Genotype when p-value is set to 0.1
What do you conclude about the relative magnitude of the influence of brain region vs. genotype on gene expression?
The BrainRegion has a bigger effect on gene expression than Genotype