library(devtools)
## Warning: package 'devtools' was built under R version 3.2.3
install_github("genomicsclass/GSE5859Subset")
## Skipping install for github remote, the SHA1 (8ada5f4f) has not changed since last install.
## Use `force = TRUE` to force installation
library(GSE5859Subset)
data(GSE5859Subset)
High-throughput technologies measure numerous features including genes, single base locations of the genome, transcripts, binding sites, CpG sites, SNPs, genomic regions and image pixel intensities. Each measurement product is defined by a set of features. i.e. A gene expression microarray product is defined by the set of genes that it measures.
So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.
dim(geneExpression)
## [1] 8793 24
RNA expression measurements for 8793 genes from blood taken from 24 individuals (the experimental units)
head(sampleInfo)
## ethnicity date filename group
## 107 ASN 2005-06-23 GSM136508.CEL.gz 1
## 122 ASN 2005-06-27 GSM136530.CEL.gz 1
## 113 ASN 2005-06-27 GSM136517.CEL.gz 1
## 163 ASN 2005-10-28 GSM136576.CEL.gz 1
## 153 ASN 2005-10-07 GSM136566.CEL.gz 1
## 161 ASN 2005-10-07 GSM136574.CEL.gz 1
sampleInfo$group
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
match(sampleInfo$filename, colnames(geneExpression))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24
One of the columns, filenames, permits us to connect the rows of this table to the columns of the measurement table.
dim(geneAnnotation)
## [1] 8793 4
head(geneAnnotation)
## PROBEID CHR CHRLOC SYMBOL
## 1 1007_s_at chr6 30852327 DDR1
## 30 1053_at chr7 -73645832 RFC2
## 31 117_at chr1 161494036 HSPA6
## 32 121_at chr2 -113973574 PAX8
## 33 1255_g_at chr6 42123144 GUCA1A
## 34 1294_at chr3 -49842638 UBA7
A table describing the features: chromosome location, gene name.
tail(match(geneAnnotation$PROBEID, rownames(geneExpression)))
## [1] 8788 8789 8790 8791 8792 8793
ProbeIDs connect rows of feature table and rows of measurement table.
How many samples where processed on 2005-06-27?
unique(sampleInfo$date)
## [1] "2005-06-23" "2005-06-27" "2005-10-28" "2005-10-07" "2005-06-10"
sum(sampleInfo$date=="2005-06-27")
## [1] 5
Question: How many of the genes represented in this particular technology are on chromosome Y?
sum(geneAnnotation$CHR=="chrY", na.rm = T)
## [1] 21
What is the log expression value of the for gene ARPC1A on the one subject that we measured on 2005-06-10 ?
symbol <- which(geneAnnotation$SYMBOL=="ARPC1A")
date <- which(sampleInfo$date=="2005-06-10")
geneExpression[symbol, date]
## [1] 8.233599
P-values are random variables.Consider a case where p-value is defined from a t-test with a large enough sample size to use the CLT approximation (the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution). Then our p-value is defined as the probability that a normally distributed random variable is larger, in absolute value, than the observed t-test, call it ZZ. So for a two sided test the p-value is: \[ p=2(1-\Phi(Z)) \]
Z is a random variable, \(\Phi\) is a deterministic function (A function has a unique value for any input in its domain). So p is a random variable.
2*(1-pnorm(X))
Create a Monte Carlo simulation to show how p changes.
set.seed(1)
population = unlist( read.csv("femaleControlsPopulation.csv") )
N <- 12
B <- 10000
pvals <- replicate(B,{
control = sample(population,N)
treatment = sample(population,N)
t.test(treatment,control)$p.val
})
hist(pvals)
P-value is uniformly distributed. For the case using CLT, the null hypothesis \(H_0\) implies that test statistic Z follows a normal distribution with mean 0 and SD 1: \[ p_a = Pr(Z<a|H_0) = \Phi(a) \]
This implies that: \[ Pr(p<p_a) = Pr[\Phi(p)<\Phi(p_a)] = Pr(Z<a) =p_a \]
which is the definition of a uniform distribution.
To find genes statistically significant
g <- sampleInfo$group
myttest <- function(x) t.test(x[g==1],x[g==0],var.equal=TRUE)$p.value
pvals <- apply(geneExpression,1,myttest)
library(genefilter)
## Warning: package 'genefilter' was built under R version 3.2.3
## Warning in .recacheSubclasses(def@className, def, doSubclasses, env):
## undefined subclass "externalRefMethod" of class "expressionORfunction";
## definition not updated
## Warning in .recacheSubclasses(def@className, def, doSubclasses, env):
## undefined subclass "externalRefMethod" of class "functionORNULL";
## definition not updated
##
## Attaching package: 'genefilter'
## The following object is masked from 'package:base':
##
## anyNA
results <- rowttests(geneExpression, factor(sampleInfo$group))
max(abs(pvals - results$p.value))
## [1] 6.528111e-14
set.seed(1)
population <- read.csv("femaleControlsPopulation.csv")
pvals <- replicate(1000,{
control = sample(population[,1],12)
treatment = sample(population[,1],12)
t.test(treatment,control)$p.val
})
What proportion of the p-values is below 0.05?
mean(pvals < 0.05)
## [1] 0.041
What proportion of the p-values is below 0.01?
mean(pvals < 0.01)
## [1] 0.008
Assume you are testing the effectiveness of 20 diets on mice weight. For each of the 20 diets, you run an experiment with 10 control mice and 10 treated mice. Assume the null hypothesis, that the diet has no effect, is true for all 20 diets and that mice weights follow a normal distribution, with mean 30 grams and a standard deviation of 2 grams. Run a Monte Carlo simulation for one of these studies:
r cases = rnorm(10,30,2)
controls = rnorm(10,30,2)
t.test(cases,controls)
Question: Now run a Monte Carlo simulation imitating the results for the experiment for all 20 diets. If you set the seed at 100, set.seed(100), how many of p-values are below 0.05?
set.seed(100)
pvals<- replicate(20,{
cases = rnorm(10,30,2)
controls = rnorm(10,30,2)
t.test(cases,controls)$p.value
})
sum(pvals<=0.05)
## [1] 1
Now create a simulation to learn about the distribution of the number of p-values that are less than 0.05. In question 3, we ran the 20 diet experiment once. Now we will run the experiment 1,000 times and each time save the number of p-values that are less than 0.05.
set.seed(100)
B = 1000
plessthan = replicate(B,{
pvals = replicate(20,{
cases = rnorm(10,30,2)
controls = rnorm(10,30,2)
t.test(cases,controls)$p.value
})
sum(pvals<=0.05)
})
table(plessthan)
## plessthan
## 0 1 2 3 4 6
## 354 374 193 71 7 1
mean(plessthan)
## [1] 1.007
Set the seed at 100, set.seed(100), run the code from Question 3 1,000 times, and save the number of times the p-value is less than 0.05 for each of the 1,000 instances. What is the average of these numbers? This is the expected number of tests (out of the 20 we run) that we will reject when the null is true.
What this says is that on average, we expect some p-value to be 0.05 even when the null is true for all diets. Use the same simulation data and report for what percent of the 1,000 replications did we make more than 0 false positives?
mean(plessthan>0)
## [1] 0.646