Basic inference

Load data

library(devtools)

## Warning: package 'devtools' was built under R version 3.2.3

install_github("genomicsclass/GSE5859Subset")

## Skipping install for github remote, the SHA1 (8ada5f4f) has not changed since last install.
##   Use `force = TRUE` to force installation

library(GSE5859Subset)
data(GSE5859Subset)

High-throughput technologies measure numerous features including genes, single base locations of the genome, transcripts, binding sites, CpG sites, SNPs, genomic regions and image pixel intensities. Each measurement product is defined by a set of features. i.e. A gene expression microarray product is defined by the set of genes that it measures.

So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.

dim(geneExpression)

## [1] 8793   24

RNA expression measurements for 8793 genes from blood taken from 24 individuals (the experimental units)

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

sampleInfo$group

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

match(sampleInfo$filename, colnames(geneExpression))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24

One of the columns, filenames, permits us to connect the rows of this table to the columns of the measurement table.

dim(geneAnnotation)

## [1] 8793    4

head(geneAnnotation)

##      PROBEID  CHR     CHRLOC SYMBOL
## 1  1007_s_at chr6   30852327   DDR1
## 30   1053_at chr7  -73645832   RFC2
## 31    117_at chr1  161494036  HSPA6
## 32    121_at chr2 -113973574   PAX8
## 33 1255_g_at chr6   42123144 GUCA1A
## 34   1294_at chr3  -49842638   UBA7

A table describing the features: chromosome location, gene name.

tail(match(geneAnnotation$PROBEID, rownames(geneExpression)))

## [1] 8788 8789 8790 8791 8792 8793

ProbeIDs connect rows of feature table and rows of measurement table.

Exercises

How many samples where processed on 2005-06-27?

unique(sampleInfo$date)

## [1] "2005-06-23" "2005-06-27" "2005-10-28" "2005-10-07" "2005-06-10"

sum(sampleInfo$date=="2005-06-27")

## [1] 5

Question: How many of the genes represented in this particular technology are on chromosome Y?

sum(geneAnnotation$CHR=="chrY", na.rm = T)

## [1] 21

What is the log expression value of the for gene ARPC1A on the one subject that we measured on 2005-06-10 ?

symbol <- which(geneAnnotation$SYMBOL=="ARPC1A")
date <- which(sampleInfo$date=="2005-06-10")
geneExpression[symbol, date]

## [1] 8.233599

P-values

P-values are random variables.Consider a case where p-value is defined from a t-test with a large enough sample size to use the CLT approximation (the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution). Then our p-value is defined as the probability that a normally distributed random variable is larger, in absolute value, than the observed t-test, call it ZZ. So for a two sided test the p-value is: \[ p=2(1-\Phi(Z)) \]

Z is a random variable, \(\Phi\) is a deterministic function (A function has a unique value for any input in its domain). So p is a random variable.

2*(1-pnorm(X))

Create a Monte Carlo simulation to show how p changes.

set.seed(1)
population = unlist( read.csv("femaleControlsPopulation.csv") )
N <- 12
B <- 10000
pvals <- replicate(B,{
  control = sample(population,N)
  treatment = sample(population,N)
  t.test(treatment,control)$p.val 
  })
hist(pvals)

P-value is uniformly distributed. For the case using CLT, the null hypothesis \(H_0\) implies that test statistic Z follows a normal distribution with mean 0 and SD 1: \[ p_a = Pr(Z<a|H_0) = \Phi(a) \]

This implies that: \[ Pr(p<p_a) = Pr[\Phi(p)<\Phi(p_a)] = Pr(Z<a) =p_a \]

which is the definition of a uniform distribution.

t-test

To find genes statistically significant

g <- sampleInfo$group
myttest <- function(x) t.test(x[g==1],x[g==0],var.equal=TRUE)$p.value
pvals <- apply(geneExpression,1,myttest)

library(genefilter)

## Warning: package 'genefilter' was built under R version 3.2.3

## Warning in .recacheSubclasses(def@className, def, doSubclasses, env):
## undefined subclass "externalRefMethod" of class "expressionORfunction";
## definition not updated

## Warning in .recacheSubclasses(def@className, def, doSubclasses, env):
## undefined subclass "externalRefMethod" of class "functionORNULL";
## definition not updated

## 
## Attaching package: 'genefilter'

## The following object is masked from 'package:base':
## 
##     anyNA

results <- rowttests(geneExpression, factor(sampleInfo$group))
max(abs(pvals - results$p.value))

## [1] 6.528111e-14

Exercise

set.seed(1)
population <- read.csv("femaleControlsPopulation.csv")
pvals <- replicate(1000,{
  control = sample(population[,1],12)
  treatment = sample(population[,1],12)
  t.test(treatment,control)$p.val
})

What proportion of the p-values is below 0.05?

mean(pvals < 0.05)

## [1] 0.041

What proportion of the p-values is below 0.01?

mean(pvals < 0.01)

## [1] 0.008

Assume you are testing the effectiveness of 20 diets on mice weight. For each of the 20 diets, you run an experiment with 10 control mice and 10 treated mice. Assume the null hypothesis, that the diet has no effect, is true for all 20 diets and that mice weights follow a normal distribution, with mean 30 grams and a standard deviation of 2 grams. Run a Monte Carlo simulation for one of these studies:

r cases = rnorm(10,30,2) 
controls = rnorm(10,30,2) 
t.test(cases,controls)

Question: Now run a Monte Carlo simulation imitating the results for the experiment for all 20 diets. If you set the seed at 100, set.seed(100), how many of p-values are below 0.05?

set.seed(100)
pvals<- replicate(20,{
  cases = rnorm(10,30,2)
  controls = rnorm(10,30,2)
  t.test(cases,controls)$p.value
})
sum(pvals<=0.05)

## [1] 1

Now create a simulation to learn about the distribution of the number of p-values that are less than 0.05. In question 3, we ran the 20 diet experiment once. Now we will run the experiment 1,000 times and each time save the number of p-values that are less than 0.05.

set.seed(100)
B = 1000
plessthan = replicate(B,{
  pvals = replicate(20,{
    cases = rnorm(10,30,2)
    controls = rnorm(10,30,2)
    t.test(cases,controls)$p.value
    })
  sum(pvals<=0.05)
})
table(plessthan)

## plessthan
##   0   1   2   3   4   6 
## 354 374 193  71   7   1

mean(plessthan)

## [1] 1.007

Set the seed at 100, set.seed(100), run the code from Question 3 1,000 times, and save the number of times the p-value is less than 0.05 for each of the 1,000 instances. What is the average of these numbers? This is the expected number of tests (out of the 20 we run) that we will reject when the null is true.

What this says is that on average, we expect some p-value to be 0.05 even when the null is true for all diets. Use the same simulation data and report for what percent of the 1,000 replications did we make more than 0 false positives?

mean(plessthan>0)

## [1] 0.646