In the last module, the Type I error rate \(\alpha\) was discussed, i.e. the probability of rejecting the null hypothesis when it’s true. The hypothesis test was structured so that the probability of Type I error is small.
The other possible error is to fail to reject when the alternative is true, i.e. the Type II error. The power is defined as the probability of correctly rejecting \(H_0\), i.e. as one minus the probability of Type II error. Ideally, a hypothesis test should have a large power (near 1). \[ \begin{align} \text{Power} & = \text{probability we correctly reject} H_0 \\& = P\left( \text{reject } H_0 \mid H_{a} \right) \\& = 1 - P\left( \text{don't reject } H_0 \mid H_{a} \right) \\& = 1 - \beta \\& = 1 - P\left( \text{Type II error} \right) \end{align} \]
We don’t have as much control over this probability, since the flexibility was spent guaranteeing that the Type I error rate, i.e \(alpha\) is small.
One way to the control of power is at the design phase, by picking a large enough sample size so that we’d be likely to reject if the alternative is true. Thus the most frequent use of power is to help us design studies.
Assume \(\bar{X}\) is normal distributed and \(\sigma\) is known. Then, we reject if \(\frac{\bar{X}-30}{\sigma/\sqrt{n}} > z_{1-\alpha}\)
Equivalently, we reject if
\[ \bar{X} > 30 + z_{1-\alpha} \frac{\sigma}{\sqrt{n}} \] Under \(H_{0} : \bar{X} \sim \text{Normal}\left(x \mid \mu_0, \frac{\sigma^2}{n} \right)\)
Under \(H_{0} : \bar{X} \sim \text{Normal}\left(x \mid \mu_{a}, \frac{\sigma^2}{n} \right)\)
mu0 = 30
mua = 32
sigma = 4
n = 16
alpha = 0.05
z <- qnorm(1-alpha)
pnorm(mu0 + z * sigma/sqrt(n), mean = mu0, sd = sigma/sqrt(n), lower.tail = FALSE)
## [1] 0.05
pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALSE)
## [1] 0.63876
library(ggplot2)
nvals <- 2**c(3,4,5,6,7)
mus = seq(30,35,by = 0.01)
sigma = 4
alpha = 0.05
r <- vector()
colnames <-
for(n in nvals){
pws <- pnorm(mu0 + z * sigma/sqrt(n), mean = mus, sd = sigma/sqrt(n), lower.tail = FALSE)
r <- cbind(r, pws)
}
colnames(r) <- paste('n=',as.character(nvals), sep = '')
df <- stack(as.data.frame(r))
df$x <- rep(mus, by = length(nvals))
qplot(x, values, data = df, group = ind, colour = ind, geom = "line")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
When testing \(H_{a}: \mu > \mu_0\): \[ \begin{align} 1 - \beta = P \left( \bar{X} > \mu_0 + z_{1-\alpha}\frac{\sigma}{\sqrt{n}} \mid \mu = \mu_{a} \right) \end{align} \] where
Specify any 3 of the unkowns and you can solve for the remainder.
\[ \begin{align} P \left( \frac{\bar{X}-\mu_0}{S/\sqrt{n}} > t_{1-\alpha\; n-1 } \mid \mu = \mu_{a} \right) \end{align} \] Calculating this requres the non-central \(t\) distribution.
power.t.test does this very well
delta is specifying the difference in the meanstype is specifying it’s one sample t testalt is specifying the alternative, in this case one
sided test, hence we test for \(H_{a} : \mu
> mu_{a}\)All the following examples compute the power and produce the same result, since \(\frac{\sqrt{n}(\mu_a - mu)}{\sigma}\) is the same in all cases
power.t.test(n = 16, delta = 2, sd = 4, type = 'one.sample', alt = 'one.sided')$power
## [1] 0.6040329
power.t.test(n = 16, delta = 2/4, sd = 1, type = 'one.sample', alt = 'one.sided')$power
## [1] 0.6040329
power.t.test(n = 16, delta = 100, sd = 200, type = 'one.sample', alt = 'one.sided')$power
## [1] 0.6040329
All the following examples compute the sample size and produce the same power, since \(\frac{\sqrt{n}(\mu_a - mu)}{\sigma}\) is the same in all cases
power.t.test(power = 0.8, delta = 2/4, sd = 1, type = 'one.sample', alt = 'one.sided')$n
## [1] 26.13751
power.t.test(power = 0.8, delta = 2, sd = 4, type = 'one.sample', alt = 'one.sided')$n
## [1] 26.13751
power.t.test(power = 0.8, delta = 100, sd = 200, type = 'one.sample', alt = 'one.sided')$n
## [1] 26.13751
Key ideas: * Hypothesis testing/significance analysis is commonly overused * Correcting for multiple testing avoids fals positives or discoveres
Two key components:
1 The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
2 The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple — Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.
3 The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.
Suppose you are testing a hypothesis that a parameter \(\beta\) eqauls to zero versus the alternative that it does not equal zero. There the possible outcoms
| \(\beta = 0\) | \(\beta \neq 0\) | hypothesis | |
|---|---|---|---|
| Claim \(\beta = 0\) | \(U\) | \(T\) | \(m-R\) |
| Claim \(\beta \neq 0\) | \(V\) | \(S\) | \(R\) |
| Claims | \(m_s\) | \(m - m_s\) | \(m\) |
Type I error or false positive (V): Decide that the parameter does not equal to zero when it actually does.
Type II error or false negative (T): Decide that the parameter equals to zero when it actually does not.
False positive rate - the rate at which false results are called significant \(\mathbb{E}\left[ \frac{V}{m_0} \right]\).
Family wise error rate (FWER) - the probabilty of at least one false positive \(P\left( V \geq 1 \right)\)
False discovery rate (FDR) - the rate at which claims of significance are false \(\mathbb{E}\left[\frac{V}{R} \right]\)
If \(p\)-values are correctly calculated calling all \(p < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.
Problem: Suposse you perfom 10000 tests and \(\beta = 0\) for all of them. Suppose that you call \(p < 0.05\) significant. The expected number of false positive is \(10000 \times 0.05 = 500\) false positives. How do we avoid so many false positives?
The Bonferroni correction is the oldest multiple testing correction:
Basic idea:
Pros: Easy to calculate & conservative.
Cons: May be very conservative.
The Benjamini–Hochberg procedure can be used.
Basic idea:
Pros: Easy to calculate, less conservative.
Cons: Allows for more false positives, may behave strangely under dependence.
One approach is to adjust the threshold \(\alpha\)
A different approach is to calculate adjusted \(p\)- values:
Case study I: No true positives
set.seed(42)
n = 1000
alpha = 0.05
pValues <- rep(NA, n)
for(i in 1:n){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y~x))$coef[2,4]
}
sum(pValues < alpha)
## [1] 57
# controls FWER
sum(p.adjust(pValues, method = 'bonferroni') < alpha )
## [1] 0
# controls FDR
sum(p.adjust(pValues, method = 'BH') < alpha )
## [1] 0
set.seed(42)
n = 1000
alpha = 0.05
pValues <- rep(NA, n)
for(i in 1:n){
x <- rnorm(20)
if(i <= 500){
y <- rnorm(20)
}
else{
y <- rnorm(20, mean = 2*x)
}
pValues[i] <- summary(lm(y~x))$coef[2,4]
}
trueStatus <- rep(c('zero', 'not zero'), each = 500)
table(pValues < alpha, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 467
## TRUE 500 33
# controls FWER
table(p.adjust(pValues, method = 'bonferroni') < alpha , trueStatus)
## trueStatus
## not zero zero
## FALSE 29 500
## TRUE 471 0
# controls FDR
table(p.adjust(pValues, method = 'BH') < alpha , trueStatus)
## trueStatus
## not zero zero
## FALSE 0 483
## TRUE 500 17
par(mfrow = c(1,2))
plot(pValues, p.adjust(pValues, method = 'bonferroni'), pch = 19)
plot(pValues, p.adjust(pValues, method = 'BH'), pch = 19)
The bootstrap is a very useful tool for construction confidence intervals and calculating standard errors for difficult statistics.
Question: How would one derive a confidence interval for the mean, median or some other statistic?
For a die roll, having the information about its distribution (say a fair die, i.e. equal probabilities for each face) allows us to use simulations to infer the behavior of the average. We can simply generate samples (specifically because we know the distribution) compute each sample mean and obtain samples of the average.
nsample <- 50
nexperiments <- 1000
mu <- c()
for(i in 1:nexperiments){
dieroll <- sample(1:6, nsample, replace = TRUE)
mu <- c(mu, mean(dieroll))
}
par(mfrow = c(1,2))
hist(dieroll,
breaks = seq(.5, 6.5, length = 7),
freq = FALSE,
col = yarrr::transparent('sienna3', .6))
hist(mu,
breaks = seq(min(mu), max(mu), length = 9),
freq = FALSE,
col = yarrr::transparent('goldenrod3', .6))
The case where the distribution is unknown can’t be handled like this. If one sample is available, but the underlying distribution is unknown, generating more samples is not possible so the unique available sample can be used to obtain just one average sample. The bootstrap method deals with this situation, by samplying for the empirical distribution (not the population distribution).
library(UsingR)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x, n*B, replace = TRUE), B, n)
resampledMedians <- apply(resamples, 1, median)
den <- density(resampledMedians)
plot(density(resampledMedians)$x,
density(resampledMedians)$y,
type = 'l', lwd = 2, col = 'black',
xlab = 'x', ylab = 'density')
polygon(density(resampledMedians)$x,
density(resampledMedians)$y,
col = "slateblue1")
abline(v = median(x), lwd = 3, col = 'black')
title(main = 'Bootstrap density')
Suppose that a statistic that estimates some population parameter is available, but the statistic’s sample distribution is unknown.
The bootstrap principle suggests using the distribution defined by the data (i.e. the empirical distribution) to approximate its sampling distribution.
The general procedure follows by first simulating complete data sets from the observe data with replacement
Calculate the statistic for each simulated data set.
Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error.
Bootstrap procedure for calculating confidence interval for the median from a data set of \(n\) observations.
Sample \(n\) observaion with replacement from the observed data resulting in one simulated complete data set.
Take the median of the simulated data set.
Repeat these two steps B times, resulting in B simulated medians.
These medians are approximately drawn from the sampling distribution of the median of \(n\) observations. Therefore we can
library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x, n*B, replace = TRUE), B, n)
medians <- apply(resamples, 1, median)
sd(medians)
## [1] 0.08560373
quantile(medians, c(0.025, 0.975))
## 2.5% 97.5%
## 68.43733 68.81579
g = ggplot(data.frame(medians = medians), aes(x = medians))
g = g + geom_histogram(color = 'black', fill = 'lightblue', binwidth = 0.05)
g
Permutation tests are used for group comparisons.
Comparing sprays B and C
| Data Type | Statistic | Test Name |
|---|---|---|
| Ranks | rank sum | rank sum test |
| Binary | hypergeometric prob | Fisher’s exact test |
| Raw data | ordinary permutation test |
The so-called randomization tests are exactly permutation tests, with a different motivation
For matched data, one can randomize the signs
Permuation strategies work for regression as well
Permutation tests work very well in multivarate setting
subdata <- InsectSprays[InsectSprays$spray %in% c('B', 'C'),]
head(subdata)
## count spray
## 13 11 B
## 14 17 B
## 15 21 B
## 16 11 B
## 17 16 B
## 18 14 B
y <- subdata$count
group <- as.character(subdata$spray)
testStat <- function(w, g) mean(w[ g == 'B']) - mean(w[g == 'C'])
observedStat <- testStat(y, group)
observedStat
## [1] 13.25
permutations <- sapply(1:10000, function(i) testStat(y, sample(group)))
mean(permutations > observedStat)
## [1] 0
g = ggplot(data.frame(permutations = permutations), aes(x = permutations))
g = g + geom_histogram(color = 'black', fill = 'lightblue', binwidth = 0.5) + geom_vline(xintercept = observedStat, size=1.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
g