mean(x)
0.042
But how do we know if this mean is significantly different than zero?
n <- length(x)
sd(x) / n^0.5
1
The standard error of the mean is a handy number. If the data is normally distributed, the true mean will be within two standard deviations of the measured mean 95% of the time.
mean(x) - 2 * sd(x) / n^0.5
-2
mean(x) + 2 * sd(x) / n^0.5
2.1
So, the mean is not significantly different than zero assuming the data is normally distributed. But what if the data isn’t normally distributed*?
*Hint: it isn’t.
bootstrapped_means <- replicate(
10000,
mean(sample(x, replace=TRUE)))
quantile(bootstrapped_means,
probs=c(0.025, 0.5, 0.975))
2.5% 50% 98%
-1.231 -0.021 2.242
Bootstrapping is great because we can estimate a confidence interval regardless of the distribution of the data. The mean still isn’t significant, but it shows that the actual confidence range is a little different than standard error told us it should be.
But, what if we are under the spell of an outlier? What if most of the data is non zero, but an outlier makes the mean zero anyways?
*Hint: We are and it does.
median(x)
-1
Medians are great because they are robust to outliers. But how do I know if this median is significantly different than zero?
bootstrapped_medians <- replicate(
10000,
median(sample(x, replace=TRUE)))
quantile(bootstrapped_medians,
probs=c(0.025, 0.5, 0.975))
2.5% 50% 98%
-1.33 -1.01 -0.56
Bootstrap says that the median is significantly less than zero, despite anything the mean says. Curse you outlier!
Unfortunaltey, bootstrapping is computationally intensive. At Khan Academy we deal with millions of students, billions of problems. It sure would be nice to find a simpler way to calculate a confidence interval for the median.
cat(sort(x))
-2.5 -2.2 -2.1 -1.9 -1.9 -1.8 -1.4 -1.3 -1.3 -1.3 -1.3 -1.2 -1 -0.87 -0.75 -0.62 -0.59 -0.56 -0.24 -0.2 0.26 0.27 0.33 1.4 24
I noticed a funny thing about the sorted values of x. The lower bound of the median, -1.33 is the 8th value in the list, the upper bound, -0.56 is the 18th. These are exactly 5 positions below and above the median value, -1.01, which is in the 13th position (lucky it!). Moreover, 5 is the square root of the number of items, 25. Whoa!
This leads to a hypothesis:
I sure hope this is true, because it’s an awful lot like the definition of the standard error of the mean.
by_bootstrap <- function(x) {
bootstrapped_samples <- replicate(
10000,
median(sample(x, replace=TRUE)))
quantile(bootstrapped_samples,
probs=c(0.025, 0.975))
}
by_theory <- function(x){
x <- sort(x)
n <- length(x)
median <- (n+1)/2
c(lower=x[round(median - (n^0.5))],
upper=x[round(median + (n^0.5))])
}
by_bootstrap(x)
2.5% 98%
-1.33 -0.56
by_theory(x)
lower upper
-1.33 -0.56
# Something big
x <- rnorm(101^2)
by_bootstrap(x)
2.5% 98%
-0.029 0.017
by_theory(x)
lower upper
-0.029 0.018
# Something repetitious
x <- sample(1:15, 169, replace=TRUE)
by_bootstrap(x)
2.5% 98%
8 11
by_theory(x)
lower upper
8 11
Good enough for me!
So what we have developed here is a useful rule of thumb for finding the confidence interval of the median. If you have a million students just look at the values plus or minus a thousand students (square root of 1 million) and you have a good estimate.
Of course, I want more. How do I calculate the confidence interval for percentiles other than the median (the median is the 50th percentile). What about the 25th percentile or 75th? I tried the above technique, but it doesn’t work quite right.
Thanks to Penn State I have found a more general formula to find the 95% confidence interval for any percentile. It is:
\[ n \cdot p ± 2 \sqrt{n \cdot p (1 - p)} \]
This gives a the same square root result I saw for the median (phew!) and looks to be vaguely reasonable. My only worry is that at the high and low ends, the percentiles of 0% or 100%, their equation produces a single value with total confidnence. But, I suppose that is a problem for another time…