Calculating a mean is easy.

mean(x)
 0.042

But how do we know if this mean is significantly different than zero?



Standard error gives us a range of means.

n <- length(x)
sd(x) / n^0.5
 1

The standard error of the mean is a handy number. If the data is normally distributed, the true mean will be within two standard deviations of the measured mean 95% of the time.

mean(x) - 2 * sd(x) / n^0.5
 -2
mean(x) + 2 * sd(x) / n^0.5
 2.1

So, the mean is not significantly different than zero assuming the data is normally distributed. But what if the data isn’t normally distributed*?

*Hint: it isn’t.



Bootstrap gives us a better range.

bootstrapped_means <- replicate(
  10000, 
  mean(sample(x, replace=TRUE)))

quantile(bootstrapped_means, 
         probs=c(0.025, 0.5, 0.975))
  2.5%    50%    98% 
-1.231 -0.021  2.242 

Bootstrapping is great because we can estimate a confidence interval regardless of the distribution of the data. The mean still isn’t significant, but it shows that the actual confidence range is a little different than standard error told us it should be.

But, what if we are under the spell of an outlier? What if most of the data is non zero, but an outlier makes the mean zero anyways?

*Hint: We are and it does.



Medians are more stable than the means.

median(x)
 -1

Medians are great because they are robust to outliers. But how do I know if this median is significantly different than zero?



Bootstrap is good for median ranges too.

bootstrapped_medians <- replicate(
  10000, 
  median(sample(x, replace=TRUE)))

quantile(bootstrapped_medians, 
         probs=c(0.025, 0.5, 0.975))
 2.5%   50%   98% 
-1.33 -1.01 -0.56 

Bootstrap says that the median is significantly less than zero, despite anything the mean says. Curse you outlier!



But bootstrapping is slow.

Unfortunaltey, bootstrapping is computationally intensive. At Khan Academy we deal with millions of students, billions of problems. It sure would be nice to find a simpler way to calculate a confidence interval for the median.



There may be a faster way.

cat(sort(x))
-2.5 -2.2 -2.1 -1.9 -1.9 -1.8 -1.4 -1.3 -1.3 -1.3 -1.3 -1.2 -1 -0.87 -0.75 -0.62 -0.59 -0.56 -0.24 -0.2 0.26 0.27 0.33 1.4 24

I noticed a funny thing about the sorted values of x. The lower bound of the median, -1.33 is the 8th value in the list, the upper bound, -0.56 is the 18th. These are exactly 5 positions below and above the median value, -1.01, which is in the 13th position (lucky it!). Moreover, 5 is the square root of the number of items, 25. Whoa!

This leads to a hypothesis:



The 95% confidence interval of the median is the range +/- the square root of n positions from the true median.

I sure hope this is true, because it’s an awful lot like the definition of the standard error of the mean.



The tests confirm it is true.

by_bootstrap <- function(x) {
  bootstrapped_samples <- replicate(
    10000, 
    median(sample(x, replace=TRUE)))
  
  quantile(bootstrapped_samples, 
           probs=c(0.025, 0.975))
}

by_theory <- function(x){
  x <- sort(x)
  n <- length(x)
  median <- (n+1)/2
  c(lower=x[round(median - (n^0.5))],
    upper=x[round(median + (n^0.5))])
}

by_bootstrap(x)
 2.5%   98% 
-1.33 -0.56 
by_theory(x)
lower upper 
-1.33 -0.56 
# Something big
x <- rnorm(101^2)
by_bootstrap(x)
  2.5%    98% 
-0.029  0.017 
by_theory(x)
 lower  upper 
-0.029  0.018 
# Something repetitious
x <- sample(1:15, 169, replace=TRUE)
by_bootstrap(x)
2.5%  98% 
   8   11 
by_theory(x)
lower upper 
    8    11 

Good enough for me!

So what we have developed here is a useful rule of thumb for finding the confidence interval of the median. If you have a million students just look at the values plus or minus a thousand students (square root of 1 million) and you have a good estimate.

Of course, I want more. How do I calculate the confidence interval for percentiles other than the median (the median is the 50th percentile). What about the 25th percentile or 75th? I tried the above technique, but it doesn’t work quite right.



There is a more general solution too.

Thanks to Penn State I have found a more general formula to find the 95% confidence interval for any percentile. It is:

\[ n \cdot p ± 2 \sqrt{n \cdot p (1 - p)} \]

This gives a the same square root result I saw for the median (phew!) and looks to be vaguely reasonable. My only worry is that at the high and low ends, the percentiles of 0% or 100%, their equation produces a single value with total confidnence. But, I suppose that is a problem for another time…