I attended the Strata+Hadoop World Conference in New York City between Oct 15 and Oct 17. John Rauser gave a very engaging keynote called Statistics without the Agonizing Pain, which I found well worth the 11 minutes and 47 seconds.

I decided to replicate the talk in R. As John points out, a lot of statistics can be approached if you have three things:

R doesn’t help with the first bullet, and it tends to “hide” the third (“for loops” are mostly eschewed). AND, using R, it is pretty trivial to just simply call “t.test”. But where’s the fun in that?

## Loading required package: ggplot2
## Loading required package: ggthemes
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Define a function that will do the randomized shuffling

permuteAndDiff <- function (both, nbeer) { 
  # Shuffle up the both data frame
  randomized = sample(both$obs, nrow(both))
  
  beer.random = randomized[1 : nbeer]
  water.random = randomized[(nbeer + 1) : nrow(both)]
  
  # Null (shuffled) difference
  mean(beer.random) - mean(water.random)
}

Seed the data from the talk

# Beer
beer = data.frame(type="beer", obs=c(27,19,20,
                                  20,23,17,
                                  21,24,31,
                                  26,28,20,
                                  27,19,25,
                                  31,24,28,
                                  24,29,21,
                                  21,18,27,
                                  20))

# Water
water = data.frame(type="water", obs=c(21,19,13,
                                     22,15,22,
                                     15,22,20,
                                     12,24,24,
                                     21,19,18,
                                     16,23,20))

# Combine the two datasets into a single dataset i.e., under the null
# hypothesis, there is no difference between the two groups
both = rbind(beer,water)

Print and plot some summary metrics

both %>% group_by(type) %>% summarize(n=n(), mean=mean(obs))
## Source: local data frame [2 x 3]
## 
##    type  n  mean
## 1  beer 25 23.60
## 2 water 18 19.22
ggplot(both, aes(x=obs, colour=type)) + geom_density() + theme_few()

plot of chunk unnamed-chunk-4

# Observed difference
diff.observed = mean(beer$obs) - mean(water$obs)
print(diff.observed)
## [1] 4.378

The key question asks if the observed difference in the means is a random occurence. We can use the t.test to find the p-value, or we can simulate to approach the p-value.

permutations = 100000
# how many beers do we have?
nbeer = nrow(both %>% filter(type=="beer"))
# instead of for-looping, just call replicate
diff.random = replicate(permutations, permuteAndDiff(both, nbeer))

P-value is the fraction of how many times the permuted difference is equal or more extreme than the observed difference

pvalue = sum(abs(diff.random) >= abs(diff.observed)) / permutations
print (pvalue)
## [1] 0.00109
# is this "close" to the Welch T-Test?
print(t.test(beer$obs, water$obs))
## 
##  Welch Two Sample t-test
## 
## data:  beer$obs and water$obs
## t = 3.658, df = 39.11, p-value = 0.0007474
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.957 6.798
## sample estimates:
## mean of x mean of y 
##     23.60     19.22

Pretty close.

Finally, show a simple plot…

df = data.frame(difference = diff.random)
ggplot(df, aes(x=difference)) + geom_histogram() + 
  geom_vline(x=diff.observed, colour="red") + annotate("text", x=diff.observed+.5, y=0, hjust=-.5,label="Observed Difference", angle=90) +
  theme_few()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-7