Box Models For Simulating Type I and Type II Error

Introduction

A box model is a very simple way to simulate a chance process. In a box model, one creates a ‘box’ representing a population of interest. Then one uses a computer, or some other unbiased mechanism, to randomly sample from the box. Outcomes from this sampling process can be used to estimate the range of expected outcomes from a random sample of a similar population in the real world.

Practically, to program a box model in R you have to use a ‘for loop’ to repeat a process some fixed number of times. A for loop accepts a list of values, and repeats until its exhausted the list. For example:

for (num in c(1, 2, 3, 4, "five", 6, 7)){
  # Statements within the curly braces are repeated
  print(num)
}

## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "five"
## [1] "6"
## [1] "7"

In the above example the variable num is assigned the first value in the list and then it is printed. Once R reaches the closing curly brace } the variable num is assigned the next value in the list. This repeats until the list of values is exhausted. In R, it is usually better to use ‘vectorized’ opertations that avoid loops. In that case, the above code becomes:

vals <- c(1, 2, 3, 4, "five", 6, 7)
res <- sapply(vals, print)  # Apply function print to each item

## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "five"
## [1] "6"
## [1] "7"

Understanding Type I and II Errors

Type I error (false alarms) occurs when a statistical test signals a significant difference where one does not exist. We know that statistical tests have an error rate, we can use a box model to illustrate this. First, we create a population of 10000 elements with a mean of 100 and a standard deviation of 20.

# Create a population with mean=100 and sd=20
pop <- rnorm(10000, mean = 100, sd = 20)

If we take two samples from this population, the difference between the mean of samples 1 and 2 should be very small.

# Take two random samples from the population
samp1 <- sample(pop, size = 100, replace = FALSE)
samp2 <- sample(pop, size = 100, replace = FALSE)

# Create a data.frame with two variables, s1 and s2
samples <- data.frame(s1 = samp1, s2 = samp2)

# The samples should be similar but not the same
summary(samples)

##        s1               s2        
##  Min.   : 43.96   Min.   : 47.45  
##  1st Qu.: 80.14   1st Qu.: 83.60  
##  Median : 96.44   Median : 98.42  
##  Mean   : 95.93   Mean   : 98.58  
##  3rd Qu.:112.54   3rd Qu.:113.70  
##  Max.   :150.56   Max.   :148.19

However, once in awhile we’ll get a freakishly large difference between values, even though they’re estimates of the same population parameters. In a statistical setting we’d interpret these unusually large differences as evidence that the two samples are statistically different. How often will we get ‘unusually large’ (statistically significant) differences between samples form the same population by random chance? It depends on how you define statistical significance. In statistical tests this is done by setting a significance threshold \(\alpha\) (alpha). Alpha controls how often we’ll get a type 1 error. A type 1 error occurs when our statistical test erroneously indicates a significant result.

# Test the sample means
library(mosaic)
test.result <- t.test(x = s1, y = s2, data = samples)
test.result

## 
##  Welch Two Sample t-test
## 
## data:  s1 and s2
## t = -0.83801, df = 197.27, p-value = 0.403
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.882615  3.584741
## sample estimates:
## mean of x mean of y 
##  95.93208  98.58101

The code above illustrates the use of the data.frame(), sample(), t.test(), rnorm(), and summary() functions. You should be able to use and interpret the results of these functions. What does the object test.result contain and what does it tell you about the samples?

The next block of code illustrates how \(\alpha\) affects the rate of type 1 errors. The logic of the code below is fairly simple – It repeats a hypothesis test on two samples drawn from the same population. Because these samples are from the same population the t-test should fail to reject the null hypothesis. However, sometimes, due to random chance we get a samples that are so different from each other the test returns a type 1 error. The frequency of the bad result depends how we set \(\alpha\)

First, we specify a number of hypothesis tests in n. Then we choose a set of alpha values to explore, these are stored in a list called alphas. Using our handy-dandy do() function, we perform a t.test() n times between two independent random samples (without replacement) from our population variable:

n <- 1000
alphas <- c(0.0, 0.001, 0.01, 0.05, 0.1, 0.2)
result <- do(n) * t.test(sample(pop, 100, replace = FALSE),
                         sample(pop, 100, replace = FALSE))

Using this result, we can then group_by the various alpha levels by cutting the dataset into groups based on the alphas list, and computing the count of type 1 errors (i.e, times that our p.value is less than alpha when it probably shouldn’t be).

library(dplyr)
result %>%
  group_by(alpha=cut(p.value, alphas)) %>%
  summarize(err=n()) %>%
  mutate(err=cumsum(err))

## # A tibble: 6 × 2
##          alpha   err
##         <fctr> <int>
## 1    (0,0.001]     3
## 2 (0.001,0.01]     8
## 3  (0.01,0.05]    50
## 4   (0.05,0.1]   102
## 5    (0.1,0.2]   207
## 6           NA  1000

The results show the count of the number of type 1 errors for each level of alpha. For example, with an alpha of 0.2 roughly 20% of the time you should get a false alarm (type 1 error).

How alpha affects the prevalence of Type 2 errors

This block of code is almost identical to the preceding block, except here we take samples from two different populations. The variable diff is the magnitude of the difference between the two populations that we’re going to use. In this case our hypothesis test should lead to rejection of the null hypothesis because we are sampling from different (random normal) populations. When the test indicates that there is not a significant difference between the samples, we have committed type 2 error (a false negative). The code shows how the rate of false negatives is determined by alpha.

# n and alpha are the same as before
diff <- 5  # Difference in group means
result <- do(n) * t.test(rnorm(100, mean = 100, sd = 10),
                         rnorm(100, mean = 100 + diff, sd = 10))

# Now perform the grouping and summary, with a twist
result %>%
  group_by(alpha=cut(p.value, alphas)) %>%
  summarize(err=n()) %>%
  mutate(err=n-cumsum(err))  # Note the n-...

## # A tibble: 6 × 2
##          alpha   err
##         <fctr> <dbl>
## 1    (0,0.001]   446
## 2 (0.001,0.01]   176
## 3  (0.01,0.05]    65
## 4   (0.05,0.1]    30
## 5    (0.1,0.2]    16
## 6           NA     0

Understanding signifigance

The statstical key point here is that there is a trade off between false positives and false negatives. As we increase alpha the number of false positives increases but the number of false negatives decreases. Most people think of alpha = 0.05 as a reasonable compromise between these two types of errors. Within the concept of ‘signifigance’ there is embedded a trade-off between these two tpes of errors. Think of ‘signifigance’ as a compromise, between false positives and negatives, not as absolute determination.

From a programing perspective the key point of the above code is that box models and grouping is useful! Now, you’ve got to create your own box model and use it to solve a real world problem.

This demo based on a previous version by Seth Spielman.