Some comments on simulate.p.value in chisq.test

First, here is the basic algorithm from the function. I'm going to disect it in the following. What is does is create B simulations of the data under HO, which is picking from the categories of x using the probabilities p.

## some data
x <- c(3, 8, 5)
p <- c(1/4, 1/2, 1/4)


## data summaries
B <- 4  ## 2000 in the code
n <- sum(x)
E <- n * p
names(E) <- names(x)

## compute simulations using matrix
nx <- length(x)
sm <- matrix(sample.int(nx, B * n, TRUE, prob = p), nrow = n)
ss <- apply(sm, 2L, function(x, E, k) {
    sum((table(factor(x, levels = 1L:k)) - E)^2/E)
}, E = E, k = nx)
## find p value
almost.1 <- 1 - 64 * .Machine$double.eps
STATISTIC <- sum((x - E)^2/E)
PVAL <- (1 + sum(ss >= almost.1 * STATISTIC))/(B + 1)

Okay, now step by step.

The variables x and p I defined so we could have a problem. Here x is the “sample data” and p the probabilities in the null hypothesis.

The next lines set B the number of simulations (I use 4, but the script uses 2000) then computes the number of picks (n which is the sum of the elements of x), the expected number E = \( n\cdot p \).

(The key idea)

Now for the simulation part. In week 4 we talk more, but that week has historically been rough going so lets get a head start. The goal of the simulation is to understand the sampling distribution of the test statistic not by theoretical means, but rather by simulating lots of values and looking at how common large values are. For example, if you wanted to know the probability of heads you could a) appeal to theoretical arguments and say since each side if equally likely and there are only two sides the answer must be ½. Or b) you could toss the coin alot and then see that in the long run, it appears to have an equal chance of being heads or tails. Now “the long run” is tricky to understand. It doesn't mean that after 10,000 coin tosses there will be 5000 heads, but rather after 10,000 coin tosses the number of heads should be close to 5000 with close given some meaning (in terms of a standard deviation). So “proof” that the coin is fair is basically data that shows the number of heads in 10,000 coin tosses is near 5000.

Applying this to the chi square. The idea is if you generate random data (values for the nx categories of x with n samples taken with proabilities p) then we can see if our observed value of x is unusual.

This is needed, as the distribution of the test statistic is asymptotic, meaning n must be large enough. (Basically, each of the (obs-E)2/E values must be, as random objects, getting close to normally distributed.)

So we need to

a) generate random samples of x – lots of them (B)
b) see if our one value is unusual.

A single sample of size n from the categories of x using proabilities p is generated in R with:

sample.int(nx, n, TRUE, prob = p)
##  [1] 2 1 2 3 2 2 1 2 1 1 1 2 2 3 3 1

This line makes lots of such samples (B of them) and stores in a rectangular array:

sm <- matrix(sample.int(nx, B * n, TRUE, prob = p), nrow = n)
sm
##       [,1] [,2] [,3] [,4]
##  [1,]    3    1    3    2
##  [2,]    1    1    1    2
##  [3,]    2    3    2    2
##  [4,]    2    1    1    1
##  [5,]    1    2    3    1
##  [6,]    3    1    3    2
##  [7,]    1    2    3    2
##  [8,]    1    2    2    1
##  [9,]    1    2    2    1
## [10,]    1    3    2    1
## [11,]    3    1    2    3
## [12,]    2    2    1    2
## [13,]    2    1    2    2
## [14,]    2    1    2    1
## [15,]    3    1    2    1
## [16,]    3    2    3    2

Each column is a sample. We need to count the number of 1's, 2's and 3's as that is what x did. For this, the function gets fancy. It uses table. Here we focus on the second column:

y <- sm[, 2]  ## second column
table(y)
## y
## 1 2 3 
## 8 6 2 

The code uses something slightly fancier to force there to be 3 values:

obs <- table(factor(y, levels = 1L:nx))
obs
## 
## 1 2 3 
## 8 6 2 

But we need to do a bit more to this. We need to figure out the test stastistic. This is where E comes in:

(obs - E)^2/E
## 
##   1   2   3 
## 4.0 0.5 1.0 

Doing this slowly we can see R's vectorization at work:

E
## [1] 4 8 4

and the differences:

(obs - E)
## 
##  1  2  3 
##  4 -2 -2 

The squared differences:

(obs - E)^2
## 
##  1  2  3 
## 16  4  4 

Ie the code, this line does this work for each column in the matrix. The apply function iterates over columns due to the 2 that is specified in its second argument (The L says use an integer value of 2, not a floating point value):

ss <- apply(sm, 2L, function(x, E, k) {
    sum((table(factor(x, levels = 1L:k)) - E)^2/E)
}, E = E, k = nx)

P-value

The observed value of our data comes from x:

obs <- sum((x - E)^2/E)
obs
## [1] 0.5

How extreme is this. Well, of the B samples, how many were bigger. We can count when B is 4:

ss
## [1] 2.375 5.500 0.500 4.500
obs
## [1] 0.5

and counting:

ss > obs
## [1]  TRUE  TRUE FALSE  TRUE

Not too extreme!

When there are many random values, you need to count with the computer. This is more or less what is done:

sum(ss >= obs)/B
## [1] 1

But not quite. The code shaves a bit of obs (and calls it STATISTIC) and then adds 1 to both top and bottom. This is not an uncommon thing to see. (Some books suggest adding 2 for some tests). It is some tradeoff between bias and variability.

Anyways, the main point is we look at the frequency of how often in our simulation do we see a value as or more extreme than our observed one and call that the simulated p-value.