Goodness-of-fit test using chisq.test function in R

If one is interested in performing a \(\chi^2\) test in R, it’s easy to find a function that does it for us. Luckily for us, it’s already supplied with base R, which means no extra packages are needed.

If we type ?chisq.test we open the help file with more or less clear instructions on what the function does and how to use it.

If x is a matrix with one row or column, or if x is a vector and y is not given, then a goodness-of-fit test is performed (x is treated as a one-dimensional contingency table). The entries of x must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in p, or are all equal if p is not given.

In other words, we are interested if a vector of numbers (\(n \geq 2\)) is generated by some process that generates these numbers using probabilities specified in argument p. This also called a goodness-of-fit test (as mention in the help file).

Let’s say there is a process that generates two values. In this case, let’s say that we measure two 10s. Since we think the process is fair, it generates first and second result with equal probability, so 0.5 each.

chisq.test(c(10, 10), p = c(0.5, 0.5))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(10, 10)
## X-squared = 0, df = 1, p-value = 1

Because we have two numbers, we are comparing the X-squared value to a \(\chi^2\) distribution with one degree of freedom (1 df). The result of this lookup is \(p\)-value.

There is no evidence that these two numbers are anything but in 0.5:0.5 ratio. Getting 10 and 10 is thus very likely. If we still assume the process generates two numbers with equal probability, but we get result 5 and 15 - is this still “very” expected?

chisq.test(c(5, 15), p = c(0.5, 0.5))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(5, 15)
## X-squared = 5, df = 1, p-value = 0.02535

There is some evidence, that 5 and 15 were not generated by a process that should be “fair”. Let’s try a simulation approach. I will explore how the statistic behaves in a number of cases.

Let’s see how \(\chi^2\) test (we’re observing the \(p\)-value) behaves if we compare 1 and 99, 2 and 99, 3 and 99 … 99 and 1. If we plot the ratio on \(x\) axis and \(p\)-value on \(y\) axis, we would expect a symmetrical curve with a peak at ratio 1:1.

library(ggplot2)

prc <- data.frame(p1 = 1:99, p2 = 99:1) # prc as in process
# perform chi^2 test and extract p-value
prc$pval <- apply(prc, MARGIN = 1, FUN = function(x) chisq.test(x)$p.value)
prc$ratio <- with(prc, round(p1/p2, 3))

ggplot(prc, aes(x = as.factor(ratio), y = pval)) +
  theme_bw() +
  scale_x_discrete(
    breaks = prc$ratio[seq(1, 99, length.out = 20)], 
    labels = prc$ratio[seq(1, 99, length.out = 20)]) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_point()

It would appear that ratio of 1 has no support that the two numbers come from other than a fair process (50:50 or 0.5:0.5) and as soon as the ratio starts rising or droping, the \(p\)-value seems to start approaching zero. In other words, if the ratio is far from what we would expect, we get a statistically significant discrepancy between what we expect and what we get.

prc.sub <- prc[prc$ratio > 0.75 & prc$ratio < 1.25, ]
prc.sub

##    p1 p2      pval ratio
## 43 43 57 0.1615133 0.754
## 44 44 56 0.2301393 0.786
## 45 45 55 0.3173105 0.818
## 46 46 54 0.4237108 0.852
## 47 47 53 0.5485062 0.887
## 48 48 52 0.6891565 0.923
## 49 49 51 0.8414806 0.961
## 50 50 50 1.0000000 1.000
## 51 51 49 0.8414806 1.041
## 52 52 48 0.6891565 1.083
## 53 53 47 0.5485062 1.128
## 54 54 46 0.4237108 1.174
## 55 55 45 0.3173105 1.222

Someone would say, if numbers 228 and 273 were the result, there is evidence to support the claim that these two numbers have not originated from a process that is fair.

Let’s see raw data in the region around ratio of 1.

prc[prc$ratio > 0.59 & prc$ratio < 1.6, ]

##    p1 p2       pval ratio
## 38 38 62 0.01639507 0.613
## 39 39 61 0.02780690 0.639
## 40 40 60 0.04550026 0.667
## 41 41 59 0.07186064 0.695
## 42 42 58 0.10959858 0.724
## 43 43 57 0.16151332 0.754
## 44 44 56 0.23013934 0.786
## 45 45 55 0.31731051 0.818
## 46 46 54 0.42371080 0.852
## 47 47 53 0.54850624 0.887
## 48 48 52 0.68915652 0.923
## 49 49 51 0.84148058 0.961
## 50 50 50 1.00000000 1.000
## 51 51 49 0.84148058 1.041
## 52 52 48 0.68915652 1.083
## 53 53 47 0.54850624 1.128
## 54 54 46 0.42371080 1.174
## 55 55 45 0.31731051 1.222
## 56 56 44 0.23013934 1.273
## 57 57 43 0.16151332 1.326
## 58 58 42 0.10959858 1.381
## 59 59 41 0.07186064 1.439
## 60 60 40 0.04550026 1.500
## 61 61 39 0.02780690 1.564

If we set the probabilities to something else (say 0.3:0.7), we would expect the curve to shift to lower ratio (left in our case).

prc <- data.frame(p1 = 1:99, p2 = 99:1) # prc as in process
# perform chi^2 test and extract p-value
prc$pval <- apply(prc, MARGIN = 1, FUN = function(x) chisq.test(x, p = c(0.3, 0.7))$p.value)
prc$ratio <- with(prc, round(p1/p2, 3))

ggplot(prc, aes(x = as.factor(ratio), y = pval)) +
  theme_bw() +
  scale_x_discrete(
    breaks = prc$ratio[seq(1, 99, length.out = 20)], 
    labels = prc$ratio[seq(1, 99, length.out = 20)]) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_point()

Let’s see raw data in the region around ratio of \(\frac{0.3}{0.7} = 0.43\), which is where we expect the test to estimate highest \(p\)-values (because we assume (or test) the process generates values in 0.3:0.7 ratio).

prc[prc$ratio > 0.2 & prc$ratio < 0.7, ]

##    p1 p2        pval ratio
## 17 17 83 0.004556350 0.205
## 18 18 82 0.008828761 0.220
## 19 19 81 0.016377308 0.235
## 20 20 80 0.029096332 0.250
## 21 21 79 0.049534613 0.266
## 22 22 78 0.080855598 0.282
## 23 23 77 0.126630458 0.299
## 24 24 76 0.190430264 0.316
## 25 25 75 0.275233524 0.333
## 26 26 74 0.382733089 0.351
## 27 27 73 0.512690760 0.370
## 28 28 72 0.662520584 0.389
## 29 29 71 0.827259347 0.408
## 30 30 70 1.000000000 0.429
## 31 31 69 0.827259347 0.449
## 32 32 68 0.662520584 0.471
## 33 33 67 0.512690760 0.493
## 34 34 66 0.382733089 0.515
## 35 35 65 0.275233524 0.538
## 36 36 64 0.190430264 0.562
## 37 37 63 0.126630458 0.587
## 38 38 62 0.080855598 0.613
## 39 39 61 0.049534613 0.639
## 40 40 60 0.029096332 0.667
## 41 41 59 0.016377308 0.695

Goodness-of-fit test using chisq.test function in R

Roman Lustrik (roman.lustrik@biolitika.si)

Sunday, November 16, 2014