Fitting probability models to frequency data

Alban Guillaumet, Troy University

“Quotesman on strike”
- Quotesman

Objectives

  • Goodness-of-fit tests
  • \( \chi^2 \) GOF test

Goodness-of-fit test

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Binomial test

Note: The binomial test is an example of a goodness-of-fit test.

Definition: The binomial test uses data to test whether a population proportion (\( p \)) matches a null expectation (\( p_{0} \)) for the proportion.

Definition: The null hypothesis \( H_{0} \) and alternative hypothesis \( H_{A} \) for a binomial test are given by:

    \( H_{0} \): Relative frequency of successes in population is \( p_{0} \).
    \( H_{A} \): Relative frequency of successes in population is not \( p_{0} \).

Binomial test

binom.test(x = 14, n = 18, p = 0.5, alternative = "two.sided")$p.value
[1] 0.03088379

Binomial test

Which is conceptually equivalent to comparing an observed frequency distribution (green) with the frequency distribution expected under a simple probability model (yellow).

         Right Left
Observed    14    4
Expected     9    9

plot of chunk unnamed-chunk-4

A more general goodness-of-fit test

  • The binomial test is limited to categorical variables with only two possible outcomes.

  • We now introduce a more general goodness-of-fit test allowing to:

    • handle categorical and discrete numerical variables having more than two outcomes
    • assess the fit of more complex probability models

Working through an example

Assignment Problem #21

A recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month Number fallen Month Number fallen
January 4 July 19
February 6 August 13
March 8 September 12
April 10 October 12
May 9 November 7
June 14 December 5

Assignment Problem #21

A recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Question: What are the null and alternative hypotheses?

Answer:
     \( H_{0} \): The frequency of cats falling is the same in each month.
     \( H_{A} \): The frequency of cats falling is not the same in each month.

Assignment Problem #21

Observed and Expected Frequencies

d = read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") 
head(d)
     month
1  January
2  January
3  January
4  January
5 February
6 February

Assignment Problem #21

is.data.frame(d)
[1] TRUE
d$month[1:10]
 [1] January  January  January  January  February February February February
 [9] February February
12 Levels: April August December February January July June March ... September

Assignment Problem #21

let's use a loop to calculate the observed frequencies

dc = data.frame(matrix(rep(NA, 36), nr = 12))
colnames(dc) = c("month", "obs", "exp")
dc$month = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
for(i in 1 : 12){ 
  month.i = dc$month[i]
  d_month.i = subset( d, month == month.i )
  dc[i, "obs"] = nrow( d_month.i )# careful, `length` won't work, `d` is a dataframe...
  # Equivalent to: dc[i, "obs"] = length( d_month.i$month )
}             

Assignment Problem #21

Always check that what you found makes sense!!!

dc[1:6,]
     month obs exp
1  January   4  NA
2 February   6  NA
3    March   8  NA
4    April  10  NA
5      May   9  NA
6     June  14  NA
sum(dc$obs) == nrow(d)
[1] TRUE
subset( d, month == "January" )
    month
1 January
2 January
3 January
4 January

Assignment Problem #21

Now let's compute the expected frequencies

( dc[, "exp"] = sum( dc[, "obs"] ) / 12 )
[1] 9.916667

Assignment Problem #21

dc
       month obs      exp
1    January   4 9.916667
2   February   6 9.916667
3      March   8 9.916667
4      April  10 9.916667
5        May   9 9.916667
6       June  14 9.916667
7       July  19 9.916667
8     August  13 9.916667
9  September  12 9.916667
10   October  12 9.916667
11  November   7 9.916667
12  December   5 9.916667

Assignment Problem #21

Let's make a plot!

dc.mat = as.matrix(dc[,2:3]); rownames(dc.mat) = substr(dc[,1],1, 3)
barplot(t(dc.mat), beside = TRUE, col = c("forestgreen","goldenrod1"), legend.text = c("Observed","Expected"))

plot of chunk unnamed-chunk-13

Assignment Problem #21

Definition: The \( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

\[ \chi^2 = \frac{(4 -9.917)^2}{9.917}+\frac{(6 -9.917)^2}{9.917}+...+\frac{(5 -9.917)^2}{9.917} \]

Assignment Problem #21

Discuss: What would support the null hypothesis more: a small value or large value for \( \chi^{2} \)?

\[ \chi^2 = \frac{(4 -9.917)^2}{9.917}+\frac{(6 -9.917)^2}{9.917}+...+\frac{(5 -9.917)^2}{9.917} \]

Answer: Small value for \( \chi^{2} \)

Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

chisq.test(x = dc[,"obs"], p = dc[,"exp"], rescale.p = TRUE)

    Chi-squared test for given probabilities

data:  dc[, "obs"]
X-squared = 20.664, df = 11, p-value = 0.03703

How is the P-value calculated?

Assignment Problem #21

The sampling distribution for the \( \chi^2 \) test statistic under the null hypothesis is well approximated by the theoretical \( \chi^2 \) distribution.

The \( \chi^2 \) distribution is actually a family of distributions

plot of chunk unnamed-chunk-15

Assignment Problem #21

Definition: The number of degrees of freedom of a \( \chi^2 \) statistic specifies which \( \chi^2 \) distribution to use as the null distribution and is given by

df = (Number of categories) - 1 - (Number of parameters estimated from data)

Assignment Problem #21

Discuss: Define the \( P \)-value.

Definition: In this case, the \( P \)-value is the probability of getting a \( \chi^2 \) value equal or greater than the observed \( \chi^2 \) value when the null hypothesis is true.

plot of chunk unnamed-chunk-16

\( P \)-value = 0.037 (shaded red area)

Assignment Problem #21

\( P \)-value = 0.037024

Discuss: Conclusion?

Conclusion: Reject \( H_{0} \), i.e. there is evidence that the frequency of cats falling is not the same in each month.

Assignment Problem #21

Another way…critical values

Definition: A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under \( H_{0} \).

Critical value for \( \alpha \) = 0.05: \( \chi_{0.05,11}^2 = 19.675 \)

plot of chunk unnamed-chunk-17

Statistical tables

\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.664 \]

Important - Assumptions to use test

The sampling distribution of the \( \chi^2 \) statistic follows a \( \chi^2 \) distribution only approximately. The approximation is excellent, as long as the following rules are obeyed:

  • None of the categories should have an expected frequency less than one.
  • No more than 20% of the categories should have expected frequencies less than five.