Fitting probability models to frequency data (Part 2)

M. Drew LaMar
September 25, 2020

Class Announcements

  • Reading Assignment for Monday: Whitlock & Schluter, Chapter 9: Contingency analysis: associations between categorical variables QUIZ!
    • Note: There will only be two more quizzes after this, for a total of 9 reading quizzes

Chapter 8: Fitting probability models to frequency data

From proportions and binomial distributions…

Chapter 8: Fitting probability models to frequency data

…to working with direct frequency distributions.

         Right Left
Observed    14    4
Expected     9    9

plot of chunk unnamed-chunk-3

Chi-squared goodness-of-fit test

Note: The binomial test is an example of a goodness-of-fit test.

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Definition: A model in this case is a simplified, mathematical representation that mimics how we think a natural process works.

Working through an example

Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month Number fallen Month Number fallen
January 4 July 19
February 6 August 13
March 8 September 12
April 10 October 12
May 9 November 7
June 14 December 5

Our Workflow

Example - Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Question: What are the null and alternative hypotheses?

Answer:
     \( H_{0} \): The frequency of cats falling is the same in each month.
     \( H_{A} \): The frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Observed and Expected Frequencies

We want to use dplyr for practice, so…

if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

Now load data and store as a tibble.

mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
  tbl_df()

Example - Assignment Problem #21

Observed and Expected Frequencies

Let's peek at the data using glimpse.

glimpse(mydata)
Rows: 119
Columns: 1
$ month <chr> "January", "January", "January", "January", "February", "Februa…

We need frequencies for months of the year.

The data in this case is in tidy form, i.e. each row is an observation (a falling cat), and each column is a measurement (month).

Question: How do we get frequencies?

Example - Assignment Problem #21

Observed and Expected Frequencies

You can use the table command…

table(mydata)
mydata
    April    August  December  February   January      July      June     March 
       10        13         5         6         4        19        14         8 
      May  November   October September 
        9         7        12        12 

but the output is not a data frame!

Question: How can we get frequencies in data frame format?

Example - Assignment Problem #21

Observed and Expected Frequencies

(mytable <- mydata %>% 
  group_by(month) %>%
  summarize(obs = n()))
# A tibble: 12 x 2
   month       obs
   <chr>     <int>
 1 April        10
 2 August       13
 3 December      5
 4 February      6
 5 January       4
 6 July         19
 7 June         14
 8 March         8
 9 May           9
10 November      7
11 October      12
12 September    12

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

mytable$month <- factor(mytable$month, 
                        levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

mytable %>% arrange(month)
  • First, we reorder the factor levels.
  • Then, we used arrange on month.

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

# A tibble: 12 x 2
   month       obs
   <fct>     <int>
 1 January       4
 2 February      6
 3 March         8
 4 April        10
 5 May           9
 6 June         14
 7 July         19
 8 August       13
 9 September    12
10 October      12
11 November      7
12 December      5

Example - Assignment Problem #21

Observed and Expected Frequencies

Now let's compute the expected frequencies using mutate.

(mytable <- mytable %>%
   mutate(exp = sum(obs)/12))

Example - Assignment Problem #21

Observed and Expected Frequencies

Now let's compute the expected frequencies using mutate.

# A tibble: 12 x 3
   month       obs   exp
   <fct>     <int> <dbl>
 1 April        10  9.92
 2 August       13  9.92
 3 December      5  9.92
 4 February      6  9.92
 5 January       4  9.92
 6 July         19  9.92
 7 June         14  9.92
 8 March         8  9.92
 9 May           9  9.92
10 November      7  9.92
11 October      12  9.92
12 September    12  9.92

Example - Assignment Problem #21

Observed and Expected Frequencies

Data validation: Do the observed and expected frequencies have the same sum? Let's use summarize.

mytable %>% summarize(sum(obs),
                      sum(exp))
# A tibble: 1 x 2
  `sum(obs)` `sum(exp)`
       <int>      <dbl>
1        119        119

YES!

Example - Assignment Problem #21

Let's make a barplot. With ggplot2, you can plot directly from this tibble. For now with base R graphics, we need a matrix.

mymatrix <- mytable %>% 
  arrange(month) %>%
  select(-month) %>%
  as.matrix()

head(mymatrix)

Example - Assignment Problem #21

Let's make a barplot. With ggplot2, you can plot directly from this tibble. For now with base R graphics, we need a matrix.

     obs      exp
[1,]   4 9.916667
[2,]   6 9.916667
[3,]   8 9.916667
[4,]  10 9.916667
[5,]   9 9.916667
[6,]  14 9.916667

Rownames!!

rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)

Example - Assignment Problem #21

rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)
         obs      exp
January    4 9.916667
February   6 9.916667
March      8 9.916667
April     10 9.916667
May        9 9.916667
June      14 9.916667

That's better! Let's plot…

Example - Assignment Problem #21

barplot(mymatrix, beside = TRUE)

plot of chunk unnamed-chunk-14

Hmmm, let's try another way…

Example - Assignment Problem #21

barplot(t(mymatrix), 
        beside = TRUE, 
        col = c("forestgreen", 
                "goldenrod1"),
        legend.text = c("Observed",
                        "Expected"))

plot of chunk unnamed-chunk-15

Example - Assignment Problem #21

tl;dr

Example - Assignment Problem #21

tl;dr

  • You will have to manipulate the data to create the frequency distributions!
  • Consult the supplemental information for next week's homework

Example - Assignment Problem #21

Definition: The \( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

Example - Assignment Problem #21

Definition: The \( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

Discuss: What would support the null hypothesis more: a small value or large value for \( \chi^{2} \)?

Answer: Small value for \( \chi^{2} \)

Example - Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

(mytable %>%
  mutate(tmp = ((obs-exp)^2)/exp) %>%
  summarize(chi2 = sum(tmp)))
# A tibble: 1 x 1
   chi2
  <dbl>
1  20.7

1x1 tibble?! Here's where “everything is a data frame” doesn't work.

Solution?

Example - Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

Use the tidyverse package and the pull() function.

mytable %>%
  mutate(tmp = ((obs-exp)^2)/exp) %>%
  summarize(chi2 = sum(tmp)) %>%
  pull(chi2)
[1] 20.66387

The observed \( \chi^2 \) test statistic is 20.6638655.

Is that good?

Example - Assignment Problem #21

Question: What is the sampling distribution for the \( \chi^2 \) test statistic under the null hypothesis?

Answer: \( \chi^2 \) distribution (actually a family of distributions)

plot of chunk unnamed-chunk-20

Example - Assignment Problem #21

Definition: The number of degrees of freedom of a \( \chi^2 \) statistic specifies which \( \chi^2 \) distribution to use as the null distribution and is given by

df = Number of categories - 1 - Number of estimated parameters from data

This is analogous to the sample size in the null distribution for a mean and proportion. Remember, for the mean and proportion, the null distribution depends on the sample size.

Example - Assignment Problem #21

Discuss: Define the \( P \)-value.

Definition: The \( P \)-value is the probability of getting the data/test statistic (or worse) assuming the null hypothesis is true.

plot of chunk unnamed-chunk-21

\( P \)-value = 0.0370255 (shaded red area)

Example - Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

\( P \)-value = 0.0370255

Discuss: Conclusion?

Conclusion: Reject \( H_{0} \), i.e. there is evidence that the frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Another way…critical values

Definition: A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under \( H_{0} \).

plot of chunk unnamed-chunk-22

Critical value = \( \chi_{0.05,11}^2 = 19.6751376 \)
Red area = \( \alpha = 0.05 \)

Example - Assignment Problem #21

Statistical tables

\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.6638655 \]

Important - Assumptions to use test

The sampling distribution of the \( \chi^2 \) statistic follows a \( \chi^2 \) distribution only approximately. Excellent approximation if the following is true:

  • None of the categories have an expected frequency less than one.
  • No more than 20% of the categories have expected frequencies less than five.

Note: Still good approximation if average expected value at least five.

Note: ALL STATISTICAL TESTS HAVE ASSUMPTIONS

Assumption common to all: random sample

Which probability models for this chapter?

  • Proportional model (e.g. Example 8.1)
  • Binomial distribution (e.g. Practice Problem #6)
  • Poisson distribution (e.g. Practice Problem #2)

Example - Practice Problem #6 (Binomial)

One thousand coins were each flipped eight times, and the number of heads was recorded for each coin. The results are below.

Number.of.heads Number.of.coins
0 6
1 32
2 105
3 186
4 236
5 201
6 98
7 33
8 103

Example - Practice Problem #6 (Binomial)

\( H_{0} \): The number of heads has a binomial distribution with \( p = 0.5 \)

\( H_{A} \): The number of heads does not have a binomial distribution with \( p = 0.5 \)

Example - Practice Problem #6 (Binomial)

Number.of.heads Observed Binomial.expectation Expected
0 6 0.0039063 3.90625
1 32 0.0312500 31.25000
2 105 0.1093750 109.37500
3 186 0.2187500 218.75000
4 236 0.2734375 273.43750
5 201 0.2187500 218.75000
6 98 0.1093750 109.37500
7 33 0.0312500 31.25000
8 103 0.0039063 3.90625
Sum 36 1000 1.0000000 1000.00000

Binomial expectation: dbinom(0:8, 8, 0.5)

Example - Practice Problem #6 (Binomial)

Number.of.heads Observed Binomial.expectation Expected
0 6 0.0039063 3.90625
1 32 0.0312500 31.25000
2 105 0.1093750 109.37500
3 186 0.2187500 218.75000
4 236 0.2734375 273.43750
5 201 0.2187500 218.75000
6 98 0.1093750 109.37500
7 33 0.0312500 31.25000
8 103 0.0039063 3.90625
Sum 36 1000 1.0000000 1000.00000

Meet assumptions?
(1) No categories with expected frequencies less than 1
(2) No more than 20% of categories with expected frequencies less than 5.

Example - Practice Problem #6 (Binomial)

Assumptions not met as 2/9, or about 22% of categories are less than 5. We'll combine 0 and 1 together, as well as 7 and 8:

Number.of.heads Observed Binomial.expectation Expected (O-E)^2/E
X 0 or 1 38 0.0351563 35.15625 0.2300278
X.1 2 105 0.1093750 109.37500 0.1750000
X.2 3 186 0.2187500 218.75000 4.9031429
X.3 4 236 0.2734375 273.43750 5.1257286
X.4 5 201 0.2187500 218.75000 1.4402857
X.5 6 98 0.1093750 109.37500 1.1830000
X.6 7 or 8 136 0.0351563 35.15625 289.2646944
Sum 1000 1.0000000 1000.00000 302.3218794

\( \chi^2 \) = 302.3218794

Example - Practice Problem #6 (Binomial)

Statistical tables

\[ df = 7 - 1 = 6 \] \[ Pr[\chi^{2} \geq \chi_{0.05,6}^2] = Pr[\chi^{2} \geq 12.59] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 302.3218794 \]

Quick way using R

chisq.test(df$Observed, p = df$`Binomial.expectation`)

    Chi-squared test for given probabilities

data:  df$Observed
X-squared = 302.32, df = 6, p-value < 2.2e-16
    Number.of.heads Observed Binomial.expectation  Expected (O-E)^2/E
X            0 or 1       38           0.03515625  35.15625 0.2300278
X.1               2      105           0.10937500 109.37500 0.1750000
X.2               3      186           0.21875000 218.75000 4.9031429
X.3               4      236           0.27343750 273.43750 5.1257286
X.4               5      201           0.21875000 218.75000 1.4402857
X.5               6       98           0.10937500 109.37500 1.1830000

Last words

We will spend some time in lab next week discussing how to compute P-values using R (i.e. without statistical tables).

Watch this video on the Poisson distribution for a walkthrough of Example 8.6.