Fitting probability models to frequency data (Part 2)

M. Drew LaMar
October 10, 2018

“Is it fall break yet?”
- Every student here

Class Announcements

Exam #1 solutions are posted on Blackboard under “Assignments” (after HW #3)
- ~~PLEASE DO NOT DISTRIBUTE!!!~~
Reading Assignment for Friday: Whitlock & Schluter, Chapter 9: Contingency analysis: associations between categorical variables ~~QUIZ!~~
- Note: There will only be one more quiz after this, for a total of 10 reading quizzes
Homework #5: Lots of supplemental (hopefully helpful) info on QUBES

Class Announcements

Homeworks will start to be graded soon! Finally got the grader set up.
- Grade for each HW will be:
  - 70% completion: Did you make an honest attempt at all the problems?
  - 30% correctness and clarity: Two “random” problems will be graded worth 15% each
We will start collecting data for the sleep study this Friday
NOTE: Textbook is on reserve at the library (both editions)

Chi-squared goodness-of-fit test

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Working through an example

Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month	Number fallen	Month	Number fallen
January	4	July	19
February	6	August	13
March	8	September	12
April	10	October	12
May	9	November	7
June	14	December	5

Our Workflow

Example - Assignment Problem #21

Question: What are the null and alternative hypotheses?

Answer:
$ H_{0} $: The frequency of cats falling is the same in each month.
$ H_{A} $: The frequency of cats falling is not the same in each month.

And then an R miracle occurred

(mytable <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
  mutate(month = factor(month, 
                        levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))) %>% 
  group_by(month) %>% 
  summarize(obs = n()) %>% 
  mutate(exp = sum(obs)/12))

# A tibble: 12 x 3
   month       obs   exp
   <fct>     <int> <dbl>
 1 January       4  9.92
 2 February      6  9.92
 3 March         8  9.92
 4 April        10  9.92
 5 May           9  9.92
 6 June         14  9.92
 7 July         19  9.92
 8 August       13  9.92
 9 September    12  9.92
10 October      12  9.92
11 November      7  9.92
12 December      5  9.92

Example - Assignment Problem #21

mytable %>% 
  select(-month) %>% 
  as.matrix() %>% 
  t() %>%
barplot(beside = TRUE, 
        col = c("forestgreen", 
                "goldenrod1"),
        legend.text = c("Observed",
                        "Expected"),
        names.arg = mytable$month)

plot of chunk unnamed-chunk-4

Example - Assignment Problem #21

tl;dr

Example - Assignment Problem #21

tl;dr

You will have to manipulate the data to create the frequency distributions!
Consult the supplemental information on QUBES for Homework #5

Example - Assignment Problem #21

Definition: The $ \chi^2 $ statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

Example - Assignment Problem #21

Definition: The $ \chi^2 $ statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

Discuss: What would support the null hypothesis more: a small value or large value for $ \chi^{2} $?

Answer: Small value for $ \chi^{2} $

Example - Assignment Problem #21

$ \chi^2 $ test statistic - computed in R

(mytable %>%
  mutate(tmp = ((obs-exp)^2)/exp) %>%
  summarize(chi2 = sum(tmp)))

# A tibble: 1 x 1
   chi2
  <dbl>
1  20.7

1x1 tibble?! Here's where “everything is a data frame” doesn't work.

Solution?

Example - Assignment Problem #21

magrittr package

“The Treachery of Images” by René Magritte

Example - Assignment Problem #21

$ \chi^2 $ test statistic - computed in R

Use the magrittr package and the exposition pipe operator %$%.

mytable %>%
  mutate(tmp = ((obs-exp)^2)/exp) %>%
  summarize(chi2 = sum(tmp)) %$%
  chi2

[1] 20.66387

The observed $ \chi^2 $ test statistic is 20.6638655.

Is that good?

Example - Assignment Problem #21

Question: What is the sampling distribution for the $ \chi^2 $ test statistic under the null hypothesis?

Answer: $ \chi^2 $ distribution (actually a family of distributions)

plot of chunk unnamed-chunk-9

Example - Assignment Problem #21

Definition: The number of degrees of freedom of a $ \chi^2 $ statistic specifies which $ \chi^2 $ distribution to use as the null distribution and is given by

df = Number of categories - 1 - Number of estimated parameters from data

This is analogous to the sample size in the null distribution for a mean and proportion. Remember, for the mean and proportion, the null distribution depends on the sample size.

Example - Assignment Problem #21

Discuss: Define the $ P $-value.

Definition: The $ P $-value is the probability of getting the data/test statistic (or worse) assuming the null hypothesis is true.

plot of chunk unnamed-chunk-10

$ P $-value = 0.0370255 (shaded red area)

Example - Assignment Problem #21

$ P $-value = 0.0370255

Discuss: Conclusion?

Conclusion: Reject $ H_{0} $, i.e. there is evidence that the frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Another way…critical values

Definition: A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under $ H_{0} $.

plot of chunk unnamed-chunk-11

Critical value = $ \chi_{0.05,11}^2 = 19.6751376 $
Red area = $ \alpha = 0.05 $

Example - Assignment Problem #21

Statistical tables

\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.6638655 \]

Important - Assumptions to use test

The sampling distribution of the $ \chi^2 $ statistic follows a $ \chi^2 $ distribution only approximately. Excellent approximation if the following is true:

None of the categories have an expected frequency less than one.
No more than 20% of the categories have expected frequencies less than five.

Note: Still good approximation if average expected value at least five.

Note: ~~ALL STATISTICAL TESTS HAVE ASSUMPTIONS~~

Assumption common to all: random sample