Fitting probability models to frequency data (Part I)

M. Drew LaMar
February 24, 2017


https://xkcd.com/882/

Class Announcements

  • No reading assignment for Monday (Yay!)
  • Haven't finished grading exams (Boo!)

One last thing on binomial tests...

Use binom.test to do a binomial test! It's more accurate. If our observed test statistic is \( X = 16 \) successes out of \( n = 17 \) trials, and our null hypothesized proportion is \( p_{0} = 0.5 \), then we have:

binom.test(16,
           n = 17,
           p = 0.5)

    Exact binomial test

data:  16 and 17
number of successes = 16, number of trials = 17, p-value =
0.0002747
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.7131106 0.9985118
sample estimates:
probability of success 
             0.9411765 

Chapter 8: Fitting probability models to frequency data

From proportions and binomial distributions…

Chapter 8: Fitting probability models to frequency data

…to working with direct frequency distributions.

         Right Left
Observed    14    4
Expected     9    9

plot of chunk unnamed-chunk-4

Chi-squared goodness-of-fit test

Note: The binomial test is an example of a goodness-of-fit test.

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Definition: A model in this case is a simplified, mathematical representation that mimics how we think a natural process works.

Working through an example

Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month Number fallen Month Number fallen
January 4 July 19
February 6 August 13
March 8 September 12
April 10 October 12
May 9 November 7
June 14 December 5

Our Workflow

Example - Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Question: What are the null and alternative hypotheses?

Answer:
     \( H_{0} \): The frequency of cats falling is the same in each month.
     \( H_{A} \): The frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Observed and Expected Frequencies

We want to use dplyr for practice, so…

if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

Now load data and store as a tibble.

mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
  tbl_df()

Example - Assignment Problem #21

Observed and Expected Frequencies

Let's peek at the data using glimpse.

glimpse(mydata)
Observations: 119
Variables: 1
$ month <fctr> January, January, January, January, February, February,...

We need frequencies for months of the year.

The data in this case is in tidy form, i.e. each row is an observation (a falling cat), and each column is a measurement (month).

Question: How do we get frequencies?

Example - Assignment Problem #21

Observed and Expected Frequencies

You can use the table command…

table(mydata)
mydata
    April    August  December  February   January      July      June 
       10        13         5         6         4        19        14 
    March       May  November   October September 
        8         9         7        12        12 

but the output is not a data frame!

Question: How can we get frequencies in data frame format?

Example - Assignment Problem #21

Observed and Expected Frequencies

(mytable <- mydata %>% 
  group_by(month) %>%
  summarize(obs = n()))
# A tibble: 12 × 2
       month   obs
      <fctr> <int>
1      April    10
2     August    13
3   December     5
4   February     6
5    January     4
6       July    19
7       June    14
8      March     8
9        May     9
10  November     7
11   October    12
12 September    12

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

mytable$month <- factor(mytable$month, 
                        levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

mytable %>% arrange(month)
  • First, we reorder the factor levels.
  • Then, we used arrange on month.

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

# A tibble: 12 × 2
       month   obs
      <fctr> <int>
1    January     4
2   February     6
3      March     8
4      April    10
5        May     9
6       June    14
7       July    19
8     August    13
9  September    12
10   October    12
11  November     7
12  December     5