Fitting probability models to frequency data (Part I)

M. Drew LaMar
October 11, 2019


https://xkcd.com/882/

Class Announcements

  • Homework #5 and Lab #6
    • Due Monday, October 21, 11:59pm pm

Back to sampling distribution

What is the mean, variance and standard deviation of a binomial random variable \( X \)?

alt text

Definition: Distribution of sample estimates is the sampling distribution.

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n} \]

Back to sampling distribution

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{1}{n}\hat{X} \]

  • \( \hat{X} \) is a binomial random variable
  • \( \hat{p} \) is thus a scaled binomial random variable
  • Standard deviation of \( \hat{X} \) is \( \sigma_{\hat{X}} = \sqrt{np(1-p)} \)
  • From rules for scaled random variables, the standard deviation of \( \hat{p} \) is thus

    \[ \sigma_{\hat{p}} = \frac{1}{n}\sqrt{np(1-p)} = \sqrt{\frac{p(1-p)}{n}} \]

This is the standard error for \( \hat{p} \)!!!

Sampling distribution for a proportion

plot of chunk unnamed-chunk-2

Binomial test - Definitions

Definition: The binomial test uses data to test whether a population proportion (\( p \)) matches a null expectation (\( p_{0} \)) for the proportion.

Definition: The null hypothesis \( H_{0} \) and alternative hypothesis \( H_{A} \) for a binomial test are given by:

    \( H_{0} \): Relative frequency of successes in population is \( p_{0} \).
    \( H_{A} \): Relative frequency of successes in population is not \( p_{0} \).

Example - Practice Problem #2

Do people typically use a particular ear preferentially when listening to strangers? Marzoli and Tomassi (2009) had a researcher approach and speak to strangers in a noisy nightclub. An observer scored whether the person approached turned either the left or right ear toward the questioner. Of 25 participants, 19 turned the right ear toward the questioner and 6 offered the left ear. Is this evidence of population difference from 50% for each ear?

Discuss: What is the null and alternative hypotheses?

Answer:
\[ \begin{array}{ll} H_{0}\,: & p = 0.5 \\ H_{A}\,: & p \neq 0.5 \end{array} \]

Example - Practice Problem #2

Do people typically use a particular ear preferentially when listening to strangers? Marzoli and Tomassi (2009) had a researcher approach and speak to strangers in a noisy nightclub. An observer scored whether the person approached turned either the left or right ear toward the questioner. Of 25 participants, 19 turned the right ear toward the questioner and 6 offered the left ear. Is this evidence of population difference from 50% for each ear?

Discuss: What is the observed value of the test statistic?

Answer: Number of right ears is 19 (\( \hat{X}=19 \)).

Example - Practice Problem #2

Discuss: Under the null hypothesis, calculate the probability of getting exactly 19 right ears and six left ears.

(prob <- dbinom(x = 19, size = 25, prob = 0.5))
[1] 0.005277991

\[ \mathrm{Pr[19]} = \left(\begin{array}{c}{25 \\ 19}\end{array}\right)0.5^{19}0.5^{6} = 0.005278 \]

Example - Practice Problem #2

Discuss: List all possible outcomes in which the number of right ears is greater than the 19 observed.

Answer: 20, 21, 22, 23, 24, 25

Discuss: Calculate the probability under the null hypothesis of each of the extreme outcomes listed above

(probs <- dbinom(x = 20:25, size = 25, prob = 0.5))
[1] 1.583397e-03 3.769994e-04 6.854534e-05 8.940697e-06 7.450581e-07
[6] 2.980232e-08

Example - Practice Problem #2

Discuss: Use the addition rule to calculate the probability of 19 or more right-eared turns under the null hypothesis.

(extreme_probs <- dbinom(x = 19:25, size = 25, prob = 0.5))
[1] 5.277991e-03 1.583397e-03 3.769994e-04 6.854534e-05 8.940697e-06
[6] 7.450581e-07 2.980232e-08
sum(extreme_probs)
[1] 0.007316649

Example - Practice Problem #2

Discuss: Give the two-tailed \( P \)-value based on your previous answer.

(pval <- 2*sum(extreme_probs))
[1] 0.0146333

Discuss: State your conclusion.

Answer: Using a significance level of \( \alpha = 0.05 \), we reject \( H_{0} \) since \( P < 0.05 \). There is evidence that more people use the right ear than the left ear when listening to a stranger in the noisy nightclub.

One last thing on binomial tests...

Use binom.test to do a binomial test! It's more accurate. If our observed test statistic is \( X = 19 \) successes out of \( n = 25 \) trials, and our null hypothesized proportion is \( p_{0} = 0.5 \), then we have:

binom.test(19,
           n = 25,
           p = 0.5)

One last thing on binomial tests...

Use binom.test to do a binomial test! It's more accurate. If our observed test statistic is \( X = 19 \) successes out of \( n = 25 \) trials, and our null hypothesized proportion is \( p_{0} = 0.5 \), then we have:


    Exact binomial test

data:  19 and 25
number of successes = 19, number of trials = 25, p-value = 0.01463
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5487120 0.9064356
sample estimates:
probability of success 
                  0.76 

Chapter 8: Fitting probability models to frequency data

From proportions and binomial distributions…

Chapter 8: Fitting probability models to frequency data

…to working with direct frequency distributions.

         Right Left
Observed    14    4
Expected     9    9

plot of chunk unnamed-chunk-10

Chi-squared goodness-of-fit test

Note: The binomial test is an example of a goodness-of-fit test.

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Definition: A model in this case is a simplified, mathematical representation that mimics how we think a natural process works.

Working through an example

Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month Number fallen Month Number fallen
January 4 July 19
February 6 August 13
March 8 September 12
April 10 October 12
May 9 November 7
June 14 December 5

Our Workflow

Example - Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Question: What are the null and alternative hypotheses?

Answer:
     \( H_{0} \): The frequency of cats falling is the same in each month.
     \( H_{A} \): The frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Observed and Expected Frequencies

We want to use dplyr for practice, so…

if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

Now load data and store as a tibble.

mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
  tbl_df()

Example - Assignment Problem #21

Observed and Expected Frequencies

Let's peek at the data using glimpse.

glimpse(mydata)
Observations: 119
Variables: 1
$ month <fct> January, January, January, January, February, February, Fe…

We need frequencies for months of the year.

The data in this case is in tidy form, i.e. each row is an observation (a falling cat), and each column is a measurement (month).

Question: How do we get frequencies?

Example - Assignment Problem #21

Observed and Expected Frequencies

You can use the table command…

table(mydata)
mydata
    April    August  December  February   January      July      June 
       10        13         5         6         4        19        14 
    March       May  November   October September 
        8         9         7        12        12 

but the output is not a data frame!

Question: How can we get frequencies in data frame format?

Example - Assignment Problem #21

Observed and Expected Frequencies

(mytable <- mydata %>% 
  group_by(month) %>%
  summarize(obs = n()))
# A tibble: 12 x 2
   month       obs
   <fct>     <int>
 1 April        10
 2 August       13
 3 December      5
 4 February      6
 5 January       4
 6 July         19
 7 June         14
 8 March         8
 9 May           9
10 November      7
11 October      12
12 September    12

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

mytable$month <- factor(mytable$month, 
                        levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

mytable %>% arrange(month)
  • First, we reorder the factor levels.
  • Then, we used arrange on month.

Example - Assignment Problem #21

Observed and Expected Frequencies

How do we get the months in the correct order?!?!

# A tibble: 12 x 2
   month       obs
   <fct>     <int>
 1 January       4
 2 February      6
 3 March         8
 4 April        10
 5 May           9
 6 June         14
 7 July         19
 8 August       13
 9 September    12
10 October      12
11 November      7
12 December      5

Example - Assignment Problem #21

Observed and Expected Frequencies

Now let's compute the expected frequencies using mutate.

(mytable <- mytable %>%
   mutate(exp = sum(obs)/12))

Example - Assignment Problem #21

Observed and Expected Frequencies

Now let's compute the expected frequencies using mutate.

# A tibble: 12 x 3
   month       obs   exp
   <fct>     <int> <dbl>
 1 April        10  9.92
 2 August       13  9.92
 3 December      5  9.92
 4 February      6  9.92
 5 January       4  9.92
 6 July         19  9.92
 7 June         14  9.92
 8 March         8  9.92
 9 May           9  9.92
10 November      7  9.92
11 October      12  9.92
12 September    12  9.92

Example - Assignment Problem #21

Observed and Expected Frequencies

Data validation: Do the observed and expected frequencies have the same sum? Let's use summarize.

mytable %>% summarize(sum(obs),
                      sum(exp))
# A tibble: 1 x 2
  `sum(obs)` `sum(exp)`
       <int>      <dbl>
1        119        119

YES!

Example - Assignment Problem #21

Let's make a barplot. With ggplot2, you can plot directly from this tibble. For now with base R graphics, we need a matrix.

mymatrix <- mytable %>% 
  arrange(month) %>%
  select(-month) %>%
  as.matrix()

head(mymatrix)

Example - Assignment Problem #21

Let's make a barplot. With ggplot2, you can plot directly from this tibble. For now with base R graphics, we need a matrix.

     obs      exp
[1,]   4 9.916667
[2,]   6 9.916667
[3,]   8 9.916667
[4,]  10 9.916667
[5,]   9 9.916667
[6,]  14 9.916667

Rownames!!

rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)

Example - Assignment Problem #21

rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)
         obs      exp
January    4 9.916667
February   6 9.916667
March      8 9.916667
April     10 9.916667
May        9 9.916667
June      14 9.916667

That's better! Let's plot…

Example - Assignment Problem #21

barplot(mymatrix, beside = TRUE)

plot of chunk unnamed-chunk-21

Hmmm, let's try another way…

Example - Assignment Problem #21

barplot(t(mymatrix), 
        beside = TRUE, 
        col = c("forestgreen", 
                "goldenrod1"),
        legend.text = c("Observed",
                        "Expected"))

plot of chunk unnamed-chunk-22