M. Drew LaMar
October 11, 2021
From proportions and binomial distributions…
…to working with direct frequency distributions.
Right Left
Observed 14 4
Expected 9 9
Note: The binomial test is an example of a
goodness-of-fit test .
Definition: A
goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.
Definition: A
model in this case is a simplified, mathematical representation that mimics how we think a natural process works.
Assignment Problem #21
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
Month | Number fallen | Month | Number fallen |
---|---|---|---|
January | 4 | July | 19 |
February | 6 | August | 13 |
March | 8 | September | 12 |
April | 10 | October | 12 |
May | 9 | November | 7 |
June | 14 | December | 5 |
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
Question: What are the null and alternative hypotheses?
Answer:
\( H_{0} \): The frequency of cats falling is the same in each month.
\( H_{A} \): The frequency of cats falling isnot the same in each month.
Observed and Expected Frequencies
We want to use dplyr
for practice, so…
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
Now load data and store as a tibble.
mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
tbl_df()
Observed and Expected Frequencies
Let's peek at the data using glimpse
.
glimpse(mydata)
Rows: 119
Columns: 1
$ month <chr> "January", "January", "January", "January", "February", "Februar…
We need frequencies for months of the year.
The data in this case is in tidy form, i.e. each row is an observation (a falling cat), and each column is a measurement (month).
Question: How do we get frequencies?
Observed and Expected Frequencies
You can use the table
command…
table(mydata)
mydata
April August December February January July June March
10 13 5 6 4 19 14 8
May November October September
9 7 12 12
…but the output is not a data frame!
Question: How can we get frequencies in data frame format?
Observed and Expected Frequencies
(mytable <- mydata %>%
group_by(month) %>%
summarize(obs = n()))
# A tibble: 12 x 2
month obs
<chr> <int>
1 April 10
2 August 13
3 December 5
4 February 6
5 January 4
6 July 19
7 June 14
8 March 8
9 May 9
10 November 7
11 October 12
12 September 12
Observed and Expected Frequencies
How do we get the months in the correct order?!?!
mytable$month <- factor(mytable$month,
levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
mytable %>% arrange(month)
arrange
on month
.Observed and Expected Frequencies
How do we get the months in the correct order?!?!
# A tibble: 12 x 2
month obs
<fct> <int>
1 January 4
2 February 6
3 March 8
4 April 10
5 May 9
6 June 14
7 July 19
8 August 13
9 September 12
10 October 12
11 November 7
12 December 5
Observed and Expected Frequencies
Now let's compute the expected frequencies using mutate
.
(mytable <- mytable %>%
mutate(exp = sum(obs)/12))
Observed and Expected Frequencies
Now let's compute the expected frequencies using mutate
.
# A tibble: 12 x 3
month obs exp
<fct> <int> <dbl>
1 April 10 9.92
2 August 13 9.92
3 December 5 9.92
4 February 6 9.92
5 January 4 9.92
6 July 19 9.92
7 June 14 9.92
8 March 8 9.92
9 May 9 9.92
10 November 7 9.92
11 October 12 9.92
12 September 12 9.92
Observed and Expected Frequencies
Data validation: Do the observed and expected frequencies have the same sum? Let's use summarize
.
mytable %>% summarize(sum(obs),
sum(exp))
# A tibble: 1 x 2
`sum(obs)` `sum(exp)`
<int> <dbl>
1 119 119
YES!
Let's make a barplot. With ggplot2
, you can plot directly from this tibble. For now with base R graphics, we need a matrix.
mymatrix <- mytable %>%
arrange(month) %>%
select(-month) %>%
as.matrix()
head(mymatrix)
Let's make a barplot. With ggplot2
, you can plot directly from this tibble. For now with base R graphics, we need a matrix.
obs exp
[1,] 4 9.916667
[2,] 6 9.916667
[3,] 8 9.916667
[4,] 10 9.916667
[5,] 9 9.916667
[6,] 14 9.916667
Rownames!!
rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)
rownames(mymatrix) <- levels(mytable$month)
head(mymatrix)
obs exp
January 4 9.916667
February 6 9.916667
March 8 9.916667
April 10 9.916667
May 9 9.916667
June 14 9.916667
That's better! Let's plot…
barplot(mymatrix, beside = TRUE)
Hmmm, let's try another way…
barplot(t(mymatrix),
beside = TRUE,
col = c("forestgreen",
"goldenrod1"),
legend.text = c("Observed",
"Expected"))
Definition: The
\( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by
\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]
Definition: The
\( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by
\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]
Discuss: What would support the null hypothesis more: a small value or large value for \( \chi^{2} \)?
Answer: Small value for \( \chi^{2} \)
\( \chi^2 \) test statistic - computed in R
(mytable %>%
mutate(tmp = ((obs-exp)^2)/exp) %>%
summarize(chi2 = sum(tmp)))
# A tibble: 1 x 1
chi2
<dbl>
1 20.7
1x1 tibble?! Here's where “everything is a data frame” doesn't work.
Solution?
magrittr
package
“The Treachery of Images” by René Magritte
\( \chi^2 \) test statistic - computed in R
Use the tidyverse
package and the pull()
function.
mytable %>%
mutate(tmp = ((obs-exp)^2)/exp) %>%
summarize(chi2 = sum(tmp)) %>%
pull(chi2)
[1] 20.66387
The observed \( \chi^2 \) test statistic is 20.6638655.
Is that good?
Question: What is the sampling distribution for the \( \chi^2 \) test statistic under the null hypothesis?
Answer: \( \chi^2 \) distribution (actually a
family of distributions)
Definition: The number of
degrees of freedom of a \( \chi^2 \) statistic specifies which \( \chi^2 \) distribution to use as the null distribution and is given by
df = Number of categories - 1 - Number of estimated parameters from data
This is analogous to the sample size in the null distribution for a mean and proportion. Remember, for the mean and proportion, the null distribution depends on the sample size.
Discuss: Define the \( P \)-value.
Definition: The \( P \)-value is the probability of getting the data/test statistic (or worse) assuming the null hypothesis is true.
\( P \)-value = 0.0370255 (shaded red area)
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
\( P \)-value = 0.0370255
Discuss: Conclusion?
Conclusion: Reject \( H_{0} \), i.e. there is evidence that the frequency of cats falling is
not the same in each month.
Another way…critical values
Definition: A
critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under \( H_{0} \).
Critical value = \( \chi_{0.05,11}^2 = 19.6751376 \)
Red area = \( \alpha = 0.05 \)
Statistical tables
\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.6638655 \]
The sampling distribution of the \( \chi^2 \) statistic follows a \( \chi^2 \) distribution only approximately. Excellent approximation if the following is true:
Note: Still good approximation if average expected value at least five.
Note: ALL STATISTICAL TESTS HAVE ASSUMPTIONS
Assumption common to all: random sample
One thousand coins were each flipped eight times, and the number of heads was recorded for each coin. The results are below.
Number.of.heads | Number.of.coins |
---|---|
0 | 6 |
1 | 32 |
2 | 105 |
3 | 186 |
4 | 236 |
5 | 201 |
6 | 98 |
7 | 33 |
8 | 103 |
\( H_{0} \): The number of heads has a binomial distribution with \( p = 0.5 \)
\( H_{A} \): The number of heads does not have a binomial distribution with \( p = 0.5 \)
Number.of.heads | Observed | Binomial.expectation | Expected | |
---|---|---|---|---|
0 | 6 | 0.0039063 | 3.90625 | |
1 | 32 | 0.0312500 | 31.25000 | |
2 | 105 | 0.1093750 | 109.37500 | |
3 | 186 | 0.2187500 | 218.75000 | |
4 | 236 | 0.2734375 | 273.43750 | |
5 | 201 | 0.2187500 | 218.75000 | |
6 | 98 | 0.1093750 | 109.37500 | |
7 | 33 | 0.0312500 | 31.25000 | |
8 | 103 | 0.0039063 | 3.90625 | |
Sum | 36 | 1000 | 1.0000000 | 1000.00000 |
Binomial expectation: dbinom(0:8, 8, 0.5)
Number.of.heads | Observed | Binomial.expectation | Expected | |
---|---|---|---|---|
0 | 6 | 0.0039063 | 3.90625 | |
1 | 32 | 0.0312500 | 31.25000 | |
2 | 105 | 0.1093750 | 109.37500 | |
3 | 186 | 0.2187500 | 218.75000 | |
4 | 236 | 0.2734375 | 273.43750 | |
5 | 201 | 0.2187500 | 218.75000 | |
6 | 98 | 0.1093750 | 109.37500 | |
7 | 33 | 0.0312500 | 31.25000 | |
8 | 103 | 0.0039063 | 3.90625 | |
Sum | 36 | 1000 | 1.0000000 | 1000.00000 |
Meet assumptions?
(1) No categories with expected frequencies less than 1
(2) No more than 20% of categories with expected frequencies less than 5.
Assumptions not met as 2/9, or about 22% of categories are less than 5. We'll combine 0 and 1 together, as well as 7 and 8:
Number.of.heads | Observed | Binomial.expectation | Expected | (O-E)^2/E | |
---|---|---|---|---|---|
X | 0 or 1 | 38 | 0.0351563 | 35.15625 | 0.2300278 |
X.1 | 2 | 105 | 0.1093750 | 109.37500 | 0.1750000 |
X.2 | 3 | 186 | 0.2187500 | 218.75000 | 4.9031429 |
X.3 | 4 | 236 | 0.2734375 | 273.43750 | 5.1257286 |
X.4 | 5 | 201 | 0.2187500 | 218.75000 | 1.4402857 |
X.5 | 6 | 98 | 0.1093750 | 109.37500 | 1.1830000 |
X.6 | 7 or 8 | 136 | 0.0351563 | 35.15625 | 289.2646944 |
Sum | 1000 | 1.0000000 | 1000.00000 | 302.3218794 |
\( \chi^2 \) = 302.3218794
Statistical tables
\[ df = 7 - 1 = 6 \] \[ Pr[\chi^{2} \geq \chi_{0.05,6}^2] = Pr[\chi^{2} \geq 12.59] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 302.3218794 \]
chisq.test(df$Observed, p = df$`Binomial.expectation`)
Chi-squared test for given probabilities
data: df$Observed
X-squared = 302.32, df = 6, p-value < 2.2e-16
Number.of.heads Observed Binomial.expectation Expected (O-E)^2/E
X 0 or 1 38 0.03515625 35.15625 0.2300278
X.1 2 105 0.10937500 109.37500 0.1750000
X.2 3 186 0.21875000 218.75000 4.9031429
X.3 4 236 0.27343750 273.43750 5.1257286
X.4 5 201 0.21875000 218.75000 1.4402857
X.5 6 98 0.10937500 109.37500 1.1830000