M. Drew LaMar
February 29, 2016
“Life is good for only two things, doing mathematics and teaching mathematics.”
- Poisson
From proportions and binomial distributions…
…to working with direct frequency distributions.
Right Left
Observed 14 4
Expected 9 9
Note: The binomial test is an example of a
goodness-of-fit test .
Definition: A
goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.
Definition: A
model in this case is a simplified, mathematical representation that mimics how we think a natural process works.
Assignment Problem #21
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
Month | Number fallen | Month | Number fallen |
---|---|---|---|
January | 4 | July | 19 |
February | 6 | August | 13 |
March | 8 | September | 12 |
April | 10 | October | 12 |
May | 9 | November | 7 |
June | 14 | December | 5 |
Null and alternative hypotheses
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
Question: What are the null and alternative hypotheses?
Answer:
\( H_{0} \): The frequency of cats falling is the same in each month.
\( H_{A} \): The frequency of cats falling isnot the same in each month.
Observed and Expected Frequencies
rows <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Obs <- c(4, 6, 8, 10, 9, 14, 19, 13, 12, 12, 7, 5)
Exp <- rep(sum(Obs)/12, 12)
FHRSTable <- matrix(c(Obs, Exp), ncol = 2, dimnames = list(rows, c("Obs","Exp")))
addmargins(FHRSTable, margin = 1)
Observed and Expected Frequencies
Obs Exp
Jan 4 9.916667
Feb 6 9.916667
Mar 8 9.916667
Apr 10 9.916667
May 9 9.916667
Jun 14 9.916667
Jul 19 9.916667
Aug 13 9.916667
Sep 12 9.916667
Oct 12 9.916667
Nov 7 9.916667
Dec 5 9.916667
Sum 119 119.000000
barplot(FHRSTable, beside=TRUE)
barplot(t(FHRSTable), beside=TRUE)
\( \chi^2 \) test statistic
Definition: The
\( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by
\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]
Discuss: What would support the null hypothesis more: a small value or large value for \( \chi^{2} \)?
Answer: Small value for \( \chi^{2} \)
\( \chi^2 \) test statistic - computed in R
FHRS_df <- as.data.frame(FHRSTable)
FHRS_df$chi2_i <- (FHRS_df$Obs - FHRS_df$Exp)^2 / FHRS_df$Exp
\( \chi^2 \) test statistic - computed in R
Obs Exp chi2_i
Jan 4 9.916667 3.5301120448
Feb 6 9.916667 1.5469187675
Mar 8 9.916667 0.3704481793
Apr 10 9.916667 0.0007002801
May 9 9.916667 0.0847338936
Jun 14 9.916667 1.6813725490
Jul 19 9.916667 8.3200280112
Aug 13 9.916667 0.9586834734
Sep 12 9.916667 0.4376750700
Oct 12 9.916667 0.4376750700
Nov 7 9.916667 0.8578431373
Dec 5 9.916667 2.4376750700
\( \chi^2 \) test statistic - computed in R
(chi2 <- sum(FHRS_df$chi2_i))
[1] 20.66387
Could be done in one line:
(chi2 <- sum((FHRS_df$Obs - FHRS_df$Exp)^2 / FHRS_df$Exp))
[1] 20.66387
So \( \ \chi^{2} = 20.6638655 \).
Is that good?
Question: What is the sampling distribution for the \( \chi^2 \) test statistic under the null hypothesis?
Answer: \( \chi^2 \) distribution (actually a
family of distributions)
Definition: The number of
degrees of freedom of a \( \chi^2 \) statistic specifies which \( \chi^2 \) distribution to use as the null distribution and is given by
\[ \begin{array}{ll} df = &\!\! \ \mathrm{(Number \ of \ categories)} - 1\\ &\!\!\!\! - \mathrm{(Number \ of \ estimated \ parameters \ from \ data)} \end{array} \]
This is analogous to the sample size in the null distribution for a mean and proportion. Remember, for the mean and proportion, the null distribution depends on the sample size.
Discuss: Define the \( P \)-value.
Definition: The \( P \)-value is the probability of getting the data/test statistic (or worse) assuming the null hypothesis is true.
\( P \)-value = 0.0370255 (shaded red area)
A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?
\( P \)-value = 0.0370255
Discuss: Conclusion?
Conclusion: Reject \( H_{0} \), i.e. there is evidence that the frequency of cats falling is
not the same in each month.
Another way…critical values
Definition: A
critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under \( H_{0} \).
Critical value = \( \chi_{0.05,11}^2 = 19.6751376 \)
Red area = \( \alpha = 0.05 \)
Statistical tables
\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.6638655 \]
By using cumulative distribution functions and inverse cumulative distribution functions (quantile function).
For our example (FHRS), we got the test statistic
\[ \chi^2_{11} = 20.6638655 \]
Here's the command in R for the corresponding \( P \)-value:
(pval <- pchisq(chi2, df=11, lower.tail=FALSE))
[1] 0.0370255
For our example (FHRS), we got the test statistic
\[ \chi^2_{11} = 20.6638655 \]
Here's the command in R for the corresponding \( P \)-value:
(pval <- pchisq(chi2, df=11, lower.tail=FALSE))
[1] 0.0370255
\[ \chi^{2}_{11, 0.05} = 19.6751376 \]
Inverse CCDF (quantile function)!!!
With a significance level of \( \alpha = 0.05 \) and \( df = 11 \), the critical value \( \chi^{2}_{11, 0.05} \) is found in R with the following command
(cval <- qchisq(0.05, df=11, lower.tail=FALSE))
[1] 19.67514
Our test statistic for FHRS was \( \chi^{2} = 20.6638655 \).
With a significance level of \( \alpha = 0.05 \) and \( df = 11 \), the critical value \( \chi^{2}_{11, 0.05} \) is found in R with the following command
(cval <- qchisq(0.05, df=11, lower.tail=FALSE))
[1] 19.67514
Our test statistic for FHRS was \( \chi^{2} = 20.6638655 \).
Discuss: Conclusion?
Answer: Reject null hypothesis, since the test statistic is greater than the critical value.
Name | R command | Uses |
---|---|---|
dchisq(x, df) |
- | |
CDF | pchisq(q, df, lower.tail=TRUE) |
- |
CCDF | pchisq(q, df, lower.tail=FALSE) |
Compute \( P \)-values |
QF | qchisq(p, df, lower.tail=TRUE) |
- |
CQF | qchisq(p, df, lower.tail=FALSE) |
Compute critical values |