Fitting probability models to frequency data II

M. Drew LaMar
February 29, 2016

“Life is good for only two things, doing mathematics and teaching mathematics.”

- Poisson

Class Announcements

Exam #1 is graded!!!! Solutions on blackboard (soon)
- Will go over exam in lab this week
Reading assignment (Whitlock & Schluter, Chapter 9)
- New policy: Submit two ½ to 1 page summaries of chapters over the remainder of the semester (graded on completion) if you would like to replace an online quiz grade (i.e. can only do two times)

Chapter 8: Fitting probability models to frequency data

From proportions and binomial distributions…

Chapter 8: Fitting probability models to frequency data

…to working with direct frequency distributions.

         Right Left
Observed    14    4
Expected     9    9

plot of chunk unnamed-chunk-3

Chi-squared goodness-of-fit test

Note: The binomial test is an example of a goodness-of-fit test.

Definition: A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes.

Definition: A model in this case is a simplified, mathematical representation that mimics how we think a natural process works.

Working through an example

Assignment Problem #21

A more recent study of Feline High-Rise Syndrom (FHRS) included data on the month in which each of 119 cats fell (Vnuk et al. 2004). The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year?

Month	Number fallen	Month	Number fallen
January	4	July	19
February	6	August	13
March	8	September	12
April	10	October	12
May	9	November	7
June	14	December	5

Example - Assignment Problem #21

Null and alternative hypotheses

Question: What are the null and alternative hypotheses?

Answer:
\( H_{0} \): The frequency of cats falling is the same in each month.
\( H_{A} \): The frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Observed and Expected Frequencies

rows <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Obs <- c(4, 6, 8, 10, 9, 14, 19, 13, 12, 12, 7, 5)
Exp <- rep(sum(Obs)/12, 12)
FHRSTable <- matrix(c(Obs, Exp), ncol = 2, dimnames = list(rows, c("Obs","Exp")))
addmargins(FHRSTable, margin = 1)

Example - Assignment Problem #21

Observed and Expected Frequencies

    Obs        Exp
Jan   4   9.916667
Feb   6   9.916667
Mar   8   9.916667
Apr  10   9.916667
May   9   9.916667
Jun  14   9.916667
Jul  19   9.916667
Aug  13   9.916667
Sep  12   9.916667
Oct  12   9.916667
Nov   7   9.916667
Dec   5   9.916667
Sum 119 119.000000

Example - Assignment Problem #21

barplot(FHRSTable, beside=TRUE)

plot of chunk unnamed-chunk-6

barplot(t(FHRSTable), beside=TRUE)

plot of chunk unnamed-chunk-7

Example - Assignment Problem #21

\( \chi^2 \) test statistic

Definition: The \( \chi^2 \) statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis and is given by

\[ \chi^2 = \sum_{i}\frac{(Observed_{i} - Expected_{i})^2}{Expected_{i}} \]

Discuss: What would support the null hypothesis more: a small value or large value for \( \chi^{2} \)?

Answer: Small value for \( \chi^{2} \)

Example - Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

FHRS_df <- as.data.frame(FHRSTable)
FHRS_df$chi2_i <- (FHRS_df$Obs - FHRS_df$Exp)^2 / FHRS_df$Exp

Example - Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

    Obs      Exp       chi2_i
Jan   4 9.916667 3.5301120448
Feb   6 9.916667 1.5469187675
Mar   8 9.916667 0.3704481793
Apr  10 9.916667 0.0007002801
May   9 9.916667 0.0847338936
Jun  14 9.916667 1.6813725490
Jul  19 9.916667 8.3200280112
Aug  13 9.916667 0.9586834734
Sep  12 9.916667 0.4376750700
Oct  12 9.916667 0.4376750700
Nov   7 9.916667 0.8578431373
Dec   5 9.916667 2.4376750700

Example - Assignment Problem #21

\( \chi^2 \) test statistic - computed in R

(chi2 <- sum(FHRS_df$chi2_i))

[1] 20.66387

Could be done in one line:

(chi2 <- sum((FHRS_df$Obs - FHRS_df$Exp)^2 / FHRS_df$Exp))

[1] 20.66387

So \( \ \chi^{2} = 20.6638655 \).

Is that good?

Example - Assignment Problem #21

Question: What is the sampling distribution for the \( \chi^2 \) test statistic under the null hypothesis?

Answer: \( \chi^2 \) distribution (actually a family of distributions)

plot of chunk unnamed-chunk-12

Example - Assignment Problem #21

Definition: The number of degrees of freedom of a \( \chi^2 \) statistic specifies which \( \chi^2 \) distribution to use as the null distribution and is given by

\[ \begin{array}{ll} df = &\!\! \ \mathrm{(Number \ of \ categories)} - 1\\ &\!\!\!\! - \mathrm{(Number \ of \ estimated \ parameters \ from \ data)} \end{array} \]

This is analogous to the sample size in the null distribution for a mean and proportion. Remember, for the mean and proportion, the null distribution depends on the sample size.

Example - Assignment Problem #21

Discuss: Define the \( P \)-value.

Definition: The \( P \)-value is the probability of getting the data/test statistic (or worse) assuming the null hypothesis is true.

plot of chunk unnamed-chunk-13

\( P \)-value = 0.0370255 (shaded red area)

Example - Assignment Problem #21

\( P \)-value = 0.0370255

Discuss: Conclusion?

Conclusion: Reject \( H_{0} \), i.e. there is evidence that the frequency of cats falling is not the same in each month.

Example - Assignment Problem #21

Another way…critical values

Definition: A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under \( H_{0} \).

plot of chunk unnamed-chunk-14

Critical value = \( \chi_{0.05,11}^2 = 19.6751376 \)
Red area = \( \alpha = 0.05 \)

Example - Assignment Problem #21

Statistical tables

\[ Pr[\chi^{2} \geq \chi_{0.05,11}^2] = Pr[\chi^{2} \geq 19.675] = 0.05 \] \[ \mathrm{Observed} \ \chi^2 = 20.6638655 \]

How do I get P-values and critical values from R?

By using cumulative distribution functions and inverse cumulative distribution functions (quantile function).

Cumulative distribution functions

plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-16

Complementary cumulative distribution function (CCDF)

plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-18

How to calculate P-value

For our example (FHRS), we got the test statistic

\[ \chi^2_{11} = 20.6638655 \]

Here's the command in R for the corresponding \( P \)-value:

(pval <- pchisq(chi2, df=11, lower.tail=FALSE))

[1] 0.0370255

plot of chunk unnamed-chunk-20

How to calculate P-value

For our example (FHRS), we got the test statistic

\[ \chi^2_{11} = 20.6638655 \]

Here's the command in R for the corresponding \( P \)-value:

(pval <- pchisq(chi2, df=11, lower.tail=FALSE))

[1] 0.0370255

plot of chunk unnamed-chunk-22

How to calculate critical value?

plot of chunk unnamed-chunk-23

\[ \chi^{2}_{11, 0.05} = 19.6751376 \]

plot of chunk unnamed-chunk-24

~~Inverse CCDF (quantile function)!!!~~

How to calculate critical value?

plot of chunk unnamed-chunk-25

plot of chunk unnamed-chunk-26

How to calculate critical value

With a significance level of \( \alpha = 0.05 \) and \( df = 11 \), the critical value \( \chi^{2}_{11, 0.05} \) is found in R with the following command

(cval <- qchisq(0.05, df=11, lower.tail=FALSE))

[1] 19.67514

Our test statistic for FHRS was \( \chi^{2} = 20.6638655 \).

plot of chunk unnamed-chunk-28

How to calculate critical value

With a significance level of \( \alpha = 0.05 \) and \( df = 11 \), the critical value \( \chi^{2}_{11, 0.05} \) is found in R with the following command

(cval <- qchisq(0.05, df=11, lower.tail=FALSE))

[1] 19.67514

Our test statistic for FHRS was \( \chi^{2} = 20.6638655 \).

Discuss: Conclusion?

Answer: Reject null hypothesis, since the test statistic is greater than the critical value.

Summary of functions for chi-squared

Name	R command	Uses
PDF	`dchisq(x, df)`	-
CDF	`pchisq(q, df, lower.tail=TRUE)`	-
CCDF	`pchisq(q, df, lower.tail=FALSE)`	Compute \( P \)-values
QF	`qchisq(p, df, lower.tail=TRUE)`	-
CQF	`qchisq(p, df, lower.tail=FALSE)`	Compute critical values