We can test the proportions in a probability model using a variant of \(\chi^2\) called Goodness of Fit (GOF). The \(\chi^2\) GOF hypothesis compares observed data to theoretical probabilities. Rejecting the null indicates evidence for a lack of fit.

I. Example: Doberman Breeding

Dobermans can be bred in 4 colors: black, red, blue and fawn, all of which have rust colored highlights. Fawn-colored Dobermans are the rarest and most prized color. If a black male Doberman (hetero dominant allele) and a fawn female Doberman (homo recessive allele) were bred, on average half their pups would be black, a quarter blue and the remaining quarter fawn. Over the course of several years, a certain Doberman breeder had a pair of dogs from which 28 pups were born: 11 black, 11 blue and 6 fawn. Test the hypothesis that these dogs have the predicted genetics at the \(\alpha= 0.1\) level of significance.

1. Hypotheses

Our hypothesis just tests theoretical probabilities, so the list of probabilities is the null hypothesis.

\[H_0 : p_\text{black}=\frac{1}{2}, p_\text{blue}=\frac{1}{4}, p_\text{fawn}=\frac{1}{4}\] The alternative hypothesis is that at least one of the probabilities is not correct. The Doberman breeder is hoping to fail to reject the null indicating the parent dogs have the hypothesized genetics which, if true, will be valuable.

2. Data: Expected vs. Observed

We must set up the Observed vs. Expected table which allows for the verification procedure. How do we determine the expected cells? We multiply the sample size by the probabilities in the hypothesis. For example, the probability of a black colored Doberman (given the hypothesized genetics) would be \(\frac{1}{2}\), so the expected number of black Doberman pups out 28 total would be 14.

\[\begin{array}{cccc}&\text{Black} & \text{Blue} & \text{Fawn}\\ \text{Observed} & 11 & 11 & 6\\ \text{Expected} & 14 & 7 & 7 \end{array}\]

3. Verification

As with other proportion tests, as long as we ensure the sample size is adequate, our category data will be appropriate for the test. For all \(\chi^2\) procedures, we require that no more than 20% of Expected cell counts can be less than 5.

Since the smallest Expected cell count is 7 which is \(\geq\) 5, we have no low Expected cell counts at all (0%) which is clearly less than 20%.

4. The GOF Procedure

Let’s put the rows from the above table into vectors so R can analyze them. Naming the vectors “observed” and “expected” will us understand the code block where we actually calcuate the GOF statistics.

observed = c(11, 11, 6)
expected = c(0.5, 0.25, 0.25)

We run the \(\text{xchisq.test}\) procedure after ensuring the Mosaic package is running. (Remove the hashtag comment symbol as needed to initialize Mosaic.)

library(mosaic)
xchisq.test(x = observed,
           p = expected)

    Chi-squared test for given probabilities

data:  x
X-squared = 3.0714, df = 2, p-value = 0.2153

 11.00    11.00     6.00  
(14.00)  ( 7.00)  ( 7.00) 
 [0.64]   [2.29]   [0.14] 
<-0.80>  < 1.51>  <-0.38> 
     
key:
    observed
    (expected)
    [contribution to X-squared]
    <Pearson residual>

The top two rows of the table in the output provide an Observed vs. Expected cell count table which matches our work.

\[\begin{array}{cccc} & \text{Black} & \text{Blue} & \text{Fawn}\\ \text{Observed} & 11 & 11 & 6 \\ \text{Expected} & 14 & 7 & 7\end{array}\]

The third row of the table in the output shows the terms of the \(\chi^2\) calculation. The \(\chi^2\) formula is \[\chi^2=\sum \frac{(O-E)^2}{E}\] where, for each corresponding pair of cells, we calculate \[\frac{(\text{Observed Cell Count} - \text{Expected Cell Count})^2}{\text{Expected Cell Count}}\] For the first pair of values (Black pups), we have \[\frac{(11-14)^2}{14}=\frac{9}{14}\approx 0.642857\] which is why the first entry of the third row in the R output table is \([0.64]\).

5. Conclusion and Discussion

Because \(p=0.2153 > 0.1\), we fail to reject the null. Be careful. We never “accept the null hypothesis.” The Doberman breeder found no evidence her hypothesized genetics model was incorrect. Remember that Fischer’s whole point when he developed the null hypothesis was that it could only be falsified, never proven.

What else can the breeder do? Her dogs won’t live forever, so she’s stuck with a small sample size. She can’t repeat the test on a larger sample. She needs to make money breeding them while they’re still young enough.

II. A Randomization Approach

Why not test the hypothesis by repeatedly drawing samples of 28 pups from the hypothesized distribution? I will show a couple of the code blocks that accomplish this, but you don’t have to learn how to code this up.

First, we need to draw a sample named “pups” that has 28 randomly generated dogs.

pups = sample(c("black","blue","fawn"), 28, replace=TRUE, prob = c(0.5, 0.25, 0.25))
tally(pups)
X
black  blue  fawn 
   13     9     6 

Repeatedly execute the above code block to see several samples. How do we tell if the observed data is a “low probability” pattern? Consider the formula for calculating the test statistic. \[\chi^2 = \sum \frac{(O - E)^2}{E}\] For our observed data: \[\chi^2 = \frac{(11 - 14)^2}{14}+\frac{(11 - 7)^2}{7}+\frac{(6 - 7)^2}{7}=\frac{43}{14}\]

How do we tell if one of our randomly generated results is more or less likely than our observed data? Easy, we just calculated it’s \(\chi^2\) statistic. Consider the random draw above: 12 black, 9 blue, 7 fawn.

\[\chi^2 = \frac{(12 - 14)^2}{14}+\frac{(9 - 7)^2}{7}+\frac{(7 - 7)^2}{7}=\frac{6}{7} < \frac{14}{43}\]

The statistic measures the sum of the absolute errors between the observed and expected data corrected by the relative size of the proportions.

R Coding. The type of randomization needed for
\(\chi^2\) GOF is beyond the scope of this course. The
comments help show what’s happening, but do not
worry about learning to do this type of coding.

If we generate, say, 1000 samples, how many of them would have a more error than our observed data? The code below adds an inner FOR loop that counts the colors of the pups in each sample, and an outer FOR loop that repeats the process 1,000 times. The output \(N\) equals the number of randomly drawn samples which have \(\chi^2\) statistics greater than \(\frac{14}{43}\).

N = 0
# Do the randomization 1,000 time
for (j in 1:1000) {
# Create random sample of 28 pups
pups = sample(c("black","blue","fawn"), 28, replace=TRUE, prob = c(0.5, 0.25, 0.25))
black = 0
blue = 0
fawn = 0
# Count number of pups for each color
for (i in 1:28) {
  if (pups[i] == "black") { black = black + 1}
  if (pups[i] == "blue") { blue = blue + 1}
  if (pups[i] == "fawn") { fawn = fawn + 1}
}
# Calculate Chi-Squared statistic
chiObserved = (black - 14)^2/14 +(blue - 7)^2/7+(fawn - 7)^2/7
if ( chiObserved > 43/14) {N = N + 1}
} 
N
[1] 224

Run several iterations of the code block above and compare to our theoretical \(p\)-value of 0.215. You’ll find that the randomized approach is working properly, and these results are not too unlikely if the dogs have the proposed genetics.

III. \(\chi^2\) Caution

The \(\chi^2\) GOF procedure is an example of a backwards statistical test since we’re usually trying to verify a model (null hypothesis) rather than show it’s false (alternative hypothesis). Fischer’s null hypothesis was definitely not designed for this. However, the \(\chi^2\) GOF test has been incredibly helpful to geneticists when used where large samples are readily available. One example is testing crop genetics where huge samples of hybrids are easy and inexpensive. Despite using this backwards statistical approach, we still generate a great deal of evidence the null is likely to be true when we examine multiple, large-scale experiments all of which produce the same results.

IV. Example 2: Mouse Genetics

Suppose researchers cross a pure breeding white mouse with a pure breeding brown mouse. All F1 (first filial generation) progeny are brown. The researchers then construct an F2 (second filial generation) cross by breeding pairs from F1 group. If the researchers’ genetics model is correct, the brown-to-white ratio in the F2 group should be \(3:1\).

In total, researchers raise 200 of the F2 offspring and observe 164 brown and and the rest white. Test the hypothesis that the genetics model is correct at the \(\alpha = 0.1\).

Our hypotheses: \[\begin{align*}H_0 &: p_B = \frac{3}{4} , p_W = \frac{1}{4}\\ H_a &: \text{At least one probability significantly different}\end{align*}\]

Creating the cell count vectors:

observed = c(164,36)
expected = c(.75,.25)

Run the procedure:

xchisq.test(x = observed,
           p = expected)

    Chi-squared test for given probabilities

data:  x
X-squared = 5.2267, df = 1, p-value = 0.02224

 164.00    36.00 
(150.00) ( 50.00)
 [1.31]   [3.92] 
< 1.14>  <-1.98> 
   
key:
    observed
    (expected)
    [contribution to X-squared]
    <Pearson residual>

The expected cell counts are both far larger than 5, so the data passes verification. Given that \(p = 0.07245 < 0.1 = \alpha\), we reject the null. We have evidence that the genetic model for the mice is incorrect.

V. Example 3: Births on Week Days vs. Weekends

Are fewer human babies born on weekend days (proportionally) then week days?

Note that, naturally, 2 out of 7 babies would be born on weekend days while 5 out of 7 would be born on weekdays. Modern medicine produced a large increase in scheduled births (planned inductions or planned C-sections) in recent years. If parents and doctors work together to schedule a birth, it sure ain’t gonna be at midnight on a Saturday! Consider the following observed data for births in one county in Georgia for 2019. \[\begin{array}{ccccccc} \text{Su}&\text{M}&\text{T}&\text{W}&\text{R}&\text{F}&\text{Sa}\\ 11 & 29 & 16 & 14 & 17 & 23 & 9 \end{array}\]

Su M T W R F Sa 11 29 16 14 17 23 9

Is there evidence at the \(\alpha = 0.05\) level that fewer babies (proportionally) are born on weekends?

Our hypotheses: \[\begin{align*}H_0 &: p_{Su} = p_M = p_T = p_W = p_R = p_F= p_{Sa} = \frac{1}{7}\\ H_a &: \text{At least one probability significantly different}\end{align*}\]

Creating the cell count vectors:

observed = c(11, 29, 16, 14, 17, 23, 9)
expected = c(1/7,1/7,1/7,1/7,1/7,1/7,1/7)

Run the procedure:

xchisq.test(x = observed,
           p = expected)

    Chi-squared test for given probabilities

data:  x
X-squared = 17.059, df = 6, p-value = 0.009069

 11.00    29.00    16.00    14.00    17.00    23.00     9.00  
(17.00)  (17.00)  (17.00)  (17.00)  (17.00)  (17.00)  (17.00) 
[2.118]  [8.471]  [0.059]  [0.529]  [0.000]  [2.118]  [3.765] 
<-1.46>  < 2.91>  <-0.24>  <-0.73>  < 0.00>  < 1.46>  <-1.94> 
             
key:
    observed
    (expected)
    [contribution to X-squared]
    <Pearson residual>

The expected cell counts are all 17 and thus larger than 5, so the data are appropriate for the \(\chi^2\) GOF procedure. Given that \(p = 0.009069 < 0.05 = \alpha\), we reject the null. We have evidence that the the model stating babies are equally likely to be born each day of the week is false.

VI. Exercises

  1. A baseball card company claims that 25% of its cards are rookies, 65% are veterans but not All-Stars, and 10% are veteran All-Stars. Suppose a random sample of 200 cards has 70 rookies, 120 veterans, and 10 All-Stars. Is the company’s claimed distribution credible? Test using \(\chi^2\) GOF with a 0.05 level of significance.

  2. A supposedly fair die is rolled 50 times with the following results: 9 ones, 15 twos, 9 threes, 8 fours, 6 fives and 13 sixes. Test at the 0.05 level whether the die is actually fair using \(\chi^2\) GOF.

  3. Helena buys custom M&M’s for a bridal shower she’s throwing for her sister. She orders 10% yellow, 20% pale blue, 30% red and 40% pink. When she receives her 10 lb. package, she sees almost not yellow and far too many pale blue. She randomly selects 200 of the M&M’s and finds the following observed counts: \[\begin{array}{cccc}\text{Yellow}&\text{Blue}&\text{Red}&\text{Pink}\\ \hline 8&54&64&74\end{array}\]

