KITADA

Lesson #9

Goodness of Fit Test for a Single Categorical Variable

Motivation:

Suppose a variable of interest is categorical, but there are more than two categories! What do we do? If there is only one population, a Goodness of Fit test can be performed. One method of performing a Goodness of Fit test is to use the Chi-square (pronounced “Keye”) methods. The Chi-square methods will give a good approximation of the true p-value as long as certain conditions exist. In this lesson, we’ll focus on using the Chi-square methods to estimate the p-value in a Goodness of Fit test to determine if the proportion in each category are different than hypothesized proportions.

*What you need to know from this lesson: *

After completing this lesson, you should be able to

explain when a Goodness of Fit test is the appropriate test to perform
write the null and alternative hypotheses for a Goodness of Fit test
explain when the Chi-square methods can be used to perform the Goodness of Fit test.
calculate the expected number of cases that would be in each category if the null hypothesis is true
calculate the Chi-square statistic based on the observed and expected counts
determine the degrees of freedom for a Chi-square statistic in a Goodness of Fit test
determine the p-value using technology and/or a Chi-square table
state a conclusion in the context of the problem

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 7.1 in the text
3. Do the Lesson 9 questions at the end of the lesson notes

The Lesson

Example: Commuting to school

Suppose that at some other major university, 60% of the students drive to campus each day, 30% either walk or ride a bike, and the other 10% use some other form of transportation (get a ride or take the bus, for example). Let’s conduct a survey to determine if students at OSU follow the same percentages as at this other major university.

Step 1: Identify the variable of interest and population:

1. What is the variable of interest? Is it categorical or quantitative?

Mode of transportation to campus today. Categorical with 3 categories

2. What is the population of interest?

There is one population: all OSU students

Step 2: Assess if the sample is representative of the population

3. Give an argument why the mode of transportation to campus today for those in the sample may be different than the mode of transportation other students at OSU used to get to campus today.

Time of year. In the winter, students may drive more because of the cold weather.

Step 3: Determine if this is an estimation only problem or a hypothesis test problem

4. Is this an estimation only problem or hypothesis test problem? Why?

Hypothesis test since we want to compare OSU students to these known percentages at this other university.

Note: for inference problems involving categorical variables with more than two categories, confidence intervals cannot be constructed. Therefore, all of these problems will involve doing a hypothesis test.

5. Explain why this is an example of using a Goodness of Fit test?

1. Categorical variable of interest with more than two categories
2. One population

Step 4: Restate the question of interest in terms of the null and alternative hypothesis

6. State the null and alternative hypotheses. (Note: the null hypothesis would be what we would expect to happen if OSU students “behaved” like the students at this other university in terms of getting to campus.)

\( H_0: p_{drive} = 0.6, p_{walk/bike} = 0.3, p_{other} = 0.1 \)
\( H_A \): at least one of the proportions is different that its hypothesized value

Step 5: Explore the sample data

7. Based on the data collected in class, complete the frequency table (table of counts) below. Mode of commuting

            drive       bike/walk       other

Frequency       15        42        3

Proportion      0.25      0.7       0.05

8. Are these proportions sample proportions or population proportions?

Sample proportions

Step 6: Determining the p-value

The Chi-square methods

9. When can the Chi-square methods be used to perform the Goodness of Fit test?

We can use a chi-square Goodness of fit test when there is a hypothesized set of proportions that we are testing against.

10. Can the Chi-square methods be used in this example? Explain.

Yes, because we can compare OSU to the proportions of transportions types from the other school

11. To determine the p-value using the Chi-square methods, we will compare the number of cases observed in each of the categories to the number of cases we’d expect to observe in each of the categories if the null hypothesis is true. We’ll call these expected numbers “expected counts”.

a. If the null hypothesis is true, how many students in this sample would we expect to drive to campus?

### DRIVE TO CAMPUS (UNDER H0)
60*.6

## [1] 36

b. If the null hypothesis is true, how many students in this sample would we expect to bike or walk to campus?

### WALK OR BIKE TO CAMPUS (UNDER H0)
60*.3

## [1] 18

c. If the null hypothesis is true, how many students in this sample would we expect to use some other form of transportation to get to campus?

### WALK OR BIKE TO CAMPUS (UNDER H0)
60*.1

## [1] 6

d. In general, what is a formula to calculate the expected count in each category?

Sample size times hypothesized proptions value for the given category

12. Calculation of the Chi-square statistic

Step 1: Compare the observed counts with the expected counts for each mode of commuting by subtracting the expected count from the observed count. Place each difference in the correct spot in the table on the next page.

a. What is the sum of the (observed – expected) counts?

(15-36)+(42-18)+(3-6)

## [1] 0

Step 2: To account for the fact that the sum of the \( (observed – expected) \) counts does not provide useful information, we square the \( (observed – expected) \) counts. Calculate \( (observed – expected)^2 \) for each mode of commuting and place it in the correct spot in the table on the next page.

Step 3: There may be situations where one expected count is small (say 20) and another expected count is large (say 150). Suppose that the (observed – expected) count for these two cells is 10. That difference of 10 is quite large compared to an expected cell count of 20 (50% of 20), but relatively small compared to an expected cell count of 150 (only about 7% of 150). We want to account for this relative difference when calculating the Chi-square statistic. In particular, we want to give more weight to those groups with small expected cell counts (since the relative difference is quite large for these groups). To do this, we divide by the expected cell count.

Return to our example. For each mode of commuting, calculate (Observed-Expected)² and place it in the appropriate spot in the table below. (These values are each category’s weight to the chi-square statistic.)

  mode of commuting                     drive   bike/walk   other       

  observed count                          15       42       3

  expected count                          36       18       6


  observed – expected                    -21       24      -3


  (observed – expected)^2                 441     576       9


  (observed – expected)^2/expected       12.25     32      1.5

Step 4: To calculate the chi-square statistic, add the values in the last row of the table. In notation:

\( \chi_{(k-1)}^2=\sum_{all groups} \frac{(observed-expected)^2}{expected} \)

\( k \) is the number of categories

\( \chi^2 \) statistic = 12.25+32+1.5 = 45.75

b. What are the possible values for a Chi-square statistic?

The test statistic can take values from 0 to infinity

c. Which values of a Chi-square statistic would lead to stronger evidence to reject the claim made in the null hypothesis?

Larger values of the test statistic

13. Determine the p-value based on the Chi-square methods.

a. Chi-square statistics have degrees of freedom. In general, what is the degrees of freedom for the Chi-square statistic for a Goodness of Fit test?

\( k-1 \)

b. How many degrees of freedom does the Chi-square statistic have in this example?

df=3-1=2

c. Determine the p-value using the Chi-square cdf function on TI-83:

Choose 2nd DISTR
Choose option 7: \( \chi^2 \) cdf
There are three arguments needed in the parentheses: \( \chi^2 \) cdf(lower bound, upper bound, degrees of freedom)
- The lower bound will ALWAYS be the chi-square statistic. (Think about what was discussed in #12 to understand why.)
- The upper bound will be some value of a Chi-square statistic that would rarely occur, such as 100000000
- Enter the correct value of the degrees of freedom

Fill in the correct arguments for this example:

\( \chi^2 \) cdf (45.75,1E99,2)

What is the p-value?

p-value = 1.1628E-10

d. The chi-square table is another option. If the table were used, how would the p-value be written?

Step 7: Answer the question of interest in the context of the problem

14. State a conclusion in the context of the problem.

There is convincing evidence to suggest that the proportion of travel types to OSU is different than the other school with a p-value< 0.00001. Therefore, we will reject the null.

There is no confidence interval for a Chi-square test!!!!!