KITADA

Lab Activity #5

Inference for categorical variables using the Chi-square methods

Objectives:

I. Example

The EAR data set is based on 214 children with acute otitis media (OME) who participated in a randomized clinical trial. Each child had OME at the beginning of the study in either one (unilateral cases) or both (bilateral cases) ears. Each child was randomly assigned to receive a 14-day course of one of two antibiotics, either cefaclor (CEF) or amoxicillin (AMO). The focus here is on the 277 children whose middle-ear status was determined at a 14-day follow-up visit.

 Column             Variable                 Format or Code
-------------------------------------------------------------------------

     1                  ID

     2                  Clearance                   1=yes (ear cleared)  / 0=no (ear had not cleared)

     3                  Antibiotic                  1=CEF  / 2=AMO

     4                  Age                         1=<2 yrs / 2=2-5 yrs  / 3=6+ yrs

     5                  Ear                         1=first ear infected  / 2= second ear infected

A. The variable of interest is whether or not a child’s ear was cleared. There are several different variables that may be explaining clearance of the ear: which antibiotic was used, age of the child, and how many ears were infected. Let’s investigate one in this part and leave another for you to investigate in Part C.

Is there an association between clearance of an infected ear (after 14 days) and type of antibiotic used?

Step 1: Identify the variable of interest and the populations

1. Which variable is the response variable and which is the explanatory variable?

Response = Clearance of an infected ear

Explanatory = Type of antibiotic

2. What “groups” are being compared? Therefore, how many “populations” are there?

We are comparing the CEF and AMO antibiotic treatment groups. There are two populations.

Step 2: Assess whether the samples are representative of the populations

3. One “condition” that must exist for conclusions to be valid to a population of interest is that the responses from the cases in the study must be independent of each other. That is, the response from one case must not have any effect on the response of any other case in the study. If too many responses are dependent, the results from the study may not be reflective of what is truly happening in a particular population (i.e. the sample may not be representative of the population).

Is there a reason to believe that responses from some cases are not independent? (Hint: recall that each case represents a separate row in the dataset.)

No these should be independent

Step 3: Determine if this is an estimation only problem or hypothesis test problem

4. Can the two-proportions method be used in this example?

Yes, a two proportion method could be used.

With two categorical variables, the Chi-square methods can be used to perform a hypothesis test. We’ll show how to perform the chi-square test for this problem. However, a confidence interval cannot be constructed using the Chi-square methods.

5. Which type of Chi-square test will be performed in this example: a Goodness of Fit test or a Test for Association? Why?

We would perform a Chi-square test for Associaton since there isnt a hypothesized set of values.

Step 4: State the null and alternative hypothesis

6. State the null and alternative hypotheses in words? In addition, write the null hypothesis in notation. Define the notation used.

\( H_0 \): There is NO association between antibiotic type and whether or not the child is cleared of their ear infection before 14 days.

\( H_A \): There IS an association between antibiotic type and whether or not the child is cleared of their ear infection before 14 days.

Note: if a response variable had more than two categories, it may be challenging to write the hypotheses in notation. Also, there are certain situations where neither variable could be considered the response or explanatory variable. For either of these two situations, the hypotheses can only be written in words in terms of “no association” or “an association”. Understand, though, what “no association” or “an association” means in the context of the problem.

Step 5: Explore the data

When a Chi-square test is to be performed, there are a couple of reasons for exploring the data collected:

7. Use the commands from the Week 4 Lab Activity (which are reproduced in Part III of this lab activity) to construct a table of counts. We can obtain row and column totals by performing the addmargins() function on your constructed table.

ear.table <- with(EAR, table(Clearance, Antibiotic))
ear.table
##          Antibiotic
## Clearance  1  2
##         0 61 72
##         1 88 56
addmargins(ear.table) 
##          Antibiotic
## Clearance   1   2 Sum
##       0    61  72 133
##       1    88  56 144
##       Sum 149 128 277

To obtain the appropriate row or column percents, we can use the prop.table() function on the table constructed, with a second argument margin= to specify showing row percents (1) or column percents (2) in the table.

# To show the row percents
prop.table(ear.table, margin=1)
##          Antibiotic
## Clearance         1         2
##         0 0.4586466 0.5413534
##         1 0.6111111 0.3888889
# To show the column percents
prop.table(ear.table, margin=2)
##          Antibiotic
## Clearance        1        2
##         0 0.409396 0.562500
##         1 0.590604 0.437500
# To show invidual cell percentages (out of the total in the table)
prop.table(ear.table)
##          Antibiotic
## Clearance         1         2
##         0 0.2202166 0.2599278
##         1 0.3176895 0.2021661

To obtain the expected counts, the easiest way is to run the Chi-Squared Test chisq.test() and extract the expected counts from it. This is done through the use of a new syntax which includes a dollar sign ($). Basically, if you want to extract a value from an object created, the syntax is object$value.

chisq.test(ear.table)$expected
##          Antibiotic
## Clearance        1        2
##         0 71.54152 61.45848
##         1 77.45848 66.54152
# OR YOU CAN CODE:

ear.test <- chisq.test(ear.table)
ear.test$expected
##          Antibiotic
## Clearance        1        2
##         0 71.54152 61.45848
##         1 77.45848 66.54152

Both codings above will give you a table of expected cell counts. You can compare this to a by hand calculation using your table above with the margin sums added.

a. Based on the expected counts, can the Chi-square methods be used to obtain a good approximation of the p-value for the Test for an Association? Explain.

Yes because all expected counts are greater than 5

b. Give the proportion of ears cleared for each group:

\( \hat{p}_{CEF}=.0.590604 \)

\( \hat{p}_{AMO}=0.437500 \)

8. Use the commands from the Week 4 Lab Activity (again, commands reproduced in Part III) to obtain a side-by-side bar chart. Based on the side-by-side bar chart and the sample proportions from #7, do you feel the distribution of ears cleared is not the same between those taking the CEF and those taking the AMO? Therefore, do you feel there will be evidence to reject the claim made in the null hypothesis?

barplot(ear.table, 
        beside=TRUE,
        legend.text=TRUE)

plot of chunk unnamed-chunk-5

Step 6: Determine the p-value

To perform a Test for an Association using the Pearson Chi-square method, we use the chisq.test() function.

chisq.test(ear.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ear.table
## X-squared = 5.8672, df = 1, p-value = 0.01543

9. There are two “tests of association” that R does. We’ll be using the Pearson Chi-square test. Give the -statistic, degrees of freedom, and p-value from the Pearson Chi-square test.

a. Why is there 1 degree of freedom?

(R-1)(C-1)=1

10. Since both variables have exactly two categories, the Fisher’s Exact Test can be used to obtain the p-value:

fisher.test(ear.table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ear.table
## p-value = 0.01168
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3248839 0.8940932
## sample estimates:
## odds ratio 
##  0.5403591

Note: The Fisher’s Exact Test gives an exact p-value for a Test for Association. However, R (and most other statistical software) only do the exact calculations for the Fisher’s Exact Test for 2x2 tables of counts (2 rows and 2 columns). An approximation for tables larger than 2x2 can be done but will not be shown or discussed in this class.

Note also that the fisher’s exact test output gives a confidence interval and sample estimate for the odds ratio. We do not discuss the odds ratio in this class.

Step 7: State a conclusion in the context of the problem

11. Does it appear that one drug does a better job at clearing infected ears than the other drug? Based on the p-value, answer the question in a complete sentence in the context of the problem.

Yes, there is suggestive evidence that there is an association between drug type and clearning of ear infection based on a p-vale of 0.01543. Therefore, we will reject the null at an \( \alpha=0.05 \) level.

12. What does it mean to say that there is “evidence of an association” between antibiotic and whether or not an ear was cleared in this example?

This means that there is evidence that the proportions are different between the two groups.

B. Suppose the children in this study were selected from a larger group of children. The researchers wondered if half the children were between 2 and 5 years old with an equal proportion of the other children under 2 and over 5 years old (25% under 2 and 25% over 5 years old). Based on the ages of the children selected to be part of the study, is there evidence to indicate that the proportion of children in the different age groups is different than what the researchers were wondering? (Use the EARNEW data set to answer this question. The EARNEW data set has only one row of information for each child.)

1. What is the variable of interest in this question?

We are interested in the proportion of children in each age group.

2. Is there an explanatory variable?

Age groups

3. State the null and alternative hypotheses. Define any notation used in the hypotheses.

\( H_0: p_{<2}=0.25, p_{2-5}=0.5, p_{6+}=0.25 \)

\( H_A \): At least one proportion is different

4. Use R to obtain a table of counts with percents for the variable Age. You can use the commands for a table from before but only on a single variable. Based on the table of counts, fill in the information below:

earnew.table <- with(EARNEW, table(Age))
            earnew.table
## Age
##  1  2  3 
## 57 97 48
                prop.table(earnew.table)
## Age
##         1         2         3 
## 0.2821782 0.4801980 0.2376238

5. What is an appropriate graph to use to display the data? Obtain that graph.

barplot(earnew.table, 
        legend.text=TRUE)

plot of chunk unnamed-chunk-10

6. Based on the percentages (or proportions) in #4 and the graph obtained in #5, do you feel the claim in the null hypothesis will be rejected? Explain.

It doesnt look like the sample proportions are very different from the hypothesized proportions.

7. Why can a Goodness of Fit test be performed here?

8. Can the Chi-square methods be used? Explain.

Yes, each group has an expected count of at least 5.

9. Follow the commands below to obtain the p-value in R:

chisq.test(earnew.table, p=c(0.25, 0.5, 0.25))
## 
##  Chi-squared test for given probabilities
## 
## data:  earnew.table
## X-squared = 1.1188, df = 2, p-value = 0.5715

Note: The test for association and the goodness-of-fit test both use the same base function chisq.test(). However, the goodness-of-fit test uses a vector of counts instead of a matrix of counts. Also, we must specify the null hypothesized proportions using the p= argument. These hypothesized proportions must follow the ordering of the categories presented in your vector of counts. Age group 1 was <2 years old, age group 2 was 2-5 year olds and age group 3 was 6+ years old.

a. Give the values of the following:

b. Why are there 2 degrees of freedom?

k-1 = 3-1 = 2

10. Based on the p-value, state a conclusion in the context of the problem.

There is no evidence to suggest that the age distribution of children is different than the one proposed (50% 2-5 years, 25% less than 2, 25% more than 5)