KITADA

Lab Activity #4

Inference for two categorical variable, each with two categories The Two-Proportion methods

Objectives:

Use the twoprop macro in R to obtain a p-value from a hypothesis test for the difference in population proportions and construct a confidence interval for the difference in population proportion
Use R to perform a hypothesis test for the difference in population proportions using Fisher's Exact Test
Interpret R output from all analyses listed above

Part I: Examples

Example 1: Sleep and Work example

Do people working long hours have difficulty falling asleep? This and other questions were investigated in a recent study. Two independent samples of British civil service workers were used in the study, those who work 35 to 40 hours per week and those who work more than 40 hours per week. Of the 952 workers working between 35 and 40 hours per week, 64 said they had difficulty falling asleep (which was defined as having difficulty falling asleep at least 3 times per week). Of the 1497 workers who worked more than 40 hours per week, 101 said they had difficulty falling asleep. The data can be found in the WORKANDSLEEP data set in Canvas. Import this data set into R.

Coding:

1 : Sleep difficulty
0 : No sleep difficulty

Step 1: Identify the variable of interest and the population

1. Another name for the variable of interest is the response variable. What is the response variable? Is it categorical or quantitative? If categorical, what are the categories?

Response: Sleep difficulty or no sleep difficulty

Categorical

2. What is the explanatory variable? Is it categorical or quantitative? If categorical, what are the categories?

Explanatory: 35-40 Hours Worked or Over 40 Hours Worked

3. What are the populations?

British civil service works

Step 2: Assess if the samples are representative of the populations

4. Those initially recruited for the study were from 20 London-based civil service departments. Do you feel that any relationship between sleep patterns and working hours found in the study would be different for

a. other workers in London?

Yes

b. other workers in Britain?

Maybe

c. other workers worldwide?

Step 3: Determining if the problem is an estimation only problem or hypothesis test problem

In this particular problem, we’re investigating the relationship between number of hours worked and sleep difficulties. Here are two specific questions the relationship between these two variables. For each, state whether the question involves a hypothesis test or is an estimation only question.

5. Is there a difference in the proportion of workers with sleep difficulties between those working 35 to 40 hours a week and those working more than 40 hours a week?

Hypothesis testing

6. What is the difference in the proportion of workers with sleep difficulties between those working 35 to 40 hours a week and those working more than 40 hours a week?

Estimation

Step 4: State the null and alternative hypotheses

Let’s investigate the question in #5 above.

7. State the null and alternative hypotheses in words and in statistical notation. Define any parameters used in the notation.

\( H_0:p_{35-40 Hours}=p_{Over 40 Hours} \)

\( H_0:p_{35-40 Hours} \neq p_{Over 40 Hours} \)

\( p_{35-40 Hours} \) : Proportion of people who have difficulty sleeping who work 35-40 Hours
\( p_{Over 40 Hours} \) : Proportion of people who have difficulty sleeping who work over 40 Hours

Step 5: Explore the sample data

8. By hand, present the summary information in a table of counts. In the table, let the rows be the categories of the response variable and the columns the categories of the explanatory variable. Put row and column totals in the margin of the table.

SEE R CODE

9. Use R to construct the table of counts with row percents and column percents

The data set WORKANDSLEEP contains rows for each individual surveyed. For many types of analysis, we will want to table these data as done by hand above.

sleep_table<-with(WORKANDSLEEP, 
                  table(sleep.difficulty,hours.work.per.week))

Important note: never copy and paste a table of counts into an Assignment. Rather, use the table of counts from the output and reconstruct the table on your own in your assignment!

10. A particular graph used to compare the relationship between two categorical variables is the side-by-side bar chart. Follow these commands to construct the side-by-side bar chart in R.

barplot(sleep_table, 
        beside=TRUE,
        legend.text=TRUE,
        ylim=c(0, 1600))

plot of chunk unnamed-chunk-3

NOTES ON ARGUMENTS:

1. beside=TRUE gives us the side-by-side bars. If this is omitted, stacked bars will be displayed.
2. legend.text=TRUE gives a nice simple legend for the bars in each category. We recommend always using this argument (or some other legend).
3. ylim=c(,) gives the lower (0) and upper (1600) bounds of the response counts to be plotted.

This is an optional argument, but since the second category has more entries than the first category, the bars were going off the y-axis that was plotted by default. You can omit ylim=c(,) to see the default plot.

From the side-by-side bar chart, do you feel that an association exists between sleep difficulties and number of hours worked among British civil service workers? That is, do you feel the proportion of British civil service workers with sleep difficulties is different between those who work 35 to 40 hours per week and those who work more than 40 hours per week?

It looks like there really isnt a difference in proportions between the two groups.

Step 6: Determine the p-value

Two methods of obtaining the p-value will be shown.

Method 1: Bootstrap/Randomization method using the twoprop macro

Step 1: The WORKANDSLEEP data set should be imported into R.
Step 2: Open the twoprop macro in R, highlight and run the full script. This will make the macro available for use.
Step 3: In order to run the twoprop macro, your data set MUST have:
- A column of response values consisting of 0’s and 1’s (for both groups)
- A column for the explanatory variable (i.e. the “name” of the category each case belongs to) with exactly 2 levels.

Here are the arguments for the twoprop macro:

response the response variable (0 / 1) for both groups
groups the categorical variable containing the grouping names (only 2 groups allowed)
iterations the number of randomizations you want the macro to generate (at least 2000 is suggested)
ci_level a decimal value from 0 to 1 for the level of confidence for the confidence interval
Alt_Hyp a value of 1, 2 or 3 determining the alternative hypthesis as:
- 1: less than alternative
- 2: greater than alternative
- 3: two-sided

NOTE: When using this macro, Group 1 and Group 2 are assigned alphabetically from the grouping variable, not necessarily as they appear in the data set. For example, if the two groups of a grouping variable sex are “Male” and “Female”, then this macro will assign “Female” as group 1 and “Male” as group 2. This is important in determining the alternative hypothesis

Fill in the blanks with the correct arguments for this example:

response: sleep.difficulty
groups: hours.work.per.week
iterations: 2000
ci_level: 0.95
Alt_Hyp: 3
- Step 4: Run the macro by typing in the following with the correct arguments:

sleepBoot<-twoprop(response=WORKANDSLEEP$sleep.difficulty,
                   groups=WORKANDSLEEP$hours.work.per.week,
                   iterations=2000,
                   ci_level=0.95, 
                   Alt_Hyp=3)

plot of chunk unnamed-chunk-4

The output from the macro will be displayed in the console of R.

Step 6: Interpret the output:

Here is what will be displayed in the output

the standard deviation of the difference in sample proportions from the specified number of randomizations.
The distribution of bootstrapped/randomized sample differences (number determined by iterations).
The sample proportions for each group
a group 1 indicator to confirm which group the macro calls group 1.
a confidence interval (with the indicated level of confidence in the macro) with bounds determined using the percentile method AND the formula method.
the p-value from the hypothesis test.
The alternative hypothesis (Alt_Hyp) chosen.

11. Run the macro.

a) What type of study was performed? Therefore, what code will you use for the first prompted question when running the macro?

Two proportion study

b) Report the p-value and confidence interval for the difference in proportions:

p-value:

sleepBoot$pval

## [1] 1

confidence interval:

sleepBoot$Confidence_Intervals

##    CI_Percent  CI_Formula
## 1 -0.01938541 -0.02054035
## 2  0.02013832  0.02005760

Method 2: Fisher’s Exact Test An exact p-value can be calculated for a two-proportion hypothesis test from the Fisher’s Exact Test. Here are the commands in R:

NOTE: You must have tabled your data into a 2x2 table as we have done above.

fisher.test(sleep_table,
             alternative="two.sided",
             conf.level=0.95)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  sleep_table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7180622 1.4116841
## sample estimates:
## odds ratio 
##   1.003849

NOTE: The default alternative is “two.sided” and the default confidence level is 95%, so these arguments could be omitted and you could just run fisher.test(sleep_table) for the same output.

The p-value from the Fisher’s Exact Test will be near the top of the output. Note that output from this test is displayed in terms of the odds ratio. This is not taught in this class, but basically you can know that the null hypothesis of an odds ratio of 1 is the same as equality in proportions between the two groups. So we are doing a test of equal proportions with this function in R.

12. Report the p-value from Fisher’s Exact Test.

p-value

Approximately 1

Note: there is no confidence interval for the difference in population proportions using the Fisher’s Exact Test (only odds ratio). If a confidence interval is desired, use the twoprop macro.

Step 7: Answer the question of interest

13. In the context of the problem, answer the question of interest based on the p-value from the most appropriate method.

There is no evidence to suggest that there is difference the proportion of sleep difficulty between people who work between 35-40 hours and over 40 hours.

Example 2: Green Tea and Cancer

A preliminary study suggests a benefit from green tea for those at risk of prostate cancer. The study involved 60 men with high grade PIN (prostate intraepithelial neoplasia) lesions, some of which turn into prostate cancer. Half the men were randomly assigned to receive 600 mg a day of a green tea extract while the other half were given a placebo. The study was double-blinded and the results after one year are given in the GREEN TEA data set in Canvas.

In the data set, the first column is whether or not a man developed prostate cancer after one year on the study (coded as 0 = “developed cancer” and 1 = “no cancer”). The second column contains which treatment group each man was assigned to (coded as “green tea” or “placebo”).