KITADA

Lab Activity #4

Inference for two categorical variable, each with two categories The Two-Proportion methods

Objectives:

Part I: Examples

Example 1: Sleep and Work example

Do people working long hours have difficulty falling asleep? This and other questions were investigated in a recent study. Two independent samples of British civil service workers were used in the study, those who work 35 to 40 hours per week and those who work more than 40 hours per week. Of the 952 workers working between 35 and 40 hours per week, 64 said they had difficulty falling asleep (which was defined as having difficulty falling asleep at least 3 times per week). Of the 1497 workers who worked more than 40 hours per week, 101 said they had difficulty falling asleep. The data can be found in the WORKANDSLEEP data set in Canvas. Import this data set into R.

Coding:

Step 1: Identify the variable of interest and the population

1. Another name for the variable of interest is the response variable. What is the response variable? Is it categorical or quantitative? If categorical, what are the categories?

Response: Sleep difficulty or no sleep difficulty

Categorical

2. What is the explanatory variable? Is it categorical or quantitative? If categorical, what are the categories?

Explanatory: 35-40 Hours Worked or Over 40 Hours Worked

3. What are the populations?

British civil service works

Step 2: Assess if the samples are representative of the populations

4. Those initially recruited for the study were from 20 London-based civil service departments. Do you feel that any relationship between sleep patterns and working hours found in the study would be different for

a. other workers in London?

Yes

b. other workers in Britain?

Maybe

c. other workers worldwide?

No

Step 3: Determining if the problem is an estimation only problem or hypothesis test problem

In this particular problem, we’re investigating the relationship between number of hours worked and sleep difficulties. Here are two specific questions the relationship between these two variables. For each, state whether the question involves a hypothesis test or is an estimation only question.

5. Is there a difference in the proportion of workers with sleep difficulties between those working 35 to 40 hours a week and those working more than 40 hours a week?

Hypothesis testing

6. What is the difference in the proportion of workers with sleep difficulties between those working 35 to 40 hours a week and those working more than 40 hours a week?

Estimation

Step 4: State the null and alternative hypotheses

Let’s investigate the question in #5 above.

7. State the null and alternative hypotheses in words and in statistical notation. Define any parameters used in the notation.

\( H_0:p_{35-40 Hours}=p_{Over 40 Hours} \)

\( H_0:p_{35-40 Hours} \neq p_{Over 40 Hours} \)

Step 5: Explore the sample data

8. By hand, present the summary information in a table of counts. In the table, let the rows be the categories of the response variable and the columns the categories of the explanatory variable. Put row and column totals in the margin of the table.

SEE R CODE

9. Use R to construct the table of counts with row percents and column percents

sleep_table<-with(WORKANDSLEEP, 
                  table(sleep.difficulty,hours.work.per.week))

Important note: never copy and paste a table of counts into an Assignment. Rather, use the table of counts from the output and reconstruct the table on your own in your assignment!

10. A particular graph used to compare the relationship between two categorical variables is the side-by-side bar chart. Follow these commands to construct the side-by-side bar chart in R.

barplot(sleep_table, 
        beside=TRUE,
        legend.text=TRUE,
        ylim=c(0, 1600))

plot of chunk unnamed-chunk-3

NOTES ON ARGUMENTS:

This is an optional argument, but since the second category has more entries than the first category, the bars were going off the y-axis that was plotted by default. You can omit ylim=c(,) to see the default plot.

From the side-by-side bar chart, do you feel that an association exists between sleep difficulties and number of hours worked among British civil service workers? That is, do you feel the proportion of British civil service workers with sleep difficulties is different between those who work 35 to 40 hours per week and those who work more than 40 hours per week?

It looks like there really isnt a difference in proportions between the two groups.

Step 6: Determine the p-value

Two methods of obtaining the p-value will be shown.

Method 1: Bootstrap/Randomization method using the twoprop macro

Here are the arguments for the twoprop macro:

NOTE: When using this macro, Group 1 and Group 2 are assigned alphabetically from the grouping variable, not necessarily as they appear in the data set. For example, if the two groups of a grouping variable sex are “Male” and “Female”, then this macro will assign “Female” as group 1 and “Male” as group 2. This is important in determining the alternative hypothesis

Fill in the blanks with the correct arguments for this example:

sleepBoot<-twoprop(response=WORKANDSLEEP$sleep.difficulty,
                   groups=WORKANDSLEEP$hours.work.per.week,
                   iterations=2000,
                   ci_level=0.95, 
                   Alt_Hyp=3)

plot of chunk unnamed-chunk-4

The output from the macro will be displayed in the console of R.

Here is what will be displayed in the output

11. Run the macro.

a) What type of study was performed? Therefore, what code will you use for the first prompted question when running the macro?

Two proportion study

b) Report the p-value and confidence interval for the difference in proportions:

p-value:

sleepBoot$pval
## [1] 1

confidence interval:

sleepBoot$Confidence_Intervals
##    CI_Percent  CI_Formula
## 1 -0.01938541 -0.02054035
## 2  0.02013832  0.02005760

Method 2: Fisher’s Exact Test An exact p-value can be calculated for a two-proportion hypothesis test from the Fisher’s Exact Test. Here are the commands in R:

NOTE: You must have tabled your data into a 2x2 table as we have done above.

fisher.test(sleep_table,
             alternative="two.sided",
             conf.level=0.95)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  sleep_table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7180622 1.4116841
## sample estimates:
## odds ratio 
##   1.003849

NOTE: The default alternative is “two.sided” and the default confidence level is 95%, so these arguments could be omitted and you could just run fisher.test(sleep_table) for the same output.

The p-value from the Fisher’s Exact Test will be near the top of the output. Note that output from this test is displayed in terms of the odds ratio. This is not taught in this class, but basically you can know that the null hypothesis of an odds ratio of 1 is the same as equality in proportions between the two groups. So we are doing a test of equal proportions with this function in R.

12. Report the p-value from Fisher’s Exact Test.

p-value

Approximately 1

Note: there is no confidence interval for the difference in population proportions using the Fisher’s Exact Test (only odds ratio). If a confidence interval is desired, use the twoprop macro.

Step 7: Answer the question of interest

13. In the context of the problem, answer the question of interest based on the p-value from the most appropriate method.

There is no evidence to suggest that there is difference the proportion of sleep difficulty between people who work between 35-40 hours and over 40 hours.

Example 2: Green Tea and Cancer

A preliminary study suggests a benefit from green tea for those at risk of prostate cancer. The study involved 60 men with high grade PIN (prostate intraepithelial neoplasia) lesions, some of which turn into prostate cancer. Half the men were randomly assigned to receive 600 mg a day of a green tea extract while the other half were given a placebo. The study was double-blinded and the results after one year are given in the GREEN TEA data set in Canvas.

In the data set, the first column is whether or not a man developed prostate cancer after one year on the study (coded as 0 = “developed cancer” and 1 = “no cancer”). The second column contains which treatment group each man was assigned to (coded as “green tea” or “placebo”).