Overview: In this lab exercise, you will explore associations between categorical variables.
Objectives: At the end of this lab you will be able to:
Create a subdirectory named Lab 6
in the
PUBHBIO 2210 Labs
directory you created in your OneDrive
folder in Lab 1.
Download the five lab files from Carmen while in the RStudio server:
lab-06-categorical-blank.html
lab-06-categorical-blank.Rmd
lab-06-categorical-worksheet-blank.docx
nhanes.RData
nhanes-data-dictionary.xls
If you have not downloaded all of these files, do so now.
Save the five downloaded files in the
PUBHBIO 2210 Labs/Lab 6
directory (i.e., save the
downloaded files in the Lab 6
directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Load the nhanes.RData
file and print it. Recall from lab
1 (latter part) and lab 2 on how to load .RData
file and
print an object (i.e., dataset in R) as well.
This dataset is ready for analysis—all the variable types have been
set and value labels have been assigned in Lab #1. Ensure that you have
opened the right version of the dataset. Look at the variable
race
. Do you see the words white
and
black
or the numbers 1
and 2
?
You should see the coded values: white
and
black
.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp… fema… 56 1 metr… midwe…
## 2 2 white not hisp… fema… 73 1 other west
## 3 3 white not hisp… fema… 25 2 metr… south
## 4 4 white mexican-… fema… 53 2 other south
## 5 5 white mexican-… fema… 68 2 other south
## 6 6 white not hisp… fema… 44 3 other west
## 7 7 black not hisp… fema… 28 2 metr… south
## 8 8 white not hisp… male 74 2 other midwe…
## 9 9 white not hisp… fema… 65 1 other north…
## 10 10 white other hi… fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>,
## # wine <int>, liquor <int>, everSmoke <fct>,
## # smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## # weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
First, we will make a contingency (frequency) table.
For example,
# Not evaluated
race.by.region.margins <- tally( ~ race | region, data = mydata,
useNA = "no", margins = TRUE)
race.by.region.margins
creates and then prints a contingency (frequency) table of
race
by region
from the dataset
mydata
, ignoring missing values and displaying
margins (totals).
In other words, the above code chunk adds the margins to the contingency table (i.e., the final row with column sums). This is useful for presenting the table itself and computing sample proportions, but you MUST NOT pass this table directly to the Chi-squared or Fisher’s exact test. Otherwise, you will get a wrong p-value.
To obtain the correct p-value for a Chi-squared (or Fisher’s exact) test, you should ALWAYS create the contingency table ( without displaying the margins) using something like the following code:
# Not evaluated
race.by.region <- tally( ~ race | region, data = mydata, useNA = "no")
In the code chuck below, make a contingency (frequency) table of
current smoking status by sex (i.e., smokeNow
by
sex
), ignoring missing values and displaying
margins (totals).
# Enter code here
smokeNow.by.sex.margins <- tally(~ smokeNow | sex, data = nhanes, useNA = "no", margins = TRUE)
smokeNow.by.sex.margins
## sex
## smokeNow male female
## no 33 45
## yes 9 12
## Total 42 57
Also, in the code chunk below, make another contingency (frequency)
table of current smoking status by sex (i.e., smokeNow
by
sex
), ignoring missing values and without
displaying margins (totals).
# Enter code here
smokeNow.by.sex <- tally(~ smokeNow | sex, data = nhanes, useNA = "no")
smokeNow.by.sex
## sex
## smokeNow male female
## no 33 45
## yes 9 12
We can use the Chi-squared test on the contingency (frequency) table ( without the margins) to test for an association between two categorical variables. The Chi-squared test can be run directly on the contingency table ( without the margins), since it only needs the cell counts.
For example,
# Not evaluated
xchisq.test(race.by.region, correct=FALSE)
uses the contingency table race.by.region
to test for an
association between race
and region
in the
population from which mydata
is a sample.
Now, using the contingency table ( without the
margins) of current smoking status by sex created, test
for an association between current smoking status
(smokeNow
) and sex
in the population from
which nhanes
is a sample.
# Enter code here
xchisq.test(smokeNow.by.sex, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 0.0020449, df = 1, p-value = 0.9639
##
## 33 45
## (33.09) (44.91)
## [0.00025] [0.00018]
## <-0.016> < 0.014>
##
## 9 12
## ( 8.91) (12.09)
## [0.00093] [0.00068]
## < 0.030> <-0.026>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
For the remainder of this lab, please DO NOT include the margins when creating your contingency tables.
Next you will investigate whether there is an association between
activity level (active
) and self-reported health status
(healthStatus
).
Now, create a contingency table active.by.health
[i.e.,
a contingency table of activity level (active
) by
self-reported health status (healthStatus
)] and then,
perform a Chi-squared test to test the association between activity
level and self-reported health status in the population from which
nhanes
is a sample.
# Enter code here
active.by.health <- tally(~ active | healthStatus, data = nhanes, useNA = "no")
active.by.health
## healthStatus
## active poor fair good very good excellent
## Less active 4 8 12 4 0
## About the same 1 9 18 12 5
## More active 0 4 5 8 8
xchisq.test(active.by.health, correct = FALSE)
## Warning in chisq.test(x = x, y = y, correct = correct, p =
## p, rescale.p = rescale.p, : Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 22.389, df = 8, p-value = 0.004244
##
## 4 8 12 4 0
## ( 1.43) ( 6.00) (10.00) ( 6.86) ( 3.71)
## [4.629] [0.667] [0.400] [1.190] [3.714]
## < 2.15> < 0.82> < 0.63> <-1.09> <-1.93>
##
## 1 9 18 12 5
## ( 2.30) ( 9.64) (16.07) (11.02) ( 5.97)
## [0.731] [0.043] [0.231] [0.087] [0.157]
## <-0.86> <-0.21> < 0.48> < 0.30> <-0.40>
##
## 0 4 5 8 8
## ( 1.28) ( 5.36) ( 8.93) ( 6.12) ( 3.32)
## [1.276] [0.344] [1.729] [0.576] [6.615]
## <-1.13> <-0.59> <-1.31> < 0.76> < 2.57>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
Notice the warning at the top of the output: Chi-squared approximation may be incorrect. If you see this warning, you should not use the Chi-squared results because the assumption of “large enough” sample in each cell is violated. In this case, you might want to perform Fisher’s exact test as shown in the code chunk below.
# Not evaluated
fisher.test(active.by.health)
However, for large tables, R may not be able to run the test. Such is the case for this contingency table.
Recall from the lecture that another alternative solution is to
“collapse” the table by combining categories. This is sometimes a better
choice in the context of the data. We will do this for the
healthStatus
variable.
As an example, the following code:
# Not evaluated
mydata <- mydata %>% mutate(
smokingCausesCancerCollapsed = recode_factor(
smokingCausesCancer,
`Strongly disagree` = "Disagree/Strongly disagree",
`Disagree` = "Disagree/Strongly disagre",
`Unsure` = "Unsure",
`Agree` = "Agree/Strongly agree",
`Strongly agree` = "Agree/Strongly agree",
.ordered = TRUE
)
)
uses the mutate()
function to create a new variable
smokingCausesCancerCollapsed
by combining categories
“Disagree–Strongly disagree”“, and”Agree–Strongly agree” of the variable
smokingCausesCancer
in the data mydata
, and
also adds appropriate value labels as “Disagree/Strongly disagree” and
“Agree/Strongly agree”, respectively. Here, you will notice that the
“Unsure” category was not changed.
Now, in the code chunk below, use the mutate()
function
to create a new variable healthStatusCollapsed
by combining
categories “poor–fair”“, and”very good–excellent”” of the variable
healthStatus
in the data nhanes
, and also add
appropriate value labels. See lab 1 for more help. Hint: as an example,
you can recode “poor” and “fair” as “poor/fair”, “good” as still “good”,
and “very good” and “excellent” as “very good/excellent”.
Then, using the select()
function, print the new
variable (healthStatusCollapsed
) to make sure it worked
correctly. Recall lab 1 on how to print selected variable(s).
# Enter code here
nhanes <- nhanes %>% mutate(
healthStatusCollapsed = recode_factor(
healthStatus,
"poor" = "poor/fair",
"fair" = "poor/fair",
"good" = "good",
"very good" = "very good/excellent",
"excellent" = "very good/excellent",
.ordered = TRUE
)
)
nhanes %>% select(healthStatusCollapsed)
## # A tibble: 100 × 1
## healthStatusCollapsed
## <ord>
## 1 poor/fair
## 2 poor/fair
## 3 good
## 4 poor/fair
## 5 poor/fair
## 6 very good/excellent
## 7 good
## 8 good
## 9 good
## 10 good
## # ℹ 90 more rows
Now, perform a Chi-squared test for an association between
healthStatusCollapsed
and active
.
Note that you would have to first create a contingency table say
active.by.health.collapsed
[i.e., a contingency table of
activity level (active
) by the collapsed self-reported
health status (healthStatusCollapsed
)] before you perform
the test to investigate the association between activity level
(active
) and the collapsed self-reported health status
(healthStatusCollapsed
) in the population from which
nhanes
is a sample.
# Enter code here
active.by.health.collapsed <- tally(~ active | healthStatusCollapsed, data = nhanes, useNA = "no")
active.by.health.collapsed
## healthStatusCollapsed
## active poor/fair good very good/excellent
## Less active 12 12 4
## About the same 10 18 17
## More active 4 5 16
xchisq.test(active.by.health.collapsed, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 15.179, df = 4, p-value = 0.004344
##
## 12 12 4
## ( 7.43) (10.00) (10.57)
## [2.8e+00] [4.0e-01] [4.1e+00]
## < 1.6773> < 0.6325> <-2.0211>
##
## 10 18 17
## (11.94) (16.07) (16.99)
## [3.1e-01] [2.3e-01] [6.1e-06]
## <-0.5611> < 0.4811> < 0.0025>
##
## 4 5 16
## ( 6.63) ( 8.93) ( 9.44)
## [1.0e+00] [1.7e+00] [4.6e+00]
## <-1.0222> <-1.3148> < 2.1356>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
The p-value is less than 0.05 so we would reject the null hypothesis (of no association).
We conclude that there is an association between health status and activity level in the population—but how can we describe this association?
There can be many conditional proportions when both variables have more than two categories. For such an analysis we can use a panel of bar charts.
For example,
# Not evaluated
gf_bar( ~ race | region, data = mydata, ylab="Frequency") %>%
gf_theme(axis.text.x = element_text(angle=90, vjust = 0.4))
creates a panel of bar charts of race
for each level of
region
.
Now create a panel of bar charts of active
for each
level of healthStatusCollapsed
.
# Enter code here
gf_bar(~ active | healthStatusCollapsed, data = nhanes, ylab = "Frequency", ) %>%
gf_theme(axis.text.x = element_text(angle = 90, hjust = 0.4))
It’s hard to summarize the results concisely—there are so many conditional proportions! But from looking at the bar charts we can see that the proportion of “more active” subjects in the “very good/excellent” health group is larger than in the “poor/fair” and “good” groups. A potential summary of these data would be that:
There was a significant association between activity level and self-reported health status (p = 0.004344), with people reporting better health status also being more active.
Now perform an analysis to investigate the association between
activity level and current smoking status (active
by
smokeNow
).
First create and print a contingency table say
active.by.smoke
[i.e., a contingency table of activity
level (active
) by the current smoking status
(smokeNow
)], then use it to perform a Chi-squared test. In
addition, produce a panel of bar charts of active
for each
level of smokeNow
.
# Enter code here
active.by.smokeNow <- tally(~ active | smokeNow, data = nhanes, useNA = "no")
active.by.smokeNow
## smokeNow
## active no yes
## Less active 18 10
## About the same 40 5
## More active 19 6
xchisq.test(active.by.smokeNow, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 6.3372, df = 2, p-value = 0.04206
##
## 18 10
## (22.00) ( 6.00)
## [0.727] [2.667]
## <-0.85> < 1.63>
##
## 40 5
## (35.36) ( 9.64)
## [0.610] [2.235]
## < 0.78> <-1.50>
##
## 19 6
## (19.64) ( 5.36)
## [0.021] [0.077]
## <-0.15> < 0.28>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
gf_bar(~ active | smokeNow, data = nhanes, ylab = "Frequency") %>%
gf_theme(axis.text.x = element_text(angle = 90, hjust = 0.4))
Now you will investigate the relationship between sex and whether a
person has ever smoked (as opposed to current smoking status).
Thus, use sex
to predict everSmoke
.
First create and print a contingency table say
eversmoke.by.sex
[i.e., a contingency table of past smoking
status (everSmoke
) by sex
], then use it to
perform an appropriate test to investigate the relationship between sex
and whether a person has ever smoked
# Enter code here
everSmoke.by.sex <- tally(~ everSmoke | sex, data = nhanes, useNA = "no")
everSmoke.by.sex
## sex
## everSmoke male female
## no 18 37
## yes 24 20
xchisq.test(everSmoke.by.sex, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 4.7639, df = 1, p-value = 0.02906
##
## 18 37
## (23.33) (31.67)
## [1.22] [0.90]
## <-1.10> < 0.95>
##
## 24 20
## (18.67) (25.33)
## [1.52] [1.12]
## < 1.23> <-1.06>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
Let’s see what happens when you flip the roles of the two variables.
Perform an analysis where you use everSmoke
to predict
sex
.
First create and print a contingency table say
sex.by.eversmoke
[i.e., a contingency table of
sex
by past smoking status (everSmoke
)] by
switching the positions of sex
and everSmoke
in your previous code for the contingency table
eversmoke.by.sex
. Then, using the contingency table
sex.by.eversmoke
(not eversmoke.by.sex
)
perform an appropriate test to investigate the relationship between sex
and whether a person has ever smoked
# Enter code here
sex.by.everSmoke <- tally(~ sex | everSmoke, data = nhanes, useNA = "no")
sex.by.everSmoke
## everSmoke
## sex no yes
## male 18 24
## female 37 20
xchisq.test(sex.by.everSmoke, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: x
## X-squared = 4.7639, df = 1, p-value = 0.02906
##
## 18 24
## (23.33) (18.67)
## [1.22] [1.52]
## <-1.10> < 1.23>
##
## 37 20
## (31.67) (25.33)
## [0.90] [1.12]
## < 0.95> <-1.06>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.