Introduction

Overview: In this lab exercise, you will explore associations between categorical variables.

Objectives: At the end of this lab you will be able to:

Part 0: Download and organize files

Part 1: Creating a contingency table

Load the nhanes.RData file and print it. Recall from lab 1 (latter part) and lab 2 on how to load .RData file and print an object (i.e., dataset in R) as well.

This dataset is ready for analysis—all the variable types have been set and value labels have been assigned in Lab #1. Ensure that you have opened the right version of the dataset. Look at the variable race. Do you see the words white and black or the numbers 1 and 2?

You should see the coded values: white and black.

# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hisp… fema…    56          1 metr… midwe…
##  2     2 white not hisp… fema…    73          1 other west  
##  3     3 white not hisp… fema…    25          2 metr… south 
##  4     4 white mexican-… fema…    53          2 other south 
##  5     5 white mexican-… fema…    68          2 other south 
##  6     6 white not hisp… fema…    44          3 other west  
##  7     7 black not hisp… fema…    28          2 metr… south 
##  8     8 white not hisp… male     74          2 other midwe…
##  9     9 white not hisp… fema…    65          1 other north…
## 10    10 white other hi… fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <fct>,
## #   smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

First, we will make a contingency (frequency) table.

For example,

# Not evaluated
race.by.region.margins <- tally( ~ race | region, data = mydata,
                                   useNA = "no", margins = TRUE)
race.by.region.margins

creates and then prints a contingency (frequency) table of race by region from the dataset mydata, ignoring missing values and displaying margins (totals).

In other words, the above code chunk adds the margins to the contingency table (i.e., the final row with column sums). This is useful for presenting the table itself and computing sample proportions, but you MUST NOT pass this table directly to the Chi-squared or Fisher’s exact test. Otherwise, you will get a wrong p-value.

To obtain the correct p-value for a Chi-squared (or Fisher’s exact) test, you should ALWAYS create the contingency table ( without displaying the margins) using something like the following code:

# Not evaluated
race.by.region <- tally( ~ race | region, data = mydata, useNA = "no")

In the code chuck below, make a contingency (frequency) table of current smoking status by sex (i.e., smokeNow by sex), ignoring missing values and displaying margins (totals).

# Enter code here
smokeNow.by.sex.margins <- tally(~ smokeNow | sex, data = nhanes, useNA = "no", margins = TRUE)
smokeNow.by.sex.margins
##         sex
## smokeNow male female
##    no      33     45
##    yes      9     12
##    Total   42     57

Also, in the code chunk below, make another contingency (frequency) table of current smoking status by sex (i.e., smokeNow by sex), ignoring missing values and without displaying margins (totals).

# Enter code here
smokeNow.by.sex <- tally(~ smokeNow | sex, data = nhanes, useNA = "no")
smokeNow.by.sex
##         sex
## smokeNow male female
##      no    33     45
##      yes    9     12

STOP! Answer Questions 1–2 now.

Part 2: Tests of association

We can use the Chi-squared test on the contingency (frequency) table ( without the margins) to test for an association between two categorical variables. The Chi-squared test can be run directly on the contingency table ( without the margins), since it only needs the cell counts.

For example,

# Not evaluated
xchisq.test(race.by.region, correct=FALSE)

uses the contingency table race.by.region to test for an association between race and region in the population from which mydata is a sample.

Now, using the contingency table ( without the margins) of current smoking status by sex created, test for an association between current smoking status (smokeNow) and sex in the population from which nhanes is a sample.

# Enter code here
xchisq.test(smokeNow.by.sex, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 0.0020449, df = 1, p-value = 0.9639
## 
##    33       45   
## (33.09)  (44.91) 
## [0.00025] [0.00018]
## <-0.016> < 0.014>
##    
##     9       12   
## ( 8.91)  (12.09) 
## [0.00093] [0.00068]
## < 0.030> <-0.026>
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

STOP! Answer Questions 3–5 now.

For the remainder of this lab, please DO NOT include the margins when creating your contingency tables.

Next you will investigate whether there is an association between activity level (active) and self-reported health status (healthStatus).

Now, create a contingency table active.by.health [i.e., a contingency table of activity level (active) by self-reported health status (healthStatus)] and then, perform a Chi-squared test to test the association between activity level and self-reported health status in the population from which nhanes is a sample.

# Enter code here
active.by.health <- tally(~ active | healthStatus, data = nhanes,  useNA = "no")
active.by.health
##                 healthStatus
## active           poor fair good very good excellent
##   Less active       4    8   12         4         0
##   About the same    1    9   18        12         5
##   More active       0    4    5         8         8
xchisq.test(active.by.health, correct = FALSE)
## Warning in chisq.test(x = x, y = y, correct = correct, p =
## p, rescale.p = rescale.p, : Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 22.389, df = 8, p-value = 0.004244
## 
##     4        8       12        4        0   
## ( 1.43)  ( 6.00)  (10.00)  ( 6.86)  ( 3.71) 
## [4.629]  [0.667]  [0.400]  [1.190]  [3.714] 
## < 2.15>  < 0.82>  < 0.63>  <-1.09>  <-1.93> 
##          
##     1        9       18       12        5   
## ( 2.30)  ( 9.64)  (16.07)  (11.02)  ( 5.97) 
## [0.731]  [0.043]  [0.231]  [0.087]  [0.157] 
## <-0.86>  <-0.21>  < 0.48>  < 0.30>  <-0.40> 
##          
##     0        4        5        8        8   
## ( 1.28)  ( 5.36)  ( 8.93)  ( 6.12)  ( 3.32) 
## [1.276]  [0.344]  [1.729]  [0.576]  [6.615] 
## <-1.13>  <-0.59>  <-1.31>  < 0.76>  < 2.57> 
##          
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

Notice the warning at the top of the output: Chi-squared approximation may be incorrect. If you see this warning, you should not use the Chi-squared results because the assumption of “large enough” sample in each cell is violated. In this case, you might want to perform Fisher’s exact test as shown in the code chunk below.

# Not evaluated
fisher.test(active.by.health)

However, for large tables, R may not be able to run the test. Such is the case for this contingency table.

Recall from the lecture that another alternative solution is to “collapse” the table by combining categories. This is sometimes a better choice in the context of the data. We will do this for the healthStatus variable.

As an example, the following code:

# Not evaluated
mydata <- mydata %>% mutate(
  smokingCausesCancerCollapsed = recode_factor(
    smokingCausesCancer,
    `Strongly disagree` = "Disagree/Strongly disagree",
    `Disagree` = "Disagree/Strongly disagre",
    `Unsure` = "Unsure",
    `Agree` = "Agree/Strongly agree",
    `Strongly agree` = "Agree/Strongly agree",
    .ordered = TRUE
  )
)

uses the mutate() function to create a new variable smokingCausesCancerCollapsed by combining categories “Disagree–Strongly disagree”“, and”Agree–Strongly agree” of the variable smokingCausesCancer in the data mydata, and also adds appropriate value labels as “Disagree/Strongly disagree” and “Agree/Strongly agree”, respectively. Here, you will notice that the “Unsure” category was not changed.

Now, in the code chunk below, use the mutate() function to create a new variable healthStatusCollapsed by combining categories “poor–fair”“, and”very good–excellent”” of the variable healthStatus in the data nhanes, and also add appropriate value labels. See lab 1 for more help. Hint: as an example, you can recode “poor” and “fair” as “poor/fair”, “good” as still “good”, and “very good” and “excellent” as “very good/excellent”.

Then, using the select() function, print the new variable (healthStatusCollapsed) to make sure it worked correctly. Recall lab 1 on how to print selected variable(s).

# Enter code here
nhanes <- nhanes %>% mutate(
  healthStatusCollapsed = recode_factor(
    healthStatus,
    "poor" = "poor/fair",
    "fair" = "poor/fair",
    "good" = "good",
    "very good" = "very good/excellent",
    "excellent" = "very good/excellent",
    .ordered = TRUE
  )
)

nhanes %>% select(healthStatusCollapsed)
## # A tibble: 100 × 1
##    healthStatusCollapsed
##    <ord>                
##  1 poor/fair            
##  2 poor/fair            
##  3 good                 
##  4 poor/fair            
##  5 poor/fair            
##  6 very good/excellent  
##  7 good                 
##  8 good                 
##  9 good                 
## 10 good                 
## # ℹ 90 more rows

Now, perform a Chi-squared test for an association between healthStatusCollapsed and active.

Note that you would have to first create a contingency table say active.by.health.collapsed [i.e., a contingency table of activity level (active) by the collapsed self-reported health status (healthStatusCollapsed)] before you perform the test to investigate the association between activity level (active) and the collapsed self-reported health status (healthStatusCollapsed) in the population from which nhanes is a sample.

# Enter code here
active.by.health.collapsed <- tally(~ active | healthStatusCollapsed, data = nhanes, useNA = "no")
active.by.health.collapsed
##                 healthStatusCollapsed
## active           poor/fair good very good/excellent
##   Less active           12   12                   4
##   About the same        10   18                  17
##   More active            4    5                  16
xchisq.test(active.by.health.collapsed, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 15.179, df = 4, p-value = 0.004344
## 
##    12       12        4   
## ( 7.43)  (10.00)  (10.57) 
## [2.8e+00] [4.0e-01] [4.1e+00]
## < 1.6773> < 0.6325> <-2.0211>
##      
##    10       18       17   
## (11.94)  (16.07)  (16.99) 
## [3.1e-01] [2.3e-01] [6.1e-06]
## <-0.5611> < 0.4811> < 0.0025>
##      
##     4        5       16   
## ( 6.63)  ( 8.93)  ( 9.44) 
## [1.0e+00] [1.7e+00] [4.6e+00]
## <-1.0222> <-1.3148> < 2.1356>
##      
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

STOP! Answer Question 6 now.

The p-value is less than 0.05 so we would reject the null hypothesis (of no association).

We conclude that there is an association between health status and activity level in the population—but how can we describe this association?

There can be many conditional proportions when both variables have more than two categories. For such an analysis we can use a panel of bar charts.

For example,

# Not evaluated
gf_bar( ~ race | region, data = mydata, ylab="Frequency") %>%
      gf_theme(axis.text.x = element_text(angle=90, vjust = 0.4))

creates a panel of bar charts of race for each level of region.

Now create a panel of bar charts of active for each level of healthStatusCollapsed.

# Enter code here
gf_bar(~ active | healthStatusCollapsed, data = nhanes, ylab = "Frequency", ) %>%
  gf_theme(axis.text.x = element_text(angle = 90, hjust = 0.4))

It’s hard to summarize the results concisely—there are so many conditional proportions! But from looking at the bar charts we can see that the proportion of “more active” subjects in the “very good/excellent” health group is larger than in the “poor/fair” and “good” groups. A potential summary of these data would be that:

There was a significant association between activity level and self-reported health status (p = 0.004344), with people reporting better health status also being more active.

Now perform an analysis to investigate the association between activity level and current smoking status (active by smokeNow).

First create and print a contingency table say active.by.smoke [i.e., a contingency table of activity level (active) by the current smoking status (smokeNow)], then use it to perform a Chi-squared test. In addition, produce a panel of bar charts of active for each level of smokeNow.

# Enter code here
active.by.smokeNow <- tally(~ active | smokeNow, data = nhanes,  useNA = "no")
active.by.smokeNow
##                 smokeNow
## active           no yes
##   Less active    18  10
##   About the same 40   5
##   More active    19   6
xchisq.test(active.by.smokeNow, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 6.3372, df = 2, p-value = 0.04206
## 
##    18       10   
## (22.00)  ( 6.00) 
## [0.727]  [2.667] 
## <-0.85>  < 1.63> 
##    
##    40        5   
## (35.36)  ( 9.64) 
## [0.610]  [2.235] 
## < 0.78>  <-1.50> 
##    
##    19        6   
## (19.64)  ( 5.36) 
## [0.021]  [0.077] 
## <-0.15>  < 0.28> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>
gf_bar(~ active | smokeNow, data = nhanes, ylab = "Frequency") %>%
  gf_theme(axis.text.x = element_text(angle = 90, hjust = 0.4))

STOP! Answer Questions 7–8 now.

Part 3: What happens if you flip the predictor and the outcome?

Now you will investigate the relationship between sex and whether a person has ever smoked (as opposed to current smoking status). Thus, use sex to predict everSmoke.

First create and print a contingency table say eversmoke.by.sex [i.e., a contingency table of past smoking status (everSmoke) by sex], then use it to perform an appropriate test to investigate the relationship between sex and whether a person has ever smoked

# Enter code here
everSmoke.by.sex <- tally(~ everSmoke | sex, data = nhanes,  useNA = "no")
everSmoke.by.sex
##          sex
## everSmoke male female
##       no    18     37
##       yes   24     20
xchisq.test(everSmoke.by.sex, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 4.7639, df = 1, p-value = 0.02906
## 
##    18       37   
## (23.33)  (31.67) 
##  [1.22]   [0.90] 
## <-1.10>  < 0.95> 
##    
##    24       20   
## (18.67)  (25.33) 
##  [1.52]   [1.12] 
## < 1.23>  <-1.06> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

STOP! Answer Questions 9–10 now.

Let’s see what happens when you flip the roles of the two variables.

Perform an analysis where you use everSmoke to predict sex.

First create and print a contingency table say sex.by.eversmoke [i.e., a contingency table of sex by past smoking status (everSmoke)] by switching the positions of sex and everSmoke in your previous code for the contingency table eversmoke.by.sex. Then, using the contingency table sex.by.eversmoke (not eversmoke.by.sex) perform an appropriate test to investigate the relationship between sex and whether a person has ever smoked

# Enter code here
sex.by.everSmoke <- tally(~ sex | everSmoke, data = nhanes,  useNA = "no")
sex.by.everSmoke
##         everSmoke
## sex      no yes
##   male   18  24
##   female 37  20
xchisq.test(sex.by.everSmoke, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 4.7639, df = 1, p-value = 0.02906
## 
##    18       24   
## (23.33)  (18.67) 
##  [1.22]   [1.52] 
## <-1.10>  < 1.23> 
##    
##    37       20   
## (31.67)  (25.33) 
##  [0.90]   [1.12] 
## < 0.95>  <-1.06> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

STOP! Answer Questions 11-12 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.