Introduction

Overview: In this lab exercise, you will explore associations between a binary variable and a continuous variable.

Objectives: At the end of this lab you will be able to:

Part 0: Download and organize files

Part 1: ANOVA

Load and print the prepared dataset from nhanes.RData. Recall labs 1 and 2 on how to load .RData file and print.

# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hisp… fema…    56          1 metr… midwe…
##  2     2 white not hisp… fema…    73          1 other west  
##  3     3 white not hisp… fema…    25          2 metr… south 
##  4     4 white mexican-… fema…    53          2 other south 
##  5     5 white mexican-… fema…    68          2 other south 
##  6     6 white not hisp… fema…    44          3 other west  
##  7     7 black not hisp… fema…    28          2 metr… south 
##  8     8 white not hisp… male     74          2 other midwe…
##  9     9 white not hisp… fema…    65          1 other north…
## 10    10 white other hi… fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <fct>,
## #   smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

We will investigate whether or not there are differences in weight (weightKg) or height (heightCm) by region of the country.

Note that there are two different sets of variables that measure height and weight. The variables weightKg and heightCm are measurements taken by researchers. The variables heightInSelf and weightLbSelf are self-reported height and weight, where participants reported their own height and weight.

Make sure you use the measured weight and height variables, weightKg and heightCm.

In the code chunk below, obtain summary statistics for each variable, heightCm, weightKg, and region by appropriately using the favstats() or tally(..., format = "proportion") functions. See labs 2 and 4 for help.

# Enter code here
favstats(~ heightCm, data = nhanes)
##    min    Q1 median    Q3   max     mean       sd  n
##  139.4 157.2  164.5 171.9 184.1 164.4441 9.953298 93
##  missing
##        7
favstats(~ weightKg, data = nhanes)
##   min    Q1 median    Q3    max     mean       sd  n
##  35.8 63.85  72.65 82.15 137.75 73.43237 17.40407 93
##  missing
##        7
tally(~ region, data = nhanes, format = "proportion")
## region
## northeast   midwest     south      west 
##      0.14      0.20      0.44      0.22

STOP! Answer Question 1 now.

We will start by looking at graphs to explore the relationships between weight and region, and between height and region. Here, we will create two sets of side-by-side boxplots.

In the code chunk below, create side-by-side boxplots of weightKg separated by region with no missing values of weightKg and/or region. See lab 7 for help.

# Enter code here
bwplot(weightKg ~ region, data = na.omit(nhanes))

In the code chunk below, create side-by-side boxplot of heightCm separated by region with no missing values of heightCm and/or region. See lab 7 for help.

# Enter code here
bwplot(heightCm ~ region, data = na.omit(nhanes))

STOP! Answer Question 2 now.

The next logical step is to obtain summary statistics for weight and height in each region.

In the code chunk below, use favstats() with the formulas weightKg ~ region and heightCm ~ region to obtain separate summary statistics for weight and height in each region. See lab 7 for help.

# Enter code here
favstats(weightKg ~ region, data = nhanes)
##      region   min      Q1 median      Q3    max     mean
## 1 northeast 54.35 62.4000  67.25 80.4500 118.25 74.06538
## 2   midwest 35.80 64.6500  72.70 79.8550 108.55 72.79737
## 3     south 44.40 70.2375  74.79 83.7125 137.75 77.31075
## 4      west 47.73 57.8500  65.35 72.8500  88.30 66.22762
##         sd  n missing
## 1 19.14924 13       1
## 2 18.40526 19       1
## 3 18.29426 40       4
## 4 11.42787 21       1
favstats(heightCm ~ region, data = nhanes)
##      region   min     Q1 median      Q3   max     mean
## 1 northeast 150.2 156.70 157.70 164.300 178.0 162.0923
## 2   midwest 142.7 165.95 171.10 174.100 181.3 168.3684
## 3     south 147.4 159.25 166.35 169.975 184.1 165.6525
## 4      west 139.4 152.20 161.00 169.400 179.6 160.0476
##          sd  n missing
## 1  8.588506 13       1
## 2  9.540618 19       1
## 3  9.160786 40       4
## 4 11.155699 21       1

We’ll start by testing the relationship between weightKg and region. Since the categorical variable, region, has more than two levels, we must use ANOVA.

STOP! Answer Question 3 now.

Much like how we run Chi-squared tests on contingency tables created by the tally() function, we run ANOVA tests on (linear) models created by the lm() funtion.

For example,

# Not evaluated
model <- lm(income ~ race, data = mydata)
summary(model)
anova(model)

creates a linear model of the continuous outcome income (i.e., the y-variable) depending on race (i.e., the x-variable or predictor); summarizes the linear model created; and then, since race is categorical, performs ANOVA on the linear model.

In the code chunk below, create and summarize a linear model of weightKg depending on region, and then perform ANOVA on the linear model.

# Enter code here
modelWeight <- lm(weightKg ~ region, data = nhanes)
summary(modelWeight)
## 
## Call:
## lm(formula = weightKg ~ region, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.997  -8.947  -1.728   6.622  60.439 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     74.065      4.755  15.576   <2e-16 ***
## regionmidwest   -1.268      6.171  -0.205    0.838    
## regionsouth      3.245      5.474   0.593    0.555    
## regionwest      -7.838      6.051  -1.295    0.199    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.15 on 89 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.06117,    Adjusted R-squared:  0.02952 
## F-statistic: 1.933 on 3 and 89 DF,  p-value: 0.13
anova(modelWeight)
## Analysis of Variance Table
## 
## Response: weightKg
##           Df  Sum Sq Mean Sq F value Pr(>F)
## region     3  1704.6  568.21  1.9329   0.13
## Residuals 89 26162.3  293.96

Look at the summary of the linear model, particularly under the Coefficients heading, where the first column (other than the row names) gives estimates. The first estimate, labeled (Intercept), matches the sample mean of weightKg in the first region (northeast) we saw from using favstats().

STOP! Answer Questions 4–6 now.

Now carry out the same test for a difference in height (heightCm) among the regions. In other words, in the code chunk below, create and summarize a linear model of the continuous outcome heightCm depending on region, and then perform ANOVA on the linear model.

# Enter code here
modelHeight <- lm(heightCm ~ region, data = nhanes)
summary(modelHeight)
## 
## Call:
## lm(formula = heightCm ~ region, data = nhanes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6684  -6.5525   0.9524   6.5316  19.5524 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    162.092      2.676  60.572   <2e-16 ***
## regionmidwest    6.276      3.473   1.807   0.0741 .  
## regionsouth      3.560      3.080   1.156   0.2509    
## regionwest      -2.045      3.405  -0.600   0.5497    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.649 on 89 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.09094,    Adjusted R-squared:  0.06029 
## F-statistic: 2.968 on 3 and 89 DF,  p-value: 0.0362
anova(modelHeight)
## Analysis of Variance Table
## 
## Response: heightCm
##           Df Sum Sq Mean Sq F value Pr(>F)  
## region     3  828.8 276.276  2.9677 0.0362 *
## Residuals 89 8285.4  93.095                 
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

STOP! Answer Questions 7–8 now.

Part 2: Multiple Comparisons After ANOVA

Since the ANOVA test was significant, it is appropriate to move forward and see where the differences in height between the regions lie. The incorrect approach is to just do two-sample t-tests comparing each pair of means. The correct approach is to use Tukey’s method, also referred to as the Tukey-Kramer method or Tukey’s HSD (honest significant difference) test.

Like ANOVA, Tukey’s method is applied to a (linear) model.

For example,

# Not evaluated
model <- lm(weightKg ~ region, data = nhanes)
TukeyHSD(model)

would carry out Tukey’s method on the mean weights (weightKg) by region. Thus, the Tukey’s method would compare the mean weights (weightKg) among all pairs of regions.

In the code chunk below, use Tukey’s method to compare the mean heights (heightCm) among all pairs of regions.

# Enter code here
modelHeight <- lm(heightCm ~ region, data = nhanes)
TukeyHSD(modelHeight)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = x)
## 
## $region
##                        diff        lwr        upr     p adj
## midwest-northeast  6.276113  -2.816716 15.3689428 0.2768117
## south-northeast    3.560192  -4.504890 11.6252744 0.6561083
## west-northeast    -2.044689 -10.959880  6.8705026 0.9316452
## south-midwest     -2.715921  -9.754609  4.3227674 0.7438540
## west-midwest      -8.320802 -16.319441 -0.3221633 0.0382209
## west-south        -5.604881 -12.412540  1.2027782 0.1438459

The adjusted p-values (p adj) in the R output generated, should be compared to whatever alpha level you are using (e.g., 0.05 in this case). They have already been adjusted for multiple comparisons.

STOP! Answer Questions 9–10 now.

Part 3: Plotting comparisons of means

Often, researchers will plot sample means and error bars showing the standard errors. However, these types of plots are not entirely reliable when deciding if there are differences between means.

Instead, you will create a plot to show the comparisons between different means.

We can use the mplot() command to automatically construct a plot of estimated differences between means and confidence intervals around those estimates.

For example,

# Not evaluated
model <- lm(weightKg ~ region, data = nhanes)
mplot(TukeyHSD(model, conf.level = 0.80))

visualizes the results of Tukey’s test for comparing mean weights (weightKg) by region at the 80% confidence level (i.e., conf.level = 0.80) by plotting estimated differences in means for weight and adjusted 80% confidence intervals for those means.

In the code chunk below, plot the results of Tukey’s test for comparing mean heights (heightCm) by region with 95% confidence intervals.

# Enter code here
modelWeight <- lm(weightKg ~ region, data = nhanes)
mplot(TukeyHSD(modelWeight, conf.level = 0.95))

STOP! Answer Question 11 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.