Overview: In this lab exercise, you will explore associations between a binary variable and a continuous variable.
Objectives: At the end of this lab you will be able to:
Create a subdirectory named Lab 8 in the
PUBHBIO 2210 Labs directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-08-anova-blank.htmllab-08-anova-blank.Rmdlab-08-anova-worksheet-blank.docxnhanes.RDataIf you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 8 directory (i.e., save the
downloaded files in the Lab 8 directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Load and print the prepared dataset from nhanes.RData.
Recall labs 1 and 2 on how to load .RData file and
print.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp… fema… 56 1 metr… midwe…
## 2 2 white not hisp… fema… 73 1 other west
## 3 3 white not hisp… fema… 25 2 metr… south
## 4 4 white mexican-… fema… 53 2 other south
## 5 5 white mexican-… fema… 68 2 other south
## 6 6 white not hisp… fema… 44 3 other west
## 7 7 black not hisp… fema… 28 2 metr… south
## 8 8 white not hisp… male 74 2 other midwe…
## 9 9 white not hisp… fema… 65 1 other north…
## 10 10 white other hi… fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>,
## # wine <int>, liquor <int>, everSmoke <fct>,
## # smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## # weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
We will investigate whether or not there are differences in weight
(weightKg) or height (heightCm) by region of
the country.
Note that there are two different sets of variables that measure
height and weight. The variables weightKg and
heightCm are measurements taken by researchers. The
variables heightInSelf and weightLbSelf are
self-reported height and weight, where participants reported their own
height and weight.
Make sure you use the measured weight and height variables,
weightKg and heightCm.
In the code chunk below, obtain summary statistics for each variable,
heightCm, weightKg, and region by
appropriately using the favstats() or
tally(..., format = "proportion") functions. See labs 2 and
4 for help.
# Enter code here
favstats(~ heightCm, data = nhanes)
## min Q1 median Q3 max mean sd n
## 139.4 157.2 164.5 171.9 184.1 164.4441 9.953298 93
## missing
## 7
favstats(~ weightKg, data = nhanes)
## min Q1 median Q3 max mean sd n
## 35.8 63.85 72.65 82.15 137.75 73.43237 17.40407 93
## missing
## 7
tally(~ region, data = nhanes, format = "proportion")
## region
## northeast midwest south west
## 0.14 0.20 0.44 0.22
We will start by looking at graphs to explore the relationships between weight and region, and between height and region. Here, we will create two sets of side-by-side boxplots.
In the code chunk below, create side-by-side boxplots of
weightKg separated by region with no missing
values of weightKg and/or region. See lab 7
for help.
# Enter code here
bwplot(weightKg ~ region, data = na.omit(nhanes))
In the code chunk below, create side-by-side boxplot of
heightCm separated by region with no missing
values of heightCm and/or region. See lab 7
for help.
# Enter code here
bwplot(heightCm ~ region, data = na.omit(nhanes))
The next logical step is to obtain summary statistics for weight and height in each region.
In the code chunk below, use favstats() with the
formulas weightKg ~ region and
heightCm ~ region to obtain separate summary statistics for
weight and height in each region. See lab 7 for help.
# Enter code here
favstats(weightKg ~ region, data = nhanes)
## region min Q1 median Q3 max mean
## 1 northeast 54.35 62.4000 67.25 80.4500 118.25 74.06538
## 2 midwest 35.80 64.6500 72.70 79.8550 108.55 72.79737
## 3 south 44.40 70.2375 74.79 83.7125 137.75 77.31075
## 4 west 47.73 57.8500 65.35 72.8500 88.30 66.22762
## sd n missing
## 1 19.14924 13 1
## 2 18.40526 19 1
## 3 18.29426 40 4
## 4 11.42787 21 1
favstats(heightCm ~ region, data = nhanes)
## region min Q1 median Q3 max mean
## 1 northeast 150.2 156.70 157.70 164.300 178.0 162.0923
## 2 midwest 142.7 165.95 171.10 174.100 181.3 168.3684
## 3 south 147.4 159.25 166.35 169.975 184.1 165.6525
## 4 west 139.4 152.20 161.00 169.400 179.6 160.0476
## sd n missing
## 1 8.588506 13 1
## 2 9.540618 19 1
## 3 9.160786 40 4
## 4 11.155699 21 1
We’ll start by testing the relationship between weightKg
and region. Since the categorical variable,
region, has more than two levels, we must use ANOVA.
Much like how we run Chi-squared tests on contingency
tables created by the tally() function, we
run ANOVA tests on (linear) models created by the
lm() funtion.
For example,
# Not evaluated
model <- lm(income ~ race, data = mydata)
summary(model)
anova(model)
creates a linear model of the continuous outcome income
(i.e., the y-variable) depending on race (i.e., the
x-variable or predictor); summarizes the linear model created; and then,
since race is categorical, performs ANOVA on the linear
model.
In the code chunk below, create and summarize a linear model of
weightKg depending on region, and then perform
ANOVA on the linear model.
# Enter code here
modelWeight <- lm(weightKg ~ region, data = nhanes)
summary(modelWeight)
##
## Call:
## lm(formula = weightKg ~ region, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.997 -8.947 -1.728 6.622 60.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.065 4.755 15.576 <2e-16 ***
## regionmidwest -1.268 6.171 -0.205 0.838
## regionsouth 3.245 5.474 0.593 0.555
## regionwest -7.838 6.051 -1.295 0.199
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.15 on 89 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.06117, Adjusted R-squared: 0.02952
## F-statistic: 1.933 on 3 and 89 DF, p-value: 0.13
anova(modelWeight)
## Analysis of Variance Table
##
## Response: weightKg
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 1704.6 568.21 1.9329 0.13
## Residuals 89 26162.3 293.96
Look at the summary of the linear model, particularly under the
Coefficients heading, where the first column (other
than the row names) gives estimates. The first estimate, labeled
(Intercept), matches the sample mean of
weightKg in the first region (northeast) we saw from using
favstats().
Now carry out the same test for a difference in height
(heightCm) among the regions. In other words, in the code
chunk below, create and summarize a linear model of the continuous
outcome heightCm depending on region, and then
perform ANOVA on the linear model.
# Enter code here
modelHeight <- lm(heightCm ~ region, data = nhanes)
summary(modelHeight)
##
## Call:
## lm(formula = heightCm ~ region, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.6684 -6.5525 0.9524 6.5316 19.5524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.092 2.676 60.572 <2e-16 ***
## regionmidwest 6.276 3.473 1.807 0.0741 .
## regionsouth 3.560 3.080 1.156 0.2509
## regionwest -2.045 3.405 -0.600 0.5497
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.649 on 89 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.09094, Adjusted R-squared: 0.06029
## F-statistic: 2.968 on 3 and 89 DF, p-value: 0.0362
anova(modelHeight)
## Analysis of Variance Table
##
## Response: heightCm
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 828.8 276.276 2.9677 0.0362 *
## Residuals 89 8285.4 93.095
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the ANOVA test was significant, it is appropriate to move forward and see where the differences in height between the regions lie. The incorrect approach is to just do two-sample t-tests comparing each pair of means. The correct approach is to use Tukey’s method, also referred to as the Tukey-Kramer method or Tukey’s HSD (honest significant difference) test.
Like ANOVA, Tukey’s method is applied to a (linear) model.
For example,
# Not evaluated
model <- lm(weightKg ~ region, data = nhanes)
TukeyHSD(model)
would carry out Tukey’s method on the mean weights
(weightKg) by region. Thus, the Tukey’s method would
compare the mean weights (weightKg) among all pairs of
regions.
In the code chunk below, use Tukey’s method to compare the mean
heights (heightCm) among all pairs of regions.
# Enter code here
modelHeight <- lm(heightCm ~ region, data = nhanes)
TukeyHSD(modelHeight)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = x)
##
## $region
## diff lwr upr p adj
## midwest-northeast 6.276113 -2.816716 15.3689428 0.2768117
## south-northeast 3.560192 -4.504890 11.6252744 0.6561083
## west-northeast -2.044689 -10.959880 6.8705026 0.9316452
## south-midwest -2.715921 -9.754609 4.3227674 0.7438540
## west-midwest -8.320802 -16.319441 -0.3221633 0.0382209
## west-south -5.604881 -12.412540 1.2027782 0.1438459
The adjusted p-values (p adj) in the R output generated,
should be compared to whatever alpha level you are using (e.g., 0.05 in
this case). They have already been adjusted for multiple
comparisons.
Often, researchers will plot sample means and error bars showing the standard errors. However, these types of plots are not entirely reliable when deciding if there are differences between means.
Instead, you will create a plot to show the comparisons between different means.
We can use the mplot() command to automatically
construct a plot of estimated differences between means and confidence
intervals around those estimates.
For example,
# Not evaluated
model <- lm(weightKg ~ region, data = nhanes)
mplot(TukeyHSD(model, conf.level = 0.80))
visualizes the results of Tukey’s test for comparing mean weights
(weightKg) by region at the 80% confidence level (i.e.,
conf.level = 0.80) by plotting estimated differences in means for weight
and adjusted 80% confidence intervals for those means.
In the code chunk below, plot the results of Tukey’s test for
comparing mean heights (heightCm) by region with 95%
confidence intervals.
# Enter code here
modelWeight <- lm(weightKg ~ region, data = nhanes)
mplot(TukeyHSD(modelWeight, conf.level = 0.95))
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.