Overview: In this lab exercise, you will explore associations between two continuous variables.
Objectives: At the end of this lab you will be able to:
Create a subdirectory named Lab 9 in the
PUBHBIO 2210 Labs directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-09-regression-blank.htmllab-09-regression-blank.Rmdlab-09-regression-worksheet-blank.docxnhanes.RDataIf you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 9 directory (i.e., save the
downloaded files in the Lab 9 directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Load and print the prepared dataset from nhanes.RData.
Recall labs 1 and 2 on how to load .RData file and
print.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hispanic fema… 56 1 metr… midwe…
## 2 2 white not hispanic fema… 73 1 other west
## 3 3 white not hispanic fema… 25 2 metr… south
## 4 4 white mexican-ameri… fema… 53 2 other south
## 5 5 white mexican-ameri… fema… 68 2 other south
## 6 6 white not hispanic fema… 44 3 other west
## 7 7 black not hispanic fema… 28 2 metr… south
## 8 8 white not hispanic male 74 2 other midwe…
## 9 9 white not hispanic fema… 65 1 other north…
## 10 10 white other hispanic fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>, heightInSelf <int>,
## # weightLbSelf <int>, beer <int>, wine <int>, liquor <int>,
## # everSmoke <fct>, smokeNow <fct>, active <ord>, SBP <int>,
## # DBP <int>, weightKg <dbl>, heightCm <dbl>, waist <dbl>,
## # tricep <dbl>, thigh <dbl>, BMD <dbl>, RBC <dbl>, …
In this lab we will investigate which physiological measurements are
associated with age. The two measurements we will consider are bone
mineral density (BMD; units=gm/cm\(^2\)) and red blood cell count
(RBC; unitless)
The first step in investigating associations should always be to make a plot. To look at the relationship between two continuous variables we need a scatterplot.
We will first produce a scatterplot of BMD and
RBC by age using the gf_point()
command.
For example,
# Not evaluated
gf_point(weight ~ height, data = nhanes)
will produce a scatterplot of weight (y-variable or
dependent variable) versus height (x-variable or
independent variable).
In the code chunk below, produce the scatterplot for BMD
versus age.
# Enter code here
gf_point(BMD ~ age, data = nhanes)
## Warning: Removed 32 rows containing missing values or values outside the
## scale range (`geom_point()`).
We get a warning that 32 rows containing missing values have been
removed. In this case, that means rows with missing values of
BMD. This warning won’t hurt the plot, but we could get rid
of it by filtering out subjects with missing values of BMD
or age.
For example,
# Not evaluated
gf_point(weight ~ height,
data = mydata %>% filter(!is.na(weight), !is.na(height)))
will remove subjects with missing values of weight or
height. Thus, the above command would produce a scatterplot
for weight versus height with no warnings
(i.e., with no missing values of weight and/or
height).
In the code chucnk below, create the scatterplot for BMD
versus age with no warnings (i.e., with no missing values
of BMD and/or age).
# Enter code here
gf_point(BMD ~ age, data = nhanes %>% filter(!is.na(BMD), !is.na(age)))
In the code chunk below, produce the scatterplot for RBC
versus age.
# Enter code here
gf_point(RBC ~ age, data = nhanes)
## Warning: Removed 8 rows containing missing values or values outside the
## scale range (`geom_point()`).
We get a warning that 8 rows containing missing values have been
removed. In this case, that means rows with missing values of
RBC.
In the code chucnk below, create the scatterplot for RBC
versus age with no warnings (i.e., with no missing values
of RBC and/or age).
# Enter code here
gf_point(RBC ~ age, data = nhanes %>% filter(!is.na(RBC), !is.na(age)))
Note that if a subject is missing a value for one or both variables he/she will not be included in that scatterplot. So, even though there are 100 people in our data set, there might not be 100 points on the scatterplot.
Before we proceed, let us look at summary statistics such as means
for BMD and age. The favstats()
command provides several statistical summaries, and removes missing data
automatically.
For example,
# Not evaluated
favstats( ~ weight, data = mydata)
favstats( ~ height, data = mydata)
will remove subjects with missing values of weight or
height, then provide statistical summaries for
weight and height.
In the code chunk below, produce the statistical summaries for
BMD and age.
# Enter code here
favstats(~ BMD, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 0.358 0.8075 0.901 1.051 1.26 0.9028824 0.188168 68 32
favstats(~ age, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 17 34.75 48.5 70.25 90 51.43 21.52858 100 0
Under the heading “n” is the number of people who have age and BMD measurements. Since no patients had missing ages, and 32 patients had missing BMD measurements, we know that 68 out of the 100 people had both age and BMD. Hence, there are 68 points on the scatterplot.
Of course, just looking at the scatterplot obtained earlier is not
enough. The next step will be to calculate the correlation between say
age and BMD (bone mineral density) and to test
if the linear relationship is significant.
To estimate the correlation and to perform the test, we will use the
cor.test() function with the
method = "pearson" argument specified. This will perform
the version of the correlation test using Pearson’s method.
For example,
# Not evaluated
cor.test(weight ~ height, method = "pearson", data = mydata)
will estimate the correlation and perform a correlation test (using
Pearson’s method) to enable us determine any signficant linear
relationship between weight and height.
In the code chunk below, perform a correlation test (using Pearson’s
method) to determine if there is a significant linear relationship
between BMD and age.
# Enter code here
cor.test(BMD ~ age, method = "pearson", data = nhanes)
##
## Pearson's product-moment correlation
##
## data: BMD and age
## t = -5.4383, df = 66, p-value = 8.425e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7016418 -0.3664531
## sample estimates:
## cor
## -0.5562755
There is a lot of output displayed; the test statistic is
t = in third line of the output and the degrees of freedom,
df as well as the p-value also appears in that
line. The 95% confidence interval for the population correlation is in
the fifth line (i.e., under the
95 percent confidence interval: heading) and the estimate
of the population correlation in the last line (i.e., under the
sample estimates: heading) of the output.
In the code chunk below, using the cor.test function
with the method = "pearson" argument specified, perform a
correlation test to determine if there is a significant linear
relationship between RBC and age.
# Enter code here
cor.test(RBC ~ age, method = "pearson", data = nhanes)
##
## Pearson's product-moment correlation
##
## data: RBC and age
## t = -1.017, df = 90, p-value = 0.3119
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3047538 0.1004201
## sample estimates:
## cor
## -0.1065896
These correlation analyses tell us that BMD is significantly linearly associated with age, and that RBC is not. It would be nice to estimate how much BMD changes when age changes, and for this we use simple linear regression.
Remember that it does not matter which variable is X (i.e., independent) and which is Y (i.e., dependent) for correlation, but it does matter for regression. We will use age as the X variable and BMD as the Y variable, which is what makes more sense (instead of the reverse). You can imagine getting older (age) causing changes in BMD, but you can’t imagine changing BMD causing you to get older!
The lm() function (“linear model”) is used to perform
simple linear regression analysis.
For example,
# Not evaluated
model <- lm(weight ~ height, data = mydata)
summary(model)
will produce summary of results for regressing weight
(the dependent variable, Y) on height (the independent
variable, X).
In the code chunk below, produce the output for regressing
BMD on age.
# Enter code here
modelBmdAge <- lm(BMD ~ age, data = nhanes)
summary(modelBmdAge)
##
## Call:
## lm(formula = BMD ~ age, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40004 -0.10865 -0.00618 0.12505 0.32375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.199413 0.057777 20.759 < 2e-16 ***
## age -0.005632 0.001036 -5.438 8.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1575 on 66 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.3094, Adjusted R-squared: 0.299
## F-statistic: 29.57 on 1 and 66 DF, p-value: 8.425e-07
There is a lot of output displayed; under the
Coefficients: heading of the output, you will find the
estimates for the intercept and slope for the regression line (best fit
line) in the column labeled Estimate.
One can also add the regression line to the jittered scatterplot (without warnings) we obtained earlier.
To draw the regression line, we use the gf_lm()
command.
For example,
# Not evaluated
gf_point(weight ~ height, data = mydata) %>% gf_lm() %>%
gf_labs(x = "horizontal axis label", y = "vertical axis label")
will produce a scatterplot of weight versus
height, and then overlay the regression line for
weight regressed on height.
Note that gf_lm() automatically uses the formula and
data you used for gf_point(). The gf_labs()
function allows you to provide well defined labels for the horizontal
and vertical axes.
If you wanted to be more explicit,
gf_lm(weight ~ height, data = mydata) would give you the
same results.
In the code chunk below, produce a scatterplot of BMD
versus age, and then overlay the regression line for
BMD regressed on age. Also, use the
gf_labs() function to label the horizontal and vertical
axes with variables and units. Here, in your code, you MUST replace
horizontal axis label and vertical axis label
in the sample code above with appropriate labels for the X-axis and
Y-axis, respectively.
# Enter code here
gf_point(BMD ~ age, data = nhanes %>% filter(!is.na(BMD), !is.na(age))) %>%
gf_lm() %>%
gf_labs(x = "Age", y = "Bone Mineral Density")
Looking back at the regression output (for question 9), you will
notice that the p-value for the estimates for both the intercept and
slope for the regression line is displayed in the column labeled
Pr(>|t|) under the Coefficients: heading of
the output. Also, the value for the \(R^2\) appears in the last but one line as
the value for Multiple R-squared:
To complete the analysis, let’s see what happens when you switch the roles of the X (independent) and Y (dependent) variables.
Back in Part 2 you calculated the correlation between
age and BMD using age as the X
(independent) variable and BMD as the Y (dependent)
variable. Now, repeat this analysis switching the roles of the two
variables.
In the code chunk below, perform a correlation test to determine if
there is a significant linear relationship between BMD and
age. Note here that you are switching the roles/positions
of BMD and age in the cor.test()
function you used in part 2.
# Enter code here
cor.test(age ~ BMD, method = "pearson", data = nhanes)
##
## Pearson's product-moment correlation
##
## data: age and BMD
## t = -5.4383, df = 66, p-value = 8.425e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7016418 -0.3664531
## sample estimates:
## cor
## -0.5562755
In Part 3, you ran a linear regression using age as the
X (independent) variable and BMD as the Y (dependent)
variable, which is what made the most sense. Now let’s see what happens
to the linear regression when you switch the roles of the two
variables.
Run a linear regression analysis using BMD as the X
(independent) variable and age as the Y (dependent)
variable.
In the code chunk below, produce the output for regressing
age on BMD. Note here that you are switching
the roles/positions of BMD and age in the
lm() function you used in part 3.
# Enter code here
model_age_bmd <- lm(age ~ BMD, data = nhanes)
summary(model_age_bmd)
##
## Call:
## lm(formula = age ~ BMD, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.476 -12.311 -0.036 13.357 31.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.251 9.314 10.978 < 2e-16 ***
## BMD -54.939 10.102 -5.438 8.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.56 on 66 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.3094, Adjusted R-squared: 0.299
## F-statistic: 29.57 on 1 and 66 DF, p-value: 8.425e-07
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.