Lab 9

Introduction

Overview: In this lab exercise, you will explore associations between two continuous variables.

Objectives: At the end of this lab you will be able to:

Create a scatterplot;
Calculate a correlation coefficient and test for a significant (non-zero) correlation;
Perform a simple linear regression analysis;
Interpret the results of a simple linear regression analysis.

Part 0: Download and organize files

Create a subdirectory named Lab 9 in the PUBHBIO 2210 Labs directory you created in your OneDrive folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
1. lab-09-regression-blank.html
2. lab-09-regression-blank.Rmd
3. lab-09-regression-worksheet-blank.docx
4. nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the PUBHBIO 2210 Labs/Lab 9 directory (i.e., save the downloaded files in the Lab 9 directory or folder created). When working on labs, it is important to keep all related files in the same directory.
Change the author and date information in the lab header.

Part 1: Scatterplots

Load and print the prepared dataset from nhanes.RData. Recall labs 1 and 2 on how to load .RData file and print.

# Enter code here
load("nhanes.RData")
print(nhanes)

## # A tibble: 100 × 33
##       id race  ethnicity      sex     age familySize urban region
##    <int> <fct> <fct>          <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hispanic   fema…    56          1 metr… midwe…
##  2     2 white not hispanic   fema…    73          1 other west  
##  3     3 white not hispanic   fema…    25          2 metr… south 
##  4     4 white mexican-ameri… fema…    53          2 other south 
##  5     5 white mexican-ameri… fema…    68          2 other south 
##  6     6 white not hispanic   fema…    44          3 other west  
##  7     7 black not hispanic   fema…    28          2 metr… south 
##  8     8 white not hispanic   male     74          2 other midwe…
##  9     9 white not hispanic   fema…    65          1 other north…
## 10    10 white other hispanic fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>, heightInSelf <int>,
## #   weightLbSelf <int>, beer <int>, wine <int>, liquor <int>,
## #   everSmoke <fct>, smokeNow <fct>, active <ord>, SBP <int>,
## #   DBP <int>, weightKg <dbl>, heightCm <dbl>, waist <dbl>,
## #   tricep <dbl>, thigh <dbl>, BMD <dbl>, RBC <dbl>, …

In this lab we will investigate which physiological measurements are associated with age. The two measurements we will consider are bone mineral density (BMD; units=gm/cm\(^2\)) and red blood cell count (RBC; unitless)

The first step in investigating associations should always be to make a plot. To look at the relationship between two continuous variables we need a scatterplot.

We will first produce a scatterplot of BMD and RBC by age using the gf_point() command.

For example,

# Not evaluated
 gf_point(weight ~ height, data = nhanes)

will produce a scatterplot of weight (y-variable or dependent variable) versus height (x-variable or independent variable).

In the code chunk below, produce the scatterplot for BMD versus age.

# Enter code here
gf_point(BMD ~ age, data = nhanes)

## Warning: Removed 32 rows containing missing values or values outside the
## scale range (`geom_point()`).

We get a warning that 32 rows containing missing values have been removed. In this case, that means rows with missing values of BMD. This warning won’t hurt the plot, but we could get rid of it by filtering out subjects with missing values of BMD or age.

For example,

# Not evaluated
gf_point(weight ~ height,
         data = mydata %>% filter(!is.na(weight), !is.na(height)))

will remove subjects with missing values of weight or height. Thus, the above command would produce a scatterplot for weight versus height with no warnings (i.e., with no missing values of weight and/or height).

In the code chucnk below, create the scatterplot for BMD versus age with no warnings (i.e., with no missing values of BMD and/or age).

# Enter code here
gf_point(BMD ~ age, data = nhanes %>% filter(!is.na(BMD), !is.na(age)))

STOP! Answer Question 1 now.

In the code chunk below, produce the scatterplot for RBC versus age.

# Enter code here
gf_point(RBC ~ age, data = nhanes)

## Warning: Removed 8 rows containing missing values or values outside the
## scale range (`geom_point()`).

We get a warning that 8 rows containing missing values have been removed. In this case, that means rows with missing values of RBC.

In the code chucnk below, create the scatterplot for RBC versus age with no warnings (i.e., with no missing values of RBC and/or age).

# Enter code here
gf_point(RBC ~ age, data = nhanes %>% filter(!is.na(RBC), !is.na(age)))

STOP! Answer Question 2 now.

Part 2: Calculating the Correlation Coefficient

Note that if a subject is missing a value for one or both variables he/she will not be included in that scatterplot. So, even though there are 100 people in our data set, there might not be 100 points on the scatterplot.

Before we proceed, let us look at summary statistics such as means for BMD and age. The favstats() command provides several statistical summaries, and removes missing data automatically.

For example,

# Not evaluated
favstats( ~ weight, data = mydata)
favstats( ~ height, data = mydata)

will remove subjects with missing values of weight or height, then provide statistical summaries for weight and height.

In the code chunk below, produce the statistical summaries for BMD and age.

# Enter code here
favstats(~ BMD, data = nhanes)

##    min     Q1 median    Q3  max      mean       sd  n missing
##  0.358 0.8075  0.901 1.051 1.26 0.9028824 0.188168 68      32

favstats(~ age, data = nhanes)

##  min    Q1 median    Q3 max  mean       sd   n missing
##   17 34.75   48.5 70.25  90 51.43 21.52858 100       0

Under the heading “n” is the number of people who have age and BMD measurements. Since no patients had missing ages, and 32 patients had missing BMD measurements, we know that 68 out of the 100 people had both age and BMD. Hence, there are 68 points on the scatterplot.

Of course, just looking at the scatterplot obtained earlier is not enough. The next step will be to calculate the correlation between say age and BMD (bone mineral density) and to test if the linear relationship is significant.

To estimate the correlation and to perform the test, we will use the cor.test() function with the method = "pearson" argument specified. This will perform the version of the correlation test using Pearson’s method.

For example,

# Not evaluated
cor.test(weight ~ height, method = "pearson", data = mydata)

will estimate the correlation and perform a correlation test (using Pearson’s method) to enable us determine any signficant linear relationship between weight and height.

In the code chunk below, perform a correlation test (using Pearson’s method) to determine if there is a significant linear relationship between BMD and age.

# Enter code here
cor.test(BMD ~ age, method = "pearson", data = nhanes)

## 
##  Pearson's product-moment correlation
## 
## data:  BMD and age
## t = -5.4383, df = 66, p-value = 8.425e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7016418 -0.3664531
## sample estimates:
##        cor 
## -0.5562755

There is a lot of output displayed; the test statistic is t = in third line of the output and the degrees of freedom, df as well as the p-value also appears in that line. The 95% confidence interval for the population correlation is in the fifth line (i.e., under the 95 percent confidence interval: heading) and the estimate of the population correlation in the last line (i.e., under the sample estimates: heading) of the output.

STOP! Answer Questions 3–7 now.

In the code chunk below, using the cor.test function with the method = "pearson" argument specified, perform a correlation test to determine if there is a significant linear relationship between RBC and age.

# Enter code here
cor.test(RBC ~ age, method = "pearson", data = nhanes)

## 
##  Pearson's product-moment correlation
## 
## data:  RBC and age
## t = -1.017, df = 90, p-value = 0.3119
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3047538  0.1004201
## sample estimates:
##        cor 
## -0.1065896

STOP! Answer Question 8 now.

Part 3: Simple Linear Regression

These correlation analyses tell us that BMD is significantly linearly associated with age, and that RBC is not. It would be nice to estimate how much BMD changes when age changes, and for this we use simple linear regression.

Remember that it does not matter which variable is X (i.e., independent) and which is Y (i.e., dependent) for correlation, but it does matter for regression. We will use age as the X variable and BMD as the Y variable, which is what makes more sense (instead of the reverse). You can imagine getting older (age) causing changes in BMD, but you can’t imagine changing BMD causing you to get older!

The lm() function (“linear model”) is used to perform simple linear regression analysis.

For example,

# Not evaluated
model <- lm(weight ~ height, data = mydata)
summary(model)

will produce summary of results for regressing weight (the dependent variable, Y) on height (the independent variable, X).

In the code chunk below, produce the output for regressing BMD on age.

# Enter code here
modelBmdAge <- lm(BMD ~ age, data = nhanes)
summary(modelBmdAge)

## 
## Call:
## lm(formula = BMD ~ age, data = nhanes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40004 -0.10865 -0.00618  0.12505  0.32375 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.199413   0.057777  20.759  < 2e-16 ***
## age         -0.005632   0.001036  -5.438 8.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1575 on 66 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.3094, Adjusted R-squared:  0.299 
## F-statistic: 29.57 on 1 and 66 DF,  p-value: 8.425e-07

There is a lot of output displayed; under the Coefficients: heading of the output, you will find the estimates for the intercept and slope for the regression line (best fit line) in the column labeled Estimate.

STOP! Answer Question 9 now.

One can also add the regression line to the jittered scatterplot (without warnings) we obtained earlier.

To draw the regression line, we use the gf_lm() command.

For example,

# Not evaluated
gf_point(weight ~ height, data = mydata) %>% gf_lm() %>% 
         gf_labs(x = "horizontal axis label", y = "vertical axis label")

will produce a scatterplot of weight versus height, and then overlay the regression line for weight regressed on height.

Note that gf_lm() automatically uses the formula and data you used for gf_point(). The gf_labs() function allows you to provide well defined labels for the horizontal and vertical axes.

If you wanted to be more explicit, gf_lm(weight ~ height, data = mydata) would give you the same results.

In the code chunk below, produce a scatterplot of BMD versus age, and then overlay the regression line for BMD regressed on age. Also, use the gf_labs() function to label the horizontal and vertical axes with variables and units. Here, in your code, you MUST replace horizontal axis label and vertical axis label in the sample code above with appropriate labels for the X-axis and Y-axis, respectively.

# Enter code here
gf_point(BMD ~ age, data = nhanes %>% filter(!is.na(BMD), !is.na(age))) %>%
    gf_lm() %>%
        gf_labs(x = "Age", y = "Bone Mineral Density")

STOP! Answer Questions 10–12 now.

Looking back at the regression output (for question 9), you will notice that the p-value for the estimates for both the intercept and slope for the regression line is displayed in the column labeled Pr(>|t|) under the Coefficients: heading of the output. Also, the value for the \(R^2\) appears in the last but one line as the value for Multiple R-squared:

STOP! Answer Questions 13–14 now.

Part 4: Switching X and Y Variables

To complete the analysis, let’s see what happens when you switch the roles of the X (independent) and Y (dependent) variables.

Back in Part 2 you calculated the correlation between age and BMD using age as the X (independent) variable and BMD as the Y (dependent) variable. Now, repeat this analysis switching the roles of the two variables.

In the code chunk below, perform a correlation test to determine if there is a significant linear relationship between BMD and age. Note here that you are switching the roles/positions of BMD and age in the cor.test() function you used in part 2.

# Enter code here
cor.test(age ~ BMD, method = "pearson", data = nhanes)

## 
##  Pearson's product-moment correlation
## 
## data:  age and BMD
## t = -5.4383, df = 66, p-value = 8.425e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7016418 -0.3664531
## sample estimates:
##        cor 
## -0.5562755

STOP! Answer Question 15 now.

In Part 3, you ran a linear regression using age as the X (independent) variable and BMD as the Y (dependent) variable, which is what made the most sense. Now let’s see what happens to the linear regression when you switch the roles of the two variables.

Run a linear regression analysis using BMD as the X (independent) variable and age as the Y (dependent) variable.

In the code chunk below, produce the output for regressing age on BMD. Note here that you are switching the roles/positions of BMD and age in the lm() function you used in part 3.

# Enter code here
model_age_bmd <- lm(age ~ BMD, data = nhanes)
summary(model_age_bmd)

## 
## Call:
## lm(formula = age ~ BMD, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.476 -12.311  -0.036  13.357  31.842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  102.251      9.314  10.978  < 2e-16 ***
## BMD          -54.939     10.102  -5.438 8.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.56 on 66 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.3094, Adjusted R-squared:  0.299 
## F-statistic: 29.57 on 1 and 66 DF,  p-value: 8.425e-07

STOP! Answer Questions 16–18 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.

Lab 9

Correlation and Simple Linear Regression

Samuel Gilliland

10/24/2025

Introduction

Part 0: Download and organize files

Part 1: Scatterplots

STOP! Answer Question 1 now.

STOP! Answer Question 2 now.

Part 2: Calculating the Correlation Coefficient

STOP! Answer Questions 3–7 now.

STOP! Answer Question 8 now.

Part 3: Simple Linear Regression

STOP! Answer Question 9 now.

STOP! Answer Questions 10–12 now.

STOP! Answer Questions 13–14 now.

Part 4: Switching X and Y Variables

STOP! Answer Question 15 now.

STOP! Answer Questions 16–18 now.