Time: ~30 minutes
Goal: Practice correlation analysis from start to finish using real public health data
Learning Objectives:
Structure:
Submission: Publish to RPubs and submit your .Rmd file + RPubs link to Brightspace by end of class
Correlation measures the strength and direction of the LINEAR relationship between two continuous variables.
✅ Use correlation when:
❌ Don’t use when:
⚠️ CORRELATION ≠ CAUSATION
Just because two variables are correlated does NOT mean one causes the other!
Classic Example: Ice cream sales and drowning deaths are highly correlated. Does ice cream cause drowning? NO! Both increase in summer (confounding by temperature/season).
# Load NHANES data
data(NHANES)
# Select adult participants with complete data
nhanes_adult <- NHANES %>%
filter(Age >= 18, Age <= 80) %>%
select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve,
Pulse, PhysActive, SleepHrsNight) %>%
na.omit()
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(nhanes_adult), "adults")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 7133 adults |
| Age | Weight | Height | BMI | BPSysAve | BPDiaAve | Pulse | PhysActive | SleepHrsNight |
|---|---|---|---|---|---|---|---|---|
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 49 | 86.7 | 168.4 | 30.6 | 112 | 75 | 86 | No | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 66 | 68.0 | 169.5 | 23.7 | 111 | 63 | 60 | Yes | 7 |
Dataset Description:
Age: Age in yearsWeight: Weight in kgBMI: Body Mass Index (kg/m²)BPSysAve: Average systolic blood pressure (mmHg)BPDiaAve: Average diastolic blood pressure (mmHg)Pulse: 60 second pulse rateSleepHrsNight: Hours of sleep per nightIs there a correlation between age and systolic blood pressure among US adults?
Public Health Context: Understanding age-related changes in blood pressure helps identify at-risk populations and inform screening guidelines.
Always start with a scatterplot!
# Create scatterplot
ggplot(nhanes_adult, aes(x = Age, y = BPSysAve)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Systolic Blood Pressure",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Age (years)",
y = "Systolic Blood Pressure (mmHg)"
) +
theme_minimal()What we observe:
# Calculate Pearson correlation
cor_age_bp <- cor.test(nhanes_adult$Age, nhanes_adult$BPSysAve)
# Display results in clean table
tidy(cor_age_bp) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Age and Systolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.415 | 38.54 | 0 | 0.396 | 0.434 |
Hypothesis Test:
Results:
# Calculate r-squared
r_squared <- cor_age_bp$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_age_bp$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.415 |
| Coefficient of Determination (r²) | 0.172 |
| Variance Explained | 17.2% |
Interpretation:
There is a statistically significant moderate positive correlation between age and systolic blood pressure. As age increases, systolic BP tends to increase. However, age explains only about 17.2% of the variation in BP, suggesting other factors also play important roles.
Public Health Implication: Age-appropriate BP screening is important, but individual risk assessment should consider multiple factors beyond age alone.
Assumption 1: Linearity (already checked with scatterplot ✓)
Assumption 2: Bivariate Normality
# Q-Q plots for normality
par(mfrow = c(1, 2))
qqnorm(nhanes_adult$Age, main = "Q-Q Plot: Age")
qqline(nhanes_adult$Age, col = "red")
qqnorm(nhanes_adult$BPSysAve, main = "Q-Q Plot: Systolic BP")
qqline(nhanes_adult$BPSysAve, col = "red")Assessment: Both variables are approximately normally distributed (points follow the red line reasonably well). Some deviation in the tails, but with large sample size (n = 7133), the correlation test is robust to minor violations.
Assumption 3: No Extreme Outliers (scatterplot shows no extreme outliers ✓)
Is BMI correlated with diastolic blood pressure?
Why this matters: Understanding the relationship between obesity and blood pressure helps inform weight management interventions.
ggplot(nhanes_adult, aes(x = BMI, y = BPDiaAve)) +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink") +
labs(
title = "BMI vs Diastolic Blood Pressure",
x = "Body Mass Index (kg/m²)",
y = "Diastolic Blood Pressure (mmHg)"
) +
theme_minimal()Observation: Positive relationship visible, moderate scatter around the line.
# Pearson correlation
cor_bmi_bp <- cor.test(nhanes_adult$BMI, nhanes_adult$BPDiaAve)
# Display results
tidy(cor_bmi_bp) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: BMI and Diastolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.117 | 9.966 | 0 | 0.094 | 0.14 |
# Calculate r-squared
r_squared_bmi <- cor_bmi_bp$estimate^2
data.frame(
Measure = c("r²", "Variance Explained"),
Value = c(
round(r_squared_bmi, 4),
paste0(round(r_squared_bmi * 100, 2), "%")
)
) %>%
kable(caption = "Effect Size")| Measure | Value | |
|---|---|---|
| cor | r² | 0.0137 |
| Variance Explained | 1.37% |
Interpretation:
Key Insight: While BMI and blood pressure are related, BMI alone explains less than 10% of BP variation. Other factors (genetics, diet, physical activity, stress, age) play substantial roles.
How are cardiovascular health indicators related to each other?
# Select cardiovascular variables
cardio_vars <- nhanes_adult %>%
select(Age, BMI, BPSysAve, BPDiaAve, Pulse)
# Calculate correlation matrix
cor_matrix <- cor(cardio_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "Cardiovascular Health Correlation Matrix")| Age | BMI | BPSysAve | BPDiaAve | Pulse | |
|---|---|---|---|---|---|
| Age | 1.000 | 0.065 | 0.415 | -0.019 | -0.153 |
| BMI | 0.065 | 1.000 | 0.135 | 0.117 | 0.112 |
| BPSysAve | 0.415 | 0.135 | 1.000 | 0.340 | -0.022 |
| BPDiaAve | -0.019 | 0.117 | 0.340 | 1.000 | 0.106 |
| Pulse | -0.153 | 0.112 | -0.022 | 0.106 | 1.000 |
# Create correlation plot
corrplot(cor_matrix,
method = "circle",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "Cardiovascular Health Correlations",
mar = c(0,0,2,0))Key Findings:
# Create summary table of notable correlations
data.frame(
Relationship = c(
"Systolic BP & Diastolic BP",
"Age & Systolic BP",
"Age & Diastolic BP",
"BMI & Systolic BP",
"BMI & Pulse"
),
Correlation = c(
round(cor_matrix["BPSysAve", "BPDiaAve"], 3),
round(cor_matrix["Age", "BPSysAve"], 3),
round(cor_matrix["Age", "BPDiaAve"], 3),
round(cor_matrix["BMI", "BPSysAve"], 3),
round(cor_matrix["BMI", "Pulse"], 3)
),
Strength = c("Strong", "Moderate", "Weak-Moderate", "Moderate", "Very Weak")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| Systolic BP & Diastolic BP | 0.340 | Strong |
| Age & Systolic BP | 0.415 | Moderate |
| Age & Diastolic BP | -0.019 | Weak-Moderate |
| BMI & Systolic BP | 0.135 | Moderate |
| BMI & Pulse | 0.112 | Very Weak |
Interpretation: Systolic and diastolic BP show the strongest correlation (r = 0.34), which makes sense as they measure the same physiological process. Pulse rate shows relatively weak correlations, suggesting it’s influenced by different factors.
Use Spearman’s rank correlation when:
# Visualize relationship
ggplot(nhanes_adult, aes(x = Age, y = Pulse)) +
geom_point(alpha = 0.3, color = "purple") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Pulse Rate",
x = "Age (years)",
y = "Pulse Rate (bpm)"
) +
theme_minimal()# Calculate both correlations
pearson_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "pearson")
spearman_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "spearman")
# Compare in table
data.frame(
Method = c("Pearson", "Spearman"),
Correlation = c(
round(pearson_r$estimate, 3),
round(spearman_r$estimate, 3)
),
p_value = c(
format.pval(pearson_r$p.value),
format.pval(spearman_r$p.value)
),
Difference = c(
"—",
round(abs(pearson_r$estimate - spearman_r$estimate), 3)
)
) %>%
kable(caption = "Pearson vs Spearman Comparison")| Method | Correlation | p_value | Difference | |
|---|---|---|---|---|
| cor | Pearson | -0.153 | < 2.22e-16 | — |
| rho | Spearman | -0.162 | < 2.22e-16 | 0.008 |
Interpretation:
Now it’s your turn to practice! Use the same NHANES dataset and follow the examples above.
Total Points: 25 points
Research Question: Is there a correlation between weight and height among US adults?
Your tasks:
cor.test() and
display with tidy() (3 points)# a. Scatterplot
# Create scatterplot
ggplot(nhanes_adult, aes(x = Weight, y = Height)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Weight vs Height",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Weight (kg)",
y = "Height"
) +
theme_minimal()# b. Correlation test with tidy() display
# Calculate Pearson correlation
cor_weight_height <- cor.test(nhanes_adult$Weight, nhanes_adult$Height)
# Display results in clean table
tidy(cor_weight_height) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Weight and Height"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.451 | 42.618 | 0 | 0.432 | 0.469 |
# c. Statistical significance
r_squared <- cor_weight_height$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_weight_height$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.451 |
| Coefficient of Determination (r²) | 0.203 |
| Variance Explained | 20.3% |
# d. r² and interpretation (write as comment)
# r² = 0.203
# There is a statistically significant weak/moderate positive correlation between weight and height. As weight increases, height tends to increase. However, weight explains only about `r round(r_squared * 100, 1)`% of the variation in height, suggesting other factors also play important roles.Research Question: What are the relationships among BMI, weight, and height?
Your tasks:
# a. Correlation matrix
BMI_vars <- nhanes_adult %>%
select(Weight, Height, BMI)
# Calculate correlation matrix
cor_matrix <- cor(BMI_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "BMI Health Correlation Matrix")| Weight | Height | BMI | |
|---|---|---|---|
| Weight | 1.000 | 0.451 | 0.880 |
| Height | 0.451 | 1.000 | -0.012 |
| BMI | 0.880 | -0.012 | 1.000 |
# b. Visualize with corrplot
corrplot(cor_matrix,
method = "circle",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "BMI Health Correlations",
mar = c(0,0,2,0))# c. Strongest correlation:
data.frame(
Relationship = c(
"Weight & BMI",
"Weight & Height",
"Height & BMI"
),
Correlation = c(
round(cor_matrix["Weight", "BMI"], 3),
round(cor_matrix["Weight", "Height"], 3),
round(cor_matrix["Height", "BMI"], 3)
),
Strength = c("Strong", "Moderate", "Weak")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| Weight & BMI | 0.880 | Strong |
| Weight & Height | 0.451 | Moderate |
| Height & BMI | -0.012 | Weak |
# d. Explanation (write as comment)
# Weight and BMI show the strongest correlation (r = 0.880), which is expected since BMI is calculated directly from weight and height. Weight and height demonstrate a moderate positive correlation (r = 0.451), mathematically and biologically reflecting that as height increases, typically so does weight. Height and BMI show a very weak inverse correlation (r = -0.012), suggesting that BMI variation in this sample is driven primarily by differences in weight rather than height.Research Question: Is there a relationship between hours of sleep and age?
Your tasks:
tidy()
(2 points)# a. Scatterplot
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = Age)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Hours of Sleep vs Age",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Hours of Sleep",
y = "Age (Yrs)"
) +
theme_minimal()# b. Correlation with tidy()
# Calculate Pearson correlation
cor_SleepHrsNightt_Age <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$Age)
# Display results in clean table
tidy(cor_SleepHrsNightt_Age) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Hours of Sleep and Age"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.023 | 1.904 | 0.057 | -0.001 | 0.046 |
# Statistical significance
r_squared <- cor_SleepHrsNightt_Age$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_SleepHrsNightt_Age$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.023 |
| Coefficient of Determination (r²) | 0.001 |
| Variance Explained | 0.1% |
# c. Interpretation (write as comment)
# There was a very weak positive correlation between age and hours of sleep (r = 0.023), which was not statistically significant (p = 0.057). The association explains only 0.1% of the variance in sleep duration, indicating no meaningful linear relationship between age and sleep. Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:
#Step 1: Scatterplot
# a. Create scatterplot
ggplot(NHANES, aes(x = HHIncomeMid, y = DaysMentHlthBad)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Income vs Mental Health",
subtitle = "NHANES Data, Adults 18 Years or Older",
x = "Income",
y = "Number of Days of Bad Mental Health"
) +
theme_minimal()#Step 2:
#b. Correlation test with tidy() display
# Calculate Pearson correlation
cor_HHIncomeMid_DaysMentHlthBad <- cor.test(NHANES$HHIncomeMid, NHANES$DaysMentHlthBad)
# Display results in clean table
tidy(cor_HHIncomeMid_DaysMentHlthBad) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Income & Number of Days of Bad Mental Health"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| -0.109 | -9.128 | 0 | -0.132 | -0.086 |
#Step 3: Statistical Significance
r_squared <- cor_HHIncomeMid_DaysMentHlthBad$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_HHIncomeMid_DaysMentHlthBad$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | -0.109 |
| Coefficient of Determination (r²) | 0.012 |
| Variance Explained | 1.2% |
#Step 4:
#Check Assumptions
#Assumption 1: Linearity (already checked with scatterplot ✓)
#Assumption 2: Bivariate Normality
#Q-Q plots for normality
par(mfrow = c(1, 2))
qqnorm(NHANES$HHIncomeMid, main = "Q-Q Plot: Income")
qqline(NHANES$HHIncomeMid, col = "red")
qqnorm(NHANES$DaysMentHlthBad, main = "Q-Q Plot: Number of Days Bad Mental Health")
qqline(NHANES$DaysMentHlthBad, col = "red")par(mfrow = c(1, 1))
#*Assessment:* The Q-Q plots indicate that both household income and number of bad mental health days deviate from normality with income appearing right-skewed and possibly grouped by categories, and heavy clustering at zero for number of bad mental health days. These violations suggest that the normality assumption underlying Pearson correlation is not fully met.
#Assumption 3: Extreme Outliers?
#There are some outliers; however, this is normal as there will always be high income earners (which explains the upper tail) and the number of bad mental health days being driven more by skewness than outliers.
#Step 5:
# d. r² and interpretation (write as comment)
# r² = 0.012
#Based on the Pearson correlation analysis there is a statistically significant but weak inverse association between household income and bad mental health days (r = -0.109, p < 0.05). However, based on the Q-Q plot, normality is not met due to the right skewness and tail deviation from the red line for income and clustering around zero for number of bad mental health days which doesn't show normal distribution. Therefore, while higher income is associated with fewer poor mental health days, income explains only 1.2% of the variance in mental health outcomes and the relationship doesn't follow a normal distribution, suggesting that while income plays a role, many other social and structural factors contribute to bad mental health. Save your work with your name:
Correlation_Lab_YourName.Rmd
Knit to HTML to create your report
Publish to RPubs:
Submit to Brightspace:
Due: End of class today
Grading: This lab is worth 15% of your in-class lab grade. The lowest 2 lab grades are dropped.
cor.test() - Calculate correlation and test
significancetidy() - Clean display of statistical test resultscor() - Calculate correlation matrixcorrplot() - Visualize correlation matrixggplot() + geom_point() - Scatterplotsgeom_smooth(method="lm") - Add fitted regression
lineqqnorm() / qqline() - Check normality?cor.test in
consoleRemember:
✓ Correlation measures LINEAR relationships only
✓ Always visualize your data first
✓ Correlation ≠ Causation
✓ Check your assumptions
✓ Consider confounding and alternative explanations
This lab activity was created for EPI 553: Principles of
Statistical Inference II
University at Albany, College of Integrated Health
Sciences
Spring 2026