Time: ~30 minutes
Goal: Practice correlation analysis from start to finish using real public health data
Learning Objectives:
Structure:
Submission: Publish to RPubs and submit your .Rmd file + RPubs link to Brightspace by end of class
Correlation measures the strength and direction of the LINEAR relationship between two continuous variables.
✅ Use correlation when:
❌ Don’t use when:
⚠️ CORRELATION ≠ CAUSATION
Just because two variables are correlated does NOT mean one causes the other!
Classic Example: Ice cream sales and drowning deaths are highly correlated. Does ice cream cause drowning? NO! Both increase in summer (confounding by temperature/season).
# Load NHANES data
data(NHANES)
# Select adult participants with complete data
nhanes_adult <- NHANES %>%
filter(Age >= 18, Age <= 80) %>%
select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve,
Pulse, PhysActive, SleepHrsNight) %>%
na.omit()
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(nhanes_adult), "adults")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 7133 adults |
| Age | Weight | Height | BMI | BPSysAve | BPDiaAve | Pulse | PhysActive | SleepHrsNight |
|---|---|---|---|---|---|---|---|---|
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 49 | 86.7 | 168.4 | 30.6 | 112 | 75 | 86 | No | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 66 | 68.0 | 169.5 | 23.7 | 111 | 63 | 60 | Yes | 7 |
Dataset Description:
Age: Age in yearsWeight: Weight in kgBMI: Body Mass Index (kg/m²)BPSysAve: Average systolic blood pressure (mmHg)BPDiaAve: Average diastolic blood pressure (mmHg)Pulse: 60 second pulse rateSleepHrsNight: Hours of sleep per nightIs there a correlation between age and systolic blood pressure among US adults?
Public Health Context: Understanding age-related changes in blood pressure helps identify at-risk populations and inform screening guidelines.
Always start with a scatterplot!
# Create scatterplot
ggplot(nhanes_adult, aes(x = Age, y = BPSysAve)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Systolic Blood Pressure",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Age (years)",
y = "Systolic Blood Pressure (mmHg)"
) +
theme_minimal()What we observe:
# Calculate Pearson correlation
cor_age_bp <- cor.test(nhanes_adult$Age, nhanes_adult$BPSysAve)
# Display results in clean table
tidy(cor_age_bp) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Age and Systolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.415 | 38.54 | 0 | 0.396 | 0.434 |
Hypothesis Test:
Results:
# Calculate r-squared
r_squared <- cor_age_bp$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_age_bp$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.415 |
| Coefficient of Determination (r²) | 0.172 |
| Variance Explained | 17.2% |
Interpretation:
There is a statistically significant moderate positive correlation between age and systolic blood pressure. As age increases, systolic BP tends to increase. However, age explains only about 17.2% of the variation in BP, suggesting other factors also play important roles.
Public Health Implication: Age-appropriate BP screening is important, but individual risk assessment should consider multiple factors beyond age alone.
Assumption 1: Linearity (already checked with scatterplot ✓)
Assumption 2: Bivariate Normality
# Q-Q plots for normality
par(mfrow = c(1, 2))
qqnorm(nhanes_adult$Age, main = "Q-Q Plot: Age")
qqline(nhanes_adult$Age, col = "red")
qqnorm(nhanes_adult$BPSysAve, main = "Q-Q Plot: Systolic BP")
qqline(nhanes_adult$BPSysAve, col = "red")Assessment: Both variables are approximately normally distributed (points follow the red line reasonably well). Some deviation in the tails, but with large sample size (n = 7133), the correlation test is robust to minor violations.
Assumption 3: No Extreme Outliers (scatterplot shows no extreme outliers ✓)
Is BMI correlated with diastolic blood pressure?
Why this matters: Understanding the relationship between obesity and blood pressure helps inform weight management interventions.
ggplot(nhanes_adult, aes(x = BMI, y = BPDiaAve)) +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink") +
labs(
title = "BMI vs Diastolic Blood Pressure",
x = "Body Mass Index (kg/m²)",
y = "Diastolic Blood Pressure (mmHg)"
) +
theme_minimal()Observation: Positive relationship visible, moderate scatter around the line.
# Pearson correlation
cor_bmi_bp <- cor.test(nhanes_adult$BMI, nhanes_adult$BPDiaAve)
# Display results
tidy(cor_bmi_bp) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: BMI and Diastolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.117 | 9.966 | 0 | 0.094 | 0.14 |
# Calculate r-squared
r_squared_bmi <- cor_bmi_bp$estimate^2
data.frame(
Measure = c("r²", "Variance Explained"),
Value = c(
round(r_squared_bmi, 4),
paste0(round(r_squared_bmi * 100, 2), "%")
)
) %>%
kable(caption = "Effect Size")| Measure | Value | |
|---|---|---|
| cor | r² | 0.0137 |
| Variance Explained | 1.37% |
Interpretation:
Key Insight: While BMI and blood pressure are related, BMI alone explains less than 10% of BP variation. Other factors (genetics, diet, physical activity, stress, age) play substantial roles.
How are cardiovascular health indicators related to each other?
# Select cardiovascular variables
cardio_vars <- nhanes_adult %>%
select(Age, BMI, BPSysAve, BPDiaAve, Pulse)
# Calculate correlation matrix
cor_matrix <- cor(cardio_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "Cardiovascular Health Correlation Matrix")| Age | BMI | BPSysAve | BPDiaAve | Pulse | |
|---|---|---|---|---|---|
| Age | 1.000 | 0.065 | 0.415 | -0.019 | -0.153 |
| BMI | 0.065 | 1.000 | 0.135 | 0.117 | 0.112 |
| BPSysAve | 0.415 | 0.135 | 1.000 | 0.340 | -0.022 |
| BPDiaAve | -0.019 | 0.117 | 0.340 | 1.000 | 0.106 |
| Pulse | -0.153 | 0.112 | -0.022 | 0.106 | 1.000 |
# Create correlation plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "Cardiovascular Health Correlations",
mar = c(0,0,2,0))Key Findings:
# Create summary table of notable correlations
data.frame(
Relationship = c(
"Systolic BP & Diastolic BP",
"Age & Systolic BP",
"Age & Diastolic BP",
"BMI & Systolic BP",
"BMI & Pulse"
),
Correlation = c(
round(cor_matrix["BPSysAve", "BPDiaAve"], 3),
round(cor_matrix["Age", "BPSysAve"], 3),
round(cor_matrix["Age", "BPDiaAve"], 3),
round(cor_matrix["BMI", "BPSysAve"], 3),
round(cor_matrix["BMI", "Pulse"], 3)
),
Strength = c("Strong", "Moderate", "Weak-Moderate", "Moderate", "Very Weak")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| Systolic BP & Diastolic BP | 0.340 | Strong |
| Age & Systolic BP | 0.415 | Moderate |
| Age & Diastolic BP | -0.019 | Weak-Moderate |
| BMI & Systolic BP | 0.135 | Moderate |
| BMI & Pulse | 0.112 | Very Weak |
Interpretation: Systolic and diastolic BP show the strongest correlation (r = 0.34), which makes sense as they measure the same physiological process. Pulse rate shows relatively weak correlations, suggesting it’s influenced by different factors.
Use Spearman’s rank correlation when:
# Visualize relationship
ggplot(nhanes_adult, aes(x = Age, y = Pulse)) +
geom_point(alpha = 0.3, color = "purple") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Pulse Rate",
x = "Age (years)",
y = "Pulse Rate (bpm)"
) +
theme_minimal()# Calculate both correlations
pearson_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "pearson")
spearman_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "spearman")
# Compare in table
data.frame(
Method = c("Pearson", "Spearman"),
Correlation = c(
round(pearson_r$estimate, 3),
round(spearman_r$estimate, 3)
),
p_value = c(
format.pval(pearson_r$p.value),
format.pval(spearman_r$p.value)
),
Difference = c(
"—",
round(abs(pearson_r$estimate - spearman_r$estimate), 3)
)
) %>%
kable(caption = "Pearson vs Spearman Comparison")| Method | Correlation | p_value | Difference | |
|---|---|---|---|---|
| cor | Pearson | -0.153 | < 2.22e-16 | — |
| rho | Spearman | -0.162 | < 2.22e-16 | 0.008 |
Interpretation:
Now it’s your turn to practice! Use the same NHANES dataset and follow the examples above.
Total Points: 25 points
Research Question: Is there a correlation between weight and height among US adults?
Your tasks:
cor.test() and
display with tidy() (3 points)# YOUR CODE HERE
# a. Scatterplot
# b. Correlation test with tidy() display
# c. Statistical significance
# d. r² and interpretation (write as comment)
# Research Question: What are the relationships among BMI, weight, and height?
Your tasks:
# YOUR CODE HERE
# a. Correlation matrix
# b. Visualize with corrplot
# c. Strongest correlation:
# d. Explanation (write as comment)
# Research Question: Is there a relationship between hours of sleep and age?
Your tasks:
tidy()
(2 points)# YOUR CODE HERE
# a. Scatterplot
# b. Correlation with tidy()
# c. Interpretation (write as comment)
# Save your work with your name:
Correlation_Lab_YourName.Rmd
Knit to HTML to create your report
Publish to RPubs:
Submit to Brightspace:
Due: End of class today
Grading: This lab is worth 15% of your in-class lab grade. The lowest 2 lab grades are dropped.
cor.test() - Calculate correlation and test
significancetidy() - Clean display of statistical test resultscor() - Calculate correlation matrixcorrplot() - Visualize correlation matrixggplot() + geom_point() - Scatterplotsgeom_smooth(method="lm") - Add fitted regression
lineqqnorm() / qqline() - Check normality?cor.test in
consoleRemember:
✓ Correlation measures LINEAR relationships only
✓ Always visualize your data first
✓ Correlation ≠ Causation
✓ Check your assumptions
✓ Consider confounding and alternative explanations
This lab activity was created for EPI 553: Principles of
Statistical Inference II
University at Albany, College of Integrated Health
Sciences
Spring 2026