Time: ~30 minutes
Goal: Practice correlation analysis from start to finish using real public health data
Learning Objectives:
Now it’s your turn to practice! Use the NHANES dataset and follow the examples above.
Total Points: 25 points
Correlation measures the strength and direction of the LINEAR relationship between two continuous variables.
✅ Use correlation when:
❌ Don’t use when:
⚠️ CORRELATION ≠ CAUSATION
Just because two variables are correlated does NOT mean one causes the other!
Classic Example: Ice cream sales and drowning deaths are highly correlated. Does ice cream cause drowning? NO! Both increase in summer (confounding by temperature/season).
# Load NHANES data
data(NHANES)
# Select adult participants with complete data
nhanes_adult <- NHANES %>%
filter(Age >= 18, Age <= 80) %>%
select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve,
Pulse, PhysActive, SleepHrsNight) %>%
na.omit()
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(nhanes_adult), "adults")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 7133 adults |
| Age | Weight | Height | BMI | BPSysAve | BPDiaAve | Pulse | PhysActive | SleepHrsNight |
|---|---|---|---|---|---|---|---|---|
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 49 | 86.7 | 168.4 | 30.6 | 112 | 75 | 86 | No | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 66 | 68.0 | 169.5 | 23.7 | 111 | 63 | 60 | Yes | 7 |
Dataset Description:
Age: Age in yearsWeight: Weight in kgBMI: Body Mass Index (kg/m²)BPSysAve: Average systolic blood pressure (mmHg)BPDiaAve: Average diastolic blood pressure (mmHg)Pulse: 60 second pulse rateSleepHrsNight: Hours of sleep per nightResearch Question: Is there a correlation between weight and height among US adults?
Your tasks:
cor.test() and
display with tidy() (3 points)# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = Weight, y = Height)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Weight vs Height Among US Adults",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Weight (kg)",
y = "Height (cm)"
) +
theme_minimal()# b. Correlation test with tidy() display
cor_wt_ht <- cor.test(nhanes_adult$Weight, nhanes_adult$Height)
tidy(cor_wt_ht) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Weight and Height"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.451 | 42.618 | 0 | 0.432 | 0.469 |
# c. Statistical significance
# r = 0.451: Moderate positive correlation
#p < 0.001: Statistically significant (reject H₀)
#95% CI [0.432, 0.469]: Doesn’t contain zero (confirms significance)
# d. r² and interpretation (write as comment)
# There is a statistically significant moderate positive correlation between Weight and Height. As Weight increases, Height tends to increase. However, Weight explains only about 20.3% of the variation in Height, suggesting other factors also play important roles.Public Health Implication: Weight-appropriate Height screening is important, but individual risk assessment should consider multiple factors beyond Weight alone.
r_squared <- cor_wt_ht$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_wt_ht$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.451 |
| Coefficient of Determination (r²) | 0.203 |
| Variance Explained | 20.3% |
Research Question: What are the relationships among BMI, weight, and height?
Your tasks:
# YOUR CODE HERE
# a. Correlation matrix
# Select variables
relationships_vars <- nhanes_adult %>%
select(BMI, Weight, Height)
# Calculate correlation matrix
cor_matrix <- cor(relationships_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "Relationships among BMI, weight, and height Correlation Matrix")| BMI | Weight | Height | |
|---|---|---|---|
| BMI | 1.000 | 0.880 | -0.012 |
| Weight | 0.880 | 1.000 | 0.451 |
| Height | -0.012 | 0.451 | 1.000 |
# b. Visualize with corrplot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "BMI, Weight, and Height Correlations",
mar = c(0,0,2,0))# c. Strongest correlation: BMI & Weight
data.frame(
Relationship = c(
"BMI & Weight",
"BMI & Height",
"Weight & Height"
),
Correlation = c(
round(cor_matrix["BMI", "Weight"], 3),
round(cor_matrix["BMI", "Height"], 3),
round(cor_matrix["Weight", "Height"], 3)
),
Strength = c("Strong", "Moderate", "Weak")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| BMI & Weight | 0.880 | Strong |
| BMI & Height | -0.012 | Moderate |
| Weight & Height | 0.451 | Weak |
Research Question: Is there a relationship between hours of sleep and age?
Your tasks:
tidy()
(2 points)# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = Age)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Hours of Sleep vs Age",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Hours of sleep per night",
y = "Age (years)"
) +
theme_minimal()# b. Correlation with tidy()
# Calculate Pearson correlation
cor_sleep_age <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$Age)
# Display results in clean table
tidy(cor_sleep_age) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Hours of sleep and Age"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.023 | 1.904 | 0.057 | -0.001 | 0.046 |
# c. Interpretation (write as comment)
# Hypothesis Test:
#H₀: ρ = 0 (no correlation between Hours of sleep and Age in population)
#H₁: ρ ≠ 0 (correlation exists)
#α = 0.05
#Results:
#r = 0.023: Weak positive correlation
#p > 0.05: Not Statistically significant (fail to reject H₀)
#95% CI [-0.001, 0.046]: Contains zero (confirms insignificance)Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:
# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = Weight, y = BPSysAve)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Weight vs Average systolic blood pressure",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Weight in kg",
y = "Average systolic blood pressure (mmHg)"
) +
theme_minimal()# b. Correlation test
cor_weight_bp <- cor.test(nhanes_adult$Weight, nhanes_adult$BPSysAve)
# Display results in clean table
tidy(cor_weight_bp) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Weight and Systolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.118 | 10.04 | 0 | 0.095 | 0.141 |
#c. Assumption Checks
#Assumption 1: Linearity (already checked with scatterplot ✓)
#Assumption 2: Bivariate Normality
# Q-Q plots for normality
par(mfrow = c(1, 2))
qqnorm(nhanes_adult$Weight, main = "Q-Q Plot: Weight")
qqline(nhanes_adult$Weight, col = "red")
qqnorm(nhanes_adult$BPSysAve, main = "Q-Q Plot: Systolic BP")
qqline(nhanes_adult$BPSysAve, col = "red")#Assessment: Both variables are approximately normally distributed (points follow the red line reasonably well). Some deviation in the tails, but with large sample size (n = 7133), the correlation test is robust to minor violations.
#Assumption 3: No Extreme Outliers (scatterplot shows no extreme outliers ✓)
#d. Thoughtful interpretation
r_squared_weight <- cor_weight_bp$estimate^2
data.frame(
Measure = c("r²", "Variance Explained"),
Value = c(
round(r_squared_weight, 4),
paste0(round(r_squared_weight * 100, 2), "%")
)
) %>%
kable(caption = "Effect Size")| Measure | Value | |
|---|---|---|
| cor | r² | 0.0139 |
| Variance Explained | 1.39% |
# There is a positive trend, heavier weight(in kg) adults tend to have higher blood pressure. While Weight and Systolic BP are related, Weight alone explains less than 10% of BP variation. Other factors (genetics, diet, physical activity, stress, age) play substantial roles. If we look at the scatterplot, there is a positive relationship visible, moderate scatter around the line. When testing the hypothesis, H₀: ρ = 0 (no correlation between age and BP in population), H₁: ρ ≠ 0 (correlation exists), and α = 0.05, results indicate that r = 0.118: weak positive correlation, p < 0.001: statistically significant (reject H₀), and 95% CI [0.095, 0.141]: doesn’t contain zero (confirms significance), affirming conclusions.cor.test() - Calculate correlation and test
significancetidy() - Clean display of statistical test resultscor() - Calculate correlation matrixcorrplot() - Visualize correlation matrixggplot() + geom_point() - Scatterplotsgeom_smooth(method="lm") - Add fitted regression
lineqqnorm() / qqline() - Check normalityThis lab activity was created for EPI 553: Principles of
Statistical Inference II
University at Albany, College of Integrated Health
Sciences
Spring 2026