Goal: Practice correlation analysis from start to finish using real public health data
Learning Objectives:
# Load NHANES data
data(NHANES)
# Select adult participants with complete data
nhanes_adult <- NHANES %>%
filter(Age >= 18, Age <= 80) %>%
select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve,
Pulse, PhysActive, SleepHrsNight) %>%
na.omit()
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(nhanes_adult), "adults")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 7133 adults |
| Age | Weight | Height | BMI | BPSysAve | BPDiaAve | Pulse | PhysActive | SleepHrsNight |
|---|---|---|---|---|---|---|---|---|
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 49 | 86.7 | 168.4 | 30.6 | 112 | 75 | 86 | No | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 66 | 68.0 | 169.5 | 23.7 | 111 | 63 | 60 | Yes | 7 |
Dataset Description:
Age: Age in yearsWeight: Weight in kgBMI: Body Mass Index (kg/m²)BPSysAve: Average systolic blood pressure (mmHg)BPDiaAve: Average diastolic blood pressure (mmHg)Pulse: 60 second pulse rateSleepHrsNight: Hours of sleep per nightResearch Question: Is there a correlation between weight and height among US adults?
Your tasks:
cor.test() and
display with tidy() (3 points)# YOUR CODE HERE
# a. Scatterplot
# Create scatterplot
ggplot(nhanes_adult, aes(x = Height, y = Weight)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Weight vs. Height among US Adults",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Height (cm)",
y = "Weight (kg)"
) +
theme_minimal()# b. Correlation test with tidy() display
# Pearson correlation
cor_Weight_Height <- cor.test(nhanes_adult$Weight, nhanes_adult$Height)
# Display results
tidy(cor_Weight_Height) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Weight and Height"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.451 | 42.618 | 0 | 0.432 | 0.469 |
# c. Statistical significance
#The pearson correlation between weight and height is r=0.451 which indicates a postive linear relationship. As height increases, weight tends to also increase. The p-value of 0 makes the correlation between weight and height statistically signifcant. We reject the null hypothesis.
# d. r² and interpretation (write as comment)
#
r_squared <- cor_Weight_Height$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_Weight_Height$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.451 |
| Coefficient of Determination (r²) | 0.203 |
| Variance Explained | 20.3% |
#Interpretation: There is a statistically significant moderate positive correlation between height and weight. As height increases, weight tends to increase. However, height explains only about 20.3% of the variation in weight, suggesting other factors also play important roles like diet and physical activity. Research Question: What are the relationships among BMI, weight, and height?
Your tasks:
# YOUR CODE HERE
# a. Correlation matrix
# Select anthropometric measurements
metric_vars <- nhanes_adult %>%
select(Weight, Height, BMI)
# Calculate correlation matrix
cor_matrix <- cor(metric_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "Antropometirc Measurement Correlation")| Weight | Height | BMI | |
|---|---|---|---|
| Weight | 1.000 | 0.451 | 0.880 |
| Height | 0.451 | 1.000 | -0.012 |
| BMI | 0.880 | -0.012 | 1.000 |
# b. Visualize with corrplot
# Create correlation plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "Antropometirc Measurement Correlations",
mar = c(0,0,2,0))# c. Strongest correlation:
data.frame(
Relationship = c(
"Weight & BMI",
"Height & Weight",
"Height & BMI"
),
Correlation = c(
round (cor_matrix["Weight", "BMI"], 3),
round (cor_matrix["Height", "Weight"], 3),
round (cor_matrix["Height", "BMI"], 3)
),
Strength = c("Strong", "Moderate", "Weak")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| Weight & BMI | 0.880 | Strong |
| Height & Weight | 0.451 | Moderate |
| Height & BMI | -0.012 | Weak |
# d. Explanation (write as comment)
# Explanation: Weight and BMI show the strongest correlation (0.880), which makes sense as they measure the same thing. Height and Weight have a moderate correlation, which makes sense because typically when you are a certain height you are around a certain "normal" weight. Height and BMI have a weak correlation, suggesitng that height is influenced by different factors not related to BMI. Research Question: Is there a relationship between hours of sleep and age?
Your tasks:
tidy()
(2 points)# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x =SleepHrsNight , y =Age )) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs Hours of Sleep",
subtitle = "NHANES Data, Adults 18-80 years",
x = "SleepHrsNight (Hours)",
y = "Age (years)"
) +
theme_minimal()# b. Correlation with tidy()
# Calculate Pearson correlation
cor_age_SleepHrsNight <- cor.test(nhanes_adult$Age, nhanes_adult$SleepHrsNight)
# Display results in clean table
tidy(cor_age_SleepHrsNight) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Age and SleepHrsNight"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.023 | 1.904 | 0.057 | -0.001 | 0.046 |
# c. Interpretation (write as comment)
# Interpretation:The pearson correlation between age and hours of sleep per night is r=0.023, this indicates that there is a weak positive relationship. As age increases, sleep hours increase slightly. The p-value is 0.057 which is slightly above the significance level of 0.05. The correlation is not statistically significant. In addition, the 95% confidence interval ranges from -0.001 to 0.046, which includes zero and further supports the conclusion that the true correlation has no relationship. Challenge: Investigate the relationship between two variables of your choice from the NHANES data set. Include:
Is there a correlation between weight and pulse rate?
# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x =Pulse , y = Weight)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Age vs BMI",
subtitle = "NHANES Data, Adults 18-80 years",
x = "SleepHrsNight (Hours)",
y = "Age (years)"
) +
theme_minimal()# b. Correlation with tidy()
# Calculate Pearson correlation
cor_Weight_Pulse <- cor.test(nhanes_adult$Pulse, nhanes_adult$Weight)
# Display results in clean table
tidy(cor_Weight_Pulse) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Age and Pulse rate"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.063 | 5.295 | 0 | 0.039 | 0.086 |
#c. Assumptions Check
par(mfrow = c(1, 2))
qqnorm(nhanes_adult$Weight, main = "Q-Q Plot: Weight")
qqline(nhanes_adult$Weight, col = "red")
qqnorm(nhanes_adult$Pulse, main = "Q-Q Plot: Pulse")
qqline(nhanes_adult$Pulse, col = "red")# d. Interpretation (write as comment)
# Interpretation: The pearson correlation between weight and pulse rate is r=0.063, this indicates that there is positive moderate relationship between the two variables. As weight increases, pulse rate starts to increase. The p-value is 0 which is statistically significant which means we can reject the null hypothesis. In addition, the 95% confidence interval ranges from 0.039 to 0.086, which does not include zero and further supports the conclusion that the positive relationship is statistically significant. The Q-Q plot follows normal distribution along the red line reasonably well for both weight and pulse. Some deviation in the tails but with a large sample size, the correlation test is robust to minor violations. This lab activity was created for EPI 553: Principles of
Statistical Inference II
University at Albany, College of Integrated Health
Sciences
Spring 2026