# Load NHANES data
data(NHANES)
# Select adult participants with complete data
nhanes_adult <- NHANES %>%
filter(Age >= 18, Age <= 80) %>%
select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve,
Pulse, PhysActive, SleepHrsNight) %>%
na.omit()
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(nhanes_adult), "adults")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 7133 adults |
| Age | Weight | Height | BMI | BPSysAve | BPDiaAve | Pulse | PhysActive | SleepHrsNight |
|---|---|---|---|---|---|---|---|---|
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 34 | 87.4 | 164.7 | 32.2 | 113 | 85 | 70 | No | 4 |
| 49 | 86.7 | 168.4 | 30.6 | 112 | 75 | 86 | No | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 45 | 75.7 | 166.7 | 27.2 | 118 | 64 | 62 | Yes | 8 |
| 66 | 68.0 | 169.5 | 23.7 | 111 | 63 | 60 | Yes | 7 |
Dataset Description:
Age: Age in yearsWeight: Weight in kgBMI: Body Mass Index (kg/m²)BPSysAve: Average systolic blood pressure (mmHg)BPDiaAve: Average diastolic blood pressure (mmHg)Pulse: 60 second pulse rateSleepHrsNight: Hours of sleep per nightNow it’s your turn to practice! Use the same NHANES dataset and follow the examples above.
Total Points: 25 points
Research Question: Is there a correlation between weight and height among US adults?
Your tasks:
cor.test() and
display with tidy() (3 points)# YOUR CODE HERE
# a. Scatterplot
### Step 1: Visualize the Relationship
# Create scatterplot
ggplot(nhanes_adult, aes(x = Height, y = Weight)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Height vs Weight",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Height (cm)",
y = "Weight (kg)"
) +
theme_minimal()# b. Correlation test with tidy() display
# Calculate Pearson correlation
cor_height_weight <- cor.test(nhanes_adult$Height, nhanes_adult$Weight)
# Display results in clean table
tidy(cor_height_weight) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Height and Weight"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.451 | 42.618 | 0 | 0.432 | 0.469 |
The Pearson correlation analysis demonstrated a moderate positive association between weight and height among U.S. adults (r = 0.451). This relationship was statistically significant (t = 42.618, p < 0.001), indicating that the likelihood of observing this correlation by random chance is extremely small. The 95% confidence interval (0.432, 0.469) excludes zero, further supporting the statistical significance of the association.
# Calculate r-squared
r_squared <- cor_height_weight$estimate^2
data.frame(
Measure = c("Correlation (r)", "Coefficient of Determination (r²)",
"Variance Explained"),
Value = c(
round(cor_height_weight$estimate, 3),
round(r_squared, 3),
paste0(round(r_squared * 100, 1), "%")
)
) %>%
kable(caption = "Summary of Correlation Strength")| Measure | Value |
|---|---|
| Correlation (r) | 0.451 |
| Coefficient of Determination (r²) | 0.203 |
| Variance Explained | 20.3% |
The correlation coefficient (r = 0.451) indicates a moderate positive relationship between height and weight, meaning that taller individuals tend to weigh more.
The coefficient of determination (r² = 0.203) suggests that approximately 20.3% of the variability in one variable is explained by the other in this sample. In practical terms, this indicates that although height and weight are meaningfully related, nearly 79.7% of the variation is attributable to other factors not captured by this simple linear relationship.
Research Question: What are the relationships among BMI, weight, and height?
Your tasks:
# Select variables
adult_vars <- nhanes_adult %>%
select(BMI, Height, Weight)
# Calculate correlation matrix
cor_matrix <- cor(adult_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
kable(digits = 3, caption = "BMI, Weight and Height Correlation Matrix")| BMI | Height | Weight | |
|---|---|---|---|
| BMI | 1.000 | -0.012 | 0.880 |
| Height | -0.012 | 1.000 | 0.451 |
| Weight | 0.880 | 0.451 | 1.000 |
# b. Visualize with corrplot
# Create correlation plot
corrplot(cor_matrix,
method = "circle",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.8,
col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
title = "BMI, Weight and Height Correlations",
mar = c(0,0,2,0))# c. Strongest correlation:
# Create summary table of notable correlations
data.frame(
Relationship = c(
"BMI & Weight",
"BMI & Height",
"Height & BMI",
"Height & Weight"
),
Correlation = c(
round(cor_matrix["BMI", "Weight"], 3),
round(cor_matrix["BMI", "Height"], 3),
round(cor_matrix["Height", "BMI"], 3),
round(cor_matrix["Height", "Weight"], 3)
),
Strength = c("Very Strong", "Very Weak", "Very Weak", "Moderate")
) %>%
kable(caption = "Notable Correlations Summary")| Relationship | Correlation | Strength |
|---|---|---|
| BMI & Weight | 0.880 | Very Strong |
| BMI & Height | -0.012 | Very Weak |
| Height & BMI | -0.012 | Very Weak |
| Height & Weight | 0.451 | Moderate |
The strongest correlation is between Weight and BMI (r=0.880)
A correlation of 0.880 reflects a very strong positive relationship, indicating that as weight increases, BMI increases substantially. This is expected because BMI is mathematically derived from weight.
In contrast, the correlation between height and BMI (r = −0.012) is extremely weak, suggesting virtually no linear relationship. The correlation between weight and height (r = 0.451) is moderate in strength, but notably weaker than the strong association observed between BMI and weight.
Research Question: Is there a relationship between hours of sleep and age?
Your tasks:
tidy()
(2 points)# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = Age)) +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "purple") +
labs(
title = "Sleeping time vs Age among US Adults",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Total Sleeping time in hrs",
y = "Age in years"
) +
theme_minimal()# b. Correlation test with tidy() display
cor_sleep_age <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$Age)
# Display results in clean table
tidy(cor_sleep_age) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Sleep and Age"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| 0.023 | 1.904 | 0.057 | -0.001 | 0.046 |
The Pearson correlation between sleep duration and age is very weak and positive (r = 0.023). However, the p-value (p = 0.057) exceeds the conventional significance level of 0.05, indicating that the association is not statistically significant. Furthermore, the 95% confidence interval (−0.001, 0.046) includes zero, reinforcing the conclusion that there is no statistically meaningful linear relationship between sleep duration and age in this sample.
Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:
I would like to do the analysis between sleep hrs and systolic blood pressure.
# YOUR CODE HERE
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = BPSysAve)) +
geom_point(alpha = 0.3, color = "gold") +
geom_smooth(method = "lm", se = TRUE, color = "darkred") +
labs(
title = "Sleeping time vs Systolic BP among US Adults",
subtitle = "NHANES Data, Adults 18-80 years",
x = "Total Sleeping time in hrs",
y = "Systolic BP (mmHg)"
) +
theme_minimal()# b. Correlation test with tidy() display
cor_sleep_sysBP <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$BPSysAve)
# Display results in clean table
tidy(cor_sleep_sysBP) %>%
select(estimate, statistic, p.value, conf.low, conf.high) %>%
kable(
digits = 3,
col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
caption = "Pearson Correlation: Sleep and Systolic BP"
)| r | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| -0.031 | -2.59 | 0.01 | -0.054 | -0.007 |
#Assumption Check: The scatterplot indicates an approximately linear association, with no clear evidence of curvature. The variability in systolic blood pressure appears fairly consistent across levels of sleep duration, suggesting that the assumption of homoscedasticity is reasonably met. Although a few outliers are visible, the large sample size enhances the stability and robustness of the Pearson correlation estimates.
#Interpretation: There is an extremely weak inverse relationship between sleep duration and systolic blood pressure (r = −0.031). While the association reaches statistical significance (p = 0.01; 95% CI: −0.054 to −0.007), the magnitude of the effect is extremely small. The coefficient of determination (r² ≈ 0.001) indicates that sleep duration accounts for less than 0.1% of the variability in systolic blood pressure. Consequently, despite statistical significance, the finding lacks practical or clinical relevance. —
Save your work with your name:
Correlation_Lab_YourName.Rmd
Knit to HTML to create your report
Publish to RPubs:
Submit to Brightspace:
Due: End of class today
Grading: This lab is worth 15% of your in-class lab grade. The lowest 2 lab grades are dropped.
cor.test() - Calculate correlation and test
significancetidy() - Clean display of statistical test resultscor() - Calculate correlation matrixcorrplot() - Visualize correlation matrixggplot() + geom_point() - Scatterplotsgeom_smooth(method="lm") - Add fitted regression
lineqqnorm() / qqline() - Check normality?cor.test in
consoleRemember:
✓ Correlation measures LINEAR relationships only
✓ Always visualize your data first
✓ Correlation ≠ Causation
✓ Check your assumptions
✓ Consider confounding and alternative explanations
This lab activity was created for EPI 553: Principles of
Statistical Inference II
University at Albany, College of Integrated Health
Sciences
Spring 2026