Lab Overview

Time: ~30 minutes

Goal: Practice correlation analysis from start to finish using real public health data

Learning Objectives:

Understand when and why to use correlation analysis
Calculate and interpret Pearson correlation coefficients
Test hypotheses about correlation
Check correlation assumptions
Distinguish between correlation and causation
Use Spearman correlation for non-normal data

PART B: YOUR TURN - Practice Problems

Now it’s your turn to practice! Use the NHANES dataset and follow the examples above.

Total Points: 25 points

Background: Why Correlation Matters

What is Correlation?

Correlation measures the strength and direction of the LINEAR relationship between two continuous variables.

Range: -1 ≤ r ≤ 1
r = 1: Perfect positive relationship (as X ↑, Y ↑)
r = -1: Perfect negative relationship (as X ↑, Y ↓)
r = 0: No linear relationship
|r| > 0.7: Strong correlation
0.3 < |r| < 0.7: Moderate correlation
|r| < 0.3: Weak correlation

When to Use Correlation

✅ Use correlation when:

Both variables are continuous (or at least ordinal)
You want to measure strength/direction of linear relationship
You’re exploring data before regression
You want to describe associations (not causation)

❌ Don’t use when:

One variable is categorical → use t-test or ANOVA
Relationship is clearly non-linear → consider transformation
You want to establish causation → use experimental design
You want to predict values → use regression

Important Warning

⚠️ CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Classic Example: Ice cream sales and drowning deaths are highly correlated. Does ice cream cause drowning? NO! Both increase in summer (confounding by temperature/season).

Setup: Load Packages and Data

# Load NHANES data
data(NHANES)

# Select adult participants with complete data
nhanes_adult <- NHANES %>%
  filter(Age >= 18, Age <= 80) %>%
  select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve, 
         Pulse, PhysActive, SleepHrsNight) %>%
  na.omit()

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(nhanes_adult), "adults")
) %>%
  kable()

Metric	Value
Sample Size	7133 adults

head(nhanes_adult, 8) %>%
  kable(digits = 1, caption = "NHANES Adult Data Sample")

NHANES Adult Data Sample
Age	Weight	Height	BMI	BPSysAve	BPDiaAve	Pulse	PhysActive	SleepHrsNight
34	87.4	164.7	32.2	113	85	70	No	4
34	87.4	164.7	32.2	113	85	70	No	4
34	87.4	164.7	32.2	113	85	70	No	4
49	86.7	168.4	30.6	112	75	86	No	8
45	75.7	166.7	27.2	118	64	62	Yes	8
45	75.7	166.7	27.2	118	64	62	Yes	8
45	75.7	166.7	27.2	118	64	62	Yes	8
66	68.0	169.5	23.7	111	63	60	Yes	7

Dataset Description:

Age: Age in years
Weight: Weight in kg
BMI: Body Mass Index (kg/m²)
BPSysAve: Average systolic blood pressure (mmHg)
BPDiaAve: Average diastolic blood pressure (mmHg)
Pulse: 60 second pulse rate
SleepHrsNight: Hours of sleep per night

Problem 1: Weight and Height (10 points)

Research Question: Is there a correlation between weight and height among US adults?

Your tasks:

Create a scatterplot with a fitted line (2 points)
Calculate Pearson correlation using cor.test() and display with tidy() (3 points)
Test for statistical significance and state your conclusion (2 points)
Calculate r² and interpret in 2-3 sentences (3 points)

# YOUR CODE HERE

# a. Scatterplot
ggplot(nhanes_adult, aes(x = Weight, y = Height)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Weight vs Height Among US Adults",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Weight (kg)",
    y = "Height (cm)"
  ) +
  theme_minimal()

# b. Correlation test with tidy() display
cor_wt_ht <- cor.test(nhanes_adult$Weight, nhanes_adult$Height)

tidy(cor_wt_ht) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Weight and Height"
  )

Pearson Correlation: Weight and Height
r	t-statistic	p-value	95% CI Lower	95% CI Upper
0.451	42.618	0	0.432	0.469

# c. Statistical significance
# r = 0.451: Moderate positive correlation
#p < 0.001: Statistically significant (reject H₀)
#95% CI [0.432, 0.469]: Doesn’t contain zero (confirms significance)

# d. r² and interpretation (write as comment)
# There is a statistically significant moderate positive correlation between Weight and Height. As Weight increases, Height tends to increase. However, Weight explains only about 20.3% of the variation in Height, suggesting other factors also play important roles.Public Health Implication: Weight-appropriate Height screening is important, but individual risk assessment should consider multiple factors beyond Weight alone.

r_squared <- cor_wt_ht$estimate^2

data.frame(
  Measure = c("Correlation (r)", "Coefficient of Determination (r²)", 
              "Variance Explained"),
  Value = c(
    round(cor_wt_ht$estimate, 3),
    round(r_squared, 3),
    paste0(round(r_squared * 100, 1), "%")
  )
) %>%
  kable(caption = "Summary of Correlation Strength")

Summary of Correlation Strength
Measure	Value
Correlation (r)	0.451
Coefficient of Determination (r²)	0.203
Variance Explained	20.3%

Problem 2: Correlation Matrix Analysis (10 points)

Research Question: What are the relationships among BMI, weight, and height?

Your tasks:

Create a correlation matrix for: Weight, Height, BMI (3 points)
Visualize the matrix using corrplot (3 points)
Identify which pair has the strongest correlation (2 points)
Explain why that correlation makes sense biologically/mathematically (2 points)

# YOUR CODE HERE

# a. Correlation matrix
# Select variables
relationships_vars <- nhanes_adult %>%
  select(BMI, Weight, Height)
# Calculate correlation matrix
cor_matrix <- cor(relationships_vars, use = "complete.obs")
# Display as table
cor_matrix %>%
  kable(digits = 3, caption = "Relationships among BMI, weight, and height Correlation Matrix")

Relationships among BMI, weight, and height Correlation Matrix
	BMI	Weight	Height
BMI	1.000	0.880	-0.012
Weight	0.880	1.000	0.451
Height	-0.012	0.451	1.000

# b. Visualize with corrplot
corrplot(cor_matrix, 
         method = "circle",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
         title = "BMI, Weight, and Height Correlations",
         mar = c(0,0,2,0))

# c. Strongest correlation: BMI & Weight
data.frame(
  Relationship = c(
    "BMI & Weight",
    "BMI & Height",
    "Weight & Height"
  ),
  Correlation = c(
    round(cor_matrix["BMI", "Weight"], 3),
    round(cor_matrix["BMI", "Height"], 3),
    round(cor_matrix["Weight", "Height"], 3)
  ),
  Strength = c("Strong", "Moderate", "Weak")
) %>%
  kable(caption = "Notable Correlations Summary")

Notable Correlations Summary
Relationship	Correlation	Strength
BMI & Weight	0.880	Strong
BMI & Height	-0.012	Moderate
Weight & Height	0.451	Weak

# d. Explanation (write as comment)
# BMI & Weight show the strongest correlation (r = 0.880), which makes sense as they measure the same physiological process. Height shows relatively weak correlations, suggesting it’s influenced by different factors.

Problem 3: Sleep and Age (5 points)

Research Question: Is there a relationship between hours of sleep and age?

Your tasks:

Create a scatterplot (1 point)
Calculate Pearson correlation and display with tidy() (2 points)
Interpret whether the relationship is statistically significant (2 points)

# YOUR CODE HERE

# a. Scatterplot
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = Age)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Hours of Sleep vs Age",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Hours of sleep per night",
    y = "Age (years)"
  ) +
  theme_minimal()

# b. Correlation with tidy()
# Calculate Pearson correlation
cor_sleep_age <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$Age)

# Display results in clean table
tidy(cor_sleep_age) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Hours of sleep and Age"
  )

Pearson Correlation: Hours of sleep and Age
r	t-statistic	p-value	95% CI Lower	95% CI Upper
0.023	1.904	0.057	-0.001	0.046

# c. Interpretation (write as comment)
# Hypothesis Test:
#H₀: ρ = 0 (no correlation between Hours of sleep and Age in population)
#H₁: ρ ≠ 0 (correlation exists)
#α = 0.05

#Results:
#r = 0.023: Weak positive correlation
#p > 0.05: Not Statistically significant (fail to reject H₀)
#95% CI [-0.001, 0.046]: Contains zero (confirms insignificance)

Bonus (Optional, 5 extra points)

Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:

Scatterplot
Correlation test with clean display
Assumption checks
Thoughtful interpretation

# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = Weight, y = BPSysAve)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Weight vs Average systolic blood pressure",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Weight in kg",
    y = "Average systolic blood pressure (mmHg)"
  ) +
  theme_minimal()

# b. Correlation test
cor_weight_bp <- cor.test(nhanes_adult$Weight, nhanes_adult$BPSysAve)

# Display results in clean table
tidy(cor_weight_bp) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Weight and Systolic BP"
  )

Pearson Correlation: Weight and Systolic BP
r	t-statistic	p-value	95% CI Lower	95% CI Upper
0.118	10.04	0	0.095	0.141

#c. Assumption Checks
#Assumption 1: Linearity (already checked with scatterplot ✓)
#Assumption 2: Bivariate Normality
# Q-Q plots for normality
par(mfrow = c(1, 2))

qqnorm(nhanes_adult$Weight, main = "Q-Q Plot: Weight")
qqline(nhanes_adult$Weight, col = "red")

qqnorm(nhanes_adult$BPSysAve, main = "Q-Q Plot: Systolic BP")
qqline(nhanes_adult$BPSysAve, col = "red")

#Assessment: Both variables are approximately normally distributed (points follow the red line reasonably well). Some deviation in the tails, but with large sample size (n = 7133), the correlation test is robust to minor violations.
#Assumption 3: No Extreme Outliers (scatterplot shows no extreme outliers ✓)

#d. Thoughtful interpretation
r_squared_weight <- cor_weight_bp$estimate^2

data.frame(
  Measure = c("r²", "Variance Explained"),
  Value = c(
    round(r_squared_weight, 4),
    paste0(round(r_squared_weight * 100, 2), "%")
  )
) %>%
  kable(caption = "Effect Size")

Effect Size
	Measure	Value
cor	r²	0.0139
	Variance Explained	1.39%

# There is a positive trend, heavier weight(in kg) adults tend to have higher blood pressure. While Weight and Systolic BP are related, Weight alone explains less than 10% of BP variation. Other factors (genetics, diet, physical activity, stress, age) play substantial roles. If we look at the scatterplot, there is a positive relationship visible, moderate scatter around the line. When testing the hypothesis, H₀: ρ = 0 (no correlation between age and BP in population), H₁: ρ ≠ 0 (correlation exists), and α = 0.05, results indicate that r = 0.118: weak positive correlation, p < 0.001: statistically significant (reject H₀), and 95% CI [0.095, 0.141]: doesn’t contain zero (confirms significance), affirming conclusions.

Additional Resources

R Functions Used Today

cor.test() - Calculate correlation and test significance
tidy() - Clean display of statistical test results
cor() - Calculate correlation matrix
corrplot() - Visualize correlation matrix
ggplot() + geom_point() - Scatterplots
geom_smooth(method="lm") - Add fitted regression line
qqnorm() / qqline() - Check normality

This lab activity was created for EPI 553: Principles of Statistical Inference II
University at Albany, College of Integrated Health Sciences
Spring 2026

Lab 03: In-Class Lab Activity: Correlation Analysis

EPI 553: Principles of Statistical Inference II

Natalia Small

February 12, 2026

Lab Overview

PART B: YOUR TURN - Practice Problems

Background: Why Correlation Matters

What is Correlation?

When to Use Correlation

Important Warning

Setup: Load Packages and Data

Problem 1: Weight and Height (10 points)

Problem 2: Correlation Matrix Analysis (10 points)

Problem 3: Sleep and Age (5 points)

Bonus (Optional, 5 extra points)

Additional Resources

R Functions Used Today