Lab Overview

Goal: Practice correlation analysis from start to finish using real public health data

Learning Objectives:

  • Understand when and why to use correlation analysis
  • Calculate and interpret Pearson correlation coefficients
  • Test hypotheses about correlation
  • Check correlation assumptions
  • Distinguish between correlation and causation
  • Use Spearman correlation for non-normal data

Setup: Load Packages and Data

# Load NHANES data
data(NHANES)

# Select adult participants with complete data
nhanes_adult <- NHANES %>%
  filter(Age >= 18, Age <= 80) %>%
  select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve, 
         Pulse, PhysActive, SleepHrsNight) %>%
  na.omit()

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(nhanes_adult), "adults")
) %>%
  kable()
Metric Value
Sample Size 7133 adults
head(nhanes_adult, 8) %>%
  kable(digits = 1, caption = "NHANES Adult Data Sample")
NHANES Adult Data Sample
Age Weight Height BMI BPSysAve BPDiaAve Pulse PhysActive SleepHrsNight
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
49 86.7 168.4 30.6 112 75 86 No 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
66 68.0 169.5 23.7 111 63 60 Yes 7

Dataset Description:

  • Age: Age in years
  • Weight: Weight in kg
  • BMI: Body Mass Index (kg/m²)
  • BPSysAve: Average systolic blood pressure (mmHg)
  • BPDiaAve: Average diastolic blood pressure (mmHg)
  • Pulse: 60 second pulse rate
  • SleepHrsNight: Hours of sleep per night

Problem 1: Weight and Height (10 points)

Research Question: Is there a correlation between weight and height among US adults?

Your tasks:

  1. Create a scatter plot with a fitted line (2 points)
  2. Calculate Pearson correlation using cor.test() and display with tidy() (3 points)
  3. Test for statistical significance and state your conclusion (2 points)
  4. Calculate r² and interpret in 2-3 sentences (3 points)
# YOUR CODE HERE

# a. Scatterplot

# Create scatterplot
ggplot(nhanes_adult, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Weight vs. Height among US Adults",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Height (cm)",
    y = "Weight (kg)"
  ) +
  theme_minimal()

# b. Correlation test with tidy() display

# Pearson correlation
cor_Weight_Height <- cor.test(nhanes_adult$Weight, nhanes_adult$Height)

# Display results
tidy(cor_Weight_Height) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Weight and Height"
  )
Pearson Correlation: Weight and Height
r t-statistic p-value 95% CI Lower 95% CI Upper
0.451 42.618 0 0.432 0.469
# c. Statistical significance

#The pearson correlation between weight and height is r=0.451 which indicates a postive linear relationship. As height increases, weight tends to also increase. The p-value of 0 makes the correlation between weight and height statistically signifcant. We reject the null hypothesis. 


# d. r² and interpretation (write as comment)
# 
  r_squared <- cor_Weight_Height$estimate^2

data.frame(
  Measure = c("Correlation (r)", "Coefficient of Determination (r²)", 
              "Variance Explained"),
  Value = c(
    round(cor_Weight_Height$estimate, 3),
    round(r_squared, 3),
    paste0(round(r_squared * 100, 1), "%")
  )
) %>%
  kable(caption = "Summary of Correlation Strength")
Summary of Correlation Strength
Measure Value
Correlation (r) 0.451
Coefficient of Determination (r²) 0.203
Variance Explained 20.3%
#Interpretation: There is a statistically significant moderate positive correlation between height and weight. As height increases, weight tends to increase. However, height explains only about 20.3% of the variation in weight, suggesting other factors also play important roles like diet and physical activity. 

Problem 2: Correlation Matrix Analysis (10 points)

Research Question: What are the relationships among BMI, weight, and height?

Your tasks:

  1. Create a correlation matrix for: Weight, Height, BMI (3 points)
  2. Visualize the matrix using corrplot (3 points)
  3. Identify which pair has the strongest correlation (2 points)
  4. Explain why that correlation makes sense biologically/mathematically (2 points)
# YOUR CODE HERE

# a. Correlation matrix

# Select anthropometric measurements
metric_vars <- nhanes_adult %>%
  select(Weight, Height, BMI)

# Calculate correlation matrix
cor_matrix <- cor(metric_vars, use = "complete.obs")

# Display as table
cor_matrix %>%
  kable(digits = 3, caption = "Antropometirc Measurement Correlation")
Antropometirc Measurement Correlation
Weight Height BMI
Weight 1.000 0.451 0.880
Height 0.451 1.000 -0.012
BMI 0.880 -0.012 1.000
# b. Visualize with corrplot

# Create correlation plot
corrplot(cor_matrix, 
         method = "circle",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
         title = "Antropometirc Measurement Correlations",
         mar = c(0,0,2,0))

# c. Strongest correlation:
data.frame(
    Relationship = c(
      "Weight & BMI",
      "Height & Weight",
      "Height & BMI"
    ),
    Correlation = c(
      round (cor_matrix["Weight", "BMI"], 3),
      round (cor_matrix["Height", "Weight"], 3),
      round (cor_matrix["Height", "BMI"], 3)
),
Strength = c("Strong", "Moderate", "Weak")
) %>%
  kable(caption = "Notable Correlations Summary")
Notable Correlations Summary
Relationship Correlation Strength
Weight & BMI 0.880 Strong
Height & Weight 0.451 Moderate
Height & BMI -0.012 Weak
# d. Explanation (write as comment)

# Explanation: Weight and BMI show the strongest correlation (0.880), which makes sense as they measure the same thing. Height and Weight have a moderate correlation, which makes sense because typically when you are a certain height you are around a certain "normal" weight. Height and BMI have a weak correlation, suggesitng that height is influenced by different factors not related to BMI. 

Problem 3: Sleep and Age (5 points)

Research Question: Is there a relationship between hours of sleep and age?

Your tasks:

  1. Create a scatterplot (1 point)
  2. Calculate Pearson correlation and display with tidy() (2 points)
  3. Interpret whether the relationship is statistically significant (2 points)
# YOUR CODE HERE

# a. Scatterplot

ggplot(nhanes_adult, aes(x =SleepHrsNight , y =Age )) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Age vs Hours of Sleep",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "SleepHrsNight (Hours)",
    y = "Age (years)"
  ) +
  theme_minimal()

# b. Correlation with tidy()

# Calculate Pearson correlation
cor_age_SleepHrsNight <- cor.test(nhanes_adult$Age, nhanes_adult$SleepHrsNight)

# Display results in clean table
tidy(cor_age_SleepHrsNight) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Age and SleepHrsNight"
  )
Pearson Correlation: Age and SleepHrsNight
r t-statistic p-value 95% CI Lower 95% CI Upper
0.023 1.904 0.057 -0.001 0.046
# c. Interpretation (write as comment)
# Interpretation:The pearson correlation between age and hours of sleep per night is r=0.023, this indicates that there is a weak positive relationship. As age increases, sleep hours increase slightly. The p-value is 0.057 which is slightly above the significance level of 0.05. The correlation is not statistically significant. In addition, the 95% confidence interval ranges from -0.001 to 0.046, which includes zero and further supports the conclusion that the true correlation has no relationship. 

Bonus (Optional, 5 extra points)

Challenge: Investigate the relationship between two variables of your choice from the NHANES data set. Include:

Is there a correlation between weight and pulse rate?

# YOUR CODE HERE

# a. Scatterplot

ggplot(nhanes_adult, aes(x =Pulse , y = Weight)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Age vs BMI",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "SleepHrsNight (Hours)",
    y = "Age (years)"
  ) +
  theme_minimal()

# b. Correlation with tidy()

# Calculate Pearson correlation
cor_Weight_Pulse <- cor.test(nhanes_adult$Pulse, nhanes_adult$Weight)

# Display results in clean table
tidy(cor_Weight_Pulse) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Age and Pulse rate"
  )
Pearson Correlation: Age and Pulse rate
r t-statistic p-value 95% CI Lower 95% CI Upper
0.063 5.295 0 0.039 0.086
#c. Assumptions Check

par(mfrow = c(1, 2))

qqnorm(nhanes_adult$Weight, main = "Q-Q Plot: Weight")
qqline(nhanes_adult$Weight, col = "red")

qqnorm(nhanes_adult$Pulse, main = "Q-Q Plot: Pulse")
qqline(nhanes_adult$Pulse, col = "red")

# d. Interpretation (write as comment)
# Interpretation: The pearson correlation between weight and pulse rate is r=0.063, this indicates that there is positive moderate relationship between the two variables. As weight increases, pulse rate starts to increase. The p-value is 0 which is statistically significant which means we can reject the null hypothesis. In addition, the 95% confidence interval ranges from 0.039 to 0.086, which does not include zero and further supports the conclusion that the positive relationship is statistically significant. The Q-Q plot follows normal distribution along the red line reasonably well for both weight and pulse. Some deviation in the tails but with a large sample size, the correlation test is robust to minor violations. 

This lab activity was created for EPI 553: Principles of Statistical Inference II
University at Albany, College of Integrated Health Sciences
Spring 2026