PART A: GUIDED EXAMPLES

Setup: Load Packages and Data

# Load NHANES data
data(NHANES)

# Select adult participants with complete data
nhanes_adult <- NHANES %>%
  filter(Age >= 18, Age <= 80) %>%
  select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve, 
         Pulse, PhysActive, SleepHrsNight) %>%
  na.omit()

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(nhanes_adult), "adults")
) %>%
  kable()
Metric Value
Sample Size 7133 adults
head(nhanes_adult, 8) %>%
  kable(digits = 1, caption = "NHANES Adult Data Sample")
NHANES Adult Data Sample
Age Weight Height BMI BPSysAve BPDiaAve Pulse PhysActive SleepHrsNight
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
49 86.7 168.4 30.6 112 75 86 No 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
66 68.0 169.5 23.7 111 63 60 Yes 7

Dataset Description:

  • Age: Age in years
  • Weight: Weight in kg
  • BMI: Body Mass Index (kg/m²)
  • BPSysAve: Average systolic blood pressure (mmHg)
  • BPDiaAve: Average diastolic blood pressure (mmHg)
  • Pulse: 60 second pulse rate
  • SleepHrsNight: Hours of sleep per night

PART B: YOUR TURN - Practice Problems

Now it’s your turn to practice! Use the same NHANES dataset and follow the examples above.

Total Points: 25 points


Problem 1: Weight and Height (10 points)

Research Question: Is there a correlation between weight and height among US adults?

Your tasks:

  1. Create a scatterplot with a fitted line (2 points)
  2. Calculate Pearson correlation using cor.test() and display with tidy() (3 points)
  3. Test for statistical significance and state your conclusion (2 points)
  4. Calculate r² and interpret in 2-3 sentences (3 points)
# YOUR CODE HERE

# a. Scatterplot
### Step 1: Visualize the Relationship

# Create scatterplot
ggplot(nhanes_adult, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Height vs Weight",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Height (cm)",
    y = "Weight (kg)"
  ) +
  theme_minimal()

# b. Correlation test with tidy() display

# Calculate Pearson correlation
cor_height_weight <- cor.test(nhanes_adult$Height, nhanes_adult$Weight)

# Display results in clean table
tidy(cor_height_weight) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Height and Weight"
  )
Pearson Correlation: Height and Weight
r t-statistic p-value 95% CI Lower 95% CI Upper
0.451 42.618 0 0.432 0.469

c. Statistical significance

The Pearson correlation analysis demonstrated a moderate positive association between weight and height among U.S. adults (r = 0.451). This relationship was statistically significant (t = 42.618, p < 0.001), indicating that the likelihood of observing this correlation by random chance is extremely small. The 95% confidence interval (0.432, 0.469) excludes zero, further supporting the statistical significance of the association.

d. r² and interpretation (write as comment)

# Calculate r-squared
r_squared <- cor_height_weight$estimate^2

data.frame(
  Measure = c("Correlation (r)", "Coefficient of Determination (r²)", 
              "Variance Explained"),
  Value = c(
    round(cor_height_weight$estimate, 3),
    round(r_squared, 3),
    paste0(round(r_squared * 100, 1), "%")
  )
) %>%
  kable(caption = "Summary of Correlation Strength")
Summary of Correlation Strength
Measure Value
Correlation (r) 0.451
Coefficient of Determination (r²) 0.203
Variance Explained 20.3%

r square interpretation:

The correlation coefficient (r = 0.451) indicates a moderate positive relationship between height and weight, meaning that taller individuals tend to weigh more.

The coefficient of determination (r² = 0.203) suggests that approximately 20.3% of the variability in one variable is explained by the other in this sample. In practical terms, this indicates that although height and weight are meaningfully related, nearly 79.7% of the variation is attributable to other factors not captured by this simple linear relationship.

Problem 2: Correlation Matrix Analysis (10 points)

Research Question: What are the relationships among BMI, weight, and height?

Your tasks:

  1. Create a correlation matrix for: Weight, Height, BMI (3 points)
  2. Visualize the matrix using corrplot (3 points)
  3. Identify which pair has the strongest correlation (2 points)
  4. Explain why that correlation makes sense biologically/mathematically (2 points)

a. Correlation matrix

# Select variables
adult_vars <- nhanes_adult %>%
  select(BMI, Height, Weight)

# Calculate correlation matrix
cor_matrix <- cor(adult_vars, use = "complete.obs")

# Display as table
cor_matrix %>%
  kable(digits = 3, caption = "BMI, Weight and Height Correlation Matrix")
BMI, Weight and Height Correlation Matrix
BMI Height Weight
BMI 1.000 -0.012 0.880
Height -0.012 1.000 0.451
Weight 0.880 0.451 1.000
# b. Visualize with corrplot
# Create correlation plot
corrplot(cor_matrix, 
         method = "circle",
         type = "lower",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.8,
         col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
         title = "BMI, Weight and Height Correlations",
         mar = c(0,0,2,0))

# c. Strongest correlation:
# Create summary table of notable correlations
data.frame(
  Relationship = c(
    "BMI & Weight",
    "BMI & Height",
    "Height & BMI",
    "Height & Weight"
  ),
  Correlation = c(
    round(cor_matrix["BMI", "Weight"], 3),
     round(cor_matrix["BMI", "Height"], 3),
     round(cor_matrix["Height", "BMI"], 3),
    round(cor_matrix["Height", "Weight"], 3)
  ),
  Strength = c("Very Strong", "Very Weak", "Very Weak", "Moderate")
) %>%
  kable(caption = "Notable Correlations Summary")
Notable Correlations Summary
Relationship Correlation Strength
BMI & Weight 0.880 Very Strong
BMI & Height -0.012 Very Weak
Height & BMI -0.012 Very Weak
Height & Weight 0.451 Moderate

c. The strongest correlation:

The strongest correlation is between Weight and BMI (r=0.880)

d. Explanation (write as comment)

A correlation of 0.880 reflects a very strong positive relationship, indicating that as weight increases, BMI increases substantially. This is expected because BMI is mathematically derived from weight.

In contrast, the correlation between height and BMI (r = −0.012) is extremely weak, suggesting virtually no linear relationship. The correlation between weight and height (r = 0.451) is moderate in strength, but notably weaker than the strong association observed between BMI and weight.


Problem 3: Sleep and Age (5 points)

Research Question: Is there a relationship between hours of sleep and age?

Your tasks:

  1. Create a scatterplot (1 point)
  2. Calculate Pearson correlation and display with tidy() (2 points)
  3. Interpret whether the relationship is statistically significant (2 points)
# YOUR CODE HERE
# a. Scatterplot
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = Age)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "purple") +
  labs(
    title = "Sleeping time vs Age among US Adults",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Total Sleeping time in hrs",
    y = "Age in years"
  ) +
  theme_minimal()

# b. Correlation test with tidy() display
cor_sleep_age <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$Age)

# Display results in clean table
tidy(cor_sleep_age) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Sleep and Age"
  )
Pearson Correlation: Sleep and Age
r t-statistic p-value 95% CI Lower 95% CI Upper
0.023 1.904 0.057 -0.001 0.046

c. Interpretation (write as comment)

The Pearson correlation between sleep duration and age is very weak and positive (r = 0.023). However, the p-value (p = 0.057) exceeds the conventional significance level of 0.05, indicating that the association is not statistically significant. Furthermore, the 95% confidence interval (−0.001, 0.046) includes zero, reinforcing the conclusion that there is no statistically meaningful linear relationship between sleep duration and age in this sample.


Bonus (Optional, 5 extra points)

Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:

  • Scatterplot
  • Correlation test with clean display
  • Assumption checks
  • Thoughtful interpretation

I would like to do the analysis between sleep hrs and systolic blood pressure.

# YOUR CODE HERE
ggplot(nhanes_adult, aes(x = SleepHrsNight, y = BPSysAve)) +
  geom_point(alpha = 0.3, color = "gold") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Sleeping time vs Systolic BP among US Adults",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Total Sleeping time in hrs",
    y = "Systolic BP (mmHg)"
  ) +
  theme_minimal()

# b. Correlation test with tidy() display
cor_sleep_sysBP <- cor.test(nhanes_adult$SleepHrsNight, nhanes_adult$BPSysAve)

# Display results in clean table
tidy(cor_sleep_sysBP) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Sleep and Systolic BP"
  )
Pearson Correlation: Sleep and Systolic BP
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.031 -2.59 0.01 -0.054 -0.007

#Assumption Check: The scatterplot indicates an approximately linear association, with no clear evidence of curvature. The variability in systolic blood pressure appears fairly consistent across levels of sleep duration, suggesting that the assumption of homoscedasticity is reasonably met. Although a few outliers are visible, the large sample size enhances the stability and robustness of the Pearson correlation estimates.

#Interpretation: There is an extremely weak inverse relationship between sleep duration and systolic blood pressure (r = −0.031). While the association reaches statistical significance (p = 0.01; 95% CI: −0.054 to −0.007), the magnitude of the effect is extremely small. The coefficient of determination (r² ≈ 0.001) indicates that sleep duration accounts for less than 0.1% of the variability in systolic blood pressure. Consequently, despite statistical significance, the finding lacks practical or clinical relevance. —

Grading Rubric

Problem 1: Weight and Height (10 points)

    1. Scatterplot properly formatted with labels: 2 points
    1. Correct correlation with clean display: 3 points
    1. Significance test correctly interpreted: 2 points
    1. r² calculated and interpreted: 3 points

Problem 2: Correlation Matrix (10 points)

    1. Correct matrix calculated: 3 points
    1. Well-formatted correlation plot: 3 points
    1. Strongest correlation identified: 2 points
    1. Biological/mathematical explanation: 2 points

Problem 3: Sleep and Age (5 points)

    1. Scatterplot: 1 point
    1. Correlation calculated and displayed: 2 points
    1. Interpretation of significance: 2 points

Submission Instructions

  1. Save your work with your name: Correlation_Lab_YourName.Rmd

  2. Knit to HTML to create your report

  3. Publish to RPubs:

    • Click the Publish button (blue icon) in the HTML preview window
    • Choose RPubs from the options
    • Follow the prompts to publish (create account if needed)
    • Copy your RPubs URL
  4. Submit to Brightspace:

    • Upload your .Rmd file
    • Paste your RPubs link in the assignment comments or submission text box
  5. Due: End of class today

Grading: This lab is worth 15% of your in-class lab grade. The lowest 2 lab grades are dropped.


Additional Resources

R Functions Used Today

  • cor.test() - Calculate correlation and test significance
  • tidy() - Clean display of statistical test results
  • cor() - Calculate correlation matrix
  • corrplot() - Visualize correlation matrix
  • ggplot() + geom_point() - Scatterplots
  • geom_smooth(method="lm") - Add fitted regression line
  • qqnorm() / qqline() - Check normality

For More Help

  • Textbook: Chapter 6 - Correlation Analysis
  • Office Hours: See syllabus
  • TA Help: See syllabus
  • R Documentation: Type ?cor.test in console

Remember:

✓ Correlation measures LINEAR relationships only
✓ Always visualize your data first
✓ Correlation ≠ Causation
✓ Check your assumptions
✓ Consider confounding and alternative explanations


This lab activity was created for EPI 553: Principles of Statistical Inference II
University at Albany, College of Integrated Health Sciences
Spring 2026