Lab Overview

Time: ~30 minutes

Goal: Practice correlation analysis from start to finish using real public health data

Learning Objectives:

  • Understand when and why to use correlation analysis
  • Calculate and interpret Pearson correlation coefficients
  • Test hypotheses about correlation
  • Check correlation assumptions
  • Distinguish between correlation and causation
  • Use Spearman correlation for non-normal data

Structure:

  • Part A: Guided Examples (follow along with instructor)
  • Part B: Your Turn (independent practice)

Submission: Publish to RPubs and submit your .Rmd file + RPubs link to Brightspace by end of class


Background: Why Correlation Matters

What is Correlation?

Correlation measures the strength and direction of the LINEAR relationship between two continuous variables.

  • Range: -1 ≤ r ≤ 1
  • r = 1: Perfect positive relationship (as X ↑, Y ↑)
  • r = -1: Perfect negative relationship (as X ↑, Y ↓)
  • r = 0: No linear relationship
  • |r| > 0.7: Strong correlation
  • 0.3 < |r| < 0.7: Moderate correlation
  • |r| < 0.3: Weak correlation

When to Use Correlation

Use correlation when:

  • Both variables are continuous (or at least ordinal)
  • You want to measure strength/direction of linear relationship
  • You’re exploring data before regression
  • You want to describe associations (not causation)

Don’t use when:

  • One variable is categorical → use t-test or ANOVA
  • Relationship is clearly non-linear → consider transformation
  • You want to establish causation → use experimental design
  • You want to predict values → use regression

Important Warning

⚠️ CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Classic Example: Ice cream sales and drowning deaths are highly correlated. Does ice cream cause drowning? NO! Both increase in summer (confounding by temperature/season).


PART A: GUIDED EXAMPLES

Setup: Load Packages and Data

# Load NHANES data
data(NHANES)

# Select adult participants with complete data
nhanes_adult <- NHANES %>%
  filter(Age >= 18, Age <= 80) %>%
  select(Age, Weight, Height, BMI, BPSysAve, BPDiaAve, 
         Pulse, PhysActive, SleepHrsNight) %>%
  na.omit()

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(nhanes_adult), "adults")
) %>%
  kable()
Metric Value
Sample Size 7133 adults
head(nhanes_adult, 8) %>%
  kable(digits = 1, caption = "NHANES Adult Data Sample")
NHANES Adult Data Sample
Age Weight Height BMI BPSysAve BPDiaAve Pulse PhysActive SleepHrsNight
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
34 87.4 164.7 32.2 113 85 70 No 4
49 86.7 168.4 30.6 112 75 86 No 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
45 75.7 166.7 27.2 118 64 62 Yes 8
66 68.0 169.5 23.7 111 63 60 Yes 7

Dataset Description:

  • Age: Age in years
  • Weight: Weight in kg
  • BMI: Body Mass Index (kg/m²)
  • BPSysAve: Average systolic blood pressure (mmHg)
  • BPDiaAve: Average diastolic blood pressure (mmHg)
  • Pulse: 60 second pulse rate
  • SleepHrsNight: Hours of sleep per night

Example 1: Age and Blood Pressure

Research Question

Is there a correlation between age and systolic blood pressure among US adults?

Public Health Context: Understanding age-related changes in blood pressure helps identify at-risk populations and inform screening guidelines.


Step 1: Visualize the Relationship

Always start with a scatterplot!

# Create scatterplot
ggplot(nhanes_adult, aes(x = Age, y = BPSysAve)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Age vs Systolic Blood Pressure",
    subtitle = "NHANES Data, Adults 18-80 years",
    x = "Age (years)",
    y = "Systolic Blood Pressure (mmHg)"
  ) +
  theme_minimal()

What we observe:

  • Positive trend: older adults tend to have higher blood pressure
  • Points scattered around the line (not perfect relationship)
  • Relationship appears roughly linear
  • Some variability at all age levels

Step 2: Calculate Correlation

# Calculate Pearson correlation
cor_age_bp <- cor.test(nhanes_adult$Age, nhanes_adult$BPSysAve)

# Display results in clean table
tidy(cor_age_bp) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Age and Systolic BP"
  )
Pearson Correlation: Age and Systolic BP
r t-statistic p-value 95% CI Lower 95% CI Upper
0.415 38.54 0 0.396 0.434

Step 3: Interpret Results

Hypothesis Test:

  • H₀: ρ = 0 (no correlation between age and BP in population)
  • H₁: ρ ≠ 0 (correlation exists)
  • α = 0.05

Results:

  • r = 0.415: Moderate positive correlation
  • p < 0.001: Statistically significant (reject H₀)
  • 95% CI [0.396, 0.434]: Doesn’t contain zero (confirms significance)
# Calculate r-squared
r_squared <- cor_age_bp$estimate^2

data.frame(
  Measure = c("Correlation (r)", "Coefficient of Determination (r²)", 
              "Variance Explained"),
  Value = c(
    round(cor_age_bp$estimate, 3),
    round(r_squared, 3),
    paste0(round(r_squared * 100, 1), "%")
  )
) %>%
  kable(caption = "Summary of Correlation Strength")
Summary of Correlation Strength
Measure Value
Correlation (r) 0.415
Coefficient of Determination (r²) 0.172
Variance Explained 17.2%

Interpretation:

There is a statistically significant moderate positive correlation between age and systolic blood pressure. As age increases, systolic BP tends to increase. However, age explains only about 17.2% of the variation in BP, suggesting other factors also play important roles.

Public Health Implication: Age-appropriate BP screening is important, but individual risk assessment should consider multiple factors beyond age alone.


Step 4: Check Assumptions

Assumption 1: Linearity (already checked with scatterplot ✓)

Assumption 2: Bivariate Normality

# Q-Q plots for normality
par(mfrow = c(1, 2))

qqnorm(nhanes_adult$Age, main = "Q-Q Plot: Age")
qqline(nhanes_adult$Age, col = "red")

qqnorm(nhanes_adult$BPSysAve, main = "Q-Q Plot: Systolic BP")
qqline(nhanes_adult$BPSysAve, col = "red")

par(mfrow = c(1, 1))

Assessment: Both variables are approximately normally distributed (points follow the red line reasonably well). Some deviation in the tails, but with large sample size (n = 7133), the correlation test is robust to minor violations.

Assumption 3: No Extreme Outliers (scatterplot shows no extreme outliers ✓)


Example 2: BMI and Diastolic Blood Pressure

Research Question

Is BMI correlated with diastolic blood pressure?

Why this matters: Understanding the relationship between obesity and blood pressure helps inform weight management interventions.


Step 1: Visualize

ggplot(nhanes_adult, aes(x = BMI, y = BPDiaAve)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink") +
  labs(
    title = "BMI vs Diastolic Blood Pressure",
    x = "Body Mass Index (kg/m²)",
    y = "Diastolic Blood Pressure (mmHg)"
  ) +
  theme_minimal()

Observation: Positive relationship visible, moderate scatter around the line.


Step 2: Calculate Correlation

# Pearson correlation
cor_bmi_bp <- cor.test(nhanes_adult$BMI, nhanes_adult$BPDiaAve)

# Display results
tidy(cor_bmi_bp) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: BMI and Diastolic BP"
  )
Pearson Correlation: BMI and Diastolic BP
r t-statistic p-value 95% CI Lower 95% CI Upper
0.117 9.966 0 0.094 0.14
# Calculate r-squared
r_squared_bmi <- cor_bmi_bp$estimate^2

data.frame(
  Measure = c("r²", "Variance Explained"),
  Value = c(
    round(r_squared_bmi, 4),
    paste0(round(r_squared_bmi * 100, 2), "%")
  )
) %>%
  kable(caption = "Effect Size")
Effect Size
Measure Value
cor 0.0137
Variance Explained 1.37%

Interpretation:

  • Moderate positive correlation (r = 0.117) between BMI and diastolic BP
  • Statistically significant (p < 0.001)
  • BMI explains only 1.4% of variation in diastolic BP

Key Insight: While BMI and blood pressure are related, BMI alone explains less than 10% of BP variation. Other factors (genetics, diet, physical activity, stress, age) play substantial roles.


Example 3: Correlation Matrix

Research Question

How are cardiovascular health indicators related to each other?

Step 1: Calculate Correlation Matrix

# Select cardiovascular variables
cardio_vars <- nhanes_adult %>%
  select(Age, BMI, BPSysAve, BPDiaAve, Pulse)

# Calculate correlation matrix
cor_matrix <- cor(cardio_vars, use = "complete.obs")

# Display as table
cor_matrix %>%
  kable(digits = 3, caption = "Cardiovascular Health Correlation Matrix")
Cardiovascular Health Correlation Matrix
Age BMI BPSysAve BPDiaAve Pulse
Age 1.000 0.065 0.415 -0.019 -0.153
BMI 0.065 1.000 0.135 0.117 0.112
BPSysAve 0.415 0.135 1.000 0.340 -0.022
BPDiaAve -0.019 0.117 0.340 1.000 0.106
Pulse -0.153 0.112 -0.022 0.106 1.000

Step 2: Visualize Correlation Matrix

# Create correlation plot
corrplot(cor_matrix, 
         method = "circle",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
         title = "Cardiovascular Health Correlations",
         mar = c(0,0,2,0))

Key Findings:

# Create summary table of notable correlations
data.frame(
  Relationship = c(
    "Systolic BP & Diastolic BP",
    "Age & Systolic BP",
    "Age & Diastolic BP",
    "BMI & Systolic BP",
    "BMI & Pulse"
  ),
  Correlation = c(
    round(cor_matrix["BPSysAve", "BPDiaAve"], 3),
    round(cor_matrix["Age", "BPSysAve"], 3),
    round(cor_matrix["Age", "BPDiaAve"], 3),
    round(cor_matrix["BMI", "BPSysAve"], 3),
    round(cor_matrix["BMI", "Pulse"], 3)
  ),
  Strength = c("Strong", "Moderate", "Weak-Moderate", "Moderate", "Very Weak")
) %>%
  kable(caption = "Notable Correlations Summary")
Notable Correlations Summary
Relationship Correlation Strength
Systolic BP & Diastolic BP 0.340 Strong
Age & Systolic BP 0.415 Moderate
Age & Diastolic BP -0.019 Weak-Moderate
BMI & Systolic BP 0.135 Moderate
BMI & Pulse 0.112 Very Weak

Interpretation: Systolic and diastolic BP show the strongest correlation (r = 0.34), which makes sense as they measure the same physiological process. Pulse rate shows relatively weak correlations, suggesting it’s influenced by different factors.


Example 4: Spearman vs Pearson

When to Use Spearman Correlation

Use Spearman’s rank correlation when:

  • Data are ordinal (ranked)
  • Relationship is monotonic but not linear
  • Data contain outliers
  • Normality assumption is violated

Example: Age vs Pulse Rate

# Visualize relationship
ggplot(nhanes_adult, aes(x = Age, y = Pulse)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Age vs Pulse Rate",
    x = "Age (years)",
    y = "Pulse Rate (bpm)"
  ) +
  theme_minimal()

# Calculate both correlations
pearson_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "pearson")
spearman_r <- cor.test(nhanes_adult$Age, nhanes_adult$Pulse, method = "spearman")

# Compare in table
data.frame(
  Method = c("Pearson", "Spearman"),
  Correlation = c(
    round(pearson_r$estimate, 3),
    round(spearman_r$estimate, 3)
  ),
  p_value = c(
    format.pval(pearson_r$p.value),
    format.pval(spearman_r$p.value)
  ),
  Difference = c(
    "—",
    round(abs(pearson_r$estimate - spearman_r$estimate), 3)
  )
) %>%
  kable(caption = "Pearson vs Spearman Comparison")
Pearson vs Spearman Comparison
Method Correlation p_value Difference
cor Pearson -0.153 < 2.22e-16
rho Spearman -0.162 < 2.22e-16 0.008

Interpretation:

  • Results are very similar (difference < 0.01)
  • Both show very weak negative correlation
  • With large sample and only mild assumption violations, either method is appropriate
  • For this data, Pearson is fine (and has better statistical power)

Key Takeaways from Examples

  1. Always visualize first - scatterplots reveal patterns correlation coefficients miss
  2. Context matters - statistical significance ≠ practical importance
  3. Correlation ≠ Causation - always consider confounding
  4. r² tells you explained variance - even significant correlations may explain little
  5. Check assumptions - especially for small samples

PART B: YOUR TURN - Practice Problems

Now it’s your turn to practice! Use the same NHANES dataset and follow the examples above.

Total Points: 25 points


Problem 1: Weight and Height (10 points)

Research Question: Is there a correlation between weight and height among US adults?

Your tasks:

  1. Create a scatterplot with a fitted line (2 points)
  2. Calculate Pearson correlation using cor.test() and display with tidy() (3 points)
  3. Test for statistical significance and state your conclusion (2 points)
  4. Calculate r² and interpret in 2-3 sentences (3 points)
# YOUR CODE HERE

# a. Scatterplot


# b. Correlation test with tidy() display


# c. Statistical significance


# d. r² and interpretation (write as comment)
# 

Problem 2: Correlation Matrix Analysis (10 points)

Research Question: What are the relationships among BMI, weight, and height?

Your tasks:

  1. Create a correlation matrix for: Weight, Height, BMI (3 points)
  2. Visualize the matrix using corrplot (3 points)
  3. Identify which pair has the strongest correlation (2 points)
  4. Explain why that correlation makes sense biologically/mathematically (2 points)
# YOUR CODE HERE

# a. Correlation matrix


# b. Visualize with corrplot


# c. Strongest correlation:


# d. Explanation (write as comment)
# 

Problem 3: Sleep and Age (5 points)

Research Question: Is there a relationship between hours of sleep and age?

Your tasks:

  1. Create a scatterplot (1 point)
  2. Calculate Pearson correlation and display with tidy() (2 points)
  3. Interpret whether the relationship is statistically significant (2 points)
# YOUR CODE HERE

# a. Scatterplot


# b. Correlation with tidy()


# c. Interpretation (write as comment)
# 

Bonus (Optional, 5 extra points)

Challenge: Investigate the relationship between two variables of your choice from the NHANES dataset. Include:

  • Scatterplot
  • Correlation test with clean display
  • Assumption checks
  • Thoughtful interpretation
# YOUR CODE HERE

Grading Rubric

Problem 1: Weight and Height (10 points)

    1. Scatterplot properly formatted with labels: 2 points
    1. Correct correlation with clean display: 3 points
    1. Significance test correctly interpreted: 2 points
    1. r² calculated and interpreted: 3 points

Problem 2: Correlation Matrix (10 points)

    1. Correct matrix calculated: 3 points
    1. Well-formatted correlation plot: 3 points
    1. Strongest correlation identified: 2 points
    1. Biological/mathematical explanation: 2 points

Problem 3: Sleep and Age (5 points)

    1. Scatterplot: 1 point
    1. Correlation calculated and displayed: 2 points
    1. Interpretation of significance: 2 points

Submission Instructions

  1. Save your work with your name: Correlation_Lab_YourName.Rmd

  2. Knit to HTML to create your report

  3. Publish to RPubs:

    • Click the Publish button (blue icon) in the HTML preview window
    • Choose RPubs from the options
    • Follow the prompts to publish (create account if needed)
    • Copy your RPubs URL
  4. Submit to Brightspace:

    • Upload your .Rmd file
    • Paste your RPubs link in the assignment comments or submission text box
  5. Due: End of class today

Grading: This lab is worth 15% of your in-class lab grade. The lowest 2 lab grades are dropped.


Additional Resources

R Functions Used Today

  • cor.test() - Calculate correlation and test significance
  • tidy() - Clean display of statistical test results
  • cor() - Calculate correlation matrix
  • corrplot() - Visualize correlation matrix
  • ggplot() + geom_point() - Scatterplots
  • geom_smooth(method="lm") - Add fitted regression line
  • qqnorm() / qqline() - Check normality

For More Help

  • Textbook: Chapter 6 - Correlation Analysis
  • Office Hours: See syllabus
  • TA Help: See syllabus
  • R Documentation: Type ?cor.test in console

Remember:

✓ Correlation measures LINEAR relationships only
✓ Always visualize your data first
✓ Correlation ≠ Causation
✓ Check your assumptions
✓ Consider confounding and alternative explanations


This lab activity was created for EPI 553: Principles of Statistical Inference II
University at Albany, College of Integrated Health Sciences
Spring 2026