By the end of this lecture, you will be able to:
What factors predict systolic blood pressure in US adults, and how do they interact?
We need to control for multiple risk factors:
This is a multivariable problem - we need regression, not just correlation!
We’ll use the National Health and Nutrition Examination Survey (NHANES) - real data from US health surveys.
# Load and prepare NHANES data
# Focus on adults (21+) with complete blood pressure and predictor data
bp_data <- NHANES %>%
filter(Age >= 21, # Adults only
!is.na(BPSysAve), # Must have BP measured
!is.na(BMI), # Must have BMI
!is.na(PhysActive), # Physical activity status
!is.na(SmokeNow)) %>% # Current smoking status
select(ID, BPSysAve, Age, BMI, PhysActive, SmokeNow, Weight, Height) %>%
mutate(
# Convert logical to numeric for regression
Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
) %>%
select(ID, BPSysAve, Age, BMI, Physically_Active, Current_Smoker, Weight, Height) %>%
slice_sample(n = 200) %>% # Random sample of 200 for clarity
arrange(ID)
# Display first 10 rows
bp_data %>%
head(10) %>%
kable(caption = "NHANES Blood Pressure Study Data (First 10 Observations)",
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| ID | BPSysAve | Age | BMI | Physically_Active | Current_Smoker | Weight | Height |
|---|---|---|---|---|---|---|---|
| 51677 | 128 | 33 | 28.5 | 0 | 0 | 93.8 | 181.3 |
| 51741 | 120 | 21 | 37.3 | 0 | 1 | 103.5 | 166.5 |
| 51959 | 123 | 31 | 29.0 | 0 | 0 | 86.2 | 172.5 |
| 52080 | 108 | 30 | 21.8 | 1 | 1 | 73.7 | 184.0 |
| 52122 | 140 | 67 | 32.4 | 1 | 0 | 104.0 | 179.3 |
| 52314 | 98 | 25 | 20.9 | 1 | 0 | 56.4 | 164.2 |
| 52366 | 114 | 49 | 34.4 | 1 | 0 | 115.2 | 182.9 |
| 52444 | 129 | 71 | 23.7 | 0 | 0 | 66.9 | 168.1 |
| 52501 | 136 | 38 | 24.8 | 1 | 1 | 71.8 | 170.2 |
| 52524 | 110 | 54 | 25.7 | 0 | 0 | 67.3 | 161.7 |
What we’re looking at: BPSysAve is systolic blood pressure - values ≥140 mmHg indicate hypertension.
# Create a cleaner summary table
bp_data %>%
select(-ID) %>%
summarise(
across(everything(),
list(Mean = ~round(mean(., na.rm = TRUE), 2),
SD = ~round(sd(., na.rm = TRUE), 2),
Min = ~round(min(., na.rm = TRUE), 1),
Max = ~round(max(., na.rm = TRUE), 1)),
.names = "{.col}_{.fn}")
) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_",
values_to = "Value") %>%
pivot_wider(names_from = Statistic, values_from = Value) %>%
kable(caption = "Summary Statistics of NHANES Blood Pressure Data") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Mean | SD | Min | Max | Active | Smoker |
|---|---|---|---|---|---|---|
| BPSysAve | 121.86 | 17.92 | 91 | 212 | NULL | NULL |
| Age | 47.03 | 14.76 | 21 | 80 | NULL | NULL |
| BMI | 28.64 | 6.61 | 17.7 | 59.1 | NULL | NULL |
| Physically | NULL | NULL | NULL | NULL | 0.45, 0.50, 0.00, 1.00 | NULL |
| Current | NULL | NULL | NULL | NULL | NULL | 0.46, 0.50, 0.00, 1.00 |
| Weight | 83.59 | 21.09 | 47 | 172.5 | NULL | NULL |
| Height | 170.66 | 9.39 | 147 | 200.4 | NULL | NULL |
Key observations: Average systolic BP is around 120 mmHg (normal range), with considerable variation across participants.
# Color points by hypertension status (BP >= 140)
bp_data <- bp_data %>%
mutate(BP_Status = ifelse(BPSysAve >= 140, "Hypertensive", "Normal"))
plot_ly(bp_data,
x = ~BMI,
y = ~BPSysAve,
color = ~BP_Status,
colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
text = ~paste("ID:", ID,
"<br>BP:", round(BPSysAve, 1), "mmHg",
"<br>BMI:", round(BMI, 1),
"<br>Age:", Age,
"<br>Active:", ifelse(Physically_Active==1, "Yes", "No")),
type = 'scatter',
mode = 'markers',
marker = list(size = 8, opacity = 0.6)) %>%
layout(title = "BMI vs Systolic BP in NHANES (Interactive - hover for details)",
xaxis = list(title = "Body Mass Index (kg/m²)"),
yaxis = list(title = "Systolic Blood Pressure (mmHg)"))Question: Is higher BMI associated with higher blood pressure? Or are other factors confounding this?
Answer: We need multiple regression to control for age, physical activity, and smoking!
We want to fit this model:
\[\text{BP} = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{BMI}) + \beta_3(\text{Active}) + \beta_4(\text{Smoker}) + \epsilon\]
With 200 observations and 4 predictors, we have 200 equations with 5 unknowns (β₀, β₁, β₂, β₃, β₄):
\[120 = \beta_0 + 35\beta_1 + 28.5\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_1\] \[135 = \beta_0 + 52\beta_1 + 31.2\beta_2 + 0\beta_3 + 1\beta_4 + \epsilon_2\] \[\vdots\] \[115 = \beta_0 + 41\beta_1 + 24.8\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_{200}\]
Writing this by hand is impossible!
Solution: Matrix algebra lets us write this compactly as Y = Xβ
Instead of 200 separate equations, we write:
\[\underbrace{\mathbf{Y}}_{n \times 1} = \underbrace{\mathbf{X}}_{n \times (p+1)} \underbrace{\boldsymbol{\beta}}_{(p+1) \times 1} + \underbrace{\boldsymbol{\epsilon}}_{n \times 1}\]
Where:
# Create the Y vector (outcomes)
Y <- bp_data$BPSysAve
# Create the X matrix (predictors + intercept)
X <- bp_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# Show dimensions
tibble(
Matrix = c("Y (outcomes)", "X (predictors)", "β (coefficients)"),
Dimensions = c(
paste(length(Y), "× 1"),
paste(nrow(X), "×", ncol(X)),
paste(ncol(X), "× 1")
),
Interpretation = c(
"200 blood pressure measurements",
"200 observations × 5 columns (1 intercept + 4 predictors)",
"5 coefficients to estimate"
)
) %>%
kable(caption = "Matrix Dimensions for Regression") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Matrix | Dimensions | Interpretation |
|---|---|---|
| Y (outcomes) | 200 × 1 | 200 blood pressure measurements |
| X (predictors) | 200 × 5 | 200 observations × 5 columns (1 intercept + 4 predictors) |
| β (coefficients) | 5 × 1 | 5 coefficients to estimate |
# Show first 8 rows of X
X[1:8, ] %>%
as.data.frame() %>%
kable(caption = "Design Matrix X (First 8 Rows)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker |
|---|---|---|---|---|
| 1 | 33 | 28.5 | 0 | 0 |
| 1 | 21 | 37.3 | 0 | 1 |
| 1 | 31 | 29.0 | 0 | 0 |
| 1 | 30 | 21.8 | 1 | 1 |
| 1 | 67 | 32.4 | 1 | 0 |
| 1 | 25 | 20.9 | 1 | 0 |
| 1 | 49 | 34.4 | 1 | 0 |
| 1 | 71 | 23.7 | 0 | 0 |
Key Point:
The first column of X is all 1’s (for the intercept β₀), and our binary variables (Physically_Active, Current_Smoker) are coded as 0/1.
When you need it: Calculating X’X and X’Y for regression
What it does: Flips rows and columns
# Small example with health data (first 3 NHANES participants)
small_example <- X[1:3, ]
rownames(small_example) <- paste("Person", 1:3)
small_example %>%
kable(caption = "Original Matrix (People × Variables)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| Person 1 | 1 | 33 | 28.5 | 0 | 0 |
| Person 2 | 1 | 21 | 37.3 | 0 | 1 |
| Person 3 | 1 | 31 | 29.0 | 0 | 0 |
t(small_example) %>%
kable(caption = "Transposed Matrix (Variables × People)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Person 1 | Person 2 | Person 3 | |
|---|---|---|---|
| intercept | 1.0 | 1.0 | 1 |
| Age | 33.0 | 21.0 | 31 |
| BMI | 28.5 | 37.3 | 29 |
| Physically_Active | 0.0 | 0.0 | 0 |
| Current_Smoker | 0.0 | 1.0 | 0 |
Why this matters: X’X gives us sums of squares and cross-products we need for regression - notice how transposing flips the matrix so we can multiply X’X.
When you need it: This is the key step in calculating regression coefficients
# Calculate X'X for our blood pressure data
XtX <- t(X) %*% X
XtX %>%
kable(caption = "X'X Matrix (Sum of Squares and Cross Products)",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 200 | 9407 | 5728 | 91 | 92 |
| Age | 9407 | 485821 | 270252 | 4203 | 3957 |
| BMI | 5728 | 270252 | 172770 | 2570 | 2583 |
| Physically_Active | 91 | 4203 | 2570 | 91 | 34 |
| Current_Smoker | 92 | 3957 | 2583 | 34 | 92 |
What do these numbers mean?
XtX[2,2] = Sum of (Age)²XtX[3,3] = Sum of (BMI)²XtX[2,3] = Sum of (Age × BMI)# Show what's in X'X
tibble(
Element = c("XtX[1,1]", "XtX[2,2]", "XtX[2,3]"),
Calculation = c(
"Sum of 1² (n observations)",
"Sum of (Age)²",
"Sum of (Age × BMI)"
),
Value = c(
XtX[1,1],
XtX[2,2],
XtX[2,3]
)
) %>%
kable(caption = "Interpreting X'X Elements", digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Element | Calculation | Value |
|---|---|---|
| XtX[1,1] | Sum of 1² (n observations) | 200 |
| XtX[2,2] | Sum of (Age)² | 485821 |
| XtX[2,3] | Sum of (Age × BMI) | 270252 |
# Verify one calculation
tibble(
Method = c("From X'X", "Manual calculation"),
Result = c(
XtX[2,2],
sum(bp_data$Age^2)
)
) %>%
kable(caption = "Verification: XtX[2,2] = Sum of (Age)²") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Method | Result |
|---|---|
| From X’X | 485821 |
| Manual calculation | 485821 |
Interpretation:
The diagonal shows variability in each predictor, off-diagonal shows how predictors co-vary.
When you need it: This is THE critical step for calculating β
What it does: The matrix equivalent of division
Important: Only works if X’X is not singular (determinant ≠ 0)
# Calculate the inverse
XtX_inv <- solve(XtX)
XtX_inv %>%
kable(caption = "(X'X)^(-1): The Inverse Matrix",
digits = 6) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 0.181171 | -0.001265 | -0.003393 | -0.017561 | -0.025021 |
| Age | -0.001265 | 0.000025 | -0.000001 | 0.000069 | 0.000195 |
| BMI | -0.003393 | -0.000001 | 0.000116 | 0.000103 | 0.000131 |
| Physically_Active | -0.017561 | 0.000069 | 0.000103 | 0.020970 | 0.003940 |
| Current_Smoker | -0.025021 | 0.000195 | 0.000131 | 0.003940 | 0.022345 |
# Verify: X'X × (X'X)^(-1) should equal the identity matrix
verification <- XtX %*% XtX_inv
verification %>%
round(10) %>% # Round to avoid floating point display issues
kable(caption = "Verification: X'X × (X'X)^(-1) = I (Identity Matrix)") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 1 | 0 | 0 | 0 | 0 |
| Age | 0 | 1 | 0 | 0 | 0 |
| BMI | 0 | 0 | 1 | 0 | 0 |
| Physically_Active | 0 | 0 | 0 | 1 | 0 |
| Current_Smoker | 0 | 0 | 0 | 0 | 1 |
The identity matrix has 1’s on diagonal, 0’s elsewhere - it’s like multiplying by 1!
Starting from Y = Xβ, we derive:
\[\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
This is how R calculates regression coefficients in
lm()!
# Step 1: Calculate X'Y
XtY <- t(X) %*% Y
data.frame(
Component = c("X'Y"),
Dimensions = paste(nrow(XtY), "× 1"),
Interpretation = "Cross products of predictors with outcome"
) %>%
kable(caption = "Step 1: Calculate X'Y") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Component | Dimensions | Interpretation |
|---|---|---|
| X’Y | 5 × 1 | Cross products of predictors with outcome |
XtY %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
rename(Value = V1) %>%
kable(caption = "X'Y Values", digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Value |
|---|---|
| intercept | 24372 |
| Age | 1160222 |
| BMI | 700551 |
| Physically_Active | 11026 |
| Current_Smoker | 11231 |
| Coefficient | Estimate | Interpretation |
|---|---|---|
| β₀ (Intercept) | 96.41 | Baseline systolic BP (mmHg) |
| β₁ (Age) | 0.34 | Change per 1 year increase in age |
| β₂ (BMI) | 0.27 | Change per 1 unit increase in BMI |
| β₃ (Physically Active) | -0.03 | Difference for physically active individuals |
| β₄ (Current Smoker) | 3.24 | Difference for current smokers |
Does our manual matrix calculation match R’s lm()?
# Fit the same model using lm()
model <- lm(BPSysAve ~ Age + BMI + Physically_Active + Current_Smoker,
data = bp_data)
# Compare results
comparison <- data.frame(
Coefficient = c("Intercept", "Age", "BMI", "Physically_Active", "Current_Smoker"),
Matrix_Calculation = as.vector(beta_hat),
lm_Function = coef(model)
)
comparison %>%
kable(caption = "Comparison: Manual vs lm()",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Coefficient | Matrix_Calculation | lm_Function | |
|---|---|---|---|
| (Intercept) | Intercept | 96.4076 | 96.4076 |
| Age | Age | 0.3426 | 0.3426 |
| BMI | BMI | 0.2746 | 0.2746 |
| Physically_Active | Physically_Active | -0.0310 | -0.0310 |
| Current_Smoker | Current_Smoker | 3.2358 | 3.2358 |
They match perfectly! This proves that
lm() uses the matrix formula β = (X’X)^(-1)X’Y
internally.
| Predictor | Effect | Clinical_Meaning |
|---|---|---|
| Age |
|
Each additional year of age increases BP (aging effect) |
| BMI |
|
Each BMI unit increase raises BP (obesity-hypertension link) |
| Physically Active | -0.03 mmHg | Being physically active decreases BP (protective effect) |
| Current Smoker |
|
Current smoking increases BP (negative health behavior) |
Key finding: Age has a strong effect - each additional year is associated with a 0.34 mmHg increase in systolic BP, controlling for other factors.
# Add predictions to data
bp_data$predicted <- as.vector(X %*% beta_hat)
bp_data$residual <- bp_data$BPSysAve - bp_data$predicted
# Interactive plot
plot_ly(bp_data,
x = ~predicted,
y = ~BPSysAve,
color = ~BP_Status,
colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
text = ~paste("ID:", ID,
"<br>Observed BP:", round(BPSysAve, 1), "mmHg",
"<br>Predicted:", round(predicted, 1), "mmHg",
"<br>Residual:", round(residual, 1),
"<br>Status:", BP_Status),
type = 'scatter',
mode = 'markers',
marker = list(size = 8, opacity = 0.6)) %>%
add_trace(x = c(min(bp_data$predicted), max(bp_data$predicted)),
y = c(min(bp_data$predicted), max(bp_data$predicted)),
type = 'scatter', mode = 'lines',
line = list(color = 'black', dash = 'dash'),
name = 'Perfect Prediction',
showlegend = TRUE,
inherit = FALSE) %>%
layout(title = "Predicted vs Observed Systolic BP",
xaxis = list(title = "Predicted Systolic BP (mmHg)"),
yaxis = list(title = "Observed Systolic BP (mmHg)"))Points close to the diagonal line are well-predicted by our model; points far from the line have large residuals (prediction errors).
Multicollinearity = When predictor variables are highly correlated with each other
Why it matters: It makes X’X singular (non-invertible), so we can’t calculate β!
Let’s try to include both Weight and BMI as predictors. Problem: BMI is calculated FROM weight and height!
# Check if we have the variables needed
bp_data_check <- bp_data %>%
filter(!is.na(Weight), !is.na(Height))
# BMI is weight(kg) / height(m)²
# Since we already have BMI, let's show the relationship
cor_check <- cor(bp_data_check[, c("Weight", "BMI", "Height")],
use = "complete.obs")
cor_check %>%
kable(caption = "Correlation Matrix: Weight, BMI, and Height",
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Weight | BMI | Height | |
|---|---|---|---|
| Weight | 1.000 | 0.891 | 0.409 |
| BMI | 0.891 | 1.000 | -0.039 |
| Height | 0.409 | -0.039 | 1.000 |
# Try to fit regression with both Weight and BMI
X_bad <- bp_data_check %>%
select(Age, BMI, Weight, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# What happens to X'X?
XtX_bad <- t(X_bad) %*% X_bad
# Check if invertible
determinant_value <- det(XtX_bad)
data.frame(
Matrix = "X'X with BMI and Weight",
Determinant = format(determinant_value, scientific = TRUE, digits = 3),
Invertible = ifelse(abs(determinant_value) < 1e-10, "NO ❌", "Close to singular ⚠️")
) %>%
kable(caption = "Problem: Nearly Singular Matrix!") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Matrix | Determinant | Invertible |
|---|---|---|
| X’X with BMI and Weight | 3e+18 | Close to singular ⚠️ |
The issue: Weight and BMI are highly correlated (r = 0.89), making the matrix nearly singular.
# Try with lm() - R will detect the high multicollinearity
model_bad <- lm(BPSysAve ~ Age + BMI + Weight + Physically_Active + Current_Smoker,
data = bp_data_check)
coef(model_bad) %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
rename(Coefficient = ".") %>%
kable(caption = "lm() Output with Both BMI and Weight",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Coefficient |
|---|---|
| (Intercept) | 95.1604 |
| Age | 0.3541 |
| BMI | -0.3409 |
| Weight | 0.2171 |
| Physically_Active | -0.0502 |
| Current_Smoker | 3.6560 |
# Check standard errors - they'll be inflated!
summary(model_bad)$coefficients %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
select(Variable, Estimate, `Std. Error`) %>%
kable(caption = "Notice the Large Standard Errors!",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Estimate | Std. Error |
|---|---|---|
| (Intercept) | 95.1604 | 7.3689 |
| Age | 0.3541 | 0.0862 |
| BMI | -0.3409 | 0.4085 |
| Weight | 0.2171 | 0.1284 |
| Physically_Active | -0.0502 | 2.4945 |
| Current_Smoker | 3.6560 | 2.5869 |
The problem: Coefficients are estimated, but standard errors are HUGE (unstable estimates) - this is what multicollinearity does!
# Check VIF (Variance Inflation Factor) for the problematic model
vif_values <- vif(model_bad)
data.frame(
Variable = names(vif_values),
VIF = vif_values,
Problem = ifelse(vif_values > 5, "⚠️ HIGH - Multicollinearity!", "✓ OK")
) %>%
kable(caption = "Variance Inflation Factors (VIF > 5 indicates multicollinearity)",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | VIF | Problem | |
|---|---|---|---|
| Age | Age | 1.09 | ✓ OK |
| BMI | BMI | 4.90 | ✓ OK |
| Weight | Weight | 4.91 | ✓ OK |
| Physically_Active | Physically_Active | 1.04 | ✓ OK |
| Current_Smoker | Current_Smoker | 1.12 | ✓ OK |
# Compare with the good model (without Weight)
vif_values_good <- vif(model)
data.frame(
Variable = names(vif_values_good),
VIF = vif_values_good,
Problem = ifelse(vif_values_good > 5, "⚠️ Concerning", "✓ OK")
) %>%
kable(caption = "VIF for Original Model (Without Weight) - All OK!",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | VIF | Problem | |
|---|---|---|---|
| Age | Age | 1.08 | ✓ OK |
| BMI | BMI | 1.01 | ✓ OK |
| Physically_Active | Physically_Active | 1.04 | ✓ OK |
| Current_Smoker | Current_Smoker | 1.11 | ✓ OK |
Rule of thumb: VIF > 5-10 indicates serious multicollinearity - the VIF for BMI and Weight in the bad model are extremely high!
| Scenario | Matrix_Concept_Involved | What_To_Do |
|---|---|---|
| Running multiple regression in R | β = (X’X)⁻¹X’Y is being calculated | Nothing - lm() handles it automatically |
| Error: ‘system is computationally singular’ | X’X is singular (can’t be inverted) | Remove redundant variables |
| Unstable coefficient estimates | X’X is nearly singular (multicollinearity) | Check VIF, remove highly correlated predictors |
| Understanding regression output | Understanding where coefficients come from | Know that lm() uses matrix algebra |
| Checking model assumptions | Residuals = Y - Xβ | Check correlation matrix of predictors |
| Operation | R_Code | When_You_Need_It |
|---|---|---|
| Transpose | t(X) | Calculating X’X or X’Y |
| Matrix multiplication | X %*% Y | Calculating regression components |
| Inverse | solve(X) | Solving for β coefficients |
| Correlation matrix | cor(data) | Checking for multicollinearity |
| Check multicollinearity | vif(model) | Diagnosing unstable models |
\[\boxed{\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}}\]
Translation: To get regression coefficients, we:
# Always check this before running regression!
bp_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
cor() %>%
kable(caption = "Predictor Correlation Matrix - Check Before Regression!",
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|
| Age | 1.000 | 0.042 | -0.053 | -0.252 |
| BMI | 0.042 | 1.000 | -0.055 | -0.079 |
| Physically_Active | -0.053 | -0.055 | 1.000 | -0.158 |
| Current_Smoker | -0.252 | -0.079 | -0.158 | 1.000 |
# Visualize correlations
cor_matrix <- cor(bp_data[, c("Age", "BMI", "Physically_Active", "Current_Smoker")])
plot_ly(
x = colnames(cor_matrix),
y = rownames(cor_matrix),
z = cor_matrix,
type = "heatmap",
colors = colorRamp(c("blue", "white", "red")),
zmin = -1, zmax = 1
) %>%
layout(title = "Correlation Heatmap of Predictors (Interactive)")If any correlation is > 0.8, consider removing one of the variables! Our predictors are safely below this threshold.
Using what you’ve learned, analyze total cholesterol in NHANES:
# Prepare cholesterol data from NHANES
chol_data <- NHANES %>%
filter(Age >= 21,
!is.na(TotChol), # Total cholesterol
!is.na(BMI),
!is.na(PhysActive),
!is.na(SmokeNow)) %>%
select(ID, TotChol, Age, BMI, PhysActive, SmokeNow) %>%
mutate(
Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
) %>%
select(ID, TotChol, Age, BMI, Physically_Active, Current_Smoker) %>%
slice_sample(n = 150) %>%
arrange(ID)
chol_data %>%
head(10) %>%
kable(caption = "NHANES Cholesterol Data (First 10 Observations)",
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| ID | TotChol | Age | BMI | Physically_Active | Current_Smoker |
|---|---|---|---|---|---|
| 51691 | 5.0 | 57 | 20.7 | 1 | 0 |
| 51856 | 7.3 | 61 | 28.4 | 1 | 1 |
| 51905 | 3.7 | 52 | 29.0 | 0 | 0 |
| 52135 | 5.6 | 74 | 32.9 | 1 | 0 |
| 52259 | 4.4 | 25 | 23.5 | 1 | 1 |
| 52408 | 8.8 | 54 | 25.9 | 0 | 0 |
| 52484 | 5.8 | 31 | 24.1 | 1 | 1 |
| 52587 | 6.0 | 72 | 27.2 | 0 | 0 |
| 52647 | 5.1 | 43 | 29.7 | 1 | 1 |
| 52730 | 5.5 | 53 | 23.6 | 1 | 0 |
Research Question: What factors predict total cholesterol in US adults?
Tasks:
# Your code here:
# 1. Correlation matrix
cor(chol_data %>% select(Age, BMI, Physically_Active, Current_Smoker))
# 2. Create X matrix
Y_practice <- chol_data$TotChol
X_practice <- chol_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# 3. Calculate X'X
XtX_practice <- t(X_practice) %*% X_practice
# 4. Fit model
model_practice <- lm(TotChol ~ Age + BMI + Physically_Active + Current_Smoker,
data = chol_data)
# 5. Compare - calculate beta manually
beta_practice <- solve(XtX_practice) %*% (t(X_practice) %*% Y_practice)
# 6. Check VIF
vif(model_practice)
# 7. Interpret coefficients
summary(model_practice)## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] NHANES_2.1.0 car_3.1-5 carData_3.0-6 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 lubridate_1.9.4 forcats_1.0.1
## [9] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.6
## [13] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56.2 bslib_0.10.0
## [4] htmlwidgets_1.6.4 tzdb_0.5.0 crosstalk_1.2.2
## [7] vctrs_0.7.1 tools_4.5.2 generics_0.1.4
## [10] curl_7.0.0 pkgconfig_2.0.3 telegram.bot_3.0.2
## [13] data.table_1.18.2.1 RColorBrewer_1.1-3 S7_0.2.1
## [16] lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [19] textshaping_1.0.4 httpuv_1.6.16 htmltools_0.5.9
## [22] sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2
## [25] Formula_1.2-5 later_1.4.5 pillar_1.11.1
## [28] jquerylib_0.1.4 openssl_2.3.4 cachem_1.1.0
## [31] abind_1.4-8 tidyselect_1.2.1 digest_0.6.39
## [34] stringi_1.8.7 fastmap_1.2.0 grid_4.5.2
## [37] cli_3.6.5 magrittr_2.0.4 withr_3.0.2
## [40] scales_1.4.0 promises_1.5.0 timechange_0.4.0
## [43] rmarkdown_2.30 httr_1.4.7 otel_0.2.0
## [46] askpass_1.2.1 hms_1.1.4 evaluate_1.0.5
## [49] viridisLite_0.4.2 rlang_1.1.7 Rcpp_1.1.1
## [52] glue_1.8.0 xml2_1.5.2 svglite_2.2.2
## [55] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
## [58] systemfonts_1.3.1.9000