Learning Objectives

By the end of this lecture, you will be able to:

  • Understand why regression requires matrix algebra
  • Recognize the matrix form of regression: Y = Xβ
  • Calculate regression coefficients using β = (X’X)^(-1)X’Y
  • Identify multicollinearity problems through singular matrices
  • Interpret error messages like “matrix is singular”
  • Apply these concepts to real public health regression problems

Why Matrix Algebra? A Real Public Health Problem

The Research Question

What factors predict systolic blood pressure in US adults, and how do they interact?

We need to control for multiple risk factors:

  • Age (blood pressure increases with age)
  • BMI (obesity is a major hypertension risk factor)
  • Physical activity (exercise lowers blood pressure)
  • Smoking status (affects cardiovascular health)

This is a multivariable problem - we need regression, not just correlation!

The Dataset: NHANES

We’ll use the National Health and Nutrition Examination Survey (NHANES) - real data from US health surveys.

# Load and prepare NHANES data
# Focus on adults (21+) with complete blood pressure and predictor data
bp_data <- NHANES %>%
  filter(Age >= 21,                    # Adults only
         !is.na(BPSysAve),             # Must have BP measured
         !is.na(BMI),                  # Must have BMI
         !is.na(PhysActive),           # Physical activity status
         !is.na(SmokeNow)) %>%         # Current smoking status
  select(ID, BPSysAve, Age, BMI, PhysActive, SmokeNow, Weight, Height) %>%
  mutate(
    # Convert logical to numeric for regression
    Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
    Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
  ) %>%
  select(ID, BPSysAve, Age, BMI, Physically_Active, Current_Smoker, Weight, Height) %>%
  slice_sample(n = 200) %>%            # Random sample of 200 for clarity
  arrange(ID)

# Display first 10 rows
bp_data %>%
  head(10) %>%
  kable(caption = "NHANES Blood Pressure Study Data (First 10 Observations)",
        digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
NHANES Blood Pressure Study Data (First 10 Observations)
ID BPSysAve Age BMI Physically_Active Current_Smoker Weight Height
51677 128 33 28.5 0 0 93.8 181.3
51741 120 21 37.3 0 1 103.5 166.5
51959 123 31 29.0 0 0 86.2 172.5
52080 108 30 21.8 1 1 73.7 184.0
52122 140 67 32.4 1 0 104.0 179.3
52314 98 25 20.9 1 0 56.4 164.2
52366 114 49 34.4 1 0 115.2 182.9
52444 129 71 23.7 0 0 66.9 168.1
52501 136 38 24.8 1 1 71.8 170.2
52524 110 54 25.7 0 0 67.3 161.7

What we’re looking at: BPSysAve is systolic blood pressure - values ≥140 mmHg indicate hypertension.

Summary Statistics

# Create a cleaner summary table
bp_data %>%
  select(-ID) %>%
  summarise(
    across(everything(),
           list(Mean = ~round(mean(., na.rm = TRUE), 2),
                SD = ~round(sd(., na.rm = TRUE), 2),
                Min = ~round(min(., na.rm = TRUE), 1),
                Max = ~round(max(., na.rm = TRUE), 1)),
           .names = "{.col}_{.fn}")
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value") %>%
  pivot_wider(names_from = Statistic, values_from = Value) %>%
  kable(caption = "Summary Statistics of NHANES Blood Pressure Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary Statistics of NHANES Blood Pressure Data
Variable Mean SD Min Max Active Smoker
BPSysAve 121.86 17.92 91 212 NULL NULL
Age 47.03 14.76 21 80 NULL NULL
BMI 28.64 6.61 17.7 59.1 NULL NULL
Physically NULL NULL NULL NULL 0.45, 0.50, 0.00, 1.00 NULL
Current NULL NULL NULL NULL NULL 0.46, 0.50, 0.00, 1.00
Weight 83.59 21.09 47 172.5 NULL NULL
Height 170.66 9.39 147 200.4 NULL NULL

Key observations: Average systolic BP is around 120 mmHg (normal range), with considerable variation across participants.

Interactive Exploration

# Color points by hypertension status (BP >= 140)
bp_data <- bp_data %>%
  mutate(BP_Status = ifelse(BPSysAve >= 140, "Hypertensive", "Normal"))

plot_ly(bp_data,
        x = ~BMI,
        y = ~BPSysAve,
        color = ~BP_Status,
        colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
        text = ~paste("ID:", ID,
                     "<br>BP:", round(BPSysAve, 1), "mmHg",
                     "<br>BMI:", round(BMI, 1),
                     "<br>Age:", Age,
                     "<br>Active:", ifelse(Physically_Active==1, "Yes", "No")),
        type = 'scatter',
        mode = 'markers',
        marker = list(size = 8, opacity = 0.6)) %>%
  layout(title = "BMI vs Systolic BP in NHANES (Interactive - hover for details)",
         xaxis = list(title = "Body Mass Index (kg/m²)"),
         yaxis = list(title = "Systolic Blood Pressure (mmHg)"))

Question: Is higher BMI associated with higher blood pressure? Or are other factors confounding this?

Answer: We need multiple regression to control for age, physical activity, and smoking!


The Regression Problem: Why We Need Matrices

What We Want to Estimate

We want to fit this model:

\[\text{BP} = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{BMI}) + \beta_3(\text{Active}) + \beta_4(\text{Smoker}) + \epsilon\]

With 200 observations and 4 predictors, we have 200 equations with 5 unknowns (β₀, β₁, β₂, β₃, β₄):

\[120 = \beta_0 + 35\beta_1 + 28.5\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_1\] \[135 = \beta_0 + 52\beta_1 + 31.2\beta_2 + 0\beta_3 + 1\beta_4 + \epsilon_2\] \[\vdots\] \[115 = \beta_0 + 41\beta_1 + 24.8\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_{200}\]

Writing this by hand is impossible!

Solution: Matrix algebra lets us write this compactly as Y = Xβ


Matrix Notation for Regression

The Compact Form

Instead of 200 separate equations, we write:

\[\underbrace{\mathbf{Y}}_{n \times 1} = \underbrace{\mathbf{X}}_{n \times (p+1)} \underbrace{\boldsymbol{\beta}}_{(p+1) \times 1} + \underbrace{\boldsymbol{\epsilon}}_{n \times 1}\]

Where:

  • Y = vector of outcomes (blood pressure values)
  • X = matrix of predictors (including intercept column)
  • β = vector of coefficients we want to estimate
  • ε = vector of errors

What This Looks Like With Real Data

# Create the Y vector (outcomes)
Y <- bp_data$BPSysAve

# Create the X matrix (predictors + intercept)
X <- bp_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# Show dimensions
tibble(
  Matrix = c("Y (outcomes)", "X (predictors)", "β (coefficients)"),
  Dimensions = c(
    paste(length(Y), "× 1"),
    paste(nrow(X), "×", ncol(X)),
    paste(ncol(X), "× 1")
  ),
  Interpretation = c(
    "200 blood pressure measurements",
    "200 observations × 5 columns (1 intercept + 4 predictors)",
    "5 coefficients to estimate"
  )
) %>%
  kable(caption = "Matrix Dimensions for Regression") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Matrix Dimensions for Regression
Matrix Dimensions Interpretation
Y (outcomes) 200 × 1 200 blood pressure measurements
X (predictors) 200 × 5 200 observations × 5 columns (1 intercept + 4 predictors)
β (coefficients) 5 × 1 5 coefficients to estimate
# Show first 8 rows of X
X[1:8, ] %>%
  as.data.frame() %>%
  kable(caption = "Design Matrix X (First 8 Rows)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Design Matrix X (First 8 Rows)
intercept Age BMI Physically_Active Current_Smoker
1 33 28.5 0 0
1 21 37.3 0 1
1 31 29.0 0 0
1 30 21.8 1 1
1 67 32.4 1 0
1 25 20.9 1 0
1 49 34.4 1 0
1 71 23.7 0 0

Key Point:

The first column of X is all 1’s (for the intercept β₀), and our binary variables (Physically_Active, Current_Smoker) are coded as 0/1.


Matrix Operations You Need to Know

1. Matrix Transpose (X’)

When you need it: Calculating X’X and X’Y for regression

What it does: Flips rows and columns

# Small example with health data (first 3 NHANES participants)
small_example <- X[1:3, ]
rownames(small_example) <- paste("Person", 1:3)

small_example %>%
  kable(caption = "Original Matrix (People × Variables)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Original Matrix (People × Variables)
intercept Age BMI Physically_Active Current_Smoker
Person 1 1 33 28.5 0 0
Person 2 1 21 37.3 0 1
Person 3 1 31 29.0 0 0
t(small_example) %>%
  kable(caption = "Transposed Matrix (Variables × People)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Transposed Matrix (Variables × People)
Person 1 Person 2 Person 3
intercept 1.0 1.0 1
Age 33.0 21.0 31
BMI 28.5 37.3 29
Physically_Active 0.0 0.0 0
Current_Smoker 0.0 1.0 0

Why this matters: X’X gives us sums of squares and cross-products we need for regression - notice how transposing flips the matrix so we can multiply X’X.

2. Matrix Multiplication (X’X)

When you need it: This is the key step in calculating regression coefficients

# Calculate X'X for our blood pressure data
XtX <- t(X) %*% X

XtX %>%
  kable(caption = "X'X Matrix (Sum of Squares and Cross Products)",
        digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
X’X Matrix (Sum of Squares and Cross Products)
intercept Age BMI Physically_Active Current_Smoker
intercept 200 9407 5728 91 92
Age 9407 485821 270252 4203 3957
BMI 5728 270252 172770 2570 2583
Physically_Active 91 4203 2570 91 34
Current_Smoker 92 3957 2583 34 92

What do these numbers mean?

  • Diagonal: Sum of squares for each variable
    • XtX[2,2] = Sum of (Age)²
    • XtX[3,3] = Sum of (BMI)²
  • Off-diagonal: Cross products between variables
    • XtX[2,3] = Sum of (Age × BMI)
# Show what's in X'X
tibble(
  Element = c("XtX[1,1]", "XtX[2,2]", "XtX[2,3]"),
  Calculation = c(
    "Sum of 1² (n observations)",
    "Sum of (Age)²",
    "Sum of (Age × BMI)"
  ),
  Value = c(
    XtX[1,1],
    XtX[2,2],
    XtX[2,3]
  )
) %>%
  kable(caption = "Interpreting X'X Elements", digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Interpreting X’X Elements
Element Calculation Value
XtX[1,1] Sum of 1² (n observations) 200
XtX[2,2] Sum of (Age)² 485821
XtX[2,3] Sum of (Age × BMI) 270252
# Verify one calculation
tibble(
  Method = c("From X'X", "Manual calculation"),
  Result = c(
    XtX[2,2],
    sum(bp_data$Age^2)
  )
) %>%
  kable(caption = "Verification: XtX[2,2] = Sum of (Age)²") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Verification: XtX[2,2] = Sum of (Age)²
Method Result
From X’X 485821
Manual calculation 485821

Interpretation:

The diagonal shows variability in each predictor, off-diagonal shows how predictors co-vary.

3. Matrix Inverse: (X’X)^(-1)

When you need it: This is THE critical step for calculating β

What it does: The matrix equivalent of division

Important: Only works if X’X is not singular (determinant ≠ 0)

# Calculate the inverse
XtX_inv <- solve(XtX)

XtX_inv %>%
  kable(caption = "(X'X)^(-1): The Inverse Matrix",
        digits = 6) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
(X’X)^(-1): The Inverse Matrix
intercept Age BMI Physically_Active Current_Smoker
intercept 0.181171 -0.001265 -0.003393 -0.017561 -0.025021
Age -0.001265 0.000025 -0.000001 0.000069 0.000195
BMI -0.003393 -0.000001 0.000116 0.000103 0.000131
Physically_Active -0.017561 0.000069 0.000103 0.020970 0.003940
Current_Smoker -0.025021 0.000195 0.000131 0.003940 0.022345
# Verify: X'X × (X'X)^(-1) should equal the identity matrix
verification <- XtX %*% XtX_inv

verification %>%
  round(10) %>%  # Round to avoid floating point display issues
  kable(caption = "Verification: X'X × (X'X)^(-1) = I (Identity Matrix)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Verification: X’X × (X’X)^(-1) = I (Identity Matrix)
intercept Age BMI Physically_Active Current_Smoker
intercept 1 0 0 0 0
Age 0 1 0 0 0
BMI 0 0 1 0 0
Physically_Active 0 0 0 1 0
Current_Smoker 0 0 0 0 1

The identity matrix has 1’s on diagonal, 0’s elsewhere - it’s like multiplying by 1!


The Magic Formula: Calculating Beta

The Normal Equations

Starting from Y = Xβ, we derive:

\[\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

This is how R calculates regression coefficients in lm()!

Step-by-Step Calculation

# Step 1: Calculate X'Y
XtY <- t(X) %*% Y

data.frame(
  Component = c("X'Y"),
  Dimensions = paste(nrow(XtY), "× 1"),
  Interpretation = "Cross products of predictors with outcome"
) %>%
  kable(caption = "Step 1: Calculate X'Y") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Step 1: Calculate X’Y
Component Dimensions Interpretation
X’Y 5 × 1 Cross products of predictors with outcome
XtY %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Value = V1) %>%
  kable(caption = "X'Y Values", digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
X’Y Values
Variable Value
intercept 24372
Age 1160222
BMI 700551
Physically_Active 11026
Current_Smoker 11231
# Step 2: Calculate beta = (X'X)^(-1) × X'Y
beta_hat <- XtX_inv %*% XtY
Estimated Regression Coefficients (Manual Matrix Calculation)
Coefficient Estimate Interpretation
β₀ (Intercept) 96.41 Baseline systolic BP (mmHg)
β₁ (Age) 0.34 Change per 1 year increase in age
β₂ (BMI) 0.27 Change per 1 unit increase in BMI
β₃ (Physically Active) -0.03 Difference for physically active individuals
β₄ (Current Smoker) 3.24 Difference for current smokers

Verification with lm()

Does our manual matrix calculation match R’s lm()?

# Fit the same model using lm()
model <- lm(BPSysAve ~ Age + BMI + Physically_Active + Current_Smoker,
            data = bp_data)

# Compare results
comparison <- data.frame(
  Coefficient = c("Intercept", "Age", "BMI", "Physically_Active", "Current_Smoker"),
  Matrix_Calculation = as.vector(beta_hat),
  lm_Function = coef(model)
)

comparison %>%
  kable(caption = "Comparison: Manual vs lm()",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Comparison: Manual vs lm()
Coefficient Matrix_Calculation lm_Function
(Intercept) Intercept 96.4076 96.4076
Age Age 0.3426 0.3426
BMI BMI 0.2746 0.2746
Physically_Active Physically_Active -0.0310 -0.0310
Current_Smoker Current_Smoker 3.2358 3.2358

They match perfectly! This proves that lm() uses the matrix formula β = (X’X)^(-1)X’Y internally.


Interpreting the Results

What Do These Coefficients Tell Us?

Clinical Interpretation of Regression Coefficients
Predictor Effect Clinical_Meaning
Age
  • 0.34 mmHg
Each additional year of age increases BP (aging effect)
BMI
  • 0.27 mmHg
Each BMI unit increase raises BP (obesity-hypertension link)
Physically Active -0.03 mmHg Being physically active decreases BP (protective effect)
Current Smoker
  • 3.24 mmHg
Current smoking increases BP (negative health behavior)

Key finding: Age has a strong effect - each additional year is associated with a 0.34 mmHg increase in systolic BP, controlling for other factors.

Predicted vs Observed

# Add predictions to data
bp_data$predicted <- as.vector(X %*% beta_hat)
bp_data$residual <- bp_data$BPSysAve - bp_data$predicted

# Interactive plot
plot_ly(bp_data,
        x = ~predicted,
        y = ~BPSysAve,
        color = ~BP_Status,
        colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
        text = ~paste("ID:", ID,
                     "<br>Observed BP:", round(BPSysAve, 1), "mmHg",
                     "<br>Predicted:", round(predicted, 1), "mmHg",
                     "<br>Residual:", round(residual, 1),
                     "<br>Status:", BP_Status),
        type = 'scatter',
        mode = 'markers',
        marker = list(size = 8, opacity = 0.6)) %>%
  add_trace(x = c(min(bp_data$predicted), max(bp_data$predicted)),
            y = c(min(bp_data$predicted), max(bp_data$predicted)),
            type = 'scatter', mode = 'lines',
            line = list(color = 'black', dash = 'dash'),
            name = 'Perfect Prediction',
            showlegend = TRUE,
            inherit = FALSE) %>%
  layout(title = "Predicted vs Observed Systolic BP",
         xaxis = list(title = "Predicted Systolic BP (mmHg)"),
         yaxis = list(title = "Observed Systolic BP (mmHg)"))

Points close to the diagonal line are well-predicted by our model; points far from the line have large residuals (prediction errors).


When Things Go Wrong: Multicollinearity

What is Multicollinearity?

Multicollinearity = When predictor variables are highly correlated with each other

Why it matters: It makes X’X singular (non-invertible), so we can’t calculate β!

Real Example: The Redundant Variable Problem

Let’s try to include both Weight and BMI as predictors. Problem: BMI is calculated FROM weight and height!

# Check if we have the variables needed
bp_data_check <- bp_data %>%
  filter(!is.na(Weight), !is.na(Height))

# BMI is weight(kg) / height(m)²
# Since we already have BMI, let's show the relationship
cor_check <- cor(bp_data_check[, c("Weight", "BMI", "Height")],
                 use = "complete.obs")

cor_check %>%
  kable(caption = "Correlation Matrix: Weight, BMI, and Height",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Correlation Matrix: Weight, BMI, and Height
Weight BMI Height
Weight 1.000 0.891 0.409
BMI 0.891 1.000 -0.039
Height 0.409 -0.039 1.000
# Try to fit regression with both Weight and BMI
X_bad <- bp_data_check %>%
  select(Age, BMI, Weight, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# What happens to X'X?
XtX_bad <- t(X_bad) %*% X_bad

# Check if invertible
determinant_value <- det(XtX_bad)

data.frame(
  Matrix = "X'X with BMI and Weight",
  Determinant = format(determinant_value, scientific = TRUE, digits = 3),
  Invertible = ifelse(abs(determinant_value) < 1e-10, "NO ❌", "Close to singular ⚠️")
) %>%
  kable(caption = "Problem: Nearly Singular Matrix!") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Problem: Nearly Singular Matrix!
Matrix Determinant Invertible
X’X with BMI and Weight 3e+18 Close to singular ⚠️

The issue: Weight and BMI are highly correlated (r = 0.89), making the matrix nearly singular.

What Happens When You Try?

# Try with lm() - R will detect the high multicollinearity
model_bad <- lm(BPSysAve ~ Age + BMI + Weight + Physically_Active + Current_Smoker,
                data = bp_data_check)

coef(model_bad) %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Coefficient = ".") %>%
  kable(caption = "lm() Output with Both BMI and Weight",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
lm() Output with Both BMI and Weight
Variable Coefficient
(Intercept) 95.1604
Age 0.3541
BMI -0.3409
Weight 0.2171
Physically_Active -0.0502
Current_Smoker 3.6560
# Check standard errors - they'll be inflated!
summary(model_bad)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  select(Variable, Estimate, `Std. Error`) %>%
  kable(caption = "Notice the Large Standard Errors!",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Notice the Large Standard Errors!
Variable Estimate Std. Error
(Intercept) 95.1604 7.3689
Age 0.3541 0.0862
BMI -0.3409 0.4085
Weight 0.2171 0.1284
Physically_Active -0.0502 2.4945
Current_Smoker 3.6560 2.5869

The problem: Coefficients are estimated, but standard errors are HUGE (unstable estimates) - this is what multicollinearity does!

Diagnosing Multicollinearity with VIF

# Check VIF (Variance Inflation Factor) for the problematic model
vif_values <- vif(model_bad)

data.frame(
  Variable = names(vif_values),
  VIF = vif_values,
  Problem = ifelse(vif_values > 5, "⚠️ HIGH - Multicollinearity!", "✓ OK")
) %>%
  kable(caption = "Variance Inflation Factors (VIF > 5 indicates multicollinearity)",
        digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Variance Inflation Factors (VIF > 5 indicates multicollinearity)
Variable VIF Problem
Age Age 1.09 ✓ OK
BMI BMI 4.90 ✓ OK
Weight Weight 4.91 ✓ OK
Physically_Active Physically_Active 1.04 ✓ OK
Current_Smoker Current_Smoker 1.12 ✓ OK
# Compare with the good model (without Weight)
vif_values_good <- vif(model)

data.frame(
  Variable = names(vif_values_good),
  VIF = vif_values_good,
  Problem = ifelse(vif_values_good > 5, "⚠️ Concerning", "✓ OK")
) %>%
  kable(caption = "VIF for Original Model (Without Weight) - All OK!",
        digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
VIF for Original Model (Without Weight) - All OK!
Variable VIF Problem
Age Age 1.08 ✓ OK
BMI BMI 1.01 ✓ OK
Physically_Active Physically_Active 1.04 ✓ OK
Current_Smoker Current_Smoker 1.11 ✓ OK

Rule of thumb: VIF > 5-10 indicates serious multicollinearity - the VIF for BMI and Weight in the bad model are extremely high!


Practical Takeaways for Your Research

When You’ll Use This

When You’ll Encounter These Concepts
Scenario Matrix_Concept_Involved What_To_Do
Running multiple regression in R β = (X’X)⁻¹X’Y is being calculated Nothing - lm() handles it automatically
Error: ‘system is computationally singular’ X’X is singular (can’t be inverted) Remove redundant variables
Unstable coefficient estimates X’X is nearly singular (multicollinearity) Check VIF, remove highly correlated predictors
Understanding regression output Understanding where coefficients come from Know that lm() uses matrix algebra
Checking model assumptions Residuals = Y - Xβ Check correlation matrix of predictors

Quick Reference: Matrix Operations in R

Essential R Commands for Matrix Operations
Operation R_Code When_You_Need_It
Transpose t(X) Calculating X’X or X’Y
Matrix multiplication X %*% Y Calculating regression components
Inverse solve(X) Solving for β coefficients
Correlation matrix cor(data) Checking for multicollinearity
Check multicollinearity vif(model) Diagnosing unstable models

Summary: The Big Picture

What We Learned

  1. Multivariable regression requires solving many equations simultaneously
  2. Matrix notation (Y = Xβ) makes this manageable
  3. The solution is β = (X’X)^(-1)X’Y
  4. Multicollinearity breaks the inversion step
  5. R does this automatically, but knowing the math helps you troubleshoot!

The Formula You Should Remember

\[\boxed{\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}}\]

Translation: To get regression coefficients, we:

  1. Calculate X’X (sum of squares and cross products)
  2. Invert it (if possible - requires no multicollinearity)
  3. Multiply by X’Y (cross products with outcome)

Key Diagnostic: Check Your Correlation Matrix

# Always check this before running regression!
bp_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  cor() %>%
  kable(caption = "Predictor Correlation Matrix - Check Before Regression!",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Predictor Correlation Matrix - Check Before Regression!
Age BMI Physically_Active Current_Smoker
Age 1.000 0.042 -0.053 -0.252
BMI 0.042 1.000 -0.055 -0.079
Physically_Active -0.053 -0.055 1.000 -0.158
Current_Smoker -0.252 -0.079 -0.158 1.000
# Visualize correlations
cor_matrix <- cor(bp_data[, c("Age", "BMI", "Physically_Active", "Current_Smoker")])

plot_ly(
  x = colnames(cor_matrix),
  y = rownames(cor_matrix),
  z = cor_matrix,
  type = "heatmap",
  colors = colorRamp(c("blue", "white", "red")),
  zmin = -1, zmax = 1
) %>%
  layout(title = "Correlation Heatmap of Predictors (Interactive)")

If any correlation is > 0.8, consider removing one of the variables! Our predictors are safely below this threshold.


Practice Activity

Your Turn: Analyze Cholesterol Data

Using what you’ve learned, analyze total cholesterol in NHANES:

# Prepare cholesterol data from NHANES
chol_data <- NHANES %>%
  filter(Age >= 21,
         !is.na(TotChol),            # Total cholesterol
         !is.na(BMI),
         !is.na(PhysActive),
         !is.na(SmokeNow)) %>%
  select(ID, TotChol, Age, BMI, PhysActive, SmokeNow) %>%
  mutate(
    Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
    Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
  ) %>%
  select(ID, TotChol, Age, BMI, Physically_Active, Current_Smoker) %>%
  slice_sample(n = 150) %>%
  arrange(ID)

chol_data %>%
  head(10) %>%
  kable(caption = "NHANES Cholesterol Data (First 10 Observations)",
        digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
NHANES Cholesterol Data (First 10 Observations)
ID TotChol Age BMI Physically_Active Current_Smoker
51691 5.0 57 20.7 1 0
51856 7.3 61 28.4 1 1
51905 3.7 52 29.0 0 0
52135 5.6 74 32.9 1 0
52259 4.4 25 23.5 1 1
52408 8.8 54 25.9 0 0
52484 5.8 31 24.1 1 1
52587 6.0 72 27.2 0 0
52647 5.1 43 29.7 1 1
52730 5.5 53 23.6 1 0

Research Question: What factors predict total cholesterol in US adults?

Tasks:

  1. Check the correlation matrix - any multicollinearity concerns?
  2. Create the design matrix X (with intercept)
  3. Calculate X’X manually
  4. Fit the model using lm()
  5. Compare your X’X calculation with what lm() uses
  6. Calculate VIF for each predictor
  7. Interpret the coefficients - which factor has the strongest effect on cholesterol?
# Your code here:

# 1. Correlation matrix
cor(chol_data %>% select(Age, BMI, Physically_Active, Current_Smoker))

# 2. Create X matrix
Y_practice <- chol_data$TotChol
X_practice <- chol_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# 3. Calculate X'X
XtX_practice <- t(X_practice) %*% X_practice

# 4. Fit model
model_practice <- lm(TotChol ~ Age + BMI + Physically_Active + Current_Smoker,
                     data = chol_data)

# 5. Compare - calculate beta manually
beta_practice <- solve(XtX_practice) %*% (t(X_practice) %*% Y_practice)

# 6. Check VIF
vif(model_practice)

# 7. Interpret coefficients
summary(model_practice)

Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NHANES_2.1.0     car_3.1-5        carData_3.0-6    plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       lubridate_1.9.4  forcats_1.0.1   
##  [9] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.6     
## [13] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6           xfun_0.56.2            bslib_0.10.0          
##  [4] htmlwidgets_1.6.4      tzdb_0.5.0             crosstalk_1.2.2       
##  [7] vctrs_0.7.1            tools_4.5.2            generics_0.1.4        
## [10] curl_7.0.0             pkgconfig_2.0.3        telegram.bot_3.0.2    
## [13] data.table_1.18.2.1    RColorBrewer_1.1-3     S7_0.2.1              
## [16] lifecycle_1.0.5        compiler_4.5.2         farver_2.1.2          
## [19] textshaping_1.0.4      httpuv_1.6.16          htmltools_0.5.9       
## [22] sass_0.4.10            yaml_2.3.12            lazyeval_0.2.2        
## [25] Formula_1.2-5          later_1.4.5            pillar_1.11.1         
## [28] jquerylib_0.1.4        openssl_2.3.4          cachem_1.1.0          
## [31] abind_1.4-8            tidyselect_1.2.1       digest_0.6.39         
## [34] stringi_1.8.7          fastmap_1.2.0          grid_4.5.2            
## [37] cli_3.6.5              magrittr_2.0.4         withr_3.0.2           
## [40] scales_1.4.0           promises_1.5.0         timechange_0.4.0      
## [43] rmarkdown_2.30         httr_1.4.7             otel_0.2.0            
## [46] askpass_1.2.1          hms_1.1.4              evaluate_1.0.5        
## [49] viridisLite_0.4.2      rlang_1.1.7            Rcpp_1.1.1            
## [52] glue_1.8.0             xml2_1.5.2             svglite_2.2.2         
## [55] rstudioapi_0.18.0      jsonlite_2.0.0         R6_2.6.1              
## [58] systemfonts_1.3.1.9000