Matrix Algebra for Regression Analysis

Learning Objectives

By the end of this lecture, you will be able to:

Understand why regression requires matrix algebra
Recognize the matrix form of regression: Y = Xβ
Calculate regression coefficients using β = (X’X)^(-1)X’Y
Identify multicollinearity problems through singular matrices
Interpret error messages like “matrix is singular”
Apply these concepts to real public health regression problems

Why Matrix Algebra?

A regression model is a system of linear equations in the unknown coefficients. Matrix algebra is the mathematical engine that allows regression analysis to scale from a simple two-variable problem to complex models with thousands of predictors. While you can calculate a simple linear regression using basic arithmetic, it quickly becomes inefficient and visually cluttered as you add more variables.

Writing the whole system as matrices lets us:

Express the model in one compact line.
Derive the least‑squares solution in closed form (the normal equations).
Compute statistics (variance, covariance, predictions, influence) with matrix formulas.
Solve large problems efficiently with numerical linear‑algebra routines.

Without matrices, a multiple linear regression equation for \(n\) observations and \(k\) predictors would look like a giant list of individual equations:\[y_1 = \beta_0 + \beta_1x_{11} + \beta_2x_{12} + \dots + \beta_kx_{1k} + \epsilon_1\]\[y_2 = \beta_0 + \beta_1x_{21} + \beta_2x_{22} + \dots + \beta_kx_{2k} + \epsilon_2\]\[\vdots\]\[y_n = \beta_0 + \beta_1x_{n1} + \beta_2x_{n2} + \dots + \beta_kx_{nk} + \epsilon_n\]With matrix algebra, this entire system collapses into one elegant expression:\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

Modern statistical software (like the survey or gtsummary packages you use in R) doesn’t “think” in terms of individual data points. It uses highly optimized linear algebra libraries (like BLAS or LAPACK) to perform these matrix inversions and multiplications. This is what allows you to process large datasets with complex survey designs in seconds rather than hours.

A Real Public Health Problem

The Research Question

What factors predict systolic blood pressure in US adults, and how do they interact?

We need to control for multiple risk factors:

Age (blood pressure increases with age)
BMI (obesity is a major hypertension risk factor)
Physical activity (exercise lowers blood pressure)
Smoking status (affects cardiovascular health)

This is a multivariable problem - we need regression, not just correlation!

The Dataset: NHANES

We’ll use the National Health and Nutrition Examination Survey (NHANES) - real data from US health surveys.

# Load and prepare NHANES data
# Focus on adults (21+) with complete blood pressure and predictor data
bp_data <- NHANES %>%
  filter(Age >= 21,                    # Adults only
         !is.na(BPSysAve),             # Must have BP measured
         !is.na(BMI),                  # Must have BMI
         !is.na(PhysActive),           # Physical activity status
         !is.na(SmokeNow)) %>%         # Current smoking status
  select(ID, BPSysAve, Age, BMI, PhysActive, SmokeNow, Weight, Height) %>%
  mutate(
    # Convert logical to numeric for regression
    Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
    Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
  ) %>%
  select(ID, BPSysAve, Age, BMI, Physically_Active, Current_Smoker, Weight, Height) %>%
  slice_sample(n = 200) %>%            # Random sample of 200 for clarity
  arrange(ID)

# Display first 10 rows
bp_data %>%
  head(10) %>%
  kable(caption = "NHANES Blood Pressure Study Data (First 10 Observations)",
        digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

NHANES Blood Pressure Study Data (First 10 Observations)
ID	BPSysAve	Age	BMI	Physically_Active	Current_Smoker	Weight	Height
51677	128	33	28.5	0	0	93.8	181.3
51741	120	21	37.3	0	1	103.5	166.5
51959	123	31	29.0	0	0	86.2	172.5
52080	108	30	21.8	1	1	73.7	184.0
52122	140	67	32.4	1	0	104.0	179.3
52314	98	25	20.9	1	0	56.4	164.2
52366	114	49	34.4	1	0	115.2	182.9
52444	129	71	23.7	0	0	66.9	168.1
52501	136	38	24.8	1	1	71.8	170.2
52524	110	54	25.7	0	0	67.3	161.7

What we’re looking at: BPSysAve is systolic blood pressure - values ≥140 mmHg indicate hypertension.

Summary Statistics

# Create a cleaner summary table
bp_data %>%
  select(-ID) %>%
  summarise(
    across(everything(),
           list(Mean = ~round(mean(., na.rm = TRUE), 2),
                SD = ~round(sd(., na.rm = TRUE), 2),
                Min = ~round(min(., na.rm = TRUE), 1),
                Max = ~round(max(., na.rm = TRUE), 1)),
           .names = "{.col}_{.fn}")
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value") %>%
  pivot_wider(names_from = Statistic, values_from = Value) %>%
  kable(caption = "Summary Statistics of NHANES Blood Pressure Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Summary Statistics of NHANES Blood Pressure Data
Variable	Mean	SD	Min	Max	Active	Smoker
BPSysAve	121.86	17.92	91	212	NULL	NULL
Age	47.03	14.76	21	80	NULL	NULL
BMI	28.64	6.61	17.7	59.1	NULL	NULL
Physically	NULL	NULL	NULL	NULL	0.45, 0.50, 0.00, 1.00	NULL
Current	NULL	NULL	NULL	NULL	NULL	0.46, 0.50, 0.00, 1.00
Weight	83.59	21.09	47	172.5	NULL	NULL
Height	170.66	9.39	147	200.4	NULL	NULL

Key observations: Average systolic BP is around 120 mmHg (normal range), with considerable variation across participants.

Interactive Exploration

# Color points by hypertension status (BP >= 140)
bp_data <- bp_data %>%
  mutate(BP_Status = ifelse(BPSysAve >= 140, "Hypertensive", "Normal"))

plot_ly(bp_data,
        x = ~BMI,
        y = ~BPSysAve,
        color = ~BP_Status,
        colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
        text = ~paste("ID:", ID,
                     "<br>BP:", round(BPSysAve, 1), "mmHg",
                     "<br>BMI:", round(BMI, 1),
                     "<br>Age:", Age,
                     "<br>Active:", ifelse(Physically_Active==1, "Yes", "No")),
        type = 'scatter',
        mode = 'markers',
        marker = list(size = 8, opacity = 0.6)) %>%
  layout(title = "BMI vs Systolic BP in NHANES (Interactive - hover for details)",
         xaxis = list(title = "Body Mass Index (kg/m²)"),
         yaxis = list(title = "Systolic Blood Pressure (mmHg)"))

Question: Is higher BMI associated with higher blood pressure? Or are other factors confounding this?

Answer: We need multiple regression to control for age, physical activity, and smoking!

The Regression Problem: Why We Need Matrices

What We Want to Estimate

We want to fit this model:

\[\text{BP} = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{BMI}) + \beta_3(\text{Active}) + \beta_4(\text{Smoker}) + \epsilon\]

With 200 observations and 4 predictors, we have 200 equations with 5 unknowns (β₀, β₁, β₂, β₃, β₄):

\[120 = \beta_0 + 35\beta_1 + 28.5\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_1\] \[135 = \beta_0 + 52\beta_1 + 31.2\beta_2 + 0\beta_3 + 1\beta_4 + \epsilon_2\] \[\vdots\] \[115 = \beta_0 + 41\beta_1 + 24.8\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_{200}\]

Writing this by hand is impossible!

Solution: Matrix algebra lets us write this compactly as Y = Xβ

Matrix Notation for Regression

The Compact Form

Instead of 200 separate equations, we write:

\[\underbrace{\mathbf{Y}}_{n \times 1} = \underbrace{\mathbf{X}}_{n \times (p+1)} \underbrace{\boldsymbol{\beta}}_{(p+1) \times 1} + \underbrace{\boldsymbol{\epsilon}}_{n \times 1}\]

Where:

Y = vector of outcomes (blood pressure values)
X = matrix of predictors (including intercept column)
β = vector of coefficients we want to estimate
ε = vector of errors

What This Looks Like With Real Data

# Create the Y vector (outcomes)
Y <- bp_data$BPSysAve

# Create the X matrix (predictors + intercept)
X <- bp_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# Show dimensions
tibble(
  Matrix = c("Y (outcomes)", "X (predictors)", "β (coefficients)"),
  Dimensions = c(
    paste(length(Y), "× 1"),
    paste(nrow(X), "×", ncol(X)),
    paste(ncol(X), "× 1")
  ),
  Interpretation = c(
    "200 blood pressure measurements",
    "200 observations × 5 columns (1 intercept + 4 predictors)",
    "5 coefficients to estimate"
  )
) %>%
  kable(caption = "Matrix Dimensions for Regression") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Matrix Dimensions for Regression
Matrix	Dimensions	Interpretation
Y (outcomes)	200 × 1	200 blood pressure measurements
X (predictors)	200 × 5	200 observations × 5 columns (1 intercept + 4 predictors)
β (coefficients)	5 × 1	5 coefficients to estimate

# Show first 8 rows of X
X[1:8, ] %>%
  as.data.frame() %>%
  kable(caption = "Design Matrix X (First 8 Rows)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Design Matrix X (First 8 Rows)
intercept	Age	BMI	Physically_Active	Current_Smoker
1	33	28.5	0	0
1	21	37.3	0	1
1	31	29.0	0	0
1	30	21.8	1	1
1	67	32.4	1	0
1	25	20.9	1	0
1	49	34.4	1	0
1	71	23.7	0	0

Key Point:

The first column of X is all 1’s (for the intercept β₀), and our binary variables (Physically_Active, Current_Smoker) are coded as 0/1.

In regression analysis, the intercept (\(\beta_0\)) represents the expected value of your outcome variable (Blood Pressure) when all your predictors (Age, BMI, etc.) are equal to zero. While “zero BMI” or “zero Age” isn’t physically possible, the intercept provides the essential starting point or “baseline” for the mathematical model.

Since the regression equation is solved by multiplying the X matrix by the \(\beta\) vector, we include a column of 1s because When you perform the matrix multiplication \(\mathbf{X}\boldsymbol{\beta}\), the first element of each row in \(\mathbf{X}\) is multiplied by \(\beta_0\). By making that element 1, you ensure that \(\beta_0\) is added to the predicted value exactly as it is (\(\beta_0 \times 1 = \beta_0\)).

Matrix Operations You Need to Know

1. Matrix Transpose (X’)

When you need it: Calculating X’X and X’Y for regression

What it does: Flips rows and columns

# Small example with health data (first 3 NHANES participants)
small_example <- X[1:3, ]
rownames(small_example) <- paste("Person", 1:3)

small_example %>%
  kable(caption = "Original Matrix (People × Variables)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Original Matrix (People × Variables)
	intercept	Age	BMI	Current_Smoker
Person 1	1	33	28.5	0
Person 2	1	21	37.3	1
Person 3	1	31	29.0	0

t(small_example) %>%
  kable(caption = "Transposed Matrix (Variables × People)", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Transposed Matrix (Variables × People)
	Person 1	Person 2	Person 3
intercept	1.0	1.0	1
Age	33.0	21.0	31
BMI	28.5	37.3	29
Physically_Active	0.0	0.0	0
Current_Smoker	0.0	1.0	0

Why this matters: X’X gives us sums of squares and cross-products we need for regression - notice how transposing flips the matrix so we can multiply X’X.

2. Matrix Multiplication (X’X)

When you need it: This is the key step in calculating regression coefficients

Let’s look at the data with 200 observations in it (these values could be different depending on the random sample).

# Calculate X'X for our blood pressure data
XtX <- t(X) %*% X

XtX %>%
  kable(caption = "X'X Matrix (Sum of Squares and Cross Products)",
        digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

X’X Matrix (Sum of Squares and Cross Products)
	intercept	Age	BMI	Physically_Active	Current_Smoker
intercept	200	9407	5728	91	92
Age	9407	485821	270252	4203	3957
BMI	5728	270252	172770	2570	2583
Physically_Active	91	4203	2570	91	34
Current_Smoker	92	3957	2583	34	92

What do these numbers mean?

Diagonal: Sum of squares for each variable
- XtX[2,2] = Sum of (Age)²
- XtX[3,3] = Sum of (BMI)²
Off-diagonal: Cross products between variables
- XtX[2,3] = Sum of (Age × BMI)

# Show what's in X'X
tibble(
  Element = c("XtX[1,1]", "XtX[2,2]", "XtX[2,3]"),
  Calculation = c(
    "Sum of 1² (n observations)",
    "Sum of (Age)²",
    "Sum of (Age × BMI)"
  ),
  Value = c(
    XtX[1,1],
    XtX[2,2],
    XtX[2,3]
  )
) %>%
  kable(caption = "Interpreting X'X Elements", digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Interpreting X’X Elements
Element	Calculation	Value
XtX[1,1]	Sum of 1² (n observations)	200
XtX[2,2]	Sum of (Age)²	485821
XtX[2,3]	Sum of (Age × BMI)	270252

# Verify one calculation
tibble(
  Method = c("From X'X", "Manual calculation"),
  Result = c(
    XtX[2,2],
    sum(bp_data$Age^2)
  )
) %>%
  kable(caption = "Verification: XtX[2,2] = Sum of (Age)²") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Verification: XtX[2,2] = Sum of (Age)²
Method	Result
From X’X	485821
Manual calculation	485821

Interpretation:

The diagonal shows variability in each predictor, off-diagonal shows how predictors co-vary.

3. Matrix Inverse: (X’X)^(-1)

When you need it: This is THE critical step for calculating β

What it does: The matrix equivalent of division

Important: Only works if X’X is not singular (determinant ≠ 0)

# Calculate the inverse
XtX_inv <- solve(XtX)

XtX_inv %>%
  kable(caption = "(X'X)^(-1): The Inverse Matrix",
        digits = 6) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

(X’X)^(-1): The Inverse Matrix
	intercept	Age	BMI	Physically_Active	Current_Smoker
intercept	0.181171	-0.001265	-0.003393	-0.017561	-0.025021
Age	-0.001265	0.000025	-0.000001	0.000069	0.000195
BMI	-0.003393	-0.000001	0.000116	0.000103	0.000131
Physically_Active	-0.017561	0.000069	0.000103	0.020970	0.003940
Current_Smoker	-0.025021	0.000195	0.000131	0.003940	0.022345

# Verify: X'X × (X'X)^(-1) should equal the identity matrix
verification <- XtX %*% XtX_inv

verification %>%
  round(10) %>%  # Round to avoid floating point display issues
  kable(caption = "Verification: X'X × (X'X)^(-1) = I (Identity Matrix)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Verification: X’X × (X’X)^(-1) = I (Identity Matrix)
	intercept	Age	BMI	Physically_Active	Current_Smoker
intercept	1	0	0	0	0
Age	0	1	0	0	0
BMI	0	0	1	0	0
Physically_Active	0	0	0	1	0
Current_Smoker	0	0	0	0	1

The identity matrix has 1’s on diagonal, 0’s elsewhere - it’s like multiplying by 1!

The Magic Formula: Calculating Beta

The Normal Equations

Starting from Y = Xβ, we derive:

\[\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

This is how R calculates regression coefficients in lm()!

Step-by-Step Calculation

# Step 1: Calculate X'Y
XtY <- t(X) %*% Y

data.frame(
  Component = c("X'Y"),
  Dimensions = paste(nrow(XtY), "× 1"),
  Interpretation = "Cross products of predictors with outcome"
) %>%
  kable(caption = "Step 1: Calculate X'Y") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Step 1: Calculate X’Y
Component	Dimensions	Interpretation
X’Y	5 × 1	Cross products of predictors with outcome

XtY %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Value = V1) %>%
  kable(caption = "X'Y Values", digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

X’Y Values
Variable	Value
intercept	24372
Age	1160222
BMI	700551
Physically_Active	11026
Current_Smoker	11231

# Step 2: Calculate beta = (X'X)^(-1) × X'Y
beta_hat <- XtX_inv %*% XtY

Estimated Regression Coefficients (Manual Matrix Calculation)
Coefficient	Estimate	Interpretation
β₀ (Intercept)	96.41	Baseline systolic BP (mmHg)
β₁ (Age)	0.34	Change per 1 year increase in age
β₂ (BMI)	0.27	Change per 1 unit increase in BMI
β₃ (Physically Active)	-0.03	Difference for physically active individuals
β₄ (Current Smoker)	3.24	Difference for current smokers

Verification with lm()

Does our manual matrix calculation match R’s lm()?

# Fit the same model using lm()
model <- lm(BPSysAve ~ Age + BMI + Physically_Active + Current_Smoker,
            data = bp_data)

# Compare results
comparison <- data.frame(
  Coefficient = c("Intercept", "Age", "BMI", "Physically_Active", "Current_Smoker"),
  Matrix_Calculation = as.vector(beta_hat),
  lm_Function = coef(model)
)

comparison %>%
  kable(caption = "Comparison: Manual vs lm()",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Comparison: Manual vs lm()
	Coefficient	Matrix_Calculation	lm_Function
(Intercept)	Intercept	96.4076	96.4076
Age	Age	0.3426	0.3426
BMI	BMI	0.2746	0.2746
Physically_Active	Physically_Active	-0.0310	-0.0310
Current_Smoker	Current_Smoker	3.2358	3.2358

They match perfectly! This proves that lm() uses the matrix formula β = (X’X)^(-1)X’Y internally.

Interpreting the Results

What Do These Coefficients Tell Us?

Clinical Interpretation of Regression Coefficients
Predictor	Effect	Clinical_Meaning
Age	0.34 mmHg	Each additional year of age increases BP (aging effect)
BMI	0.27 mmHg	Each BMI unit increase raises BP (obesity-hypertension link)
Physically Active	-0.03 mmHg	Being physically active decreases BP (protective effect)
Current Smoker	3.24 mmHg	Current smoking increases BP (negative health behavior)

Key finding: Age has a strong effect - each additional year is associated with a 0.34 mmHg increase in systolic BP, controlling for other factors.

Predicted vs Observed

# Add predictions to data
bp_data$predicted <- as.vector(X %*% beta_hat)
bp_data$residual <- bp_data$BPSysAve - bp_data$predicted

# Interactive plot
plot_ly(bp_data,
        x = ~predicted,
        y = ~BPSysAve,
        color = ~BP_Status,
        colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
        text = ~paste("ID:", ID,
                     "<br>Observed BP:", round(BPSysAve, 1), "mmHg",
                     "<br>Predicted:", round(predicted, 1), "mmHg",
                     "<br>Residual:", round(residual, 1),
                     "<br>Status:", BP_Status),
        type = 'scatter',
        mode = 'markers',
        marker = list(size = 8, opacity = 0.6)) %>%
  add_trace(x = c(min(bp_data$predicted), max(bp_data$predicted)),
            y = c(min(bp_data$predicted), max(bp_data$predicted)),
            type = 'scatter', mode = 'lines',
            line = list(color = 'black', dash = 'dash'),
            name = 'Perfect Prediction',
            showlegend = TRUE,
            inherit = FALSE) %>%
  layout(title = "Predicted vs Observed Systolic BP",
         xaxis = list(title = "Predicted Systolic BP (mmHg)"),
         yaxis = list(title = "Observed Systolic BP (mmHg)"))

Points close to the diagonal line are well-predicted by our model; points far from the line have large residuals (prediction errors).

When Things Go Wrong: Multicollinearity

What is Multicollinearity?

Multicollinearity = When predictor variables are highly correlated with each other

Why it matters: It makes X’X singular (non-invertible), so we can’t calculate β!

Real Example: The Redundant Variable Problem

Let’s try to include both Weight and BMI as predictors. Problem: BMI is calculated FROM weight and height!

# Check if we have the variables needed
bp_data_check <- bp_data %>%
  filter(!is.na(Weight), !is.na(Height))

# BMI is weight(kg) / height(m)²
# Since we already have BMI, let's show the relationship
cor_check <- cor(bp_data_check[, c("Weight", "BMI", "Height")],
                 use = "complete.obs")

cor_check %>%
  kable(caption = "Correlation Matrix: Weight, BMI, and Height",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Correlation Matrix: Weight, BMI, and Height
	Weight	BMI	Height
Weight	1.000	0.891	0.409
BMI	0.891	1.000	-0.039
Height	0.409	-0.039	1.000

# Try to fit regression with both Weight and BMI
X_bad <- bp_data_check %>%
  select(Age, BMI, Weight, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# What happens to X'X?
XtX_bad <- t(X_bad) %*% X_bad

# Check if invertible
determinant_value <- det(XtX_bad)

data.frame(
  Matrix = "X'X with BMI and Weight",
  Determinant = format(determinant_value, scientific = TRUE, digits = 3),
  Invertible = ifelse(abs(determinant_value) < 1e-10, "NO ❌", "Close to singular ⚠️")
) %>%
  kable(caption = "Problem: Nearly Singular Matrix!") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Problem: Nearly Singular Matrix!
Matrix	Determinant	Invertible
X’X with BMI and Weight	3e+18	Close to singular ⚠️

The issue: Weight and BMI are highly correlated (r = 0.89), making the matrix nearly singular.

What Happens When You Try?

# Try with lm() - R will detect the high multicollinearity
model_bad <- lm(BPSysAve ~ Age + BMI + Weight + Physically_Active + Current_Smoker,
                data = bp_data_check)

coef(model_bad) %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Coefficient = ".") %>%
  kable(caption = "lm() Output with Both BMI and Weight",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

lm() Output with Both BMI and Weight
Variable	Coefficient
(Intercept)	95.1604
Age	0.3541
BMI	-0.3409
Weight	0.2171
Physically_Active	-0.0502
Current_Smoker	3.6560

# Check standard errors - they'll be inflated!
summary(model_bad)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  select(Variable, Estimate, `Std. Error`) %>%
  kable(caption = "Notice the Large Standard Errors!",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Notice the Large Standard Errors!
Variable	Estimate	Std. Error
(Intercept)	95.1604	7.3689
Age	0.3541	0.0862
BMI	-0.3409	0.4085
Weight	0.2171	0.1284
Physically_Active	-0.0502	2.4945
Current_Smoker	3.6560	2.5869

The problem: Coefficients are estimated, but standard errors are HUGE (unstable estimates) - this is what multicollinearity does!

Diagnosing Multicollinearity with VIF

# Check VIF (Variance Inflation Factor) for the problematic model
vif_values <- vif(model_bad)

data.frame(
  Variable = names(vif_values),
  VIF = vif_values,
  Problem = ifelse(vif_values > 5, "⚠️ HIGH - Multicollinearity!", "✓ OK")
) %>%
  kable(caption = "Variance Inflation Factors (VIF > 5 indicates multicollinearity)",
        digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Variance Inflation Factors (VIF > 5 indicates multicollinearity)
	Variable	VIF	Problem
Age	Age	1.09	✓ OK
BMI	BMI	4.90	✓ OK
Weight	Weight	4.91	✓ OK
Physically_Active	Physically_Active	1.04	✓ OK
Current_Smoker	Current_Smoker	1.12	✓ OK

# Compare with the good model (without Weight)
vif_values_good <- vif(model)

data.frame(
  Variable = names(vif_values_good),
  VIF = vif_values_good,
  Problem = ifelse(vif_values_good > 5, "⚠️ Concerning", "✓ OK")
) %>%
  kable(caption = "VIF for Original Model (Without Weight) - All OK!",
        digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

VIF for Original Model (Without Weight) - All OK!
	Variable	VIF	Problem
Age	Age	1.08	✓ OK
BMI	BMI	1.01	✓ OK
Physically_Active	Physically_Active	1.04	✓ OK
Current_Smoker	Current_Smoker	1.11	✓ OK

Rule of thumb: VIF > 5-10 indicates serious multicollinearity - the VIF for BMI and Weight in the bad model are extremely high!

Practical Takeaways for Your Research

When You’ll Use This

When You’ll Encounter These Concepts
Scenario	Matrix_Concept_Involved	What_To_Do
Running multiple regression in R	β = (X’X)⁻¹X’Y is being calculated	Nothing - lm() handles it automatically
Error: ‘system is computationally singular’	X’X is singular (can’t be inverted)	Remove redundant variables
Unstable coefficient estimates	X’X is nearly singular (multicollinearity)	Check VIF, remove highly correlated predictors
Understanding regression output	Understanding where coefficients come from	Know that lm() uses matrix algebra
Checking model assumptions	Residuals = Y - Xβ	Check correlation matrix of predictors

Quick Reference: Matrix Operations in R

Essential R Commands for Matrix Operations
Operation	R_Code	When_You_Need_It
Transpose	t(X)	Calculating X’X or X’Y
Matrix multiplication	X %*% Y	Calculating regression components
Inverse	solve(X)	Solving for β coefficients
Correlation matrix	cor(data)	Checking for multicollinearity
Check multicollinearity	vif(model)	Diagnosing unstable models

Summary: The Big Picture

What We Learned

Multivariable regression requires solving many equations simultaneously
Matrix notation (Y = Xβ) makes this manageable
The solution is β = (X’X)^(-1)X’Y
Multicollinearity breaks the inversion step
R does this automatically, but knowing the math helps you troubleshoot!

The Formula You Should Remember

\[\boxed{\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}}\]

Translation: To get regression coefficients, we:

Calculate X’X (sum of squares and cross products)
Invert it (if possible - requires no multicollinearity)
Multiply by X’Y (cross products with outcome)

Key Diagnostic: Check Your Correlation Matrix

# Always check this before running regression!
bp_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  cor() %>%
  kable(caption = "Predictor Correlation Matrix - Check Before Regression!",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Predictor Correlation Matrix - Check Before Regression!
	Age	BMI	Physically_Active	Current_Smoker
Age	1.000	0.042	-0.053	-0.252
BMI	0.042	1.000	-0.055	-0.079
Physically_Active	-0.053	-0.055	1.000	-0.158
Current_Smoker	-0.252	-0.079	-0.158	1.000

# Visualize correlations
cor_matrix <- cor(bp_data[, c("Age", "BMI", "Physically_Active", "Current_Smoker")])

plot_ly(
  x = colnames(cor_matrix),
  y = rownames(cor_matrix),
  z = cor_matrix,
  type = "heatmap",
  colors = colorRamp(c("blue", "white", "red")),
  zmin = -1, zmax = 1
) %>%
  layout(title = "Correlation Heatmap of Predictors (Interactive)")

If any correlation is > 0.8, consider removing one of the variables! Our predictors are safely below this threshold.

Practice Activity

Your Turn: Analyze Cholesterol Data

Using what you’ve learned, analyze total cholesterol in NHANES:

# Prepare cholesterol data from NHANES
chol_data <- NHANES %>%
  filter(Age >= 21,
         !is.na(TotChol),            # Total cholesterol
         !is.na(BMI),
         !is.na(PhysActive),
         !is.na(SmokeNow)) %>%
  select(ID, TotChol, Age, BMI, PhysActive, SmokeNow) %>%
  mutate(
    Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
    Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
  ) %>%
  select(ID, TotChol, Age, BMI, Physically_Active, Current_Smoker) %>%
  slice_sample(n = 150) %>%
  arrange(ID)

chol_data %>%
  head(10) %>%
  kable(caption = "NHANES Cholesterol Data (First 10 Observations)",
        digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

NHANES Cholesterol Data (First 10 Observations)
ID	TotChol	Age	BMI	Physically_Active	Current_Smoker
51691	5.0	57	20.7	1	0
51856	7.3	61	28.4	1	1
51905	3.7	52	29.0	0	0
52135	5.6	74	32.9	1	0
52259	4.4	25	23.5	1	1
52408	8.8	54	25.9	0	0
52484	5.8	31	24.1	1	1
52587	6.0	72	27.2	0	0
52647	5.1	43	29.7	1	1
52730	5.5	53	23.6	1	0

Research Question: What factors predict total cholesterol in US adults?

Tasks:

Check the correlation matrix - any multicollinearity concerns?
Create the design matrix X (with intercept)
Calculate X’X manually
Fit the model using lm()
Compare your X’X calculation with what lm() uses
Calculate VIF for each predictor
Interpret the coefficients - which factor has the strongest effect on cholesterol?

# Your code here:

# 1. Correlation matrix
cor(chol_data %>% select(Age, BMI, Physically_Active, Current_Smoker))

# 2. Create X matrix
Y_practice <- chol_data$TotChol
X_practice <- chol_data %>%
  select(Age, BMI, Physically_Active, Current_Smoker) %>%
  mutate(intercept = 1) %>%
  select(intercept, everything()) %>%
  as.matrix()

# 3. Calculate X'X
XtX_practice <- t(X_practice) %*% X_practice

# 4. Fit model
model_practice <- lm(TotChol ~ Age + BMI + Physically_Active + Current_Smoker,
                     data = chol_data)

# 5. Compare - calculate beta manually
beta_practice <- solve(XtX_practice) %*% (t(X_practice) %*% Y_practice)

# 6. Check VIF
vif(model_practice)

# 7. Interpret coefficients
summary(model_practice)

Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NHANES_2.1.0     car_3.1-5        carData_3.0-6    plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       lubridate_1.9.5  forcats_1.0.1   
##  [9] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.6     
## [13] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6           xfun_0.56.2            bslib_0.10.0          
##  [4] htmlwidgets_1.6.4      tzdb_0.5.0             crosstalk_1.2.2       
##  [7] vctrs_0.7.1            tools_4.5.2            generics_0.1.4        
## [10] curl_7.0.0             pkgconfig_2.0.3        telegram.bot_3.0.2    
## [13] data.table_1.18.2.1    RColorBrewer_1.1-3     S7_0.2.1              
## [16] lifecycle_1.0.5        compiler_4.5.2         farver_2.1.2          
## [19] textshaping_1.0.4      httpuv_1.6.16          htmltools_0.5.9       
## [22] sass_0.4.10            yaml_2.3.12            lazyeval_0.2.2        
## [25] Formula_1.2-5          later_1.4.5            pillar_1.11.1         
## [28] jquerylib_0.1.4        openssl_2.3.4          cachem_1.1.0          
## [31] abind_1.4-8            tidyselect_1.2.1       digest_0.6.39         
## [34] stringi_1.8.7          fastmap_1.2.0          grid_4.5.2            
## [37] cli_3.6.5              magrittr_2.0.4         withr_3.0.2           
## [40] scales_1.4.0           promises_1.5.0         timechange_0.4.0      
## [43] rmarkdown_2.30         httr_1.4.7             otel_0.2.0            
## [46] askpass_1.2.1          hms_1.1.4              evaluate_1.0.5        
## [49] viridisLite_0.4.2      rlang_1.1.7            Rcpp_1.1.1            
## [52] glue_1.8.0             xml2_1.5.2             svglite_2.2.2         
## [55] rstudioapi_0.18.0      jsonlite_2.0.0         R6_2.6.1              
## [58] systemfonts_1.3.1.9000