By the end of this lecture, you will be able to:
A regression model is a system of linear equations in the unknown coefficients. Matrix algebra is the mathematical engine that allows regression analysis to scale from a simple two-variable problem to complex models with thousands of predictors. While you can calculate a simple linear regression using basic arithmetic, it quickly becomes inefficient and visually cluttered as you add more variables.
Writing the whole system as matrices lets us:
Without matrices, a multiple linear regression equation for \(n\) observations and \(k\) predictors would look like a giant list of individual equations:\[y_1 = \beta_0 + \beta_1x_{11} + \beta_2x_{12} + \dots + \beta_kx_{1k} + \epsilon_1\]\[y_2 = \beta_0 + \beta_1x_{21} + \beta_2x_{22} + \dots + \beta_kx_{2k} + \epsilon_2\]\[\vdots\]\[y_n = \beta_0 + \beta_1x_{n1} + \beta_2x_{n2} + \dots + \beta_kx_{nk} + \epsilon_n\]With matrix algebra, this entire system collapses into one elegant expression:\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]
Modern statistical software (like the survey or gtsummary packages you use in R) doesn’t “think” in terms of individual data points. It uses highly optimized linear algebra libraries (like BLAS or LAPACK) to perform these matrix inversions and multiplications. This is what allows you to process large datasets with complex survey designs in seconds rather than hours.
What factors predict systolic blood pressure in US adults, and how do they interact?
We need to control for multiple risk factors:
This is a multivariable problem - we need regression, not just correlation!
We’ll use the National Health and Nutrition Examination Survey (NHANES) - real data from US health surveys.
# Load and prepare NHANES data
# Focus on adults (21+) with complete blood pressure and predictor data
bp_data <- NHANES %>%
filter(Age >= 21, # Adults only
!is.na(BPSysAve), # Must have BP measured
!is.na(BMI), # Must have BMI
!is.na(PhysActive), # Physical activity status
!is.na(SmokeNow)) %>% # Current smoking status
select(ID, BPSysAve, Age, BMI, PhysActive, SmokeNow, Weight, Height) %>%
mutate(
# Convert logical to numeric for regression
Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
) %>%
select(ID, BPSysAve, Age, BMI, Physically_Active, Current_Smoker, Weight, Height) %>%
slice_sample(n = 200) %>% # Random sample of 200 for clarity
arrange(ID)
# Display first 10 rows
bp_data %>%
head(10) %>%
kable(caption = "NHANES Blood Pressure Study Data (First 10 Observations)",
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| ID | BPSysAve | Age | BMI | Physically_Active | Current_Smoker | Weight | Height |
|---|---|---|---|---|---|---|---|
| 51677 | 128 | 33 | 28.5 | 0 | 0 | 93.8 | 181.3 |
| 51741 | 120 | 21 | 37.3 | 0 | 1 | 103.5 | 166.5 |
| 51959 | 123 | 31 | 29.0 | 0 | 0 | 86.2 | 172.5 |
| 52080 | 108 | 30 | 21.8 | 1 | 1 | 73.7 | 184.0 |
| 52122 | 140 | 67 | 32.4 | 1 | 0 | 104.0 | 179.3 |
| 52314 | 98 | 25 | 20.9 | 1 | 0 | 56.4 | 164.2 |
| 52366 | 114 | 49 | 34.4 | 1 | 0 | 115.2 | 182.9 |
| 52444 | 129 | 71 | 23.7 | 0 | 0 | 66.9 | 168.1 |
| 52501 | 136 | 38 | 24.8 | 1 | 1 | 71.8 | 170.2 |
| 52524 | 110 | 54 | 25.7 | 0 | 0 | 67.3 | 161.7 |
What we’re looking at: BPSysAve is systolic blood pressure - values ≥140 mmHg indicate hypertension.
# Create a cleaner summary table
bp_data %>%
select(-ID) %>%
summarise(
across(everything(),
list(Mean = ~round(mean(., na.rm = TRUE), 2),
SD = ~round(sd(., na.rm = TRUE), 2),
Min = ~round(min(., na.rm = TRUE), 1),
Max = ~round(max(., na.rm = TRUE), 1)),
.names = "{.col}_{.fn}")
) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_",
values_to = "Value") %>%
pivot_wider(names_from = Statistic, values_from = Value) %>%
kable(caption = "Summary Statistics of NHANES Blood Pressure Data") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Mean | SD | Min | Max | Active | Smoker |
|---|---|---|---|---|---|---|
| BPSysAve | 121.86 | 17.92 | 91 | 212 | NULL | NULL |
| Age | 47.03 | 14.76 | 21 | 80 | NULL | NULL |
| BMI | 28.64 | 6.61 | 17.7 | 59.1 | NULL | NULL |
| Physically | NULL | NULL | NULL | NULL | 0.45, 0.50, 0.00, 1.00 | NULL |
| Current | NULL | NULL | NULL | NULL | NULL | 0.46, 0.50, 0.00, 1.00 |
| Weight | 83.59 | 21.09 | 47 | 172.5 | NULL | NULL |
| Height | 170.66 | 9.39 | 147 | 200.4 | NULL | NULL |
Key observations: Average systolic BP is around 120 mmHg (normal range), with considerable variation across participants.
# Color points by hypertension status (BP >= 140)
bp_data <- bp_data %>%
mutate(BP_Status = ifelse(BPSysAve >= 140, "Hypertensive", "Normal"))
plot_ly(bp_data,
x = ~BMI,
y = ~BPSysAve,
color = ~BP_Status,
colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
text = ~paste("ID:", ID,
"<br>BP:", round(BPSysAve, 1), "mmHg",
"<br>BMI:", round(BMI, 1),
"<br>Age:", Age,
"<br>Active:", ifelse(Physically_Active==1, "Yes", "No")),
type = 'scatter',
mode = 'markers',
marker = list(size = 8, opacity = 0.6)) %>%
layout(title = "BMI vs Systolic BP in NHANES (Interactive - hover for details)",
xaxis = list(title = "Body Mass Index (kg/m²)"),
yaxis = list(title = "Systolic Blood Pressure (mmHg)"))Question: Is higher BMI associated with higher blood pressure? Or are other factors confounding this?
Answer: We need multiple regression to control for age, physical activity, and smoking!
We want to fit this model:
\[\text{BP} = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{BMI}) + \beta_3(\text{Active}) + \beta_4(\text{Smoker}) + \epsilon\]
With 200 observations and 4 predictors, we have 200 equations with 5 unknowns (β₀, β₁, β₂, β₃, β₄):
\[120 = \beta_0 + 35\beta_1 + 28.5\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_1\] \[135 = \beta_0 + 52\beta_1 + 31.2\beta_2 + 0\beta_3 + 1\beta_4 + \epsilon_2\] \[\vdots\] \[115 = \beta_0 + 41\beta_1 + 24.8\beta_2 + 1\beta_3 + 0\beta_4 + \epsilon_{200}\]
Writing this by hand is impossible!
Solution: Matrix algebra lets us write this compactly as Y = Xβ
Instead of 200 separate equations, we write:
\[\underbrace{\mathbf{Y}}_{n \times 1} = \underbrace{\mathbf{X}}_{n \times (p+1)} \underbrace{\boldsymbol{\beta}}_{(p+1) \times 1} + \underbrace{\boldsymbol{\epsilon}}_{n \times 1}\]
Where:
# Create the Y vector (outcomes)
Y <- bp_data$BPSysAve
# Create the X matrix (predictors + intercept)
X <- bp_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# Show dimensions
tibble(
Matrix = c("Y (outcomes)", "X (predictors)", "β (coefficients)"),
Dimensions = c(
paste(length(Y), "× 1"),
paste(nrow(X), "×", ncol(X)),
paste(ncol(X), "× 1")
),
Interpretation = c(
"200 blood pressure measurements",
"200 observations × 5 columns (1 intercept + 4 predictors)",
"5 coefficients to estimate"
)
) %>%
kable(caption = "Matrix Dimensions for Regression") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Matrix | Dimensions | Interpretation |
|---|---|---|
| Y (outcomes) | 200 × 1 | 200 blood pressure measurements |
| X (predictors) | 200 × 5 | 200 observations × 5 columns (1 intercept + 4 predictors) |
| β (coefficients) | 5 × 1 | 5 coefficients to estimate |
# Show first 8 rows of X
X[1:8, ] %>%
as.data.frame() %>%
kable(caption = "Design Matrix X (First 8 Rows)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker |
|---|---|---|---|---|
| 1 | 33 | 28.5 | 0 | 0 |
| 1 | 21 | 37.3 | 0 | 1 |
| 1 | 31 | 29.0 | 0 | 0 |
| 1 | 30 | 21.8 | 1 | 1 |
| 1 | 67 | 32.4 | 1 | 0 |
| 1 | 25 | 20.9 | 1 | 0 |
| 1 | 49 | 34.4 | 1 | 0 |
| 1 | 71 | 23.7 | 0 | 0 |
Key Point:
The first column of X is all 1’s (for the intercept β₀), and our binary variables (Physically_Active, Current_Smoker) are coded as 0/1.
In regression analysis, the intercept (\(\beta_0\)) represents the expected value of your outcome variable (Blood Pressure) when all your predictors (Age, BMI, etc.) are equal to zero. While “zero BMI” or “zero Age” isn’t physically possible, the intercept provides the essential starting point or “baseline” for the mathematical model.
Since the regression equation is solved by multiplying the X matrix by the \(\beta\) vector, we include a column of 1s because When you perform the matrix multiplication \(\mathbf{X}\boldsymbol{\beta}\), the first element of each row in \(\mathbf{X}\) is multiplied by \(\beta_0\). By making that element 1, you ensure that \(\beta_0\) is added to the predicted value exactly as it is (\(\beta_0 \times 1 = \beta_0\)).
When you need it: Calculating X’X and X’Y for regression
What it does: Flips rows and columns
# Small example with health data (first 3 NHANES participants)
small_example <- X[1:3, ]
rownames(small_example) <- paste("Person", 1:3)
small_example %>%
kable(caption = "Original Matrix (People × Variables)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| Person 1 | 1 | 33 | 28.5 | 0 | 0 |
| Person 2 | 1 | 21 | 37.3 | 0 | 1 |
| Person 3 | 1 | 31 | 29.0 | 0 | 0 |
t(small_example) %>%
kable(caption = "Transposed Matrix (Variables × People)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Person 1 | Person 2 | Person 3 | |
|---|---|---|---|
| intercept | 1.0 | 1.0 | 1 |
| Age | 33.0 | 21.0 | 31 |
| BMI | 28.5 | 37.3 | 29 |
| Physically_Active | 0.0 | 0.0 | 0 |
| Current_Smoker | 0.0 | 1.0 | 0 |
Why this matters: X’X gives us sums of squares and cross-products we need for regression - notice how transposing flips the matrix so we can multiply X’X.
When you need it: This is the key step in calculating regression coefficients
Let’s look at the data with 200 observations in it (these values could be different depending on the random sample).
# Calculate X'X for our blood pressure data
XtX <- t(X) %*% X
XtX %>%
kable(caption = "X'X Matrix (Sum of Squares and Cross Products)",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 200 | 9407 | 5728 | 91 | 92 |
| Age | 9407 | 485821 | 270252 | 4203 | 3957 |
| BMI | 5728 | 270252 | 172770 | 2570 | 2583 |
| Physically_Active | 91 | 4203 | 2570 | 91 | 34 |
| Current_Smoker | 92 | 3957 | 2583 | 34 | 92 |
What do these numbers mean?
XtX[2,2] = Sum of (Age)²XtX[3,3] = Sum of (BMI)²XtX[2,3] = Sum of (Age × BMI)# Show what's in X'X
tibble(
Element = c("XtX[1,1]", "XtX[2,2]", "XtX[2,3]"),
Calculation = c(
"Sum of 1² (n observations)",
"Sum of (Age)²",
"Sum of (Age × BMI)"
),
Value = c(
XtX[1,1],
XtX[2,2],
XtX[2,3]
)
) %>%
kable(caption = "Interpreting X'X Elements", digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Element | Calculation | Value |
|---|---|---|
| XtX[1,1] | Sum of 1² (n observations) | 200 |
| XtX[2,2] | Sum of (Age)² | 485821 |
| XtX[2,3] | Sum of (Age × BMI) | 270252 |
# Verify one calculation
tibble(
Method = c("From X'X", "Manual calculation"),
Result = c(
XtX[2,2],
sum(bp_data$Age^2)
)
) %>%
kable(caption = "Verification: XtX[2,2] = Sum of (Age)²") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Method | Result |
|---|---|
| From X’X | 485821 |
| Manual calculation | 485821 |
Interpretation:
The diagonal shows variability in each predictor, off-diagonal shows how predictors co-vary.
When you need it: This is THE critical step for calculating β
What it does: The matrix equivalent of division
Important: Only works if X’X is not singular (determinant ≠ 0)
# Calculate the inverse
XtX_inv <- solve(XtX)
XtX_inv %>%
kable(caption = "(X'X)^(-1): The Inverse Matrix",
digits = 6) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 0.181171 | -0.001265 | -0.003393 | -0.017561 | -0.025021 |
| Age | -0.001265 | 0.000025 | -0.000001 | 0.000069 | 0.000195 |
| BMI | -0.003393 | -0.000001 | 0.000116 | 0.000103 | 0.000131 |
| Physically_Active | -0.017561 | 0.000069 | 0.000103 | 0.020970 | 0.003940 |
| Current_Smoker | -0.025021 | 0.000195 | 0.000131 | 0.003940 | 0.022345 |
# Verify: X'X × (X'X)^(-1) should equal the identity matrix
verification <- XtX %*% XtX_inv
verification %>%
round(10) %>% # Round to avoid floating point display issues
kable(caption = "Verification: X'X × (X'X)^(-1) = I (Identity Matrix)") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| intercept | Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|---|
| intercept | 1 | 0 | 0 | 0 | 0 |
| Age | 0 | 1 | 0 | 0 | 0 |
| BMI | 0 | 0 | 1 | 0 | 0 |
| Physically_Active | 0 | 0 | 0 | 1 | 0 |
| Current_Smoker | 0 | 0 | 0 | 0 | 1 |
The identity matrix has 1’s on diagonal, 0’s elsewhere - it’s like multiplying by 1!
Starting from Y = Xβ, we derive:
\[\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
This is how R calculates regression coefficients in
lm()!
# Step 1: Calculate X'Y
XtY <- t(X) %*% Y
data.frame(
Component = c("X'Y"),
Dimensions = paste(nrow(XtY), "× 1"),
Interpretation = "Cross products of predictors with outcome"
) %>%
kable(caption = "Step 1: Calculate X'Y") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Component | Dimensions | Interpretation |
|---|---|---|
| X’Y | 5 × 1 | Cross products of predictors with outcome |
XtY %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
rename(Value = V1) %>%
kable(caption = "X'Y Values", digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Value |
|---|---|
| intercept | 24372 |
| Age | 1160222 |
| BMI | 700551 |
| Physically_Active | 11026 |
| Current_Smoker | 11231 |
| Coefficient | Estimate | Interpretation |
|---|---|---|
| β₀ (Intercept) | 96.41 | Baseline systolic BP (mmHg) |
| β₁ (Age) | 0.34 | Change per 1 year increase in age |
| β₂ (BMI) | 0.27 | Change per 1 unit increase in BMI |
| β₃ (Physically Active) | -0.03 | Difference for physically active individuals |
| β₄ (Current Smoker) | 3.24 | Difference for current smokers |
Does our manual matrix calculation match R’s lm()?
# Fit the same model using lm()
model <- lm(BPSysAve ~ Age + BMI + Physically_Active + Current_Smoker,
data = bp_data)
# Compare results
comparison <- data.frame(
Coefficient = c("Intercept", "Age", "BMI", "Physically_Active", "Current_Smoker"),
Matrix_Calculation = as.vector(beta_hat),
lm_Function = coef(model)
)
comparison %>%
kable(caption = "Comparison: Manual vs lm()",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Coefficient | Matrix_Calculation | lm_Function | |
|---|---|---|---|
| (Intercept) | Intercept | 96.4076 | 96.4076 |
| Age | Age | 0.3426 | 0.3426 |
| BMI | BMI | 0.2746 | 0.2746 |
| Physically_Active | Physically_Active | -0.0310 | -0.0310 |
| Current_Smoker | Current_Smoker | 3.2358 | 3.2358 |
They match perfectly! This proves that
lm() uses the matrix formula β = (X’X)^(-1)X’Y
internally.
| Predictor | Effect | Clinical_Meaning |
|---|---|---|
| Age |
|
Each additional year of age increases BP (aging effect) |
| BMI |
|
Each BMI unit increase raises BP (obesity-hypertension link) |
| Physically Active | -0.03 mmHg | Being physically active decreases BP (protective effect) |
| Current Smoker |
|
Current smoking increases BP (negative health behavior) |
Key finding: Age has a strong effect - each additional year is associated with a 0.34 mmHg increase in systolic BP, controlling for other factors.
# Add predictions to data
bp_data$predicted <- as.vector(X %*% beta_hat)
bp_data$residual <- bp_data$BPSysAve - bp_data$predicted
# Interactive plot
plot_ly(bp_data,
x = ~predicted,
y = ~BPSysAve,
color = ~BP_Status,
colors = c("Hypertensive" = "red", "Normal" = "steelblue"),
text = ~paste("ID:", ID,
"<br>Observed BP:", round(BPSysAve, 1), "mmHg",
"<br>Predicted:", round(predicted, 1), "mmHg",
"<br>Residual:", round(residual, 1),
"<br>Status:", BP_Status),
type = 'scatter',
mode = 'markers',
marker = list(size = 8, opacity = 0.6)) %>%
add_trace(x = c(min(bp_data$predicted), max(bp_data$predicted)),
y = c(min(bp_data$predicted), max(bp_data$predicted)),
type = 'scatter', mode = 'lines',
line = list(color = 'black', dash = 'dash'),
name = 'Perfect Prediction',
showlegend = TRUE,
inherit = FALSE) %>%
layout(title = "Predicted vs Observed Systolic BP",
xaxis = list(title = "Predicted Systolic BP (mmHg)"),
yaxis = list(title = "Observed Systolic BP (mmHg)"))Points close to the diagonal line are well-predicted by our model; points far from the line have large residuals (prediction errors).
Multicollinearity = When predictor variables are highly correlated with each other
Why it matters: It makes X’X singular (non-invertible), so we can’t calculate β!
Let’s try to include both Weight and BMI as predictors. Problem: BMI is calculated FROM weight and height!
# Check if we have the variables needed
bp_data_check <- bp_data %>%
filter(!is.na(Weight), !is.na(Height))
# BMI is weight(kg) / height(m)²
# Since we already have BMI, let's show the relationship
cor_check <- cor(bp_data_check[, c("Weight", "BMI", "Height")],
use = "complete.obs")
cor_check %>%
kable(caption = "Correlation Matrix: Weight, BMI, and Height",
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Weight | BMI | Height | |
|---|---|---|---|
| Weight | 1.000 | 0.891 | 0.409 |
| BMI | 0.891 | 1.000 | -0.039 |
| Height | 0.409 | -0.039 | 1.000 |
# Try to fit regression with both Weight and BMI
X_bad <- bp_data_check %>%
select(Age, BMI, Weight, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# What happens to X'X?
XtX_bad <- t(X_bad) %*% X_bad
# Check if invertible
determinant_value <- det(XtX_bad)
data.frame(
Matrix = "X'X with BMI and Weight",
Determinant = format(determinant_value, scientific = TRUE, digits = 3),
Invertible = ifelse(abs(determinant_value) < 1e-10, "NO ❌", "Close to singular ⚠️")
) %>%
kable(caption = "Problem: Nearly Singular Matrix!") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Matrix | Determinant | Invertible |
|---|---|---|
| X’X with BMI and Weight | 3e+18 | Close to singular ⚠️ |
The issue: Weight and BMI are highly correlated (r = 0.89), making the matrix nearly singular.
# Try with lm() - R will detect the high multicollinearity
model_bad <- lm(BPSysAve ~ Age + BMI + Weight + Physically_Active + Current_Smoker,
data = bp_data_check)
coef(model_bad) %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
rename(Coefficient = ".") %>%
kable(caption = "lm() Output with Both BMI and Weight",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Coefficient |
|---|---|
| (Intercept) | 95.1604 |
| Age | 0.3541 |
| BMI | -0.3409 |
| Weight | 0.2171 |
| Physically_Active | -0.0502 |
| Current_Smoker | 3.6560 |
# Check standard errors - they'll be inflated!
summary(model_bad)$coefficients %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
select(Variable, Estimate, `Std. Error`) %>%
kable(caption = "Notice the Large Standard Errors!",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Estimate | Std. Error |
|---|---|---|
| (Intercept) | 95.1604 | 7.3689 |
| Age | 0.3541 | 0.0862 |
| BMI | -0.3409 | 0.4085 |
| Weight | 0.2171 | 0.1284 |
| Physically_Active | -0.0502 | 2.4945 |
| Current_Smoker | 3.6560 | 2.5869 |
The problem: Coefficients are estimated, but standard errors are HUGE (unstable estimates) - this is what multicollinearity does!
# Check VIF (Variance Inflation Factor) for the problematic model
vif_values <- vif(model_bad)
data.frame(
Variable = names(vif_values),
VIF = vif_values,
Problem = ifelse(vif_values > 5, "⚠️ HIGH - Multicollinearity!", "✓ OK")
) %>%
kable(caption = "Variance Inflation Factors (VIF > 5 indicates multicollinearity)",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | VIF | Problem | |
|---|---|---|---|
| Age | Age | 1.09 | ✓ OK |
| BMI | BMI | 4.90 | ✓ OK |
| Weight | Weight | 4.91 | ✓ OK |
| Physically_Active | Physically_Active | 1.04 | ✓ OK |
| Current_Smoker | Current_Smoker | 1.12 | ✓ OK |
# Compare with the good model (without Weight)
vif_values_good <- vif(model)
data.frame(
Variable = names(vif_values_good),
VIF = vif_values_good,
Problem = ifelse(vif_values_good > 5, "⚠️ Concerning", "✓ OK")
) %>%
kable(caption = "VIF for Original Model (Without Weight) - All OK!",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | VIF | Problem | |
|---|---|---|---|
| Age | Age | 1.08 | ✓ OK |
| BMI | BMI | 1.01 | ✓ OK |
| Physically_Active | Physically_Active | 1.04 | ✓ OK |
| Current_Smoker | Current_Smoker | 1.11 | ✓ OK |
Rule of thumb: VIF > 5-10 indicates serious multicollinearity - the VIF for BMI and Weight in the bad model are extremely high!
| Scenario | Matrix_Concept_Involved | What_To_Do |
|---|---|---|
| Running multiple regression in R | β = (X’X)⁻¹X’Y is being calculated | Nothing - lm() handles it automatically |
| Error: ‘system is computationally singular’ | X’X is singular (can’t be inverted) | Remove redundant variables |
| Unstable coefficient estimates | X’X is nearly singular (multicollinearity) | Check VIF, remove highly correlated predictors |
| Understanding regression output | Understanding where coefficients come from | Know that lm() uses matrix algebra |
| Checking model assumptions | Residuals = Y - Xβ | Check correlation matrix of predictors |
| Operation | R_Code | When_You_Need_It |
|---|---|---|
| Transpose | t(X) | Calculating X’X or X’Y |
| Matrix multiplication | X %*% Y | Calculating regression components |
| Inverse | solve(X) | Solving for β coefficients |
| Correlation matrix | cor(data) | Checking for multicollinearity |
| Check multicollinearity | vif(model) | Diagnosing unstable models |
\[\boxed{\boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}}\]
Translation: To get regression coefficients, we:
# Always check this before running regression!
bp_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
cor() %>%
kable(caption = "Predictor Correlation Matrix - Check Before Regression!",
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Age | BMI | Physically_Active | Current_Smoker | |
|---|---|---|---|---|
| Age | 1.000 | 0.042 | -0.053 | -0.252 |
| BMI | 0.042 | 1.000 | -0.055 | -0.079 |
| Physically_Active | -0.053 | -0.055 | 1.000 | -0.158 |
| Current_Smoker | -0.252 | -0.079 | -0.158 | 1.000 |
# Visualize correlations
cor_matrix <- cor(bp_data[, c("Age", "BMI", "Physically_Active", "Current_Smoker")])
plot_ly(
x = colnames(cor_matrix),
y = rownames(cor_matrix),
z = cor_matrix,
type = "heatmap",
colors = colorRamp(c("blue", "white", "red")),
zmin = -1, zmax = 1
) %>%
layout(title = "Correlation Heatmap of Predictors (Interactive)")If any correlation is > 0.8, consider removing one of the variables! Our predictors are safely below this threshold.
Using what you’ve learned, analyze total cholesterol in NHANES:
# Prepare cholesterol data from NHANES
chol_data <- NHANES %>%
filter(Age >= 21,
!is.na(TotChol), # Total cholesterol
!is.na(BMI),
!is.na(PhysActive),
!is.na(SmokeNow)) %>%
select(ID, TotChol, Age, BMI, PhysActive, SmokeNow) %>%
mutate(
Physically_Active = ifelse(PhysActive == "Yes", 1, 0),
Current_Smoker = ifelse(SmokeNow == "Yes", 1, 0)
) %>%
select(ID, TotChol, Age, BMI, Physically_Active, Current_Smoker) %>%
slice_sample(n = 150) %>%
arrange(ID)
chol_data %>%
head(10) %>%
kable(caption = "NHANES Cholesterol Data (First 10 Observations)",
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| ID | TotChol | Age | BMI | Physically_Active | Current_Smoker |
|---|---|---|---|---|---|
| 51691 | 5.0 | 57 | 20.7 | 1 | 0 |
| 51856 | 7.3 | 61 | 28.4 | 1 | 1 |
| 51905 | 3.7 | 52 | 29.0 | 0 | 0 |
| 52135 | 5.6 | 74 | 32.9 | 1 | 0 |
| 52259 | 4.4 | 25 | 23.5 | 1 | 1 |
| 52408 | 8.8 | 54 | 25.9 | 0 | 0 |
| 52484 | 5.8 | 31 | 24.1 | 1 | 1 |
| 52587 | 6.0 | 72 | 27.2 | 0 | 0 |
| 52647 | 5.1 | 43 | 29.7 | 1 | 1 |
| 52730 | 5.5 | 53 | 23.6 | 1 | 0 |
Research Question: What factors predict total cholesterol in US adults?
Tasks:
# Your code here:
# 1. Correlation matrix
cor(chol_data %>% select(Age, BMI, Physically_Active, Current_Smoker))
# 2. Create X matrix
Y_practice <- chol_data$TotChol
X_practice <- chol_data %>%
select(Age, BMI, Physically_Active, Current_Smoker) %>%
mutate(intercept = 1) %>%
select(intercept, everything()) %>%
as.matrix()
# 3. Calculate X'X
XtX_practice <- t(X_practice) %*% X_practice
# 4. Fit model
model_practice <- lm(TotChol ~ Age + BMI + Physically_Active + Current_Smoker,
data = chol_data)
# 5. Compare - calculate beta manually
beta_practice <- solve(XtX_practice) %*% (t(X_practice) %*% Y_practice)
# 6. Check VIF
vif(model_practice)
# 7. Interpret coefficients
summary(model_practice)## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] NHANES_2.1.0 car_3.1-5 carData_3.0-6 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 lubridate_1.9.5 forcats_1.0.1
## [9] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.6
## [13] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56.2 bslib_0.10.0
## [4] htmlwidgets_1.6.4 tzdb_0.5.0 crosstalk_1.2.2
## [7] vctrs_0.7.1 tools_4.5.2 generics_0.1.4
## [10] curl_7.0.0 pkgconfig_2.0.3 telegram.bot_3.0.2
## [13] data.table_1.18.2.1 RColorBrewer_1.1-3 S7_0.2.1
## [16] lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [19] textshaping_1.0.4 httpuv_1.6.16 htmltools_0.5.9
## [22] sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2
## [25] Formula_1.2-5 later_1.4.5 pillar_1.11.1
## [28] jquerylib_0.1.4 openssl_2.3.4 cachem_1.1.0
## [31] abind_1.4-8 tidyselect_1.2.1 digest_0.6.39
## [34] stringi_1.8.7 fastmap_1.2.0 grid_4.5.2
## [37] cli_3.6.5 magrittr_2.0.4 withr_3.0.2
## [40] scales_1.4.0 promises_1.5.0 timechange_0.4.0
## [43] rmarkdown_2.30 httr_1.4.7 otel_0.2.0
## [46] askpass_1.2.1 hms_1.1.4 evaluate_1.0.5
## [49] viridisLite_0.4.2 rlang_1.1.7 Rcpp_1.1.1
## [52] glue_1.8.0 xml2_1.5.2 svglite_2.2.2
## [55] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
## [58] systemfonts_1.3.1.9000