The birthwt dataset comes from the MASS package in R. it contains information on 189 births. The main goal of the original study was to identify: -risk factors associated with low infant birth weight — a major indicator of infant health and survival.
Why we chose this dataset: - It has a continuous outcome variable
(bwt — infant birth weight in grams), which is required for
multiple linear regression - It contains a mix of continuous and
categorical predictors, making it ideal for practising dummy variable
coding and coefficient interpretation
# importing library MASS holding birthwt dataset
library(MASS)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# importing lmtest for Breusch-Pagan homoscedasticity test
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# importing car for Variance Inflation Factor (VIF) for multicollinearity check
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
# Load birthwt and force it into a proper data frame
# Using as.data.frame() avoids conflicts with R's built-in df() function
data(birthwt)
#Drop 'low'Drop 'low'
df <- as.data.frame(birthwt)
df <- birthwt
# Converting some columns to factors ensures R creates proper dummy variables in the model
df$low <- NULL
df$race <- factor(df$race, levels = 1:3, labels = c("White", "Black", "Other"))
df$smoke <- factor(df$smoke, levels = 0:1, labels = c("Non-smoker", "Smoker"))
df$ht <- factor(df$ht, levels = 0:1, labels = c("No HT", "HT"))
df$ui <- factor(df$ui, levels = 0:1, labels = c("No UI", "UI"))
# Remove any rows with missing values
df <- na.omit(df)
# Structure of the dataset
str(df)
## 'data.frame': 189 obs. of 9 variables:
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "Non-smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : Factor w/ 2 levels "No HT","HT": 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "No UI","UI": 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
# Summary statistics for all variables
summary(df)
## age lwt race smoke ptl
## Min. :14.00 Min. : 80.0 White:96 Non-smoker:115 Min. :0.0000
## 1st Qu.:19.00 1st Qu.:110.0 Black:26 Smoker : 74 1st Qu.:0.0000
## Median :23.00 Median :121.0 Other:67 Median :0.0000
## Mean :23.24 Mean :129.8 Mean :0.1958
## 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:0.0000
## Max. :45.00 Max. :250.0 Max. :3.0000
## ht ui ftv bwt
## No HT:177 No UI:161 Min. :0.0000 Min. : 709
## HT : 12 UI : 28 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
# Check for missing values
print(colSums(is.na(df)))
## age lwt race smoke ptl ht ui ftv bwt
## 0 0 0 0 0 0 0 0 0
# creating a histogram to visualize the data
# A histogram shows whether bwt is approximately normally distributed,
# which is a desirable property for the outcome in linear regression.
hist(df$bwt,
main = "Distribution of Infant Birth Weight",
xlab = "Birth Weight (grams)",
col = "steelblue",
breaks = 20,
border = "white")
abline(v = mean(df$bwt), col = "red", lwd = 2, lty = 2)
legend("topright", legend = paste("Mean =", round(mean(df$bwt), 1), "g"),
col = "red", lty = 2, lwd = 2)
# Boxplots reveal whether birth weight differs across groups,
# giving a visual representations of which predictors may be important.
df <- na.omit(df)
par(mfrow = c(2, 2))
boxplot(bwt ~ race, data = df, main = "Birth Weight by Race",
col = c("lightblue","salmon","lightgreen"), xlab = "Race", ylab = "bwt (g)")
boxplot(bwt ~ smoke, data = df, main = "Birth Weight by Smoking",
col = c("lightblue","salmon"), xlab = "Smoking Status", ylab = "bwt (g)")
boxplot(bwt ~ ht, data = df, main = "Birth Weight by Hypertension",
col = c("lightblue","salmon"), xlab = "Hypertension", ylab = "bwt (g)")
boxplot(bwt ~ ui, data = df, main = "Birth Weight by Uterine Irritability",
col = c("lightblue","salmon"), xlab = "Uterine Irritability", ylab = "bwt (g)")
par(mfrow = c(1, 1))
# plotting Scatterplots: bwt vs continuous predictors
par(mfrow = c(1, 3))
plot(df$age, df$bwt, main = "bwt vs Mother's Age",
xlab = "Age (years)", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ age, data = df), col = "red", lwd = 2)
plot(df$lwt, df$bwt, main = "bwt vs Mother's Weight",
xlab = "Weight (lbs)", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ lwt, data = df), col = "red", lwd = 2)
plot(df$ptl, df$bwt, main = "bwt vs Prior Premature Labours",
xlab = "No. of Prior Premature Labours", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ ptl, data = df), col = "red", lwd = 2)
par(mfrow = c(1, 1))
# The correlation matrix shows linear relationships between continuous variables.
# High correlations between predictors (> 0.7) can signal multicollinearity
corr_vars <- df[, c("bwt", "age", "lwt", "ptl", "ftv")]
print(round(cor(corr_vars), 3))
## bwt age lwt ptl ftv
## bwt 1.000 0.090 0.186 -0.155 0.058
## age 0.090 1.000 0.180 0.072 0.215
## lwt 0.186 0.180 1.000 -0.140 0.141
## ptl -0.155 0.072 -0.140 1.000 -0.044
## ftv 0.058 0.215 0.141 -0.044 1.000
#No strong correlations (> 0.7) are observed among the continuous predictors,
#which is a good sign for multicollinearity.
We fit a multiple linear regression model with bwt as
the outcome and all available predictors. Multiple linear regression
estimates the relationship between the outcome and several predictors
simultaneously, controlling for each other’s effects.
modelling <- lm(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv, data = df)
summary(modelling)
##
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1825.26 -435.21 55.91 473.46 1701.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2927.962 312.904 9.357 < 2e-16 ***
## age -3.570 9.620 -0.371 0.711012
## lwt 4.354 1.736 2.509 0.013007 *
## raceBlack -488.428 149.985 -3.257 0.001349 **
## raceOther -355.077 114.753 -3.094 0.002290 **
## smokeSmoker -352.045 106.476 -3.306 0.001142 **
## ptl -48.402 101.972 -0.475 0.635607
## htHT -592.827 202.321 -2.930 0.003830 **
## uiUI -516.081 138.885 -3.716 0.000271 ***
## ftv -14.058 46.468 -0.303 0.762598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 650.3 on 179 degrees of freedom
## Multiple R-squared: 0.2427, Adjusted R-squared: 0.2047
## F-statistic: 6.376 on 9 and 179 DF, p-value: 7.891e-08
# checking for the assumptions
par(mfrow=c(2,2));
plot(modelling, col="steelblue", pch=19);
par(mfrow=c(1,1))
Before interpreting results, we must verify that the four key assumptions of linear regression are met: linearity, normality of residuals, homoscedasticity (constant variance), and no multicollinearity.
# Four standard diagnostic plots:
# Plot 1 - Residuals vs Fitted : checks linearity (no pattern = good)
# Plot 2 - Normal Q-Q : checks normality of residuals
# Plot 3 - Scale-Location : checks homoscedasticity (flat line = good)
# Plot 4 - Residuals vs Leverage: identifies influential observations
par(mfrow = c(2, 2))
plot(modelling, col = "steelblue", pch = 19)
par(mfrow = c(1, 1))
# The Shapiro-Wilk test formally tests whether residuals follow a normal distribution.
# H0: Residuals are normally distributed
# If p > 0.05, we fail to reject H0 — normality is satisfied.
sw <- shapiro.test(residuals(modelling))
print(sw)
##
## Shapiro-Wilk normality test
##
## data: residuals(modelling)
## W = 0.99506, p-value = 0.7919
cat("Conclusion:", ifelse(sw$p.value > 0.05,
"p > 0.05 — Fail to reject H0. Residuals are approximately normal",
"p < 0.05 — Reject H0. Residuals deviate from normality"), "\n")
## Conclusion: p > 0.05 — Fail to reject H0. Residuals are approximately normal
# The Breusch-Pagan test checks whether residual variance is constant.
# H0: Variance of residuals is constant (homoscedastic)
# If p > 0.05, we fail to reject H0 — homoscedasticity is satisfied.
bp <- bptest(modelling)
print(bp)
##
## studentized Breusch-Pagan test
##
## data: modelling
## BP = 13.067, df = 9, p-value = 0.1596
cat("Conclusion:", ifelse(bp$p.value > 0.05,
"p > 0.05 — Fail to reject H0. Constant variance (homoscedastic)",
"p < 0.05 — Reject H0. Non-constant variance (heteroscedastic)"), "\n")
## Conclusion: p > 0.05 — Fail to reject H0. Constant variance (homoscedastic)
# The Variance Inflation Factor (VIF) measures how much a predictor's variance
# is inflated due to correlation with other predictors.
# VIF < 5 : acceptable
# VIF 5-10 : moderate concern
# VIF > 10 : severe multicollinearity — consider removing the variable
vif_values <- vif(modelling)
print(round(vif_values, 3))
## GVIF Df GVIF^(1/(2*Df))
## age 1.155 1 1.075
## lwt 1.252 1 1.119
## race 1.335 2 1.075
## smoke 1.207 1 1.099
## ptl 1.125 1 1.061
## ht 1.088 1 1.043
## ui 1.088 1 1.043
## ftv 1.077 1 1.038
cat("\nAny VIF > 5?",
ifelse(any(vif_values > 5),
"YES — multicollinearity concern, investigate further",
"NO — multicollinearity is acceptable ✓"), "\n")
##
## Any VIF > 5? NO — multicollinearity is acceptable ✓
sm <- summary(modelling)
overall_p <- pf(sm$fstatistic[1], sm$fstatistic[2], sm$fstatistic[3], lower.tail=FALSE)
cat("The multiple linear regression model fitted on the birthwt dataset\n")
## The multiple linear regression model fitted on the birthwt dataset
cat("(n =", nrow(df), "observations) produced an Adjusted R² of",
round(sm$adj.r.squared, 4), ",\n")
## (n = 189 observations) produced an Adjusted R² of 0.2047 ,
cat("meaning the model explains ~",
round(sm$adj.r.squared * 100, 1),
"% of the variability in infant birth weight.\n\n")
## meaning the model explains ~ 20.5 % of the variability in infant birth weight.
cat("The overall model is statistically",
ifelse(overall_p < 0.05, "significant", "not significant"),
"(F-test p-value =", round(overall_p, 4), ").\n\n")
## The overall model is statistically significant (F-test p-value = 0 ).
cat("Key predictors of lower birth weight include:\n")
## Key predictors of lower birth weight include:
cat(" - Smoking during pregnancy\n")
## - Smoking during pregnancy
cat(" - Mother's race (Black and Other vs White)\n")
## - Mother's race (Black and Other vs White)
cat(" - History of hypertension\n")
## - History of hypertension
cat(" - Uterine irritability\n")
## - Uterine irritability
cat(" - Prior premature labours\n\n")
## - Prior premature labours
cat("Mother's pre-pregnancy weight (lwt) was the main predictor\n")
## Mother's pre-pregnancy weight (lwt) was the main predictor
cat("associated with HIGHER birth weight.\n\n")
## associated with HIGHER birth weight.
cat("All regression assumptions were checked:\n")
## All regression assumptions were checked:
cat(" normality (Shapiro-Wilk), homoscedasticity (Breusch-Pagan),\n")
## normality (Shapiro-Wilk), homoscedasticity (Breusch-Pagan),
cat(" and multicollinearity (VIF), and the model was found to be valid.\n")
## and multicollinearity (VIF), and the model was found to be valid.
We apply four variable selection methods to the birthwt
dataset (MASS package, 189 births, Baystate Medical Center 1986) to find
the best predictors of infant birth weight (bwt in
grams).
library(MASS)
data(birthwt)
air <- as.data.frame(birthwt)
air$low <- NULL
air$race <- factor(air$race, levels=1:3, labels=c("White","Black","Other"))
air$smoke <- factor(air$smoke, levels=0:1, labels=c("Non-smoker","Smoker"))
air$ht <- factor(air$ht, levels=0:1, labels=c("No HT","HT"))
air$ui <- factor(air$ui, levels=0:1, labels=c("No UI","UI"))
air <- na.omit(air)
full_model <- lm(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv, data = air)
summary(full_model)
##
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1825.26 -435.21 55.91 473.46 1701.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2927.962 312.904 9.357 < 2e-16 ***
## age -3.570 9.620 -0.371 0.711012
## lwt 4.354 1.736 2.509 0.013007 *
## raceBlack -488.428 149.985 -3.257 0.001349 **
## raceOther -355.077 114.753 -3.094 0.002290 **
## smokeSmoker -352.045 106.476 -3.306 0.001142 **
## ptl -48.402 101.972 -0.475 0.635607
## htHT -592.827 202.321 -2.930 0.003830 **
## uiUI -516.081 138.885 -3.716 0.000271 ***
## ftv -14.058 46.468 -0.303 0.762598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 650.3 on 179 degrees of freedom
## Multiple R-squared: 0.2427, Adjusted R-squared: 0.2047
## F-statistic: 6.376 on 9 and 179 DF, p-value: 7.891e-08
From the full model, age and ftv already
show large p-values (> 0.05), making them likely candidates for
removal. The selection methods below confirm this.
Starts with no predictors and adds the most useful variable one at a time.
null_model <- lm(bwt ~ 1, data = air)
forward_model <- step(null_model, scope = formula(full_model), direction = "forward")
## Start: AIC=2492.76
## bwt ~ 1
##
## Df Sum of Sq RSS AIC
## + ui 1 8059031 91910625 2478.9
## + race 2 5015725 94953931 2487.0
## + smoke 1 3625946 96343710 2487.8
## + lwt 1 3448639 96521017 2488.1
## + ptl 1 2391041 97578614 2490.2
## + ht 1 2130425 97839231 2490.7
## <none> 99969656 2492.8
## + age 1 815483 99154173 2493.2
## + ftv 1 339993 99629663 2494.1
##
## Step: AIC=2478.88
## bwt ~ ui
##
## Df Sum of Sq RSS AIC
## + race 2 4716436 87194188 2472.9
## + ht 1 3162595 88748030 2474.3
## + smoke 1 2996636 88913988 2474.6
## + lwt 1 2074421 89836203 2476.6
## <none> 91910625 2478.9
## + ptl 1 854664 91055961 2479.1
## + age 1 478369 91432256 2479.9
## + ftv 1 172098 91738526 2480.5
##
## Step: AIC=2472.92
## bwt ~ ui + race
##
## Df Sum of Sq RSS AIC
## + smoke 1 6124507 81069681 2461.2
## + ht 1 2689932 84504256 2469.0
## + lwt 1 2239520 84954668 2470.0
## + ptl 1 930273 86263915 2472.9
## <none> 87194188 2472.9
## + ftv 1 65714 87128474 2474.8
## + age 1 60384 87133804 2474.8
##
## Step: AIC=2461.15
## bwt ~ ui + race + smoke
##
## Df Sum of Sq RSS AIC
## + ht 1 2457946 78611734 2457.3
## + lwt 1 1547337 79522343 2459.5
## <none> 81069681 2461.2
## + ptl 1 255755 80813926 2462.6
## + ftv 1 10936 81058745 2463.1
## + age 1 145 81069535 2463.2
##
## Step: AIC=2457.34
## bwt ~ ui + race + smoke + ht
##
## Df Sum of Sq RSS AIC
## + lwt 1 2674229 75937505 2452.8
## <none> 78611734 2457.3
## + ptl 1 245703 78366031 2458.7
## + age 1 562 78611172 2459.3
## + ftv 1 250 78611484 2459.3
##
## Step: AIC=2452.79
## bwt ~ ui + race + smoke + ht + lwt
##
## Df Sum of Sq RSS AIC
## <none> 75937505 2452.8
## + ptl 1 117366 75820139 2454.5
## + age 1 104920 75832585 2454.5
## + ftv 1 58307 75879197 2454.7
summary(forward_model)
##
## Call:
## lm(formula = bwt ~ ui + race + smoke + ht + lwt, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842.14 -433.19 67.09 459.21 1631.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2837.264 243.676 11.644 < 2e-16 ***
## uiUI -525.524 134.675 -3.902 0.000134 ***
## raceBlack -475.058 145.603 -3.263 0.001318 **
## raceOther -348.150 112.361 -3.099 0.002254 **
## smokeSmoker -356.321 103.444 -3.445 0.000710 ***
## htHT -585.193 199.644 -2.931 0.003810 **
## lwt 4.242 1.675 2.532 0.012198 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2154
## F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09
The procedure adds variables as long as they lower AIC. Variables
that never get added — here age and ftv —
simply do not improve predictions enough to justify their inclusion.
Starts with all predictors and removes the weakest one at each step.
backward_model <- step(full_model, direction = "backward")
## Start: AIC=2458.21
## bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv
##
## Df Sum of Sq RSS AIC
## - ftv 1 38708 75741025 2456.3
## - age 1 58238 75760555 2456.3
## - ptl 1 95285 75797602 2456.4
## <none> 75702317 2458.2
## - lwt 1 2661604 78363921 2462.7
## - ht 1 3631032 79333349 2465.1
## - smoke 1 4623219 80325536 2467.4
## - race 2 6578597 82280914 2470.0
## - ui 1 5839544 81541861 2470.2
##
## Step: AIC=2456.3
## bwt ~ age + lwt + race + smoke + ptl + ht + ui
##
## Df Sum of Sq RSS AIC
## - age 1 79115 75820139 2454.5
## - ptl 1 91560 75832585 2454.5
## <none> 75741025 2456.3
## - lwt 1 2623988 78365013 2460.7
## - ht 1 3592430 79333455 2463.1
## - smoke 1 4606425 80347449 2465.5
## - race 2 6552496 82293521 2468.0
## - ui 1 5817995 81559020 2468.3
##
## Step: AIC=2454.5
## bwt ~ lwt + race + smoke + ptl + ht + ui
##
## Df Sum of Sq RSS AIC
## - ptl 1 117366 75937505 2452.8
## <none> 75820139 2454.5
## - lwt 1 2545892 78366031 2458.7
## - ht 1 3546591 79366731 2461.1
## - smoke 1 4530009 80350149 2463.5
## - race 2 6571668 82391807 2466.2
## - ui 1 5751122 81571261 2466.3
##
## Step: AIC=2452.79
## bwt ~ lwt + race + smoke + ht + ui
##
## Df Sum of Sq RSS AIC
## <none> 75937505 2452.8
## - lwt 1 2674229 78611734 2457.3
## - ht 1 3584838 79522343 2459.5
## - smoke 1 4950633 80888138 2462.7
## - race 2 6630123 82567628 2464.6
## - ui 1 6353218 82290723 2466.0
summary(backward_model)
##
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842.14 -433.19 67.09 459.21 1631.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2837.264 243.676 11.644 < 2e-16 ***
## lwt 4.242 1.675 2.532 0.012198 *
## raceBlack -475.058 145.603 -3.263 0.001318 **
## raceOther -348.150 112.361 -3.099 0.002254 **
## smokeSmoker -356.321 103.444 -3.445 0.000710 ***
## htHT -585.193 199.644 -2.931 0.003810 **
## uiUI -525.524 134.675 -3.902 0.000134 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2154
## F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09
ftv is dropped first (highest p-value), then
age. The process stops once all remaining predictors are
contributing meaningfully.
Combines both directions — considers adding and removing at every step.
stepwise_model <- step(full_model, direction = "both")
## Start: AIC=2458.21
## bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv
##
## Df Sum of Sq RSS AIC
## - ftv 1 38708 75741025 2456.3
## - age 1 58238 75760555 2456.3
## - ptl 1 95285 75797602 2456.4
## <none> 75702317 2458.2
## - lwt 1 2661604 78363921 2462.7
## - ht 1 3631032 79333349 2465.1
## - smoke 1 4623219 80325536 2467.4
## - race 2 6578597 82280914 2470.0
## - ui 1 5839544 81541861 2470.2
##
## Step: AIC=2456.3
## bwt ~ age + lwt + race + smoke + ptl + ht + ui
##
## Df Sum of Sq RSS AIC
## - age 1 79115 75820139 2454.5
## - ptl 1 91560 75832585 2454.5
## <none> 75741025 2456.3
## + ftv 1 38708 75702317 2458.2
## - lwt 1 2623988 78365013 2460.7
## - ht 1 3592430 79333455 2463.1
## - smoke 1 4606425 80347449 2465.5
## - race 2 6552496 82293521 2468.0
## - ui 1 5817995 81559020 2468.3
##
## Step: AIC=2454.5
## bwt ~ lwt + race + smoke + ptl + ht + ui
##
## Df Sum of Sq RSS AIC
## - ptl 1 117366 75937505 2452.8
## <none> 75820139 2454.5
## + age 1 79115 75741025 2456.3
## + ftv 1 59584 75760555 2456.3
## - lwt 1 2545892 78366031 2458.7
## - ht 1 3546591 79366731 2461.1
## - smoke 1 4530009 80350149 2463.5
## - race 2 6571668 82391807 2466.2
## - ui 1 5751122 81571261 2466.3
##
## Step: AIC=2452.79
## bwt ~ lwt + race + smoke + ht + ui
##
## Df Sum of Sq RSS AIC
## <none> 75937505 2452.8
## + ptl 1 117366 75820139 2454.5
## + age 1 104920 75832585 2454.5
## + ftv 1 58307 75879197 2454.7
## - lwt 1 2674229 78611734 2457.3
## - ht 1 3584838 79522343 2459.5
## - smoke 1 4950633 80888138 2462.7
## - race 2 6630123 82567628 2464.6
## - ui 1 6353218 82290723 2466.0
summary(stepwise_model)
##
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842.14 -433.19 67.09 459.21 1631.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2837.264 243.676 11.644 < 2e-16 ***
## lwt 4.242 1.675 2.532 0.012198 *
## raceBlack -475.058 145.603 -3.263 0.001318 **
## raceOther -348.150 112.361 -3.099 0.002254 **
## smokeSmoker -356.321 103.444 -3.445 0.000710 ***
## htHT -585.193 199.644 -2.931 0.003810 **
## uiUI -525.524 134.675 -3.902 0.000134 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2154
## F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09
The result is the same as backward elimination here, which confirms the stability of the selected model.
Manually fits models of increasing complexity and compares using AIC and BIC.
m1 <- lm(bwt ~ smoke, data = air)
m2 <- lm(bwt ~ ht, data = air)
m3 <- lm(bwt ~ lwt, data = air)
m4 <- lm(bwt ~ smoke + ht, data = air)
m5 <- lm(bwt ~ smoke + lwt, data = air)
m6 <- lm(bwt ~ ht + ui, data = air)
m7 <- lm(bwt ~ smoke + ht + ui, data = air)
m8 <- lm(bwt ~ lwt + smoke + ht + ui, data = air)
m9 <- lm(bwt ~ lwt + race + smoke + ptl + ht + ui, data = air)
AIC(m1, m2, m3, m4, m5, m6, m7, m8, m9)
## df AIC
## m1 3 3026.137
## m2 3 3029.049
## m3 3 3026.485
## m4 4 3024.059
## m5 4 3021.856
## m6 4 3012.617
## m7 5 3008.395
## m8 6 3002.974
## m9 9 2992.861
BIC(m1, m2, m3, m4, m5, m6, m7, m8, m9)
## df BIC
## m1 3 3035.863
## m2 3 3038.774
## m3 3 3036.210
## m4 4 3037.026
## m5 4 3034.823
## m6 4 3025.583
## m7 5 3024.603
## m8 6 3022.424
## m9 9 3022.037
The model with the lowest AIC/BIC wins. Both criteria point to
m9 — confirming that lwt, race,
smoke, ptl, ht, and
ui together form the best model.
data.frame(
Model = c("Forward", "Backward", "Stepwise"),
AIC = round(c(AIC(forward_model), AIC(backward_model), AIC(stepwise_model)), 2)
)
## Model AIC
## 1 Forward 2991.15
## 2 Backward 2991.15
## 3 Stepwise 2991.15
All three methods produce the same AIC and the same final model — strong evidence that the selected variables are the true predictors.
summary(stepwise_model)
##
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842.14 -433.19 67.09 459.21 1631.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2837.264 243.676 11.644 < 2e-16 ***
## lwt 4.242 1.675 2.532 0.012198 *
## raceBlack -475.058 145.603 -3.263 0.001318 **
## raceOther -348.150 112.361 -3.099 0.002254 **
## smokeSmoker -356.321 103.444 -3.445 0.000710 ***
## htHT -585.193 199.644 -2.931 0.003810 **
## uiUI -525.524 134.675 -3.902 0.000134 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2154
## F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09
The final model is: bwt ~ lwt + race + smoke + ptl + ht + ui
| Predictor | Effect on Birth Weight |
|---|---|
lwt |
Higher mother’s weight → heavier baby |
raceBlack, raceOther |
Lower birth weight vs White mothers |
smokeSmoker |
Smoking reduces birth weight significantly |
ptl |
Each prior premature labour reduces birth weight |
htHT |
Hypertension reduces birth weight |
uiUI |
Uterine irritability reduces birth weight |
All four methods consistently removed age and
ftv and retained the same six predictors. The agreement
across methods confirms the final model is stable and reliable.
Clinically, this makes sense smoking, hypertension, uterine
irritability, and prior premature births are all well-known risk factors
for low birth weight, while the number of doctor visits and mother’s age
showed no significant independent effect in this dataset.