Assignment on Multiple Linear regression

HW1: Find a data set of which you can fit multiple linear regression and interpret your results

The birthwt dataset comes from the MASS package in R. it contains information on 189 births. The main goal of the original study was to identify: -risk factors associated with low infant birth weight — a major indicator of infant health and survival.

Why we chose this dataset: - It has a continuous outcome variable (bwt — infant birth weight in grams), which is required for multiple linear regression - It contains a mix of continuous and categorical predictors, making it ideal for practising dummy variable coding and coefficient interpretation

importing all the needed libraries

# importing library MASS holding birthwt dataset
library(MASS)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# importing lmtest for Breusch-Pagan homoscedasticity test
library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# importing car for Variance Inflation Factor (VIF) for multicollinearity check
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Loading the dataset

# Load birthwt and force it into a proper data frame
# Using as.data.frame() avoids conflicts with R's built-in df() function
data(birthwt)
#Drop 'low'Drop 'low'
df <- as.data.frame(birthwt)
df <- birthwt

Convert categorical variables to factors

# Converting some columns to factors ensures R creates proper dummy variables in the model

df$low   <- NULL

df$race  <- factor(df$race,  levels = 1:3, labels = c("White", "Black", "Other"))
df$smoke <- factor(df$smoke, levels = 0:1, labels = c("Non-smoker", "Smoker"))
df$ht    <- factor(df$ht,    levels = 0:1, labels = c("No HT", "HT"))
df$ui    <- factor(df$ui,    levels = 0:1, labels = c("No UI", "UI"))

# Remove any rows with missing values
df <- na.omit(df)

Performing Exploratory Data Analysis

# Structure of the dataset
str(df)

## 'data.frame':    189 obs. of  9 variables:
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "Non-smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : Factor w/ 2 levels "No HT","HT": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "No UI","UI": 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

# Summary statistics for all variables
summary(df)

##       age             lwt           race           smoke          ptl        
##  Min.   :14.00   Min.   : 80.0   White:96   Non-smoker:115   Min.   :0.0000  
##  1st Qu.:19.00   1st Qu.:110.0   Black:26   Smoker    : 74   1st Qu.:0.0000  
##  Median :23.00   Median :121.0   Other:67                    Median :0.0000  
##  Mean   :23.24   Mean   :129.8                               Mean   :0.1958  
##  3rd Qu.:26.00   3rd Qu.:140.0                               3rd Qu.:0.0000  
##  Max.   :45.00   Max.   :250.0                               Max.   :3.0000  
##      ht          ui           ftv              bwt      
##  No HT:177   No UI:161   Min.   :0.0000   Min.   : 709  
##  HT   : 12   UI   : 28   1st Qu.:0.0000   1st Qu.:2414  
##                          Median :0.0000   Median :2977  
##                          Mean   :0.7937   Mean   :2945  
##                          3rd Qu.:1.0000   3rd Qu.:3487  
##                          Max.   :6.0000   Max.   :4990

# Check for missing values
print(colSums(is.na(df)))

##   age   lwt  race smoke   ptl    ht    ui   ftv   bwt 
##     0     0     0     0     0     0     0     0     0

Distribution of Birth Weight

# creating a histogram to visualize the data 
# A histogram shows whether bwt is approximately normally distributed,
# which is a desirable property for the outcome in linear regression.
hist(df$bwt,
     main  = "Distribution of Infant Birth Weight",
     xlab  = "Birth Weight (grams)",
     col   = "steelblue",
     breaks = 20,
     border = "white")
abline(v = mean(df$bwt), col = "red", lwd = 2, lty = 2)
legend("topright", legend = paste("Mean =", round(mean(df$bwt), 1), "g"),
       col = "red", lty = 2, lwd = 2)

Boxplots: Birth Weight by Categorical Predictors

# Boxplots reveal whether birth weight differs across groups,
# giving a visual representations of which predictors may be important.
df <- na.omit(df)
par(mfrow = c(2, 2))
boxplot(bwt ~ race,  data = df, main = "Birth Weight by Race",
        col = c("lightblue","salmon","lightgreen"), xlab = "Race", ylab = "bwt (g)")
boxplot(bwt ~ smoke, data = df, main = "Birth Weight by Smoking",
        col = c("lightblue","salmon"), xlab = "Smoking Status", ylab = "bwt (g)")
boxplot(bwt ~ ht,    data = df, main = "Birth Weight by Hypertension",
        col = c("lightblue","salmon"), xlab = "Hypertension", ylab = "bwt (g)")
boxplot(bwt ~ ui,    data = df, main = "Birth Weight by Uterine Irritability",
        col = c("lightblue","salmon"), xlab = "Uterine Irritability", ylab = "bwt (g)")

par(mfrow = c(1, 1))

Scatterplots: Birth Weight vs Continuous Predictors

# plotting Scatterplots: bwt vs continuous predictors 
par(mfrow = c(1, 3))
plot(df$age, df$bwt, main = "bwt vs Mother's Age",
     xlab = "Age (years)", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ age, data = df), col = "red", lwd = 2)

plot(df$lwt, df$bwt, main = "bwt vs Mother's Weight",
     xlab = "Weight (lbs)", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ lwt, data = df), col = "red", lwd = 2)

plot(df$ptl, df$bwt, main = "bwt vs Prior Premature Labours",
     xlab = "No. of Prior Premature Labours", ylab = "bwt (g)", pch = 19, col = "steelblue")
abline(lm(bwt ~ ptl, data = df), col = "red", lwd = 2)

par(mfrow = c(1, 1))

Correlation Matrix

# The correlation matrix shows linear relationships between continuous variables.
# High correlations between predictors (> 0.7) can signal multicollinearity
corr_vars <- df[, c("bwt", "age", "lwt", "ptl", "ftv")]
print(round(cor(corr_vars), 3))

##        bwt   age    lwt    ptl    ftv
## bwt  1.000 0.090  0.186 -0.155  0.058
## age  0.090 1.000  0.180  0.072  0.215
## lwt  0.186 0.180  1.000 -0.140  0.141
## ptl -0.155 0.072 -0.140  1.000 -0.044
## ftv  0.058 0.215  0.141 -0.044  1.000

#No strong correlations (> 0.7) are observed among the continuous predictors,
#which is a good sign for multicollinearity.

Model Fitting

We fit a multiple linear regression model with bwt as the outcome and all available predictors. Multiple linear regression estimates the relationship between the outcome and several predictors simultaneously, controlling for each other’s effects.

modelling <- lm(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv, data = df)
summary(modelling)

## 
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1825.26  -435.21    55.91   473.46  1701.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2927.962    312.904   9.357  < 2e-16 ***
## age           -3.570      9.620  -0.371 0.711012    
## lwt            4.354      1.736   2.509 0.013007 *  
## raceBlack   -488.428    149.985  -3.257 0.001349 ** 
## raceOther   -355.077    114.753  -3.094 0.002290 ** 
## smokeSmoker -352.045    106.476  -3.306 0.001142 ** 
## ptl          -48.402    101.972  -0.475 0.635607    
## htHT        -592.827    202.321  -2.930 0.003830 ** 
## uiUI        -516.081    138.885  -3.716 0.000271 ***
## ftv          -14.058     46.468  -0.303 0.762598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 650.3 on 179 degrees of freedom
## Multiple R-squared:  0.2427, Adjusted R-squared:  0.2047 
## F-statistic: 6.376 on 9 and 179 DF,  p-value: 7.891e-08

# checking for the assumptions
par(mfrow=c(2,2)); 
plot(modelling, col="steelblue", pch=19);

par(mfrow=c(1,1))

Assumption Checking

Before interpreting results, we must verify that the four key assumptions of linear regression are met: linearity, normality of residuals, homoscedasticity (constant variance), and no multicollinearity.

Diagnostic Plots (Linearity & Homoscedasticity)

# Four standard diagnostic plots:
# Plot 1 - Residuals vs Fitted : checks linearity (no pattern = good)
# Plot 2 - Normal Q-Q          : checks normality of residuals
# Plot 3 - Scale-Location      : checks homoscedasticity (flat line = good)
# Plot 4 - Residuals vs Leverage: identifies influential observations

par(mfrow = c(2, 2))
plot(modelling, col = "steelblue", pch = 19)

par(mfrow = c(1, 1))

Normality Test (Shapiro-Wilk)

# The Shapiro-Wilk test formally tests whether residuals follow a normal distribution.
# H0: Residuals are normally distributed
# If p > 0.05, we fail to reject H0 — normality is satisfied.

sw <- shapiro.test(residuals(modelling))
print(sw)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(modelling)
## W = 0.99506, p-value = 0.7919

cat("Conclusion:", ifelse(sw$p.value > 0.05,
    "p > 0.05 — Fail to reject H0. Residuals are approximately normal",
    "p < 0.05 — Reject H0. Residuals deviate from normality"), "\n")

## Conclusion: p > 0.05 — Fail to reject H0. Residuals are approximately normal

Homoscedasticity Test (Breusch-Pagan)

# The Breusch-Pagan test checks whether residual variance is constant.
# H0: Variance of residuals is constant (homoscedastic)
# If p > 0.05, we fail to reject H0 — homoscedasticity is satisfied.

bp <- bptest(modelling)
print(bp)

## 
##  studentized Breusch-Pagan test
## 
## data:  modelling
## BP = 13.067, df = 9, p-value = 0.1596

cat("Conclusion:", ifelse(bp$p.value > 0.05,
    "p > 0.05 — Fail to reject H0. Constant variance (homoscedastic)",
    "p < 0.05 — Reject H0. Non-constant variance (heteroscedastic)"), "\n")

## Conclusion: p > 0.05 — Fail to reject H0. Constant variance (homoscedastic)

Multicollinearity Check (VIF)

# The Variance Inflation Factor (VIF) measures how much a predictor's variance
# is inflated due to correlation with other predictors.
# VIF < 5  : acceptable
# VIF 5-10 : moderate concern
# VIF > 10 : severe multicollinearity — consider removing the variable

vif_values <- vif(modelling)
print(round(vif_values, 3))

##        GVIF Df GVIF^(1/(2*Df))
## age   1.155  1           1.075
## lwt   1.252  1           1.119
## race  1.335  2           1.075
## smoke 1.207  1           1.099
## ptl   1.125  1           1.061
## ht    1.088  1           1.043
## ui    1.088  1           1.043
## ftv   1.077  1           1.038

cat("\nAny VIF > 5?",
    ifelse(any(vif_values > 5),
           "YES — multicollinearity concern, investigate further",
           "NO — multicollinearity is acceptable ✓"), "\n")

## 
## Any VIF > 5? NO — multicollinearity is acceptable ✓

Conclusion

sm <- summary(modelling)
overall_p <- pf(sm$fstatistic[1], sm$fstatistic[2], sm$fstatistic[3], lower.tail=FALSE)

cat("The multiple linear regression model fitted on the birthwt dataset\n")

## The multiple linear regression model fitted on the birthwt dataset

cat("(n =", nrow(df), "observations) produced an Adjusted R² of",
    round(sm$adj.r.squared, 4), ",\n")

## (n = 189 observations) produced an Adjusted R² of 0.2047 ,

cat("meaning the model explains ~",
    round(sm$adj.r.squared * 100, 1),
    "% of the variability in infant birth weight.\n\n")

## meaning the model explains ~ 20.5 % of the variability in infant birth weight.

cat("The overall model is statistically",
    ifelse(overall_p < 0.05, "significant", "not significant"),
    "(F-test p-value =", round(overall_p, 4), ").\n\n")

## The overall model is statistically significant (F-test p-value = 0 ).

cat("Key predictors of lower birth weight include:\n")

## Key predictors of lower birth weight include:

cat("  - Smoking during pregnancy\n")

##   - Smoking during pregnancy

cat("  - Mother's race (Black and Other vs White)\n")

##   - Mother's race (Black and Other vs White)

cat("  - History of hypertension\n")

##   - History of hypertension

cat("  - Uterine irritability\n")

##   - Uterine irritability

cat("  - Prior premature labours\n\n")

##   - Prior premature labours

cat("Mother's pre-pregnancy weight (lwt) was the main predictor\n")

## Mother's pre-pregnancy weight (lwt) was the main predictor

cat("associated with HIGHER birth weight.\n\n")

## associated with HIGHER birth weight.

cat("All regression assumptions were checked:\n")

## All regression assumptions were checked:

cat("  normality (Shapiro-Wilk), homoscedasticity (Breusch-Pagan),\n")

##   normality (Shapiro-Wilk), homoscedasticity (Breusch-Pagan),

cat("  and multicollinearity (VIF), and the model was found to be valid.\n")

##   and multicollinearity (VIF), and the model was found to be valid.

HW2: Read about variable selection methods

We apply four variable selection methods to the birthwt dataset (MASS package, 189 births, Baystate Medical Center 1986) to find the best predictors of infant birth weight (bwt in grams).

library(MASS)

data(birthwt)
air <- as.data.frame(birthwt)
air$low   <- NULL
air$race  <- factor(air$race,  levels=1:3, labels=c("White","Black","Other"))
air$smoke <- factor(air$smoke, levels=0:1, labels=c("Non-smoker","Smoker"))
air$ht    <- factor(air$ht,    levels=0:1, labels=c("No HT","HT"))
air$ui    <- factor(air$ui,    levels=0:1, labels=c("No UI","UI"))
air       <- na.omit(air)

full_model <- lm(bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv, data = air)
summary(full_model)

## 
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1825.26  -435.21    55.91   473.46  1701.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2927.962    312.904   9.357  < 2e-16 ***
## age           -3.570      9.620  -0.371 0.711012    
## lwt            4.354      1.736   2.509 0.013007 *  
## raceBlack   -488.428    149.985  -3.257 0.001349 ** 
## raceOther   -355.077    114.753  -3.094 0.002290 ** 
## smokeSmoker -352.045    106.476  -3.306 0.001142 ** 
## ptl          -48.402    101.972  -0.475 0.635607    
## htHT        -592.827    202.321  -2.930 0.003830 ** 
## uiUI        -516.081    138.885  -3.716 0.000271 ***
## ftv          -14.058     46.468  -0.303 0.762598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 650.3 on 179 degrees of freedom
## Multiple R-squared:  0.2427, Adjusted R-squared:  0.2047 
## F-statistic: 6.376 on 9 and 179 DF,  p-value: 7.891e-08

From the full model, age and ftv already show large p-values (> 0.05), making them likely candidates for removal. The selection methods below confirm this.

Method 1: Forward Selection

Starts with no predictors and adds the most useful variable one at a time.

null_model    <- lm(bwt ~ 1, data = air)
forward_model <- step(null_model, scope = formula(full_model), direction = "forward")

## Start:  AIC=2492.76
## bwt ~ 1
## 
##         Df Sum of Sq      RSS    AIC
## + ui     1   8059031 91910625 2478.9
## + race   2   5015725 94953931 2487.0
## + smoke  1   3625946 96343710 2487.8
## + lwt    1   3448639 96521017 2488.1
## + ptl    1   2391041 97578614 2490.2
## + ht     1   2130425 97839231 2490.7
## <none>               99969656 2492.8
## + age    1    815483 99154173 2493.2
## + ftv    1    339993 99629663 2494.1
## 
## Step:  AIC=2478.88
## bwt ~ ui
## 
##         Df Sum of Sq      RSS    AIC
## + race   2   4716436 87194188 2472.9
## + ht     1   3162595 88748030 2474.3
## + smoke  1   2996636 88913988 2474.6
## + lwt    1   2074421 89836203 2476.6
## <none>               91910625 2478.9
## + ptl    1    854664 91055961 2479.1
## + age    1    478369 91432256 2479.9
## + ftv    1    172098 91738526 2480.5
## 
## Step:  AIC=2472.92
## bwt ~ ui + race
## 
##         Df Sum of Sq      RSS    AIC
## + smoke  1   6124507 81069681 2461.2
## + ht     1   2689932 84504256 2469.0
## + lwt    1   2239520 84954668 2470.0
## + ptl    1    930273 86263915 2472.9
## <none>               87194188 2472.9
## + ftv    1     65714 87128474 2474.8
## + age    1     60384 87133804 2474.8
## 
## Step:  AIC=2461.15
## bwt ~ ui + race + smoke
## 
##        Df Sum of Sq      RSS    AIC
## + ht    1   2457946 78611734 2457.3
## + lwt   1   1547337 79522343 2459.5
## <none>              81069681 2461.2
## + ptl   1    255755 80813926 2462.6
## + ftv   1     10936 81058745 2463.1
## + age   1       145 81069535 2463.2
## 
## Step:  AIC=2457.34
## bwt ~ ui + race + smoke + ht
## 
##        Df Sum of Sq      RSS    AIC
## + lwt   1   2674229 75937505 2452.8
## <none>              78611734 2457.3
## + ptl   1    245703 78366031 2458.7
## + age   1       562 78611172 2459.3
## + ftv   1       250 78611484 2459.3
## 
## Step:  AIC=2452.79
## bwt ~ ui + race + smoke + ht + lwt
## 
##        Df Sum of Sq      RSS    AIC
## <none>              75937505 2452.8
## + ptl   1    117366 75820139 2454.5
## + age   1    104920 75832585 2454.5
## + ftv   1     58307 75879197 2454.7

summary(forward_model)

## 
## Call:
## lm(formula = bwt ~ ui + race + smoke + ht + lwt, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842.14  -433.19    67.09   459.21  1631.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2837.264    243.676  11.644  < 2e-16 ***
## uiUI        -525.524    134.675  -3.902 0.000134 ***
## raceBlack   -475.058    145.603  -3.263 0.001318 ** 
## raceOther   -348.150    112.361  -3.099 0.002254 ** 
## smokeSmoker -356.321    103.444  -3.445 0.000710 ***
## htHT        -585.193    199.644  -2.931 0.003810 ** 
## lwt            4.242      1.675   2.532 0.012198 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared:  0.2404, Adjusted R-squared:  0.2154 
## F-statistic:   9.6 on 6 and 182 DF,  p-value: 3.601e-09

The procedure adds variables as long as they lower AIC. Variables that never get added — here age and ftv — simply do not improve predictions enough to justify their inclusion.

Method 2: Backward Elimination

Starts with all predictors and removes the weakest one at each step.

backward_model <- step(full_model, direction = "backward")

## Start:  AIC=2458.21
## bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv
## 
##         Df Sum of Sq      RSS    AIC
## - ftv    1     38708 75741025 2456.3
## - age    1     58238 75760555 2456.3
## - ptl    1     95285 75797602 2456.4
## <none>               75702317 2458.2
## - lwt    1   2661604 78363921 2462.7
## - ht     1   3631032 79333349 2465.1
## - smoke  1   4623219 80325536 2467.4
## - race   2   6578597 82280914 2470.0
## - ui     1   5839544 81541861 2470.2
## 
## Step:  AIC=2456.3
## bwt ~ age + lwt + race + smoke + ptl + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## - age    1     79115 75820139 2454.5
## - ptl    1     91560 75832585 2454.5
## <none>               75741025 2456.3
## - lwt    1   2623988 78365013 2460.7
## - ht     1   3592430 79333455 2463.1
## - smoke  1   4606425 80347449 2465.5
## - race   2   6552496 82293521 2468.0
## - ui     1   5817995 81559020 2468.3
## 
## Step:  AIC=2454.5
## bwt ~ lwt + race + smoke + ptl + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## - ptl    1    117366 75937505 2452.8
## <none>               75820139 2454.5
## - lwt    1   2545892 78366031 2458.7
## - ht     1   3546591 79366731 2461.1
## - smoke  1   4530009 80350149 2463.5
## - race   2   6571668 82391807 2466.2
## - ui     1   5751122 81571261 2466.3
## 
## Step:  AIC=2452.79
## bwt ~ lwt + race + smoke + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## <none>               75937505 2452.8
## - lwt    1   2674229 78611734 2457.3
## - ht     1   3584838 79522343 2459.5
## - smoke  1   4950633 80888138 2462.7
## - race   2   6630123 82567628 2464.6
## - ui     1   6353218 82290723 2466.0

summary(backward_model)

## 
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842.14  -433.19    67.09   459.21  1631.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2837.264    243.676  11.644  < 2e-16 ***
## lwt            4.242      1.675   2.532 0.012198 *  
## raceBlack   -475.058    145.603  -3.263 0.001318 ** 
## raceOther   -348.150    112.361  -3.099 0.002254 ** 
## smokeSmoker -356.321    103.444  -3.445 0.000710 ***
## htHT        -585.193    199.644  -2.931 0.003810 ** 
## uiUI        -525.524    134.675  -3.902 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared:  0.2404, Adjusted R-squared:  0.2154 
## F-statistic:   9.6 on 6 and 182 DF,  p-value: 3.601e-09

ftv is dropped first (highest p-value), then age. The process stops once all remaining predictors are contributing meaningfully.

Method 3: Stepwise Regression

Combines both directions — considers adding and removing at every step.

stepwise_model <- step(full_model, direction = "both")

## Start:  AIC=2458.21
## bwt ~ age + lwt + race + smoke + ptl + ht + ui + ftv
## 
##         Df Sum of Sq      RSS    AIC
## - ftv    1     38708 75741025 2456.3
## - age    1     58238 75760555 2456.3
## - ptl    1     95285 75797602 2456.4
## <none>               75702317 2458.2
## - lwt    1   2661604 78363921 2462.7
## - ht     1   3631032 79333349 2465.1
## - smoke  1   4623219 80325536 2467.4
## - race   2   6578597 82280914 2470.0
## - ui     1   5839544 81541861 2470.2
## 
## Step:  AIC=2456.3
## bwt ~ age + lwt + race + smoke + ptl + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## - age    1     79115 75820139 2454.5
## - ptl    1     91560 75832585 2454.5
## <none>               75741025 2456.3
## + ftv    1     38708 75702317 2458.2
## - lwt    1   2623988 78365013 2460.7
## - ht     1   3592430 79333455 2463.1
## - smoke  1   4606425 80347449 2465.5
## - race   2   6552496 82293521 2468.0
## - ui     1   5817995 81559020 2468.3
## 
## Step:  AIC=2454.5
## bwt ~ lwt + race + smoke + ptl + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## - ptl    1    117366 75937505 2452.8
## <none>               75820139 2454.5
## + age    1     79115 75741025 2456.3
## + ftv    1     59584 75760555 2456.3
## - lwt    1   2545892 78366031 2458.7
## - ht     1   3546591 79366731 2461.1
## - smoke  1   4530009 80350149 2463.5
## - race   2   6571668 82391807 2466.2
## - ui     1   5751122 81571261 2466.3
## 
## Step:  AIC=2452.79
## bwt ~ lwt + race + smoke + ht + ui
## 
##         Df Sum of Sq      RSS    AIC
## <none>               75937505 2452.8
## + ptl    1    117366 75820139 2454.5
## + age    1    104920 75832585 2454.5
## + ftv    1     58307 75879197 2454.7
## - lwt    1   2674229 78611734 2457.3
## - ht     1   3584838 79522343 2459.5
## - smoke  1   4950633 80888138 2462.7
## - race   2   6630123 82567628 2464.6
## - ui     1   6353218 82290723 2466.0

summary(stepwise_model)

## 
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842.14  -433.19    67.09   459.21  1631.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2837.264    243.676  11.644  < 2e-16 ***
## lwt            4.242      1.675   2.532 0.012198 *  
## raceBlack   -475.058    145.603  -3.263 0.001318 ** 
## raceOther   -348.150    112.361  -3.099 0.002254 ** 
## smokeSmoker -356.321    103.444  -3.445 0.000710 ***
## htHT        -585.193    199.644  -2.931 0.003810 ** 
## uiUI        -525.524    134.675  -3.902 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared:  0.2404, Adjusted R-squared:  0.2154 
## F-statistic:   9.6 on 6 and 182 DF,  p-value: 3.601e-09

The result is the same as backward elimination here, which confirms the stability of the selected model.

Method 4: Best Subset Selection

Manually fits models of increasing complexity and compares using AIC and BIC.

m1 <- lm(bwt ~ smoke,                              data = air)
m2 <- lm(bwt ~ ht,                                 data = air)
m3 <- lm(bwt ~ lwt,                                data = air)
m4 <- lm(bwt ~ smoke + ht,                         data = air)
m5 <- lm(bwt ~ smoke + lwt,                        data = air)
m6 <- lm(bwt ~ ht + ui,                            data = air)
m7 <- lm(bwt ~ smoke + ht + ui,                    data = air)
m8 <- lm(bwt ~ lwt + smoke + ht + ui,              data = air)
m9 <- lm(bwt ~ lwt + race + smoke + ptl + ht + ui, data = air)

AIC(m1, m2, m3, m4, m5, m6, m7, m8, m9)

##    df      AIC
## m1  3 3026.137
## m2  3 3029.049
## m3  3 3026.485
## m4  4 3024.059
## m5  4 3021.856
## m6  4 3012.617
## m7  5 3008.395
## m8  6 3002.974
## m9  9 2992.861

BIC(m1, m2, m3, m4, m5, m6, m7, m8, m9)

##    df      BIC
## m1  3 3035.863
## m2  3 3038.774
## m3  3 3036.210
## m4  4 3037.026
## m5  4 3034.823
## m6  4 3025.583
## m7  5 3024.603
## m8  6 3022.424
## m9  9 3022.037

The model with the lowest AIC/BIC wins. Both criteria point to m9 — confirming that lwt, race, smoke, ptl, ht, and ui together form the best model.

Model Comparison

data.frame(
  Model = c("Forward", "Backward", "Stepwise"),
  AIC   = round(c(AIC(forward_model), AIC(backward_model), AIC(stepwise_model)), 2)
)

##      Model     AIC
## 1  Forward 2991.15
## 2 Backward 2991.15
## 3 Stepwise 2991.15

All three methods produce the same AIC and the same final model — strong evidence that the selected variables are the true predictors.

Final Selected Model

summary(stepwise_model)

## 
## Call:
## lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842.14  -433.19    67.09   459.21  1631.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2837.264    243.676  11.644  < 2e-16 ***
## lwt            4.242      1.675   2.532 0.012198 *  
## raceBlack   -475.058    145.603  -3.263 0.001318 ** 
## raceOther   -348.150    112.361  -3.099 0.002254 ** 
## smokeSmoker -356.321    103.444  -3.445 0.000710 ***
## htHT        -585.193    199.644  -2.931 0.003810 ** 
## uiUI        -525.524    134.675  -3.902 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared:  0.2404, Adjusted R-squared:  0.2154 
## F-statistic:   9.6 on 6 and 182 DF,  p-value: 3.601e-09

The final model is: bwt ~ lwt + race + smoke + ptl + ht + ui

Predictor	Effect on Birth Weight
`lwt`	Higher mother’s weight → heavier baby
`raceBlack`, `raceOther`	Lower birth weight vs White mothers
`smokeSmoker`	Smoking reduces birth weight significantly
`ptl`	Each prior premature labour reduces birth weight
`htHT`	Hypertension reduces birth weight
`uiUI`	Uterine irritability reduces birth weight

Conclusion

All four methods consistently removed age and ftv and retained the same six predictors. The agreement across methods confirms the final model is stable and reliable. Clinically, this makes sense smoking, hypertension, uterine irritability, and prior premature births are all well-known risk factors for low birth weight, while the number of doctor visits and mother’s age showed no significant independent effect in this dataset.

Assignment on Multiple Linear regression

SHIMWA UWASE Sylvie

4th June 2026

HW1: Find a data set of which you can fit multiple linear regression and interpret your results

importing all the needed libraries

Loading the dataset

Convert categorical variables to factors

Performing Exploratory Data Analysis

Distribution of Birth Weight

Boxplots: Birth Weight by Categorical Predictors

Scatterplots: Birth Weight vs Continuous Predictors

Correlation Matrix

Model Fitting

Assumption Checking

Diagnostic Plots (Linearity & Homoscedasticity)

Normality Test (Shapiro-Wilk)

Homoscedasticity Test (Breusch-Pagan)

Multicollinearity Check (VIF)

Conclusion

HW2: Read about variable selection methods

Method 1: Forward Selection

Method 2: Backward Elimination

Method 3: Stepwise Regression

Method 4: Best Subset Selection

Model Comparison

Final Selected Model

Conclusion