HW4: Multiple Linear Regression & Variable Selection

Part 1: Multiple Linear Regression

Why this dataset?

I chose the Swiss dataset (available in base R) because it reflects a real demographic and socioeconomic study. It contains 1888 standardised fertility and socioeconomic indicators for 47 French-speaking provinces in Switzerland. I want to understand what social and economic factors predict fertility rates across these provinces.

The response variable is Fertility (a standardised index). The predictors I will use are:

Variable	Description
`Agriculture`	% of males involved in agriculture
`Education`	% of draftees with education beyond primary school
`Catholic`	% Catholic (as opposed to Protestant)
`Infant.Mortality`	Infant deaths per 1,000 live births

data("swiss")
head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

dim(swiss)

## [1] 47  6

1. Understanding the Variables

Before fitting any model, I examine the distribution of each variable and its raw correlation with Fertility.

summary(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])

##    Fertility      Agriculture      Education        Catholic      
##  Min.   :35.00   Min.   : 1.20   Min.   : 1.00   Min.   :  2.150  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.: 6.00   1st Qu.:  5.195  
##  Median :70.40   Median :54.10   Median : 8.00   Median : 15.140  
##  Mean   :70.14   Mean   :50.66   Mean   :10.98   Mean   : 41.144  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:12.00   3rd Qu.: 93.125  
##  Max.   :92.50   Max.   :89.70   Max.   :53.00   Max.   :100.000  
##  Infant.Mortality
##  Min.   :10.80   
##  1st Qu.:18.15   
##  Median :20.00   
##  Mean   :19.94   
##  3rd Qu.:21.70   
##  Max.   :26.60

cor(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])

##                   Fertility Agriculture   Education   Catholic Infant.Mortality
## Fertility         1.0000000  0.35307918 -0.66378886  0.4636847       0.41655603
## Agriculture       0.3530792  1.00000000 -0.63952252  0.4010951      -0.06085861
## Education        -0.6637889 -0.63952252  1.00000000 -0.1538589      -0.09932185
## Catholic          0.4636847  0.40109505 -0.15385892  1.0000000       0.17549591
## Infant.Mortality  0.4165560 -0.06085861 -0.09932185  0.1754959       1.00000000

Key observations from correlations:

Education has the strongest negative correlation with Fertility (r = –0.66): provinces where more men received advanced education tend to have lower fertility.
Agriculture shows a positive correlation (r = 0.35): more agrarian provinces tend to have higher fertility.
Catholic is positively correlated (r = 0.46): predominantly Catholic provinces show higher fertility rates.
Infant.Mortality is positively correlated (r = 0.42): higher infant mortality tends to coincide with higher fertility — consistent with the demographic theory that higher child mortality motivates larger family sizes.

2. Fitting the Multiple Linear Regression Model

model <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
summary(model)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

3. Interpreting the Results

Overall model fit:

The model explains 67.1% of the variance in fertility (Adjusted R² = 0.6707), which is a strong fit for demographic data. The F-statistic is highly significant (p < 2.2e-16), confirming that the predictors collectively explain a meaningful portion of fertility variation across Swiss provinces.

Individual predictors (holding all others constant):

Agriculture (p = 0.021): Statistically significant. A 1-percentage-point increase in the share of males in agriculture is associated with a 0.136-unit increase in the fertility index. This reflects the tendency for rural, agrarian communities to have larger families.
Education (p < 0.001): The most significant predictor. Each additional percentage point of draftees with beyond-primary education is associated with a 0.862-unit decrease in fertility. Education — especially for men, which was the measure available in this 1888 dataset — is closely linked to family planning and lower birth rates.
Catholic (p < 0.001): Provinces with higher proportions of Catholics have significantly higher fertility. A 1-percentage-point increase in the Catholic share is associated with a 0.104-unit increase in fertility, consistent with religious-cultural attitudes toward family size during this era.
Infant.Mortality (p = 0.002): A 1-unit increase in infant deaths per 1,000 live births is associated with a 1.077-unit increase in the fertility index. This positive relationship supports the demographic compensation hypothesis — families in high-mortality environments tend to have more children to ensure survivors.

4. Checking Model Assumptions

par(mfrow = c(2, 2))
plot(model, pch = 19, col = "darkblue")

Assessment of each diagnostic plot:

Residuals vs Fitted: The residuals scatter randomly around zero with no visible curve or pattern. This supports the linearity assumption.
Normal Q-Q: The residuals follow the theoretical diagonal closely, with only slight deviation at the tails. Normality is reasonably well satisfied.
Scale-Location: The spread of residuals is roughly constant across fitted values — no dramatic fanning or narrowing. Homoscedasticity holds.
Residuals vs Leverage: The province of Porrentruy appears as a high-leverage point but does not exceed Cook’s distance. No observation critically distorts the model.

Overall, the regression assumptions are adequately met. The model is reliable for interpretation.

Part 2:Variable Selection Methods

Variable selection is the process of identifying which predictors to include in a regression model. The goal is a model that is both accurate and parsimonious — no unnecessary variables, no missing important ones.

Overview of Methods

There are three main classes of variable selection methods:

1. Criterion-Based Selection (All Subsets)

This approach fits every possible combination of predictors and evaluates each using a model quality criterion. Common criteria include:

AIC (Akaike Information Criterion): Penalises complexity; lower is better. Tends to keep more variables.
BIC (Bayesian Information Criterion): Applies a heavier penalty for complexity than AIC; favours simpler models.
Adjusted R²: Rewards models that explain more variance only when the added predictor genuinely contributes.

With 4 predictors, there are 2⁴ = 16 possible models. All-subsets selection is feasible here and guarantees the globally best model under the chosen criterion.

# Using regsubsets from leaps package
if (!require(leaps)) install.packages("leaps", repos = "https://cloud.r-project.org")
library(leaps)

all_subsets <- regsubsets(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality,
                          data = swiss, nbest = 1)
summary_all <- summary(all_subsets)

# Table of results
results_table <- data.frame(
  Predictors = 1:4,
  AdjR2      = round(summary_all$adjr2, 4),
  CP         = round(summary_all$cp, 3),
  BIC        = round(summary_all$bic, 3)
)
print(results_table)

##   Predictors  AdjR2     CP     BIC
## 1          1 0.4282 35.144 -19.603
## 2          2 0.5552 18.440 -28.611
## 3          3 0.6390  8.141 -35.656
## 4          4 0.6707  5.000 -37.234

par(mfrow = c(1, 3))
plot(summary_all$adjr2, type = "b", xlab = "No. of Predictors", ylab = "Adjusted R²",
     main = "Adjusted R²", col = "steelblue", pch = 19)
plot(summary_all$cp,    type = "b", xlab = "No. of Predictors", ylab = "Cp",
     main = "Mallow's Cp", col = "darkorange", pch = 19)
plot(summary_all$bic,   type = "b", xlab = "No. of Predictors", ylab = "BIC",
     main = "BIC", col = "darkgreen", pch = 19)

Reading the plots:

Adjusted R² peaks at 4 predictors — including all variables maximises explained variance.
Mallow’s Cp is minimised at 3 predictors, suggesting Agriculture may be marginally dispensable.
BIC, which penalises complexity most aggressively, also favours 3 predictors.

The tension between these criteria is instructive: if predictive completeness is the priority, keep all four. If model simplicity is valued, a 3-predictor model (dropping Agriculture) is defensible.

2. Stepwise Selection

Rather than evaluating all subsets, stepwise methods build or prune the model one variable at a time. This is more computationally efficient and is appropriate when the number of predictors is large.

I apply backward elimination here: start with all predictors and remove the one that most improves (lowers) AIC at each step.

full_model  <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
null_model  <- lm(Fertility ~ 1, data = swiss)

# Backward stepwise
step_back <- step(full_model, direction = "backward", trace = TRUE)

## Start:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43

summary(step_back)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

The backward procedure drops no variables — every predictor reduces AIC when retained. This is consistent with the all-subsets finding that all four predictors contribute to model quality under AIC.

3. LASSO Regularisation

LASSO (Least Absolute Shrinkage and Selection Operator) takes a fundamentally different approach. Instead of testing which variables to include or exclude, it fits all predictors simultaneously but adds an L1 penalty that shrinks small coefficients toward zero — and can set them to exactly zero, effectively removing them from the model.

The strength of shrinkage is controlled by the tuning parameter λ: higher λ = more shrinkage = fewer active predictors. The optimal λ is chosen by cross-validation.

if (!require(glmnet)) install.packages("glmnet", repos = "https://cloud.r-project.org")
library(glmnet)

X <- as.matrix(swiss[, c("Agriculture", "Education", "Catholic", "Infant.Mortality")])
y <- swiss$Fertility

# Cross-validated LASSO
set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)
plot(lasso_cv)
title("LASSO Cross-Validation: MSE vs log(λ)", line = 2.5)

# Coefficients at optimal lambda
cat("\nCoefficients at lambda.min:\n")

## 
## Coefficients at lambda.min:

coef(lasso_cv, s = "lambda.min")

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                  lambda.min
## (Intercept)      61.4779037
## Agriculture      -0.1434864
## Education        -0.9588465
## Catholic          0.1214429
## Infant.Mortality  1.0762862

cat("\nCoefficients at lambda.1se (more regularised):\n")

## 
## Coefficients at lambda.1se (more regularised):

coef(lasso_cv, s = "lambda.1se")

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                   lambda.1se
## (Intercept)      57.57165456
## Agriculture       .         
## Education        -0.61593220
## Catholic          0.06593351
## Infant.Mortality  0.83340838

Interpretation:

At lambda.min (lowest cross-validated MSE), LASSO retains all four predictors with non-zero coefficients — consistent with the stepwise result.
At lambda.1se (the most regularised model within one standard error of the minimum), some coefficients are further shrunk. If any reach zero, that predictor is removed.
LASSO is especially powerful when dealing with many correlated predictors, where stepwise methods can behave inconsistently.

Summary Comparison of Methods

Method	Variables Selected	Basis
All-subsets (Adj R²)	All 4	Maximises adjusted variance explained
All-subsets (BIC)	3 (drop `Agriculture`)	Penalises complexity heavily
Backward Stepwise (AIC)	All 4	AIC does not improve by removing any
LASSO (lambda.min)	All 4	Cross-validated prediction error

Conclusion

All three variable selection approaches converge on a broadly consistent message: Education, Catholic, and Infant.Mortality are the core predictors of fertility in 19th-century Swiss provinces. Agriculture contributes meaningfully under AIC and LASSO but is marginal under BIC.

The final recommended model retains all four predictors. It achieves an Adjusted R² of 0.671, satisfies regression assumptions, and is grounded in well-understood demographic theory. Each coefficient has a clear and interpretable direction — education and fertility decline together, while religiosity, agrarian lifestyle, and infant mortality are each associated with higher fertility rates. ```