Part 1: Multiple Linear Regression

Why this dataset?

I chose the Swiss dataset (available in base R) because it reflects a real demographic and socioeconomic study. It contains 1888 standardised fertility and socioeconomic indicators for 47 French-speaking provinces in Switzerland. I want to understand what social and economic factors predict fertility rates across these provinces.

The response variable is Fertility (a standardised index). The predictors I will use are:

Variable Description
Agriculture % of males involved in agriculture
Education % of draftees with education beyond primary school
Catholic % Catholic (as opposed to Protestant)
Infant.Mortality Infant deaths per 1,000 live births
data("swiss")
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
dim(swiss)
## [1] 47  6

1. Understanding the Variables

Before fitting any model, I examine the distribution of each variable and its raw correlation with Fertility.

summary(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])
##    Fertility      Agriculture      Education        Catholic      
##  Min.   :35.00   Min.   : 1.20   Min.   : 1.00   Min.   :  2.150  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.: 6.00   1st Qu.:  5.195  
##  Median :70.40   Median :54.10   Median : 8.00   Median : 15.140  
##  Mean   :70.14   Mean   :50.66   Mean   :10.98   Mean   : 41.144  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:12.00   3rd Qu.: 93.125  
##  Max.   :92.50   Max.   :89.70   Max.   :53.00   Max.   :100.000  
##  Infant.Mortality
##  Min.   :10.80   
##  1st Qu.:18.15   
##  Median :20.00   
##  Mean   :19.94   
##  3rd Qu.:21.70   
##  Max.   :26.60
cor(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])
##                   Fertility Agriculture   Education   Catholic Infant.Mortality
## Fertility         1.0000000  0.35307918 -0.66378886  0.4636847       0.41655603
## Agriculture       0.3530792  1.00000000 -0.63952252  0.4010951      -0.06085861
## Education        -0.6637889 -0.63952252  1.00000000 -0.1538589      -0.09932185
## Catholic          0.4636847  0.40109505 -0.15385892  1.0000000       0.17549591
## Infant.Mortality  0.4165560 -0.06085861 -0.09932185  0.1754959       1.00000000

Key observations from correlations:

  • Education has the strongest negative correlation with Fertility (r = –0.66): provinces where more men received advanced education tend to have lower fertility.
  • Agriculture shows a positive correlation (r = 0.35): more agrarian provinces tend to have higher fertility.
  • Catholic is positively correlated (r = 0.46): predominantly Catholic provinces show higher fertility rates.
  • Infant.Mortality is positively correlated (r = 0.42): higher infant mortality tends to coincide with higher fertility — consistent with the demographic theory that higher child mortality motivates larger family sizes.

2. Fitting the Multiple Linear Regression Model

model <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
summary(model)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

3. Interpreting the Results

Overall model fit:

The model explains 67.1% of the variance in fertility (Adjusted R² = 0.6707), which is a strong fit for demographic data. The F-statistic is highly significant (p < 2.2e-16), confirming that the predictors collectively explain a meaningful portion of fertility variation across Swiss provinces.

Individual predictors (holding all others constant):

  • Agriculture (p = 0.021): Statistically significant. A 1-percentage-point increase in the share of males in agriculture is associated with a 0.136-unit increase in the fertility index. This reflects the tendency for rural, agrarian communities to have larger families.

  • Education (p < 0.001): The most significant predictor. Each additional percentage point of draftees with beyond-primary education is associated with a 0.862-unit decrease in fertility. Education — especially for men, which was the measure available in this 1888 dataset — is closely linked to family planning and lower birth rates.

  • Catholic (p < 0.001): Provinces with higher proportions of Catholics have significantly higher fertility. A 1-percentage-point increase in the Catholic share is associated with a 0.104-unit increase in fertility, consistent with religious-cultural attitudes toward family size during this era.

  • Infant.Mortality (p = 0.002): A 1-unit increase in infant deaths per 1,000 live births is associated with a 1.077-unit increase in the fertility index. This positive relationship supports the demographic compensation hypothesis — families in high-mortality environments tend to have more children to ensure survivors.


4. Checking Model Assumptions

par(mfrow = c(2, 2))
plot(model, pch = 19, col = "darkblue")

Assessment of each diagnostic plot:

  • Residuals vs Fitted: The residuals scatter randomly around zero with no visible curve or pattern. This supports the linearity assumption.

  • Normal Q-Q: The residuals follow the theoretical diagonal closely, with only slight deviation at the tails. Normality is reasonably well satisfied.

  • Scale-Location: The spread of residuals is roughly constant across fitted values — no dramatic fanning or narrowing. Homoscedasticity holds.

  • Residuals vs Leverage: The province of Porrentruy appears as a high-leverage point but does not exceed Cook’s distance. No observation critically distorts the model.

Overall, the regression assumptions are adequately met. The model is reliable for interpretation.


Part 2:Variable Selection Methods

Variable selection is the process of identifying which predictors to include in a regression model. The goal is a model that is both accurate and parsimonious — no unnecessary variables, no missing important ones.

Overview of Methods

There are three main classes of variable selection methods:


1. Criterion-Based Selection (All Subsets)

This approach fits every possible combination of predictors and evaluates each using a model quality criterion. Common criteria include:

  • AIC (Akaike Information Criterion): Penalises complexity; lower is better. Tends to keep more variables.
  • BIC (Bayesian Information Criterion): Applies a heavier penalty for complexity than AIC; favours simpler models.
  • Adjusted R²: Rewards models that explain more variance only when the added predictor genuinely contributes.

With 4 predictors, there are 2⁴ = 16 possible models. All-subsets selection is feasible here and guarantees the globally best model under the chosen criterion.

# Using regsubsets from leaps package
if (!require(leaps)) install.packages("leaps", repos = "https://cloud.r-project.org")
library(leaps)

all_subsets <- regsubsets(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality,
                          data = swiss, nbest = 1)
summary_all <- summary(all_subsets)

# Table of results
results_table <- data.frame(
  Predictors = 1:4,
  AdjR2      = round(summary_all$adjr2, 4),
  CP         = round(summary_all$cp, 3),
  BIC        = round(summary_all$bic, 3)
)
print(results_table)
##   Predictors  AdjR2     CP     BIC
## 1          1 0.4282 35.144 -19.603
## 2          2 0.5552 18.440 -28.611
## 3          3 0.6390  8.141 -35.656
## 4          4 0.6707  5.000 -37.234
par(mfrow = c(1, 3))
plot(summary_all$adjr2, type = "b", xlab = "No. of Predictors", ylab = "Adjusted R²",
     main = "Adjusted R²", col = "steelblue", pch = 19)
plot(summary_all$cp,    type = "b", xlab = "No. of Predictors", ylab = "Cp",
     main = "Mallow's Cp", col = "darkorange", pch = 19)
plot(summary_all$bic,   type = "b", xlab = "No. of Predictors", ylab = "BIC",
     main = "BIC", col = "darkgreen", pch = 19)

Reading the plots:

  • Adjusted R² peaks at 4 predictors — including all variables maximises explained variance.
  • Mallow’s Cp is minimised at 3 predictors, suggesting Agriculture may be marginally dispensable.
  • BIC, which penalises complexity most aggressively, also favours 3 predictors.

The tension between these criteria is instructive: if predictive completeness is the priority, keep all four. If model simplicity is valued, a 3-predictor model (dropping Agriculture) is defensible.


2. Stepwise Selection

Rather than evaluating all subsets, stepwise methods build or prune the model one variable at a time. This is more computationally efficient and is appropriate when the number of predictors is large.

I apply backward elimination here: start with all predictors and remove the one that most improves (lowers) AIC at each step.

full_model  <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
null_model  <- lm(Fertility ~ 1, data = swiss)

# Backward stepwise
step_back <- step(full_model, direction = "backward", trace = TRUE)
## Start:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
summary(step_back)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

The backward procedure drops no variables — every predictor reduces AIC when retained. This is consistent with the all-subsets finding that all four predictors contribute to model quality under AIC.


3. LASSO Regularisation

LASSO (Least Absolute Shrinkage and Selection Operator) takes a fundamentally different approach. Instead of testing which variables to include or exclude, it fits all predictors simultaneously but adds an L1 penalty that shrinks small coefficients toward zero — and can set them to exactly zero, effectively removing them from the model.

The strength of shrinkage is controlled by the tuning parameter λ: higher λ = more shrinkage = fewer active predictors. The optimal λ is chosen by cross-validation.

if (!require(glmnet)) install.packages("glmnet", repos = "https://cloud.r-project.org")
library(glmnet)

X <- as.matrix(swiss[, c("Agriculture", "Education", "Catholic", "Infant.Mortality")])
y <- swiss$Fertility

# Cross-validated LASSO
set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)
plot(lasso_cv)
title("LASSO Cross-Validation: MSE vs log(λ)", line = 2.5)

# Coefficients at optimal lambda
cat("\nCoefficients at lambda.min:\n")
## 
## Coefficients at lambda.min:
coef(lasso_cv, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                  lambda.min
## (Intercept)      61.4779037
## Agriculture      -0.1434864
## Education        -0.9588465
## Catholic          0.1214429
## Infant.Mortality  1.0762862
cat("\nCoefficients at lambda.1se (more regularised):\n")
## 
## Coefficients at lambda.1se (more regularised):
coef(lasso_cv, s = "lambda.1se")
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                   lambda.1se
## (Intercept)      57.57165456
## Agriculture       .         
## Education        -0.61593220
## Catholic          0.06593351
## Infant.Mortality  0.83340838

Interpretation:

  • At lambda.min (lowest cross-validated MSE), LASSO retains all four predictors with non-zero coefficients — consistent with the stepwise result.
  • At lambda.1se (the most regularised model within one standard error of the minimum), some coefficients are further shrunk. If any reach zero, that predictor is removed.
  • LASSO is especially powerful when dealing with many correlated predictors, where stepwise methods can behave inconsistently.

Summary Comparison of Methods

Method Variables Selected Basis
All-subsets (Adj R²) All 4 Maximises adjusted variance explained
All-subsets (BIC) 3 (drop Agriculture) Penalises complexity heavily
Backward Stepwise (AIC) All 4 AIC does not improve by removing any
LASSO (lambda.min) All 4 Cross-validated prediction error

Conclusion

All three variable selection approaches converge on a broadly consistent message: Education, Catholic, and Infant.Mortality are the core predictors of fertility in 19th-century Swiss provinces. Agriculture contributes meaningfully under AIC and LASSO but is marginal under BIC.

The final recommended model retains all four predictors. It achieves an Adjusted R² of 0.671, satisfies regression assumptions, and is grounded in well-understood demographic theory. Each coefficient has a clear and interpretable direction — education and fertility decline together, while religiosity, agrarian lifestyle, and infant mortality are each associated with higher fertility rates. ```