I chose the Swiss dataset (available in base R) because
it reflects a real demographic and socioeconomic study. It contains 1888
standardised fertility and socioeconomic indicators for 47
French-speaking provinces in Switzerland. I want to understand what
social and economic factors predict fertility rates
across these provinces.
The response variable is Fertility (a standardised
index). The predictors I will use are:
| Variable | Description |
|---|---|
Agriculture |
% of males involved in agriculture |
Education |
% of draftees with education beyond primary school |
Catholic |
% Catholic (as opposed to Protestant) |
Infant.Mortality |
Infant deaths per 1,000 live births |
data("swiss")
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
dim(swiss)
## [1] 47 6
Before fitting any model, I examine the distribution of each variable
and its raw correlation with Fertility.
summary(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])
## Fertility Agriculture Education Catholic
## Min. :35.00 Min. : 1.20 Min. : 1.00 Min. : 2.150
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.: 6.00 1st Qu.: 5.195
## Median :70.40 Median :54.10 Median : 8.00 Median : 15.140
## Mean :70.14 Mean :50.66 Mean :10.98 Mean : 41.144
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:12.00 3rd Qu.: 93.125
## Max. :92.50 Max. :89.70 Max. :53.00 Max. :100.000
## Infant.Mortality
## Min. :10.80
## 1st Qu.:18.15
## Median :20.00
## Mean :19.94
## 3rd Qu.:21.70
## Max. :26.60
cor(swiss[, c("Fertility", "Agriculture", "Education", "Catholic", "Infant.Mortality")])
## Fertility Agriculture Education Catholic Infant.Mortality
## Fertility 1.0000000 0.35307918 -0.66378886 0.4636847 0.41655603
## Agriculture 0.3530792 1.00000000 -0.63952252 0.4010951 -0.06085861
## Education -0.6637889 -0.63952252 1.00000000 -0.1538589 -0.09932185
## Catholic 0.4636847 0.40109505 -0.15385892 1.0000000 0.17549591
## Infant.Mortality 0.4165560 -0.06085861 -0.09932185 0.1754959 1.00000000
Key observations from correlations:
Education has the strongest negative correlation with
Fertility (r = –0.66): provinces where more men received
advanced education tend to have lower fertility.Agriculture shows a positive correlation (r = 0.35):
more agrarian provinces tend to have higher fertility.Catholic is positively correlated (r = 0.46):
predominantly Catholic provinces show higher fertility rates.Infant.Mortality is positively correlated (r = 0.42):
higher infant mortality tends to coincide with higher fertility —
consistent with the demographic theory that higher child mortality
motivates larger family sizes.model <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
summary(model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
Overall model fit:
The model explains 67.1% of the variance in fertility (Adjusted R² = 0.6707), which is a strong fit for demographic data. The F-statistic is highly significant (p < 2.2e-16), confirming that the predictors collectively explain a meaningful portion of fertility variation across Swiss provinces.
Individual predictors (holding all others constant):
Agriculture (p = 0.021):
Statistically significant. A 1-percentage-point increase in the share of
males in agriculture is associated with a 0.136-unit
increase in the fertility index. This reflects the tendency for
rural, agrarian communities to have larger families.
Education (p < 0.001): The most
significant predictor. Each additional percentage point of draftees with
beyond-primary education is associated with a 0.862-unit
decrease in fertility. Education — especially for men, which
was the measure available in this 1888 dataset — is closely linked to
family planning and lower birth rates.
Catholic (p < 0.001): Provinces
with higher proportions of Catholics have significantly higher
fertility. A 1-percentage-point increase in the Catholic share is
associated with a 0.104-unit increase in fertility,
consistent with religious-cultural attitudes toward family size during
this era.
Infant.Mortality (p = 0.002): A
1-unit increase in infant deaths per 1,000 live births is associated
with a 1.077-unit increase in the fertility index. This
positive relationship supports the demographic compensation hypothesis —
families in high-mortality environments tend to have more children to
ensure survivors.
par(mfrow = c(2, 2))
plot(model, pch = 19, col = "darkblue")
Assessment of each diagnostic plot:
Residuals vs Fitted: The residuals scatter randomly around zero with no visible curve or pattern. This supports the linearity assumption.
Normal Q-Q: The residuals follow the theoretical diagonal closely, with only slight deviation at the tails. Normality is reasonably well satisfied.
Scale-Location: The spread of residuals is roughly constant across fitted values — no dramatic fanning or narrowing. Homoscedasticity holds.
Residuals vs Leverage: The province of Porrentruy appears as a high-leverage point but does not exceed Cook’s distance. No observation critically distorts the model.
Overall, the regression assumptions are adequately met. The model is reliable for interpretation.
Variable selection is the process of identifying which predictors to include in a regression model. The goal is a model that is both accurate and parsimonious — no unnecessary variables, no missing important ones.
There are three main classes of variable selection methods:
1. Criterion-Based Selection (All Subsets)
This approach fits every possible combination of predictors and evaluates each using a model quality criterion. Common criteria include:
With 4 predictors, there are 2⁴ = 16 possible models. All-subsets selection is feasible here and guarantees the globally best model under the chosen criterion.
# Using regsubsets from leaps package
if (!require(leaps)) install.packages("leaps", repos = "https://cloud.r-project.org")
library(leaps)
all_subsets <- regsubsets(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality,
data = swiss, nbest = 1)
summary_all <- summary(all_subsets)
# Table of results
results_table <- data.frame(
Predictors = 1:4,
AdjR2 = round(summary_all$adjr2, 4),
CP = round(summary_all$cp, 3),
BIC = round(summary_all$bic, 3)
)
print(results_table)
## Predictors AdjR2 CP BIC
## 1 1 0.4282 35.144 -19.603
## 2 2 0.5552 18.440 -28.611
## 3 3 0.6390 8.141 -35.656
## 4 4 0.6707 5.000 -37.234
par(mfrow = c(1, 3))
plot(summary_all$adjr2, type = "b", xlab = "No. of Predictors", ylab = "Adjusted R²",
main = "Adjusted R²", col = "steelblue", pch = 19)
plot(summary_all$cp, type = "b", xlab = "No. of Predictors", ylab = "Cp",
main = "Mallow's Cp", col = "darkorange", pch = 19)
plot(summary_all$bic, type = "b", xlab = "No. of Predictors", ylab = "BIC",
main = "BIC", col = "darkgreen", pch = 19)
Reading the plots:
Agriculture may be marginally dispensable.The tension between these criteria is instructive: if predictive
completeness is the priority, keep all four. If model simplicity is
valued, a 3-predictor model (dropping Agriculture) is
defensible.
2. Stepwise Selection
Rather than evaluating all subsets, stepwise methods build or prune the model one variable at a time. This is more computationally efficient and is appropriate when the number of predictors is large.
I apply backward elimination here: start with all predictors and remove the one that most improves (lowers) AIC at each step.
full_model <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
null_model <- lm(Fertility ~ 1, data = swiss)
# Backward stepwise
step_back <- step(full_model, direction = "backward", trace = TRUE)
## Start: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
summary(step_back)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
The backward procedure drops no variables — every predictor reduces AIC when retained. This is consistent with the all-subsets finding that all four predictors contribute to model quality under AIC.
3. LASSO Regularisation
LASSO (Least Absolute Shrinkage and Selection Operator) takes a fundamentally different approach. Instead of testing which variables to include or exclude, it fits all predictors simultaneously but adds an L1 penalty that shrinks small coefficients toward zero — and can set them to exactly zero, effectively removing them from the model.
The strength of shrinkage is controlled by the tuning parameter λ: higher λ = more shrinkage = fewer active predictors. The optimal λ is chosen by cross-validation.
if (!require(glmnet)) install.packages("glmnet", repos = "https://cloud.r-project.org")
library(glmnet)
X <- as.matrix(swiss[, c("Agriculture", "Education", "Catholic", "Infant.Mortality")])
y <- swiss$Fertility
# Cross-validated LASSO
set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)
plot(lasso_cv)
title("LASSO Cross-Validation: MSE vs log(λ)", line = 2.5)
# Coefficients at optimal lambda
cat("\nCoefficients at lambda.min:\n")
##
## Coefficients at lambda.min:
coef(lasso_cv, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 61.4779037
## Agriculture -0.1434864
## Education -0.9588465
## Catholic 0.1214429
## Infant.Mortality 1.0762862
cat("\nCoefficients at lambda.1se (more regularised):\n")
##
## Coefficients at lambda.1se (more regularised):
coef(lasso_cv, s = "lambda.1se")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## lambda.1se
## (Intercept) 57.57165456
## Agriculture .
## Education -0.61593220
## Catholic 0.06593351
## Infant.Mortality 0.83340838
Interpretation:
lambda.min (lowest cross-validated MSE), LASSO
retains all four predictors with non-zero coefficients — consistent with
the stepwise result.lambda.1se (the most regularised model within one
standard error of the minimum), some coefficients are further shrunk. If
any reach zero, that predictor is removed.| Method | Variables Selected | Basis |
|---|---|---|
| All-subsets (Adj R²) | All 4 | Maximises adjusted variance explained |
| All-subsets (BIC) | 3 (drop Agriculture) |
Penalises complexity heavily |
| Backward Stepwise (AIC) | All 4 | AIC does not improve by removing any |
| LASSO (lambda.min) | All 4 | Cross-validated prediction error |
All three variable selection approaches converge on a broadly
consistent message: Education, Catholic, and Infant.Mortality
are the core predictors of fertility in 19th-century Swiss
provinces. Agriculture contributes meaningfully under AIC
and LASSO but is marginal under BIC.
The final recommended model retains all four predictors. It achieves an Adjusted R² of 0.671, satisfies regression assumptions, and is grounded in well-understood demographic theory. Each coefficient has a clear and interpretable direction — education and fertility decline together, while religiosity, agrarian lifestyle, and infant mortality are each associated with higher fertility rates. ```