Part 1: Multiple Linear Regression for Fertility

Model Strategy

Given the small sample size (n = 47), I start with the full model and use AIC-based stepwise selection to balance model fit and parsimony.

data(swiss)

m_full <- lm(Fertility ~ ., data = swiss)
m_best <- stepAIC(m_full, direction = "both", trace = FALSE)

formula(m_best)
#> Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
summary(m_best)
#> 
#> Call:
#> lm(formula = Fertility ~ Agriculture + Education + Catholic + 
#>     Infant.Mortality, data = swiss)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.6765  -6.0522   0.7514   3.1664  16.1422 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
#> Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
#> Education        -0.98026    0.14814  -6.617 5.14e-08 ***
#> Catholic          0.12467    0.02889   4.315 9.50e-05 ***
#> Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.168 on 42 degrees of freedom
#> Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
#> F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

Multicollinearity Check

Because education-related variables are often correlated, I examine VIF values.

vif(m_best)
#>      Agriculture        Education         Catholic Infant.Mortality 
#>         2.147153         1.816361         1.299916         1.107528

Values below 5 suggest multicollinearity is not severe in the selected model.

Interpretation

Key findings from the final model:

  • Education (-0.98): A one-unit increase in education is associated with nearly a one-point decrease in fertility, holding other variables constant. This is substantial and consistent with demographic transition theory, where increased education leads to delayed family formation and smaller family sizes.

  • Catholic (0.12): Provinces with higher proportions of Catholics tend to have higher fertility. This aligns with historical patterns where more religious regions maintained higher birth rates.

  • Infant Mortality (1.08): Higher infant mortality is associated with higher fertility.

  • Agriculture (-0.15): Greater agricultural employment slightly lowers fertility in the presence of other controls, suggesting modernization effects are nuanced once education and religion are considered.

summary(m_best)$adj.r.squared
#> [1] 0.670714

Adjusted R² ≈ 0.67 indicates the model explains a meaningful portion of fertility variation given the historical context.

Diagnostic Plots

par(mfrow = c(2, 2))
plot(m_best)

par(mfrow = c(1, 1))

Residual diagnostics suggest:

  • Reasonable linearity.
  • Mild but acceptable deviation from normality.
  • No extreme leverage concerns.

Given the small sample size, influential observations should be interpreted cautiously.

Part 2: Logistic Regression for Fertility > 70

Define high fertility as Fertility > 70.

swiss$fertility_high <- as.integer(swiss$Fertility > 70)
table(swiss$fertility_high)
#> 
#>  0  1 
#> 23 24

The classes are nearly balanced (23 vs 24), which improves model stability.

Model Selection

g_full <- glm(fertility_high ~ Agriculture + Examination + Education + Catholic + Infant.Mortality,
              data = swiss, family = binomial)

g_best <- stepAIC(g_full, direction = "both", trace = FALSE)

summary(g_best)
#> 
#> Call:
#> glm(formula = fertility_high ~ Agriculture + Examination + Education + 
#>     Catholic + Infant.Mortality, family = binomial, data = swiss)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)       4.82826    5.25607   0.919   0.3583  
#> Agriculture      -0.09615    0.04011  -2.397   0.0165 *
#> Examination      -0.32116    0.13844  -2.320   0.0203 *
#> Education        -0.12078    0.08610  -1.403   0.1607  
#> Catholic          0.02078    0.01376   1.509   0.1312  
#> Infant.Mortality  0.29078    0.21051   1.381   0.1672  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 65.135  on 46  degrees of freedom
#> Residual deviance: 32.887  on 41  degrees of freedom
#> AIC: 44.887
#> 
#> Number of Fisher Scoring iterations: 6

Odds Ratios

b <- coef(g_best)
se <- sqrt(diag(vcov(g_best)))

data.frame(
  term = names(b),
  odds_ratio = exp(b),
  ci_low = exp(b - 1.96 * se),
  ci_high = exp(b + 1.96 * se)
)

Interpretation

  • Agriculture (OR ≈ 0.91): Higher agricultural employment decreases the odds of high fertility.
  • Examination (OR ≈ 0.73): Provinces with stronger education indicators are significantly less likely to exhibit high fertility.
  • Other predictors show weaker statistical evidence but remain directionally informative.

These results reinforce the modernization narrative: increased educational attainment corresponds to lower fertility likelihood.

Predictive Performance* (In-Sample)

p_hat <- predict(g_best, type = "response")
pred <- as.integer(p_hat >= 0.5)

cm <- table(Actual = swiss$fertility_high, Predicted = pred)
cm
#>       Predicted
#> Actual  0  1
#>      0 18  5
#>      1  2 22
sum(diag(cm)) / sum(cm)
#> [1] 0.8510638

Accuracy ≈ 85%.

Because this is in-sample performance with only 47 observations, cross-validation would provide a more honest assessment in practice.

Final Discussion

Both models consistently show education as the strongest determinant of fertility differences. Religious composition and infant mortality also contribute meaningfully.

Key takeaways: