First, I looked at a multiple regression model for Fertility, using all other variables:
summary(lm(Fertility ~ . , data = swiss))
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
Based on that model, I decided to remove the Examination variable because it had the highest p-value. I ran the multiple regression again without Examination:
summary(lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss))
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
I am surprised to see that the adjusted R-squared actually got slightly worse when removing the Examination variable. With this information, I conclude that the model with all variables is the best model.
Now that the Fertility variable has been recoded, I can run my logistic model:
summary(glm(Fertility ~ . , data = swiss, family = binomial))
##
## Call:
## glm(formula = Fertility ~ ., family = binomial, data = swiss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.85403 -0.45960 0.03648 0.55548 2.32911
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.82826 5.25607 0.919 0.3583
## Agriculture -0.09615 0.04011 -2.397 0.0165 *
## Examination -0.32116 0.13844 -2.320 0.0203 *
## Education -0.12078 0.08610 -1.403 0.1607
## Catholic 0.02078 0.01376 1.509 0.1312
## Infant.Mortality 0.29078 0.21051 1.381 0.1672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 32.887 on 41 degrees of freedom
## AIC: 44.887
##
## Number of Fisher Scoring iterations: 6