In the Swiss dataset, there are 5 potential predictors. My first goal is to determine which predictors to include in determing the “best multiple regression model” for the Fertility variable.
data(swiss)
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
So, what I see is a high correlation coefficient between between Examination (-0.66) and Education (-0.65) and perhaps Catholic (0.46). So using the lm() in R…
# Education and Catholic
lm(Fertility ~ Education + Catholic, data = swiss)
##
## Call:
## lm(formula = Fertility ~ Education + Catholic, data = swiss)
##
## Coefficients:
## (Intercept) Education Catholic
## 74.2337 -0.7883 0.1109
summary(lm(Fertility ~ Education + Catholic, data = swiss))
##
## Call:
## lm(formula = Fertility ~ Education + Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.042 -6.578 -1.431 6.122 14.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.23369 2.35197 31.562 < 2e-16 ***
## Education -0.78833 0.12929 -6.097 2.43e-07 ***
## Catholic 0.11092 0.02981 3.721 0.00056 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.331 on 44 degrees of freedom
## Multiple R-squared: 0.5745, Adjusted R-squared: 0.5552
## F-statistic: 29.7 on 2 and 44 DF, p-value: 6.849e-09
# Education and Catholic and Infant.Mortality
lm(Fertility ~ Education + Catholic + Infant.Mortality, data = swiss)
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality,
## data = swiss)
##
## Coefficients:
## (Intercept) Education Catholic Infant.Mortality
## 48.67707 -0.75925 0.09607 1.29615
summary(lm(Fertility ~ Education + Catholic + Infant.Mortality, data= swiss))
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality,
## data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4781 -5.4403 -0.5143 4.1568 15.1187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.67707 7.91908 6.147 2.24e-07 ***
## Education -0.75925 0.11680 -6.501 6.83e-08 ***
## Catholic 0.09607 0.02722 3.530 0.00101 **
## Infant.Mortality 1.29615 0.38699 3.349 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.505 on 43 degrees of freedom
## Multiple R-squared: 0.6625, Adjusted R-squared: 0.639
## F-statistic: 28.14 on 3 and 43 DF, p-value: 3.15e-10
# Education and Catholic and Infant.Mortality and Agriculture
lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss)
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality +
## Agriculture, data = swiss)
##
## Coefficients:
## (Intercept) Education Catholic Infant.Mortality
## 62.1013 -0.9803 0.1247 1.0784
## Agriculture
## -0.1546
summary(lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss))
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality +
## Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
fit <- lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss)
par(mfrow = c(2, 2))
plot(fit)
Observations:
Plot 1: I see random scatter around zero, confirming the
linearity assumption.
Plot 2: Data points follow the dashed line confirming the
normality assumption.
Plot 3: A horizontal line with points spread randomly and equally. A
“non-funneling” shape indicates
heteroscedasticity.
The Swiss dataset includes observations from 47 different provinces, so
I can assume independence. Provinces are distinct
regions not impacting each other.
For a logistic regression, I need to add binary variable to the swiss dataset
swiss$greater_than_70 <- as.integer(swiss$Fertility > 70)
table(swiss$greater_than_70)
##
## 0 1
## 23 24
# My starting point is the model from linear regression
summary(glm(greater_than_70 ~ Education + Catholic + Infant.Mortality + Agriculture, family = binomial(), data = swiss ))
##
## Call:
## glm(formula = greater_than_70 ~ Education + Catholic + Infant.Mortality +
## Agriculture, family = binomial(), data = swiss)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.94524 4.13457 -0.229 0.8192
## Education -0.21212 0.08825 -2.404 0.0162 *
## Catholic 0.03316 0.01187 2.795 0.0052 **
## Infant.Mortality 0.22203 0.17261 1.286 0.1983
## Agriculture -0.04973 0.02675 -1.859 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 40.634 on 42 degrees of freedom
## AIC: 50.634
##
## Number of Fisher Scoring iterations: 5
Observations:
I see that the most significant values are Catholic and Education in the
Pr(>|z|), so I will eliminate Infant.Mortality and Agricuture to
clean this up.
glm(greater_than_70 ~ Education + Catholic, family = binomial(), data = swiss)
##
## Call: glm(formula = greater_than_70 ~ Education + Catholic, family = binomial(),
## data = swiss)
##
## Coefficients:
## (Intercept) Education Catholic
## 0.37798 -0.11882 0.02343
##
## Degrees of Freedom: 46 Total (i.e. Null); 44 Residual
## Null Deviance: 65.13
## Residual Deviance: 48.26 AIC: 54.26
summary(glm(greater_than_70 ~ Education + Catholic, family = binomial(), data = swiss))
##
## Call:
## glm(formula = greater_than_70 ~ Education + Catholic, family = binomial(),
## data = swiss)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.377979 0.718516 0.526 0.5988
## Education -0.118823 0.059196 -2.007 0.0447 *
## Catholic 0.023428 0.009328 2.511 0.0120 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 48.262 on 44 degrees of freedom
## AIC: 54.262
##
## Number of Fisher Scoring iterations: 5
Observations
Education and Catholic remain significant.
exp(coef(glm(greater_than_70 ~ Education + Catholic, family = binomial(), data = swiss)))
## (Intercept) Education Catholic
## 1.4593328 0.8879647 1.0237044
Interpretation: To predict whether a province’s Fertility exceeded 70, I built a logistic regression model using the swiss dataset. I initially included all four predictors from my best linear model (Education, Catholic, Agriculture, and Infant.Mortality), but Agriculture and Infant.Mortality were not statistically significant (p > 0.05) and were removed. The final model retained only Education and Catholic as predictors.
For every one unit increase in Education, the odds of Fertility exceeding 70 decrease by 11.2%, holding Catholic constant. For every one unit increase in Catholic, the odds of Fertility exceeding 70 increase by 2.4%, holding Education constant.
Education is the stronger and more influential predictor, suggesting that provinces with higher education levels are less likely to have high fertility rates.