Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility.
#Using the Swiss dataset in R
library(psych)
describe(swiss)
## vars n mean sd median trimmed mad min max range
## Fertility 1 47 70.14 12.49 70.40 70.66 10.23 35.00 92.5 57.50
## Agriculture 2 47 50.66 22.71 54.10 51.16 23.87 1.20 89.7 88.50
## Examination 3 47 16.49 7.98 16.00 16.08 7.41 3.00 37.0 34.00
## Education 4 47 10.98 9.62 8.00 9.38 5.93 1.00 53.0 52.00
## Catholic 5 47 41.14 41.70 15.14 39.12 18.65 2.15 100.0 97.85
## Infant.Mortality 6 47 19.94 2.91 20.00 19.98 2.82 10.80 26.6 15.80
## skew kurtosis se
## Fertility -0.46 0.26 1.82
## Agriculture -0.32 -0.89 3.31
## Examination 0.45 -0.14 1.16
## Education 2.27 6.14 1.40
## Catholic 0.48 -1.67 6.08
## Infant.Mortality -0.33 0.78 0.42
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
#Test the full model (includes all vars in the dataset)
swiss.lm <- lm(Fertility ~ Agriculture + Education + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination +
## Catholic + Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Examination
## 66.9152 -0.1721 -0.8709 -0.2580
## Catholic Infant.Mortality
## 0.1041 1.0770
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination +
## Catholic + Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Examination -0.25801 0.25388 -1.016 0.31546
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
Adjusted \(R^2\) = 0.671 for this model. Adjusted \(R^2\) measures how much variation in y (fertility) can be explained by the model. The adjusted version of \(R^2\) is used when there are multiple variables in the regression model, since the more variables added to the model, the greater \(R^2\) increases. Adjusted \(R^2\) is high so our model fits the observations well. Since \(R^2\) < 0.8, this would imply that we don’t have multicollinearity.
Next, we want to look at the p-values. We would want to keep any predictors/variables that have a p-value lower than \(\alpha\) < 0.05. In the above model, Examination has a p-value greater than 0.05. As a result, we’d want to remove Examination from the model. However, we’ll do this one at a time through the backwards elimination process so we can test \(R^2\) as well.
Source: Open Intro stats textbook, old econometrics notes and https://statisticsbyjim.com/regression/interpret-r-squared-regression/#google_vignette
#Test full without Catholic
swiss.lm <- lm(Fertility ~ Agriculture + Education + Examination + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Examination
## 68.7731 -0.1293 -0.6196 -0.6880
## Infant.Mortality
## 1.3071
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.513 -5.011 -1.188 5.522 16.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.77314 11.62836 5.914 5.28e-07 ***
## Agriculture -0.12929 0.07485 -1.727 0.09144 .
## Education -0.61965 0.17631 -3.515 0.00107 **
## Examination -0.68799 0.22628 -3.040 0.00406 **
## Infant.Mortality 1.30710 0.40658 3.215 0.00251 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.796 on 42 degrees of freedom
## Multiple R-squared: 0.6444, Adjusted R-squared: 0.6105
## F-statistic: 19.02 on 4 and 42 DF, p-value: 5.413e-09
Adjusted \(R^2\) has decreased from 0.671 to 0.61, so removing Catholic seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.
#Test full without Agriculture
swiss.lm <- lm(Fertility ~ Education + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Education + Examination + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Education Examination Catholic
## 50.02821 -0.70416 -0.10580 0.08631
## Infant.Mortality
## 1.30568
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Education + Examination + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7141 -5.1741 -0.6893 4.2776 14.7346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.02821 8.66076 5.776 8.33e-07 ***
## Education -0.70416 0.17969 -3.919 0.000322 ***
## Examination -0.10580 0.26037 -0.406 0.686539
## Catholic 0.08631 0.03649 2.365 0.022717 *
## Infant.Mortality 1.30568 0.39150 3.335 0.001791 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.579 on 42 degrees of freedom
## Multiple R-squared: 0.6639, Adjusted R-squared: 0.6319
## F-statistic: 20.74 on 4 and 42 DF, p-value: 1.703e-09
Adjusted \(R^2\) has decreased from 0.671 to 0.631, so removing Agriculture seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.
#Test full without Education
swiss.lm <- lm(Fertility ~ Agriculture + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Catholic
## 59.60267 -0.04759 -0.96805 0.02611
## Infant.Mortality
## 1.39597
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9194 -3.5530 -0.6489 6.5956 14.1767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.60267 13.04246 4.570 4.25e-05 ***
## Agriculture -0.04759 0.08032 -0.593 0.556688
## Examination -0.96805 0.25284 -3.829 0.000423 ***
## Catholic 0.02611 0.03843 0.679 0.500551
## Infant.Mortality 1.39597 0.46259 3.018 0.004315 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.82 on 42 degrees of freedom
## Multiple R-squared: 0.5448, Adjusted R-squared: 0.5014
## F-statistic: 12.57 on 4 and 42 DF, p-value: 8.272e-07
Adjusted \(R^2\) has decreased significantly from 0.671 to 0.501, so removing Education seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.
#Test full without Examination
swiss.lm <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
Adjusted \(R^2\) has decreased very slightly from 0.671 to 0.67, so removing Examination seemed to make our model worse (although only by a small amount). HOWEVER, it had a high p-value in our original full model, so we will want to remove it. We will add it back in just to test the final variable, infant.mortality, in the next model, but eventually we will want to remove it.
#Test full without Infant.Mortality
swiss.lm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Education Catholic
## 91.0554 -0.2206 -0.2606 -0.9616 0.1244
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7813 -6.3308 0.8113 5.7205 15.5569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.05542 6.94881 13.104 < 2e-16 ***
## Agriculture -0.22065 0.07360 -2.998 0.00455 **
## Examination -0.26058 0.27411 -0.951 0.34722
## Education -0.96161 0.19455 -4.943 1.28e-05 ***
## Catholic 0.12442 0.03727 3.339 0.00177 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.736 on 42 degrees of freedom
## Multiple R-squared: 0.6498, Adjusted R-squared: 0.6164
## F-statistic: 19.48 on 4 and 42 DF, p-value: 3.95e-09
Adjusted \(R^2\) has decreased slightly from 0.671 to 0.616, so removing Infant.Mortality seemed to make our model worse. Based on these results, we should still add it back in.
Based on our backwards elimination results, removing any of the variables/predictors reduced the adjusted \(R^2\) value. However, for the Examination variable, the adjusted \(R^2\) was only slightly lower as a result of removing it. More concerning is its high p-value, which was around 0.31, which is much larger than the p-value 0.05. As a result, the best regression model is the original full model, but removing the one variable/predictor, Examination.
In addition, looking at the diagnostic plots, the model seems to meet the Gauss-Markov assumptions- the Residuals vs Fitted graph looks linear (doesn’t look like a parabola or curve), the Q-Q graph shows a mostly straight line aligning with the points, the Scale-Location graph shows a mainly horizontal red line, and the Residuals vs Leverage graph shows that the values are inside the Cook’s distance lines. These graphs show that the model conditions are reasonable.
\[ \hat{Fertility}_i = \hat{\beta_0} + Agriculture_i \hat{\beta_1} + Education_i \hat{\beta_2} + Catholic_i \hat{\beta_3} + Infant.Mortality_i \hat{\beta_4} \]
#Test model without Examination
swiss.lm <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
swiss.lm
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
plot(swiss.lm)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
which then becomes:
\[ \hat{Fertility}_i = 62.1 - Agriculture_i 0.155 - Education_i 0.98 + Catholic_i 0.125 + Infant.Mortality_i 1.08 \]
Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.
We need to have a binary independent variable (Fertility). Since we want to predict Fertility > 70, we need to create a dummy or flag variable that = 0 if Fertility < 70 and = 1 if Fertility > 70.
swiss_df <- swiss
swiss_df$Fertility_70 = as.numeric(swiss_df$Fertility>70)
#Fit the logit or GLM model for all vars
swiss.glm <- glm(Fertility_70 ~ Agriculture + Education + Examination + Catholic + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Agriculture + Education + Examination +
## Catholic + Infant.Mortality, family = binomial(link = "logit"),
## data = swiss_df)
##
## Coefficients:
## (Intercept) Agriculture Education Examination
## 4.82826 -0.09615 -0.12078 -0.32116
## Catholic Infant.Mortality
## 0.02078 0.29078
##
## Degrees of Freedom: 46 Total (i.e. Null); 41 Residual
## Null Deviance: 65.13
## Residual Deviance: 32.89 AIC: 44.89
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Education + Examination +
## Catholic + Infant.Mortality, family = binomial(link = "logit"),
## data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.82826 5.25607 0.919 0.3583
## Agriculture -0.09615 0.04011 -2.397 0.0165 *
## Education -0.12078 0.08610 -1.403 0.1607
## Examination -0.32116 0.13844 -2.320 0.0203 *
## Catholic 0.02078 0.01376 1.509 0.1312
## Infant.Mortality 0.29078 0.21051 1.381 0.1672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 32.887 on 41 degrees of freedom
## AIC: 44.887
##
## Number of Fisher Scoring iterations: 6
Use backwards elimination to trim the model. We’ll use the statistic Akaike information criterion (AIC), an analog to how we used adjusted R-squared in multiple regression. We want to look for models with a lower AIC through a backward elimination strategy. However, based on p-values, it looks like we’d want to remove the Education, Catholic, and Infant.Mortality predictors because their p-values are above 0.05. Source: Open Intro Stats Textbook
#Fit the logit or GLM model, removing Agriculture
swiss.glm <- glm(Fertility_70 ~ Education + Catholic + Examination + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Education + Catholic + Examination +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## (Intercept) Education Catholic Examination
## -3.770018 -0.036020 0.009022 -0.162016
## Infant.Mortality
## 0.321133
##
## Degrees of Freedom: 46 Total (i.e. Null); 42 Residual
## Null Deviance: 65.13
## Residual Deviance: 41.5 AIC: 51.5
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Education + Catholic + Examination +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.770018 3.660064 -1.030 0.3030
## Education -0.036020 0.072932 -0.494 0.6214
## Catholic 0.009022 0.012197 0.740 0.4595
## Examination -0.162016 0.096949 -1.671 0.0947 .
## Infant.Mortality 0.321133 0.176705 1.817 0.0692 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 41.504 on 42 degrees of freedom
## AIC: 51.504
##
## Number of Fisher Scoring iterations: 5
The AIC is higher here than in the original model, so removing Agriculture actually makes the model worse.
#Fit the logit or GLM model, removing Education
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Examination + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Agriculture + Catholic + Examination +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## (Intercept) Agriculture Catholic Examination
## 2.74528 -0.08006 0.01311 -0.37906
## Infant.Mortality
## 0.35266
##
## Degrees of Freedom: 46 Total (i.e. Null); 42 Residual
## Null Deviance: 65.13
## Residual Deviance: 35.28 AIC: 45.28
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Examination +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.74528 4.79625 0.572 0.56707
## Agriculture -0.08006 0.03812 -2.100 0.03571 *
## Catholic 0.01311 0.01239 1.059 0.28973
## Examination -0.37906 0.13199 -2.872 0.00408 **
## Infant.Mortality 0.35266 0.20510 1.719 0.08553 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 35.275 on 42 degrees of freedom
## AIC: 45.275
##
## Number of Fisher Scoring iterations: 6
AIC is lightly higher here than in the original model (45.28 vs 44.89), so removing Education makes the model worse.
#Fit the logit or GLM model, removing Examination
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Education + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Agriculture + Catholic + Education +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## (Intercept) Agriculture Catholic Education
## -0.94524 -0.04973 0.03316 -0.21212
## Infant.Mortality
## 0.22203
##
## Degrees of Freedom: 46 Total (i.e. Null); 42 Residual
## Null Deviance: 65.13
## Residual Deviance: 40.63 AIC: 50.63
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Education +
## Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.94524 4.13457 -0.229 0.8192
## Agriculture -0.04973 0.02675 -1.859 0.0631 .
## Catholic 0.03316 0.01187 2.795 0.0052 **
## Education -0.21212 0.08825 -2.404 0.0162 *
## Infant.Mortality 0.22203 0.17261 1.286 0.1983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 40.634 on 42 degrees of freedom
## AIC: 50.634
##
## Number of Fisher Scoring iterations: 5
AIC is higher here than in the original model (50.63 vs 44.89), so removing Examination makes the model worse.
#Fit the logit or GLM model, removing Infant.Mortality
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Education + Examination,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Agriculture + Catholic + Education +
## Examination, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## (Intercept) Agriculture Catholic Education Examination
## 10.44664 -0.09573 0.02251 -0.14430 -0.29567
##
## Degrees of Freedom: 46 Total (i.e. Null); 42 Residual
## Null Deviance: 65.13
## Residual Deviance: 35.15 AIC: 45.15
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Education +
## Examination, family = binomial(link = "logit"), data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.44664 3.79705 2.751 0.00594 **
## Agriculture -0.09573 0.03687 -2.596 0.00942 **
## Catholic 0.02251 0.01278 1.762 0.07802 .
## Education -0.14430 0.08420 -1.714 0.08657 .
## Examination -0.29567 0.12821 -2.306 0.02110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 35.149 on 42 degrees of freedom
## AIC: 45.149
##
## Number of Fisher Scoring iterations: 6
AIC is slightly higher here than in the original model (45.15 vs 44.89), so removing Infant.Mortality makes the model worse.
Based on our results with AIC, we shouldn’t remove any predictors. However, when looking at p-values of the predictors from the original full model, we should remove the Education, Catholic, and Infant.Mortality predictors.
swiss.glm <- glm(Fertility_70 ~ Agriculture + Examination,family=binomial(link='logit'),data=swiss_df)
swiss.glm
##
## Call: glm(formula = Fertility_70 ~ Agriculture + Examination, family = binomial(link = "logit"),
## data = swiss_df)
##
## Coefficients:
## (Intercept) Agriculture Examination
## 9.69552 -0.06706 -0.37609
##
## Degrees of Freedom: 46 Total (i.e. Null); 44 Residual
## Null Deviance: 65.13
## Residual Deviance: 40.48 AIC: 46.48
summary(swiss.glm)
##
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Examination, family = binomial(link = "logit"),
## data = swiss_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.69552 3.30710 2.932 0.003371 **
## Agriculture -0.06706 0.03098 -2.165 0.030420 *
## Examination -0.37609 0.11286 -3.332 0.000861 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 40.484 on 44 degrees of freedom
## AIC: 46.484
##
## Number of Fisher Scoring iterations: 5
These results actually have a higher AIC than the original full model (46.484 vs 44.887). Based on this, I believe we should pick the full model as our GLM model, even though the p-values are not statistically significant. This is because the lowest AIC, the full model, shows that this is the best fit, even though each predictor independently is not significant.
\[ {Log}_e(\frac{p}{1-p}) = 4.828 - Agriculture_i 0.096 - Education_i 0.121 - Examination_i 0.321 + Catholic_i 0.021 + Infant.Mortality_i 0.291 \]
To find the interpretation of each of the predictors, we’ll take the log odds of each.
agriculture <- exp(-0.09615)
Education <- exp(-0.12078)
Examination <- exp(-0.32116)
Catholic <- exp(0.02078)
Infant.Mortality <- exp(0.29078)
Increasing Agriculture by 1 unit decreases the odds of fertility > 70 by ~9% (1-0.91). Increasing Education by 1 unit decreases the odds of fertility > 70 by 11% (1-0.89). Increasing Examination by 1 unit decreases the odds of fertility > 70 by 27% (1-0.73). Increasing Catholic by 1 unit increases the odds of fertility > 70 by 2%. Increasing infant mortality by one unit increases the odds of fertility > 70 by 34%. The intercept is useless because zero values are not present in the data.