Part I: MULTIVARIATE REGRESSION

Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility.

Using the full model

#Using the Swiss dataset in R
library(psych)
describe(swiss)
##                  vars  n  mean    sd median trimmed   mad   min   max range
## Fertility           1 47 70.14 12.49  70.40   70.66 10.23 35.00  92.5 57.50
## Agriculture         2 47 50.66 22.71  54.10   51.16 23.87  1.20  89.7 88.50
## Examination         3 47 16.49  7.98  16.00   16.08  7.41  3.00  37.0 34.00
## Education           4 47 10.98  9.62   8.00    9.38  5.93  1.00  53.0 52.00
## Catholic            5 47 41.14 41.70  15.14   39.12 18.65  2.15 100.0 97.85
## Infant.Mortality    6 47 19.94  2.91  20.00   19.98  2.82 10.80  26.6 15.80
##                   skew kurtosis   se
## Fertility        -0.46     0.26 1.82
## Agriculture      -0.32    -0.89 3.31
## Examination       0.45    -0.14 1.16
## Education         2.27     6.14 1.40
## Catholic          0.48    -1.67 6.08
## Infant.Mortality -0.33     0.78 0.42
cor(swiss)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000
#Test the full model (includes all vars in the dataset)
swiss.lm <- lm(Fertility ~ Agriculture + Education + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination + 
##     Catholic + Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education       Examination  
##          66.9152           -0.1721           -0.8709           -0.2580  
##         Catholic  Infant.Mortality  
##           0.1041            1.0770
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination + 
##     Catholic + Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

Interpretation

Adjusted \(R^2\) = 0.671 for this model. Adjusted \(R^2\) measures how much variation in y (fertility) can be explained by the model. The adjusted version of \(R^2\) is used when there are multiple variables in the regression model, since the more variables added to the model, the greater \(R^2\) increases. Adjusted \(R^2\) is high so our model fits the observations well. Since \(R^2\) < 0.8, this would imply that we don’t have multicollinearity.

Next, we want to look at the p-values. We would want to keep any predictors/variables that have a p-value lower than \(\alpha\) < 0.05. In the above model, Examination has a p-value greater than 0.05. As a result, we’d want to remove Examination from the model. However, we’ll do this one at a time through the backwards elimination process so we can test \(R^2\) as well.

Source: Open Intro stats textbook, old econometrics notes and https://statisticsbyjim.com/regression/interpret-r-squared-regression/#google_vignette

Use backwards elimination to determine the best model. We’ll do this for each variable.

Removing the Catholic Predictor

#Test full without Catholic
swiss.lm <- lm(Fertility ~ Agriculture + Education + Examination + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education       Examination  
##          68.7731           -0.1293           -0.6196           -0.6880  
## Infant.Mortality  
##           1.3071
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Examination + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.513  -5.011  -1.188   5.522  16.982 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      68.77314   11.62836   5.914 5.28e-07 ***
## Agriculture      -0.12929    0.07485  -1.727  0.09144 .  
## Education        -0.61965    0.17631  -3.515  0.00107 ** 
## Examination      -0.68799    0.22628  -3.040  0.00406 ** 
## Infant.Mortality  1.30710    0.40658   3.215  0.00251 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.796 on 42 degrees of freedom
## Multiple R-squared:  0.6444, Adjusted R-squared:  0.6105 
## F-statistic: 19.02 on 4 and 42 DF,  p-value: 5.413e-09

Interpretation

Adjusted \(R^2\) has decreased from 0.671 to 0.61, so removing Catholic seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.

Removing the Agriculture Predictor

#Test full without Agriculture
swiss.lm <- lm(Fertility ~  Education + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Education + Examination + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)         Education       Examination          Catholic  
##         50.02821          -0.70416          -0.10580           0.08631  
## Infant.Mortality  
##          1.30568
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Education + Examination + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7141  -5.1741  -0.6893   4.2776  14.7346 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      50.02821    8.66076   5.776 8.33e-07 ***
## Education        -0.70416    0.17969  -3.919 0.000322 ***
## Examination      -0.10580    0.26037  -0.406 0.686539    
## Catholic          0.08631    0.03649   2.365 0.022717 *  
## Infant.Mortality  1.30568    0.39150   3.335 0.001791 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.579 on 42 degrees of freedom
## Multiple R-squared:  0.6639, Adjusted R-squared:  0.6319 
## F-statistic: 20.74 on 4 and 42 DF,  p-value: 1.703e-09

Interpretation

Adjusted \(R^2\) has decreased from 0.671 to 0.631, so removing Agriculture seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.

Removing the Education Predictor

#Test full without Education 
swiss.lm <- lm(Fertility ~  Agriculture + Examination + Catholic + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture       Examination          Catholic  
##         59.60267          -0.04759          -0.96805           0.02611  
## Infant.Mortality  
##          1.39597
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.9194  -3.5530  -0.6489   6.5956  14.1767 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      59.60267   13.04246   4.570 4.25e-05 ***
## Agriculture      -0.04759    0.08032  -0.593 0.556688    
## Examination      -0.96805    0.25284  -3.829 0.000423 ***
## Catholic          0.02611    0.03843   0.679 0.500551    
## Infant.Mortality  1.39597    0.46259   3.018 0.004315 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.82 on 42 degrees of freedom
## Multiple R-squared:  0.5448, Adjusted R-squared:  0.5014 
## F-statistic: 12.57 on 4 and 42 DF,  p-value: 8.272e-07

Interpretation

Adjusted \(R^2\) has decreased significantly from 0.671 to 0.501, so removing Education seemed to make our model worse. Based on these results, we should add it back in and remove a different predictor/variable.

Removing the Examination Predictor

#Test full without Examination 
swiss.lm <- lm(Fertility ~  Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education          Catholic  
##          62.1013           -0.1546           -0.9803            0.1247  
## Infant.Mortality  
##           1.0784
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

Interpretation

Adjusted \(R^2\) has decreased very slightly from 0.671 to 0.67, so removing Examination seemed to make our model worse (although only by a small amount). HOWEVER, it had a high p-value in our original full model, so we will want to remove it. We will add it back in just to test the final variable, infant.mortality, in the next model, but eventually we will want to remove it.

Removing the Infant.Mortality Predictor

#Test full without Infant.Mortality 
swiss.lm <- lm(Fertility ~  Agriculture + Examination + Education + Catholic, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic, data = swiss)
## 
## Coefficients:
## (Intercept)  Agriculture  Examination    Education     Catholic  
##     91.0554      -0.2206      -0.2606      -0.9616       0.1244
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7813  -6.3308   0.8113   5.7205  15.5569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.05542    6.94881  13.104  < 2e-16 ***
## Agriculture -0.22065    0.07360  -2.998  0.00455 ** 
## Examination -0.26058    0.27411  -0.951  0.34722    
## Education   -0.96161    0.19455  -4.943 1.28e-05 ***
## Catholic     0.12442    0.03727   3.339  0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.736 on 42 degrees of freedom
## Multiple R-squared:  0.6498, Adjusted R-squared:  0.6164 
## F-statistic: 19.48 on 4 and 42 DF,  p-value: 3.95e-09

Interpretation

Adjusted \(R^2\) has decreased slightly from 0.671 to 0.616, so removing Infant.Mortality seemed to make our model worse. Based on these results, we should still add it back in.

Final Answer to Part I:

Based on our backwards elimination results, removing any of the variables/predictors reduced the adjusted \(R^2\) value. However, for the Examination variable, the adjusted \(R^2\) was only slightly lower as a result of removing it. More concerning is its high p-value, which was around 0.31, which is much larger than the p-value 0.05. As a result, the best regression model is the original full model, but removing the one variable/predictor, Examination.

In addition, looking at the diagnostic plots, the model seems to meet the Gauss-Markov assumptions- the Residuals vs Fitted graph looks linear (doesn’t look like a parabola or curve), the Q-Q graph shows a mostly straight line aligning with the points, the Scale-Location graph shows a mainly horizontal red line, and the Residuals vs Leverage graph shows that the values are inside the Cook’s distance lines. These graphs show that the model conditions are reasonable.

Going back to the full model:

\[ \hat{Fertility}_i = \hat{\beta_0} + Agriculture_i \hat{\beta_1} + Education_i \hat{\beta_2} + Catholic_i \hat{\beta_3} + Infant.Mortality_i \hat{\beta_4} \]

#Test model without Examination
swiss.lm <- lm(Fertility ~  Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
swiss.lm
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education          Catholic  
##          62.1013           -0.1546           -0.9803            0.1247  
## Infant.Mortality  
##           1.0784
plot(swiss.lm)

summary(swiss.lm)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

which then becomes:

\[ \hat{Fertility}_i = 62.1 - Agriculture_i 0.155 - Education_i 0.98 + Catholic_i 0.125 + Infant.Mortality_i 1.08 \]

Part II: LOGISTIC REGRESSION

Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.

Part II Answer:

We need to have a binary independent variable (Fertility). Since we want to predict Fertility > 70, we need to create a dummy or flag variable that = 0 if Fertility < 70 and = 1 if Fertility > 70.

swiss_df <- swiss
swiss_df$Fertility_70 = as.numeric(swiss_df$Fertility>70)

Full Model

#Fit the logit or GLM model for all vars
swiss.glm <- glm(Fertility_70 ~ Agriculture + Education + Examination + Catholic + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Agriculture + Education + Examination + 
##     Catholic + Infant.Mortality, family = binomial(link = "logit"), 
##     data = swiss_df)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education       Examination  
##          4.82826          -0.09615          -0.12078          -0.32116  
##         Catholic  Infant.Mortality  
##          0.02078           0.29078  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  41 Residual
## Null Deviance:       65.13 
## Residual Deviance: 32.89     AIC: 44.89
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Education + Examination + 
##     Catholic + Infant.Mortality, family = binomial(link = "logit"), 
##     data = swiss_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       4.82826    5.25607   0.919   0.3583  
## Agriculture      -0.09615    0.04011  -2.397   0.0165 *
## Education        -0.12078    0.08610  -1.403   0.1607  
## Examination      -0.32116    0.13844  -2.320   0.0203 *
## Catholic          0.02078    0.01376   1.509   0.1312  
## Infant.Mortality  0.29078    0.21051   1.381   0.1672  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 32.887  on 41  degrees of freedom
## AIC: 44.887
## 
## Number of Fisher Scoring iterations: 6

Use backwards elimination to trim the model. We’ll use the statistic Akaike information criterion (AIC), an analog to how we used adjusted R-squared in multiple regression. We want to look for models with a lower AIC through a backward elimination strategy. However, based on p-values, it looks like we’d want to remove the Education, Catholic, and Infant.Mortality predictors because their p-values are above 0.05. Source: Open Intro Stats Textbook

Removing the Agriculture Predictor

#Fit the logit or GLM model, removing Agriculture
swiss.glm <- glm(Fertility_70 ~ Education + Catholic + Examination + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Education + Catholic + Examination + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##      (Intercept)         Education          Catholic       Examination  
##        -3.770018         -0.036020          0.009022         -0.162016  
## Infant.Mortality  
##         0.321133  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  42 Residual
## Null Deviance:       65.13 
## Residual Deviance: 41.5  AIC: 51.5
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Education + Catholic + Examination + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)      -3.770018   3.660064  -1.030   0.3030  
## Education        -0.036020   0.072932  -0.494   0.6214  
## Catholic          0.009022   0.012197   0.740   0.4595  
## Examination      -0.162016   0.096949  -1.671   0.0947 .
## Infant.Mortality  0.321133   0.176705   1.817   0.0692 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 41.504  on 42  degrees of freedom
## AIC: 51.504
## 
## Number of Fisher Scoring iterations: 5

Interpretation

The AIC is higher here than in the original model, so removing Agriculture actually makes the model worse.

Removing the Education Predictor

#Fit the logit or GLM model, removing Education
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Examination + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Agriculture + Catholic + Examination + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##      (Intercept)       Agriculture          Catholic       Examination  
##          2.74528          -0.08006           0.01311          -0.37906  
## Infant.Mortality  
##          0.35266  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  42 Residual
## Null Deviance:       65.13 
## Residual Deviance: 35.28     AIC: 45.28
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Examination + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       2.74528    4.79625   0.572  0.56707   
## Agriculture      -0.08006    0.03812  -2.100  0.03571 * 
## Catholic          0.01311    0.01239   1.059  0.28973   
## Examination      -0.37906    0.13199  -2.872  0.00408 **
## Infant.Mortality  0.35266    0.20510   1.719  0.08553 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 35.275  on 42  degrees of freedom
## AIC: 45.275
## 
## Number of Fisher Scoring iterations: 6

Interpretation

AIC is lightly higher here than in the original model (45.28 vs 44.89), so removing Education makes the model worse.

Removing the Examination Predictor

#Fit the logit or GLM model, removing Examination
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Education + Infant.Mortality,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Agriculture + Catholic + Education + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##      (Intercept)       Agriculture          Catholic         Education  
##         -0.94524          -0.04973           0.03316          -0.21212  
## Infant.Mortality  
##          0.22203  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  42 Residual
## Null Deviance:       65.13 
## Residual Deviance: 40.63     AIC: 50.63
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Education + 
##     Infant.Mortality, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.94524    4.13457  -0.229   0.8192   
## Agriculture      -0.04973    0.02675  -1.859   0.0631 . 
## Catholic          0.03316    0.01187   2.795   0.0052 **
## Education        -0.21212    0.08825  -2.404   0.0162 * 
## Infant.Mortality  0.22203    0.17261   1.286   0.1983   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 40.634  on 42  degrees of freedom
## AIC: 50.634
## 
## Number of Fisher Scoring iterations: 5

Interpretation

AIC is higher here than in the original model (50.63 vs 44.89), so removing Examination makes the model worse.

Removing the Infant.Mortality Predictor

#Fit the logit or GLM model, removing Infant.Mortality
swiss.glm <- glm(Fertility_70 ~ Agriculture + Catholic + Education + Examination,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Agriculture + Catholic + Education + 
##     Examination, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
## (Intercept)  Agriculture     Catholic    Education  Examination  
##    10.44664     -0.09573      0.02251     -0.14430     -0.29567  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  42 Residual
## Null Deviance:       65.13 
## Residual Deviance: 35.15     AIC: 45.15
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Catholic + Education + 
##     Examination, family = binomial(link = "logit"), data = swiss_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 10.44664    3.79705   2.751  0.00594 **
## Agriculture -0.09573    0.03687  -2.596  0.00942 **
## Catholic     0.02251    0.01278   1.762  0.07802 . 
## Education   -0.14430    0.08420  -1.714  0.08657 . 
## Examination -0.29567    0.12821  -2.306  0.02110 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 35.149  on 42  degrees of freedom
## AIC: 45.149
## 
## Number of Fisher Scoring iterations: 6

Interpretation

AIC is slightly higher here than in the original model (45.15 vs 44.89), so removing Infant.Mortality makes the model worse.

Based on our results with AIC, we shouldn’t remove any predictors. However, when looking at p-values of the predictors from the original full model, we should remove the Education, Catholic, and Infant.Mortality predictors.

Removing the Catholic, Education, and Infant.Mortality Predictors

swiss.glm <- glm(Fertility_70 ~ Agriculture + Examination,family=binomial(link='logit'),data=swiss_df)
swiss.glm
## 
## Call:  glm(formula = Fertility_70 ~ Agriculture + Examination, family = binomial(link = "logit"), 
##     data = swiss_df)
## 
## Coefficients:
## (Intercept)  Agriculture  Examination  
##     9.69552     -0.06706     -0.37609  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  44 Residual
## Null Deviance:       65.13 
## Residual Deviance: 40.48     AIC: 46.48
summary(swiss.glm)
## 
## Call:
## glm(formula = Fertility_70 ~ Agriculture + Examination, family = binomial(link = "logit"), 
##     data = swiss_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  9.69552    3.30710   2.932 0.003371 ** 
## Agriculture -0.06706    0.03098  -2.165 0.030420 *  
## Examination -0.37609    0.11286  -3.332 0.000861 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 40.484  on 44  degrees of freedom
## AIC: 46.484
## 
## Number of Fisher Scoring iterations: 5

Interpretation

These results actually have a higher AIC than the original full model (46.484 vs 44.887). Based on this, I believe we should pick the full model as our GLM model, even though the p-values are not statistically significant. This is because the lowest AIC, the full model, shows that this is the best fit, even though each predictor independently is not significant.

Going back to the full model, which is the model of best fit:

\[ {Log}_e(\frac{p}{1-p}) = 4.828 - Agriculture_i 0.096 - Education_i 0.121 - Examination_i 0.321 + Catholic_i 0.021 + Infant.Mortality_i 0.291 \]

To find the interpretation of each of the predictors, we’ll take the log odds of each.

Taking the log odds of each coefficient result for each predictor

agriculture <- exp(-0.09615)
Education <- exp(-0.12078)
Examination <- exp(-0.32116)
Catholic <- exp(0.02078)
Infant.Mortality <- exp(0.29078)

Interpretation:

Increasing Agriculture by 1 unit decreases the odds of fertility > 70 by ~9% (1-0.91). Increasing Education by 1 unit decreases the odds of fertility > 70 by 11% (1-0.89). Increasing Examination by 1 unit decreases the odds of fertility > 70 by 27% (1-0.73). Increasing Catholic by 1 unit increases the odds of fertility > 70 by 2%. Increasing infant mortality by one unit increases the odds of fertility > 70 by 34%. The intercept is useless because zero values are not present in the data.

Source: https://medium.com/data-science/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf