Multivariate Regression

I’m going to begin with the model that contains all of the variables in the dataset and work backwards using stepwise / backward elimination. For now, I am putting mylm1 related information in this code segment and mylm2 related information in the code segment below. I will revisit what is not mentioned here initially.

mydata <- swiss
mylm1 <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data = mydata)
summary(mylm1)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
AIC(mylm1)
## [1] 326.0716
par(mfrow=c(2,2))
plot(mylm1)

We can see that all variables except for swiss$Examination are statistically significant at the 0.05 level, so let’s see if Adjusted R-squared improves after its removal.

mylm2 <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = mydata)
summary(mylm2)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
AIC(mylm2)
## [1] 325.2408
par(mfrow=c(2,2))
plot(mylm2)

We can see that the change between models is seemingly minimal; In mylm1 (where we have all of the variables in the dataset), we have a slightly lower Adjusted R-Squared, but in mylm2, we have a slightly lower AIC. If you may have glossed over it in the readings, Akaike information criterion (AIC) is mentioned on page 374 and is described as “an analog to how we used adjusted R-squared in multiple regression, and we look for models with a lower AIC through a backward elimination strategy.” Looking at the plots for each, there is not much difference. Since adding the additional variable does not convincingly improve the model, I am going to go with mylm2 (the linear model with all variables except for swiss$Examination) as my determination for the best multiple regression model for the variable Fertility.

Logistic Regression

I created three GLMs to compare, based on the correlations being somewhat similar:

Fertility ~ Examination,

Fertility ~ Examination + Catholic,

Fertility ~ Examination + Catholic + Education,

But ultimately the best fit was Fertility ~ Examination, so I removed the others.

fertility_converted <- ifelse(mydata$Fertility > 70.0, 1, 0) # To convert the existing swiss$Fertility to binary for glm
cor(fertility_converted, mydata) # Starting point to determine what best to model fertility against
##      Fertility Agriculture Examination  Education  Catholic Infant.Mortality
## [1,] 0.7751995   0.1670252  -0.5648651 -0.3869862 0.4501909        0.3143046
myglm <- glm(fertility_converted ~ mydata$Examination, family = binomial(link = 'logit'), data = mydata)
summary(myglm)
## 
## Call:
## glm(formula = fertility_converted ~ mydata$Examination, family = binomial(link = "logit"), 
##     data = mydata)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         3.62761    1.16430   3.116  0.00184 **
## mydata$Examination -0.22067    0.06907  -3.195  0.00140 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 46.851  on 45  degrees of freedom
## AIC: 50.851
## 
## Number of Fisher Scoring iterations: 5

We can see that the variable Examination is significant.

mypred <- predict(myglm, mydata, type = 'response')
mypred[mypred<.5]=0 # Remember that the percentage has already been accounted for above - these are already in binary, so we use 0.5 instead of 0.7
mypred[mypred>=.5]=1
table(mypred, fertility_converted)
##       fertility_converted
## mypred  0  1
##      0 16  4
##      1  7 20

To interpret this table, we can see “mypred” is the rows, and the actual values are the columns. What is the total accuracy of the prediction? (16+20)/47, where 16 is the predicted 0s and 20 is the predicted 1s, all divided by the total observations, comes out to be about 76.6%.