The first thing we should do when determing the best multiple regression model is try all the variables together in a linear combination.
lm_all <- lm(Fertility~., data = swiss)
summary(lm_all)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
par(mfrow=c(2,2))
plot(lm_all)
par(mfrow=c(1,1))
We can see that the p-value of this model is very low, making it statistically significant. The multiple R-squared value is fairly high as well, though the adjusted r-squared value being lower suggests that one or more of our predictors are irrelevant. Looking at the Pr(>|t|) column, we can see that Examination is the least significant variable. Agriculture isn’t particularly significant either compared to the others, so we should consider removing at least one of them from the model.
Looking at the residuals plots, it looks like several of the Gauss-Markov assumptions are being met. We should try a few more models, though, to determine if we can make a better one. Before doing that, though, we should check for multicolinearity.
library(car)
## Loading required package: carData
vif(lm_all)
## Agriculture Examination Education Catholic
## 2.284129 3.675420 2.774943 1.937160
## Infant.Mortality
## 1.107542
Both Examination and Education have high VIF, suggesting that these two are strongly correlated. This means we should remove one of them from the model. Since we saw before that Examination was the least significant in the model that used all variables, let’s get rid of it for the next model while keeping all the others in a linear combination.
lm_reduced <- lm(Fertility~Agriculture+Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_reduced)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
par(mfrow=c(2,2))
plot(lm_reduced)
par(mfrow=c(1,1))
vif(lm_reduced)
## Agriculture Education Catholic Infant.Mortality
## 2.147153 1.816361 1.299916 1.107528
Once again the p-value is very low, indicating statistical significance. We can see that the multiple R-squared value has been reduced slightly but the adjusted R-squared value has barely been affected. This means that Examination was a redundant variable. Looking at the rightmost column, we can see that Education and Catholic are the most significant variables. Agriculture is once again among the least significant.
Looking at the plots, we see that the Residuals vs Fitted chart looks about the same, the Normal Q-Q plot looks a little worse, the Scale-Location chart looks about the same, as does the Residuals vs Leverage chart.
The VIF values have improved across the board. There is evidence suggesting that this model is a better representation of the data, though the slight changes in the plots leave me wondering if we can do better. Let’s try a new model that removes Agriculture and keeps the other variables in a linear combination.
lm_more_reduced <- lm(Fertility~Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_more_reduced)
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality,
## data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4781 -5.4403 -0.5143 4.1568 15.1187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.67707 7.91908 6.147 2.24e-07 ***
## Education -0.75925 0.11680 -6.501 6.83e-08 ***
## Catholic 0.09607 0.02722 3.530 0.00101 **
## Infant.Mortality 1.29615 0.38699 3.349 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.505 on 43 degrees of freedom
## Multiple R-squared: 0.6625, Adjusted R-squared: 0.639
## F-statistic: 28.14 on 3 and 43 DF, p-value: 3.15e-10
par(mfrow=c(2,2))
plot(lm_more_reduced)
par(mfrow=c(1,1))
vif(lm_more_reduced)
## Education Catholic Infant.Mortality
## 1.029939 1.052185 1.037512
Once again, a very low p-value indicates statistical significance. Both the multiple R-squared and adjusted R-squared values have been lowered since the last model, but all the variables look significant.
Looking at the plots, the Residuals vs Fitted chart looks more linear, the points on the Normal Q-Q chart adhere more closely to the line, the Scale-Location chart still looks messy, and the Residuals vs Leverage chart looks good as well since no points are beyond the lines for Cook’s distance. Overall, I would say that this model more closely adheres to the Gauss-Markov assumptions.
With low VIF values across the board as well, it looks like this model has low multicolinearity. While the R-squared values have suffered, this looks like the best model so far. I’m curious to know if a transformation of some kind would be helpful. Instead of trying some out, though, let’s do a Box-Cox test to see if one might help.
library(MASS)
boxcox(lm_more_reduced)
Looking at this chart, the \(\lambda\) value of the peak lies between 0 and 1, though it looks closer to 1 than it does to 0.5. This suggests that no transformation should be necessary, though it won’t hurt to see how a square root transformation on the dependent variable affects the model.
lm_root <- lm(sqrt(Fertility)~Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_root)
##
## Call:
## lm(formula = sqrt(Fertility) ~ Education + Catholic + Infant.Mortality,
## data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9272 -0.3133 -0.0159 0.2631 0.8454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.081726 0.477152 14.842 < 2e-16 ***
## Education -0.050452 0.007037 -7.169 7.33e-09 ***
## Catholic 0.005245 0.001640 3.198 0.00260 **
## Infant.Mortality 0.080042 0.023317 3.433 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4522 on 43 degrees of freedom
## Multiple R-squared: 0.6823, Adjusted R-squared: 0.6601
## F-statistic: 30.78 on 3 and 43 DF, p-value: 8.73e-11
par(mfrow=c(2,2))
plot(lm_root)
par(mfrow=c(1,1))
vif(lm_root)
## Education Catholic Infant.Mortality
## 1.029939 1.052185 1.037512
I’m surprised to see how similar this looks to the more reduced model. The p-value is incredibly small, and both R-squared values are slightly higher than in the more reduced model. The variables are all still highly significant.
Looking at the plots, the Residuals vs Fitted chart looks more linear than the one for the more reduced model, the points on the Normal Q-Q chart stick more closely to the line, the Scale-Location chart looks slightly more linear, and the Residuals vs Leverage chart looks about the same.
We can also see that there is very little multicolinearity in this model. Looking at all of this evidence, I could make the argument that my last model is the best option. However, it is difficult to interpret the results of a square root model since the model is non-linear. Unlike a log transformation, where unit increases in any of the predictor variables can be understood as a percentage increase in the dependent variable, in a square root transformation a unit increase in any of the predictor variables will have a greater impact on the dependent variable when the dependent variable is larger (and, conversely, a smaller impact when the dependent variable is smaller).
When looking at the more reduced model, for comparison, we see that the charts are about the same, the VIF values are all low, and the relevant statistics and coefficients are all good as well. More importantly, we immediately understand the impact of the predictor variables on the dependent variable. A one unit increase in Education is associated with a 0.759 unit decrease in Fertility, holding other variables constant. A one unit increase in Catholic population is associated with a 0.096 unit increase in Fertility, holding other variables constant. A one unit increase in Infant.Mortality is associated with a 1.296 unit increase in Fertility, holding other values constant. Because of all of this, I would say that the best multivaraite model for the swiss dataset is: \(lm(Fertility\sim Education+Catholic+Infant.Mortality, data = swiss)\)
To do a logistic regression, we need to create a binary variable based on Fertility.
swiss$HighFertility <- ifelse(swiss$Fertility > 70, 1, 0)
table(swiss$HighFertility)
##
## 0 1
## 23 24
We can see that there is a roughly even split in the data. Now we can fit this new variable to the same combination of variables we selected earlier.
lm_logit <- glm(
HighFertility ~ Education + Catholic + Infant.Mortality,
data = swiss,
family = binomial
)
summary(lm_logit)
##
## Call:
## glm(formula = HighFertility ~ Education + Catholic + Infant.Mortality,
## family = binomial, data = swiss)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.363008 3.390464 -1.582 0.1137
## Education -0.119747 0.060788 -1.970 0.0488 *
## Catholic 0.022667 0.009722 2.332 0.0197 *
## Infant.Mortality 0.289250 0.166745 1.735 0.0828 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.135 on 46 degrees of freedom
## Residual deviance: 44.688 on 43 degrees of freedom
## AIC: 52.688
##
## Number of Fisher Scoring iterations: 5
exp(coef(lm_logit))
## (Intercept) Education Catholic Infant.Mortality
## 0.004686785 0.887144918 1.022925795 1.335425006
Holding other variables constant, a one percentage-point increase in Education multiplies the odds of Fertility > 70 by 0.89 (an 11% decrease in odds). This matches the negative value the Education coefficients had in the linear model. On the other hand, a one percentage-point increase in Catholic population is increases the odds of Fertility > 70 by 2.3% holding other variables constant. Similarly, a one percentage-point increase in Infant Mortality increases the odds of Fertility > 70 by 33.5% holding other variables equal.
We can now use this model to predict the probability that each province will have a popultion with Fertility > 70. From there, we can check the accuracy of our predictive model by classifying each province based on if it is more likely (probability greater than 50%) to have high fertility.
pred_prob <- predict(lm_logit, type="response")
pred_prob
## Courtelary Delemont Franches-Mnt Moutier Neuveville Porrentruy
## 0.461857337 0.870306802 0.880607092 0.607303778 0.252804737 0.971961779
## Broye Glane Gruyere Sarine Veveyse Aigle
## 0.938755424 0.956213338 0.889621962 0.901079535 0.962323455 0.137735040
## Aubonne Avenches Cossonay Echallens Grandson Lausanne
## 0.348618890 0.466651814 0.380163401 0.746139824 0.386703345 0.069229620
## La Vallee Lavaux Morges Moudon Nyone Orbe
## 0.010097512 0.356321804 0.225231282 0.702534540 0.164334332 0.173536198
## Oron Payerne Paysd'enhaut Rolle Vevey Yverdon
## 0.656082998 0.664119078 0.387524707 0.158328157 0.236051282 0.580639837
## Conthey Entremont Herens Martigwy Monthey St Maurice
## 0.736001363 0.870466913 0.876257820 0.854839458 0.912690718 0.721778333
## Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel
## 0.776775585 0.624933002 0.309850902 0.392210331 0.231728297 0.103526174
## Val de Ruz ValdeTravers V. De Geneve Rive Droite Rive Gauche
## 0.424680070 0.409797972 0.003898007 0.081045007 0.126641146
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
mean(pred_class == swiss$HighFertility)
## [1] 0.7446809
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(as.factor(pred_class),
as.factor(swiss$HighFertility))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18 7
## 1 5 17
##
## Accuracy : 0.7447
## 95% CI : (0.5965, 0.8606)
## No Information Rate : 0.5106
## P-Value [Acc > NIR] : 0.0008955
##
## Kappa : 0.4901
##
## Mcnemar's Test P-Value : 0.7728300
##
## Sensitivity : 0.7826
## Specificity : 0.7083
## Pos Pred Value : 0.7200
## Neg Pred Value : 0.7727
## Prevalence : 0.4894
## Detection Rate : 0.3830
## Detection Prevalence : 0.5319
## Balanced Accuracy : 0.7455
##
## 'Positive' Class : 0
##
The model correctly predicted fertility status about three-quarters of the time. It also correctly identified 78% of the high-fertility provinces and 71% of the low-fertility provinces.