Multivariate Regression

The first thing we should do when determing the best multiple regression model is try all the variables together in a linear combination.

lm_all <- lm(Fertility~., data = swiss)
summary(lm_all)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
par(mfrow=c(2,2))
plot(lm_all)

par(mfrow=c(1,1))

We can see that the p-value of this model is very low, making it statistically significant. The multiple R-squared value is fairly high as well, though the adjusted r-squared value being lower suggests that one or more of our predictors are irrelevant. Looking at the Pr(>|t|) column, we can see that Examination is the least significant variable. Agriculture isn’t particularly significant either compared to the others, so we should consider removing at least one of them from the model.

Looking at the residuals plots, it looks like several of the Gauss-Markov assumptions are being met. We should try a few more models, though, to determine if we can make a better one. Before doing that, though, we should check for multicolinearity.

library(car)
## Loading required package: carData
vif(lm_all)
##      Agriculture      Examination        Education         Catholic 
##         2.284129         3.675420         2.774943         1.937160 
## Infant.Mortality 
##         1.107542

Both Examination and Education have high VIF, suggesting that these two are strongly correlated. This means we should remove one of them from the model. Since we saw before that Examination was the least significant in the model that used all variables, let’s get rid of it for the next model while keeping all the others in a linear combination.

lm_reduced <- lm(Fertility~Agriculture+Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_reduced)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
par(mfrow=c(2,2))
plot(lm_reduced)

par(mfrow=c(1,1))
vif(lm_reduced)
##      Agriculture        Education         Catholic Infant.Mortality 
##         2.147153         1.816361         1.299916         1.107528

Once again the p-value is very low, indicating statistical significance. We can see that the multiple R-squared value has been reduced slightly but the adjusted R-squared value has barely been affected. This means that Examination was a redundant variable. Looking at the rightmost column, we can see that Education and Catholic are the most significant variables. Agriculture is once again among the least significant.

Looking at the plots, we see that the Residuals vs Fitted chart looks about the same, the Normal Q-Q plot looks a little worse, the Scale-Location chart looks about the same, as does the Residuals vs Leverage chart.

The VIF values have improved across the board. There is evidence suggesting that this model is a better representation of the data, though the slight changes in the plots leave me wondering if we can do better. Let’s try a new model that removes Agriculture and keeps the other variables in a linear combination.

lm_more_reduced <- lm(Fertility~Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_more_reduced)
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality, 
##     data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4781  -5.4403  -0.5143   4.1568  15.1187 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.67707    7.91908   6.147 2.24e-07 ***
## Education        -0.75925    0.11680  -6.501 6.83e-08 ***
## Catholic          0.09607    0.02722   3.530  0.00101 ** 
## Infant.Mortality  1.29615    0.38699   3.349  0.00169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.505 on 43 degrees of freedom
## Multiple R-squared:  0.6625, Adjusted R-squared:  0.639 
## F-statistic: 28.14 on 3 and 43 DF,  p-value: 3.15e-10
par(mfrow=c(2,2))
plot(lm_more_reduced)

par(mfrow=c(1,1))
vif(lm_more_reduced)
##        Education         Catholic Infant.Mortality 
##         1.029939         1.052185         1.037512

Once again, a very low p-value indicates statistical significance. Both the multiple R-squared and adjusted R-squared values have been lowered since the last model, but all the variables look significant.

Looking at the plots, the Residuals vs Fitted chart looks more linear, the points on the Normal Q-Q chart adhere more closely to the line, the Scale-Location chart still looks messy, and the Residuals vs Leverage chart looks good as well since no points are beyond the lines for Cook’s distance. Overall, I would say that this model more closely adheres to the Gauss-Markov assumptions.

With low VIF values across the board as well, it looks like this model has low multicolinearity. While the R-squared values have suffered, this looks like the best model so far. I’m curious to know if a transformation of some kind would be helpful. Instead of trying some out, though, let’s do a Box-Cox test to see if one might help.

library(MASS)
boxcox(lm_more_reduced)

Looking at this chart, the \(\lambda\) value of the peak lies between 0 and 1, though it looks closer to 1 than it does to 0.5. This suggests that no transformation should be necessary, though it won’t hurt to see how a square root transformation on the dependent variable affects the model.

lm_root <- lm(sqrt(Fertility)~Education+Catholic+Infant.Mortality, data = swiss)
summary(lm_root)
## 
## Call:
## lm(formula = sqrt(Fertility) ~ Education + Catholic + Infant.Mortality, 
##     data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9272 -0.3133 -0.0159  0.2631  0.8454 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.081726   0.477152  14.842  < 2e-16 ***
## Education        -0.050452   0.007037  -7.169 7.33e-09 ***
## Catholic          0.005245   0.001640   3.198  0.00260 ** 
## Infant.Mortality  0.080042   0.023317   3.433  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4522 on 43 degrees of freedom
## Multiple R-squared:  0.6823, Adjusted R-squared:  0.6601 
## F-statistic: 30.78 on 3 and 43 DF,  p-value: 8.73e-11
par(mfrow=c(2,2))
plot(lm_root)

par(mfrow=c(1,1))
vif(lm_root)
##        Education         Catholic Infant.Mortality 
##         1.029939         1.052185         1.037512

I’m surprised to see how similar this looks to the more reduced model. The p-value is incredibly small, and both R-squared values are slightly higher than in the more reduced model. The variables are all still highly significant.

Looking at the plots, the Residuals vs Fitted chart looks more linear than the one for the more reduced model, the points on the Normal Q-Q chart stick more closely to the line, the Scale-Location chart looks slightly more linear, and the Residuals vs Leverage chart looks about the same.

We can also see that there is very little multicolinearity in this model. Looking at all of this evidence, I could make the argument that my last model is the best option. However, it is difficult to interpret the results of a square root model since the model is non-linear. Unlike a log transformation, where unit increases in any of the predictor variables can be understood as a percentage increase in the dependent variable, in a square root transformation a unit increase in any of the predictor variables will have a greater impact on the dependent variable when the dependent variable is larger (and, conversely, a smaller impact when the dependent variable is smaller).

When looking at the more reduced model, for comparison, we see that the charts are about the same, the VIF values are all low, and the relevant statistics and coefficients are all good as well. More importantly, we immediately understand the impact of the predictor variables on the dependent variable. A one unit increase in Education is associated with a 0.759 unit decrease in Fertility, holding other variables constant. A one unit increase in Catholic population is associated with a 0.096 unit increase in Fertility, holding other variables constant. A one unit increase in Infant.Mortality is associated with a 1.296 unit increase in Fertility, holding other values constant. Because of all of this, I would say that the best multivaraite model for the swiss dataset is: \(lm(Fertility\sim Education+Catholic+Infant.Mortality, data = swiss)\)

Logistic Regression

To do a logistic regression, we need to create a binary variable based on Fertility.

swiss$HighFertility <- ifelse(swiss$Fertility > 70, 1, 0)
table(swiss$HighFertility)
## 
##  0  1 
## 23 24

We can see that there is a roughly even split in the data. Now we can fit this new variable to the same combination of variables we selected earlier.

lm_logit <- glm(
  HighFertility ~ Education + Catholic + Infant.Mortality,
  data = swiss,
  family = binomial
)
summary(lm_logit)
## 
## Call:
## glm(formula = HighFertility ~ Education + Catholic + Infant.Mortality, 
##     family = binomial, data = swiss)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)      -5.363008   3.390464  -1.582   0.1137  
## Education        -0.119747   0.060788  -1.970   0.0488 *
## Catholic          0.022667   0.009722   2.332   0.0197 *
## Infant.Mortality  0.289250   0.166745   1.735   0.0828 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 44.688  on 43  degrees of freedom
## AIC: 52.688
## 
## Number of Fisher Scoring iterations: 5
exp(coef(lm_logit))
##      (Intercept)        Education         Catholic Infant.Mortality 
##      0.004686785      0.887144918      1.022925795      1.335425006

Holding other variables constant, a one percentage-point increase in Education multiplies the odds of Fertility > 70 by 0.89 (an 11% decrease in odds). This matches the negative value the Education coefficients had in the linear model. On the other hand, a one percentage-point increase in Catholic population is increases the odds of Fertility > 70 by 2.3% holding other variables constant. Similarly, a one percentage-point increase in Infant Mortality increases the odds of Fertility > 70 by 33.5% holding other variables equal.

We can now use this model to predict the probability that each province will have a popultion with Fertility > 70. From there, we can check the accuracy of our predictive model by classifying each province based on if it is more likely (probability greater than 50%) to have high fertility.

pred_prob <- predict(lm_logit, type="response")
pred_prob
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
##  0.461857337  0.870306802  0.880607092  0.607303778  0.252804737  0.971961779 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
##  0.938755424  0.956213338  0.889621962  0.901079535  0.962323455  0.137735040 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
##  0.348618890  0.466651814  0.380163401  0.746139824  0.386703345  0.069229620 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
##  0.010097512  0.356321804  0.225231282  0.702534540  0.164334332  0.173536198 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##  0.656082998  0.664119078  0.387524707  0.158328157  0.236051282  0.580639837 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
##  0.736001363  0.870466913  0.876257820  0.854839458  0.912690718  0.721778333 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
##  0.776775585  0.624933002  0.309850902  0.392210331  0.231728297  0.103526174 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
##  0.424680070  0.409797972  0.003898007  0.081045007  0.126641146
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
mean(pred_class == swiss$HighFertility)
## [1] 0.7446809
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(as.factor(pred_class),
                as.factor(swiss$HighFertility))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  7
##          1  5 17
##                                           
##                Accuracy : 0.7447          
##                  95% CI : (0.5965, 0.8606)
##     No Information Rate : 0.5106          
##     P-Value [Acc > NIR] : 0.0008955       
##                                           
##                   Kappa : 0.4901          
##                                           
##  Mcnemar's Test P-Value : 0.7728300       
##                                           
##             Sensitivity : 0.7826          
##             Specificity : 0.7083          
##          Pos Pred Value : 0.7200          
##          Neg Pred Value : 0.7727          
##              Prevalence : 0.4894          
##          Detection Rate : 0.3830          
##    Detection Prevalence : 0.5319          
##       Balanced Accuracy : 0.7455          
##                                           
##        'Positive' Class : 0               
## 

The model correctly predicted fertility status about three-quarters of the time. It also correctly identified 78% of the high-fertility provinces and 71% of the low-fertility provinces.