anova()In the following sections, we will be using the diamonds data from ggplot2 to compare two models derivable from this data set. The diamonds data set has 10 attributes and over 50,000 records.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
data("diamonds")
diamonds <- diamonds
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
We will use the carat, clarity and color attributes only to predict price in our examination of comparing two models. Naturally, more attributes can be used from this data set to derive better quality models. However, since the focus is on “comparing” models, and not finding the best model, these 3 attributes will suffice.
Though this is not always the case, a regression model with a higher R2 value may be a better model. R2 is defined as the amount of variation in the response variable explained by the model. In the case where all the points fall on the line and the sum of the squared of the residuals is zero, all of the variation in the response variable is explained by the model. Therefore R2 is 1. Where none is explained, so that the mean value of the response variable is all there is, R2 equates to zero.
Because R2 will naturally increase in value with the addition of more predictors, it is often a better approach to use adjusted R2 as a comparison indicator, since this accounts for the number of predictors present in the model.
Using the lm function, we examine our two models.
diamonds.mod1 <- lm(price ~ carat, data = diamonds)
summary(diamonds.mod1)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
diamonds.mod2 <- lm(price ~ carat + clarity + color, data = diamonds)
summary(diamonds.mod2)
##
## Call:
## lm(formula = price ~ carat + clarity + color, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17310.9 -678.0 -192.2 473.0 10313.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3506.64 12.89 -272.009 < 2e-16 ***
## carat 8856.23 12.10 731.857 < 2e-16 ***
## clarity.L 4417.66 30.63 144.218 < 2e-16 ***
## clarity.Q -1939.31 28.88 -67.150 < 2e-16 ***
## clarity.C 1009.46 24.79 40.714 < 2e-16 ***
## clarity^4 -410.68 19.87 -20.663 < 2e-16 ***
## clarity^5 242.64 16.28 14.907 < 2e-16 ***
## clarity^6 -12.13 14.19 -0.855 0.393
## clarity^7 121.32 12.52 9.688 < 2e-16 ***
## color.L -1916.89 17.92 -106.985 < 2e-16 ***
## color.Q -629.34 16.31 -38.597 < 2e-16 ***
## color.C -181.84 15.24 -11.931 < 2e-16 ***
## color^4 18.58 14.00 1.327 0.184
## color^5 -86.19 13.23 -6.517 7.25e-11 ***
## color^6 -56.21 12.03 -4.674 2.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1170 on 53925 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.9139
## F-statistic: 4.092e+04 on 14 and 53925 DF, p-value: < 2.2e-16
diamonds.mod1 has a R2 value of ~85%, while diamonds.mod2 has a higher value of ~91%. The adjusted R2 value of the latter model is also better, suggesting it is probably better suited for predicting the value of price than does the former. In intepreting the slope coefficient from diamonds.mod1, we say that a unit increase in carat value increases price by 7756. The output from the latter model is a little bit more convoluted. There is one numerical and two categorical independent variables. A unit increase in carat value increases price by 8856 after controlling for clarity and color. Similarly, each factor value of clarity or color increases price by their respective coefficients after controlling for all other variables.
Clearly, with more information, the second model is able to deliver a better prediction for price than the first.
It is widely accepted that R2 is never enough as an indicator for assessing the fit of a model. An additional check which should follow is an examination of the residuals plot of the two models in order to compare how well they fit. A model’s residual plot can be easily assessed using the plot function in R. In this case, the plots still confirm the second model to be a slightly better fit than the first.
par(mfrow=c(2,2))
plot(diamonds.mod1)
par(mfrow=c(2,2))
plot(diamonds.mod2)
rmseNext, we examine using rmse to compare the results of two models. rmse is root-mean squared error and gives a spread of the actual dependent variable around the predicted value. Though closely related to R2, they are not exactly the same. While R2 is the fraction of the total sum of squares explained by the model, rmse is the root mean squared of the errors, an absolute measure of fit. Lower values of rmse indicate a better fit.
Examining our previous two models, we can obtain the rmse by using the fitted values produced by the model and the actual values from the data set.
library(broom)
## Warning: package 'broom' was built under R version 3.6.3
diamonds.mod1.aug <- augment(diamonds.mod1)
head(diamonds.mod1.aug)
## # A tibble: 6 x 9
## price carat .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 326 0.23 -472. 10.4 798. 0.0000452 1549. 0.00000600 0.516
## 2 326 0.21 -628. 10.6 954. 0.0000471 1549. 0.00000892 0.616
## 3 327 0.23 -472. 10.4 799. 0.0000452 1549. 0.00000602 0.516
## 4 334 0.290 -7.00 9.77 341. 0.0000398 1549. 0.000000966 0.220
## 5 335 0.31 148. 9.57 187. 0.0000382 1549. 0.000000278 0.121
## 6 336 0.24 -395. 10.3 731. 0.0000442 1549. 0.00000493 0.472
Using the rmse function from the Metrics library, we can compute the rmse of the first model.
Metrics::rmse(diamonds$price, diamonds.mod1.aug$.fitted)
## [1] 1548.533
Alternatively, we could use the original formula to obtain the rmse.
sqrt(sum((diamonds$price - diamonds.mod1.aug$.fitted)^2)/nrow(diamonds))
## [1] 1548.533
Similary, the second model produces a rmse of 1170, which is significantly lower than the result from the first model. This could suggest that the second model is indeed better than the first as we originally ascertained.
diamonds.mod2.aug <- augment(diamonds.mod2)
Metrics::rmse(diamonds$price, diamonds.mod2.aug$.fitted)
## [1] 1170.186
anova()To compare the fits of two models, we could also use the anova() function with the regression objects as two separate arguments. If the resulting p-value is sufficiently low (usually less than 0.05), we could conclude that the more complex model is significantly better than the simpler model. Otherwise, it could mean that there is no significant difference between the two models.
anova(diamonds.mod1, diamonds.mod2)
## Analysis of Variance Table
##
## Model 1: price ~ carat
## Model 2: price ~ carat + clarity + color
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 53938 1.2935e+11
## 2 53925 7.3862e+10 13 5.5484e+10 3116 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that there is significant difference between the two models after adding those additional two terms, clarity and color. A F-test is carried out in the comparison of the two models. The resulting F-statistic lies beyond the rejection region and showing that the difference between the two models cannot be left to chance alone.
Another approach to selecting a good model in R, or at least confirming that one model offers better prediction than the other, is using a variable selection method which descends (or ascends) step-wise in its comparison of models. stepAIC from the MASS package offers a step-wise model selection process using the Akaike Information Criterion (AIC). The AIC is a technique for estimating the likelihood of a model to predict or estimate future values. A good model is one that has minimum AIC among all the other models. Therefore, using such step-wise comparison, a set of best predictor variables can be obtained from a combination of available predictor values.
library(MASS)
null <- lm(price ~ 1, data = diamonds)
full <- lm(price ~ carat + clarity + color, data = diamonds)
stepF <- stepAIC(null, scope = list(lower = null, upper = full), direction = "forward", trace = TRUE)
## Start: AIC=894477.9
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + carat 1 7.2913e+11 1.2935e+11 792389
## + color 6 2.6849e+10 8.3162e+11 892776
## + clarity 7 2.3308e+10 8.3517e+11 893007
## <none> 8.5847e+11 894478
##
## Step: AIC=792389.4
## price ~ carat
##
## Df Sum of Sq RSS AIC
## + clarity 7 3.9082e+10 9.0264e+10 772998
## + color 6 1.2561e+10 1.1678e+11 786891
## <none> 1.2935e+11 792389
##
## Step: AIC=772998.5
## price ~ carat + clarity
##
## Df Sum of Sq RSS AIC
## + color 6 1.6402e+10 7.3862e+10 762193
## <none> 9.0264e+10 772998
##
## Step: AIC=762193.4
## price ~ carat + clarity + color
Here, the result shows that the best AIC is achieved from the combination of color, clarity and carat as predictor variables.
In this extract, we considered three different ways of choosing comparing models in order to decide which is better. However, there may be preferences given to certain models even if they are worse. Usually, specific situations, in addition to domain knowledge, will often be the deciding factor when choosing a model for implementation.
In addition to these described methods, there are more indicators which are often used to compare models in order to decide which is better or best. Some of these are basic, while others are exhaustive. In later extracts, we will be examining some of these methods.
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48