01_Comparing_regression

Introduction
Comparing models using R²
Comparing models using rmse
Comparing regression models with anova()
Using variable selection methods
Conclusion

Introduction

In the following sections, we will be using the diamonds data from ggplot2 to compare two models derivable from this data set. The diamonds data set has 10 attributes and over 50,000 records.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

data("diamonds")

diamonds <- diamonds
str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We will use the carat, clarity and color attributes only to predict price in our examination of comparing two models. Naturally, more attributes can be used from this data set to derive better quality models. However, since the focus is on “comparing” models, and not finding the best model, these 3 attributes will suffice.

Comparing models using R²

Though this is not always the case, a regression model with a higher R² value may be a better model. R² is defined as the amount of variation in the response variable explained by the model. In the case where all the points fall on the line and the sum of the squared of the residuals is zero, all of the variation in the response variable is explained by the model. Therefore R² is 1. Where none is explained, so that the mean value of the response variable is all there is, R² equates to zero.

Because R² will naturally increase in value with the addition of more predictors, it is often a better approach to use adjusted R² as a comparison indicator, since this accounts for the number of predictors present in the model.

Using the lm function, we examine our two models.

diamonds.mod1 <- lm(price ~ carat, data = diamonds)
summary(diamonds.mod1)

## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

diamonds.mod2 <- lm(price ~ carat + clarity + color, data = diamonds)
summary(diamonds.mod2)

## 
## Call:
## lm(formula = price ~ carat + clarity + color, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17310.9   -678.0   -192.2    473.0  10313.2 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3506.64      12.89 -272.009  < 2e-16 ***
## carat        8856.23      12.10  731.857  < 2e-16 ***
## clarity.L    4417.66      30.63  144.218  < 2e-16 ***
## clarity.Q   -1939.31      28.88  -67.150  < 2e-16 ***
## clarity.C    1009.46      24.79   40.714  < 2e-16 ***
## clarity^4    -410.68      19.87  -20.663  < 2e-16 ***
## clarity^5     242.64      16.28   14.907  < 2e-16 ***
## clarity^6     -12.13      14.19   -0.855    0.393    
## clarity^7     121.32      12.52    9.688  < 2e-16 ***
## color.L     -1916.89      17.92 -106.985  < 2e-16 ***
## color.Q      -629.34      16.31  -38.597  < 2e-16 ***
## color.C      -181.84      15.24  -11.931  < 2e-16 ***
## color^4        18.58      14.00    1.327    0.184    
## color^5       -86.19      13.23   -6.517 7.25e-11 ***
## color^6       -56.21      12.03   -4.674 2.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1170 on 53925 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9139 
## F-statistic: 4.092e+04 on 14 and 53925 DF,  p-value: < 2.2e-16

diamonds.mod1 has a R² value of ~85%, while diamonds.mod2 has a higher value of ~91%. The adjusted R² value of the latter model is also better, suggesting it is probably better suited for predicting the value of price than does the former. In intepreting the slope coefficient from diamonds.mod1, we say that a unit increase in carat value increases price by 7756. The output from the latter model is a little bit more convoluted. There is one numerical and two categorical independent variables. A unit increase in carat value increases price by 8856 after controlling for clarity and color. Similarly, each factor value of clarity or color increases price by their respective coefficients after controlling for all other variables.

Clearly, with more information, the second model is able to deliver a better prediction for price than the first.

It is widely accepted that R² is never enough as an indicator for assessing the fit of a model. An additional check which should follow is an examination of the residuals plot of the two models in order to compare how well they fit. A model’s residual plot can be easily assessed using the plot function in R. In this case, the plots still confirm the second model to be a slightly better fit than the first.

par(mfrow=c(2,2))
plot(diamonds.mod1)

par(mfrow=c(2,2))
plot(diamonds.mod2)

Comparing models using `rmse`

Next, we examine using rmse to compare the results of two models. rmse is root-mean squared error and gives a spread of the actual dependent variable around the predicted value. Though closely related to R², they are not exactly the same. While R² is the fraction of the total sum of squares explained by the model, rmse is the root mean squared of the errors, an absolute measure of fit. Lower values of rmse indicate a better fit.

Examining our previous two models, we can obtain the rmse by using the fitted values produced by the model and the actual values from the data set.

library(broom)

## Warning: package 'broom' was built under R version 3.6.3

diamonds.mod1.aug <- augment(diamonds.mod1)
head(diamonds.mod1.aug)

## # A tibble: 6 x 9
##   price carat .fitted .se.fit .resid      .hat .sigma     .cooksd .std.resid
##   <int> <dbl>   <dbl>   <dbl>  <dbl>     <dbl>  <dbl>       <dbl>      <dbl>
## 1   326 0.23  -472.     10.4    798. 0.0000452  1549. 0.00000600       0.516
## 2   326 0.21  -628.     10.6    954. 0.0000471  1549. 0.00000892       0.616
## 3   327 0.23  -472.     10.4    799. 0.0000452  1549. 0.00000602       0.516
## 4   334 0.290   -7.00    9.77   341. 0.0000398  1549. 0.000000966      0.220
## 5   335 0.31   148.      9.57   187. 0.0000382  1549. 0.000000278      0.121
## 6   336 0.24  -395.     10.3    731. 0.0000442  1549. 0.00000493       0.472

Using the rmse function from the Metrics library, we can compute the rmse of the first model.

Metrics::rmse(diamonds$price, diamonds.mod1.aug$.fitted)

## [1] 1548.533

Alternatively, we could use the original formula to obtain the rmse.

sqrt(sum((diamonds$price - diamonds.mod1.aug$.fitted)^2)/nrow(diamonds))

## [1] 1548.533

Similary, the second model produces a rmse of 1170, which is significantly lower than the result from the first model. This could suggest that the second model is indeed better than the first as we originally ascertained.

diamonds.mod2.aug <- augment(diamonds.mod2)
Metrics::rmse(diamonds$price, diamonds.mod2.aug$.fitted)

## [1] 1170.186

Comparing regression models with `anova()`

To compare the fits of two models, we could also use the anova() function with the regression objects as two separate arguments. If the resulting p-value is sufficiently low (usually less than 0.05), we could conclude that the more complex model is significantly better than the simpler model. Otherwise, it could mean that there is no significant difference between the two models.

anova(diamonds.mod1, diamonds.mod2)

## Analysis of Variance Table
## 
## Model 1: price ~ carat
## Model 2: price ~ carat + clarity + color
##   Res.Df        RSS Df  Sum of Sq    F    Pr(>F)    
## 1  53938 1.2935e+11                                 
## 2  53925 7.3862e+10 13 5.5484e+10 3116 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows that there is significant difference between the two models after adding those additional two terms, clarity and color. A F-test is carried out in the comparison of the two models. The resulting F-statistic lies beyond the rejection region and showing that the difference between the two models cannot be left to chance alone.

Using variable selection methods

Another approach to selecting a good model in R, or at least confirming that one model offers better prediction than the other, is using a variable selection method which descends (or ascends) step-wise in its comparison of models. stepAIC from the MASS package offers a step-wise model selection process using the Akaike Information Criterion (AIC). The AIC is a technique for estimating the likelihood of a model to predict or estimate future values. A good model is one that has minimum AIC among all the other models. Therefore, using such step-wise comparison, a set of best predictor variables can be obtained from a combination of available predictor values.

library(MASS)
null <- lm(price ~ 1, data = diamonds)
full <- lm(price ~ carat + clarity + color, data = diamonds)
stepF <- stepAIC(null, scope = list(lower = null, upper = full), direction = "forward", trace = TRUE)

## Start:  AIC=894477.9
## price ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + carat    1 7.2913e+11 1.2935e+11 792389
## + color    6 2.6849e+10 8.3162e+11 892776
## + clarity  7 2.3308e+10 8.3517e+11 893007
## <none>                  8.5847e+11 894478
## 
## Step:  AIC=792389.4
## price ~ carat
## 
##           Df  Sum of Sq        RSS    AIC
## + clarity  7 3.9082e+10 9.0264e+10 772998
## + color    6 1.2561e+10 1.1678e+11 786891
## <none>                  1.2935e+11 792389
## 
## Step:  AIC=772998.5
## price ~ carat + clarity
## 
##         Df  Sum of Sq        RSS    AIC
## + color  6 1.6402e+10 7.3862e+10 762193
## <none>                9.0264e+10 772998
## 
## Step:  AIC=762193.4
## price ~ carat + clarity + color

Here, the result shows that the best AIC is achieved from the combination of color, clarity and carat as predictor variables.

Conclusion

In this extract, we considered three different ways of choosing comparing models in order to decide which is better. However, there may be preferences given to certain models even if they are worse. Usually, specific situations, in addition to domain knowledge, will often be the deciding factor when choosing a model for implementation.

In addition to these described methods, there are more indicators which are often used to compare models in order to decide which is better or best. Some of these are basic, while others are exhaustive. In later extracts, we will be examining some of these methods.

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

01_Comparing_regression_models

16/04/2020

Introduction

Comparing models using R²

Comparing models using `rmse`

Comparing regression models with `anova()`

Using variable selection methods

Conclusion

01_Comparing_regression_models

16/04/2020

Introduction

Comparing models using R2

Comparing models using rmse

Comparing regression models with anova()

Using variable selection methods

Conclusion

Comparing models using R²

Comparing models using `rmse`

Comparing regression models with `anova()`