Chapter 6: Multiple Linear Regression

6.3 Example: Predicting the Price of Used Toyota Corolla Cars

A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership. The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size. A description of each of these variables is given in Table 6.1. A sample of this dataset is shown in Table 6.2. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv). After partitioning the data into training (60%) and validation (40%) sets, we fit a multiple linear regression model between price (the outcome variable) and the other variables (as predictors) using only the training set. Table 6.3 shows the estimated coefficients. Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG (0/1), is redundant given the information on the first two dummies. Including the redundant dummy would cause the regression to fail, since the redundant dummy will be a perfect linear combination of the other two; R’s “lm” routine handles this issue automatically.

# load and look at your data!
toyota.corolla.df <- read.csv("~/Downloads/ToyotaCorolla.csv")

car.df <- toyota.corolla.df
glimpse(car.df)

## Observations: 1,436
## Variables: 39
## $ Id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Model             <fct> TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,…
## $ Price             <int> 13500, 13750, 13950, 14950, 13750, 12950, 1690…
## $ Age_08_04         <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22…
## $ Mfg_Month         <int> 10, 10, 9, 7, 3, 1, 6, 3, 6, 10, 8, 11, 8, 2, …
## $ Mfg_Year          <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002…
## $ KM                <int> 46986, 72937, 41711, 48000, 38500, 61000, 9461…
## $ Fuel_Type         <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel…
## $ HP                <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, …
## $ Met_Color         <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ Color             <fct> Blue, Silver, Blue, Black, Black, White, Grey,…
## $ Automatic         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CC                <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
## $ Doors             <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Cylinders         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ Gears             <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6…
## $ Quarterly_Tax     <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 1…
## $ Weight            <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245…
## $ Mfr_Guarantee     <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0…
## $ BOVAG_Guarantee   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Guarantee_Period  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 12, 3, 3, 3, 3, …
## $ ABS               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_1          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_2          <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
## $ Airco             <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Automatic_airco   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1…
## $ Boardcomputer     <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ CD_Player         <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0…
## $ Central_Lock      <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Powered_Windows   <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Power_Steering    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Radio             <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Mistlamps         <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Sport_Model       <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Backseat_Divider  <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ Metallic_Rim      <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Radio_cassette    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Parking_Assistant <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Tow_Bar           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Select the data to use

## Observations: 1,000
## Variables: 11
## $ Price         <int> 13500, 13750, 13950, 14950, 13750, 12950, 16900, 1…
## $ Age_08_04     <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22, 25…
## $ KM            <int> 46986, 72937, 41711, 48000, 38500, 61000, 94612, 7…
## $ Fuel_Type     <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel, Di…
## $ HP            <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, 192,…
## $ Met_Color     <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ Automatic     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CC            <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 18…
## $ Doors         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Quarterly_Tax <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 185, …
## $ Weight        <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245, 11…

Create testing and validation dataset

set.seed(1)  # set seed for reproducing the partition
train.index <- sample(c(1:1000), 600)  
train.df <- car.df[train.index,]
valid.df <- car.df[-train.index,]  # another way of saying 400

Run regression

# use lm() to run a linear regression of Price on all 11 predictors in the
# training set. 
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm <- lm(Price ~ ., data = train.df)

#  use options() to ensure numbers are not displayed in scientific notation.
# options(scipen = 999)
summary(car.lm)

## 
## Call:
## lm(formula = Price ~ ., data = train.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9781.2  -729.9     0.9   739.3  6912.9 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.754e+03  1.662e+03  -2.861 0.004372 ** 
## Age_08_04       -1.333e+02  4.902e+00 -27.187  < 2e-16 ***
## KM              -2.099e-02  2.304e-03  -9.111  < 2e-16 ***
## Fuel_TypeDiesel  8.962e+02  6.032e+02   1.486 0.137857    
## Fuel_TypePetrol  2.191e+03  5.756e+02   3.807 0.000155 ***
## HP               3.726e+01  5.233e+00   7.119 3.17e-12 ***
## Met_Color        5.132e+01  1.234e+02   0.416 0.677664    
## Automatic        6.357e+01  2.623e+02   0.242 0.808583    
## CC               1.075e-02  9.771e-02   0.110 0.912456    
## Doors           -5.570e+01  6.397e+01  -0.871 0.384230    
## Quarterly_Tax    1.308e+01  2.608e+00   5.015 7.05e-07 ***
## Weight           1.622e+01  1.527e+00  10.622  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1392 on 588 degrees of freedom
## Multiple R-squared:  0.8703, Adjusted R-squared:  0.8679 
## F-statistic: 358.7 on 11 and 588 DF,  p-value: < 2.2e-16

Run predictions

library(forecast)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

# use predict() to make predictions on a new set. 
car.lm.pred <- predict(car.lm, valid.df)

options(scipen=999, digits = 3)
# use accuracy() to compute common accuracy measures.
accuracy(car.lm.pred, valid.df$Price)

##            ME RMSE  MAE   MPE MAPE
## Test set 19.6 1325 1049 -0.75 9.35

How to see the residuals: “Table 6.4 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (relative to the actual prices) for these 20 cars.”

some.residuals <- valid.df$Price[1:20] - car.lm.pred[1:20]
data.frame("Predicted" = car.lm.pred[1:20], "Actual" = valid.df$Price[1:20], "Residual" = some.residuals)

##    Predicted Actual Residual
## 2      16447  13750    -2697
## 7      16757  16900      143
## 8      16750  18600     1850
## 9      20959  21500      541
## 10     14350  12950    -1400
## 12     21124  19950    -1174
## 13     20964  19600    -1364
## 14     20408  21500     1092
## 18     16817  17950     1133
## 21     15053  15950      897
## 23     15800  15950      150
## 24     16307  16950      643
## 26     16786  15950     -836
## 30     16484  17950     1466
## 32     16233  15750     -483
## 34     15752  14950     -802
## 36     15485  15750      265
## 38     16629  14950    -1679
## 46     18069  19000      931
## 47     17441  17950      509

Not let’s plot the residuals using ggplot2!

library(ggplot2)
ggplot(lm(Price ~ ., data = train.df)) + 
  geom_point(aes(x=.fitted, y=.resid))

Versus using base R

plot(car.lm)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

What about a histogram of the errors?

car.lm.pred <- predict(car.lm, valid.df)
all.residuals <- valid.df$Price - car.lm.pred
length(all.residuals[which(all.residuals > -1406 & all.residuals < 1406)])/400

## [1] 0.723

hist(all.residuals, breaks = 25, xlab = "Residuals", main = "")

## QUESTION FOR YOU:  
# 1. how would you do this with ggplot2?

Method 1: Exhaustive Search

Note: It can be shown that for linear regression, in large samples Mallow’s Cp is equivalent to AIC.

Finally, a useful point to note is that for a fixed size of subset, R2, R2adj, Cp, AIC, and BIC all select the same subset. In fact, there is no difference between them in the order of merit they ascribe to subsets of a fixed size. This is good to know if comparing models with the same number of predictors, but often we want to compare models with different numbers of predictors.

Table 6.5 gives the results of applying an exhaustive search on the Toyota Corolla price data (with the 11 predictors). It reports the best model with a single predictor, two predictors, and so on. It can be seen that the R2adj increases until eight predictors are used (number of coefficients = 9) and then stabilizes. The Cp indicates that a model with 7 to 8 predictors is good. The dominant predictor in all models is the age of the car, with horsepower and mileage playing important roles as well.

# use regsubsets() in package leaps to run an exhaustive search. 
# unlike with lm, categorical predictors must be turned into dummies manually.
library(leaps)
search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2],
                     method = "exhaustive")
help("regsubsets")
sum <- summary(search)

# show models
sum$which

##    (Intercept) Age_08_04    KM Fuel_TypeDiesel Fuel_TypePetrol    HP
## 1         TRUE      TRUE FALSE           FALSE           FALSE FALSE
## 2         TRUE      TRUE FALSE           FALSE           FALSE FALSE
## 3         TRUE      TRUE  TRUE           FALSE           FALSE FALSE
## 4         TRUE      TRUE  TRUE           FALSE           FALSE  TRUE
## 5         TRUE      TRUE  TRUE           FALSE           FALSE  TRUE
## 6         TRUE      TRUE  TRUE           FALSE            TRUE  TRUE
## 7         TRUE      TRUE  TRUE            TRUE            TRUE  TRUE
## 8         TRUE      TRUE  TRUE            TRUE            TRUE  TRUE
## 9         TRUE      TRUE  TRUE            TRUE            TRUE  TRUE
## 10        TRUE      TRUE  TRUE            TRUE            TRUE  TRUE
## 11        TRUE      TRUE  TRUE            TRUE            TRUE  TRUE
##    Met_Color Automatic    CC Doors Quarterly_Tax Weight
## 1      FALSE     FALSE FALSE FALSE         FALSE  FALSE
## 2      FALSE     FALSE FALSE FALSE         FALSE   TRUE
## 3      FALSE     FALSE FALSE FALSE         FALSE   TRUE
## 4      FALSE     FALSE FALSE FALSE         FALSE   TRUE
## 5      FALSE     FALSE FALSE FALSE          TRUE   TRUE
## 6      FALSE     FALSE FALSE FALSE          TRUE   TRUE
## 7      FALSE     FALSE FALSE FALSE          TRUE   TRUE
## 8      FALSE     FALSE FALSE  TRUE          TRUE   TRUE
## 9       TRUE     FALSE FALSE  TRUE          TRUE   TRUE
## 10      TRUE      TRUE FALSE  TRUE          TRUE   TRUE
## 11      TRUE      TRUE  TRUE  TRUE          TRUE   TRUE

# show metrics
data.frame(rsq = sum$rsq, adjr2 = sum$adjr2, cp=sum$cp)

##      rsq adjr2     cp
## 1  0.753 0.753 522.34
## 2  0.795 0.794 334.81
## 3  0.844 0.843 115.90
## 4  0.863 0.862  29.80
## 5  0.866 0.865  19.32
## 6  0.870 0.868   5.17
## 7  0.870 0.869   4.96
## 8  0.870 0.868   6.26
## 9  0.870 0.868   8.08
## 10 0.870 0.868  10.01
## 11 0.870 0.868  12.00

## QUESTION FOR YOU:  
# 2. What is cp?

Method 2: Iterative search

The second method of finding the best subset of predictors relies on a partial, iterative search through the space of all possible regression models. The end product is one best subset of predictors (although there do exist variations of these methods that identify several close-to-best choices for different sizes of predictor subsets). This approach is computationally cheaper, but it has the potential of missing “good” combinations of predictors. None of the methods guarantee that they yield the best subset for any criterion, such as R2adj. They are reasonable methods for situations with a large number of predictors, but for a moderate number of predictors, the exhaustive search is preferable.

Three popular iterative search algorithms are forward selection, backward elimination, and stepwise regression. In forward selection, we start with no predictors and then add predictors one by one. Each predictor added is the one (among all predictors) that has the largest contribution to R2 on top of the predictors that are already in it. The algorithm stops when the contribution of additional predictors is not statistically significant. The main disadvantage of this method is that the algorithm will miss pairs or groups of predictors that perform very well together but perform poorly as single predictors. This is similar to interviewing job candidates for a team project one by one, thereby missing groups of candidates who perform superiorly together (“colleagues”), but poorly on their own or with non-colleagues.

In backward elimination, we start with all predictors and then at each step, eliminate the least useful predictor (according to statistical significance). The algorithm stops when all the remaining predictors have significant contributions. The weakness of this algorithm is that computing the initial model with all predictors can be time-consuming and unstable. Stepwise regression is like forward selection except that at each step, we consider dropping predictors that are not statistically significant, as in backward elimination.

R has several libraries with stepwise functions: function regsubsets() in the leaps package implements (in addition to exhaustive search) forward selection, backward elimination, and stepwise regression. Predictors are added/dropped based on either R2, R2adj, or Cp. In contrast, function step() in the stats package, as well as function stepAIC() in the MASS package perform model selection using the AIC criterion (stepAIC offers a wider range of object classes).

Backward step

Start with all variables and remove them one by one!

#### Table 6.6
# use step() to run stepwise regression.
car.lm.step <- step(car.lm, direction = "backward")

## Start:  AIC=8698
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic + 
##     CC + Doors + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - CC             1      23445 1139558987 8696
## - Automatic      1     113837 1139649380 8696
## - Met_Color      1     335154 1139870697 8696
## - Doors          1    1469490 1141005033 8697
## <none>                        1139535543 8698
## - Fuel_Type      2   36864358 1176399900 8713
## - Quarterly_Tax  1   48732676 1188268219 8721
## - HP             1   98229083 1237764626 8746
## - KM             1  160862596 1300398139 8775
## - Weight         1  218676925 1358212468 8802
## - Age_08_04      1 1432472333 2572007876 9185
## 
## Step:  AIC=8696
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic + 
##     Doors + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Automatic      1     136842 1139695829 8694
## - Met_Color      1     340681 1139899668 8694
## - Doors          1    1457424 1141016411 8695
## <none>                        1139558987 8696
## - Fuel_Type      2   36879383 1176438370 8711
## - Quarterly_Tax  1   48759179 1188318167 8719
## - HP             1  100144734 1239703722 8745
## - KM             1  160839218 1300398206 8773
## - Weight         1  218873160 1358432148 8800
## - Age_08_04      1 1433096756 2572655743 9183
## 
## Step:  AIC=8694
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Doors + 
##     Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Met_Color      1     338704 1140034533 8692
## - Doors          1    1522740 1141218569 8693
## <none>                        1139695829 8694
## - Fuel_Type      2   37033833 1176729662 8709
## - Quarterly_Tax  1   48735659 1188431487 8717
## - HP             1  100045224 1239741053 8743
## - KM             1  161464457 1301160286 8772
## - Weight         1  226617762 1366313591 8801
## - Age_08_04      1 1440955839 2580651668 9183
## 
## Step:  AIC=8692
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Doors + Quarterly_Tax + 
##     Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Doors          1    1362886 1141397420 8691
## <none>                        1140034533 8692
## - Fuel_Type      2   36776012 1176810545 8707
## - Quarterly_Tax  1   48499275 1188533808 8715
## - HP             1  101053268 1241087802 8741
## - KM             1  161965108 1301999641 8770
## - Weight         1  226421966 1366456500 8799
## - Age_08_04      1 1448501122 2588535655 9182
## 
## Step:  AIC=8691
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## <none>                        1141397420 8691
## - Fuel_Type      2   35587030 1176984450 8706
## - Quarterly_Tax  1   48089820 1189487240 8714
## - HP             1  102605929 1244003348 8741
## - KM             1  165583130 1306980550 8770
## - Weight         1  232428680 1373826100 8800
## - Age_08_04      1 1447234462 2588631881 9180

summary(car.lm.step)

## 
## Call:
## lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + 
##     Weight, data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9667   -748     21    746   6987 
## 
## Coefficients:
##                    Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     -4622.46993  1634.07988   -2.83               0.0048 ** 
## Age_08_04        -133.13196     4.85926  -27.40 < 0.0000000000000002 ***
## KM                 -0.02120     0.00229   -9.27 < 0.0000000000000002 ***
## Fuel_TypeDiesel   888.54989   596.23572    1.49               0.1367    
## Fuel_TypePetrol  2138.33406   571.47519    3.74               0.0002 ***
## HP                 37.60879     5.15538    7.30     0.00000000000096 ***
## Quarterly_Tax      12.97858     2.59871    4.99     0.00000077835339 ***
## Weight             15.96199     1.45378   10.98 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared:  0.87,   Adjusted R-squared:  0.869 
## F-statistic:  566 on 7 and 592 DF,  p-value: <0.0000000000000002

# 3. Which variables were dropped?
car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)

##            ME RMSE  MAE    MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41

## QUESTION FOR YOU:  
# 4. RECORD THE RMSE: ______

## QUESTION FOR YOU:  
# 5. How does this compare to your first RMSE?

Forward Step

Start without any variables and add them one by one

#### Table 6.7
# create model with no predictors
car.lm.null <- lm(Price~1, data = train.df)
# use step() to run forward regression.
car.lm.step <- step(car.lm.null, scope=list(lower=car.lm.null, upper=car.lm), direction = "forward")

## Start:  AIC=9902
## Price ~ 1
## 
##                 Df  Sum of Sq        RSS  AIC
## + Age_08_04      1 6619336124 2167322714 9064
## + KM             1 3357409684 5429249154 9615
## + Weight         1 3185574106 5601084732 9634
## + HP             1 1030024095 7756634743 9829
## + Quarterly_Tax  1  454459218 8332199620 9872
## + Doors          1  283210488 8503448350 9884
## + Met_Color      1  136926847 8649731991 9894
## + CC             1  132887209 8653771629 9895
## + Fuel_Type      2   82914175 8703744663 9900
## <none>                        8786658838 9902
## + Automatic      1   14099233 8772559605 9903
## 
## Step:  AIC=9064
## Price ~ Age_08_04
## 
##                 Df Sum of Sq        RSS  AIC
## + Weight         1 367311518 1800011196 8954
## + HP             1 348771443 1818551271 8961
## + KM             1 229319983 1938002731 8999
## + Quarterly_Tax  1  29968151 2137354563 9058
## + Automatic      1  19010241 2148312473 9061
## + Doors          1  17838602 2149484111 9061
## + Fuel_Type      2  24222614 2143100100 9061
## + CC             1  13455747 2153866967 9062
## <none>                       2167322714 9064
## + Met_Color      1   3355998 2163966716 9065
## 
## Step:  AIC=8954
## Price ~ Age_08_04 + Weight
## 
##                 Df Sum of Sq        RSS  AIC
## + KM             1 428119347 1371891849 8794
## + HP             1 373615357 1426395839 8817
## + Fuel_Type      2 317441967 1482569229 8842
## + Quarterly_Tax  1  66286337 1733724859 8934
## + Automatic      1   8279853 1791731343 8954
## <none>                       1800011196 8954
## + Met_Color      1   2076895 1797934301 8956
## + CC             1    276268 1799734929 8956
## + Doors          1    230044 1799781152 8956
## 
## Step:  AIC=8794
## Price ~ Age_08_04 + Weight + KM
## 
##                 Df Sum of Sq        RSS  AIC
## + HP             1 170728393 1201163456 8716
## + Fuel_Type      2  65233378 1306658471 8768
## <none>                       1371891849 8794
## + Met_Color      1    718694 1371173155 8795
## + CC             1    551713 1371340136 8795
## + Automatic      1    380438 1371511411 8795
## + Doors          1    119381 1371772468 8795
## + Quarterly_Tax  1     20420 1371871429 8796
## 
## Step:  AIC=8716
## Price ~ Age_08_04 + Weight + KM + HP
## 
##                 Df Sum of Sq        RSS  AIC
## + Quarterly_Tax  1  24179006 1176984450 8706
## + Fuel_Type      2  11676216 1189487240 8714
## <none>                       1201163456 8716
## + Doors          1    635682 1200527775 8717
## + CC             1    141287 1201022170 8718
## + Automatic      1     34178 1201129279 8718
## + Met_Color      1     21772 1201141684 8718
## 
## Step:  AIC=8706
## Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax
## 
##             Df Sum of Sq        RSS  AIC
## + Fuel_Type  2  35587030 1141397420 8691
## <none>                   1176984450 8706
## + Automatic  1    314349 1176670101 8707
## + Doors      1    173905 1176810545 8707
## + Met_Color  1     52675 1176931775 8708
## + CC         1     11182 1176973268 8708
## 
## Step:  AIC=8691
## Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax + Fuel_Type
## 
##             Df Sum of Sq        RSS  AIC
## <none>                   1141397420 8691
## + Doors      1   1362886 1140034533 8692
## + Automatic  1    197092 1141200327 8693
## + Met_Color  1    178851 1141218569 8693
## + CC         1     38498 1141358922 8693

summary(car.lm.step)  # Which variables were added?

## 
## Call:
## lm(formula = Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax + 
##     Fuel_Type, data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9667   -748     21    746   6987 
## 
## Coefficients:
##                    Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     -4622.46993  1634.07988   -2.83               0.0048 ** 
## Age_08_04        -133.13196     4.85926  -27.40 < 0.0000000000000002 ***
## Weight             15.96199     1.45378   10.98 < 0.0000000000000002 ***
## KM                 -0.02120     0.00229   -9.27 < 0.0000000000000002 ***
## HP                 37.60879     5.15538    7.30     0.00000000000096 ***
## Quarterly_Tax      12.97858     2.59871    4.99     0.00000077835339 ***
## Fuel_TypeDiesel   888.54989   596.23572    1.49               0.1367    
## Fuel_TypePetrol  2138.33406   571.47519    3.74               0.0002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared:  0.87,   Adjusted R-squared:  0.869 
## F-statistic:  566 on 7 and 592 DF,  p-value: <0.0000000000000002

car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)

##            ME RMSE  MAE    MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41

Both

#### Table 6.8
# use step() to run stepwise regression.
car.lm.step <- step(car.lm, direction = "both")

## Start:  AIC=8698
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic + 
##     CC + Doors + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - CC             1      23445 1139558987 8696
## - Automatic      1     113837 1139649380 8696
## - Met_Color      1     335154 1139870697 8696
## - Doors          1    1469490 1141005033 8697
## <none>                        1139535543 8698
## - Fuel_Type      2   36864358 1176399900 8713
## - Quarterly_Tax  1   48732676 1188268219 8721
## - HP             1   98229083 1237764626 8746
## - KM             1  160862596 1300398139 8775
## - Weight         1  218676925 1358212468 8802
## - Age_08_04      1 1432472333 2572007876 9185
## 
## Step:  AIC=8696
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic + 
##     Doors + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Automatic      1     136842 1139695829 8694
## - Met_Color      1     340681 1139899668 8694
## - Doors          1    1457424 1141016411 8695
## <none>                        1139558987 8696
## + CC             1      23445 1139535543 8698
## - Fuel_Type      2   36879383 1176438370 8711
## - Quarterly_Tax  1   48759179 1188318167 8719
## - HP             1  100144734 1239703722 8745
## - KM             1  160839218 1300398206 8773
## - Weight         1  218873160 1358432148 8800
## - Age_08_04      1 1433096756 2572655743 9183
## 
## Step:  AIC=8694
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Doors + 
##     Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Met_Color      1     338704 1140034533 8692
## - Doors          1    1522740 1141218569 8693
## <none>                        1139695829 8694
## + Automatic      1     136842 1139558987 8696
## + CC             1      46449 1139649380 8696
## - Fuel_Type      2   37033833 1176729662 8709
## - Quarterly_Tax  1   48735659 1188431487 8717
## - HP             1  100045224 1239741053 8743
## - KM             1  161464457 1301160286 8772
## - Weight         1  226617762 1366313591 8801
## - Age_08_04      1 1440955839 2580651668 9183
## 
## Step:  AIC=8692
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Doors + Quarterly_Tax + 
##     Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## - Doors          1    1362886 1141397420 8691
## <none>                        1140034533 8692
## + Met_Color      1     338704 1139695829 8694
## + Automatic      1     134865 1139899668 8694
## + CC             1      53737 1139980796 8694
## - Fuel_Type      2   36776012 1176810545 8707
## - Quarterly_Tax  1   48499275 1188533808 8715
## - HP             1  101053268 1241087802 8741
## - KM             1  161965108 1301999641 8770
## - Weight         1  226421966 1366456500 8799
## - Age_08_04      1 1448501122 2588535655 9182
## 
## Step:  AIC=8691
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + Weight
## 
##                 Df  Sum of Sq        RSS  AIC
## <none>                        1141397420 8691
## + Doors          1    1362886 1140034533 8692
## + Automatic      1     197092 1141200327 8693
## + Met_Color      1     178851 1141218569 8693
## + CC             1      38498 1141358922 8693
## - Fuel_Type      2   35587030 1176984450 8706
## - Quarterly_Tax  1   48089820 1189487240 8714
## - HP             1  102605929 1244003348 8741
## - KM             1  165583130 1306980550 8770
## - Weight         1  232428680 1373826100 8800
## - Age_08_04      1 1447234462 2588631881 9180

summary(car.lm.step)  # Which variables were dropped/added?

## 
## Call:
## lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + 
##     Weight, data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9667   -748     21    746   6987 
## 
## Coefficients:
##                    Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     -4622.46993  1634.07988   -2.83               0.0048 ** 
## Age_08_04        -133.13196     4.85926  -27.40 < 0.0000000000000002 ***
## KM                 -0.02120     0.00229   -9.27 < 0.0000000000000002 ***
## Fuel_TypeDiesel   888.54989   596.23572    1.49               0.1367    
## Fuel_TypePetrol  2138.33406   571.47519    3.74               0.0002 ***
## HP                 37.60879     5.15538    7.30     0.00000000000096 ***
## Quarterly_Tax      12.97858     2.59871    4.99     0.00000077835339 ***
## Weight             15.96199     1.45378   10.98 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared:  0.87,   Adjusted R-squared:  0.869 
## F-statistic:  566 on 7 and 592 DF,  p-value: <0.0000000000000002

car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)

##            ME RMSE  MAE    MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41