A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership. The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size. A description of each of these variables is given in Table 6.1. A sample of this dataset is shown in Table 6.2. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv). After partitioning the data into training (60%) and validation (40%) sets, we fit a multiple linear regression model between price (the outcome variable) and the other variables (as predictors) using only the training set. Table 6.3 shows the estimated coefficients. Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG (0/1), is redundant given the information on the first two dummies. Including the redundant dummy would cause the regression to fail, since the redundant dummy will be a perfect linear combination of the other two; R’s “lm” routine handles this issue automatically.
# load and look at your data!
toyota.corolla.df <- read.csv("~/Downloads/ToyotaCorolla.csv")
car.df <- toyota.corolla.df
glimpse(car.df)
## Observations: 1,436
## Variables: 39
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Model <fct> TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,…
## $ Price <int> 13500, 13750, 13950, 14950, 13750, 12950, 1690…
## $ Age_08_04 <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22…
## $ Mfg_Month <int> 10, 10, 9, 7, 3, 1, 6, 3, 6, 10, 8, 11, 8, 2, …
## $ Mfg_Year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002…
## $ KM <int> 46986, 72937, 41711, 48000, 38500, 61000, 9461…
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel…
## $ HP <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, …
## $ Met_Color <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ Color <fct> Blue, Silver, Blue, Black, Black, White, Grey,…
## $ Automatic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CC <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
## $ Doors <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Cylinders <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ Gears <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6…
## $ Quarterly_Tax <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 1…
## $ Weight <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245…
## $ Mfr_Guarantee <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0…
## $ BOVAG_Guarantee <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Guarantee_Period <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 12, 3, 3, 3, 3, …
## $ ABS <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
## $ Airco <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Automatic_airco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1…
## $ Boardcomputer <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ CD_Player <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0…
## $ Central_Lock <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Powered_Windows <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Power_Steering <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Radio <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Mistlamps <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Sport_Model <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Backseat_Divider <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ Metallic_Rim <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Radio_cassette <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Parking_Assistant <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Tow_Bar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## Observations: 1,000
## Variables: 11
## $ Price <int> 13500, 13750, 13950, 14950, 13750, 12950, 16900, 1…
## $ Age_08_04 <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22, 25…
## $ KM <int> 46986, 72937, 41711, 48000, 38500, 61000, 94612, 7…
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel, Di…
## $ HP <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, 192,…
## $ Met_Color <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ Automatic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CC <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 18…
## $ Doors <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Quarterly_Tax <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 185, …
## $ Weight <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245, 11…
set.seed(1) # set seed for reproducing the partition
train.index <- sample(c(1:1000), 600)
train.df <- car.df[train.index,]
valid.df <- car.df[-train.index,] # another way of saying 400
# use lm() to run a linear regression of Price on all 11 predictors in the
# training set.
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm <- lm(Price ~ ., data = train.df)
# use options() to ensure numbers are not displayed in scientific notation.
# options(scipen = 999)
summary(car.lm)
##
## Call:
## lm(formula = Price ~ ., data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9781.2 -729.9 0.9 739.3 6912.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.754e+03 1.662e+03 -2.861 0.004372 **
## Age_08_04 -1.333e+02 4.902e+00 -27.187 < 2e-16 ***
## KM -2.099e-02 2.304e-03 -9.111 < 2e-16 ***
## Fuel_TypeDiesel 8.962e+02 6.032e+02 1.486 0.137857
## Fuel_TypePetrol 2.191e+03 5.756e+02 3.807 0.000155 ***
## HP 3.726e+01 5.233e+00 7.119 3.17e-12 ***
## Met_Color 5.132e+01 1.234e+02 0.416 0.677664
## Automatic 6.357e+01 2.623e+02 0.242 0.808583
## CC 1.075e-02 9.771e-02 0.110 0.912456
## Doors -5.570e+01 6.397e+01 -0.871 0.384230
## Quarterly_Tax 1.308e+01 2.608e+00 5.015 7.05e-07 ***
## Weight 1.622e+01 1.527e+00 10.622 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1392 on 588 degrees of freedom
## Multiple R-squared: 0.8703, Adjusted R-squared: 0.8679
## F-statistic: 358.7 on 11 and 588 DF, p-value: < 2.2e-16
library(forecast)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
# use predict() to make predictions on a new set.
car.lm.pred <- predict(car.lm, valid.df)
options(scipen=999, digits = 3)
# use accuracy() to compute common accuracy measures.
accuracy(car.lm.pred, valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 19.6 1325 1049 -0.75 9.35
How to see the residuals: “Table 6.4 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (relative to the actual prices) for these 20 cars.”
some.residuals <- valid.df$Price[1:20] - car.lm.pred[1:20]
data.frame("Predicted" = car.lm.pred[1:20], "Actual" = valid.df$Price[1:20], "Residual" = some.residuals)
## Predicted Actual Residual
## 2 16447 13750 -2697
## 7 16757 16900 143
## 8 16750 18600 1850
## 9 20959 21500 541
## 10 14350 12950 -1400
## 12 21124 19950 -1174
## 13 20964 19600 -1364
## 14 20408 21500 1092
## 18 16817 17950 1133
## 21 15053 15950 897
## 23 15800 15950 150
## 24 16307 16950 643
## 26 16786 15950 -836
## 30 16484 17950 1466
## 32 16233 15750 -483
## 34 15752 14950 -802
## 36 15485 15750 265
## 38 16629 14950 -1679
## 46 18069 19000 931
## 47 17441 17950 509
Not let’s plot the residuals using ggplot2!
library(ggplot2)
ggplot(lm(Price ~ ., data = train.df)) +
geom_point(aes(x=.fitted, y=.resid))
Versus using base R
plot(car.lm)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
What about a histogram of the errors?
car.lm.pred <- predict(car.lm, valid.df)
all.residuals <- valid.df$Price - car.lm.pred
length(all.residuals[which(all.residuals > -1406 & all.residuals < 1406)])/400
## [1] 0.723
hist(all.residuals, breaks = 25, xlab = "Residuals", main = "")
## QUESTION FOR YOU:
# 1. how would you do this with ggplot2?
Note: It can be shown that for linear regression, in large samples Mallow’s Cp is equivalent to AIC.
Finally, a useful point to note is that for a fixed size of subset, R2, R2adj, Cp, AIC, and BIC all select the same subset. In fact, there is no difference between them in the order of merit they ascribe to subsets of a fixed size. This is good to know if comparing models with the same number of predictors, but often we want to compare models with different numbers of predictors.
Table 6.5 gives the results of applying an exhaustive search on the Toyota Corolla price data (with the 11 predictors). It reports the best model with a single predictor, two predictors, and so on. It can be seen that the R2adj increases until eight predictors are used (number of coefficients = 9) and then stabilizes. The Cp indicates that a model with 7 to 8 predictors is good. The dominant predictor in all models is the age of the car, with horsepower and mileage playing important roles as well.
# use regsubsets() in package leaps to run an exhaustive search.
# unlike with lm, categorical predictors must be turned into dummies manually.
library(leaps)
search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2],
method = "exhaustive")
help("regsubsets")
sum <- summary(search)
# show models
sum$which
## (Intercept) Age_08_04 KM Fuel_TypeDiesel Fuel_TypePetrol HP
## 1 TRUE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE FALSE TRUE
## 5 TRUE TRUE TRUE FALSE FALSE TRUE
## 6 TRUE TRUE TRUE FALSE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE TRUE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE
## Met_Color Automatic CC Doors Quarterly_Tax Weight
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE
## 3 FALSE FALSE FALSE FALSE FALSE TRUE
## 4 FALSE FALSE FALSE FALSE FALSE TRUE
## 5 FALSE FALSE FALSE FALSE TRUE TRUE
## 6 FALSE FALSE FALSE FALSE TRUE TRUE
## 7 FALSE FALSE FALSE FALSE TRUE TRUE
## 8 FALSE FALSE FALSE TRUE TRUE TRUE
## 9 TRUE FALSE FALSE TRUE TRUE TRUE
## 10 TRUE TRUE FALSE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE
# show metrics
data.frame(rsq = sum$rsq, adjr2 = sum$adjr2, cp=sum$cp)
## rsq adjr2 cp
## 1 0.753 0.753 522.34
## 2 0.795 0.794 334.81
## 3 0.844 0.843 115.90
## 4 0.863 0.862 29.80
## 5 0.866 0.865 19.32
## 6 0.870 0.868 5.17
## 7 0.870 0.869 4.96
## 8 0.870 0.868 6.26
## 9 0.870 0.868 8.08
## 10 0.870 0.868 10.01
## 11 0.870 0.868 12.00
## QUESTION FOR YOU:
# 2. What is cp?
The second method of finding the best subset of predictors relies on a partial, iterative search through the space of all possible regression models. The end product is one best subset of predictors (although there do exist variations of these methods that identify several close-to-best choices for different sizes of predictor subsets). This approach is computationally cheaper, but it has the potential of missing “good” combinations of predictors. None of the methods guarantee that they yield the best subset for any criterion, such as R2adj. They are reasonable methods for situations with a large number of predictors, but for a moderate number of predictors, the exhaustive search is preferable.
Three popular iterative search algorithms are forward selection, backward elimination, and stepwise regression. In forward selection, we start with no predictors and then add predictors one by one. Each predictor added is the one (among all predictors) that has the largest contribution to R2 on top of the predictors that are already in it. The algorithm stops when the contribution of additional predictors is not statistically significant. The main disadvantage of this method is that the algorithm will miss pairs or groups of predictors that perform very well together but perform poorly as single predictors. This is similar to interviewing job candidates for a team project one by one, thereby missing groups of candidates who perform superiorly together (“colleagues”), but poorly on their own or with non-colleagues.
In backward elimination, we start with all predictors and then at each step, eliminate the least useful predictor (according to statistical significance). The algorithm stops when all the remaining predictors have significant contributions. The weakness of this algorithm is that computing the initial model with all predictors can be time-consuming and unstable. Stepwise regression is like forward selection except that at each step, we consider dropping predictors that are not statistically significant, as in backward elimination.
R has several libraries with stepwise functions: function regsubsets() in the leaps package implements (in addition to exhaustive search) forward selection, backward elimination, and stepwise regression. Predictors are added/dropped based on either R2, R2adj, or Cp. In contrast, function step() in the stats package, as well as function stepAIC() in the MASS package perform model selection using the AIC criterion (stepAIC offers a wider range of object classes).
Start with all variables and remove them one by one!
#### Table 6.6
# use step() to run stepwise regression.
car.lm.step <- step(car.lm, direction = "backward")
## Start: AIC=8698
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic +
## CC + Doors + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - CC 1 23445 1139558987 8696
## - Automatic 1 113837 1139649380 8696
## - Met_Color 1 335154 1139870697 8696
## - Doors 1 1469490 1141005033 8697
## <none> 1139535543 8698
## - Fuel_Type 2 36864358 1176399900 8713
## - Quarterly_Tax 1 48732676 1188268219 8721
## - HP 1 98229083 1237764626 8746
## - KM 1 160862596 1300398139 8775
## - Weight 1 218676925 1358212468 8802
## - Age_08_04 1 1432472333 2572007876 9185
##
## Step: AIC=8696
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic +
## Doors + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - Automatic 1 136842 1139695829 8694
## - Met_Color 1 340681 1139899668 8694
## - Doors 1 1457424 1141016411 8695
## <none> 1139558987 8696
## - Fuel_Type 2 36879383 1176438370 8711
## - Quarterly_Tax 1 48759179 1188318167 8719
## - HP 1 100144734 1239703722 8745
## - KM 1 160839218 1300398206 8773
## - Weight 1 218873160 1358432148 8800
## - Age_08_04 1 1433096756 2572655743 9183
##
## Step: AIC=8694
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Doors +
## Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - Met_Color 1 338704 1140034533 8692
## - Doors 1 1522740 1141218569 8693
## <none> 1139695829 8694
## - Fuel_Type 2 37033833 1176729662 8709
## - Quarterly_Tax 1 48735659 1188431487 8717
## - HP 1 100045224 1239741053 8743
## - KM 1 161464457 1301160286 8772
## - Weight 1 226617762 1366313591 8801
## - Age_08_04 1 1440955839 2580651668 9183
##
## Step: AIC=8692
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Doors + Quarterly_Tax +
## Weight
##
## Df Sum of Sq RSS AIC
## - Doors 1 1362886 1141397420 8691
## <none> 1140034533 8692
## - Fuel_Type 2 36776012 1176810545 8707
## - Quarterly_Tax 1 48499275 1188533808 8715
## - HP 1 101053268 1241087802 8741
## - KM 1 161965108 1301999641 8770
## - Weight 1 226421966 1366456500 8799
## - Age_08_04 1 1448501122 2588535655 9182
##
## Step: AIC=8691
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## <none> 1141397420 8691
## - Fuel_Type 2 35587030 1176984450 8706
## - Quarterly_Tax 1 48089820 1189487240 8714
## - HP 1 102605929 1244003348 8741
## - KM 1 165583130 1306980550 8770
## - Weight 1 232428680 1373826100 8800
## - Age_08_04 1 1447234462 2588631881 9180
summary(car.lm.step)
##
## Call:
## lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax +
## Weight, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9667 -748 21 746 6987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4622.46993 1634.07988 -2.83 0.0048 **
## Age_08_04 -133.13196 4.85926 -27.40 < 0.0000000000000002 ***
## KM -0.02120 0.00229 -9.27 < 0.0000000000000002 ***
## Fuel_TypeDiesel 888.54989 596.23572 1.49 0.1367
## Fuel_TypePetrol 2138.33406 571.47519 3.74 0.0002 ***
## HP 37.60879 5.15538 7.30 0.00000000000096 ***
## Quarterly_Tax 12.97858 2.59871 4.99 0.00000077835339 ***
## Weight 15.96199 1.45378 10.98 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.869
## F-statistic: 566 on 7 and 592 DF, p-value: <0.0000000000000002
# 3. Which variables were dropped?
car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41
## QUESTION FOR YOU:
# 4. RECORD THE RMSE: ______
## QUESTION FOR YOU:
# 5. How does this compare to your first RMSE?
Start without any variables and add them one by one
#### Table 6.7
# create model with no predictors
car.lm.null <- lm(Price~1, data = train.df)
# use step() to run forward regression.
car.lm.step <- step(car.lm.null, scope=list(lower=car.lm.null, upper=car.lm), direction = "forward")
## Start: AIC=9902
## Price ~ 1
##
## Df Sum of Sq RSS AIC
## + Age_08_04 1 6619336124 2167322714 9064
## + KM 1 3357409684 5429249154 9615
## + Weight 1 3185574106 5601084732 9634
## + HP 1 1030024095 7756634743 9829
## + Quarterly_Tax 1 454459218 8332199620 9872
## + Doors 1 283210488 8503448350 9884
## + Met_Color 1 136926847 8649731991 9894
## + CC 1 132887209 8653771629 9895
## + Fuel_Type 2 82914175 8703744663 9900
## <none> 8786658838 9902
## + Automatic 1 14099233 8772559605 9903
##
## Step: AIC=9064
## Price ~ Age_08_04
##
## Df Sum of Sq RSS AIC
## + Weight 1 367311518 1800011196 8954
## + HP 1 348771443 1818551271 8961
## + KM 1 229319983 1938002731 8999
## + Quarterly_Tax 1 29968151 2137354563 9058
## + Automatic 1 19010241 2148312473 9061
## + Doors 1 17838602 2149484111 9061
## + Fuel_Type 2 24222614 2143100100 9061
## + CC 1 13455747 2153866967 9062
## <none> 2167322714 9064
## + Met_Color 1 3355998 2163966716 9065
##
## Step: AIC=8954
## Price ~ Age_08_04 + Weight
##
## Df Sum of Sq RSS AIC
## + KM 1 428119347 1371891849 8794
## + HP 1 373615357 1426395839 8817
## + Fuel_Type 2 317441967 1482569229 8842
## + Quarterly_Tax 1 66286337 1733724859 8934
## + Automatic 1 8279853 1791731343 8954
## <none> 1800011196 8954
## + Met_Color 1 2076895 1797934301 8956
## + CC 1 276268 1799734929 8956
## + Doors 1 230044 1799781152 8956
##
## Step: AIC=8794
## Price ~ Age_08_04 + Weight + KM
##
## Df Sum of Sq RSS AIC
## + HP 1 170728393 1201163456 8716
## + Fuel_Type 2 65233378 1306658471 8768
## <none> 1371891849 8794
## + Met_Color 1 718694 1371173155 8795
## + CC 1 551713 1371340136 8795
## + Automatic 1 380438 1371511411 8795
## + Doors 1 119381 1371772468 8795
## + Quarterly_Tax 1 20420 1371871429 8796
##
## Step: AIC=8716
## Price ~ Age_08_04 + Weight + KM + HP
##
## Df Sum of Sq RSS AIC
## + Quarterly_Tax 1 24179006 1176984450 8706
## + Fuel_Type 2 11676216 1189487240 8714
## <none> 1201163456 8716
## + Doors 1 635682 1200527775 8717
## + CC 1 141287 1201022170 8718
## + Automatic 1 34178 1201129279 8718
## + Met_Color 1 21772 1201141684 8718
##
## Step: AIC=8706
## Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax
##
## Df Sum of Sq RSS AIC
## + Fuel_Type 2 35587030 1141397420 8691
## <none> 1176984450 8706
## + Automatic 1 314349 1176670101 8707
## + Doors 1 173905 1176810545 8707
## + Met_Color 1 52675 1176931775 8708
## + CC 1 11182 1176973268 8708
##
## Step: AIC=8691
## Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax + Fuel_Type
##
## Df Sum of Sq RSS AIC
## <none> 1141397420 8691
## + Doors 1 1362886 1140034533 8692
## + Automatic 1 197092 1141200327 8693
## + Met_Color 1 178851 1141218569 8693
## + CC 1 38498 1141358922 8693
summary(car.lm.step) # Which variables were added?
##
## Call:
## lm(formula = Price ~ Age_08_04 + Weight + KM + HP + Quarterly_Tax +
## Fuel_Type, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9667 -748 21 746 6987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4622.46993 1634.07988 -2.83 0.0048 **
## Age_08_04 -133.13196 4.85926 -27.40 < 0.0000000000000002 ***
## Weight 15.96199 1.45378 10.98 < 0.0000000000000002 ***
## KM -0.02120 0.00229 -9.27 < 0.0000000000000002 ***
## HP 37.60879 5.15538 7.30 0.00000000000096 ***
## Quarterly_Tax 12.97858 2.59871 4.99 0.00000077835339 ***
## Fuel_TypeDiesel 888.54989 596.23572 1.49 0.1367
## Fuel_TypePetrol 2138.33406 571.47519 3.74 0.0002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.869
## F-statistic: 566 on 7 and 592 DF, p-value: <0.0000000000000002
car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41
#### Table 6.8
# use step() to run stepwise regression.
car.lm.step <- step(car.lm, direction = "both")
## Start: AIC=8698
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic +
## CC + Doors + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - CC 1 23445 1139558987 8696
## - Automatic 1 113837 1139649380 8696
## - Met_Color 1 335154 1139870697 8696
## - Doors 1 1469490 1141005033 8697
## <none> 1139535543 8698
## - Fuel_Type 2 36864358 1176399900 8713
## - Quarterly_Tax 1 48732676 1188268219 8721
## - HP 1 98229083 1237764626 8746
## - KM 1 160862596 1300398139 8775
## - Weight 1 218676925 1358212468 8802
## - Age_08_04 1 1432472333 2572007876 9185
##
## Step: AIC=8696
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Automatic +
## Doors + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - Automatic 1 136842 1139695829 8694
## - Met_Color 1 340681 1139899668 8694
## - Doors 1 1457424 1141016411 8695
## <none> 1139558987 8696
## + CC 1 23445 1139535543 8698
## - Fuel_Type 2 36879383 1176438370 8711
## - Quarterly_Tax 1 48759179 1188318167 8719
## - HP 1 100144734 1239703722 8745
## - KM 1 160839218 1300398206 8773
## - Weight 1 218873160 1358432148 8800
## - Age_08_04 1 1433096756 2572655743 9183
##
## Step: AIC=8694
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color + Doors +
## Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## - Met_Color 1 338704 1140034533 8692
## - Doors 1 1522740 1141218569 8693
## <none> 1139695829 8694
## + Automatic 1 136842 1139558987 8696
## + CC 1 46449 1139649380 8696
## - Fuel_Type 2 37033833 1176729662 8709
## - Quarterly_Tax 1 48735659 1188431487 8717
## - HP 1 100045224 1239741053 8743
## - KM 1 161464457 1301160286 8772
## - Weight 1 226617762 1366313591 8801
## - Age_08_04 1 1440955839 2580651668 9183
##
## Step: AIC=8692
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Doors + Quarterly_Tax +
## Weight
##
## Df Sum of Sq RSS AIC
## - Doors 1 1362886 1141397420 8691
## <none> 1140034533 8692
## + Met_Color 1 338704 1139695829 8694
## + Automatic 1 134865 1139899668 8694
## + CC 1 53737 1139980796 8694
## - Fuel_Type 2 36776012 1176810545 8707
## - Quarterly_Tax 1 48499275 1188533808 8715
## - HP 1 101053268 1241087802 8741
## - KM 1 161965108 1301999641 8770
## - Weight 1 226421966 1366456500 8799
## - Age_08_04 1 1448501122 2588535655 9182
##
## Step: AIC=8691
## Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax + Weight
##
## Df Sum of Sq RSS AIC
## <none> 1141397420 8691
## + Doors 1 1362886 1140034533 8692
## + Automatic 1 197092 1141200327 8693
## + Met_Color 1 178851 1141218569 8693
## + CC 1 38498 1141358922 8693
## - Fuel_Type 2 35587030 1176984450 8706
## - Quarterly_Tax 1 48089820 1189487240 8714
## - HP 1 102605929 1244003348 8741
## - KM 1 165583130 1306980550 8770
## - Weight 1 232428680 1373826100 8800
## - Age_08_04 1 1447234462 2588631881 9180
summary(car.lm.step) # Which variables were dropped/added?
##
## Call:
## lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax +
## Weight, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9667 -748 21 746 6987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4622.46993 1634.07988 -2.83 0.0048 **
## Age_08_04 -133.13196 4.85926 -27.40 < 0.0000000000000002 ***
## KM -0.02120 0.00229 -9.27 < 0.0000000000000002 ***
## Fuel_TypeDiesel 888.54989 596.23572 1.49 0.1367
## Fuel_TypePetrol 2138.33406 571.47519 3.74 0.0002 ***
## HP 37.60879 5.15538 7.30 0.00000000000096 ***
## Quarterly_Tax 12.97858 2.59871 4.99 0.00000077835339 ***
## Weight 15.96199 1.45378 10.98 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1390 on 592 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.869
## F-statistic: 566 on 7 and 592 DF, p-value: <0.0000000000000002
car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 20.4 1328 1055 -0.736 9.41