6.3 Example: Predicting the Price of Used Toyota Corolla Cars

A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership. The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size. A description of each of these variables is given in Table 6.1. A sample of this dataset is shown in Table 6.2. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv). After partitioning the data into training (60%) and validation (40%) sets, we fit a multiple linear regression model between price (the outcome variable) and the other variables (as predictors) using only the training set. Table 6.3 shows the estimated coefficients. Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG (0/1), is redundant given the information on the first two dummies. Including the redundant dummy would cause the regression to fail, since the redundant dummy will be a perfect linear combination of the other two; R’s “lm” routine handles this issue automatically.

# load and look at your data!
toyota.corolla.df <- read.csv("~/Downloads/ToyotaCorolla.csv")

car.df <- toyota.corolla.df
glimpse(car.df)
## Observations: 1,436
## Variables: 39
## $ Id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Model             <fct> TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,…
## $ Price             <int> 13500, 13750, 13950, 14950, 13750, 12950, 1690…
## $ Age_08_04         <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22…
## $ Mfg_Month         <int> 10, 10, 9, 7, 3, 1, 6, 3, 6, 10, 8, 11, 8, 2, …
## $ Mfg_Year          <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002…
## $ KM                <int> 46986, 72937, 41711, 48000, 38500, 61000, 9461…
## $ Fuel_Type         <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel…
## $ HP                <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, …
## $ Met_Color         <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ Color             <fct> Blue, Silver, Blue, Black, Black, White, Grey,…
## $ Automatic         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CC                <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
## $ Doors             <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Cylinders         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ Gears             <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6…
## $ Quarterly_Tax     <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 1…
## $ Weight            <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245…
## $ Mfr_Guarantee     <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0…
## $ BOVAG_Guarantee   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Guarantee_Period  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 12, 3, 3, 3, 3, …
## $ ABS               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_1          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Airbag_2          <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
## $ Airco             <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Automatic_airco   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1…
## $ Boardcomputer     <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ CD_Player         <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0…
## $ Central_Lock      <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Powered_Windows   <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Power_Steering    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Radio             <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Mistlamps         <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Sport_Model       <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Backseat_Divider  <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ Metallic_Rim      <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1…
## $ Radio_cassette    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Parking_Assistant <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Tow_Bar           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Select the data to use

## Observations: 1,000
## Variables: 11
## $ Price         <int> 13500, 13750, 13950, 14950, 13750, 12950, 16900, 1…
## $ Age_08_04     <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22, 25…
## $ KM            <int> 46986, 72937, 41711, 48000, 38500, 61000, 94612, 7…
## $ Fuel_Type     <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel, Di…
## $ HP            <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, 192,…
## $ Met_Color     <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ Automatic     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CC            <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 18…
## $ Doors         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Quarterly_Tax <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 185, …
## $ Weight        <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245, 11…

Create testing and validation dataset

set.seed(1)  # set seed for reproducing the partition
train.index <- sample(c(1:1000), 600)  
train.df <- car.df[train.index,]
valid.df <- car.df[-train.index,]  # another way of saying 400

Run regression

# use lm() to run a linear regression of Price on all 11 predictors in the
# training set. 
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm <- lm(Price ~ ., data = train.df)

#  use options() to ensure numbers are not displayed in scientific notation.
# options(scipen = 999)
summary(car.lm)
## 
## Call:
## lm(formula = Price ~ ., data = train.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9781.2  -729.9     0.9   739.3  6912.9 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.754e+03  1.662e+03  -2.861 0.004372 ** 
## Age_08_04       -1.333e+02  4.902e+00 -27.187  < 2e-16 ***
## KM              -2.099e-02  2.304e-03  -9.111  < 2e-16 ***
## Fuel_TypeDiesel  8.962e+02  6.032e+02   1.486 0.137857    
## Fuel_TypePetrol  2.191e+03  5.756e+02   3.807 0.000155 ***
## HP               3.726e+01  5.233e+00   7.119 3.17e-12 ***
## Met_Color        5.132e+01  1.234e+02   0.416 0.677664    
## Automatic        6.357e+01  2.623e+02   0.242 0.808583    
## CC               1.075e-02  9.771e-02   0.110 0.912456    
## Doors           -5.570e+01  6.397e+01  -0.871 0.384230    
## Quarterly_Tax    1.308e+01  2.608e+00   5.015 7.05e-07 ***
## Weight           1.622e+01  1.527e+00  10.622  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1392 on 588 degrees of freedom
## Multiple R-squared:  0.8703, Adjusted R-squared:  0.8679 
## F-statistic: 358.7 on 11 and 588 DF,  p-value: < 2.2e-16

Run predictions

library(forecast)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff
# use predict() to make predictions on a new set. 
car.lm.pred <- predict(car.lm, valid.df)

options(scipen=999, digits = 3)
# use accuracy() to compute common accuracy measures.
accuracy(car.lm.pred, valid.df$Price)
##            ME RMSE  MAE   MPE MAPE
## Test set 19.6 1325 1049 -0.75 9.35

How to see the residuals: “Table 6.4 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (relative to the actual prices) for these 20 cars.”

some.residuals <- valid.df$Price[1:20] - car.lm.pred[1:20]
data.frame("Predicted" = car.lm.pred[1:20], "Actual" = valid.df$Price[1:20], "Residual" = some.residuals)
##    Predicted Actual Residual
## 2      16447  13750    -2697
## 7      16757  16900      143
## 8      16750  18600     1850
## 9      20959  21500      541
## 10     14350  12950    -1400
## 12     21124  19950    -1174
## 13     20964  19600    -1364
## 14     20408  21500     1092
## 18     16817  17950     1133
## 21     15053  15950      897
## 23     15800  15950      150
## 24     16307  16950      643
## 26     16786  15950     -836
## 30     16484  17950     1466
## 32     16233  15750     -483
## 34     15752  14950     -802
## 36     15485  15750      265
## 38     16629  14950    -1679
## 46     18069  19000      931
## 47     17441  17950      509

Not let’s plot the residuals using ggplot2!

library(ggplot2)
ggplot(lm(Price ~ ., data = train.df)) + 
  geom_point(aes(x=.fitted, y=.resid))

Versus using base R

plot(car.lm)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

What about a histogram of the errors?

car.lm.pred <- predict(car.lm, valid.df)
all.residuals <- valid.df$Price - car.lm.pred
length(all.residuals[which(all.residuals > -1406 & all.residuals < 1406)])/400
## [1] 0.723
hist(all.residuals, breaks = 25, xlab = "Residuals", main = "")

## QUESTION FOR YOU:  
# 1. how would you do this with ggplot2?