1. Creating training and validation Sets

For assessing prediction performance, several measures are used. In all cases, the measures are based on the validation set, which serves as a more objective ground than the training set to assess predictive accuracy. This is because records in the validation set are more similar to the future records to be predicted, in the sense that they are not used to select predictors or to estimate the model parameters. Models are trained on the training data, applied to the validation data, and measures of accuracy then use the prediction errors on that validation set.

# load file
toyota.corolla.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/ToyotaCorolla.csv")
help(toyota.corolla.df)
## No documentation for 'toyota.corolla.df' in specified packages and libraries:
## you could try '??toyota.corolla.df'
# randomly generate training and validation sets
training <- sample(toyota.corolla.df$Id, 600)
validation <- sample(setdiff(toyota.corolla.df$Id, training), 400)
glimpse(toyota.corolla.df)
## Observations: 1,436
## Variables: 39
## $ Id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Model             <fct> TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors, TOYO…
## $ Price             <int> 13500, 13750, 13950, 14950, 13750, 12950, 16900, 18…
## $ Age_08_04         <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22, 25,…
## $ Mfg_Month         <int> 10, 10, 9, 7, 3, 1, 6, 3, 6, 10, 8, 11, 8, 2, 1, 5,…
## $ Mfg_Year          <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 200…
## $ KM                <int> 46986, 72937, 41711, 48000, 38500, 61000, 94612, 75…
## $ Fuel_Type         <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel, Die…
## $ HP                <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, 192, …
## $ Met_Color         <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, …
## $ Color             <fct> Blue, Silver, Blue, Black, Black, White, Grey, Grey…
## $ Automatic         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CC                <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 180…
## $ Doors             <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ Cylinders         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ Gears             <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 5, …
## $ Quarterly_Tax     <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 185, 1…
## $ Weight            <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245, 118…
## $ Mfr_Guarantee     <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ BOVAG_Guarantee   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Guarantee_Period  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 12, 3, 3, 3, 3, 3, 3,…
## $ ABS               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airbag_1          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airbag_2          <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airco             <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Automatic_airco   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Boardcomputer     <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ CD_Player         <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, …
## $ Central_Lock      <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Powered_Windows   <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Power_Steering    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Radio             <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Mistlamps         <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, …
## $ Sport_Model       <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ Backseat_Divider  <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ Metallic_Rim      <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Radio_cassette    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Parking_Assistant <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Tow_Bar           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

2. Set up a model: predict price of car!

Using regression on the training and validation sets

# run linear regression model
# run linear regression model
reg <- lm(Price~., data=toyota.corolla.df[,-c(1,2,8,11)], subset=training,
          na.action=na.exclude)

summary(reg)
## 
## Call:
## lm(formula = Price ~ ., data = toyota.corolla.df[, -c(1, 2, 8, 
##     11)], subset = training, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6231.8  -635.2     6.3   623.2  4757.2 
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.800e+03  2.034e+03   2.361 0.018583 *  
## Age_08_04         -1.192e+02  5.658e+00 -21.064  < 2e-16 ***
## Mfg_Month         -9.350e+01  1.473e+01  -6.349 4.46e-10 ***
## Mfg_Year                  NA         NA      NA       NA    
## KM                -1.976e-02  1.677e-03 -11.781  < 2e-16 ***
## HP                 2.725e+01  3.893e+00   7.001 7.20e-12 ***
## Met_Color          7.412e+01  1.077e+02   0.688 0.491619    
## Automatic          5.285e+02  2.382e+02   2.219 0.026892 *  
## CC                -5.257e-02  8.465e-02  -0.621 0.534866    
## Doors              9.179e+01  5.602e+01   1.638 0.101882    
## Cylinders                 NA         NA      NA       NA    
## Gears              2.122e+02  2.777e+02   0.764 0.445238    
## Quarterly_Tax      9.027e+00  1.779e+00   5.074 5.28e-07 ***
## Weight             8.029e+00  1.328e+00   6.046 2.69e-09 ***
## Mfr_Guarantee      1.640e+02  1.083e+02   1.514 0.130571    
## BOVAG_Guarantee    3.202e+02  1.701e+02   1.882 0.060379 .  
## Guarantee_Period   7.356e+01  2.221e+01   3.312 0.000986 ***
## ABS               -6.070e+02  1.822e+02  -3.332 0.000919 ***
## Airbag_1           9.361e+01  3.491e+02   0.268 0.788697    
## Airbag_2           1.213e+02  1.895e+02   0.640 0.522227    
## Airco              3.011e+02  1.314e+02   2.292 0.022281 *  
## Automatic_airco    2.101e+03  2.742e+02   7.663 7.93e-14 ***
## Boardcomputer     -1.758e+02  1.749e+02  -1.005 0.315272    
## CD_Player          3.697e+02  1.491e+02   2.479 0.013478 *  
## Central_Lock      -8.353e+01  2.165e+02  -0.386 0.699793    
## Powered_Windows    4.175e+02  2.174e+02   1.920 0.055319 .  
## Power_Steering     8.150e+00  4.710e+02   0.017 0.986200    
## Radio             -1.299e+03  1.232e+03  -1.054 0.292455    
## Mistlamps         -4.071e+01  1.605e+02  -0.254 0.799871    
## Sport_Model        4.658e+02  1.262e+02   3.690 0.000245 ***
## Backseat_Divider  -7.325e+01  1.827e+02  -0.401 0.688603    
## Metallic_Rim       1.115e+02  1.444e+02   0.772 0.440255    
## Radio_cassette     1.258e+03  1.238e+03   1.016 0.310032    
## Parking_Assistant -1.417e+03  8.666e+02  -1.635 0.102605    
## Tow_Bar           -1.595e+02  1.155e+02  -1.381 0.167757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1166 on 566 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.9011 
## F-statistic: 171.2 on 32 and 566 DF,  p-value: < 2.2e-16
help(lm)  ## This one is a must!

## how would you the "data=toyota.corolla.df[,-c(1,2,8,11)]" 
## statement with dplyr?  Why is it better?

pred_t <- predict(reg, na.action=na.pass)
pred_v <- predict(reg, newdata=toyota.corolla.df[validation,-c(1,2,8,11)],
                  na.action=na.pass)
## Warning in predict.lm(reg, newdata = toyota.corolla.df[validation, -c(1, :
## prediction from a rank-deficient fit may be misleading
help(predict)  ## This one is also a must!

3. Evaluate performance

Find the typical measures

# training
forecast::accuracy(pred_t, toyota.corolla.df[training,]$Price)     # Watch out! Yardstick also has an accuracy!
##                     ME    RMSE      MAE        MPE     MAPE
## Test set -3.999342e-11 1133.29 825.7272 -0.9562265 8.346658
# validation
forecast::accuracy(pred_v, toyota.corolla.df[validation,]$Price)
##                ME     RMSE      MAE        MPE     MAPE
## Test set 60.31086 1073.912 830.3579 0.06752121 8.327669

4. Create lift chart

# Using gains package
library(gains)
gain <- gains(toyota.corolla.df[validation,]$Price[!is.na(pred_v)], pred_v[!is.na(pred_v)])

# cumulative lift chart
options(scipen=999) # avoid scientific notation
# we will compute the gain relative to price
price <- toyota.corolla.df[validation,]$Price[!is.na(toyota.corolla.df[validation,]$Price)]
{ # if plotting two things together, you have to put them in brakets
  plot(c(0,gain$cume.pct.of.total*sum(price))~c(0,gain$cume.obs), 
     xlab="# cases", ylab="Cumulative Price", main="Lift Chart", type="l")

  # baseline
  lines(c(0,sum(price))~c(0,dim(toyota.corolla.df[validation,])[1]), col="gray", lty=2)
}

Vs. Yardstick … and WVPlots

# install.packages("yardstick") #new one
#library(yardstick)

truth <- toyota.corolla.df %>%
  select(Price) %>%
  filter(row_number() %in% validation, !is.na(Price))

data.for.yardstick <- data.frame(Truth=truth, estimate=pred_v[!is.na(pred_v)])
# Yardstick doesn't work because it wants a factor!

library(WVPlots)
WVPlots::GainCurvePlot(data.for.yardstick, "Price", "estimate", title="Example Lift/Gain Plot")

Confusion matrix

Using caret package

library(caret)
owner.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/ownerExample.csv")
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.5, 'owner', 'nonowner')), 
                owner.df$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonowner owner
##   nonowner       10     1
##   owner           2    11
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.6764, 0.9734)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.0001386       
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.9167          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4167          
##    Detection Prevalence : 0.4583          
##       Balanced Accuracy : 0.8750          
##                                           
##        'Positive' Class : nonowner        
## 
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.25, 'owner', 'nonowner')), 
                owner.df$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonowner owner
##   nonowner        8     1
##   owner           4    11
##                                           
##                Accuracy : 0.7917          
##                  95% CI : (0.5785, 0.9287)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.003305        
##                                           
##                   Kappa : 0.5833          
##                                           
##  Mcnemar's Test P-Value : 0.371093        
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9167          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 0.7333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3750          
##       Balanced Accuracy : 0.7917          
##                                           
##        'Positive' Class : nonowner        
## 
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.75, 'owner', 'nonowner')), 
                owner.df$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonowner owner
##   nonowner       11     5
##   owner           1     7
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.5329, 0.9023)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.01133         
##                                           
##                   Kappa : 0.5             
##                                           
##  Mcnemar's Test P-Value : 0.22067         
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.5833          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4583          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : nonowner        
## 

Vs. Yardstick from the tidyverse!

library(yardstick)
owner.df<- owner.df %>%
  mutate(Estimate = Probability>0.75)

owner.df$Estimate[owner.df$Estimate==T] <- "owner"
owner.df$Estimate[owner.df$Estimate==F] <- "nonowner"
owner.df$Estimate <- factor(owner.df$Estimate)

conf_mat(data = owner.df, Class, Estimate)
##           Truth
## Prediction nonowner owner
##   nonowner       11     5
##   owner           1     7

ROC Curves

A more popular method for plotting the two measures is through ROC (Receiver Operating Characteristic) curves. Starting from the lower left, the ROC curve plots the pairs {sensitivity, specificity} as the cutoff value descends from 1 to 0. (A typical alternative presentation is to plot 1-specificity on the x-axis, which allows 0 to be placed on the left end of the axis, and 1 on the right.) Better performance is reflected by curves that are closer to the top-left corner. The comparison curve is the diagonal, which reflects the performance of the naive rule, using varying cutoff values (i.e., setting different thresholds on the level of majority used by the majority rule). A common metric to summarize an ROC curve is “area under the curve (AUC),” which ranges from 1 (perfect discrimination between classes) to 0.5 (no better than the naive rule).

Let’s use pROC:

library(pROC)
r <- roc(owner.df$Class, owner.df$Probability)
## Setting levels: control = nonowner, case = owner
## Setting direction: controls < cases
r
## 
## Call:
## roc.default(response = owner.df$Class, predictor = owner.df$Probability)
## 
## Data: owner.df$Probability in 12 controls (owner.df$Class nonowner) < 12 cases (owner.df$Class owner).
## Area under the curve: 0.9375
plot.roc(r)

Vs. Yardstick

yardstick::roc_curve(data = owner.df, Class, Probability)  # the table
## # A tibble: 26 x 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1  -Inf            1          0     
##  2     0.0031       1          0.0833
##  3     0.0161       1          0.167 
##  4     0.0218       1          0.25  
##  5     0.0248       1          0.333 
##  6     0.0383       1          0.417 
##  7     0.0479       1          0.5   
##  8     0.149        1          0.583 
##  9     0.199        1          0.667 
## 10     0.218        0.917      0.667 
## # … with 16 more rows
autoplot(roc_curve(owner.df, Class, Probability)) # the plot