For assessing prediction performance, several measures are used. In all cases, the measures are based on the validation set, which serves as a more objective ground than the training set to assess predictive accuracy. This is because records in the validation set are more similar to the future records to be predicted, in the sense that they are not used to select predictors or to estimate the model parameters. Models are trained on the training data, applied to the validation data, and measures of accuracy then use the prediction errors on that validation set.
# load file
toyota.corolla.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/ToyotaCorolla.csv")
help(toyota.corolla.df)
## No documentation for 'toyota.corolla.df' in specified packages and libraries:
## you could try '??toyota.corolla.df'
# randomly generate training and validation sets
training <- sample(toyota.corolla.df$Id, 600)
validation <- sample(setdiff(toyota.corolla.df$Id, training), 400)
glimpse(toyota.corolla.df)
## Observations: 1,436
## Variables: 39
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Model <fct> TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors, TOYO…
## $ Price <int> 13500, 13750, 13950, 14950, 13750, 12950, 16900, 18…
## $ Age_08_04 <int> 23, 23, 24, 26, 30, 32, 27, 30, 27, 23, 25, 22, 25,…
## $ Mfg_Month <int> 10, 10, 9, 7, 3, 1, 6, 3, 6, 10, 8, 11, 8, 2, 1, 5,…
## $ Mfg_Year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 200…
## $ KM <int> 46986, 72937, 41711, 48000, 38500, 61000, 94612, 75…
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Diesel, Diesel, Diesel, Die…
## $ HP <int> 90, 90, 90, 90, 90, 90, 90, 90, 192, 69, 192, 192, …
## $ Met_Color <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, …
## $ Color <fct> Blue, Silver, Blue, Black, Black, White, Grey, Grey…
## $ Automatic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CC <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 180…
## $ Doors <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ Cylinders <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ Gears <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 5, …
## $ Quarterly_Tax <int> 210, 210, 210, 210, 210, 210, 210, 210, 100, 185, 1…
## $ Weight <int> 1165, 1165, 1165, 1165, 1170, 1170, 1245, 1245, 118…
## $ Mfr_Guarantee <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ BOVAG_Guarantee <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Guarantee_Period <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 12, 3, 3, 3, 3, 3, 3,…
## $ ABS <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airbag_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airbag_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Airco <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Automatic_airco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Boardcomputer <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ CD_Player <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, …
## $ Central_Lock <int> 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Powered_Windows <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Power_Steering <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Radio <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Mistlamps <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, …
## $ Sport_Model <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ Backseat_Divider <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ Metallic_Rim <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ Radio_cassette <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Parking_Assistant <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Tow_Bar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Using regression on the training and validation sets
# run linear regression model
# run linear regression model
reg <- lm(Price~., data=toyota.corolla.df[,-c(1,2,8,11)], subset=training,
na.action=na.exclude)
summary(reg)
##
## Call:
## lm(formula = Price ~ ., data = toyota.corolla.df[, -c(1, 2, 8,
## 11)], subset = training, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6231.8 -635.2 6.3 623.2 4757.2
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.800e+03 2.034e+03 2.361 0.018583 *
## Age_08_04 -1.192e+02 5.658e+00 -21.064 < 2e-16 ***
## Mfg_Month -9.350e+01 1.473e+01 -6.349 4.46e-10 ***
## Mfg_Year NA NA NA NA
## KM -1.976e-02 1.677e-03 -11.781 < 2e-16 ***
## HP 2.725e+01 3.893e+00 7.001 7.20e-12 ***
## Met_Color 7.412e+01 1.077e+02 0.688 0.491619
## Automatic 5.285e+02 2.382e+02 2.219 0.026892 *
## CC -5.257e-02 8.465e-02 -0.621 0.534866
## Doors 9.179e+01 5.602e+01 1.638 0.101882
## Cylinders NA NA NA NA
## Gears 2.122e+02 2.777e+02 0.764 0.445238
## Quarterly_Tax 9.027e+00 1.779e+00 5.074 5.28e-07 ***
## Weight 8.029e+00 1.328e+00 6.046 2.69e-09 ***
## Mfr_Guarantee 1.640e+02 1.083e+02 1.514 0.130571
## BOVAG_Guarantee 3.202e+02 1.701e+02 1.882 0.060379 .
## Guarantee_Period 7.356e+01 2.221e+01 3.312 0.000986 ***
## ABS -6.070e+02 1.822e+02 -3.332 0.000919 ***
## Airbag_1 9.361e+01 3.491e+02 0.268 0.788697
## Airbag_2 1.213e+02 1.895e+02 0.640 0.522227
## Airco 3.011e+02 1.314e+02 2.292 0.022281 *
## Automatic_airco 2.101e+03 2.742e+02 7.663 7.93e-14 ***
## Boardcomputer -1.758e+02 1.749e+02 -1.005 0.315272
## CD_Player 3.697e+02 1.491e+02 2.479 0.013478 *
## Central_Lock -8.353e+01 2.165e+02 -0.386 0.699793
## Powered_Windows 4.175e+02 2.174e+02 1.920 0.055319 .
## Power_Steering 8.150e+00 4.710e+02 0.017 0.986200
## Radio -1.299e+03 1.232e+03 -1.054 0.292455
## Mistlamps -4.071e+01 1.605e+02 -0.254 0.799871
## Sport_Model 4.658e+02 1.262e+02 3.690 0.000245 ***
## Backseat_Divider -7.325e+01 1.827e+02 -0.401 0.688603
## Metallic_Rim 1.115e+02 1.444e+02 0.772 0.440255
## Radio_cassette 1.258e+03 1.238e+03 1.016 0.310032
## Parking_Assistant -1.417e+03 8.666e+02 -1.635 0.102605
## Tow_Bar -1.595e+02 1.155e+02 -1.381 0.167757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1166 on 566 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9063, Adjusted R-squared: 0.9011
## F-statistic: 171.2 on 32 and 566 DF, p-value: < 2.2e-16
help(lm) ## This one is a must!
## how would you the "data=toyota.corolla.df[,-c(1,2,8,11)]"
## statement with dplyr? Why is it better?
pred_t <- predict(reg, na.action=na.pass)
pred_v <- predict(reg, newdata=toyota.corolla.df[validation,-c(1,2,8,11)],
na.action=na.pass)
## Warning in predict.lm(reg, newdata = toyota.corolla.df[validation, -c(1, :
## prediction from a rank-deficient fit may be misleading
help(predict) ## This one is also a must!
Find the typical measures
# training
forecast::accuracy(pred_t, toyota.corolla.df[training,]$Price) # Watch out! Yardstick also has an accuracy!
## ME RMSE MAE MPE MAPE
## Test set -3.999342e-11 1133.29 825.7272 -0.9562265 8.346658
# validation
forecast::accuracy(pred_v, toyota.corolla.df[validation,]$Price)
## ME RMSE MAE MPE MAPE
## Test set 60.31086 1073.912 830.3579 0.06752121 8.327669
# Using gains package
library(gains)
gain <- gains(toyota.corolla.df[validation,]$Price[!is.na(pred_v)], pred_v[!is.na(pred_v)])
# cumulative lift chart
options(scipen=999) # avoid scientific notation
# we will compute the gain relative to price
price <- toyota.corolla.df[validation,]$Price[!is.na(toyota.corolla.df[validation,]$Price)]
{ # if plotting two things together, you have to put them in brakets
plot(c(0,gain$cume.pct.of.total*sum(price))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative Price", main="Lift Chart", type="l")
# baseline
lines(c(0,sum(price))~c(0,dim(toyota.corolla.df[validation,])[1]), col="gray", lty=2)
}
Vs. Yardstick … and WVPlots
# install.packages("yardstick") #new one
#library(yardstick)
truth <- toyota.corolla.df %>%
select(Price) %>%
filter(row_number() %in% validation, !is.na(Price))
data.for.yardstick <- data.frame(Truth=truth, estimate=pred_v[!is.na(pred_v)])
# Yardstick doesn't work because it wants a factor!
library(WVPlots)
WVPlots::GainCurvePlot(data.for.yardstick, "Price", "estimate", title="Example Lift/Gain Plot")
Using caret package
library(caret)
owner.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/ownerExample.csv")
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.5, 'owner', 'nonowner')),
owner.df$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonowner owner
## nonowner 10 1
## owner 2 11
##
## Accuracy : 0.875
## 95% CI : (0.6764, 0.9734)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.0001386
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8333
## Specificity : 0.9167
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.8462
## Prevalence : 0.5000
## Detection Rate : 0.4167
## Detection Prevalence : 0.4583
## Balanced Accuracy : 0.8750
##
## 'Positive' Class : nonowner
##
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.25, 'owner', 'nonowner')),
owner.df$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonowner owner
## nonowner 8 1
## owner 4 11
##
## Accuracy : 0.7917
## 95% CI : (0.5785, 0.9287)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.003305
##
## Kappa : 0.5833
##
## Mcnemar's Test P-Value : 0.371093
##
## Sensitivity : 0.6667
## Specificity : 0.9167
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.7333
## Prevalence : 0.5000
## Detection Rate : 0.3333
## Detection Prevalence : 0.3750
## Balanced Accuracy : 0.7917
##
## 'Positive' Class : nonowner
##
confusionMatrix(as.factor(ifelse(owner.df$Probability>0.75, 'owner', 'nonowner')),
owner.df$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonowner owner
## nonowner 11 5
## owner 1 7
##
## Accuracy : 0.75
## 95% CI : (0.5329, 0.9023)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.01133
##
## Kappa : 0.5
##
## Mcnemar's Test P-Value : 0.22067
##
## Sensitivity : 0.9167
## Specificity : 0.5833
## Pos Pred Value : 0.6875
## Neg Pred Value : 0.8750
## Prevalence : 0.5000
## Detection Rate : 0.4583
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : nonowner
##
Vs. Yardstick from the tidyverse!
library(yardstick)
owner.df<- owner.df %>%
mutate(Estimate = Probability>0.75)
owner.df$Estimate[owner.df$Estimate==T] <- "owner"
owner.df$Estimate[owner.df$Estimate==F] <- "nonowner"
owner.df$Estimate <- factor(owner.df$Estimate)
conf_mat(data = owner.df, Class, Estimate)
## Truth
## Prediction nonowner owner
## nonowner 11 5
## owner 1 7
A more popular method for plotting the two measures is through ROC (Receiver Operating Characteristic) curves. Starting from the lower left, the ROC curve plots the pairs {sensitivity, specificity} as the cutoff value descends from 1 to 0. (A typical alternative presentation is to plot 1-specificity on the x-axis, which allows 0 to be placed on the left end of the axis, and 1 on the right.) Better performance is reflected by curves that are closer to the top-left corner. The comparison curve is the diagonal, which reflects the performance of the naive rule, using varying cutoff values (i.e., setting different thresholds on the level of majority used by the majority rule). A common metric to summarize an ROC curve is “area under the curve (AUC),” which ranges from 1 (perfect discrimination between classes) to 0.5 (no better than the naive rule).
Let’s use pROC:
library(pROC)
r <- roc(owner.df$Class, owner.df$Probability)
## Setting levels: control = nonowner, case = owner
## Setting direction: controls < cases
r
##
## Call:
## roc.default(response = owner.df$Class, predictor = owner.df$Probability)
##
## Data: owner.df$Probability in 12 controls (owner.df$Class nonowner) < 12 cases (owner.df$Class owner).
## Area under the curve: 0.9375
plot.roc(r)
Vs. Yardstick
yardstick::roc_curve(data = owner.df, Class, Probability) # the table
## # A tibble: 26 x 3
## .threshold specificity sensitivity
## <dbl> <dbl> <dbl>
## 1 -Inf 1 0
## 2 0.0031 1 0.0833
## 3 0.0161 1 0.167
## 4 0.0218 1 0.25
## 5 0.0248 1 0.333
## 6 0.0383 1 0.417
## 7 0.0479 1 0.5
## 8 0.149 1 0.583
## 9 0.199 1 0.667
## 10 0.218 0.917 0.667
## # … with 16 more rows
autoplot(roc_curve(owner.df, Class, Probability)) # the plot