Assignment: Do Problems 11.3 and 11.4 in the textbook.
library(tidyverse)
library(neuralnet)
library(NeuralNetTools)
library(caret)
library(forecast)
library(gains)
ToyotaCorolla <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/ToyotaCorolla.csv")
ToyotaCorolla <- ToyotaCorolla %>%
select(Price, Age_08_04, KM, Fuel_Type, HP, Automatic, Doors,
Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco,
Automatic_airco, CD_Player, Powered_Windows, Sport_Model, Tow_Bar)
summary(ToyotaCorolla)
## Price Age_08_04 KM Fuel_Type
## Min. : 4350 Min. : 1.00 Min. : 1 Length:1436
## 1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 43000 Class :character
## Median : 9900 Median :61.00 Median : 63390 Mode :character
## Mean :10731 Mean :55.95 Mean : 68533
## 3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 87021
## Max. :32500 Max. :80.00 Max. :243000
## HP Automatic Doors Quarterly_Tax
## Min. : 69.0 Min. :0.00000 Min. :2.000 Min. : 19.00
## 1st Qu.: 90.0 1st Qu.:0.00000 1st Qu.:3.000 1st Qu.: 69.00
## Median :110.0 Median :0.00000 Median :4.000 Median : 85.00
## Mean :101.5 Mean :0.05571 Mean :4.033 Mean : 87.12
## 3rd Qu.:110.0 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.: 85.00
## Max. :192.0 Max. :1.00000 Max. :5.000 Max. :283.00
## Mfr_Guarantee Guarantee_Period Airco Automatic_airco
## Min. :0.0000 Min. : 3.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median : 3.000 Median :1.0000 Median :0.00000
## Mean :0.4095 Mean : 3.815 Mean :0.5084 Mean :0.05641
## 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :36.000 Max. :1.0000 Max. :1.00000
## CD_Player Powered_Windows Sport_Model Tow_Bar
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.000 Median :0.0000 Median :0.0000
## Mean :0.2187 Mean :0.562 Mean :0.3001 Mean :0.2779
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
Convert categorical variables to dummies:
# Fuel_Type variable
ToyotaCorolla$Petrol = 1*(ToyotaCorolla$Fuel_Type == "Petrol")
ToyotaCorolla$Diesel = 1*(ToyotaCorolla$Fuel_Type == "Diesel")
ToyotaCorolla$CNG = 1*(ToyotaCorolla$Fuel_Type == "CNG")
# Doors variable
ToyotaCorolla$Door2 = 1*(ToyotaCorolla$Doors == 2)
ToyotaCorolla$Door3 = 1*(ToyotaCorolla$Doors == 3)
ToyotaCorolla$Door4 = 1*(ToyotaCorolla$Doors == 4)
ToyotaCorolla$Door5 = 1*(ToyotaCorolla$Doors == 5)
Partition data into 60/40 Split:
set.seed(100)
n = nrow(ToyotaCorolla)
train.index = sample(n, n*0.6)
train.df = ToyotaCorolla[train.index, ] # 861 observations
valid.df = ToyotaCorolla[-train.index, ] # 575 observations
Preprocess (normalize) quantitative variables to a 0-1 scale:
norm.values = preProcess(train.df, method = "range")
train.norm = predict(norm.values, train.df)
valid.norm = predict(norm.values, valid.df)
summary(train.norm)
## Price Age_08_04 KM Fuel_Type
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:861
## 1st Qu.:0.1523 1st Qu.:0.5443 1st Qu.:0.1733 Class :character
## Median :0.2061 Median :0.7595 Median :0.2611 Mode :character
## Mean :0.2339 Mean :0.6948 Mean :0.2793
## 3rd Qu.:0.2748 3rd Qu.:0.8608 3rd Qu.:0.3534
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## HP Automatic Doors Quarterly_Tax
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1707 1st Qu.:0.00000 1st Qu.:0.3333 1st Qu.:0.1894
## Median :0.3333 Median :0.00000 Median :0.6667 Median :0.2500
## Mean :0.2663 Mean :0.04646 Mean :0.6752 Mean :0.2529
## 3rd Qu.:0.3333 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.2500
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Mfr_Guarantee Guarantee_Period Airco Automatic_airco
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :1.0000 Median :0.00000
## Mean :0.4065 Mean :0.02432 Mean :0.5145 Mean :0.05343
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## CD_Player Powered_Windows Sport_Model Tow_Bar
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.2149 Mean :0.5575 Mean :0.2927 Mean :0.2846
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Petrol Diesel CNG Door2
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.000000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :1.0000 Median :0.0000 Median :0.000000 Median :0.000000
## Mean :0.8839 Mean :0.1069 Mean :0.009291 Mean :0.001161
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.000000
## Door3 Door4 Door5
## Min. :0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.0000
## Mean :0.439 Mean :0.09292 Mean :0.4669
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.00000 Max. :1.0000
# These will help unscale our predictions later
min = min(train.df$Price)
max = max(train.df$Price)
unscale.price <- function(scaled.price){
unscaled = scaled.price*(max-min) + min
return(unscaled)
}
As we saw, the first neural network (1 hidden layer, 2 nodes) performed best on a validation set, although its RMSE on the training set was the highest (worst) of the 3 models.
accuracy(val1.lm)
## ME RMSE MAE MPE MAPE MASE
## Training set 9.977314e-14 1107.249 829.6625 -0.7938356 7.928735 0.3150899
accuracy(val2.lm)
## ME RMSE MAE MPE MAPE MASE
## Training set 2.613297e-13 1172.428 853.6804 -0.9622825 8.306225 0.3231541
accuracy(val3.lm)
## ME RMSE MAE MPE MAPE MASE
## Training set 7.635651e-14 1190.08 878.1159 -0.9896157 8.423273 0.3312438
This shows that having a model that performs awesome on the training set, might result in overfitting and perform worse on unseen data.
From these models, I would suggest using the first model because it has the lowest RMSE on the validation data.
However, in a real-world setting, I would be very cautious using this model because, on average, the predictions are still over $900 off.
If I was a used-car dealership, this difference might be enough to put us out of business.
Furthermore, there are a few rows in this data set I would want to explore further.
For example, Row 524 is the worst prediction by all 3 models (over $6,000 different from the actual price). Below are graphs of the residuals and numerical values, to indicate that there might be outliers within the dataset that need further investigation. Although Model 1 had the worst RMSE on the training data, it predicted Row 524 the best. Perhaps this row is an outlier, and if removed, could completely change the order of models’ RMSE.
ToyotaCorolla[524, ]
## Price Age_08_04 KM Fuel_Type HP Automatic Doors Quarterly_Tax
## 524 18950 49 49568 Petrol 110 0 3 19
## Mfr_Guarantee Guarantee_Period Airco Automatic_airco CD_Player
## 524 0 3 1 1 0
## Powered_Windows Sport_Model Tow_Bar Petrol Diesel CNG Door2 Door3 Door4
## 524 0 0 0 1 0 0 0 1 0
## Door5
## 524 0
val1.res = residuals(val1.lm)
sort(abs(val1.res)) %>% tail()
## 117 558 669 1059 139 524
## 3251.805 3345.832 3380.911 4243.596 5959.788 6005.280
plot(val1.res ~ val1.pred$Actual,
col = ifelse(abs(val1.res) > 6000, "red", "black"),
pch = ifelse(abs(val1.res) > 6000, 19, 1),
ylab = "Residuals", xlab = "Car Price",
main = "Neural Network #1 (1 Hidden Layer, 2 Nodes)",
sub = "* Row 524 (potential outlier) is colored Red")
abline(0,0, col = "red")
val2.res = residuals(val2.lm)
sort(abs(val2.res)) %>% tail()
## 105 115 117 139 1059 524
## 3586.150 3591.908 4411.091 4968.184 5740.484 8113.415
plot(val2.res ~ val2.pred$Actual,
col = ifelse(abs(val2.res) > 8000, "red", "black"),
pch = ifelse(abs(val2.res) > 8000, 19, 1),
ylab = "Residuals", xlab = "Car Price",
main = "Neural Network #2 (1 Hidden Layer, 5 Nodes)",
sub = "* Row 524 (potential outlier) is colored Red")
abline(0,0, col = "red")
val3.res = residuals(val3.lm)
sort(abs(val3.res)) %>% tail()
## 1110 134 125 115 1059 524
## 3643.014 3661.807 4089.827 4722.373 5067.182 8252.993
plot(val3.res ~ val3.pred$Actual,
col = ifelse(abs(val3.res) > 8000, "red", "black"),
pch = ifelse(abs(val3.res) > 8000, 19, 1),
ylab = "Residuals", xlab = "Car Price",
main = "Neural Network #3 (2 Hidden Layers, 5 Nodes)",
sub = "* Row 524 (potential outlier) is colored Red")
abline(0,0, col = "red")
Anyways, of the first 3 models, I’d choose the first model (with 1 hidden layer and 2 nodes) because it had the lowest RMSE when predicting unseen data, but there might be concerns with the large RMSE value (>$900) and we might want to explore outliers (Row 524 as an example.)
Airlines <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/EastWestAirlinesNN.csv") %>% na.omit()
Partition data:
set.seed(100)
n = nrow(Airlines)
train.index = sample(n, n*0.6)
train.df = Airlines[train.index, ] # 2991 observations
valid.df = Airlines[-train.index, ] # 1994 observations
Preprocess (normalize) quantitative variables to a 0-1 scale:
norm.values = preProcess(train.df, method = "range")
train.norm = predict(norm.values, train.df)
valid.norm = predict(norm.values, valid.df)
summary(train.norm)
## ID. Topflight Balance Qual_miles
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.2509 1st Qu.:0.0000 1st Qu.:0.008171 1st Qu.:0.00000
## Median :0.4987 Median :0.0000 Median :0.020575 Median :0.00000
## Mean :0.5003 Mean :0.1618 Mean :0.036826 Mean :0.01255
## 3rd Qu.:0.7493 3rd Qu.:0.0000 3rd Qu.:0.045097 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.00000
## cc1_miles. cc2_miles. cc3_miles. Bonus_miles
## Min. :0.0000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.004157
## Median :1.0000 Median :0.00000 Median :0.000000 Median :0.023559
## Mean :0.5169 Mean :0.04146 Mean :0.004681 Mean :0.064052
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.090832
## Max. :1.0000 Max. :1.00000 Max. :1.000000 Max. :1.000000
## Bonus_trans Flight_miles_12mo Flight_trans_12 Online_12
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.03409 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.11364 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.12049 Mean :0.01429 Mean :0.02304 Mean :0.005167
## 3rd Qu.:0.18182 3rd Qu.:0.00649 3rd Qu.:0.01887 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
## Email Club_member Any_cc_miles_12mo Phone_sale
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.6536 Mean :0.08893 Mean :0.5336 Mean :0.1264
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
From the above decile-wise lift charts, we can see the 1st model (with 5 nodes) does a really good job predicting the first 30% in its training set. However, on the validation set, it doesn’t predict the Phone Sale very well. Targeting the first 20% could increase the percent of sales, but from there the model struggles to accurately predict Phone Sales.
In contrast, the 2nd model (with just one node) doesn’t predict its training set as well, however the response ratio for its validation set predictions is greater than 1 for the first 60%. In the first model, this ratio drops below 1 in the third decile, indicating issues with overfitting.
If this model was real-world applicaiton, I’d want to look at the correlations between the predictors and phone sales, then maybe reduce the number of predictors. Otherwise, maybe fitting a neural network with 3 nodes or a couple of hidden layers may improve our model’s performance on unseen data.