Consider the data on used cars (ToyotaCorolla.csv
) with
1436 records and details on 38 attributes, including Price, Age, KM, HP,
and other specifications. The goal is to predict the price of a used
Toyota Corolla based on its specifications.
# Type your code here
set.seed(54321)
selected_columns <- c('Price', 'Age_08_04', 'KM', 'Fuel_Type', 'HP','Automatic','Doors','Quarterly_Tax','Mfr_Guarantee', 'Guarantee_Period','Airco','Automatic_airco','CD_Player','Powered_Windows','Sport_Model','Tow_Bar')
cars_selected <- cars[,selected_columns]
names(cars)
## [1] "Id" "Model" "Price"
## [4] "Age_08_04" "Mfg_Month" "Mfg_Year"
## [7] "KM" "Fuel_Type" "HP"
## [10] "Met_Color" "Color" "Automatic"
## [13] "CC" "Doors" "Cylinders"
## [16] "Gears" "Quarterly_Tax" "Weight"
## [19] "Mfr_Guarantee" "BOVAG_Guarantee" "Guarantee_Period"
## [22] "ABS" "Airbag_1" "Airbag_2"
## [25] "Airco" "Automatic_airco" "Boardcomputer"
## [28] "CD_Player" "Central_Lock" "Powered_Windows"
## [31] "Power_Steering" "Radio" "Mistlamps"
## [34] "Sport_Model" "Backseat_Divider" "Metallic_Rim"
## [37] "Radio_cassette" "Parking_Assistant" "Tow_Bar"
target <- cars_selected$Price
cars_Selected <- cars_selected[,-which(names(cars_selected) == "Price")]
cars_selected_no_cat <- cars_selected[,!names(cars_selected) %in% c('Fuel_Type', 'Automatic', 'Mfr_Guarantee','Guarantee_Period', 'Airco', 'Automatic_airco','CD_Player', 'Powered_Windows', 'Sport_Model', 'Tow_Bar')]
cars_dummies <- model.matrix(~ Fuel_Type + Automatic + Mfr_Guarantee + Guarantee_Period + Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar - 1, data = cars_selected)
cars_processed <- cbind(Price = target, cars_dummies, cars_selected_no_cat)
set.seed(54321)
train_size <-floor(0.7 * nrow(cars_processed))
train_data <- cars_processed[1:train_size,]
test_data <- cars_processed[(train_size + 1):nrow(cars_processed),]
dim(train_data)
## [1] 1005 19
## [1] 431 19
Fit a neural network model to the data. Use a single hidden layer
with 2 nodes.
- Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors,
Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco,
CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.
- Record the RMSE for the training data and the test data.
## Warning: package 'neuralnet' was built under R version 4.3.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
nn_model <- neuralnet(Price ~ ., data = train_data, hidden = 2, linear.output = TRUE)
summary(nn_model)
## Length Class Mode
## call 5 -none- call
## response 1005 -none- numeric
## covariate 17085 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 19 data.frame list
## exclude 0 -none- NULL
## net.result 1 -none- list
## weights 1 -none- list
## generalized.weights 1 -none- list
## startweights 1 -none- list
## result.matrix 42 -none- numeric
Repeat the process, changing the number of hidden layers and nodes to
{single layer with 5 nodes}, {two layers, 5 nodes in each layer}
i. What happens to the RMS error for the training data as the number of
layers and nodes increases?
ii. What happens to the RMS error for the validation data?
iii. Comment on the appropriate number of layers and nodes for this
application.
# Type your code here
nn_model_single_layer <- neuralnet(Price ~ ., data = train_data, hidden = 5, linear.output = TRUE)
train_predictions_single_layer <- predict(nn_model_single_layer, train_data)
test_predictions_single_layer <- predict(nn_model_single_layer, test_data)
train_rmse_single_layer <- sqrt(mean((train_data$Price - train_predictions_single_layer)^2))
test_rmse_single_layer <- sqrt(mean((test_data$Price - test_predictions_single_layer)^2))
cat("Single Hidden Layer (5 nodes) - Training RMSE:", train_rmse_single_layer, "\n")
## Single Hidden Layer (5 nodes) - Training RMSE: 3743.216
## Single Hidden Layer (5 nodes) - Testing RMSE: 3904.122
nn_model_two_layers <- neuralnet(Price ~ ., data = train_data, hidden = c(5, 5), linear.output = TRUE)
train_predictions_two_layers <- predict(nn_model_two_layers, train_data)
test_predictions_two_layers <- predict(nn_model_two_layers, test_data)
train_rmse_two_layers <-sqrt(mean((train_data$Price - train_predictions_two_layers)^2))
test_rmse_two_layers <-sqrt(mean((test_data$Price - test_predictions_two_layers)^2))
cat("Two Hidden Layer (5 nodes each) - Training RMSE:", train_rmse_two_layers, "\n")
## Two Hidden Layer (5 nodes each) - Training RMSE: 3743.216
## Two Hidden Layer (5 nodes each) - Testing RMSE: 3908.356
In terms of what happened to the RMSE for the training data as the number of layers and nodes increased was the neural network became more complex. This can potentially lower the training RMSE. When models become too complex it may start overfitting the training data, meaning it could perform “too well” and not generalize well to the whole test set
In terms of what happens to the RMSE for the validation data if the model becomes overfitted and complicated with too many layers and nodes the test RMSE may increase. This means that the model has a poor generalization of the data. If the network on the other hand is not complex enough then the test RMSE will remain high because the model overgeneralizes and does not capture the patterns of the data. In terms of an actual comparison between the test RMSE for both models there is a leveling off at a certain point in which adding more layers and nodes will reduce the test RMSE. Further layers and nodes will off balance this benefit and lead to an increased test RMSE.
The appropriate number of layers and nodes for this application is single layer with 5 nodes. We have seen a common theme throughout these responses that we are ultimately looking for a sweet spot in maximizing layers and nodes while also not overfitting the data and performing poorly in not seeing patterns. We don’t want generalizations but we also don’t want overcomplications. For these reasons with every data set we need to understand a good number to use and understand the trade-offs between training RMSE and test RMSE.