Homework: Predicting Price of Used Car Sales

Consider the data on used cars (ToyotaCorolla.csv) with 1436 records and details on 38 attributes, including Price, Age, KM, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.

# import ToyotaCorolla.csv and
cars <- read.csv("ToyotaCorolla.csv")

Exercise 1

  • Preprocess data appropriately for neural network modelling.
    • Select following variables: Price, Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.
    • Remember to first scale the numerical predictor and outcome variables to a 0–1 scale.
    • Convert categorical predictors to dummies.
  • Use 70% of the data for training and 30% for test.
# Type your code here

set.seed(54321)

selected_columns <- c('Price', 'Age_08_04', 'KM', 'Fuel_Type', 'HP','Automatic','Doors','Quarterly_Tax','Mfr_Guarantee', 'Guarantee_Period','Airco','Automatic_airco','CD_Player','Powered_Windows','Sport_Model','Tow_Bar')

cars_selected <- cars[,selected_columns]

names(cars)
##  [1] "Id"                "Model"             "Price"            
##  [4] "Age_08_04"         "Mfg_Month"         "Mfg_Year"         
##  [7] "KM"                "Fuel_Type"         "HP"               
## [10] "Met_Color"         "Color"             "Automatic"        
## [13] "CC"                "Doors"             "Cylinders"        
## [16] "Gears"             "Quarterly_Tax"     "Weight"           
## [19] "Mfr_Guarantee"     "BOVAG_Guarantee"   "Guarantee_Period" 
## [22] "ABS"               "Airbag_1"          "Airbag_2"         
## [25] "Airco"             "Automatic_airco"   "Boardcomputer"    
## [28] "CD_Player"         "Central_Lock"      "Powered_Windows"  
## [31] "Power_Steering"    "Radio"             "Mistlamps"        
## [34] "Sport_Model"       "Backseat_Divider"  "Metallic_Rim"     
## [37] "Radio_cassette"    "Parking_Assistant" "Tow_Bar"
target <- cars_selected$Price 
cars_Selected <- cars_selected[,-which(names(cars_selected) == "Price")]

cars_selected_no_cat <- cars_selected[,!names(cars_selected) %in% c('Fuel_Type', 'Automatic', 'Mfr_Guarantee','Guarantee_Period', 'Airco', 'Automatic_airco','CD_Player', 'Powered_Windows', 'Sport_Model', 'Tow_Bar')]

cars_dummies <- model.matrix(~ Fuel_Type + Automatic + Mfr_Guarantee + Guarantee_Period + Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar - 1, data = cars_selected)

cars_processed <- cbind(Price = target, cars_dummies, cars_selected_no_cat)

set.seed(54321)

train_size <-floor(0.7 * nrow(cars_processed))

train_data <- cars_processed[1:train_size,]

test_data <- cars_processed[(train_size + 1):nrow(cars_processed),]

dim(train_data)
## [1] 1005   19
dim(test_data)
## [1] 431  19

Exercise 2

Fit a neural network model to the data. Use a single hidden layer with 2 nodes.
- Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.
- Record the RMSE for the training data and the test data.

# Type your code here
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
nn_model <- neuralnet(Price ~ ., data = train_data, hidden = 2, linear.output = TRUE)

summary(nn_model)
##                     Length Class      Mode    
## call                    5  -none-     call    
## response             1005  -none-     numeric 
## covariate           17085  -none-     numeric 
## model.list              2  -none-     list    
## err.fct                 1  -none-     function
## act.fct                 1  -none-     function
## linear.output           1  -none-     logical 
## data                   19  data.frame list    
## exclude                 0  -none-     NULL    
## net.result              1  -none-     list    
## weights                 1  -none-     list    
## generalized.weights     1  -none-     list    
## startweights            1  -none-     list    
## result.matrix          42  -none-     numeric
train_predictions <- predict(nn_model, train_data)

test_predictions <- predict(nn_model, test_data)

train_rmse <- sqrt(mean((train_data$Price - train_predictions)^2))

test_rmse <- sqrt(mean((test_data$Price - test_predictions)^2))

Exercise 3

Repeat the process, changing the number of hidden layers and nodes to {single layer with 5 nodes}, {two layers, 5 nodes in each layer}
i. What happens to the RMS error for the training data as the number of layers and nodes increases?
ii. What happens to the RMS error for the validation data?
iii. Comment on the appropriate number of layers and nodes for this application.

# Type your code here

nn_model_single_layer <- neuralnet(Price ~ ., data = train_data, hidden = 5, linear.output = TRUE)

train_predictions_single_layer <- predict(nn_model_single_layer, train_data)

test_predictions_single_layer <- predict(nn_model_single_layer, test_data)

train_rmse_single_layer <- sqrt(mean((train_data$Price - train_predictions_single_layer)^2)) 

test_rmse_single_layer <- sqrt(mean((test_data$Price - test_predictions_single_layer)^2))

cat("Single Hidden Layer (5 nodes) - Training RMSE:", train_rmse_single_layer, "\n")
## Single Hidden Layer (5 nodes) - Training RMSE: 3743.216
cat("Single Hidden Layer (5 nodes) - Testing RMSE:", test_rmse_single_layer, "\n")
## Single Hidden Layer (5 nodes) - Testing RMSE: 3904.122
nn_model_two_layers <- neuralnet(Price ~ ., data = train_data, hidden = c(5, 5), linear.output = TRUE)

train_predictions_two_layers <- predict(nn_model_two_layers, train_data)

test_predictions_two_layers <- predict(nn_model_two_layers, test_data)

train_rmse_two_layers <-sqrt(mean((train_data$Price - train_predictions_two_layers)^2))

test_rmse_two_layers <-sqrt(mean((test_data$Price - test_predictions_two_layers)^2))

cat("Two Hidden Layer (5 nodes each) - Training RMSE:", train_rmse_two_layers, "\n")
## Two Hidden Layer (5 nodes each) - Training RMSE: 3743.216
cat("Two Hidden Layer (5 nodes each) - Testing RMSE:", test_rmse_two_layers, "\n")
## Two Hidden Layer (5 nodes each) - Testing RMSE: 3908.356

Summary

  1. In terms of what happened to the RMSE for the training data as the number of layers and nodes increased was the neural network became more complex. This can potentially lower the training RMSE. When models become too complex it may start overfitting the training data, meaning it could perform “too well” and not generalize well to the whole test set

  2. In terms of what happens to the RMSE for the validation data if the model becomes overfitted and complicated with too many layers and nodes the test RMSE may increase. This means that the model has a poor generalization of the data. If the network on the other hand is not complex enough then the test RMSE will remain high because the model overgeneralizes and does not capture the patterns of the data. In terms of an actual comparison between the test RMSE for both models there is a leveling off at a certain point in which adding more layers and nodes will reduce the test RMSE. Further layers and nodes will off balance this benefit and lead to an increased test RMSE.

  3. The appropriate number of layers and nodes for this application is single layer with 5 nodes. We have seen a common theme throughout these responses that we are ultimately looking for a sweet spot in maximizing layers and nodes while also not overfitting the data and performing poorly in not seeing patterns. We don’t want generalizations but we also don’t want overcomplications. For these reasons with every data set we need to understand a good number to use and understand the trade-offs between training RMSE and test RMSE.

---
title: "ECON 3200: Homework 5"
author: "Jaden Sampson"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
library(tidyverse)
library(openintro)
```

# Homework: Predicting Price of Used Car Sales

Consider the data on used cars (`ToyotaCorolla.csv`) with 1436 records and details on 38 attributes, including Price, Age, KM, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.

```{r}
# import ToyotaCorolla.csv and
cars <- read.csv("ToyotaCorolla.csv")
```

## Exercise 1
- Preprocess data appropriately for neural network modelling.  
  - Select following variables: Price, Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.  
  - Remember to first scale the numerical predictor and outcome variables to a 0–1 scale.  
  - Convert categorical predictors to dummies.    
- Use 70% of the data for training and 30% for test.  


```{r}
# Type your code here

set.seed(54321)

selected_columns <- c('Price', 'Age_08_04', 'KM', 'Fuel_Type', 'HP','Automatic','Doors','Quarterly_Tax','Mfr_Guarantee', 'Guarantee_Period','Airco','Automatic_airco','CD_Player','Powered_Windows','Sport_Model','Tow_Bar')

cars_selected <- cars[,selected_columns]

names(cars)

target <- cars_selected$Price 
cars_Selected <- cars_selected[,-which(names(cars_selected) == "Price")]

cars_selected_no_cat <- cars_selected[,!names(cars_selected) %in% c('Fuel_Type', 'Automatic', 'Mfr_Guarantee','Guarantee_Period', 'Airco', 'Automatic_airco','CD_Player', 'Powered_Windows', 'Sport_Model', 'Tow_Bar')]

cars_dummies <- model.matrix(~ Fuel_Type + Automatic + Mfr_Guarantee + Guarantee_Period + Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar - 1, data = cars_selected)

cars_processed <- cbind(Price = target, cars_dummies, cars_selected_no_cat)

set.seed(54321)

train_size <-floor(0.7 * nrow(cars_processed))

train_data <- cars_processed[1:train_size,]

test_data <- cars_processed[(train_size + 1):nrow(cars_processed),]

dim(train_data)
dim(test_data)

```

## Exercise 2
Fit a neural network model to the data. Use a single hidden layer with 2 nodes.  
- Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.  
- Record the RMSE for the training data and the test data.  

```{r}
# Type your code here
library(neuralnet)

nn_model <- neuralnet(Price ~ ., data = train_data, hidden = 2, linear.output = TRUE)

summary(nn_model)

train_predictions <- predict(nn_model, train_data)

test_predictions <- predict(nn_model, test_data)

train_rmse <- sqrt(mean((train_data$Price - train_predictions)^2))

test_rmse <- sqrt(mean((test_data$Price - test_predictions)^2))
```

## Exercise 3
Repeat the process, changing the number of hidden layers and nodes to {single layer with 5 nodes}, {two layers, 5 nodes in each layer}  
	i. What happens to the RMS error for the training data as the number of layers and nodes increases?  
	ii. What happens to the RMS error for the validation data?  
	iii. Comment on the appropriate number of layers and nodes for this application.  

```{r}
# Type your code here

nn_model_single_layer <- neuralnet(Price ~ ., data = train_data, hidden = 5, linear.output = TRUE)

train_predictions_single_layer <- predict(nn_model_single_layer, train_data)

test_predictions_single_layer <- predict(nn_model_single_layer, test_data)

train_rmse_single_layer <- sqrt(mean((train_data$Price - train_predictions_single_layer)^2)) 

test_rmse_single_layer <- sqrt(mean((test_data$Price - test_predictions_single_layer)^2))

cat("Single Hidden Layer (5 nodes) - Training RMSE:", train_rmse_single_layer, "\n")

cat("Single Hidden Layer (5 nodes) - Testing RMSE:", test_rmse_single_layer, "\n")

nn_model_two_layers <- neuralnet(Price ~ ., data = train_data, hidden = c(5, 5), linear.output = TRUE)

train_predictions_two_layers <- predict(nn_model_two_layers, train_data)

test_predictions_two_layers <- predict(nn_model_two_layers, test_data)

train_rmse_two_layers <-sqrt(mean((train_data$Price - train_predictions_two_layers)^2))

test_rmse_two_layers <-sqrt(mean((test_data$Price - test_predictions_two_layers)^2))

cat("Two Hidden Layer (5 nodes each) - Training RMSE:", train_rmse_two_layers, "\n")

cat("Two Hidden Layer (5 nodes each) - Testing RMSE:", test_rmse_two_layers, "\n")



```

### Summary 

i. In terms of what happened to the RMSE for the training data as the number of layers and nodes increased was the neural
network became more complex. This can potentially lower the training RMSE. When models become too complex it may start 
overfitting the training data, meaning it could perform "too well" and not generalize well to the whole test set 

ii. In terms of what happens to the RMSE for the validation data if the model becomes overfitted and complicated with too
many layers and nodes the test RMSE may increase. This means that the model has a poor generalization of the data. If the
network on the other hand is not complex enough then the test RMSE will remain high because the model overgeneralizes and
does not capture the patterns of the data. In terms of an actual comparison between the test RMSE for both models there 
is a leveling off at a certain point in which adding more layers and nodes will reduce the test RMSE. Further layers and 
nodes will off balance this benefit and lead to an increased test RMSE.

iii. The appropriate number of layers and nodes for this application is single layer with 5 nodes. We have seen a common
theme throughout these responses that we are ultimately looking for a sweet spot in maximizing layers and nodes while
also not overfitting the data and performing poorly in not seeing patterns. We don't want generalizations but we also
don't want overcomplications. For these reasons with every data set we need to understand a good number to use and
understand the trade-offs between training RMSE and test RMSE.
