The aim of this project is to predict a car’s market price using its various characteristics, including body style, engine type and horsepower using the mulivariate K-nearest neighbours algorithm.
#install caret package
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
install.packages('caret')
Error in install.packages : Updating loaded packages
#load libraries
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyverse)
library(ggplot2)
library(tidyr)
library(broom)
library(caret)
#load raw data
setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
raw_data <- read_csv("cars.csv")
Parsed with column specification:
cols(
.default = col_character(),
symboling = [32mcol_double()[39m,
wheel_base = [32mcol_double()[39m,
length = [32mcol_double()[39m,
width = [32mcol_double()[39m,
height = [32mcol_double()[39m,
curb_weight = [32mcol_double()[39m,
engine_size = [32mcol_double()[39m,
compression_ratio = [32mcol_double()[39m,
city_mpg = [32mcol_double()[39m,
highway_mpg = [32mcol_double()[39m
)
See spec(...) for full column specifications.
#view top rows of dataframe
head(raw_data)
#remove any columns which contain non-numerical data types and convert the num_doors data to numerical
data_1 <- raw_data %>%
select(-engine_location, -engine_type, -fuel_system, -make, -fuel_type, - aspiration, -body_style, -drive_wheels) %>%
mutate(num_doors = if_else(num_doors == "two", 2, 4))
#return number of rows in dataframe
nrow(data_1)
[1] 205
#write function to take written numbers and return numeric
v1 <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve")
v2 <- c(1:12)
lookup <- as.data.frame(cbind(v1, v2))
lookup <- lookup %>% mutate(v2 = as.numeric(v2))
lookup
convert_to_number <- function(x) {
lookup$v2[str_detect(lookup$v1, x)]
}
#count NA values in each column
no_nas <- as.data.frame(colSums(is.na(data_2)))
no_nas
#change the column data type to numeric and drop any rows comtaining NA values
data_2 <- data_1 %>%
mutate(num_cylinders = map_dbl(num_cylinders, convert_to_number),
bore = as.numeric(bore),
stroke = as.numeric(stroke),
horsepower = as.numeric(horsepower),
price = as.numeric(price),
peak_rpm = as.numeric(peak_rpm))%>%
drop_na()
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
head(data_2, 20)
#count ? entries in `normalised_losses` column
sum(str_detect(data_2$normalized_losses, "\\?"))
[1] 35
There are 35 ‘?’ entries in the normalized_losses column. As there are only 200 entries in total, removing these entries will significantly reduce the size of the dataset. Therefore, remove the whole column instead.
#remove normalised_losses column
data_3 <- data_2 %>%
select(-normalized_losses)
head(data_3)
Now the data is cleaned and only the columns with numeric values have been kept, we can move on to see which of these variables are correlated to price. This cleaned dataset will be renamed cars.
cars <- data_3
#plot graphs of each variable against price to identify variables which are correlated to price
featurePlot(cars, cars$price)
The following variables appear to have a strong positive correlation with price: -horsepower, curb_weight, num_cylinders, engine_size The following variables appear to have a weak positive correlation with price: -bore, wheel_base, length, width The following variables appear to have a strong negative correlation with price: - city_mpg, highway_mpg The following variables appear to have little to no correlation with price: - peak_rpm, stroke, impression_rate, symboling, num_doors, height
Next, visualise the distribution of price
#plot boxplot of price
ggplot(data = cars,
aes(x = price))+
geom_boxplot()
The boxplot of price shows that most of the vehicles are priced between $7500 and $16000. There are some outliers shown on the boxplot but none of these suggest an erronous entry as the most expensive vehicle is under $50,000 which is not an unusual price for a vehicle.
#split the dataset into training and test sets
set.seed(1)
train_indices <- createDataPartition(y = cars[["price"]],
p = 0.8,
list = FALSE)
train_listings <- cars[train_indices,]
test_listings <- cars[-train_indices,]
#check the number of rows in the training and test sets
nrow(train_listings)
[1] 159
nrow(test_listings)
[1] 36
#create grid of k values 1 to 20
knn_grid <- expand.grid(k = 1:20)
knn_grid
#set 5-fold cross validation (i.e. k-fold with 5 folds)
train_control <- trainControl(method = "cv", number = 5)
#run a knn model with normalised data. Include all the variables which appear to have a correlation with price
model <- train(price ~ horsepower + curb_weight + engine_size + num_cylinders + city_mpg + highway_mpg + bore + wheel_base + length + width,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
model
k-Nearest Neighbors
159 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 128, 127, 127, 127, 127
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 2974.391 0.8874459 1888.892
2 2635.996 0.9100430 1838.104
3 2986.687 0.8926193 1909.449
4 3286.810 0.8670844 2081.739
5 3455.385 0.8630038 2166.483
6 3577.612 0.8522202 2202.450
7 3761.488 0.8357604 2272.914
8 3859.429 0.8166946 2320.885
9 3939.228 0.8140066 2359.872
10 3876.878 0.8261265 2350.108
11 3921.719 0.8207215 2354.333
12 3981.244 0.8206515 2408.287
13 4035.622 0.8211317 2468.332
14 4113.682 0.8179072 2496.333
15 4189.996 0.8150818 2536.305
16 4263.751 0.8061355 2587.911
17 4303.379 0.8044902 2604.410
18 4355.183 0.7999812 2601.270
19 4405.379 0.7933899 2608.540
20 4453.594 0.7925094 2633.513
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.
plot(model)
model[[4]]$RMSE
[1] 2974.391 2635.996 2986.687 3286.810 3455.385 3577.612 3761.488 3859.429 3939.228 3876.878 3921.719 3981.244
[13] 4035.622 4113.682 4189.996 4263.751 4303.379 4355.183 4405.379 4453.594
This graph shows that, with the variables included, the number of nearest neighbours to be considered should be 2. This gives a RMSE of $2706
It is difficult to say how well this model works without comparing it to different models i.e. models that use different variables and different numbers of variables. Therefore, look at the models based on individual variables and then based on combinations of variables
#create models based on each variable which is shown to be correlated to price. Identify which of these are good predictors of price with knn
#knn model price ~ horsepower
model_1 <- train(price ~ horsepower,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
horsepower_rmse <- model_1[[4]]$RMSE
#knn model price ~ curb_Weight
model_2 <- train(price ~ curb_weight,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
curb_weight_rmse <- model_2[[4]]$RMSE
#knn model price ~ engine_size
model_3 <- train(price ~ engine_size,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
engine_size_rmse <- model_3[[4]]$RMSE
#knn model price ~ num_cylinders
model_4 <- train(price ~ num_cylinders,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
num_cylinders_rmse <- model_4[[4]]$RMSE
#knn model price ~ city_mpg
model_5 <- train(price ~ city_mpg,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
city_mpg_rmse <- model_5[[4]]$RMSE
#knn model price ~ highway_mpg
model_6 <- train(price ~ highway_mpg,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
highway_mpg_rmse <- model_6[[4]]$RMSE
#knn model price ~ bore
model_7 <- train(price ~ bore,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
bore_rmse <- model_7[[4]]$RMSE
#knn model price ~ wheel_base
model_8 <- train(price ~ wheel_base,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
wheel_base_rmse <- model_8[[4]]$RMSE
#knn model price ~ length
model_9 <- train(price ~ length,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
length_rmse <- model_9[[4]]$RMSE
#knn model price ~ width
model_10 <- train(price ~ width,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
width_rmse <- model_10[[4]]$RMSE
#create a dataframe with all the RMSE from each variable from k = 1 to 20.
rmse_all <- as.data.frame(cbind(k= 1:20, horsepower_rmse, curb_weight_rmse, engine_size_rmse, num_cylinders_rmse, city_mpg_rmse, highway_mpg_rmse, bore_rmse, wheel_base_rmse, length_rmse, width_rmse))
rmse_all
NA
#in order to plot this, use pivot_longer
rmse_all_longer <- rmse_all %>%
pivot_longer(cols = horsepower_rmse:width_rmse, names_to = "variable", values_to = "RMSE")
#plot the RMSE from the KNN model from each individual variable on a graph
ggplot(data = rmse_all_longer,
aes(x = k, y = RMSE, color = variable)) +
geom_line(lwd = 1) +
labs(title = "RMSE vs k (no. nearest neighbours) for each variable", x = "k", y = "RMSE") +
theme(panel.background = element_rect(fill = "white"))
NA
A K-nearest neighbours model based on engine size provides the lowest RMSE. The next best performing variable is city_mpg, then horsepower, highway_mpg, width.
It is possible that the best model is one just based on engine_size. However, it is also possible that a model with combinations of variables performs better.
Next, the following knn models will be run and the RMSE will be compared to identify whether a single variable or a combination of variables works best to predict price.
model_C1 <- train(price ~ engine_size + city_mpg,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
comb1 <- model_C1[[4]]$RMSE
#knn model price ~ curb_Weight
model_C2 <- train(price ~ engine_size + city_mpg + horsepower,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
comb2 <- model_C2[[4]]$RMSE
#knn model price ~ engine_size
model_C3 <- train(price ~ engine_size + city_mpg + horsepower + highway_mpg,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
comb3 <- model_C3[[4]]$RMSE
#knn model price ~ num_cylinders
model_C4 <- train(price ~ engine_size + city_mpg + horsepower + highway_mpg + width,
data = train_listings,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = knn_grid)
comb4 <- model_C4[[4]]$RMSE
#create a dataframe with all the RMSE from each variable from k = 1 to 20.
comb_rmse_all <- as.data.frame(cbind(k= 1:20, comb1, comb2, comb3, comb4))
comb_rmse_all
NA
#in order to plot this, use pivot_longer
comb_rmse_all_longer <- comb_rmse_all %>%
pivot_longer(cols = comb1:comb4, names_to = "variable", values_to = "RMSE")
#plot the RMSE from the KNN model from each individual variable on a graph
ggplot(data = comb_rmse_all_longer,
aes(x = k, y = RMSE, color = variable)) +
geom_line(lwd = 1)+
labs(title = "RMSE vs k (no. nearest neighbours) for each variable combinations", x = "k", y = "RMSE") +
theme(panel.background = element_rect(fill = "white"))
NA
From this plot, it can be seen that the combination of 4 variables (engine_size + city_mpg + horsepower + highway_mpg + width) with k (number of nearest neighbours) = 2 provides the lowest RMSE
#now use the test data set in the model and obtain the RMSE of the test set.
predictions <- predict(model_C4, newdata = test_listings)
postResample(pred = predictions, obs = test_listings$price)
RMSE Rsquared MAE
2751.3648044 0.8132928 1783.8018519
The RMSE value for the test set of data is $2527 which is lower than in the training set. This is therefore acceptable. However, this still seems like quite a high RMSE, especially for a vehicle which may cost say £7000. There may potentially be more factors which have greater influence over the price which either were not included in this dataset i.e the car year, or were not of numeric type, i.e. the car make and model.