This is project was inspired by DataQuest and is an exploration of the Automobile dataset from UC Irvine consisting of car attributes from 1985.
# The column names are messy. Let's rename them
colnames(cars) <- c(
"symboling",
"normalized_losses",
"make",
"fuel_type",
"aspiration",
"num_doors",
"body_style",
"drive_wheels",
"engine_location",
"wheel_base",
"length",
"width",
"height",
"curb_weight",
"engine_type",
"num_cylinders",
"engine_size",
"fuel_system",
"bore",
"stroke",
"compression_ratio",
"horsepower",
"peak_rpm",
"city_mpg",
"highway_mpg",
"price"
)
# Remove non-numerical columns and missing data
cars <- cars %>%
select(
symboling, wheel_base, length, width, height, curb_weight,
engine_size, bore, stroke, compression_ratio, horsepower,
peak_rpm, city_mpg, highway_mpg, price
) %>%
filter(
stroke != "?",
bore != "?",
horsepower != "?",
peak_rpm != "?",
price != "?"
) %>%
mutate(
stroke = as.numeric(stroke),
bore = as.numeric(bore),
horsepower = as.numeric(horsepower),
peak_rpm = as.numeric(peak_rpm),
price = as.numeric(price)
)
# Confirming that each of the columns are numeric
map(cars, typeof)
# Examining relationships between predictors
featurePlot(cars, cars$price)
There is a positive relationship between price and the following variables:
horsepower
city_mpg (fuel efficiency)
highway_mpg (fuel efficiency)
curb_weight
engine_size
length
width
These variables appear scattered and have no clear relationship to price:
peak_rpm
stroke
height
# Plot distribution of prices
ggplot(cars, aes(x = price)) +
geom_histogram(color = "red") +
labs(
title = "Distribution of prices in cars dataset",
x = "Price",
y = "Frequency"
)
Car prices below $20,000 are reasonably distributed. The range of prices is $5,118 to $45,400.
80% of the data will be shown to the model for training
20% of the data will be withheld and used to test the model’s accuracy on unseen data
split_indicies <- createDataPartition(cars$price, p = 0.8, list = FALSE)
train_cars <- cars[split_indicies,]
test_cars <- cars[-split_indicies,]
The model is tested 5 times on slightly different parts of the data to make sure the model is not just lucky or over-fitting.
# 5-fold cross validation
five_fold_control <- trainControl(method = "cv", number = 5)
# Trying out different values of k (number of neighbors) from 1 to 20 to see which gives the best predictions
tuning_grid <- expand.grid(k = 1:20)
Training the K-Nearest Neighbors (KNN) model to:
Compare a car to its most similar cars (based on all features) to guess its price.
Test out many values of k and automatically scale the data so that features are comparable.
# creating a model based on all the features
full_model <- train(price ~.,
data = train_cars,
method = "knn",
trContol = five_fold_control,
tuneGrid = tuning_grid,
preProcess = c("center", "scale"))
Using the model to predict prices for the cars I previously held back from the model.
Check how close the model’s guesses were to the actual prices using standard evaluation metrics like RMSE, R-squared, and MAE.
Calculates the average magnitude of the errors between predicted and actual values.
Indicates how well a regression model explains the variability in the dependent variable or simply, how much of the variation in the data is explained by the model.
Evaluates the performance of a prediction model by calculating the average magnitude of the absolute differences between predicted and actual values.
predictions <- predict(full_model, newdata = test_cars)
postResample(pred = predictions, obs = test_cars$price)
## RMSE Rsquared MAE
## 3759.9221475 0.7711353 2280.8888889
What does it mean?
On average, the model’s predictions were off by about $2,340. Bigger mistakes carry more weight because the errors are squared before averaging. Lower RMSE = better predictions.
This is not a bad result at all. Our car prices range from $5,118 to $45,400 making this an error of around 5-10% depending on the car.
What does it mean?
This is a measure of how well the model explains the variation in car prices.
A value of 1 means perfect prediction
A value of 0 means we might as well guess at the price
This model explains about 92.8% of the changes in car prices which is very strong. It’s doing a great job.
What does it mean?
On average the model’s price predictions are off by about +/- $1,678. MAE is different than RMSE because it treats all errors equally rather than penalizing large mistakes more than small ones.