Car Price Predictions

This is project was inspired by DataQuest and is an exploration of the Automobile dataset from UC Irvine consisting of car attributes from 1985.

Importing and Cleaning the Data

# The column names are messy. Let's rename them
colnames(cars) <- c(
  "symboling",
  "normalized_losses",
  "make",
  "fuel_type",
  "aspiration",
  "num_doors",
  "body_style",
  "drive_wheels",
  "engine_location",
  "wheel_base",
  "length",
  "width",
  "height",
  "curb_weight",
  "engine_type",
  "num_cylinders",
  "engine_size",
  "fuel_system",
  "bore",
  "stroke",
  "compression_ratio",
  "horsepower",
  "peak_rpm",
  "city_mpg",
  "highway_mpg",
  "price"
)

# Remove non-numerical columns and missing data
cars <- cars %>% 
  select(
    symboling, wheel_base, length, width, height, curb_weight,
    engine_size, bore, stroke, compression_ratio, horsepower,
    peak_rpm, city_mpg, highway_mpg, price
  ) %>% 
  filter(
    stroke != "?",
    bore != "?",
    horsepower != "?",
    peak_rpm != "?",
    price != "?"
  ) %>% 
  mutate(
    stroke = as.numeric(stroke),
    bore = as.numeric(bore),
    horsepower = as.numeric(horsepower),
    peak_rpm = as.numeric(peak_rpm),
    price = as.numeric(price)
  )

# Confirming that each of the columns are numeric
map(cars, typeof)

Examine the Data

# Examining relationships between predictors 
featurePlot(cars, cars$price)

There is a positive relationship between price and the following variables:

horsepower
city_mpg (fuel efficiency)
highway_mpg (fuel efficiency)
curb_weight
engine_size
length
width

These variables appear scattered and have no clear relationship to price:

peak_rpm
stroke
height

# Plot distribution of prices
ggplot(cars, aes(x = price)) +
  geom_histogram(color = "red") +
  labs(
    title = "Distribution of prices in cars dataset",
    x = "Price",
    y = "Frequency"
  )

Car prices below $20,000 are reasonably distributed. The range of prices is $5,118 to $45,400.

Split Testing and Modeling Data

80% of the data will be shown to the model for training
20% of the data will be withheld and used to test the model’s accuracy on unseen data

split_indicies <- createDataPartition(cars$price, p = 0.8, list = FALSE)
train_cars <- cars[split_indicies,]
test_cars <- cars[-split_indicies,]

Cross-Validation and Hyperparameter Optimization

The model is tested 5 times on slightly different parts of the data to make sure the model is not just lucky or over-fitting.

# 5-fold cross validation
five_fold_control <- trainControl(method = "cv", number = 5)

# Trying out different values of k (number of neighbors) from 1 to 20 to see which gives the best predictions
tuning_grid <- expand.grid(k = 1:20)

Choosing a Model

Training the K-Nearest Neighbors (KNN) model to:

Compare a car to its most similar cars (based on all features) to guess its price.
Test out many values of k and automatically scale the data so that features are comparable.

# creating a model based on all the features
full_model <- train(price ~.,
                    data = train_cars,
                    method = "knn",
                    trContol = five_fold_control,
                    tuneGrid = tuning_grid,
                    preProcess = c("center", "scale"))

Final Model Evaluation

Using the model to predict prices for the cars I previously held back from the model.
Check how close the model’s guesses were to the actual prices using standard evaluation metrics like RMSE, R-squared, and MAE.

RMSE (Root Mean Squared Error): Calculates the average magnitude of the errors between predicted and actual values.
R-Squared: Indicates how well a regression model explains the variability in the dependent variable or simply, how much of the variation in the data is explained by the model.
MAE (Mean Absolute Error): Evaluates the performance of a prediction model by calculating the average magnitude of the absolute differences between predicted and actual values.

predictions <- predict(full_model, newdata = test_cars)
postResample(pred = predictions, obs = test_cars$price)

##         RMSE     Rsquared          MAE 
## 3759.9221475    0.7711353 2280.8888889

RMSE = 2340.44

What does it mean?

On average, the model’s predictions were off by about $2,340. Bigger mistakes carry more weight because the errors are squared before averaging. Lower RMSE = better predictions.

This is not a bad result at all. Our car prices range from $5,118 to $45,400 making this an error of around 5-10% depending on the car.

R-Squared = 0.928

What does it mean?

This is a measure of how well the model explains the variation in car prices.

A value of 1 means perfect prediction
A value of 0 means we might as well guess at the price

This model explains about 92.8% of the changes in car prices which is very strong. It’s doing a great job.

MAE = 1677.86

What does it mean?

On average the model’s price predictions are off by about +/- $1,678. MAE is different than RMSE because it treats all errors equally rather than penalizing large mistakes more than small ones.