This is project was inspired by DataQuest and is an exploration of the Automobile dataset from UC Irvine consisting of car attributes from 1985.


Importing and Cleaning the Data

# The column names are messy. Let's rename them
colnames(cars) <- c(
  "symboling",
  "normalized_losses",
  "make",
  "fuel_type",
  "aspiration",
  "num_doors",
  "body_style",
  "drive_wheels",
  "engine_location",
  "wheel_base",
  "length",
  "width",
  "height",
  "curb_weight",
  "engine_type",
  "num_cylinders",
  "engine_size",
  "fuel_system",
  "bore",
  "stroke",
  "compression_ratio",
  "horsepower",
  "peak_rpm",
  "city_mpg",
  "highway_mpg",
  "price"
)

# Remove non-numerical columns and missing data
cars <- cars %>% 
  select(
    symboling, wheel_base, length, width, height, curb_weight,
    engine_size, bore, stroke, compression_ratio, horsepower,
    peak_rpm, city_mpg, highway_mpg, price
  ) %>% 
  filter(
    stroke != "?",
    bore != "?",
    horsepower != "?",
    peak_rpm != "?",
    price != "?"
  ) %>% 
  mutate(
    stroke = as.numeric(stroke),
    bore = as.numeric(bore),
    horsepower = as.numeric(horsepower),
    peak_rpm = as.numeric(peak_rpm),
    price = as.numeric(price)
  )

# Confirming that each of the columns are numeric
map(cars, typeof)


Examine the Data

# Examining relationships between predictors 
featurePlot(cars, cars$price)

There is a positive relationship between price and the following variables:

These variables appear scattered and have no clear relationship to price:

# Plot distribution of prices
ggplot(cars, aes(x = price)) +
  geom_histogram(color = "red") +
  labs(
    title = "Distribution of prices in cars dataset",
    x = "Price",
    y = "Frequency"
  )

Car prices below $20,000 are reasonably distributed. The range of prices is $5,118 to $45,400.


Split Testing and Modeling Data

split_indicies <- createDataPartition(cars$price, p = 0.8, list = FALSE)
train_cars <- cars[split_indicies,]
test_cars <- cars[-split_indicies,]


Cross-Validation and Hyperparameter Optimization

The model is tested 5 times on slightly different parts of the data to make sure the model is not just lucky or over-fitting.

# 5-fold cross validation
five_fold_control <- trainControl(method = "cv", number = 5)

# Trying out different values of k (number of neighbors) from 1 to 20 to see which gives the best predictions
tuning_grid <- expand.grid(k = 1:20)


Choosing a Model

Training the K-Nearest Neighbors (KNN) model to:

# creating a model based on all the features
full_model <- train(price ~.,
                    data = train_cars,
                    method = "knn",
                    trContol = five_fold_control,
                    tuneGrid = tuning_grid,
                    preProcess = c("center", "scale"))


Final Model Evaluation

RMSE (Root Mean Squared Error)

Calculates the average magnitude of the errors between predicted and actual values.

R-Squared

Indicates how well a regression model explains the variability in the dependent variable or simply, how much of the variation in the data is explained by the model.

MAE (Mean Absolute Error)

Evaluates the performance of a prediction model by calculating the average magnitude of the absolute differences between predicted and actual values.

predictions <- predict(full_model, newdata = test_cars)
postResample(pred = predictions, obs = test_cars$price)
##         RMSE     Rsquared          MAE 
## 3759.9221475    0.7711353 2280.8888889

RMSE = 2340.44

What does it mean?

On average, the model’s predictions were off by about $2,340. Bigger mistakes carry more weight because the errors are squared before averaging. Lower RMSE = better predictions.

This is not a bad result at all. Our car prices range from $5,118 to $45,400 making this an error of around 5-10% depending on the car.


R-Squared = 0.928

What does it mean?

This is a measure of how well the model explains the variation in car prices.

This model explains about 92.8% of the changes in car prices which is very strong. It’s doing a great job.


MAE = 1677.86

What does it mean?

On average the model’s price predictions are off by about +/- $1,678. MAE is different than RMSE because it treats all errors equally rather than penalizing large mistakes more than small ones.