2025-03-25

Loading the data

We are loading a data set about used cars. In this case we have information about the price of the cars, trim, if the amount of owners is one or greater, mileage, year, color, and displacement.

## [1] "price"        "trim"         "isOneOwner"   "mileage"      "year"        
## [6] "color"        "displacement"

Interest is within price and mileage

We want to see if there exist a relationship between the price of a car and the amount of mileage the car has. Normally, we would guess cars with higher mileage lead to a lower price.

## [1] "price"   "mileage"
##    price mileage
## 1 43.995  36.858
## 2 44.995  46.883
## 3 25.999 108.759
## 4 33.880  35.187
## 5 34.895  48.153
## 6  5.995 121.748

Creating our testing environment

We have quantitative responses (Y-values) to use as a sort of guide to see the accuracy of our predictive Y value. In this case, ‘ii’ will be a vector storing randmly selected row indices. This will be used to split our dataset (cd) into subsets training data (cdtr) and testing data (cdte)

n = nrow(cd) # 1000
set.seed(80) # setting random seed to ensure reproducibility (random results are set)
pin = .80 # proportion of data we will use for training, 80% of data in this case

ii = sample(1:n, floor(pin*n)) # essentially floor(0.80 * 1000) = 800 unique rows are randomly selected

cdtr = cd[ii,] # Training data set selecting rows from indices in ii. (e.g. [3,7,12,...,999])

cdte = cd[-ii,] # Test data set, select all rows of cd *not* in ii

Relationship between Mileage and Price (code)

#Setting dataset, mapping mileage -> x-axis, price -> y-axis
relationship_plot <- ggplot(cdtr, aes(x = mileage, y = price)) +
  # 
  geom_point(alpha = 0.5, color = "blue") +  # Semi-transparent points
  labs(
    title = "Relationship Between Mileage and Price",
    x = "Mileage",
    y = "Price"
  ) +
  theme_bw() # Arbitrary theme by choice

Graph of relationship between mileage and price

Notice the downward slope, this indicates that a higher mileage is associated with a lower price of a car.

Finding a nice value of k for our kNN prediction.

This will be found using cross-validation.

\[ k_{opt} = floor(sqrt(floor(pin * n))) \\ k_{opt} = 27 \] The result of the analysis is that the best k is 27. At least this is a great starting point k value.

Fitting a kNN prediction

  1. We will now have our training model learn patterns from our training data (cdtr) (learning patterns is to “fit”)
  2. lm() function will fit a linear regression model in order to find the best linear function (best as in minimizes residual errors on the training data)
  3. We then use our linear model (lmtr) to generate predicted values \(\hat{Y}\) for our test data (cdte)

How the kNN model works

The model will learn patterns from our training data (cdtr). We are using a k-nearest neighbors approach with k=27. The value 27 comes from cross-validation. This model will generate predicted values for our test data (cdte)

Plotting kNN prediction

  1. Create a base plot using the knn_data dataframe, use our predicted prices \(\hat{Y}\) as our x-axis, and the response values from the test set (cdte) as our y-axis
  2. For each observation, make the points green and semi-transparent
  3. Create a reference line with slope 1 (aka y=x) as all points on this line are valid predictions.
  4. ais limits (xlim, ylim) so both axes have the same scale, makes visual comparison easier imo as it is 1 to 1 magnitude axes.

ggplot code

# Create a data frame with kNN predictions (k=27) and actual prices
knn_data <- data.frame(
  Predicted_Price = as.numeric(as.character(knn_predictions_k27)),  # Ensure numeric
  Actual_Price = as.numeric(cdte$price)  # Ensure numeric (if not already)
)

# Compute axis limits dynamically (Fixes the "lms not found" error)
lms <- range(c(knn_data$Predicted_Price, knn_data$Actual_Price), na.rm = TRUE)

# Plot only kNN results
p_k27 <- ggplot(knn_data, aes(x = Predicted_Price, y = Actual_Price)) +
  geom_point(color = "green", alpha = 0.5) +  # Scatter plot of kNN predictions
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +  # Y=X reference line
  xlim(lms) + ylim(lms) +                     # Use the same axis limits as before
  labs(
    x = "Predicted Price (kNN, k=27)", 
    y = "Actual Price",
    title = "Actual vs. Predicted Prices (kNN Only)"
  ) +
  theme_minimal()

Actual vs. Predicted Prices (kNN Only)

Interpretation

-Points above the line indicate: Actual Price (Y-axis) > Predicted Price (X-axis) aka kNN underpredicted the price -Points below the line indicate: Actual Price (Y-axis) < Predicted Price (X-axis) aka kNN overpredicted the price -Points intercepting the line indicate: Actual Price (Y-axis) = Predicted Price (X-axis) aka kNN predicted correctly