DS 2870: Homework 10 - Non-parametric Regression

Data Description:

The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:

manufactor: The company that makes the car
model: The model of the car
price: The sale price of the used car (our response variable)
year: The year are the car was manufactured
age: The age of the car when it was posted
condition: The condition of the car (like new/excellent/good)
cylinders: The number of cylinders in the engine (4/6/8)
fuel: Type of fuel the car takes (gas/hybrid)
mileage: The miles driven according to the odometer (in thousands of miles)
transmission: The type of transmission (automatic/manual)
paint_color: The color of the car

Question 1) kNN Regression

Part 1a) condition, fuel, transmission

Briefly explain why the condition, fuel, and transmission variables can’t be used as a predictor of QBR using kNN regression.

Since kNN uses distance to determine which neighbors are the closest (nearest), we can only use numeric predictors and condition, fuel, and transmission are categorical (can’t subtract “gas” and “hybrid”)

Part 1b) Best choice of k

Using age, cylinders, and mileage as predictors, find the best choice of \(k\) to predict the price of the used cars. Report the value of \(k\), rescaling method, and resulting \(R^2\). Search k = 5 to k = 100. (Best to start with a smaller range of k until you get the loop to work).

Display your results using a line graph showing the R2 value when normalizing the data and when standardizing the data with two lines.

set.seed(2870)
### Normalizing the data
cars_norm <- 
  cars |> 
  # Removing the categorical columns
  dplyr::select(price, age, cylinders, mileage) |> 
  mutate(
    across(
      .cols = age:mileage,
      .fns = ~ (. - min(.)) / (max(.) - min(.))
    )
  )

### Standardizing the data
cars_stan <- 
  cars |> 
  # Removing the categorical columns
  dplyr::select(price, age, cylinders, mileage) |> 
  mutate(
    across(
      .cols = age:mileage,
      .fns = ~ (. - mean(.)) / (sd(.))
    )
  )

### for loop set up
# Vector of k to search through

# Preallocating a matrix to save the results in 
fit_stats <- 
  tibble(
    k = 5:100, 
    R2_norm = -1,
    R2_stan = -1
    
  )


# Conducting the for loop
for (i in 1:nrow(fit_stats)){
  # Finding R2 for the normalized data
  loop_norm <- 
    knn.reg(
      train = cars_norm |> dplyr::select(-price),
      y = cars_norm$price,
      k = fit_stats$k[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "R2_norm"] <- loop_norm$R2Pred
  
  
  # Finding R2 for the standardized data
  loop_stan <- 
    knn.reg(
      train = cars_stan |> dplyr::select(-price),
      y = cars_stan$price,
      k = fit_stats$k[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "R2_stan"] <- loop_stan$R2Pred
  
}

fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |> 
  # Making the graph
  ggplot(
    mapping = aes(
      x = k,
      y = R2,
      color = rescale
    )
  ) + 
  
  geom_line() + 
  
  theme(legend.position = c(0.9, 0.85))

# Finding the best choice of k
fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |>
  slice_max(
    n = 1,
    R2
  )

## # A tibble: 1 × 3
##       k rescale    R2
##   <int> <chr>   <dbl>
## 1    27 R2_stan 0.634

The best choice of k is 27 when rescaling the data by standardizing with an \(R^2\) value of 0.634

Part 1c) Predicting the QBR for the 2023 season.

Regardless of your answer in the previous question, predict the price for the 200 cars in the test_cars data set when standardizing the data with k = 30. Display the results using an R-squared plot. Make sure to standardize the test_cars data set before predicting the price!

# standardizing the test_cars data
test_cars_stan <- 
  test_cars |> 
  # Removing the categorical columns
  dplyr::select(price, age, cylinders, mileage) |> 
  mutate(
    across(
      .cols = age:mileage,
      .fns = ~ (. - mean(.)) / (sd(.))
    )
  )

# Predicting the price for the test cars data
test_cars_knn <- 
  knn.reg(
    train = cars_stan[, -1],
    test = test_cars_stan[, -1],
    y = cars_stan$price,
    k = 30
  )

test_cars_stan |> 
  mutate(price_hat = test_cars_knn$pred) |> 
  ggplot(
    mapping = aes(
      x = price_hat,
      y = price
    )
  ) + 
  # Adding the points and a trend line
  geom_point() + 
  geom_smooth(
    color = "red",
    method = "lm",
    se = F,
    formula = y ~ x
  ) +
  # Adding the R2 value to the graph
  annotate(
    geom = "text",
    x = 8000,
    y = 22500,
    label = paste("R-squared:", round((1 - sum((test_cars$price - test_cars_knn$pred)^2)/sum((test_cars$price - mean(test_cars$price))^2)), 3)),
    color = "red",
    size = 5
  ) +
  
  labs(
    x = "Predicted price",
    y = "Actual price",
    title = "Predicted vs Actual Price for 200 Used Cars"
  ) + 
  
  # Adding $ to the axes
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_dollar())

Is kNN accurate for the 200 used cars?

It’s an ok predictor because the graph has a lot of spread between the predicted values (red line) and actual values (black dots) and the \(R^2\) is a moderate value of around 0.57

Part 1d) The effect of mileage on price

Using the results of kNN, can you interpret the effect of mileage on the price of a used car? If yes, interpret the results. If not, briefly explain why.

No, kNN is a lazy learner, which doesn’t build a model. Without building a model, we can’t learn how the method used each variable to predict the price

Question 2) Regression trees

Part 2a) Fitting the full tree

Create the full regression tree predicting QBR using age, cylinders, fuel, mileage, and transmission. Display the last 10 rows of the CP table

# Leave this at the top
set.seed(2870)
price_tree_full <- 
  rpart(
    formula = price ~ age + cylinders + fuel + mileage + transmission,
    data = cars,
    method = "anova",
    minsplit = 2,
    minbucket = 1,
    cp = 0
  )

price_tree_full$cptable |> 
  data.frame() |> 
  tail(n = 10)

##               CP nsplit   rel.error    xerror       xstd
## 618 3.775409e-10    915 0.006356401 0.7046940 0.04776653
## 619 3.775409e-10    916 0.006356401 0.7046940 0.04776653
## 620 2.831556e-10    918 0.006356400 0.7046940 0.04776653
## 621 2.516939e-10    919 0.006356400 0.7046936 0.04776653
## 622 9.438522e-11    921 0.006356399 0.7046936 0.04776653
## 623 9.438522e-11    922 0.006356399 0.7046935 0.04776653
## 624 3.146174e-11    923 0.006356399 0.7046935 0.04776653
## 625 2.359630e-11    924 0.006356399 0.7046935 0.04776653
## 626 2.359630e-11    925 0.006356399 0.7046931 0.04776654
## 627 0.000000e+00    926 0.006356399 0.7046931 0.04776654

Part 2b) Finding the pruning point

Find the cp value to prune the tree. Don’t round the actual results, but you can round to 4 decimal places when typing your answer.

# Finding the xerror cutoff: min(xerror) + xstd
xerror_cutoff <- 
  price_tree_full$cptable |> 
  data.frame() |> 
  slice_min(xerror, n = 1, with_ties = F) |> 
  mutate(
    xerror_1sd = xerror + xstd
  ) |> 
  pull(xerror_1sd)

# Finding the first (simplest) tree with xerror < xerror_cutoff
cp_prune <- 
  price_tree_full$cptable |> 
  data.frame() |> 
  filter(xerror < xerror_cutoff) |> 
  slice(1) |> 
  pull(CP)

cp_prune

## [1] 0.008486252

The cp value is: 0.0085

Part 2c) Pruning and plotting the tree

Using your answer from the previous question, prune the tree, then use rpart.plot() to display the tree.

price_tree_pruned <- 
  prune(tree = price_tree_full,
        cp = cp_prune)

rpart.plot(
  price_tree_pruned,
  type = 5,
  extra = 101
)

Part 2d) Variable Importance

Using the pruned tree, which three variables are the most important in predicting the price of a used car?

caret::varImp(price_tree_pruned) |> 
  arrange(-Overall)

##                 Overall
## mileage      1.72718776
## cylinders    1.10766910
## age          0.97584687
## transmission 0.05169987
## fuel         0.01499465

The three most important variables are mileage, cylinders, and age

Part 2e) Predicting the Price for the test cars data set

Using both the full tree and the pruned tree separately, predict the price for the 200 used cars in the test_cars data set. Calculate the \(R^2\) value for both the full tree and pruned tree.

# Predicting the qbr in the qbr23 data
test_cars |> 
  mutate(
    price_hat_full   = predict(price_tree_full, 
                             newdata = test_cars),
    price_hat_pruned = predict(price_tree_pruned, 
                             newdata = test_cars),
    price = price / 1000
  ) |> 
  summarize(
    R2_full   = cor(price, price_hat_full)^2,
    R2_pruned = cor(price, price_hat_pruned)^2
  )

##     R2_full R2_pruned
## 1 0.3484765 0.4915536

Which model is more accurate for the test cars? Briefly explain why the outcome (full vs pruned) is not surprising.

The pruned model predicts the test cars better. This isn’t surprising because pruned trees fit new data better than more complex (full) models. The more complex the model, the better it will fit the sample data but won’t fit new data nearly as well as it is likely overfit