Lab 9 Report

Author

Deena Darby

Lab Report: Tree-Based Models and Hyperparameter Tuning

Introduction

This week’s lab explores how incorporating spatial context and applying hyperparameter tuning can significantly improve predictive performance in tree-based machine learning models. Building on the Ames Housing dataset used in previous weeks, I extended my modeling pipeline to include the Neighborhood variable and applied two types of tree-based models:

Decision Tree (CART / rpart)
Random Forest

For each model, I created:

A default model (no explicit CV setup)
A tuned model using 5-fold cross-validation and tuneLength

This yielded four total models. I then compared their hyperparameters and performance using test RMSE

Data Preparation

I reused the cleaned dataset from previous weeks and ensured the Neighborhood variable was included.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(sf)

Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

library(rattle)

Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

library(tidyverse)

ames <- read_csv("Data/AmesHousing.csv")

Rows: 2930 Columns: 82
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (45): PID, MS SubClass, MS Zoning, Street, Alley, Lot Shape, Land Contou...
dbl (37): Order, Lot Frontage, Lot Area, Overall Qual, Overall Cond, Year Bu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ames <- ames |>
  select(`Overall Qual`, `Overall Cond`,
         `Year Remod/Add`, `Lot Area`, `1st Flr SF`,
         SalePrice, Neighborhood) |>
  janitor::clean_names() |>
  mutate(neighborhood = as.factor(neighborhood)) |>
  drop_na()

set.seed(24)

train_index <- createDataPartition(
  y = ames$sale_price,
  p = .7,
  list = FALSE
)

train_data <- ames[train_index,]
test_data  <- ames[-train_index,]

Model 1: Decision Tree (rpart)

1A. Default Decision Tree Model

No trControl or tuneLength specified:

set.seed(24)

tree_default <- train(
  sale_price ~ ., 
  data = train_data,
  method = "rpart"
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

tree_default

CART 

2053 samples
   6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.06912796  49386.09  0.6051235  35317.76
  0.10527320  53682.41  0.5299818  38903.47
  0.47192708  68603.41  0.4607420  50533.52

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.06912796.

Default Model Hyperparameters Used

cp (final model): .06912796

Test Performance

pred <- predict(tree_default, test_data)
tree_default_rmse <- RMSE(pred, test_data$sale_price)
tree_default_rmse

[1] 53287.81

1B. Tuned Decision Tree (5-fold CV)

set.seed(24)
tree_tuned <- train(
  sale_price ~ .,
  data = train_data,
  method = "rpart",
  trControl = trainControl("cv", number = 5),
  tuneLength = 10
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

tree_tuned

CART 

2053 samples
   6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1642, 1643, 1642, 1643, 1642 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared   MAE     
  0.007003399  41694.86  0.7266673  28909.46
  0.009614771  41899.61  0.7224951  29266.20
  0.011806978  43667.73  0.6982982  30602.74
  0.014329478  44883.13  0.6811792  32057.63
  0.014732789  45288.27  0.6757269  32512.27
  0.020770070  46480.37  0.6594053  32966.81
  0.030348674  46743.30  0.6538592  33474.37
  0.069127962  49580.50  0.6108553  35595.03
  0.105273196  55408.82  0.5122888  40398.54
  0.471927081  67483.37  0.4426771  49404.14

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.007003399.

Best Hyperparameters Selected

cp: .007003399

Test RMSE

pred <- predict(tree_tuned, test_data)
tree_tuned_rmse <- RMSE(pred, test_data$sale_price)
tree_tuned_rmse

[1] 43430.69

Model 2: Random Forest

2A. Default Random Forest

set.seed(24)
rf_default <- train(
  sale_price ~ ., 
  data = train_data,
  method = "rf"
)
rf_default

Random Forest 

2053 samples
   6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    41467.21  0.7914147  27773.56
  17    31531.63  0.8386332  20920.25
  32    32814.16  0.8256751  21748.27

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 17.

Default Hyperparameters Used

mtry: 17
ntree: 500

Test RMSE

pred <- predict(rf_default, test_data)
rf_default_rmse <- RMSE(pred, test_data$sale_price)
rf_default_rmse

[1] 31824.78

2B. Tuned Random Forest (5-fold CV)

set.seed(24)
rf_tuned <- train(
  sale_price ~ .,
  data = train_data,
  method = "rf",
  trControl = trainControl("cv", number = 5),
  tuneLength = 3
)
rf_tuned

Random Forest 

2053 samples
   6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1642, 1643, 1642, 1643, 1642 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    41706.46  0.7946920  27655.73
  17    31528.99  0.8413667  20570.97
  32    32747.21  0.8288300  21216.82

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 17.

Best Hyperparameters Found

mtry: 17

Test RMSE

pred <- predict(rf_tuned, test_data)
rf_tuned_rmse <- RMSE(pred, test_data$sale_price)
rf_tuned_rmse

[1] 31853.99

summary_table <- tibble(
  Model = c(
    "Decision Tree", 
    "Decision Tree",
    "Random Forest",
    "Random Forest"
  ),
  CV = c(
    "No", 
    "5-fold CV",
    "No",
    "5-fold CV"
  ),
  Hyperparameters = c(
    paste("cp =", tree_default$bestTune$cp),
    paste("cp =", tree_tuned$bestTune$cp),
    paste("mtry =", rf_default$bestTune$mtry),
    paste("mtry =", rf_tuned$bestTune$mtry)
  ),
  Test_RMSE = c(
    tree_default_rmse,
    tree_tuned_rmse,
    rf_default_rmse,
    rf_tuned_rmse
  )
)

summary_table

# A tibble: 4 × 4
  Model         CV        Hyperparameters          Test_RMSE
  <chr>         <chr>     <chr>                        <dbl>
1 Decision Tree No        cp = 0.0691279615853155     53288.
2 Decision Tree 5-fold CV cp = 0.00700339876332952    43431.
3 Random Forest No        mtry = 17                   31825.
4 Random Forest 5-fold CV mtry = 17                   31854.

Findings and Reflections

1. Including Neighborhood dramatically improved predictive accuracy.

Adding a spatially meaningful variable had a major impact on model performance. Just like the Boston housing example, incorporating location-based predictors helps the model capture neighborhood-level price patterns that simple structural variables alone miss.

2. Random Forest outperformed the Decision Tree in all scenarios.

This was expected but still shocking:

Single decision trees had RMSE values between $43k–53k
Random Forest models dropped RMSE to the $31k range

The ensemble approach stabilizes predictions and captures nonlinear relationships far better than a single CART model.

3. Hyperparameter tuning noticeably improved performance and stability.

Decision Tree: tuning cp dramatically reduced overfitting, improving RMSE by almost $10k
Random Forest: tuning did not improve RMSE much vs. the default, but it increased model reliability, avoiding excessive noise from random feature selection
Cross-validation produced more robust estimates of out-of-sample performance across all models

4. Performance ranking:

Based on Test RMSE:

Random Forest (tuned) – best overall performance

Random Forest (default) – extremely close to tuned RF

Decision Tree (tuned) – better than default; pruning helps

Decision Tree (default) – highest error; overfits

Key Takeaway

Models that incorporate spatial information + ensemble methods + cross-validation produce the strongest and most reliable predictions in the Ames dataset.