In this assignment, I will be using Tidymodels framework instead of base R.

1 Exercise 1 (10 points)

Suppose we fit a curve with basis functions \(b_1(X) = X\), \(B_2(X) = (X - 1)^2 I(X \geq 1)\). Note that \(I(X \geq 1)\) equals 1 for \(X \geq 1\) and 0 otherwise. We fit the linear regression model

\[Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \varepsilon\]

and obtain the coefficient estimates \(\hat \beta_0 = 1\), \(\hat \beta_1 = 1\), \(\hat \beta_2 = -2\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes and other relevant information.

X <- seq(-2, 2, 0.01)
Y <-  1 + X + -2 * (X - 1)^2 * (X >= 1)
df1 <- tibble(X, Y)

ggplot(df1, aes(X, Y)) + 
  geom_vline(xintercept = 0) + 
  geom_hline(yintercept = 0) + 
  geom_vline(xintercept = 1, col = "red") + 
  geom_line(size = 3, color = "green") +
  theme_bw()

2 Exercise 2 (10 points)

Suppose we fit a curve with basis functions \(b_1(X) = I(0 \leq X \leq 2) - (X-1)I(1 \leq X \leq 2)\), \(B_2(X) = (X - 3) I(3 \leq X \leq 4) + I(4 < X \leq 5)\). We fit the linear regression model

\[Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \varepsilon\] and obtain the coefficient estimates \(\hat \beta_0 = 1\), \(\hat \beta_1 = 1\), \(\hat \beta_2 = 3\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes and other relevant information.

X <- seq(-2, 2, 0.01)
Y <- 1 + (X >= 0 & X <= 2) - (X - 1)*(X >= 1 & X <= 2) + 
     3*(X - 3)*(X >= 3 & X <= 4) + 3*(X > 4 & X <= 5)

df2 <- tibble(X, Y)

ggplot(df2, aes(X, Y)) + 
  geom_vline(xintercept = 0) + 
  geom_hline(yintercept = 0) + 
  geom_line(size = 3, color = "blue") +
  theme_bw()

3 Exercise 3 (10 points)

Explain what happens to the bias/variance trade-off of our model estimates use regression splines.

Regression splines is a more flexible method because instead of fitting 1 polynomial over the whole range of data, it fits a combination of linear/polynomial functions in a piecewise manner. In general, as we use more flexible methods, the variance will increase because changing datasets will be more likely to produce a different set of model estimates. The bias will decrease because the error is smaller when the model fits the data better. Therefore, compared to linear regression, when we use regression splines, variance will increase and bias will decrease. However, bear in mind that, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance.

4 Exercise 4 (10 points)

Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including regions \(R_1, R_2, ...\), the cut points \(t_1, t_2, ...\), and so forth.

5 Exercise 5 (10 points)

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions. When the target values take discrete values (such as dummy variables), we call them classification trees and when the target values take continuous values, we call them regression trees. It consists of a series of splitting rules, starting at the top of the tree. The algorithm begins with the top split where it assigns observations into two branches (recursive binary splitting). The algorithm selects the split that minimizes the sum of the squared deviations from the mean in the two separate partitions. This splitting rule is then applied to each of the new branches. This process continues until each node reaches a user-specified minimum node size and becomes a terminal node.

6 Exercise 6 (10 points)

Explain the difference between bagging, boosting, and random forests.

Bagging entails utilizing the bootstrap to make numerous copies of the original training data set, fitting a different decision tree to each copy, and then integrating all of the trees to make a single predictive model. Each tree is created independently of the others using a bootstrap data set.

Boosting operates in a similar way, but the trees are grown in a sequential order, with each tree receiving information from preceding trees. Instead of using bootstrap sampling, boosting uses a modified version of the original data set to fit each tree. Random Forest split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Random forests outperform bagged trees due to a chance small tweak that decorrelates the trees. When considering a Random Forest split in a tree, a random sample of m predictors from the full set of p predictors is chosen as split candidates.

In summary, bagging focuses on reducing the variance of a decision tree, whereas random Forest is an extension of bagging that uses a random selection of features rather than all features to grow trees. Boosting will attempt to produce strong models that are less biased than their elements (even if variance can also be reduced).

7 Exercise 7 (20 points)

You will be using the Boston data found here. The response is medv and the remaining variables are predictors.

7.1 Read the Data

The range of medv is between 5 to 50.

Boston <- read_csv("./data/Boston.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   crim = col_double(),
##   zn = col_double(),
##   indus = col_double(),
##   chas = col_double(),
##   nox = col_double(),
##   rm = col_double(),
##   age = col_double(),
##   dis = col_double(),
##   rad = col_double(),
##   tax = col_double(),
##   ptratio = col_double(),
##   black = col_double(),
##   lstat = col_double(),
##   medv = col_double()
## )
str(Boston)
## spec_tbl_df [506 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ crim   : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : num [1:506] 0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num [1:506] 6.58 6.42 7.18 7 7.15 ...
##  $ age    : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num [1:506] 1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num [1:506] 397 397 393 395 397 ...
##  $ lstat  : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   crim = col_double(),
##   ..   zn = col_double(),
##   ..   indus = col_double(),
##   ..   chas = col_double(),
##   ..   nox = col_double(),
##   ..   rm = col_double(),
##   ..   age = col_double(),
##   ..   dis = col_double(),
##   ..   rad = col_double(),
##   ..   tax = col_double(),
##   ..   ptratio = col_double(),
##   ..   black = col_double(),
##   ..   lstat = col_double(),
##   ..   medv = col_double()
##   .. )
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Do test-training split as usual, and fit a random forest model or boosted tree (your choice) and a linear regression model.

7.2 Split the Data

Boston_split <- initial_split(Boston)
set.seed(1234)
Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)

The random forest or boosted tree model has a selection of hyper-parameters that you can tune to improve performance. Perform hyperparameter tuning using k-fold cross-validation to find a model with good predictive power. How does this model compare to the linear regression model?

7.3 Random Forest Model

We want to find the optimal minimum number of data points in a node, so we use tune() to find it.

rand_forest_randomForest_spec <-
  rand_forest(min_n = tune()) %>% # mtry = tune(), min_n = tune()
  set_engine('randomForest') %>%
  set_mode('regression')

7.4 Recipe and Workflow

rec_Boston <- recipe(medv ~ ., data = Boston_train)
rf_wf <- workflow() %>%
  add_recipe(rec_Boston) %>%
  add_model(rand_forest_randomForest_spec)

Set the minimum node size to anywhere between 2 and 40 to find the best value using the algorithm.

param_grid <- grid_regular(min_n(), levels = 30)
param_grid
## # A tibble: 30 x 1
##    min_n
##    <int>
##  1     2
##  2     3
##  3     4
##  4     5
##  5     7
##  6     8
##  7     9
##  8    11
##  9    12
## 10    13
## # … with 20 more rows

7.5 k-fold cross-validation

Create the Cross-Validation term in order to use in the following tune_grid() session later, the number of default folds is 10.

set.seed(4321)
cv <- vfold_cv(Boston, strata = medv, v = 10)

7.6 Tune

tune_res <- tune_grid(
  object = rf_wf,
  resamples = cv,
  grid = param_grid, 
  control = control_grid(verbose = TRUE)
)
autoplot(tune_res)

## Fit the Model

The best minimal node size in this model is 2 so we use 2 to fit the model.

best <- select_best(tune_res, metric = "rmse")
best
## # A tibble: 1 x 2
##   min_n .config              
##   <int> <chr>                
## 1     2 Preprocessor1_Model01
rf_final <- finalize_workflow(rf_wf, best)

rf_fit <- fit(rf_final, data = Boston_train)
rf_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, nodesize = min_rows(~2L,      x)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 10.27795
##                     % Var explained: 88.39

7.7 Random Forest Evaluation

We can see from the small rmse that the fitted model is good. Furthermore, the plot shows that the vast majority of predicted medv values are very close to the observed values.

augment(rf_fit, new_data = Boston_test) %>%
  rmse(truth = medv, estimate = .pred) %>%
  mutate(description = "random forest model") -> rf_model
rf_model
## # A tibble: 1 x 4
##   .metric .estimator .estimate description        
##   <chr>   <chr>          <dbl> <chr>              
## 1 rmse    standard        3.68 random forest model
augment(rf_fit, new_data = Boston_test) %>%
  ggplot(aes(medv, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

7.8 Linear Regression Model

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

7.9 Fit the Model

lm_fit <- lm_spec %>%
  fit(medv ~., data = Boston_train)

lm_fit$fit %>%
  summary()
## 
## Call:
## stats::lm(formula = medv ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7277  -2.7062  -0.6725   1.8547  26.3257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.581794   5.937320   4.309 2.11e-05 ***
## crim         -0.049804   0.047805  -1.042 0.298190    
## zn            0.035917   0.014987   2.397 0.017052 *  
## indus         0.050988   0.069906   0.729 0.466229    
## chas          3.153927   0.981519   3.213 0.001429 ** 
## nox         -14.228948   4.268185  -3.334 0.000944 ***
## rm            5.213392   0.486633  10.713  < 2e-16 ***
## age          -0.019362   0.014764  -1.311 0.190521    
## dis          -1.335964   0.218630  -6.111 2.54e-09 ***
## rad           0.284195   0.082618   3.440 0.000649 ***
## tax          -0.013883   0.004673  -2.971 0.003165 ** 
## ptratio      -0.977408   0.144382  -6.770 5.15e-11 ***
## black         0.011467   0.002981   3.846 0.000142 ***
## lstat        -0.454316   0.059834  -7.593 2.63e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.569 on 366 degrees of freedom
## Multiple R-squared:  0.7729, Adjusted R-squared:  0.7648 
## F-statistic: 95.82 on 13 and 366 DF,  p-value: < 2.2e-16

8 Linear Regression Model Evaluation

The linear regression model’s output is also not bad.

augment(lm_fit, new_data = Boston_test) %>%
  rmse(truth = medv, estimate = .pred) %>%
  mutate(description = "linear regression model") -> lm_model
lm_model
## # A tibble: 1 x 4
##   .metric .estimator .estimate description            
##   <chr>   <chr>          <dbl> <chr>                  
## 1 rmse    standard        5.53 linear regression model
augment(lm_fit, new_data = Boston_test) %>%
  ggplot(aes(medv, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

8.1 Comparison

Finally, because the random forest model has a smaller root mean squared error, the rf_model outperforms the linear regression model.

bind_rows(rf_model, lm_model)
## # A tibble: 2 x 4
##   .metric .estimator .estimate description            
##   <chr>   <chr>          <dbl> <chr>                  
## 1 rmse    standard        3.68 random forest model    
## 2 rmse    standard        5.53 linear regression model

9 References