library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5      ✔ recipes      1.0.10
## ✔ dials        1.2.1      ✔ rsample      1.2.0 
## ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
## ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.6      ✔ tune         1.1.2 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.0      ✔ workflowsets 1.0.1 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.0
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'rsample' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Part 1 Feature Engineering

1. Using the boston.csv data, fill in the blanks to:

a. import the boston.csv data,
b. create a 70-30 train/test dataset split,
c. fit a multiple linear regression model using all predictors in their current state (without any feature engineering),
d. and compute the RMSE on the test data.
What is the generalization RMSE?
# a. import the boston.csv data
boston <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(split)
boston_test <- testing(split)

# c. fit a multiple linear regression model using all predictors
# in their current state (without any feature engineering)
boston_lm1 <- linear_reg() %>%
fit(cmedv ~ ., data = boston_train)

# d. and compute the RMSE on the test data
boston_lm1 %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

2. Now, fill in the blanks to apply feature engineering steps to standardize and normalize the numeric

features prior to modeling. Your steps should include:
a. Create a recipe that includes two steps:
i. a Yeo-Johnson transformation to all numeric predictors to ensure the predictors follow a near-normal distribution and
ii. a standardization transformation to all numeric predictors to ensure our predictors have a mean of zero and standard deviation of 1.

b. Create a workflow object that contains the model and recipe,

c. Train the model,

d. and compute the RMSE on the test data

Does the model performance improve?

# a. Create a recipe
mlr_recipe <- recipe(cmedv ~., data = boston_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())

# b. Create a workflow object that contains the model and recipe
mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)

# c. Train the model
mlr_fit <- mlr_wflow %>%
fit(data = boston_train)

# d. and compute the RMSE on the test data
mlr_fit %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.43

3. Recall the model we built in this section. We had to remove 5 troublesome predictor variables. These fea￾tures caused a problem because of novel levels. For example, there is only one observation where MS_SubClass == One_and_Half_Story_PUD_All_Ages.

# Import Ames data and split into train/test
ames <- AmesHousing::make_ames()

set.seed(123) # for reproducibility
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)

# Remove trouble variables
trbl_vars <- c("MS_SubClass", "Condition_2", "Exterior_1st",
"Exterior_2nd", "Misc_Feature")
ames_train_subset <- ames_train %>%
select(-trbl_vars)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(trbl_vars)
## 
##   # Now:
##   data %>% select(all_of(trbl_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Train the model without the above trouble variables
ames_lm1 <- linear_reg() %>%
fit(Sale_Price ~ ., data = ames_train_subset)

# Compute test data generalization RMSE
ames_lm1 %>%
predict(ames_test) %>%
bind_cols(ames_test) %>%
rmse(truth = Sale_Price, estimate = .pred)
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      22959.

Rather than remove the trouble variables, fill in the blanks to apply a feature engineering step to all nominal predictor variables to lump novel levels together so that you can use those variables as predictors. Use a threshold of 1% so that all categorical levels with an observation rate of less than 1% will be pooled together into an “other” level. Does the model performance improve?

mlr_recipe <- recipe(Sale_Price ~ . , data = ames_train) %>%
step_other(all_nominal_predictors(), threshold = 0.01, other = "other")

mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)

mlr_fit <- mlr_wflow %>%
fit(data = ames_train)

mlr_fit %>%
predict(ames_test) %>%
bind_cols(ames_test %>% select(Sale_Price)) %>%
rmse(truth = Sale_Price, estimate = .pred)
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      25917.

Part 2: Resampling

Use the Advertising.csv data to complete the following tasks in this section. The Advertising.csv data contains three predictor variables - TV, radio, and newspaper, which represents a companies advertising budget for these respective mediums across 200 metropolitan markets. It also contains the response variable - sales, which represents the total sales in thousands of units for a given product

1. Fill in the blanks to split the data into 70-30 training-test sets.

# a. import the Advertising.csv data
advertising <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/advertising.csv")
## Rows: 200 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): TV, radio, newspaper, sales
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(advertising, prop = .7, strata = sales)
advertising_train <- training(split)
advertising_test <- testing(split)

2. Fill in the blanks to apply 10-fold cross-validation procedure that models Sales as a function of all three predictors (without any feature engineering).

# create 10-fold cross-validation object
set.seed(123)
kfolds <- vfold_cv(advertising_train, v = 10, strata = sales)
# create recipe with no feature engineering steps
mlr_recipe <- recipe(sales ~ ., data = advertising_train)
# create workflow object
mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)
# fit our model across the 10-fold CV
mlr_fit_cv <- mlr_wflow %>%
fit_resamples(kfolds)

3. What is the average cross-validation RMSE?

collect_metrics(mlr_fit_cv)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.65     10  0.154  Preprocessor1_Model1
## 2 rsq     standard   0.913    10  0.0166 Preprocessor1_Model1

4. What is the range of cross-validation RMSE values across all ten folds?

collect_metrics(mlr_fit_cv, summarize = FALSE) %>%
filter(.metric == 'rmse')
## # A tibble: 10 × 5
##    id     .metric .estimator .estimate .config             
##    <chr>  <chr>   <chr>          <dbl> <chr>               
##  1 Fold01 rmse    standard       2.59  Preprocessor1_Model1
##  2 Fold02 rmse    standard       1.20  Preprocessor1_Model1
##  3 Fold03 rmse    standard       1.73  Preprocessor1_Model1
##  4 Fold04 rmse    standard       1.44  Preprocessor1_Model1
##  5 Fold05 rmse    standard       2.23  Preprocessor1_Model1
##  6 Fold06 rmse    standard       0.975 Preprocessor1_Model1
##  7 Fold07 rmse    standard       1.73  Preprocessor1_Model1
##  8 Fold08 rmse    standard       1.58  Preprocessor1_Model1
##  9 Fold09 rmse    standard       1.75  Preprocessor1_Model1
## 10 Fold10 rmse    standard       1.23  Preprocessor1_Model1

5. Fill in the blanks to apply a boostrap resampling procedure that models Sales as a function of all three predictors (without any feature engineering). Use 10 bootstrap samples.

# create bootstrap sample object
set.seed(123)
bs_samples <- bootstraps(advertising_train, times = 10, strata = sales)
# fit our model across the bootstrapped samples
mlr_fit_bs <- mlr_wflow %>%
fit_resamples(bs_samples)

6. What is the average bootstrap RMSE?

collect_metrics(mlr_fit_bs)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.83     10  0.0973 Preprocessor1_Model1
## 2 rsq     standard   0.885    10  0.0102 Preprocessor1_Model1

7. What is the range of bootstrap RMSE values across all ten bootstrap samples?

collect_metrics(mlr_fit_bs, summarize = FALSE) %>%
filter(.metric == 'rmse')
## # A tibble: 10 × 5
##    id          .metric .estimator .estimate .config             
##    <chr>       <chr>   <chr>          <dbl> <chr>               
##  1 Bootstrap01 rmse    standard        2.02 Preprocessor1_Model1
##  2 Bootstrap02 rmse    standard        2.02 Preprocessor1_Model1
##  3 Bootstrap03 rmse    standard        1.42 Preprocessor1_Model1
##  4 Bootstrap04 rmse    standard        1.61 Preprocessor1_Model1
##  5 Bootstrap05 rmse    standard        1.57 Preprocessor1_Model1
##  6 Bootstrap06 rmse    standard        1.93 Preprocessor1_Model1
##  7 Bootstrap07 rmse    standard        1.75 Preprocessor1_Model1
##  8 Bootstrap08 rmse    standard        2.51 Preprocessor1_Model1
##  9 Bootstrap09 rmse    standard        1.76 Preprocessor1_Model1
## 10 Bootstrap10 rmse    standard        1.75 Preprocessor1_Model1