Module Lab 10

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5      ✔ recipes      1.0.10
## ✔ dials        1.2.1      ✔ rsample      1.2.0 
## ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
## ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.6      ✔ tune         1.1.2 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.0      ✔ workflowsets 1.0.1 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.0

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'rsample' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Part 1 Feature Engineering

1. Using the boston.csv data, fill in the blanks to:

a. import the boston.csv data,

b. create a 70-30 train/test dataset split,

c. fit a multiple linear regression model using all predictors in their current state (without any feature engineering),

d. and compute the RMSE on the test data.

What is the generalization RMSE?

# a. import the boston.csv data
boston <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/boston.csv")

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(split)
boston_test <- testing(split)

# c. fit a multiple linear regression model using all predictors
# in their current state (without any feature engineering)
boston_lm1 <- linear_reg() %>%
fit(cmedv ~ ., data = boston_train)

# d. and compute the RMSE on the test data
boston_lm1 %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

2. Now, fill in the blanks to apply feature engineering steps to standardize and normalize the numeric

features prior to modeling. Your steps should include:

a. Create a recipe that includes two steps:

i. a Yeo-Johnson transformation to all numeric predictors to ensure the predictors follow a near-normal distribution and

ii. a standardization transformation to all numeric predictors to ensure our predictors have a mean of zero and standard deviation of 1.

b. Create a workflow object that contains the model and recipe,

c. Train the model,

d. and compute the RMSE on the test data

Does the model performance improve?

# a. Create a recipe
mlr_recipe <- recipe(cmedv ~., data = boston_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())

# b. Create a workflow object that contains the model and recipe
mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)

# c. Train the model
mlr_fit <- mlr_wflow %>%
fit(data = boston_train)

# d. and compute the RMSE on the test data
mlr_fit %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.43

3. Recall the model we built in this section. We had to remove 5 troublesome predictor variables. These features caused a problem because of novel levels. For example, there is only one observation where MS_SubClass == One_and_Half_Story_PUD_All_Ages.

# Import Ames data and split into train/test
ames <- AmesHousing::make_ames()

set.seed(123) # for reproducibility
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)

# Remove trouble variables
trbl_vars <- c("MS_SubClass", "Condition_2", "Exterior_1st",
"Exterior_2nd", "Misc_Feature")
ames_train_subset <- ames_train %>%
select(-trbl_vars)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(trbl_vars)
## 
##   # Now:
##   data %>% select(all_of(trbl_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Train the model without the above trouble variables
ames_lm1 <- linear_reg() %>%
fit(Sale_Price ~ ., data = ames_train_subset)

# Compute test data generalization RMSE
ames_lm1 %>%
predict(ames_test) %>%
bind_cols(ames_test) %>%
rmse(truth = Sale_Price, estimate = .pred)

## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      22959.

Rather than remove the trouble variables, fill in the blanks to apply a feature engineering step to all nominal predictor variables to lump novel levels together so that you can use those variables as predictors. Use a threshold of 1% so that all categorical levels with an observation rate of less than 1% will be pooled together into an “other” level. Does the model performance improve?

mlr_recipe <- recipe(Sale_Price ~ . , data = ames_train) %>%
step_other(all_nominal_predictors(), threshold = 0.01, other = "other")

mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)

mlr_fit <- mlr_wflow %>%
fit(data = ames_train)

mlr_fit %>%
predict(ames_test) %>%
bind_cols(ames_test %>% select(Sale_Price)) %>%
rmse(truth = Sale_Price, estimate = .pred)

## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      25917.

Part 2: Resampling

Use the Advertising.csv data to complete the following tasks in this section. The Advertising.csv data contains three predictor variables - TV, radio, and newspaper, which represents a companies advertising budget for these respective mediums across 200 metropolitan markets. It also contains the response variable - sales, which represents the total sales in thousands of units for a given product

1. Fill in the blanks to split the data into 70-30 training-test sets.

# a. import the Advertising.csv data
advertising <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/advertising.csv")

## Rows: 200 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): TV, radio, newspaper, sales
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(advertising, prop = .7, strata = sales)
advertising_train <- training(split)
advertising_test <- testing(split)

2. Fill in the blanks to apply 10-fold cross-validation procedure that models Sales as a function of all three predictors (without any feature engineering).

# create 10-fold cross-validation object
set.seed(123)
kfolds <- vfold_cv(advertising_train, v = 10, strata = sales)
# create recipe with no feature engineering steps
mlr_recipe <- recipe(sales ~ ., data = advertising_train)
# create workflow object
mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)
# fit our model across the 10-fold CV
mlr_fit_cv <- mlr_wflow %>%
fit_resamples(kfolds)

3. What is the average cross-validation RMSE?

collect_metrics(mlr_fit_cv)

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.65     10  0.154  Preprocessor1_Model1
## 2 rsq     standard   0.913    10  0.0166 Preprocessor1_Model1

4. What is the range of cross-validation RMSE values across all ten folds?

collect_metrics(mlr_fit_cv, summarize = FALSE) %>%
filter(.metric == 'rmse')

## # A tibble: 10 × 5
##    id     .metric .estimator .estimate .config             
##    <chr>  <chr>   <chr>          <dbl> <chr>               
##  1 Fold01 rmse    standard       2.59  Preprocessor1_Model1
##  2 Fold02 rmse    standard       1.20  Preprocessor1_Model1
##  3 Fold03 rmse    standard       1.73  Preprocessor1_Model1
##  4 Fold04 rmse    standard       1.44  Preprocessor1_Model1
##  5 Fold05 rmse    standard       2.23  Preprocessor1_Model1
##  6 Fold06 rmse    standard       0.975 Preprocessor1_Model1
##  7 Fold07 rmse    standard       1.73  Preprocessor1_Model1
##  8 Fold08 rmse    standard       1.58  Preprocessor1_Model1
##  9 Fold09 rmse    standard       1.75  Preprocessor1_Model1
## 10 Fold10 rmse    standard       1.23  Preprocessor1_Model1

5. Fill in the blanks to apply a boostrap resampling procedure that models Sales as a function of all three predictors (without any feature engineering). Use 10 bootstrap samples.

# create bootstrap sample object
set.seed(123)
bs_samples <- bootstraps(advertising_train, times = 10, strata = sales)
# fit our model across the bootstrapped samples
mlr_fit_bs <- mlr_wflow %>%
fit_resamples(bs_samples)

6. What is the average bootstrap RMSE?

collect_metrics(mlr_fit_bs)

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.83     10  0.0973 Preprocessor1_Model1
## 2 rsq     standard   0.885    10  0.0102 Preprocessor1_Model1

7. What is the range of bootstrap RMSE values across all ten bootstrap samples?

collect_metrics(mlr_fit_bs, summarize = FALSE) %>%
filter(.metric == 'rmse')

## # A tibble: 10 × 5
##    id          .metric .estimator .estimate .config             
##    <chr>       <chr>   <chr>          <dbl> <chr>               
##  1 Bootstrap01 rmse    standard        2.02 Preprocessor1_Model1
##  2 Bootstrap02 rmse    standard        2.02 Preprocessor1_Model1
##  3 Bootstrap03 rmse    standard        1.42 Preprocessor1_Model1
##  4 Bootstrap04 rmse    standard        1.61 Preprocessor1_Model1
##  5 Bootstrap05 rmse    standard        1.57 Preprocessor1_Model1
##  6 Bootstrap06 rmse    standard        1.93 Preprocessor1_Model1
##  7 Bootstrap07 rmse    standard        1.75 Preprocessor1_Model1
##  8 Bootstrap08 rmse    standard        2.51 Preprocessor1_Model1
##  9 Bootstrap09 rmse    standard        1.76 Preprocessor1_Model1
## 10 Bootstrap10 rmse    standard        1.75 Preprocessor1_Model1

Module Lab 10

Loi Ly

2024-03-25

Part 1 Feature Engineering

1. Using the boston.csv data, fill in the blanks to:

a. import the boston.csv data,

b. create a 70-30 train/test dataset split,

c. fit a multiple linear regression model using all predictors in their current state (without any feature engineering),

d. and compute the RMSE on the test data.

What is the generalization RMSE?

2. Now, fill in the blanks to apply feature engineering steps to standardize and normalize the numeric

features prior to modeling. Your steps should include:

a. Create a recipe that includes two steps:

i. a Yeo-Johnson transformation to all numeric predictors to ensure the predictors follow a near-normal distribution and

ii. a standardization transformation to all numeric predictors to ensure our predictors have a mean of zero and standard deviation of 1.

b. Create a workflow object that contains the model and recipe,

c. Train the model,

d. and compute the RMSE on the test data

Does the model performance improve?

3. Recall the model we built in this section. We had to remove 5 troublesome predictor variables. These features caused a problem because of novel levels. For example, there is only one observation where MS_SubClass == One_and_Half_Story_PUD_All_Ages.

Part 2: Resampling

1. Fill in the blanks to split the data into 70-30 training-test sets.

2. Fill in the blanks to apply 10-fold cross-validation procedure that models Sales as a function of all three predictors (without any feature engineering).

3. What is the average cross-validation RMSE?

4. What is the range of cross-validation RMSE values across all ten folds?

5. Fill in the blanks to apply a boostrap resampling procedure that models Sales as a function of all three predictors (without any feature engineering). Use 10 bootstrap samples.

6. What is the average bootstrap RMSE?

7. What is the range of bootstrap RMSE values across all ten bootstrap samples?

Module Lab 10

Loi Ly

2024-03-25

Part 1 Feature Engineering

1. Using the boston.csv data, fill in the blanks to:

a. import the boston.csv data,

b. create a 70-30 train/test dataset split,

c. fit a multiple linear regression model using all predictors in their current state (without any feature engineering),

d. and compute the RMSE on the test data.

What is the generalization RMSE?

2. Now, fill in the blanks to apply feature engineering steps to standardize and normalize the numeric

features prior to modeling. Your steps should include:

a. Create a recipe that includes two steps:

i. a Yeo-Johnson transformation to all numeric predictors to ensure the predictors follow a near-normal distribution and

ii. a standardization transformation to all numeric predictors to ensure our predictors have a mean of zero and standard deviation of 1.

b. Create a workflow object that contains the model and recipe,

c. Train the model,

d. and compute the RMSE on the test data

Does the model performance improve?

3. Recall the model we built in this section. We had to remove 5 troublesome predictor variables. These fea￾tures caused a problem because of novel levels. For example, there is only one observation where MS_SubClass == One_and_Half_Story_PUD_All_Ages.

Part 2: Resampling

1. Fill in the blanks to split the data into 70-30 training-test sets.

2. Fill in the blanks to apply 10-fold cross-validation procedure that models Sales as a function of all three predictors (without any feature engineering).

3. What is the average cross-validation RMSE?

4. What is the range of cross-validation RMSE values across all ten folds?

5. Fill in the blanks to apply a boostrap resampling procedure that models Sales as a function of all three predictors (without any feature engineering). Use 10 bootstrap samples.

6. What is the average bootstrap RMSE?

7. What is the range of bootstrap RMSE values across all ten bootstrap samples?

3. Recall the model we built in this section. We had to remove 5 troublesome predictor variables. These features caused a problem because of novel levels. For example, there is only one observation where MS_SubClass == One_and_Half_Story_PUD_All_Ages.