library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ recipes 1.0.10
## ✔ dials 1.2.1 ✔ rsample 1.2.0
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.6 ✔ tune 1.1.2
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1
## ✔ purrr 1.0.2 ✔ yardstick 1.3.0
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'rsample' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Part 1 Feature Engineering
1. Using the boston.csv data, fill in the blanks to:
a. import the boston.csv data,
b. create a 70-30 train/test dataset split,
c. fit a multiple linear regression model using all predictors in
their current state (without any feature engineering),
d. and compute the RMSE on the test data.
What is the generalization RMSE?
# a. import the boston.csv data
boston <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(split)
boston_test <- testing(split)
# c. fit a multiple linear regression model using all predictors
# in their current state (without any feature engineering)
boston_lm1 <- linear_reg() %>%
fit(cmedv ~ ., data = boston_train)
# d. and compute the RMSE on the test data
boston_lm1 %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
2. Now, fill in the blanks to apply feature engineering steps to
standardize and normalize the numeric
features prior to modeling. Your steps should include:
a. Create a recipe that includes two steps:
i. a Yeo-Johnson transformation to all numeric predictors to ensure
the predictors follow a near-normal distribution and
ii. a standardization transformation to all numeric predictors to
ensure our predictors have a mean of zero and standard deviation of
1.
b. Create a workflow object that contains the model and recipe,
c. Train the model,
d. and compute the RMSE on the test data
3. Recall the model we built in this section. We had to remove 5
troublesome predictor variables. These features caused a problem
because of novel levels. For example, there is only one observation
where MS_SubClass == One_and_Half_Story_PUD_All_Ages.
# Import Ames data and split into train/test
ames <- AmesHousing::make_ames()
set.seed(123) # for reproducibility
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
# Remove trouble variables
trbl_vars <- c("MS_SubClass", "Condition_2", "Exterior_1st",
"Exterior_2nd", "Misc_Feature")
ames_train_subset <- ames_train %>%
select(-trbl_vars)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(trbl_vars)
##
## # Now:
## data %>% select(all_of(trbl_vars))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Train the model without the above trouble variables
ames_lm1 <- linear_reg() %>%
fit(Sale_Price ~ ., data = ames_train_subset)
# Compute test data generalization RMSE
ames_lm1 %>%
predict(ames_test) %>%
bind_cols(ames_test) %>%
rmse(truth = Sale_Price, estimate = .pred)
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 22959.
Part 2: Resampling
Use the Advertising.csv data to complete the following tasks in this
section. The Advertising.csv data contains three predictor variables -
TV, radio, and newspaper, which represents a companies advertising
budget for these respective mediums across 200 metropolitan markets. It
also contains the response variable - sales, which represents the total
sales in thousands of units for a given product
1. Fill in the blanks to split the data into 70-30 training-test
sets.
# a. import the Advertising.csv data
advertising <- read_csv("D:/UC Course/Spring 2024/Data mining/HW/Lab 10/advertising.csv")
## Rows: 200 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): TV, radio, newspaper, sales
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# b. create a 70-30 train/test dataset split
set.seed(123)
split <- initial_split(advertising, prop = .7, strata = sales)
advertising_train <- training(split)
advertising_test <- testing(split)
2. Fill in the blanks to apply 10-fold cross-validation procedure
that models Sales as a function of all three predictors (without any
feature engineering).
# create 10-fold cross-validation object
set.seed(123)
kfolds <- vfold_cv(advertising_train, v = 10, strata = sales)
# create recipe with no feature engineering steps
mlr_recipe <- recipe(sales ~ ., data = advertising_train)
# create workflow object
mlr_wflow <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(mlr_recipe)
# fit our model across the 10-fold CV
mlr_fit_cv <- mlr_wflow %>%
fit_resamples(kfolds)
3. What is the average cross-validation RMSE?
collect_metrics(mlr_fit_cv)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.65 10 0.154 Preprocessor1_Model1
## 2 rsq standard 0.913 10 0.0166 Preprocessor1_Model1
4. What is the range of cross-validation RMSE values across all ten
folds?
collect_metrics(mlr_fit_cv, summarize = FALSE) %>%
filter(.metric == 'rmse')
## # A tibble: 10 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold01 rmse standard 2.59 Preprocessor1_Model1
## 2 Fold02 rmse standard 1.20 Preprocessor1_Model1
## 3 Fold03 rmse standard 1.73 Preprocessor1_Model1
## 4 Fold04 rmse standard 1.44 Preprocessor1_Model1
## 5 Fold05 rmse standard 2.23 Preprocessor1_Model1
## 6 Fold06 rmse standard 0.975 Preprocessor1_Model1
## 7 Fold07 rmse standard 1.73 Preprocessor1_Model1
## 8 Fold08 rmse standard 1.58 Preprocessor1_Model1
## 9 Fold09 rmse standard 1.75 Preprocessor1_Model1
## 10 Fold10 rmse standard 1.23 Preprocessor1_Model1
5. Fill in the blanks to apply a boostrap resampling procedure that
models Sales as a function of all three predictors (without any feature
engineering). Use 10 bootstrap samples.
# create bootstrap sample object
set.seed(123)
bs_samples <- bootstraps(advertising_train, times = 10, strata = sales)
# fit our model across the bootstrapped samples
mlr_fit_bs <- mlr_wflow %>%
fit_resamples(bs_samples)
6. What is the average bootstrap RMSE?
collect_metrics(mlr_fit_bs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.83 10 0.0973 Preprocessor1_Model1
## 2 rsq standard 0.885 10 0.0102 Preprocessor1_Model1
7. What is the range of bootstrap RMSE values across all ten
bootstrap samples?
collect_metrics(mlr_fit_bs, summarize = FALSE) %>%
filter(.metric == 'rmse')
## # A tibble: 10 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Bootstrap01 rmse standard 2.02 Preprocessor1_Model1
## 2 Bootstrap02 rmse standard 2.02 Preprocessor1_Model1
## 3 Bootstrap03 rmse standard 1.42 Preprocessor1_Model1
## 4 Bootstrap04 rmse standard 1.61 Preprocessor1_Model1
## 5 Bootstrap05 rmse standard 1.57 Preprocessor1_Model1
## 6 Bootstrap06 rmse standard 1.93 Preprocessor1_Model1
## 7 Bootstrap07 rmse standard 1.75 Preprocessor1_Model1
## 8 Bootstrap08 rmse standard 2.51 Preprocessor1_Model1
## 9 Bootstrap09 rmse standard 1.76 Preprocessor1_Model1
## 10 Bootstrap10 rmse standard 1.75 Preprocessor1_Model1