Tidymodels_01

Kate C

2022-01-28

Load Packages and Dataset

Packages used to import and analyse the data include

  • tidymodels - a collection of R packages designed to support machine learning model development

    • rsample - data sampling used to create random subsets of a dataset for different activities in the modelling process. split data into training and test datasets (75/25) - guards against overfitting.

    • recipes - functions for transforming data for modeling. also called feature engineering

    • parsnip - specifying and fitting models as well as obtaining model predictions

    • tune and dial - provide functionality for fine-tuning models in order to achieve optimal prediction accuracy

    • yardstick - evaluating quality of model predictions.

  • summarize in table

Data resampling and feature engineering Model fitting and tuning Model evaluation
recipes tune yardstick
rsample dials
parsnip
  • data set - home_sales (loaded from Rds file)

A. Key points

  • tidymodels primarily used for supervised machine learning - algorithms learn patterns from labeled data. (mainly two types - regression and classification)

  • stratification when sampling

    • used to eliminate sampling bias in a test set

    • allows to create a test set with a population that best represents the entire population being studied

    • stratified random sampling is different from simple random sampling which involves the random selection of data from the entire population so that each possible sample is equally likely to occur

  • data resampling with tidymodels

    • initial split: specifies instructions for creating training and test datasets. use outcome as the strata argument and prop specifies the proportion to place into training.

    • passing to training() to create training dataset, to testing() to create testing dataset

B. Practice

  • Data preparation for machine learning in tidymodel
# create an rsample object home_split contains the instructions for randomly splitting the home_sales
# allocate 70% of the data into training and stratify the results by selling_price (the dependent var)
home_split <- initial_split(home_sales,
                            prop = 0.7,
                            strata = selling_price)
# create training and testing datasets
home_training <- home_split %>% 
  training()
home_test <- home_split %>% 
  testing()
# check number of rows in each dataset
nrow(home_training)
## [1] 1042
nrow(home_test)
## [1] 450
  • check key statistics in both training and test dataset - they should be close (as we enabled proper stratification)
home_training %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
## # A tibble: 1 × 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         478909.        80735.
home_test %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
## # A tibble: 1 × 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         479491.        81629.

Fitting linear regression model

using parsnip package within tidymodels

  • specify a linear model
linear_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
  • train the model to predict selling_price using home_age and sqft_living using training dataset
lm_fit <- linear_model %>% 
  fit(selling_price ~ home_age + sqft_living, data = home_training)
  • print lm_fit to view model information
lm_fit
## parsnip model object
## 
## Fit time:  3ms 
## 
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
## 
## Coefficients:
## (Intercept)     home_age  sqft_living  
##    291325.7      -1545.3        103.5
tidy(lm_fit)
## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  291326.   7689.       37.9  5.24e-198
## 2 home_age      -1545.    177.       -8.74 9.47e- 18
## 3 sqft_living     103.      2.79     37.1  2.60e-192
  • predict the results and combine into test dataset
home_predictions <- predict(lm_fit, new_data = home_test)

home_test_results <- home_test %>% 
  select(selling_price, home_age, sqft_living) %>% 
  cbind(home_predictions)

head(home_test_results)
##   selling_price home_age sqft_living    .pred
## 1        635000        4        3350 631742.8
## 2        380000       24        2130 474613.9
## 3        464950       19        2190 488547.9
## 4        535000        3        2360 530860.6
## 5        356000       24        1430 402190.4
## 6        495000        3        2140 508098.9

Evaluating model results with yardstick package

  • RMSE (comes with yardstick) and R squared (in yardstick)

  • streamlining model fitting with last_fit()

    • takes a model specification, model formula, and data split object

    • creates training and test datasets

    • fits the model to the training data

    • calculates metrics and predictions on the test data

    • returns an object with all results

    • use collect_metrics to collect the results; use collect_predictions to collect results from test datasets

  • some practice

  • calculate RMSE and R Squared

home_test_results %>% 
  rmse(truth = selling_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      46971.
home_test_results %>% 
  rsq(truth = selling_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.669
  • visualization
ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) + #representing predicted = actual
  coord_obs_pred() + #standardize the range of both axes
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

  • overall pipeline using all variable for fit. visually it looks better than previous fit.
linear_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

linear_fit <- linear_model %>% #fit model with all available indepdendent variables
  last_fit(selling_price ~ ., split = home_split)

predictions_df <- linear_fit %>% 
  collect_predictions()

ggplot(predictions_df, aes(x = selling_price, y = .pred )) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "blue", linetype = 2) +
  coord_obs_pred() +
  labs(x = "actual selling price", y = "predicted selling price")