Tidymodels

Load Packages and Dataset

Packages used to import and analyse the data include

tidymodels - a collection of R packages designed to support machine learning model development
- rsample - data sampling used to create random subsets of a dataset for different activities in the modelling process. split data into training and test datasets (75/25) - guards against overfitting.
- recipes - functions for transforming data for modeling. also called feature engineering
- parsnip - specifying and fitting models as well as obtaining model predictions
- tune and dial - provide functionality for fine-tuning models in order to achieve optimal prediction accuracy
- yardstick - evaluating quality of model predictions.
summarize in table

Data resampling and feature engineering	Model fitting and tuning	Model evaluation
recipes	tune	yardstick
rsample	dials
	parsnip

data set - home_sales (loaded from Rds file)

A. Key points

tidymodels primarily used for supervised machine learning - algorithms learn patterns from labeled data. (mainly two types - regression and classification)
stratification when sampling
- used to eliminate sampling bias in a test set
- allows to create a test set with a population that best represents the entire population being studied
- stratified random sampling is different from simple random sampling which involves the random selection of data from the entire population so that each possible sample is equally likely to occur
data resampling with tidymodels
- initial split: specifies instructions for creating training and test datasets. use outcome as the strata argument and prop specifies the proportion to place into training.
- passing to training() to create training dataset, to testing() to create testing dataset

B. Practice

Data preparation for machine learning in tidymodel

# create an rsample object home_split contains the instructions for randomly splitting the home_sales
# allocate 70% of the data into training and stratify the results by selling_price (the dependent var)
home_split <- initial_split(home_sales,
                            prop = 0.7,
                            strata = selling_price)
# create training and testing datasets
home_training <- home_split %>% 
  training()
home_test <- home_split %>% 
  testing()
# check number of rows in each dataset
nrow(home_training)

## [1] 1042

nrow(home_test)

## [1] 450

check key statistics in both training and test dataset - they should be close (as we enabled proper stratification)

home_training %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))

## # A tibble: 1 × 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         478909.        80735.

home_test %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))

## # A tibble: 1 × 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         479491.        81629.

Fitting linear regression model

using parsnip package within tidymodels

specify a linear model

linear_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

train the model to predict selling_price using home_age and sqft_living using training dataset

lm_fit <- linear_model %>% 
  fit(selling_price ~ home_age + sqft_living, data = home_training)

print lm_fit to view model information

lm_fit

## parsnip model object
## 
## Fit time:  3ms 
## 
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
## 
## Coefficients:
## (Intercept)     home_age  sqft_living  
##    291325.7      -1545.3        103.5

tidy(lm_fit)

## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  291326.   7689.       37.9  5.24e-198
## 2 home_age      -1545.    177.       -8.74 9.47e- 18
## 3 sqft_living     103.      2.79     37.1  2.60e-192

predict the results and combine into test dataset

home_predictions <- predict(lm_fit, new_data = home_test)

home_test_results <- home_test %>% 
  select(selling_price, home_age, sqft_living) %>% 
  cbind(home_predictions)

head(home_test_results)

##   selling_price home_age sqft_living    .pred
## 1        635000        4        3350 631742.8
## 2        380000       24        2130 474613.9
## 3        464950       19        2190 488547.9
## 4        535000        3        2360 530860.6
## 5        356000       24        1430 402190.4
## 6        495000        3        2140 508098.9

Evaluating model results with yardstick package

RMSE (comes with yardstick) and R squared (in yardstick)
streamlining model fitting with last_fit()
- takes a model specification, model formula, and data split object
- creates training and test datasets
- fits the model to the training data
- calculates metrics and predictions on the test data
- returns an object with all results
- use collect_metrics to collect the results; use collect_predictions to collect results from test datasets
some practice
calculate RMSE and R Squared

home_test_results %>% 
  rmse(truth = selling_price, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      46971.

home_test_results %>% 
  rsq(truth = selling_price, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.669

visualization

ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) + #representing predicted = actual
  coord_obs_pred() + #standardize the range of both axes
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

overall pipeline using all variable for fit. visually it looks better than previous fit.

linear_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

linear_fit <- linear_model %>% #fit model with all available indepdendent variables
  last_fit(selling_price ~ ., split = home_split)

predictions_df <- linear_fit %>% 
  collect_predictions()

ggplot(predictions_df, aes(x = selling_price, y = .pred )) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "blue", linetype = 2) +
  coord_obs_pred() +
  labs(x = "actual selling price", y = "predicted selling price")

Tidymodels_01

Kate C

2022-01-28

Load Packages and Dataset

A. Key points

B. Practice

Fitting linear regression model

Evaluating model results with yardstick package