Load Packages and Dataset
Packages used to import and analyse the data include
tidymodels - a collection of R packages designed to support machine learning model development
rsample - data sampling used to create random subsets of a dataset for different activities in the modelling process. split data into training and test datasets (75/25) - guards against overfitting.
recipes - functions for transforming data for modeling. also called feature engineering
parsnip - specifying and fitting models as well as obtaining model predictions
tune and dial - provide functionality for fine-tuning models in order to achieve optimal prediction accuracy
yardstick - evaluating quality of model predictions.
summarize in table
| Data resampling and feature engineering | Model fitting and tuning | Model evaluation |
|---|---|---|
| recipes | tune | yardstick |
| rsample | dials | |
| parsnip |
- data set - home_sales (loaded from Rds file)
A. Key points
tidymodels primarily used for supervised machine learning - algorithms learn patterns from labeled data. (mainly two types - regression and classification)
stratification when sampling
used to eliminate sampling bias in a test set
allows to create a test set with a population that best represents the entire population being studied
stratified random sampling is different from simple random sampling which involves the random selection of data from the entire population so that each possible sample is equally likely to occur
data resampling with tidymodels
initial split: specifies instructions for creating training and test datasets. use outcome as the strata argument and prop specifies the proportion to place into training.
passing to training() to create training dataset, to testing() to create testing dataset
B. Practice
- Data preparation for machine learning in tidymodel
# create an rsample object home_split contains the instructions for randomly splitting the home_sales
# allocate 70% of the data into training and stratify the results by selling_price (the dependent var)
home_split <- initial_split(home_sales,
prop = 0.7,
strata = selling_price)
# create training and testing datasets
home_training <- home_split %>%
training()
home_test <- home_split %>%
testing()
# check number of rows in each dataset
nrow(home_training)## [1] 1042
nrow(home_test)## [1] 450
- check key statistics in both training and test dataset - they should be close (as we enabled proper stratification)
home_training %>%
summarize(min_sell_price = min(selling_price),
max_sell_price = max(selling_price),
mean_sell_price = mean(selling_price),
sd_sell_price = sd(selling_price))## # A tibble: 1 × 4
## min_sell_price max_sell_price mean_sell_price sd_sell_price
## <dbl> <dbl> <dbl> <dbl>
## 1 350000 650000 478909. 80735.
home_test %>%
summarize(min_sell_price = min(selling_price),
max_sell_price = max(selling_price),
mean_sell_price = mean(selling_price),
sd_sell_price = sd(selling_price))## # A tibble: 1 × 4
## min_sell_price max_sell_price mean_sell_price sd_sell_price
## <dbl> <dbl> <dbl> <dbl>
## 1 350000 650000 479491. 81629.
Fitting linear regression model
using parsnip package within tidymodels
- specify a linear model
linear_model <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")- train the model to predict selling_price using home_age and sqft_living using training dataset
lm_fit <- linear_model %>%
fit(selling_price ~ home_age + sqft_living, data = home_training)- print lm_fit to view model information
lm_fit## parsnip model object
##
## Fit time: 3ms
##
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
##
## Coefficients:
## (Intercept) home_age sqft_living
## 291325.7 -1545.3 103.5
tidy(lm_fit)## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 291326. 7689. 37.9 5.24e-198
## 2 home_age -1545. 177. -8.74 9.47e- 18
## 3 sqft_living 103. 2.79 37.1 2.60e-192
- predict the results and combine into test dataset
home_predictions <- predict(lm_fit, new_data = home_test)
home_test_results <- home_test %>%
select(selling_price, home_age, sqft_living) %>%
cbind(home_predictions)
head(home_test_results)## selling_price home_age sqft_living .pred
## 1 635000 4 3350 631742.8
## 2 380000 24 2130 474613.9
## 3 464950 19 2190 488547.9
## 4 535000 3 2360 530860.6
## 5 356000 24 1430 402190.4
## 6 495000 3 2140 508098.9
Evaluating model results with yardstick package
RMSE (comes with yardstick) and R squared (in yardstick)
streamlining model fitting with last_fit()
takes a model specification, model formula, and data split object
creates training and test datasets
fits the model to the training data
calculates metrics and predictions on the test data
returns an object with all results
use collect_metrics to collect the results; use collect_predictions to collect results from test datasets
some practice
calculate RMSE and R Squared
home_test_results %>%
rmse(truth = selling_price, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 46971.
home_test_results %>%
rsq(truth = selling_price, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.669
- visualization
ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(color = 'blue', linetype = 2) + #representing predicted = actual
coord_obs_pred() + #standardize the range of both axes
labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')- overall pipeline using all variable for fit. visually it looks better than previous fit.
linear_model <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
linear_fit <- linear_model %>% #fit model with all available indepdendent variables
last_fit(selling_price ~ ., split = home_split)
predictions_df <- linear_fit %>%
collect_predictions()
ggplot(predictions_df, aes(x = selling_price, y = .pred )) +
geom_point(alpha = 0.5) +
geom_abline(color = "blue", linetype = 2) +
coord_obs_pred() +
labs(x = "actual selling price", y = "predicted selling price")