This template offers an opinionated guide on how to structure a modeling analysis. Your individual modeling analysis may require you to add to, subtract from, or otherwise change this structure, but consider this a general framework to start from. If you want to learn more about using tidymodels, check out our Getting Started guide.
In this example analysis, let’s fit a model to predict the sex of penguins from species and measurement information.
library(tidyverse)
horror_movies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-01/horror_movies.csv')
## Rows: 32540 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): original_title, title, original_language, overview, tagline, post...
## dbl (8): id, popularity, vote_count, vote_average, budget, revenue, runtim...
## lgl (1): adult
## date (1): release_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
horror_movies <- na.omit(horror_movies)
horror_movies %>%
ggplot(aes(vote_average)) +
geom_histogram(bins = 25)
library(tidytext)
tidy_horror <-
horror_movies %>%
unnest_tokens(word, genre_names)
tidy_horror %>%
count(word, sort = TRUE)
## # A tibble: 20 × 2
## word n
## <chr> <int>
## 1 horror 1117
## 2 thriller 373
## 3 fiction 218
## 4 science 218
## 5 comedy 195
## 6 action 149
## 7 mystery 133
## 8 fantasy 108
## 9 drama 86
## 10 adventure 52
## 11 crime 36
## 12 movie 29
## 13 tv 29
## 14 animation 15
## 15 family 11
## 16 romance 7
## 17 documentary 5
## 18 war 3
## 19 western 3
## 20 music 2
Exploratory data analysis (EDA) is an important part of the modeling process.
tidy_horror %>%
group_by(word) %>%
summarise(n = n(),
vote_average = mean(vote_average)) %>%
ggplot(aes(n, vote_average)) +
geom_hline(yintercept = mean(horror_movies$vote_average),
lty = 2, color = "gray50", size = 1.5) +
geom_point(color = "midnightblue", alpha = 0.7) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = "top", hjust = "left") +
scale_x_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Let’s consider how to spend our data budget:
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
set.seed(123)
horror_split <- initial_split(horror_movies, strata = vote_average)
horror_train <- training(horror_split)
horror_test <- testing(horror_split)
set.seed(234)
horror_folds <- vfold_cv(horror_train, strata = vote_average)
horror_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [752/85]> Fold01
## 2 <split [752/85]> Fold02
## 3 <split [752/85]> Fold03
## 4 <split [752/85]> Fold04
## 5 <split [753/84]> Fold05
## 6 <split [754/83]> Fold06
## 7 <split [754/83]> Fold07
## 8 <split [754/83]> Fold08
## 9 <split [755/82]> Fold09
## 10 <split [755/82]> Fold10
Let;s set up our preprocessing:
library(textrecipes)
horror_rec <-
recipe(vote_average ~ tagline, data = horror_train) %>%
step_tokenize(tagline) %>%
step_tokenfilter(tagline, max_tokens = 100) %>%
step_tf(tagline)
Let’s create a model specification for each model we want to try:
ranger_spec <-
rand_forest(trees = 500) %>%
set_engine("ranger") %>%
set_mode("regression")
ranger_spec
## Random Forest Model Specification (regression)
##
## Main Arguments:
## trees = 500
##
## Computational engine: ranger
svm_spec <-
svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression")
svm_spec
## Linear Support Vector Machine Model Specification (regression)
##
## Computational engine: LiblineaR
To set up your modeling code, consider using the parsnip addin or the usemodels package.
Now let’s build a model workflow combining each model specification with a data preprocessor:
ranger_wf <- workflow(horror_rec, ranger_spec)
svm_wf <- workflow(horror_rec, svm_spec)
If your feature engineering needs are more complex than provided by a
formula like sex ~ ., use a recipe. Read more about feature
engineering with recipes to learn how they work.
These models have no tuning parameters so we can evaluate them as they are. Learn about tuning hyperparameters here.
doParallel::registerDoParallel()
contrl_preds <- control_resamples(save_pred = TRUE)
svm_rs <- fit_resamples(
svm_wf,
resamples = horror_folds,
control = contrl_preds
)
ranger_rs <- fit_resamples(
ranger_wf,
resamples = horror_folds,
control = contrl_preds
)
How did these two models compare?
collect_metrics(svm_rs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.42 10 0.0587 Preprocessor1_Model1
## 2 rsq standard 0.00659 10 0.00154 Preprocessor1_Model1
collect_metrics(ranger_rs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.34 10 0.0641 Preprocessor1_Model1
## 2 rsq standard 0.0123 10 0.00642 Preprocessor1_Model1
We can visualize these results:
bind_rows(
collect_predictions(svm_rs) %>%
mutate(mod = "SVM"),
collect_predictions(ranger_rs) %>%
mutate(mod = "ranger")
) %>%
ggplot(aes(vote_average, .pred, color = id)) +
geom_abline(lty = 2, color = "gray50", size = 1.2) +
geom_jitter(width = 0.5, alpha = 0.5) +
facet_wrap(vars(mod)) +
coord_fixed()
These models perform very similarly, so perhaps we would choose the
simpler, linear model. The function last_fit()
fits one final time on the training data and evaluates
on the testing data. This is the first time we have used the testing
data.
final_fitted <- last_fit(svm_wf, horror_split)
collect_metrics(final_fitted) ## metrics evaluated on the *testing* data
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.48 Preprocessor1_Model1
## 2 rsq standard 0.0000201 Preprocessor1_Model1
This object contains a fitted workflow that we can use for prediction.
final_wf <- extract_workflow(final_fitted)
predict(final_wf, horror_test[35,])
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 5.65
You can save this fitted final_wf object to use later
with new data, for example with readr::write_rds().
extract_workflow(final_fitted) %>%
tidy() %>%
filter(term != "Bias") %>%
group_by(estimate > 0) %>%
slice_max(abs(estimate), n = 10) %>%
ungroup() %>%
mutate(term = str_remove(term, "tf_tagline_")) %>%
ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
geom_col(alpha = 0.8) +
scale_fill_discrete(labels = c("low ratings", "high ratings")) +
labs(y = NULL, fill = "More from...")
For this assignment, you are tasked with replicating the Code Along 2 (CA) project but with a new dataset obtained from the TidyTuesday Project. The new dataset should be conceptually similar to the original dataset used in the CA assignment, particularly with a numeric outcome variable. Your task is to follow the same analytical questions and steps as in the CA assignment, but with this new dataset.