This template offers an opinionated guide on how to structure a modeling analysis. Your individual modeling analysis may require you to add to, subtract from, or otherwise change this structure, but consider this a general framework to start from. If you want to learn more about using tidymodels, check out our Getting Started guide.

In this example analysis, let’s fit a model to predict the sex of penguins from species and measurement information.

library(tidyverse)

url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv"
chocolate <- read_csv(url)
## Rows: 2530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chocolate %>%
  ggplot(aes(rating)) +
  geom_histogram(bins = 15)

Explore data

Exploratory data analysis (EDA) is an important part of the modeling process.

chocolate %>%
    ggplot(aes(rating)) +
    geom_histogram(bins = 15)

library(tidytext)

tidy_chocolate <-
    chocolate %>% 
    unnest_tokens(word, most_memorable_characteristics)

tidy_chocolate %>%
    count(word, sort = TRUE)
## # A tibble: 547 × 2
##    word        n
##    <chr>   <int>
##  1 cocoa     419
##  2 sweet     318
##  3 nutty     278
##  4 fruit     273
##  5 roasty    228
##  6 mild      226
##  7 sour      208
##  8 earthy    199
##  9 creamy    189
## 10 intense   178
## # ℹ 537 more rows
tidy_chocolate %>%
  group_by(word) %>%
  summarise(n = n(),
            rating = mean(rating)) %>%
  ggplot(aes(n, rating)) +
  geom_hline(yintercept = mean(chocolate$rating),
             lty = 2, color = "gray50", size = 1.5) +
  geom_point(color = "midnightblue", alpha = 0.7) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = "top", hjust = "left") +
  scale_x_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Build models

Let’s consider how to spend our data budget:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
set.seed(123)
choco_split <- initial_split(chocolate, strata = rating)
choco_train <- training(choco_split)
choco_test <- testing(choco_split)

set.seed(234)
choco_folds <- vfold_cv(choco_train, strata = rating)
choco_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [1705/191]> Fold01
##  2 <split [1705/191]> Fold02
##  3 <split [1705/191]> Fold03
##  4 <split [1706/190]> Fold04
##  5 <split [1706/190]> Fold05
##  6 <split [1706/190]> Fold06
##  7 <split [1707/189]> Fold07
##  8 <split [1707/189]> Fold08
##  9 <split [1708/188]> Fold09
## 10 <split [1709/187]> Fold10

Let;s set up our preprocessing:

library(textrecipes)

choco_rec <-
  recipe(rating ~ most_memorable_characteristics, data = choco_train) %>%
  step_tokenize(most_memorable_characteristics) %>%
  step_tokenfilter(most_memorable_characteristics, max_tokens = 100) %>%
  step_tf(most_memorable_characteristics)

Let’s create a model specification for each model we want to try:

ranger_spec <-
  rand_forest(trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")

ranger_spec
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 500
## 
## Computational engine: ranger
svm_spec <-
    svm_linear() %>%
    set_engine("LiblineaR") %>%
    set_mode("regression")

svm_spec
## Linear Support Vector Machine Model Specification (regression)
## 
## Computational engine: LiblineaR

To set up your modeling code, consider using the parsnip addin or the usemodels package.

Now let’s build a model workflow combining each model specification with a data preprocessor:

ranger_wf <- workflow(choco_rec, ranger_spec)
svm_wf <- workflow(choco_rec, svm_spec)

If your feature engineering needs are more complex than provided by a formula like sex ~ ., use a recipe. Read more about feature engineering with recipes to learn how they work.

Evaluate models

These models have no tuning parameters so we can evaluate them as they are. Learn about tuning hyperparameters here.

doParallel::registerDoParallel()
contrl_preds <- control_resamples(save_pred = TRUE)

svm_rs <- fit_resamples(
  svm_wf,
  resamples = choco_folds,
  control = contrl_preds
)

ranger_rs <- fit_resamples(
  ranger_wf,
  resamples = choco_folds,
  control = contrl_preds
)

How did these two models compare?

collect_metrics(svm_rs)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   0.348    10 0.00704 Preprocessor1_Model1
## 2 rsq     standard   0.365    10 0.0146  Preprocessor1_Model1
collect_metrics(ranger_rs)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   0.345    10 0.00726 Preprocessor1_Model1
## 2 rsq     standard   0.378    10 0.0151  Preprocessor1_Model1

We can visualize these results:

bind_rows(
  collect_predictions(svm_rs) %>%
    mutate(mod = "SVM"),
  collect_predictions(ranger_rs) %>%
    mutate(mod = "ranger")
) %>%
  ggplot(aes(rating, .pred, color = id)) +
    geom_abline(lty = 2, color = "gray50", size = 1.2) +
    geom_jitter(width = 0.5, alpha = 0.5) +
    facet_wrap(vars(mod)) +
    coord_fixed()

These models perform very similarly, so perhaps we would choose the simpler, linear model. The function last_fit() fits one final time on the training data and evaluates on the testing data. This is the first time we have used the testing data.

final_fitted <- last_fit(svm_wf, choco_split)
collect_metrics(final_fitted)  ## metrics evaluated on the *testing* data
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       0.385 Preprocessor1_Model1
## 2 rsq     standard       0.340 Preprocessor1_Model1

This object contains a fitted workflow that we can use for prediction.

final_wf <- extract_workflow(final_fitted)
predict(final_wf, choco_test[55,])
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  3.70

You can save this fitted final_wf object to use later with new data, for example with readr::write_rds().

extract_workflow(final_fitted) %>%
  tidy() %>%
  filter(term != "Bias") %>%
  group_by(estimate > 0) %>%
  slice_max(abs(estimate), n = 10) %>%
  ungroup() %>%
  mutate(term = str_remove(term, "tfidf_most_memorable_characteristics_")) %>%
  ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
  geom_col(alpha = 0.8) +
  scale_fill_discrete(labels = c("low ratings", "high ratings")) +
  labs(y = NULL, fill = "More from...")

Questions

  1. Question and Data:
    • What is the research question? Clearly state the research question you aim to address using the new dataset. The research question is “Can we predict chocolate ratings based on the most memorable characteristics of chocolates?”
    • Describe the data briefly: Provide an overview of the new dataset, highlighting its key characteristics and dimensions. There are 2,530 rows and 10 columns of data. There is character and numerical data in the data set. A few column names in the data set are company_manufacturer, company_location, country_of_bean_origin, and so on.
    • What are the characteristics of the key variables used in the analysis? Describe the primary variables of interest in the dataset and their characteristics. The key variable is the rating. This is the target for what the model is trying to predict. The rating data is the rate the chocolate was given. This data is numerical. Another important key variable is the most_memorable_characterstics column. This data is categorical and contains descriptions of what was good or bad about the chocolate. It is what is being used to predict the rating with the model.
  2. Data Exploration and Transformation:
    • Describe the differences between the original data and the data transformed for modeling. Why? Explain any preprocessing or transformations performed on the new dataset compared to the original data. Discuss why these changes were necessary or beneficial. The original code had the column most_memorable_characteristics which contained descriptive words about what was memorable about that chocolate. In the transformed data, the unnest_tokens function was used to break down the text into individual words or “tokens”. Using the count function the code counts the amount of times a word shows up in the text data. This can be helpful for finding which characteristics customers use to describe chocolate samples that they like or do not like.
  3. Data Preparation and Modeling:
    • What are the names of data preparation steps mentioned in the video? List and describe any data preparation steps or techniques mentioned in the CA video that you applied to the new dataset. The data preparation steps mentioned in the video are tokenization (step_tokenize()), token filtering (step_tokenfilter()) TF-IDF transformation (step_tfidf()), and data splitting (inital split()).
    • What is the name of the machine learning model(s) used in the analysis? Specify the machine learning model(s) you employed for your analysis and briefly explain their relevance to the research question. The machine learning models used in the analysis were random forest and support vector machine. Random forest is used with regression tasks, which is what the research question was requesting for this data set. The ranger function was used to predict the rating of chocolate based on the most_memorable_characteristics column from the dataset. Random forests can handle linear and nonliner relationships in data. They also can find complex patterns in a large data set, which is helpful in this code by finding nonlinear relationships between the words or tokens in connection with the rating. Support vector machines are also used for regression tasks. We used LiblineaR to predict the the rating using data from the most_memorable_characteristics column. This is more helpful in finding linear relationships between the words or tokens and the ratings.
  4. Model Evaluation:
    • What metrics are used in the model evaluation? Detail the evaluation metrics you used to assess the performance of your machine learning model(s) on the new dataset. Discuss the significance of these metrics in the context of your research question. This is the code that was used to evaluate the metrics “collect_metrics(svm_rs) collect_metrics(ranger_rs)”. This code finds the RMSE and R-Squared to evaluate how accurate the models predictions are. RMSE is used for calculating both Support Vector Machine and Random Forest. R-squared does the same thing. They basically assess how well the machine learning models are prediciting how the most memorable traits of chocolate are connected to the chocolate ratings.
  5. Conclusion:
    • What are the major findings? Summarize the key findings and insights obtained from your analysis of the new dataset. The RMSE and r-squared for the SVM and random forest models were 0.345 and 0.378 respectively. This along with the rest of the analysis indicates that it is somewhat possible to predict chocolate ratings based off of the most memorable characteristics of the chocolates. The SVM and Random Forest models show that the model has reasonable predictive capabilities.