This template offers an opinionated guide on how to structure a modeling analysis. Your individual modeling analysis may require you to add to, subtract from, or otherwise change this structure, but consider this a general framework to start from. If you want to learn more about using tidymodels, check out our Getting Started guide.

In this example analysis, let’s fit a model to predict the sex of penguins from species and measurement information.

library(tidyverse)

horror_movies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-01/horror_movies.csv')
## Rows: 32540 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): original_title, title, original_language, overview, tagline, post...
## dbl   (8): id, popularity, vote_count, vote_average, budget, revenue, runtim...
## lgl   (1): adult
## date  (1): release_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
horror_movies <- na.omit(horror_movies)

Explore data

horror_movies %>%
  ggplot(aes(vote_average)) +
  geom_histogram(bins = 25)

library(tidytext)

tidy_horror <-
    horror_movies %>%
    unnest_tokens(word, genre_names)

tidy_horror %>%
    count(word, sort = TRUE)
## # A tibble: 20 × 2
##    word            n
##    <chr>       <int>
##  1 horror       1117
##  2 thriller      373
##  3 fiction       218
##  4 science       218
##  5 comedy        195
##  6 action        149
##  7 mystery       133
##  8 fantasy       108
##  9 drama          86
## 10 adventure      52
## 11 crime          36
## 12 movie          29
## 13 tv             29
## 14 animation      15
## 15 family         11
## 16 romance         7
## 17 documentary     5
## 18 war             3
## 19 western         3
## 20 music           2

Exploratory data analysis (EDA) is an important part of the modeling process.

tidy_horror %>%
  group_by(word) %>%
  summarise(n = n(),
            vote_average = mean(vote_average)) %>%
  ggplot(aes(n, vote_average)) +
  geom_hline(yintercept = mean(horror_movies$vote_average),
             lty = 2, color = "gray50", size = 1.5) +
  geom_point(color = "midnightblue", alpha = 0.7) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = "top", hjust = "left") +
  scale_x_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Build models

Let’s consider how to spend our data budget:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
set.seed(123)
horror_split <- initial_split(horror_movies, strata = vote_average)
horror_train <- training(horror_split)
horror_test <- testing(horror_split)

set.seed(234)
horror_folds <- vfold_cv(horror_train, strata = vote_average)
horror_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [752/85]> Fold01
##  2 <split [752/85]> Fold02
##  3 <split [752/85]> Fold03
##  4 <split [752/85]> Fold04
##  5 <split [753/84]> Fold05
##  6 <split [754/83]> Fold06
##  7 <split [754/83]> Fold07
##  8 <split [754/83]> Fold08
##  9 <split [755/82]> Fold09
## 10 <split [755/82]> Fold10

Let;s set up our preprocessing:

library(textrecipes)

horror_rec <-
  recipe(vote_average ~ tagline, data = horror_train) %>%
  step_tokenize(tagline) %>%
  step_tokenfilter(tagline, max_tokens = 100) %>%
  step_tf(tagline)

Let’s create a model specification for each model we want to try:

ranger_spec <-
  rand_forest(trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")

ranger_spec
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 500
## 
## Computational engine: ranger
svm_spec <-
    svm_linear() %>%
    set_engine("LiblineaR") %>%
    set_mode("regression")

svm_spec
## Linear Support Vector Machine Model Specification (regression)
## 
## Computational engine: LiblineaR

To set up your modeling code, consider using the parsnip addin or the usemodels package.

Now let’s build a model workflow combining each model specification with a data preprocessor:

ranger_wf <- workflow(horror_rec, ranger_spec)
svm_wf <- workflow(horror_rec, svm_spec)

If your feature engineering needs are more complex than provided by a formula like sex ~ ., use a recipe. Read more about feature engineering with recipes to learn how they work.

Evaluate models

These models have no tuning parameters so we can evaluate them as they are. Learn about tuning hyperparameters here.

doParallel::registerDoParallel()
contrl_preds <- control_resamples(save_pred = TRUE)

svm_rs <- fit_resamples(
  svm_wf,
  resamples = horror_folds,
  control = contrl_preds
)

ranger_rs <- fit_resamples(
  ranger_wf,
  resamples = horror_folds,
  control = contrl_preds
)

How did these two models compare?

collect_metrics(svm_rs)
## # A tibble: 2 × 6
##   .metric .estimator    mean     n std_err .config             
##   <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.42       10 0.0587  Preprocessor1_Model1
## 2 rsq     standard   0.00659    10 0.00154 Preprocessor1_Model1
collect_metrics(ranger_rs)
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.34      10 0.0641  Preprocessor1_Model1
## 2 rsq     standard   0.0123    10 0.00642 Preprocessor1_Model1

We can visualize these results:

bind_rows(
  collect_predictions(svm_rs) %>%
    mutate(mod = "SVM"),
  collect_predictions(ranger_rs) %>%
    mutate(mod = "ranger")
) %>%
  ggplot(aes(vote_average, .pred, color = id)) +
    geom_abline(lty = 2, color = "gray50", size = 1.2) +
    geom_jitter(width = 0.5, alpha = 0.5) +
    facet_wrap(vars(mod)) +
    coord_fixed()

These models perform very similarly, so perhaps we would choose the simpler, linear model. The function last_fit() fits one final time on the training data and evaluates on the testing data. This is the first time we have used the testing data.

final_fitted <- last_fit(svm_wf, horror_split)
collect_metrics(final_fitted)  ## metrics evaluated on the *testing* data
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard   1.48      Preprocessor1_Model1
## 2 rsq     standard   0.0000201 Preprocessor1_Model1

This object contains a fitted workflow that we can use for prediction.

final_wf <- extract_workflow(final_fitted)
predict(final_wf, horror_test[35,])
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  5.65

You can save this fitted final_wf object to use later with new data, for example with readr::write_rds().

extract_workflow(final_fitted) %>%
  tidy() %>%
  filter(term != "Bias") %>%
  group_by(estimate > 0) %>%
  slice_max(abs(estimate), n = 10) %>%
  ungroup() %>%
  mutate(term = str_remove(term, "tf_tagline_")) %>%
  ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
  geom_col(alpha = 0.8) +
  scale_fill_discrete(labels = c("low ratings", "high ratings")) +
  labs(y = NULL, fill = "More from...")

Questions

For this assignment, you are tasked with replicating the Code Along 2 (CA) project but with a new dataset obtained from the TidyTuesday Project. The new dataset should be conceptually similar to the original dataset used in the CA assignment, particularly with a numeric outcome variable. Your task is to follow the same analytical questions and steps as in the CA assignment, but with this new dataset.

  1. Question and Data:
    • What is the research question? Clearly state the research question you aim to address using the new dataset. The research question is prediciting the vote_average using a regression model for horror movies.
    • Describe the data briefly: Provide an overview of the new dataset, highlighting its key characteristics and dimensions. The data is information about horror movies. There is an id, budget, vote count, vote average, title, tagline, overview, and original language columns all containing data on the movies.
    • What are the characteristics of the key variables used in the analysis? Describe the primary variables of interest in the dataset and their characteristics. The primary variables in the data set are tagline and vote_average. Tagline is character data and the column contains one sentence that is the catchphrase or tagline from each movie. Vote_average is numerical data and displays the average voter rating for each movie. It is the data I am trying to predict by using tagline in a regression model.
  2. Data Exploration and Transformation:
    • Describe the differences between the original data and the data transformed for modeling. Why? Explain any preprocessing or transformations performed on the new dataset compared to the original data. Discuss why these changes were necessary or beneficial. One difference between the original and the transformed data is that I tokenized and token filtered the original data. I did this to split the text from the tagline column into individual words, and also token filtered only 100 words. Breaking the tagline’s into individual words helped me to analyze how many times a word came up. Token filtering prevented an overload of words, so limiting it to only 100 words kept the dimensions smaller. The text features were also convereted to numeric data instead of character data in order for the machine learning model to work with the data. I also split the data into test and training sets and folded the data as well. This way I could better assess the performance of the model on multiple sets of the data to get a better idea of how well it does.
  3. Data Preparation and Modeling:
    • What are the names of data preparation steps mentioned in the video? List and describe any data preparation steps or techniques mentioned in the CA video that you applied to the new dataset. The data preparation steps mentioned in the video are tokenization (step_tokenize()), token filtering (step_tokenfilter()) TF-IDF transformation (step_tfidf()), and data splitting (inital split()). I used tokenization as mentioned in the previous question. Using step_tokenize() I split the tagline text into individual words. I used step_tokenfilter() to limit the number of tokens to only 100. I used step_tfidf to transform text data into numerical. This function calculates the importance of each word in relation to the dataset. I used initial_split to seperate my data into training and test data in order to better train my machine learning model.
    • What is the name of the machine learning model(s) used in the analysis? Specify the machine learning model(s) you employed for your analysis and briefly explain their relevance to the research question. The machine learning models used in the analysis were random forest and support vector machine. Random forest is used with regression tasks, which is what the research quetion was requesting for this dataset. I used the ranger function to predict vote_average of horror movies based on the tagline column from the dataset. Support vector machines are also used for regression tasks. I used LiblineaR to predict the vote_average using data from the tagline column.
  4. Model Evaluation:
    • What metrics are used in the model evaluation? Detail the evaluation metrics you used to assess the performance of your machine learning model(s) on the new dataset. Discuss the significance of these metrics in the context of your research question. I used the code chunk “collect_metrics(svm_rs) collect_metrics(ranger_rs)” to see how the two models compared. This code prints the metrics using root mean squared error and r-squared. These metrics are helpful as root mean squared error finds the average of the errors and I can see that if the root mean square error value is low that the model has close predictions. R-squared finds the variance in the dependent variable that is predictable from the independent variables. Using this, I know I can look at the value and see that the closer the r-squared value is to 1 the better the accuracy of the prediction is. This is important in seeing the precision performance of the model.
  5. Conclusion:
    • What are the major findings? Summarize the key findings and insights obtained from your analysis of the new dataset. Relate these findings back to the research question and any similarities or differences compared to the CA assignment. My key findings are that it is not very accurate to predict the vote_average based on the tagline of a movie. Using data like this does not exactly seem like it relates to each other, therefore the prediction does not seem to good. The code along assignment seemed to have better luck, but I think using descriptive words for how chocolate tastes is much more related to how it could be rated than a tagline in a movie affecting voter ratings. If someone were to describe chocolate as “amazing” or “disgusting” it seems more likely that the amazing would be linked to a higher rating and disgusting to a lower. But it does not appear to work quite as smoothing for a movie having the word “the” in the tagline always being rated highly.