## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 32540 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): original_title, title, original_language, overview, tagline, post...
## dbl (8): id, popularity, vote_count, vote_average, budget, revenue, runtim...
## lgl (1): adult
## date (1): release_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 21 × 2
## word n
## <chr> <int>
## 1 horror 32543
## 2 thriller 7680
## 3 comedy 4963
## 4 drama 4271
## 5 mystery 3138
## 6 fiction 2714
## 7 science 2714
## 8 fantasy 2195
## 9 action 1966
## 10 crime 1153
## # ℹ 11 more rows
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##. Build Models
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [21962/2442]> Fold01
## 2 <split [21962/2442]> Fold02
## 3 <split [21963/2441]> Fold03
## 4 <split [21963/2441]> Fold04
## 5 <split [21964/2440]> Fold05
## 6 <split [21964/2440]> Fold06
## 7 <split [21964/2440]> Fold07
## 8 <split [21964/2440]> Fold08
## 9 <split [21965/2439]> Fold09
## 10 <split [21965/2439]> Fold10
## Random Forest Model Specification (regression)
##
## Main Arguments:
## trees = 500
##
## Computational engine: ranger
## Linear Support Vector Machine Model Specification (regression)
##
## Computational engine: LiblineaR
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 2.81 10 0.00992 Preprocessor1_Model1
## 2 rsq standard 0.0453 10 0.00300 Preprocessor1_Model1
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 2.81 10 0.00939 Preprocessor1_Model1
## 2 rsq standard 0.0500 10 0.00278 Preprocessor1_Model1
## → A | warning: max_tokens was set to '100', but only 21 was available and selected.
##
There were issues with some computations A: x1
There were issues with some computations A: x1
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 2.80 Preprocessor1_Model1
## 2 rsq standard 0.0468 Preprocessor1_Model1
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 2.77
#The Primary variables of interest are vote averages which is a numerical item and the genre name which is a character item.
Step_tokenize which is when we split character string into smaller parts Step_tokenfilter which creates a recipe step to convert a token variable to be filtered based on frequency of use. Step_tf which creates a specification of a recipe step that will convert a token variable into multiple variables containing the token counts
The major metrics used in this model evaluation are RMSE(Root Mean Square Error) and R-Squared.
RMSE is the measure of difference between actual and predicted ratings. A lower RMSE can mean that the model is a great fit alongside the data.
R-Squared is a measure that determines the proportion of variance in a dependent variable that can be explained by the independent variable. A higher value in R-Squared can mean that the variance is mostly explained which can be helpful while looking at what genre the vote average the most.
The major findings from this analysis was relatively inconclusive as there were so many words, but we were able to find which words had a higher causation in terms of higher or lower vote averages.