## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 32540 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): original_title, title, original_language, overview, tagline, post...
## dbl   (8): id, popularity, vote_count, vote_average, budget, revenue, runtim...
## lgl   (1): adult
## date  (1): release_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore Data

## # A tibble: 21 × 2
##    word         n
##    <chr>    <int>
##  1 horror   32543
##  2 thriller  7680
##  3 comedy    4963
##  4 drama     4271
##  5 mystery   3138
##  6 fiction   2714
##  7 science   2714
##  8 fantasy   2195
##  9 action    1966
## 10 crime     1153
## # ℹ 11 more rows
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##. Build Models

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [21962/2442]> Fold01
##  2 <split [21962/2442]> Fold02
##  3 <split [21963/2441]> Fold03
##  4 <split [21963/2441]> Fold04
##  5 <split [21964/2440]> Fold05
##  6 <split [21964/2440]> Fold06
##  7 <split [21964/2440]> Fold07
##  8 <split [21964/2440]> Fold08
##  9 <split [21965/2439]> Fold09
## 10 <split [21965/2439]> Fold10

Preprocessing

Model Spec

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 500
## 
## Computational engine: ranger
## Linear Support Vector Machine Model Specification (regression)
## 
## Computational engine: LiblineaR

Workflow

Evaluate Models

## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   2.81      10 0.00992 Preprocessor1_Model1
## 2 rsq     standard   0.0453    10 0.00300 Preprocessor1_Model1
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   2.81      10 0.00939 Preprocessor1_Model1
## 2 rsq     standard   0.0500    10 0.00278 Preprocessor1_Model1

## → A | warning: max_tokens was set to '100', but only 21 was available and selected.
## 
There were issues with some computations   A: x1

There were issues with some computations   A: x1
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      2.80   Preprocessor1_Model1
## 2 rsq     standard      0.0468 Preprocessor1_Model1
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  2.77

1

We are seeking to answer which genre titles were associated with higher vote averages.

The data set consisted of many different categories pertaining to horror movies such as release dates, popularity, vote averages, revenue and genres.

#The Primary variables of interest are vote averages which is a numerical item and the genre name which is a character item.

2

The difference between the data was that it was split into a much smaller subsection and split each word into its own category to see what effect that specific word had on vote average versus the string of characteristics that we started with.

3

The three data prep steps we employed were:

Step_tokenize which is when we split character string into smaller parts Step_tokenfilter which creates a recipe step to convert a token variable to be filtered based on frequency of use. Step_tf which creates a specification of a recipe step that will convert a token variable into multiple variables containing the token counts

The two machine learning models used in the analysis were SVM and Ranger. SVM or support vector machine is typically used to solve classifciation or regression and are good at classifying the elements of a data set into multiple groups. Ranger is a fast implementation of Random forrests.

4

The major metrics used in this model evaluation are RMSE(Root Mean Square Error) and R-Squared.

RMSE is the measure of difference between actual and predicted ratings. A lower RMSE can mean that the model is a great fit alongside the data.

R-Squared is a measure that determines the proportion of variance in a dependent variable that can be explained by the independent variable. A higher value in R-Squared can mean that the variance is mostly explained which can be helpful while looking at what genre the vote average the most.

5

The major findings from this analysis was relatively inconclusive as there were so many words, but we were able to find which words had a higher causation in terms of higher or lower vote averages.