Final Project: Analyzing Rotten Tomatoes Top 200 Horror Movies

“The 200 Best Horror Movies of All Time” ranking by Rotten Tomatoes

https://editorial.rottentomatoes.com/guide/best-horror-movies-of-all-time/

Research Questions:

If a movie has a higher rank, can it also be expected to have a higher user rating on rt?
How does the year a film was released impact how it is ranked?
Is the year a film was released or the user rating more effective in predicting how a movie will rank?

Selected Technique:

Decision Tree: determines which variables are most important in predicting the rank of a movie.

# make rank and score numeric
rt200$rank<-as.numeric(rt200$rank)
rt200$score<-as.numeric(rt200$score)

# create variable "rank_over_101" that denotes if rank is over 101 (Yes or No)
rt200 <-  rt200  %>%
  mutate(rank_over_101 = factor(if_else(rank >= 101, "bottom 50%", "top 50%")))

# create variable "years" that denotes if year is in the 2000's or 1900's
rt200 <- rt200 %>%
  mutate(years = case_when(
                               year <= 1950 ~ '1919-1950',
                               year >= 1951 & year <= 1980 ~ '1951-1980',
                               year >= 1981 & year <= 2010 ~ '1981-2010',
                               year >= 2011 ~ '2011+'))

head(rt200)

# scatterplot of rank and score with regression line, color coded by century
ggplot(data = rt200,
       mapping = aes(x = rank, y = score, )) + 
  labs(x = "Rank by Rotten Tomatoes", 
         y = "User Rating", 
         color = "Year Released",
         title = "Top 200 Horror Movies: Rank by Rotten Tomatoes vs. User Rating",
         subtitle = "By Year Released") +
  geom_point(aes(col=years)) +
  geom_smooth(method = 'lm', color = "#E48957", se = FALSE)

- There is a linear relationship between rank and score. higher scores based on user ratings have a linear relationship to be ranked higher by the website. Films released 1919-1950 plotten higher than newer films, showing a slight relationship between older films and higher rank and rating.

# scatterplot of rank and director with regression line
ggplot(data = rt200,
       mapping = aes(x = rank, y = director, )) + 
  labs(x = "Rank by Rotten Tomatoes", 
         y = "Director", 
         title = "Top 200 Horror Movies: Rank by Rotten Tomatoes vs. Director",) +
  geom_point()

- As you can see, Director is a hard variable to use as there are so many different directors, and not enough of them repeat. For this reason, I have chosen to not use this variable in my analysis. In a future project, it would be interesting to see if directors who made the list several times had films that scored higher than directors who only made the list once.

# split the data into training "rt200_train" (80%) and testing "rt200_test" (20%) data sets
set.seed(314)
rt200_split <- initial_split(rt200, prop = 0.80)
rt200_train <- training(rt200_split)
rt200_test <- testing(rt200_split)

# create dummy variables "step_dummy" for nominal predictors. Creates a feature engineering recipe "rt200_recipe" that uses "rt200_train" to predict the output a variable will have for "rank_over_101".
rt200_recipe <- recipe(rank_over_101 ~ score + years, data = rt200_train) %>% 
                       step_dummy(all_nominal(), -all_outcomes())

# splits "rt200_train" into 5 folds to tune model hyperparameters
set.seed(314)
rt200_folds <- vfold_cv(rt200_train,v=5)

# creates a classification decision tree model specification "rt200_model" and sets the hyperparameters within the decision tree model
rt200_model <- decision_tree(cost_complexity = tune(),
                            tree_depth = tune(),
                            min_n = tune() ) %>% 
              set_engine('rpart') %>% 
              set_mode('classification')

# combines the model and the recipe created above into "rt200_tree_workflow"
rt200_tree_workflow <- workflow() %>% 
                 add_model(rt200_model) %>% 
                 add_recipe(rt200_recipe)

# performs a grid search on the decision tree hyperperparameters. This grid has three rows.
rt200_tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          min_n(), 
                          levels = 3)

# tunes decision tree and finds the optimal hyperparameters from the training grid
set.seed(314)

rt200_tree_tuning <-   rt200_tree_workflow %>% 
               tune_grid(resamples = rt200_folds ,
                         grid = rt200_tree_grid)

# produces the top five models with the most optimal hyperparameters. 
rt200_tree_tuning %>% show_best('roc_auc')

# selects the best model from tuning results based on area under ROC curve.
rt200_best_tree <- rt200_tree_tuning %>% 
             select_best(metric = 'roc_auc')

# adds the optimal model to the workflow object
rt200_final_tree_workflow <- rt200_tree_workflow %>% 
                       finalize_workflow(rt200_best_tree)

# fit workflow to the training data 
rt200_tree_wf_fit <- rt200_final_tree_workflow %>% 
               fit(rt200_train)

# extracts the trained model from the workflow fit
rt200_tree_fit <- rt200_tree_wf_fit %>% 
            extract_fit_parsnip()

# see which variables are most important in the model
vip(rt200_tree_fit)

- Score is the most important predictor for this model.}

# Visualize the decision tree 
rpart.plot(rt200_tree_fit$fit, roundint = FALSE, main = "Top 200 Horror Movies:\nHow User Rating ('score') and Year Released Determine if a Movie is Ranked Top 50%")

According to the decision tree, the most important predictor for if a movie will be in the top 50% is if it has a score over 92%, which makes sense since a higher score most likely means it is a better movie. In movies with a score between 88 and 92, older movies are more likely to be rated in the top 50%.

# Fit final model workflow to the training data and output performance metrics on test data 
rt200_tree_last_fit <- rt200_final_tree_workflow %>% 
                 last_fit(rt200_split)

rt200_tree_last_fit %>% collect_metrics()

# Create a  confusion matrix to display false results
rt200_tree_predictions <- rt200_tree_last_fit %>% collect_predictions()
conf_mat(rt200_tree_predictions, truth = rank_over_101 , estimate = .pred_class)

##             Truth
## Prediction   bottom 50% top 50%
##   bottom 50%         18       4
##   top 50%             2      16

# Plot the ROC curve to visualize test set performance of tuned decision tree 
rt200_tree_last_fit %>% collect_predictions() %>% 
                  roc_curve(truth  = rank_over_101 , estimate = ".pred_bottom 50%" ) %>%
  autoplot()

- This is a very accurate test, as it is quite close to being a 95-degree angle. }

Explanation of Results:

The results of this decision tree show that user rating (‘score’) is more influential on the rank of a movie than the year it was released (‘years’). This is not extremely surprising, as users will most likely have a similar reaction to a film as the editors at Rotten Tomatoes will. I am a bit surprised that older movies are more likely to rate higher, as I assume that newer movies have better technology and therefore should be better. I wanted to also analyze how the Director would affect the ranking of a film, but was unable to make this work with the decision tree. I have also found that the title of the film is actually not very useful information when performing analysis for a similar reason to director: there are too many individual titles and therefore they don’t work well with the decision tree method. Perhaps analyzing the length of titles or presence of key words in titles could be more compatable with the decision tree model. In the future, I may use linear regression to see how factors such as score predict rating because that may be more effective than the decision tree model.

Final Project: Analyzing Rotten Tomatoes Top 200 Horror Movies

MATH/CS 215: Intro to Data Science

Emma Griffith