“The 200 Best Horror Movies of All Time” ranking by Rotten Tomatoes
Research Questions:
Selected Technique:
# make rank and score numeric
rt200$rank<-as.numeric(rt200$rank)
rt200$score<-as.numeric(rt200$score)
# create variable "rank_over_101" that denotes if rank is over 101 (Yes or No)
rt200 <- rt200 %>%
mutate(rank_over_101 = factor(if_else(rank >= 101, "bottom 50%", "top 50%")))
# create variable "years" that denotes if year is in the 2000's or 1900's
rt200 <- rt200 %>%
mutate(years = case_when(
year <= 1950 ~ '1919-1950',
year >= 1951 & year <= 1980 ~ '1951-1980',
year >= 1981 & year <= 2010 ~ '1981-2010',
year >= 2011 ~ '2011+'))
head(rt200)
# scatterplot of rank and score with regression line, color coded by century
ggplot(data = rt200,
mapping = aes(x = rank, y = score, )) +
labs(x = "Rank by Rotten Tomatoes",
y = "User Rating",
color = "Year Released",
title = "Top 200 Horror Movies: Rank by Rotten Tomatoes vs. User Rating",
subtitle = "By Year Released") +
geom_point(aes(col=years)) +
geom_smooth(method = 'lm', color = "#E48957", se = FALSE)
- There is a linear relationship between rank and score. higher scores based on user ratings have a linear relationship to be ranked higher by the website. Films released 1919-1950 plotten higher than newer films, showing a slight relationship between older films and higher rank and rating.
# scatterplot of rank and director with regression line
ggplot(data = rt200,
mapping = aes(x = rank, y = director, )) +
labs(x = "Rank by Rotten Tomatoes",
y = "Director",
title = "Top 200 Horror Movies: Rank by Rotten Tomatoes vs. Director",) +
geom_point()
- As you can see, Director is a hard variable to use as there are so many different directors, and not enough of them repeat. For this reason, I have chosen to not use this variable in my analysis. In a future project, it would be interesting to see if directors who made the list several times had films that scored higher than directors who only made the list once.
# split the data into training "rt200_train" (80%) and testing "rt200_test" (20%) data sets
set.seed(314)
rt200_split <- initial_split(rt200, prop = 0.80)
rt200_train <- training(rt200_split)
rt200_test <- testing(rt200_split)
# create dummy variables "step_dummy" for nominal predictors. Creates a feature engineering recipe "rt200_recipe" that uses "rt200_train" to predict the output a variable will have for "rank_over_101".
rt200_recipe <- recipe(rank_over_101 ~ score + years, data = rt200_train) %>%
step_dummy(all_nominal(), -all_outcomes())
# splits "rt200_train" into 5 folds to tune model hyperparameters
set.seed(314)
rt200_folds <- vfold_cv(rt200_train,v=5)
# creates a classification decision tree model specification "rt200_model" and sets the hyperparameters within the decision tree model
rt200_model <- decision_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune() ) %>%
set_engine('rpart') %>%
set_mode('classification')
# combines the model and the recipe created above into "rt200_tree_workflow"
rt200_tree_workflow <- workflow() %>%
add_model(rt200_model) %>%
add_recipe(rt200_recipe)
# performs a grid search on the decision tree hyperperparameters. This grid has three rows.
rt200_tree_grid <- grid_regular(cost_complexity(),
tree_depth(),
min_n(),
levels = 3)
# tunes decision tree and finds the optimal hyperparameters from the training grid
set.seed(314)
rt200_tree_tuning <- rt200_tree_workflow %>%
tune_grid(resamples = rt200_folds ,
grid = rt200_tree_grid)
# produces the top five models with the most optimal hyperparameters.
rt200_tree_tuning %>% show_best('roc_auc')
# selects the best model from tuning results based on area under ROC curve.
rt200_best_tree <- rt200_tree_tuning %>%
select_best(metric = 'roc_auc')
# adds the optimal model to the workflow object
rt200_final_tree_workflow <- rt200_tree_workflow %>%
finalize_workflow(rt200_best_tree)
# fit workflow to the training data
rt200_tree_wf_fit <- rt200_final_tree_workflow %>%
fit(rt200_train)
# extracts the trained model from the workflow fit
rt200_tree_fit <- rt200_tree_wf_fit %>%
extract_fit_parsnip()
# see which variables are most important in the model
vip(rt200_tree_fit)
- Score is the most important predictor for this model.}
# Visualize the decision tree
rpart.plot(rt200_tree_fit$fit, roundint = FALSE, main = "Top 200 Horror Movies:\nHow User Rating ('score') and Year Released Determine if a Movie is Ranked Top 50%")
# Fit final model workflow to the training data and output performance metrics on test data
rt200_tree_last_fit <- rt200_final_tree_workflow %>%
last_fit(rt200_split)
rt200_tree_last_fit %>% collect_metrics()
# Create a confusion matrix to display false results
rt200_tree_predictions <- rt200_tree_last_fit %>% collect_predictions()
conf_mat(rt200_tree_predictions, truth = rank_over_101 , estimate = .pred_class)
## Truth
## Prediction bottom 50% top 50%
## bottom 50% 18 4
## top 50% 2 16
# Plot the ROC curve to visualize test set performance of tuned decision tree
rt200_tree_last_fit %>% collect_predictions() %>%
roc_curve(truth = rank_over_101 , estimate = ".pred_bottom 50%" ) %>%
autoplot()
- This is a very accurate test, as it is quite close to being a 95-degree angle. }
Explanation of Results: