DS 3001 Final Project

Question

What are the different ways in which a model for predicting IMDb scores trains for each genre?

Background Information

Some articles that we found discussed the variation of movie scores by genre. One reddit thread, for example, included opinions about how horror movies are usually rated lower due to unpleasant content and poor acting. An data analysis article on IMDb, genres, and ratings concluded that horror was the lowest scoring genre, whereas documentaries scored the highest. Among movies rated in multiple genres, drama was a common genre between the highly-scoring movies. For the lowest rating movies of multiple genres, horror was found most commonly among all combinations of genres. Therefore, we wanted to explore how different factors contribute to a movie’s rating based on genre, focusing particularly on horror and drama.

Initially, we had found an interesting group of datasets that measured the same features of movies over multiple streaming platforms and thought that it would be interesting to somehow compare ratings over different platforms. However, we quickly realized that the datasets did not have many variables to work with, which would greatly limit the amount of analysis we could perform on our data. Thus, we were able to locate our current dataset, titled “Latest Netflix data with 26+ joined attributes,” with plenty of columns such as IMDb Score, Awards Received, Genre, View Rating, Country Availability, and more. We obtained our dataset from Kaggle. Our dataset includes information about various movies from Netflix. The data is an aggregation of different features from Netflix, Rotten Tomatoes, IMDb, posters, box office information, trailers on YouTube, etc. It is updated every month, with the latest update in April 2021.

Exploratory Data Analysis

Data Cleaning

Important Variables: Genre, Series or Movie, Country Availability, Runtime, View Rating, Director, Actors, IMDb Score, Awards Recieved, Awards Nominated, IMDb Votes
Created new factor target “Score” with two levels, “Above” and “Below” based on IMDb score
Replaced NAs with 0 for Awards Nominated and Awards Received
Replaced NAs for View Rating to “Not Rated”
Collapsed Directors and Actors to top 6 and others
Converted data types to factors

Prevalance

##        prevalance
## Horror  0.2917847
## Drama   0.6036036
## Other   0.4688330
## All     0.5142022

Average IMDb score for each genre

Score Distribution for each Genre

Methods

Initially, we wanted to create a k-means clustering model that clustered by genre, as we were hoping to see if there were any differences in the particular features that contribute to each genre. For example, based on previous research we had conducted, we thought that Horror movies would have lower IMDb scores than movies of other genres. In addition, we also thought that they would be nominated for and receive less awards than movies of other genres. However, after creating and adjusting our k-means clustering model to cluster over a variety of features, we noticed that there was too much visual overlap between the data points for us to make any valuable conclusions.

Because of the issues we were experiencing with our k-means clustering model, we decided that we would build a classification tree instead. With this type of model, we aimed to predict whether a movie/series would have an above average or below average rating, which is displayed in the Score column. In order to provide more in-depth analysis on our question, we decided that it would be more interesting to create four different classification models and compare the outputs of each model. In particular, these models would be split based on several specific genres: Horror, Drama, Other (everything except Horror and Drama), and All (all genres together, no filtering done). The primary techniques we used to address our goal consisted of purposeful feature selection, increasing and decreasing the number of nodes in the tree by examining max depth, and changing the threshold of the model to improve specificity and sensitivity.

We partitioned our data into train, tune, and test, with each of these sets accounting for 80, 10, and 10 percent of the data respectively. Next, with careful consideration of how it would affect the classification tree, we chose the features that we wanted our model to train off of. Ultimately, we removed Genre, as the specific genres we are concerned with for our question now existed in the Genre_2 column. Although we initially wanted to use the Director and Actor columns, we decided to remove them due to the complexity and wide range of names that were too difficult to collapse or restructure. We also decided to remove IMDb Score, as our Score column was calculated based off of IMDb Score; we had to remove the column due in order to prevent our model from simply memorizing results. Lastly, we removed the Score column itself, as it was our target variable.

Results and Evaluation

Horror Model

threshold = 0.4

## rpart2 variable importance
## 
##                      Overall
## IMDb.Votes            100.00
## Awards.Nominated.For   88.64
## Awards.Received        83.62
## Runtime                58.83
## Series.or.Movie        53.78
## Country.Availability   17.82
## View.Rating            12.06
## actors_2                0.00
## directors_2             0.00
## Genre_2                 0.00

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Above Below
##      Above    29    14
##      Below    17    99
##                                           
##                Accuracy : 0.805           
##                  95% CI : (0.7348, 0.8635)
##     No Information Rate : 0.7107          
##     P-Value [Acc > NIR] : 0.004418        
##                                           
##                   Kappa : 0.5165          
##                                           
##  Mcnemar's Test P-Value : 0.719438        
##                                           
##             Sensitivity : 0.6304          
##             Specificity : 0.8761          
##          Pos Pred Value : 0.6744          
##          Neg Pred Value : 0.8534          
##               Precision : 0.6744          
##                  Recall : 0.6304          
##                      F1 : 0.6517          
##              Prevalence : 0.2893          
##          Detection Rate : 0.1824          
##    Detection Prevalence : 0.2704          
##       Balanced Accuracy : 0.7533          
##                                           
##        'Positive' Class : Above           
##

Drama Model

threshold = 0.6

## rpart2 variable importance
## 
##                       Overall
## Runtime              100.0000
## Awards.Nominated.For  80.3397
## Awards.Received       75.4967
## Series.or.Movie       72.1998
## IMDb.Votes            54.4659
## View.Rating           19.2955
## Country.Availability   0.2103
## directors_2            0.0000
## actors_2               0.0000
## Genre_2                0.0000

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Above Below
##      Above   413   100
##      Below   120   250
##                                           
##                Accuracy : 0.7508          
##                  95% CI : (0.7209, 0.7791)
##     No Information Rate : 0.6036          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4844          
##                                           
##  Mcnemar's Test P-Value : 0.2002          
##                                           
##             Sensitivity : 0.7749          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.8051          
##          Neg Pred Value : 0.6757          
##               Precision : 0.8051          
##                  Recall : 0.7749          
##                      F1 : 0.7897          
##              Prevalence : 0.6036          
##          Detection Rate : 0.4677          
##    Detection Prevalence : 0.5810          
##       Balanced Accuracy : 0.7446          
##                                           
##        'Positive' Class : Above           
##

Other Model

threshold = 0.4

## rpart2 variable importance
## 
##                      Overall
## Runtime              100.000
## Series.or.Movie       92.586
## Awards.Received       74.222
## Awards.Nominated.For  71.824
## View.Rating           42.935
## IMDb.Votes            24.281
## Country.Availability   6.096
## actors_2               0.000
## directors_2            0.000
## Genre_2                0.000

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Above Below
##      Above   266   164
##      Below   184   346
##                                          
##                Accuracy : 0.6375         
##                  95% CI : (0.6062, 0.668)
##     No Information Rate : 0.5312         
##     P-Value [Acc > NIR] : 1.816e-11      
##                                          
##                   Kappa : 0.2702         
##                                          
##  Mcnemar's Test P-Value : 0.3084         
##                                          
##             Sensitivity : 0.5911         
##             Specificity : 0.6784         
##          Pos Pred Value : 0.6186         
##          Neg Pred Value : 0.6528         
##               Precision : 0.6186         
##                  Recall : 0.5911         
##                      F1 : 0.6045         
##              Prevalence : 0.4688         
##          Detection Rate : 0.2771         
##    Detection Prevalence : 0.4479         
##       Balanced Accuracy : 0.6348         
##                                          
##        'Positive' Class : Above          
##

Model with All Genres

threshold = 0.4

## rpart2 variable importance
## 
##                      Overall
## Awards.Received      100.000
## Awards.Nominated.For  79.213
## Runtime               64.666
## Series.or.Movie       46.952
## IMDb.Votes            36.611
## Genre_2                9.035
## View.Rating            1.162
## Country.Availability   0.000
## actors_2               0.000
## directors_2            0.000

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Above Below
##      Above   686   277
##      Below   343   695
##                                           
##                Accuracy : 0.6902          
##                  95% CI : (0.6694, 0.7104)
##     No Information Rate : 0.5142          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.381           
##                                           
##  Mcnemar's Test P-Value : 0.009042        
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.7150          
##          Pos Pred Value : 0.7124          
##          Neg Pred Value : 0.6696          
##               Precision : 0.7124          
##                  Recall : 0.6667          
##                      F1 : 0.6888          
##              Prevalence : 0.5142          
##          Detection Rate : 0.3428          
##    Detection Prevalence : 0.4813          
##       Balanced Accuracy : 0.6908          
##                                           
##        'Positive' Class : Above           
##

Our model was formed from repeated cross validation over 50 trees, and we evaluated our model at depths ranging from 1 to 11. We examined ROC, sensitivity, and specificity to find the best depth of each tree. We chose each depth based on the highest ROC, sensitivity, and specificity values. A high ROC value would indicate that our model performs well at predicting our given data. Because sensitivity and specificity are related, we aimed to choose a depth where both values were as high as possible. We also utilized a function discussed in class to adjust the threshold of the probability that our model classifies shows and movies as “Above” or “Below” the average rating. For each model, we tested multiple thresholds and examined the resulting confusion matrix, which displayed several metrics we wanted to focus on, such as accuracy, sensitivity, and specificity. Similar to how we chose a max depth, we chose a threshold based on the highest values of accuracy, sensitivity, and specificity. Our final values are displayed in the table below.

Comparing all four models, the Horror classification tree had the highest ROC value of 0.7985, and it also had the highest accuracy of 0.805. On the other hand, the model with the lowest ROC value was the Other tree, with a value of 0.6573. It also had the lowest overall accuracy of 0.6375. This indicates that our models generally perform better when the genre is narrowed down, as this may make relationships between our target variable and our features more apparent. An interesting observation to make, however, is that the All tree performed better in accuracy than the Other tree, despite having less narrowed down columns. This may be because when building the All tree, the Genre_2 column was able to provide more insight than it did for the Other tree, due to the fact that there are more genres included. Furthermore, the max depth for the Other and All trees were found to be 4. The Drama tree had a max depth of 5, and the Horror tree had a max depth of 9. Additionally, the threshold for Drama was 0.6, while the threshold for everything else was 0.4.

The feature that scored the highest when examining variable importance for Drama and Other was Runtime, whereas the highest scoring feature for Horror was IMDb Votes and the highest scoring feature for All was Awards Received. Across all three genres (Horror, Drama, and Other), we notice that the variable importance for both actors and directors is zero. This is likely due to the fact that we collapsed both of these variables to only the top 6 actors/directors. Given the large size of our dataset and the large number of different actors/directors included in the dataset, the effect of one specific actor/director to our model is miniscule. In addition, although this is not shown in the chart below, the sixth most important variable was Genre_2, whereas Genre_2 had an importance of 0 for the rest of the models. Furthermore, Country Availability was minimally important for all three models except for All.

Variable Importance by Model

Variable Importance by Model
	Drama		Horror		Other		All

1	Runtime	100	IMDb Votes	100	Runtime	100	Awards Received	100
2	Awards Nominated For	80.34	Awards Nominated For	88.64	Series or Movie	92.59	Awards Norminated For	79.21
3	Awards Received	75.5	Awards Received	83.62	Awards Received	74.22	Runtime	64.67
4	Series or Movie	72.2	Runtime	58.83	Awards Nominated For	71.82	Series or Movie	46.95
5	IMDb Votes	54.47	Series or Movie	53.78	View Rating	42.93	IMDb Votes	36.61

Metrics by Model

Model Metrics
Genre	Threshold	maxDepth	ROC	Sensitivity	Specificity	Accuracy
Horror	0.4	9	0.799	0.63	0.876	0.805
Drama	0.6	5	0.77	0.775	0.714	0.751
Other	0.4	4	0.657	0.591	0.678	0.638
All	0.4	4	0.701	0.667	0.715	0.69

Conclusion

After examining all four models, we found that prediction power/accuracy increases when only one genre is modeled. This may occur because more genres lead to less accurate predictions because of specific features’ relationships with certain genres. Filtering based on a specific genre makes relationships between features more clear. We also learned that to more accurately predict movie ratings, we must include more features. This includes genre-specific features to improve the model, and we can also change features based on what genre we are predicting on. Creating one model to predict ratings across multiple genres will most likely be a more difficult task, and it will also be more difficult to draw conclusions from. On the other hand, including a column that is more specific to genre-type could improve IMDb score prediction accuracy as seen by how the All model resulted in better accuracy than the Other model, which provided no genre-specific data. However, this is solely based on our intuition since strangely, the Genre column was not in the top-five variable importance values for the All model. The difficulty in this would be deciding how to create the Genre column, especially given the fact that movies can contain multiple genres. We could group the genres into only the main genre groups like how we did with Drama and Horror values as a possibility.

Future Work

Because not all Netflix movies had Rotten Tomatoes or Metacritic Ratings, we lost the ability to create a metric to balance all of these movie ratings instead of solely relying on IMDb score. In the future, it could be interesting to see which score would change. Similarly, with more data-scraping experience or access to more resources and time, we could create a dataset with all movies and shows from the top streaming services and collect rating information of those films. This would allow us to have a more holistic view of how general audiences rate movies and could decrease sample bias towards only movies and series found on Netflix and Netflix audiences. Furthermore, because IMDb breaks down the voting demographics for all of its films, it would be interesting to see who makes up the majority of the voters behind the ratings of different genres. This may (or may not) bring more insight into our question, and we could include IMDb member data as predictors (of course we would have data ethics in mind). Trying different models (LogReg) to see if they improve predictions and also to see correlations between target and predictor variables could also be beneficial. We could test different models in general to see if they produce different outcomes.

DS 3001 Final Project

Lucy Wang, Kaylee Liu, Ann Sofo

Question

Background Information

Exploratory Data Analysis

Data Cleaning

Prevalance

Average IMDb score for each genre

Score Distribution for each Genre

Methods

Results and Evaluation

Horror Model

Drama Model

Other Model

Model with All Genres

Variable Importance by Model

Metrics by Model

Conclusion

Future Work

Sources