Bayesian Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(ggplotify)
library(dplyr)
library(statsr)
library(BAS)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

This data set comprises of 651 randomly sampled movies produced and released before 2016, with 32 variables recorded for each of these movies.

As the movies in this set are randomly sampled, and there is significant representation from each genre type, studio and MPAA rating, one could generalize the results to the entire population of movies before 2016.

However, as random assignment was not used in this study, no causation can be established.

Part 2: Data manipulation

Some of the data wrangling efforts will be shown below to obtain the following variables: 1) feature_film: “yes” if title_type is Feature Film, “no” otherwise 2) drama: “yes” if genre is Drama, “no” otherwise 3) mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise 4) oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise 5) summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise

movies_prep <- movies %>% 
          mutate(feature_film=as.factor(ifelse(title_type == 'Feature Film', 'yes', 'no'))) %>%
          mutate(drama=as.factor(ifelse(genre == 'Drama', 'yes', 'no'))) %>%
          mutate(mpaa_rating_R=as.factor(ifelse(mpaa_rating == 'R', 'yes', 'no'))) %>%
          mutate(oscar_season=as.factor(ifelse(thtr_rel_month %in% c(10,11,12), 'yes', 'no'))) %>%
          mutate(summer_season=as.factor(ifelse(thtr_rel_month %in% c(5:8), 'yes', 'no')))

Bay_Set <-movies_prep %>%
  select(audience_score,runtime,feature_film,drama, mpaa_rating_R,thtr_rel_year,
         oscar_season,summer_season,imdb_rating,imdb_num_votes,critics_score,
         best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,
         top200_box)

EDA_Set <- movies_prep %>%
  select(audience_score,feature_film,drama,mpaa_rating_R,oscar_season,summer_season)

We now have 2 sets of data, the Bay_Set, which will be used in the Bayesian modelling, and the EDA_set, which will be used in the exploratory data analysis. * * *

Part 3: Exploratory data analysis

As the dataset here is largely similar to the previous dataset in the course “Linear Modelling and Regression”, the following EDA analysis will be on the 5 created variables. The data below shows the summary statistics for the data we are exploring.

summary(EDA_Set)

##  audience_score  feature_film drama     mpaa_rating_R oscar_season
##  Min.   :11.00   no : 60      no :346   no :322       no :460     
##  1st Qu.:46.00   yes:591      yes:305   yes:329       yes:191     
##  Median :65.00                                                    
##  Mean   :62.36                                                    
##  3rd Qu.:80.00                                                    
##  Max.   :97.00                                                    
##  summer_season
##  no :443      
##  yes:208      
##               
##               
##               
##

We will then visualize this for more clarity:

par(mfrow=c(2,3))
plot(EDA_Set$feature_film, main="Feature Film",ylim=c(0,650))
plot(EDA_Set$drama, main="Drama",ylim=c(0,650))
plot(EDA_Set$mpaa_rating_R, main="R rating",ylim=c(0,650))
plot(EDA_Set$oscar_season, main="Oscar Season",ylim=c(0,650))
plot(EDA_Set$summer_season, main="Summer Season",ylim=c(0,650))

From this, it is evident that feature films make up the majority of the movies in the dataset, with approximately half of them being of the Drama genre and have an MPAA rating of R. Approximately 32% of movies were released during the Summer season whereas 29% of movies were released during the Oscar season. With this, let us try to determine the relationship between the new variables and audience scores:

par(mfrow=c(2,3))
boxplot(audience_score~feature_film, 
                           data=EDA_Set, main='Feature Film vs Audience Scores', 
                           xlab='Feature Film?', ylab='Score')
boxplot(audience_score~drama, 
                           data=EDA_Set, main='Drama vs Audience Scores', 
                           xlab='Drama?', ylab='Score')
boxplot(audience_score~mpaa_rating_R,
                           data=EDA_Set, main='MPaa Rating R vs Audience Scores', 
                           xlab='R Rating?', ylab='Score')
boxplot(audience_score~oscar_season,
                           data=EDA_Set, main='Oscar Season vs Audience Scores', 
                           xlab='Oscar Season?', ylab='Score')
boxplot(audience_score~oscar_season,
                           data=EDA_Set, main='Summer Season vs Audience Scores', 
                           xlab='Summer Season?', ylab='Score')

We can see that based on this initial analysis, Summer Seaon, Oscar Season and MPaa rating R has little influence on the resulting audience score. The strongest predictior is whether or not the film was a feature film, followed by the drama genre. Hence, we should not be surprised in the former 3 variables will be eliminated from the analysis.

2 things that I left out in my previous analysis were looking at critics_score and imdb_rating in relation to audience scores. With that in mind, here are the following plots for these 2 factors

IMDB Rating to Audience Score
The following is the plot of IMDB rating against Audience scores.

Imdbrating<-ggplot(data = Bay_Set, aes(x  = imdb_rating, y = audience_score))  
Imdbrating<-Imdbrating+ geom_point() + geom_smooth(method = "lm" , se = FALSE)  
Imdbrating+ggtitle(label = "IMDB Rating vs Audience Score")

As seen from this graph, there is a strong linear relationship between IMDB rating and Audience Score. The correlation coefficient is:

cor(Bay_Set$imdb_rating,Bay_Set$audience_score)

## [1] 0.8648652

Critics Score to Audience Score
The following is the plot of critics scores against Audience scores.

Critscore<-ggplot(data = Bay_Set, aes(x  = critics_score, y = audience_score))  
Critscore<-Critscore+ geom_point() + geom_smooth(method = "lm" , se = FALSE)  
Critscore+ggtitle(label = "Critics Score vs Audience Score")

Although seemingly scattered, there seems to be a general strong linear relationship. The correlation coefficient of this pair is:

cor(Bay_Set$critics_score,Bay_Set$audience_score)

## [1] 0.7042762

These 2 results mean that critics score and Imdb ratings are likely to play an important role in the final variable selection. * * *

Part 4: Modeling

The goal of this exercise is to be able to predict the popularity of a movie, which is represented by the audience_score variable. The predictors chosen are:
1) runtime
2) thtr_rel_year
3) imdb_rating
4) imdb_num_votes
5) critics_score
6) best_pic_nom
7) best_pic_win
8) best_actor_win
9) best_actress_win
10) best_dir_win
11) top200_box
12) feature_film
13) drama
14) mpaa_rating_R
15) oscar_season
16) summer_season
These are already selected under the Bay_set group above as the predictor variables along with audience_score (the response variable).

Due to the large number of predictors present (w a potential for a large number of model combinations), the Markov Chain Monte Carlo method (MCMC) was used to sampling models. A uniform distribution was also used for allocating prior probabilities for all models.

bas_Eric_model<- bas.lm(audience_score ~.,data=Bay_Set, method='MCMC',  
                        modelprior=uniform())

## Warning in bas.lm(audience_score ~ ., data = Bay_Set, method = "MCMC",
## modelprior = uniform()): dropping 1 rows due to missing data

bas_Eric_model

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = Bay_Set, modelprior = uniform(), 
##     method = "MCMC")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept              runtime      feature_filmyes  
##             1.00000              0.46364              0.06872  
##            dramayes     mpaa_rating_Ryes        thtr_rel_year  
##             0.04480              0.20233              0.09397  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07768              0.08276              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06093              0.88364              0.13563  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04251              0.14640              0.14609  
##     best_dir_winyes        top200_boxyes  
##             0.07056              0.05100

par(mfrow=c(1,2))
plot(bas_Eric_model,which=c(1:2),ask=FALSE)

The above show charts indicating the performance of the regression fitting. In the residual vs fitted graph, random scattering was not observed. Instead, the residues seem to indicate overfitting and underfitting at different sections in time, which indicates that perhaps more predictors should be included in the model to predict audience scores.

As for model probabilities, it was clear that using the MCMC method reduced the amount of models required to be sampled to achieve optimality, as only 3500 model combinations were sampled before the model posterior probability dentisty reaches 1.

plot(bas_Eric_model,which=4,ask=FALSE, cex.lab=0.5)

As for inclusion probabilities as seen from the graph above, the high likelihood of inclusion for imdb_rating and critics_score is expected as explored in the EDA.

image(bas_Eric_model,rotate=FALSE)

As we can clearly see from this plot above of model ranking, the model with the highest posterior odds only has 2 factors, imdb_rating and critics_score (the only 2 throughout all the models).

With this new set, we will now plot the posterior distribution of the regression coefficients:

par(mfrow=c(1,3))
plot(coefficients(bas_Eric_model), subset=c(1, 9, 11), ask=FALSE)

We notice that imdb_rating is not entirely normal as there is an additional bump that seems to be happening on the RHS. However,this was small, so it was not investigated further.

Checking for normal distribution of posterior probabilities

diagnostics(bas_Eric_model, type = "model", pch = 20)

As we can see clearly from the chart, the posterior probabilities follow a normal distribution.

Part 5: Prediction

For a meaningful comparison, I have chosen Thor-Ragnarok as the movie of choice to test the model. The information obtained was from the following websites:

Rotten Tomatoes: https://www.rottentomatoes.com/m/thor_ragnarok_2017
IMDB: https://www.imdb.com/title/tt3501632/?ref_=tt_rt
https://www.imdb.com/title/tt3501632/awards

Thormovie<- data.frame(runtime=as.numeric(130),
                       feature_film= factor("yes",levels=c("no","yes")),
                       drama= factor("no",levels=c("no","yes")),
                       mpaa_rating_R= factor("no",levels=c("no","yes")),
                       thtr_rel_year= as.numeric(2017),
                       oscar_season= factor("yes",levels=c("no","yes")), 
                       summer_season= factor("no",levels=c("no","yes")),
                       imdb_rating=as.numeric(7.9),
                       imdb_num_votes=as.numeric(528758),
                       critics_score=as.numeric(93),
                       best_pic_nom= factor("no",levels=c("no","yes")),
                       best_pic_win= factor("no",levels=c("no","yes")),
                       best_actor_win= factor("no",levels=c("no","yes")),
                       best_actress_win= factor("no",levels=c("no","yes")),
                       best_dir_win= factor("no",levels=c("no","yes")),
                       top200_box= factor("no",levels=c("no","yes")))
Predictor<-predict(bas_Eric_model,newdata=Thormovie,estimator="BMA",se.fit=TRUE, prediction=TRUE)

confint(Predictor)

##          2.5%    97.5%     pred
## [1,] 65.62781 105.3428 85.14679
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The predicted value using this model for Thor Ragnarok is 85.1, which only deviates from the observed value by 1.9 points. The 95% confidence interval for this prediction is 64.4 and 106.2.

Part 6: Conclusion

In this analysis, we created a linear regression model using bayesian model averaging, which did better than a pure basic linear regression model done in course 3 in predicting audience scores. However, there are certain shortfalls that warrant another look into the data:
1. Why is there a wave pattern in the residual plot? 2. Why imdb_rating residual coefficient probability curve is not normally distributed?

If these are addressed, this will be a very coherent piece! Thanks for reading!

Eric, Jan 19 2020