Bayesian modeling and prediction for movies

Emiliano La Rocca

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

load("movies.Rdata")

```

Part 1: Data

Generalizability

The data is gathered online, from whoever wants to leave ratings or reviews for movies. Although the respondents are self-selecting (as opposed to randomly selected), there is no reason to think they are skewed toward any one demographic, except for the fact they all have access to a computer and the internet. Thus, the opinions (ratings and scores) are generalizable to internet users who watch movies.

Causality

The data were derived from an observational, not experimental, study which precludes a generalization to the population from which it was obtained. The best we can look for in the data is correlations, not causations. This is not to say we can’t develop models for prediction based on the data, just that we can’t assume these models address causality.

Part 2: Data manipulation

Create new variable based on title_type: new variable should be called feature_film with levels yes (movies that are feature films) and no
Create new variable based on genre: new variable should be called drama with levels yes (movies that are dramas) and no
Create new variable based on mpaa_rating: new variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no
Create two new variables based on thtr_rel_month: new variable called oscar_season with levels yes (if movie is released in November, October, or December) and no
New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

Construct some new variables:

movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
movies$feature_film<-as.factor(movies$feature_film)
summary(movies$feature_film)

##  no yes 
##  60 591

movies<-mutate(movies, drama=ifelse(genre=='Drama', 'yes', 'no'))
movies$drama<-as.factor(movies$drama)
summary(movies$drama)

##  no yes 
## 346 305

movies<-mutate(movies, mpaa_rating_R=ifelse(mpaa_rating=='R', 'yes', 'no'))
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
summary(movies$mpaa_rating_R)

##  no yes 
## 322 329

movies<-mutate(movies, oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), 'yes', 'no'))
movies$oscar_season<-as.factor(movies$oscar_season)
summary(movies$oscar_season)

##  no yes 
## 460 191

movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
movies$summer_season<-as.factor(movies$summer_season)
summary(movies$summer_season)

##  no yes 
## 443 208

Creation of the data set with variable for building the predictive model.

df<-movies %>% select(feature_film,drama,runtime,mpaa_rating_R,
                     thtr_rel_year,oscar_season,summer_season,imdb_rating,
                     imdb_num_votes,critics_score,best_pic_nom,
                     best_pic_win,best_actor_win,best_actress_win,
                     best_dir_win,top200_box,audience_score)
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  17 variables:
##  $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
##  $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...

Part 3: Exploratory data analysis

In this part analyzes the relationship between audience and score new variables created through a graphical analysis and also using Bayesian inference. Relationship between audience score and gender.

ggplot(movies, aes(x=factor(genre), y=audience_score)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Several genres such as Animation, Documentary, and Musical & Performing Arts seem to have a relatively high audience score with little variation while the audience scores for other genres such as Action and Adventure, Drama, and Science Fiction & Fantasy are lower on average with a much greater range of scores. The lower variation for the genres Animation and Science Fiction and Fantasy could be due to the relatively low number of movies within these genres.

ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
  geom_point(aes(colour=factor(genre))) +
  xlab("Year of Release (theater)") +
  ylab("Audience Score")

A plot of either the theater release date or the DVD release date as the predictor variable and audience score as the response variable did not show any obvious pattern for any particular genre. The Figure shown below displays the audience score as a function of the theater release year.

I will study the relationship between audience score and whether it’s a feature movie.

ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = feature_film)) + geom_boxplot()

bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0 
## P(H2|data) = 1

We can see that there are evidence that difference existed between those two groups. The influence is strong that it provides very strong evidence against H1, which means that there is a significant difference of mean audience score for non-feature and feature films. For the other variables, we can do the exaclty the same testing:
drama films, mpaa_rating_R films, oscar season films and summer season films.

The relationship between audience score and the genre of the movie:drama.

ggplot(movies, aes(x=drama, y=audience_score, fill=drama)) + geom_boxplot()

bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 22.6567
## P(H1|data) = 0.0423 
## P(H2|data) = 0.9577

It shows that drama have a slightly lower audience score than other movies. The relationship between auddience score and rating.

ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()

bayes_inference(y = audience_score, x =mpaa_rating_R, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 322, y_bar_no = 62.6894, s_no = 20.3167
## n_yes = 329, y_bar_yes = 62.0426, s_yes = 20.1559
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 23.9679
## P(H1|data) = 0.9599 
## P(H2|data) = 0.0401

It shows no significant difference of audience score between the rating R films and others. The relationship between the audience score and the time of the movie released.

ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + geom_boxplot()

bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 13.3993
## P(H1|data) = 0.9306 
## P(H2|data) = 0.0694

ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = summer_season)) + geom_boxplot()

bayes_inference(y = audience_score, x = summer_season, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 21.7118
## P(H1|data) = 0.956 
## P(H2|data) = 0.044

As we can see that the movie in the oscar season have slightly higher audience score than those who do not, while the movies in the summer season have slightly lower audience score.

Part 4: Modeling

Model building: the modeling goal was to predict the popularity of a movie with bayesian model - as quantified by the audience_score variable - using a linear regression model and bayesian model averaging. A bayesian linear regression model was created using all of the initial predictor variables. Because of the large number of predictors (leading to a very large number of possible model combinations), the Markov Chain Monte Carlo (MCMC) method was used for sampling models during the fitting process and the prior probabilities for the regression coefficients were assigned using the Zellner-Siow Cauchy distribution. A uniform distribution was used for the prior probabilities for all models.

# Create a basian linear regression model to derive audience_score from the
# initial set of predictors.

bayesian_model <- bas.lm(data=na.omit(movies), audience_score ~ feature_film + drama + runtime + mpaa_rating_R
               + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes
               + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win
               + best_dir_win + top200_box, prior = 'BIC', modelprior = uniform())
bayesian_model

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season +     imdb_rating + imdb_num_votes + critics_score + best_pic_nom +     best_pic_win + best_actor_win + best_actress_win + best_dir_win +     top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876

AS imdb rating posterior inclusion of possibility of 1, critics score and runtime also has high probability.

summary(bayesian_model)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.05876309     0.0000     0.0000000     0.0000000
## dramayes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237

We see that the best model includes intercept, imdb_rating, critics_score and runtime, and we need to visualize the results. Summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that " mpaa_rating_Ryes" is included in the model is 0.1649. Further, the most likely model, which has posterior probability of 0.1558 . By looking at the image, we can see that the most probable model is the one with only intercept.

image(bayesian_model, rotate = F)

imdb_rating, critics_score are inlcuded in the at least over half of the model. I will apply these 3 variables to build the final model.

final_model<- bas.lm(data=na.omit(movies),audience_score ~ imdb_rating+critics_score+runtime,
                      modelprior = uniform(),
                      prior="ZS-null",
                      method = "MCMC")
summary(final_model)

##               P(B != 0 | Y)  model 1     model 2      model 3      model 4
## Intercept           1.00000   1.0000   1.0000000   1.00000000   1.00000000
## imdb_rating         1.00000   1.0000   1.0000000   1.00000000   1.00000000
## critics_score       0.70625   1.0000   1.0000000   0.00000000   0.00000000
## runtime             0.50625   1.0000   0.0000000   0.00000000   1.00000000
## BF                       NA   1.0000   0.9928814   0.08035048   0.09635143
## PostProbs                NA   0.4000   0.3062000   0.18750000   0.10620000
## R2                       NA   0.7483   0.7455000   0.74050000   0.74360000
## dim                      NA   4.0000   3.0000000   2.00000000   3.00000000
## logmarg                  NA 413.8820 413.8748206 411.36060745 411.54221157
##                     model 5
## Intercept      1.000000e+00
## imdb_rating    0.000000e+00
## critics_score  0.000000e+00
## runtime        0.000000e+00
## BF            1.792035e-180
## PostProbs      0.000000e+00
## R2             0.000000e+00
## dim            1.000000e+00
## logmarg        0.000000e+00

The command summary for the final model is that the most likely model, which has posterior probability of 0712, includes only an intercept.

par(mfrow=c(1,2))
plot(final_model)

confidence intervals from the model

coef.final <- coef(final_model)
coef.final

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  5 models 
##                post mean  post SD   post p(B != 0)
## Intercept      62.21002    0.40988   1.00000      
## imdb_rating    15.20990    0.88339   1.00000      
## critics_score   0.05387    0.03948   0.70625      
## runtime        -0.02961    0.03332   0.50625

confint(coef.final)

##                      2.5%      97.5%        beta
## Intercept     61.39631100 62.9962049 62.21001616
## imdb_rating   13.67515717 16.8419525 15.20989730
## critics_score  0.00000000  0.1089549  0.05386582
## runtime       -0.08821393  0.0000000 -0.02961401
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

The predictive capability of bayesian model averaging was tested using data for a movie that was not included in the analysis data set. The movie chosen was Moana which was released early in 2016. An early release date was chosen to give time for the movie to have been in the theaters for a while and for the ratings to mature. The information for this movie was obtained from the IMDb and Rotten Tomatoes web sites in order to be consistent with the analysis data (Deadpool on IMDb and Deadpool on Rotten Tomatoes).

The results are shown below.

#Create a data frame with the predictor variable data for the movie rating to
#be predicted.

Moana <- data.frame(audience_score= 91, critics_score=85, runtime=103,imdb_rating = 7.9)
predict_score <- predict(final_model,Moana,estimator = "BMA")$fit
predict_score

##          [,1]
## [1,] 85.31262

The true audience score for the movie (taken from the IMDb web site) is 91. The model prediction is 85 with a 95% prediction interval

Part 6: Conclusion

In this analysis, linear regression model using bayesian model averaging was created that proved to have some capability for predicting movie popularity based on certain movie characteristics. We can clearly see that drama and Feature Film are more popular than other films. Moreover, imdb_rating has the highest posterior probability, and critic_score and runtime are also important. In my bayesian model, these three varibles are important predictors on audience_score.

References

1 https://en.wikipedia.org/wiki/Rotten_Tomatoes

2 https://en.wikipedia.org/wiki/IMDb

3 https://www.rottentomatoes.com/

4 http://www.imdb.com/