Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

## Warning: package 'BAS' was built under R version 3.4.2

library(GGally)

Load data

load("movies.Rdata")

Introduction

This report is presented to Paramount Studios to illustrate the potential of Bayesian regression to predict a movie’s audience score on Rotten Tomatoes.

Part 1: Data

Data Overview

The movies data consist of 651 rows of randomly sampled movies from IMDb and/or Rotten Tomatoes, including one movie that is listed twice. The exact method of sampling is unknown, but it is known to be random. The sample appears biased to include more recent movies. The data do not include a single film released before 1970.

With that bias in mind, this report focuses on “contemporary movies,” or those with theatrical release in year 1999 or later. With the caveats already mentioned, the conclusions herein are generalizable to all contemporary movies listed on IMDb and Rotten Tomatoes.

This study is purely observational and one cannot draw conclusions about causality.

Data Cleaning

These are minor glitches in the data that are repaired.

A duplicate movie is removed:

# Remove duplicate movies ("Man on Wire")
movies <- 
  distinct(movies, imdb_url, .keep_all=TRUE)

Missing runtime is filled in:

movies[movies$title=="The End of America",]$runtime <- 74

Part 2: Research question

Given 16 explanatory variables, how well can Bayesian regression predict the audience_score of a movie on Rotten Tomatoes? We are interested in this question because the instructions are super-clear that we must answer it with this project.

Part 3: Data manipulation

The candidate explanatory variables are below. Most are included in movies except for those flagged with “^**“.

Variable	Description
`feature_film`^**	“yes” if title_type is Feature Film, “no” otherwise
`drama`^**	“yes” if genre is Drama, “no” otherwise
`runtime`	Duration in minutes
`mpaa_rating_R`^**	“yes” if mpaa_rating is R, “no” otherwise
`thtr_rel_year`	Year of theatrical release
`oscar_season`^**	“yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
`summer_season`^**	“yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
`imdb_rating`	Average rating on IMDb
`imdb_num_votes`	Number of votes from IMDb
`critics_score`	Average critics’ score on Rotten Tomatoes
`best_pic_nom`	Did movie receive best picture Oscar nomination
`best_pic_win`	Did movie win best picture Oscar
`best_actor_win`	Did any actress ever win an Oscar
`best_actress_win`	Did any actor ever win an Oscar
`best_dir_win`	Dir director ever win an Oscar
`top200_box`	Did movie achieve top 200 box office all time

In the code below, new variables are created as needed, and unnecessary variables are removed from movies.

movies <- movies %>% 
  mutate(
    feature_film = ifelse(title_type=="Feature Film",
                        "Yes",
                        "No"),
    drama = ifelse(genre=="Drama",
                 "Yes",
                 "No"),
    mpaa_rating_R = ifelse(mpaa_rating=="R",
                         "Yes",
                         "No"),
    oscar_season = ifelse(thtr_rel_month>=10 & thtr_rel_month<=12,
                        "Yes",
                        "No"),
    summer_season = ifelse(thtr_rel_month>=5 & thtr_rel_month<=8,
                         "Yes",
                         "No")) %>%
  select(title,
         runtime,
         thtr_rel_year,
         imdb_rating,
         imdb_num_votes,
         critics_score,
         audience_score,
         best_pic_nom,
         best_pic_win,
         best_actor_win,
         best_actress_win,
         best_dir_win,
         top200_box,
         feature_film,
         drama,
         mpaa_rating_R,
         oscar_season,
         summer_season)

Part 4: Exploratory data analysis

Before we conduct Bayesian Model Averaging with all 16 variables, we explore a few notable explanatory variables:

(1) Imdb_rating is an extremely powerful predictor of audience_score. This seems pretty obvious, since these two variables measure almost exactly the same thing, just from different websites. It feels like cheating to use imdb_rating as an explanatory variable. But the instructions are clear!

movies %>% ggplot(aes(x=imdb_rating,y=audience_score)) + geom_jitter() + geom_smooth(method="lm") + ggtitle("Rotten Tomatoes Audience Score vs IMDb Rating")

It appears the correlation degrades for very low values of imdb_rating. Perhaps if a movie is bad enough to score less than 5 out of 10, then the exact score no longer matters that much. We will revisit this issue when we test the validity of our final model.

(2) Critics_score is a good predictor of audience_score. Ratings from critics are not quite as good predictors as are ratings from IMDb users, but they are still OK.

movies %>% ggplot(aes(x=critics_score,y=audience_score)) + geom_jitter() + geom_smooth(method="lm") + ggtitle("Rotten Tomatoes Audience Score vs Critics Score")

(3) Thtr_rel_year is a fair predictor of audience score. It appears that movies are slowly getting worse with each year.

movies %>% ggplot(aes(x=thtr_rel_year,y=audience_score)) + geom_jitter() + geom_smooth(method="lm") + ggtitle("Rotten Tomatoes Audience Score vs Year of Theatrical Release")

The scatterplot above shows that the earliest movie in the data was released in 1970. It also appears that the sample is biased toward including more recent movies. Looking at the distribution of thtr_rel_year below confirms that recent years dominate the data:

movies %>% ggplot(aes(x=thtr_rel_year)) + geom_histogram(binwidth = 1) + ggtitle("Distribution of Year of Theatrical Release")

In contrast to the above distribution, some quick research shows that the number of studio-released films has actually been in decline for at least 20 years. We will handle the sample bias by filtering out older movies entirely and focusing solely on movies released after 1998. Our reasons are: (1) This is essentially the data we have, (2) We are interested more in understanding contemporary movie popularity than historical popularity; (3) We expect recent movies to be better than old movies at explaining contemporary movie popularity; and (4) Our data source, Rotten Tomatoes, didn’t exist until 1998, and we expect its data to be more reliable for movies after that date.

movies <- movies %>% filter(thtr_rel_year>1998)

Part 5: Modeling

Before proceeding with Bayesian regression, let’s check the full model using all possible variables:

model_full <- lm(audience_score ~ . -title -audience_score, data=movies)
summary(model_full)

## 
## Call:
## lm(formula = audience_score ~ . - title - audience_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.929  -6.664  -0.226   5.803  49.238 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.642e+02  2.696e+02   2.834  0.00487 ** 
## runtime             -6.450e-02  3.864e-02  -1.669  0.09599 .  
## thtr_rel_year       -3.889e-01  1.341e-01  -2.901  0.00396 ** 
## imdb_rating          1.277e+01  8.077e-01  15.811  < 2e-16 ***
## imdb_num_votes       1.368e-05  5.731e-06   2.387  0.01752 *  
## critics_score        8.997e-02  3.182e-02   2.828  0.00497 ** 
## best_pic_nomyes      3.963e+00  4.084e+00   0.970  0.33259    
## best_pic_winyes      2.614e+00  8.225e+00   0.318  0.75085    
## best_actor_winyes   -8.451e-01  1.734e+00  -0.487  0.62637    
## best_actress_winyes -2.948e+00  1.835e+00  -1.606  0.10920    
## best_dir_winyes     -6.068e+00  3.117e+00  -1.947  0.05239 .  
## top200_boxyes       -1.209e+00  5.018e+00  -0.241  0.80978    
## feature_filmYes     -4.384e+00  2.235e+00  -1.962  0.05062 .  
## dramaYes             2.400e+00  1.258e+00   1.908  0.05721 .  
## mpaa_rating_RYes    -1.499e+00  1.162e+00  -1.290  0.19794    
## oscar_seasonYes     -7.179e-01  1.389e+00  -0.517  0.60563    
## summer_seasonYes     5.446e-01  1.312e+00   0.415  0.67844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.2 on 336 degrees of freedom
## Multiple R-squared:  0.7562, Adjusted R-squared:  0.7446 
## F-statistic: 65.13 on 16 and 336 DF,  p-value: < 2.2e-16

The R² is 0.76. Below we see its BIC is 2729.

BIC(model_full)

## [1] 2729.262

For comparison, below we check the simplest reasonable linear model, using only imdb_rating. We see R² is 0.73 (slightly worse than the full model) and BIC is 2683 (slightly better). It’s a very good model for having just one variable.

model_imdb <- lm(audience_score ~ imdb_rating, data=movies)
summary(model_imdb)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.015  -6.658   0.819   5.858  50.367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.4238     3.2782  -11.11   <2e-16 ***
## imdb_rating  15.1591     0.4975   30.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.58 on 351 degrees of freedom
## Multiple R-squared:  0.7257, Adjusted R-squared:  0.7249 
## F-statistic: 928.4 on 1 and 351 DF,  p-value: < 2.2e-16

BIC(model_imdb)

## [1] 2682.923

Bayesian Model Averaging

The code below uses the BAS library to enumerate and evaluate all possible models. With 16 variables, there are 2¹⁶ or roughly 64K models to test. We use BIC for the prior distribution of regression coefficients, and we use the uniform distribution as the prior distribution for all possible models. The model rank image shows that, according to the resulting posterior probability distribution, the most probable model uses 2 variables: imdb_rating and critics_score.

movie_bas <- bas.lm(audience_score ~ . -audience_score,
                    prior="BIC", 
                    modelprior = uniform(),
                    data = movies %>% select(-title))
image(movie_bas,rotate=F)

Below we see that the BIC for the top model is 2677, just a bit better than the one-variable model with imdb_rating.

model_top <- lm(audience_score ~ 
                  imdb_rating +
                  critics_score,
                data = movies)
BIC(model_top)

## [1] 2677.346

With the summary table of top 5 models below, we can see in the Bayes Factors (BF) that the differences in posterior probabilities of the top 4 models (ranging from 4% to 9%) are “not worth a bare mention”. The evidence grows stronger for models 5 and beyond that they are less probable than the top model.

round(summary(movie_bas),3)

##                     P(B != 0 | Y)   model 1   model 2   model 3   model 4
## Intercept                   1.000     1.000     1.000     1.000     1.000
## runtime                     0.173     0.000     0.000     0.000     0.000
## thtr_rel_year               0.409     0.000     0.000     1.000     1.000
## imdb_rating                 1.000     1.000     1.000     1.000     1.000
## imdb_num_votes              0.132     0.000     0.000     0.000     0.000
## critics_score               0.933     1.000     1.000     1.000     1.000
## best_pic_nomyes             0.085     0.000     0.000     0.000     0.000
## best_pic_winyes             0.067     0.000     0.000     0.000     0.000
## best_actor_winyes           0.065     0.000     0.000     0.000     0.000
## best_actress_winyes         0.193     0.000     0.000     0.000     0.000
## best_dir_winyes             0.437     0.000     1.000     1.000     0.000
## top200_boxyes               0.052     0.000     0.000     0.000     0.000
## feature_filmYes             0.197     0.000     0.000     0.000     0.000
## dramaYes                    0.095     0.000     0.000     0.000     0.000
## mpaa_rating_RYes            0.108     0.000     0.000     0.000     0.000
## oscar_seasonYes             0.072     0.000     0.000     0.000     0.000
## summer_seasonYes            0.055     0.000     0.000     0.000     0.000
## BF                             NA     1.000     0.917     0.551     0.504
## PostProbs                      NA     0.086     0.078     0.047     0.043
## R2                             NA     0.734     0.739     0.742     0.738
## dim                            NA     3.000     4.000     5.000     4.000
## logmarg                        NA -1870.286 -1870.373 -1870.883 -1870.972
##                       model 5
## Intercept               1.000
## runtime                 0.000
## thtr_rel_year           0.000
## imdb_rating             1.000
## imdb_num_votes          0.000
## critics_score           1.000
## best_pic_nomyes         0.000
## best_pic_winyes         0.000
## best_actor_winyes       0.000
## best_actress_winyes     1.000
## best_dir_winyes         0.000
## top200_boxyes           0.000
## feature_filmYes         0.000
## dramaYes                0.000
## mpaa_rating_RYes        0.000
## oscar_seasonYes         0.000
## summer_seasonYes        0.000
## BF                      0.259
## PostProbs               0.022
## R2                      0.737
## dim                     4.000
## logmarg             -1871.636

The first column of the above table shows marginal posterior inclusion probabilities for each variable. These are easier to compare in the chart below.

plot(movie_bas, which=4, sub.caption = "")

Two variables have posterior probabilities much higher than the others: imdb_ratingand critics_score. These two variables are included in all 4 top models. Two other variables with moderately significant inclusion probabilities are thtr_rel_year and best_dir_win. There are 4 possible ways to include or not these 2 variables, and that is exactly what we see in the top 4 models.

Below we plot the distributions for the coefficients of all the variables just mentioned. The first row of charts shows the top two variables, which are very probably in the model. The second row shows the next two variables, which have lower probabilities of being in the model.

coef_movies <- coef(movie_bas)
par(mfrow = c(2,2))
plot(coef_movies, subset=c(4,6,3,11), ask=FALSE)

Check Model Assumptions

The Bayesian model specification assumes that the errors are normally distributed with a constant variance. As with the frequentist approach we check this assumption by examining the distribution of the residuals for the model.

Which specific model to test? We will test the same model that we will use for prediction in the next section: Bayesian Model Averaging (BMA). This is essentially a weighted average of 2¹⁶ simple linear models, each defined by one of the possible choices of the 16 variables. The average is weighted according to the posterior probabilities derived above.

Alternatively, we might have chosen the highest probability model (HPM), the median probability model (MPM), or the best predictor model (BPM). We choose BMA because our goal not to produce a single “best” model but rather to demonstrate the potential of predicting a movie’s audience score with the given data.

Testing BMA begins with computing the residuals of this model:

BMA <- data.frame(
  obs=movies$audience_score,
  fit=fitted(movie_bas,estimator = "BMA")
  ) %>%
  mutate(res=obs-fit)

Residuals are normally distributed: The histogram and Q-Q plot show an essentially normal distribution of residuals, with some right skew that we will discuss soon.

hist(BMA$res)

qqnorm(BMA$res)
qqline(BMA$res)

Constant variability of residuals: Below we plot residuals vs fitted values and observe a distinct fan shape. From left to right the residuals tend to get smaller (closer to zero). The large residuals on the left side of the fan are the same points that produce the right skew in the above histogram and QQ plot.

plot(abs(BMA$res) ~ BMA$fit)

Larger residuals for lower-rated movies: The large residuals that cause the right skew and the fan above are produced by the lower-rated movies in the data. We illustrate this with the chart below, which plots residuals vs imdb_rating. Movies with imdb_rating less than 5 generally have positive residuals that are larger in magnitude than the residuals for higher-rated movies. Intuitively this seems to indicate that a very bad movie rating is less precise than a good movie rating. In other words, if IMDb rates a movie 7.5 then Rotten Tomatoes probably scores it in the 70s, but if IMDb rates a movie 3.5 then Rotten Tomatoes might score it anywhere between 10-60.

plot(BMA$res ~ movies$imdb_rating)

Strictly speaking, this phenomenon does compromise at least somewhat the validity of our model. It is the root cause of the right skew and the non-constant variability shown above. In future research, we recommend that low-scoring movies be considered separately from other movies in order to address this issue.

Part 6: Prediction

We use BMA to predict the audience score of a movie released in 2016 (and therefore not in the dataset). The movie is Sully with data from IMDb and Rotten Tomatoes:

sully <- data.frame(
  runtime = 96,
  thtr_rel_year = 2016,
  imdb_rating = 8.3,
  imdb_num_votes = 250796,
  critics_score = 85,
  audience_score = 84,
  best_pic_nom = "no",
  best_pic_win = "no",
  best_actor_win = "yes",
  best_actress_win = "no",
  best_dir_win = "yes",
  top200_box = "no",
  feature_film = "Yes",
  drama = "Yes",
  mpaa_rating_R = "No",
  oscar_season = "Yes",
  summer_season = "No")

Based on BMA, we would predict the audience score for Sully to be 85 as shown below. The actual audience score is 84. Extremely close! By comparison, the 95% credible prediction interval is huge, spanning from 63 to 107.

pred_sully <- predict(movie_bas, newdata = sully, estimator = "BMA", se.fit = TRUE)
confint(pred_sully, parm="pred")

##          2.5%    97.5%     pred
## [1,] 63.04103 107.1151 85.05959
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 7: Conclusion

BMA is quite useful for predicting the audience score of a movie on Rotten Tomatoes, especially when using the IMDb rating as one of the explanatory variables. Relying on IMDb ratings was probably the single biggest limitation of this study. It would be much more useful to predict audience score based on information that is available before theatrical release, or better yet before marketing and distribution. We recommend additional research.

Our research also suggests that new models may be necessary to accurately predict very low ratings. We hypothesize that, independent of the movie business, rating scales generally elicit thoughtful responses in the top half of the scale, and anything relegated to the bottom half of the scale is roughly categorized as “failing.” This could be a topic for social psychologists to consider.