Bayesian modeling and prediction for movies

Setup

Load packages

library(tidyverse)
library(statsr)
library(BAS)
library(lubridate)
library(GGally)
library(reshape2)

Load data

load("movies.Rdata")

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before 2016. Since random sampling was used, we can expect our result to be generalizable. However, we did not use random assignment, thus the result is not causual

Part 2: Data manipulation

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no (2 pt)
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no (2 pt)
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no (2 pt)
Create two new variables based on thtr_rel_month: i)New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no (2 pt) ii)New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no (2 pt)

oscar <- c("November", "October", "December")
summer <- c("May","June", "July", "August")
movies$thtr_rel_month <- month.name[movies$thtr_rel_month]
movies <- movies %>%
        mutate(feature_film = ifelse(title_type == "Feature Film","yes","no"),
               drama = ifelse(genre == "Drama", "yes","no"),
               mpaa_rating_R = ifelse(mpaa_rating == "R","yes","no"),
               oscar_season = ifelse(thtr_rel_month %in% oscar,"yes","no"),
               summer_season = ifelse(thtr_rel_month %in% summer,"yes","no"))

## Warning: package 'bindrcpp' was built under R version 3.4.1

Part 3: Exploratory data analysis

N Perform exploratory data analysis (EDA) of the relationship between audience_score and the new variables constructed in the previous part. Your EDA should contain numerical summaries and visualizations. This might mean you initially create a lot more visualizations and summary statistics than what you finally choose to include in your paper. Each R output and plot should be accompanied by a brief interpretation.

df <- select(movies, audience_score, feature_film,drama,mpaa_rating_R, oscar_season,summer_season)
dfmelt <- melt(df, measure.vars = 2:6)
ggplot(dfmelt, aes(x=value, y=audience_score,fill=variable))+
  geom_boxplot()+
  facet_grid(.~variable)+
  labs(x="Type",y="Audience Score")

dfmelt %>%
        group_by(variable,value) %>%
        summarize(avg_rating = mean(audience_score))

## # A tibble: 10 x 3
## # Groups:   variable [?]
##         variable value avg_rating
##           <fctr> <chr>      <dbl>
##  1  feature_film    no   81.05000
##  2  feature_film   yes   60.46531
##  3         drama    no   59.73121
##  4         drama   yes   65.34754
##  5 mpaa_rating_R    no   62.68944
##  6 mpaa_rating_R   yes   62.04255
##  7  oscar_season    no   61.81304
##  8  oscar_season   yes   63.68586
##  9 summer_season    no   62.62302
## 10 summer_season   yes   61.80769

By looking at both summary and boxplot, we can see that except for feature film, others type of categorical variables have no little to no effect on audience_score. However, we would also expect this to happen since feature films are judged much more harshly being released in theater compared to Documentary and TV Movies where the audience is more selected. One more interesting statistic is drama’s score which seems to be higher than other genres.

Part 4: Modeling

df2 <- select(movies,audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
fit1 <- bas.lm(audience_score ~ ., data=df2,
               prior="BIC",
               modelprior = uniform())

## Warning in bas.lm(audience_score ~ ., data = df2, prior = "BIC", modelprior
## = uniform()): dropping 1 rows due to missing data

fit1

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = df2, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

summary(fit1)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. The posterior probability that mpaa_rating_R is included in the model is 0.2 which is much higher than other categorical varibles we created in EDA. However, any model that includes these has very low posterior probability.

The most likely model, which has posterior probability of 0.1297, includes only Intecept, runtime, imdb_rating, and critics_score. This model is not much different from our model from Linear Regression which is expected from our choice of BIC and prior model. While a posterior probability of 0.1297 sounds small, it is much larger than the uniform prior probability assigned to it, since there are \(2^{17}\) possible models.

Thus, lets we get the final model with the 3 variables above.

df3 <- select(df2,runtime, imdb_rating,critics_score,audience_score)
fit2 <- bas.lm(audience_score ~ ., df3,
               prior="BIC",
               modelprior = uniform())

## Warning in bas.lm(audience_score ~ ., df3, prior = "BIC", modelprior =
## uniform()): dropping 1 rows due to missing data

fit2

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = df3, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##     Intercept        runtime    imdb_rating  critics_score  
##        1.0000         0.5102         1.0000         0.9051

summary(fit2)

##               P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.0000000     1.0000     1.0000000     1.0000000
## runtime           0.5101554     1.0000     0.0000000     1.0000000
## imdb_rating       1.0000000     1.0000     1.0000000     1.0000000
## critics_score     0.9050549     1.0000     1.0000000     0.0000000
## BF                       NA     1.0000     0.9968489     0.1255707
## PostProbs                NA     0.4532     0.4518000     0.0569000
## R2                       NA     0.7549     0.7525000     0.7509000
## dim                      NA     4.0000     3.0000000     3.0000000
## logmarg                  NA -3615.2791 -3615.2822108 -3617.3539409
##                     model 4       model 5
## Intercept      1.000000e+00  1.000000e+00
## runtime        0.000000e+00  0.000000e+00
## imdb_rating    1.000000e+00  0.000000e+00
## critics_score  0.000000e+00  1.000000e+00
## BF             8.390939e-02  1.012980e-99
## PostProbs      3.800000e-02  0.000000e+00
## R2             7.481000e-01  4.958000e-01
## dim            2.000000e+00  2.000000e+00
## logmarg       -3.617757e+03 -3.843222e+03

We want to verify our final model using various graph. The residuals vs. fitted values shows no pattern. Moreover, the inclusions probabilities of all variables are above 0.5 (near 1).

par(mfrow = c(1,2))
plot(fit2,which=c(1,2))

plot(fit2,which=c(3,4))

Final Model:

audience_score = 62.35 - 0.0276runtime + 14.96imdb_rating + 0.065*critics_score

Thus, ceteris paribus, a 10 minutes increase in runtime will results in 0.276 decrease of audience score on average. ceteris paribus, one unit increase in imdb_rating will expected to increase 14.96 audience score on average…

We also have credible intervals for each coefficients. Since this is Bayesian statistic, there is a 95% probability the true coefficient will be in this interval (from lower to upper)

confint(coefficients(fit2))

##                      2.5%      97.5%        beta
## Intercept     61.57343735 63.1011325 62.34769231
## runtime       -0.08212438  0.0000000 -0.02755783
## imdb_rating   13.71056784 16.5571262 14.96315740
## critics_score  0.00000000  0.1056181  0.06498060
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

Lets pick Deathpool in 2016 to test our model.

deathpool <- data.frame(runtime=103, imdb_rating = 8, critics_score = 84)
pre1 <- predict(fit2,deathpool,estimator="BPM", se.fit=TRUE)
pre2 <- predict(fit2,deathpool,estimator="MPM", se.fit=TRUE)

## Warning in bas.lm(eval(object$call$formula), data = eval(object$call
## $data), : dropping 1 rows due to missing data

pre1$fit

## [1] 86.68242
## attr(,"model")
## [1] 0 2
## attr(,"best")
## [1] 4

pre2$fit

## [1] 86.9505
## attr(,"model")
## [1] 0 1 2 3
## attr(,"best")
## [1] 1

Our model predicts the audience score of roughly 86-87 points (not much difference between methods). This is a very good prediction (the real audience score is at 90). * * *

Part 6: Conclusion

We conclude that based on the given data, critics score, runtime and imdb_rating will have the most effects on audience score of a movies on average.

We created various variable and run a regression on all of them. However, most of them has little effects on audience score. We choose our final model based on the highest posterior probability.

We conclude that based on the given data, critics score, runtime and genre will have the most effects on audience score of a movies on average.

One shortcoming of this analysis is the lack of prior model. It will greatly enhance our paper if we research more and use a better prior model for our regression. Moreover, perhaps there maybe reverse causuality between audience score and critics score and we should take it into account.