Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

load("movies.Rdata")

Part 1: Data

We will investigate one question using the “movies” dataset. This dataset includes information from Rotten Tomatoes and IMDB and is comprised of 651 randomly sampled movies produced and released before 2016. The results are generalizable to movies of all genres in US, as this is an observational study that uses random sampling. Since there is no random assignment, we cannot make causal conclusions. As there are no volunteers employed we can exclude the possibility of voluntary response bias. There is not non-response bias as well. However, there might coveniece bias since the movies included are American productions. * * *

Part 2: Data manipulation

We will begin by excluding all NAs from our data.

nonamovies = na.omit(movies)

Now, we will create the new variables needed.

nonamovies <- nonamovies %>% 
  mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"), drama = ifelse(genre == "Drama", "yes", "no"), mpaa_rating_R = ifelse(mpaa_rating == "R", "yes", "no"), oscar_season = ifelse(thtr_rel_month %in% c("10", "11","12"), "yes", "no"), summer_season = ifelse(thtr_rel_month %in% c("5", "6","7", "8"), "yes", "no"))

Part 3: Exploratory data analysis

We will procceed with some plots and summaries of the above mentioned variables with the variable “audience score”.

ggplot(nonamovies, aes(x = factor(feature_film), y = audience_score)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))

nonamovies %>%
  group_by(feature_film) %>%
  summarise(mean_as = mean(audience_score), med_as = median(audience_score))

## # A tibble: 2 x 3
##   feature_film mean_as med_as
##   <chr>          <dbl>  <dbl>
## 1 no              82.5     86
## 2 yes             60.6     63

ggplot(nonamovies, aes(x = factor(drama), y = audience_score)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))

nonamovies %>%
  group_by(drama) %>%
  summarise(mean_as = mean(audience_score), med_as = median(audience_score))

## # A tibble: 2 x 3
##   drama mean_as med_as
##   <chr>   <dbl>  <dbl>
## 1 no       59.4     59
## 2 yes      65.3     70

ggplot(nonamovies, aes(x = factor(mpaa_rating_R), y = audience_score)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))

nonamovies %>%
  group_by(mpaa_rating_R) %>%
  summarise(mean_as = mean(audience_score), med_as = median(audience_score))

## # A tibble: 2 x 3
##   mpaa_rating_R mean_as med_as
##   <chr>           <dbl>  <dbl>
## 1 no               62.0     65
## 2 yes              62.4     65

ggplot(nonamovies, aes(x = factor(oscar_season), y = audience_score)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))

nonamovies %>%
  group_by(oscar_season) %>%
  summarise(mean_as = mean(audience_score), med_as = median(audience_score))

## # A tibble: 2 x 3
##   oscar_season mean_as med_as
##   <chr>          <dbl>  <dbl>
## 1 no              61.5   63.5
## 2 yes             63.9   69

ggplot(nonamovies, aes(x = factor(summer_season), y = audience_score)) +
  geom_boxplot() + coord_flip() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=10))

nonamovies %>%
  group_by(summer_season) %>%
  summarise(mean_as = mean(audience_score), med_as = median(audience_score))

## # A tibble: 2 x 3
##   summer_season mean_as med_as
##   <chr>           <dbl>  <dbl>
## 1 no               62.4     65
## 2 yes              61.9     64

As we can see in summary statistics of the variables used for our research question, for all variables (and their corresponding yes/no categories) the values of the median and the mean are fairly close to one another.

when it comes to the plots, we can notice major differences for the variables “feature film” and “drama” and minor ones for “mpaa_rating_R”, “oscar_season” and “summer_season”.

Part 4: Modeling

We will develop a Bayesian regression model using Bayesian model averaging (BMA) to predict “audience_score”" from the explanatory variables: “feature_film”, “drama”, “runtime”, “mpaa_rating_R”, “thtr_rel_year”, “oscar_season”, “summer_season”, “imdb_rating”, “imdb_num_votes”, “critics_score”, “best_pic_nom”, “best_pic_win”, “best_actor_win”, “best_actress_win”, “best_dir_win”, “top200_box”.

nonamovies_red = nonamovies %>%
  select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box) 
bma_audsco <- bas.lm(audience_score ~ . -audience_score, data = nonamovies_red,
                   prior = "BIC", 
                   modelprior = uniform())
bma_audsco

## 
## Call:
## bas.lm(formula = audience_score ~ . - audience_score, data = nonamovies_red, 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876

summary(bma_audsco)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.05876309     0.0000     0.0000000     0.0000000
## dramayes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237

The most likely model, which has posterior probability of 0.1558, includes an intercept, runtime, imdb_rating and critics_score is the one we select. While a posterior probability of 0.1558 is small, it is much larger than the uniform prior probability assigned to it, since there are \(2^{17}\) possible models.

plot(bma_audsco, which=1)

round(summary(bma_audsco), 3)

##                     P(B != 0 | Y)   model 1   model 2   model 3   model 4
## Intercept                   1.000     1.000     1.000     1.000     1.000
## feature_filmyes             0.059     0.000     0.000     0.000     0.000
## dramayes                    0.045     0.000     0.000     0.000     0.000
## runtime                     0.514     1.000     0.000     0.000     1.000
## mpaa_rating_Ryes            0.165     0.000     0.000     0.000     1.000
## thtr_rel_year               0.081     0.000     0.000     0.000     0.000
## oscar_seasonyes             0.065     0.000     0.000     0.000     0.000
## summer_seasonyes            0.079     0.000     0.000     0.000     0.000
## imdb_rating                 1.000     1.000     1.000     1.000     1.000
## imdb_num_votes              0.062     0.000     0.000     0.000     0.000
## critics_score               0.920     1.000     1.000     1.000     1.000
## best_pic_nomyes             0.132     0.000     0.000     0.000     0.000
## best_pic_winyes             0.041     0.000     0.000     0.000     0.000
## best_actor_winyes           0.116     0.000     0.000     0.000     0.000
## best_actress_winyes         0.148     0.000     0.000     1.000     0.000
## best_dir_winyes             0.067     0.000     0.000     0.000     0.000
## top200_boxyes               0.049     0.000     0.000     0.000     0.000
## BF                             NA     1.000     0.872     0.205     0.204
## PostProbs                      NA     0.156     0.136     0.032     0.032
## R2                             NA     0.748     0.746     0.747     0.750
## dim                            NA     4.000     3.000     4.000     5.000
## logmarg                        NA -3434.752 -3434.889 -3436.338 -3436.342
##                       model 5
## Intercept               1.000
## feature_filmyes         0.000
## dramayes                0.000
## runtime                 1.000
## mpaa_rating_Ryes        0.000
## thtr_rel_year           0.000
## oscar_seasonyes         0.000
## summer_seasonyes        0.000
## imdb_rating             1.000
## imdb_num_votes          0.000
## critics_score           1.000
## best_pic_nomyes         1.000
## best_pic_winyes         0.000
## best_actor_winyes       0.000
## best_actress_winyes     0.000
## best_dir_winyes         0.000
## top200_boxyes           0.000
## BF                      0.185
## PostProbs               0.029
## R2                      0.750
## dim                     5.000
## logmarg             -3436.438

About the coeffficients: 1) We are 5.9% sure that “feature_film” should be included. 2) We are 4,5% sure that “drama” should be included. 3) We are 51.4% sure that “runtime” should be included. 4) We are 16,5% sure that “mpaa_rating_R” should be included. 5) We are 8,1% sure that “thtr_rel_year” should be included. 6) We are 6,5% sure that “oscar_season” should be included. 7) We are 7,9% sure that “summer_season” should be included. 8) We are 100% sure that “imdb_rating” should be included. 9) We are 6,2% sure that “imdb_num_votes” should be included. 10) We are 92% sure that “critics_score” should be included. 11) We are 13,2% sure that “best_pic_nom” should be included. 12) We are 4,1% sure that “best_pic_win” should be included. 13) We are 11,6% sure that “best_actor_win” should be included. 14) We are 14,8% sure that “best_actress_win” should be included. 15) We are 6,7% sure that “best_dir_winyes” should be included. 16) We are 4,9% sure that “top200_boxyes” should be included.

Part 5: Prediction

At this point, we want to use the model we created earlier, bma_audsco to predict the popularity of a movie based on its ‘audience_score’ , The land(2016). The data used to make this prediction come from Rotten Tomatoes (https://www.rottentomatoes.com/m/the_land_2016#audience_reviews) and IMDB (https://www.imdb.com/title/tt5164412/?ref_=ttawd_awd_tt).

theland <- data.frame(audience_score=61,feature_film="yes", drama="yes", runtime = 104,mpaa_rating_R = "yes", thtr_rel_year = 2016, oscar_season = "no", summer_season = "yes", imdb_rating = 6.4, imdb_num_votes = 1393, critics_score = 67, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "no")
audience_score_pred = predict(bma_audsco, newdata=theland, estimator="BMA",se.fit=TRUE)
audience_score_pred$fit

## [1] 61.61292

We will also construct a credible interval around this prediction, which will provide a measure of uncertainty around the prediction.

ci_bma_audsco = confint(audience_score_pred, estimator="BMA")
ci_bma_audsco

##          2.5%    97.5%     pred
## [1,] 41.66233 81.57263 61.61292
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

This model predicts that the audience score of the movie would roughly be 61,6% (which is almost exactly the true audience score). Also, there is a 95% chance that the score is between 41,7% and 81,8%.

Part 6: Conclusion

As final remarks, I believe that we have ended up in a parsimonious model that makes very close to reality predictions about the future popularity of a movie. The prediction for the movie “The land” is one of an audience score of 61,6% when the actual score of the movie is 61% (which actually belongs to credible interval). As seen above, our model has a quite low posterior probability, maybe that could have been countered with more/different variables offered.