Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

According to the cookbook provided, the data set is comprised of 651 randomly sampled movies produced and released before 2016. As the sample is a random sample, the results are able to be generalized to the population (all movies produced and released before 2016), however, since it is not an experiment with random assignment, no causality can be inferred based on the correlations observed.

Part 2: Data manipulation

movies <- movies %>%
  mutate(feature_film = if_else(title_type == 'Feature Film', 'yes', 'no'),
         drama = if_else(genre == 'Drama', 'yes', 'no'),
         mpaa_rating_R = if_else(mpaa_rating == 'R', 'yes', 'no'),
         oscar_season = if_else((thtr_rel_month == 10 |
                                  thtr_rel_month == 11 |
                                  thtr_rel_month == 12), 'yes', 'no'),
         summer_season = if_else((thtr_rel_month == 5 |
                                    thtr_rel_month == 6 |
                                    thtr_rel_month == 7 |
                                    thtr_rel_month == 8), 'yes', 'no'))

#Converting the new variables into factors

movies <- movies %>%
  mutate(feature_film = as.factor(feature_film),
         drama = as.factor(drama),
         mpaa_rating_R = as.factor(mpaa_rating_R),
         oscar_season = as.factor(oscar_season),
         summer_season = as.factor(summer_season))

Part 3: Exploratory data analysis

movies %>%
  select(audience_score, feature_film, drama, mpaa_rating_R,
         oscar_season, summer_season) %>%
  summary()

##  audience_score  feature_film drama     mpaa_rating_R oscar_season
##  Min.   :11.00   no : 60      no :346   no :322       no :460     
##  1st Qu.:46.00   yes:591      yes:305   yes:329       yes:191     
##  Median :65.00                                                    
##  Mean   :62.36                                                    
##  3rd Qu.:80.00                                                    
##  Max.   :97.00                                                    
##  summer_season
##  no :443      
##  yes:208      
##               
##               
##               
##

According to this summary, the mean of the audience scores is 62.36 and the median is 65, so the sample is probably a little left-skewed. The range is 11-97. The drama and mpaa_rating categories are evenly represented in the dataset while the feature_film, oscar_season and summer_season seem to have a bias towards one of the “no” or “yes” bins.

movies %>%
  ggplot(aes(x=feature_film, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()

This boxplot shows that apart from a few outliers, movies that are not feature films tend to garner a bigger audience score. However we should be careful with this interpretation as there are many more feature films in the dataset than no feature films.

movies %>%
  ggplot(aes(x=drama, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()

According to this boxplot, drama movies seem to have a bigger median audience score with similar variability.

movies %>%
  ggplot(aes(x=mpaa_rating_R, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()

Movies with a mpaa rating of R seem to have a lower median audience score but the results are mostly similar.

movies %>%
  ggplot(aes(x=oscar_season, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()

Movies in the oscar season have a higher audience score with similar variability to those who are not oscar season. The number of movies in the oscar season is small (191) compared to those not in oscar season (460) in the dataset, so interpretation must be careful.

movies %>%
  ggplot(aes(x=summer_season, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()

Movies in summer season appear to have a smaller median audience score than those not in summer season.The smaller number of movies in summer season (208) than not in summer season (443) also shows a need for a careful interpretation.

Part 4: Modeling

Selecting best prior and estimate using the material from the optional supplement laboratory in week 4

BIC

set.seed(42)

movies_no_na = na.omit(movies)

n = nrow(movies_no_na)
n_cv = 50
ape = matrix(NA, ncol=4, nrow=n_cv)
colnames(ape) = c("BMA", "BPM", "HPM", "MPM")

for (i in 1:n_cv) {
  train = sample(1:n, size=round(.90*n), replace=FALSE)
  lmovies_train = movies_no_na[train,]
  lmovies_test = movies_no_na[-train,]

  bma_train_movie = bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=lmovies_train, 
                          prior="BIC", modelprior=uniform(), initprobs="eplogp")
  yhat_bma = predict(bma_train_movie, lmovies_test, estimator="BMA")$fit
  yhat_hpm = predict(bma_train_movie, lmovies_test, estimator="HPM")$fit
  yhat_mpm = predict(bma_train_movie, lmovies_test, estimator="MPM")$fit
  yhat_bpm = predict(bma_train_movie, lmovies_test, estimator="BPM")$fit
  ape[i, "BMA"] = cv.summary.bas(yhat_bma, lmovies_test$audience_score)
  ape[i, "BPM"] = cv.summary.bas(yhat_bpm, lmovies_test$audience_score)
  ape[i, "HPM"] = cv.summary.bas(yhat_hpm, lmovies_test$audience_score)
  ape[i, "MPM"] = cv.summary.bas(yhat_mpm, lmovies_test$audience_score)
}

boxplot(ape)

AIC

set.seed(42)

movies_no_na = na.omit(movies)

n = nrow(movies_no_na)
n_cv = 50
ape = matrix(NA, ncol=4, nrow=n_cv)
colnames(ape) = c("BMA", "BPM", "HPM", "MPM")

for (i in 1:n_cv) {
  train = sample(1:n, size=round(.90*n), replace=FALSE)
  lmovies_train = movies_no_na[train,]
  lmovies_test = movies_no_na[-train,]

  bma_train_movie = bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=lmovies_train, 
                          prior="AIC", modelprior=uniform(), initprobs="eplogp")
  yhat_bma = predict(bma_train_movie, lmovies_test, estimator="BMA")$fit
  yhat_hpm = predict(bma_train_movie, lmovies_test, estimator="HPM")$fit
  yhat_mpm = predict(bma_train_movie, lmovies_test, estimator="MPM")$fit
  yhat_bpm = predict(bma_train_movie, lmovies_test, estimator="BPM")$fit
  ape[i, "BMA"] = cv.summary.bas(yhat_bma, lmovies_test$audience_score)
  ape[i, "BPM"] = cv.summary.bas(yhat_bpm, lmovies_test$audience_score)
  ape[i, "HPM"] = cv.summary.bas(yhat_hpm, lmovies_test$audience_score)
  ape[i, "MPM"] = cv.summary.bas(yhat_mpm, lmovies_test$audience_score)
}

boxplot(ape)

According to those results, choosing the AIC prior with the BMA estimator seems to be the better choice for reducing the average prediction errors.

Building the full model

movies.lm <- bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=movies, 
                          prior="AIC", modelprior=uniform(), initprobs="eplogp")

## Warning in bas.lm(audience_score ~ feature_film + drama + runtime +
## mpaa_rating_R + : dropping 1 rows due to missing data

summary(movies.lm)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept               1.0000000     1.0000     1.0000000     1.0000000
## feature_filmyes         0.3811635     0.0000     0.0000000     0.0000000
## dramayes                0.3794742     0.0000     0.0000000     0.0000000
## runtime                 0.8560184     1.0000     1.0000000     1.0000000
## mpaa_rating_Ryes        0.6831204     1.0000     1.0000000     1.0000000
## thtr_rel_year           0.5764312     1.0000     0.0000000     0.0000000
## oscar_seasonyes         0.3647827     0.0000     0.0000000     0.0000000
## summer_seasonyes        0.4160076     0.0000     0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000     1.0000000     1.0000000
## imdb_num_votes          0.4505720     0.0000     0.0000000     0.0000000
## critics_score           0.9666890     1.0000     1.0000000     1.0000000
## best_pic_nomyes         0.6933310     1.0000     1.0000000     1.0000000
## best_pic_winyes         0.3032774     0.0000     0.0000000     0.0000000
## best_actor_winyes       0.5155485     1.0000     1.0000000     0.0000000
## best_actress_winyes     0.5941480     1.0000     1.0000000     1.0000000
## best_dir_winyes         0.3621784     0.0000     0.0000000     0.0000000
## top200_boxyes           0.3051228     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9783606     0.9298905
## PostProbs                      NA     0.0017     0.0017000     0.0016000
## R2                             NA     0.7601     0.7593000     0.7585000
## dim                            NA     9.0000     8.0000000     7.0000000
## logmarg                        NA -3604.4206 -3604.4424632 -3604.4932746
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           1.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     1.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     1.0000000     1.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.8903369     0.7914465
## PostProbs               0.0015000     0.0013000
## R2                      0.7592000     0.7592000
## dim                     8.0000000     8.0000000
## logmarg             -3604.5367415 -3604.6544792

According to the results obtained from this summary, the best model for this data will be a model containing runtime, mpaa_rating_R, thtr_rel_year, imdb_rating, critics_score, best_pic_nom, best_actor_win and best_actress_win.

Building the final model

movies_final.lm <- bas.lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data = movies, prior='AIC', modelprior=uniform(), initprobs='eplogp')

## Warning in bas.lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year
## + : dropping 1 rows due to missing data

summary(movies_final.lm)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept               1.0000000     1.0000     1.0000000     1.0000000
## runtime                 0.8628615     1.0000     1.0000000     1.0000000
## mpaa_rating_Ryes        0.7010683     1.0000     1.0000000     1.0000000
## thtr_rel_year           0.4914510     1.0000     0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000     1.0000000     1.0000000
## critics_score           0.9782012     1.0000     1.0000000     1.0000000
## best_pic_nomyes         0.6956091     1.0000     1.0000000     1.0000000
## best_actor_winyes       0.5429452     1.0000     1.0000000     0.0000000
## best_actress_winyes     0.5834202     1.0000     1.0000000     1.0000000
## BF                             NA     1.0000     0.9783606     0.9298905
## PostProbs                      NA     0.0666     0.0652000     0.0619000
## R2                             NA     0.7601     0.7593000     0.7585000
## dim                            NA     9.0000     8.0000000     7.0000000
## logmarg                        NA -3604.4206 -3604.4424632 -3604.4932746
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           1.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     1.0000000
## best_actor_winyes       0.0000000     1.0000000
## best_actress_winyes     1.0000000     0.0000000
## BF                      0.8903369     0.7062613
## PostProbs               0.0593000     0.0471000
## R2                      0.7592000     0.7583000
## dim                     8.0000000     7.0000000
## logmarg             -3604.5367415 -3604.7683561

confint(coef(movies_final.lm))

##                            2.5%        97.5%        beta
## Intercept           61.61295213 63.136609084 62.34769231
## runtime             -0.09087433  0.000000000 -0.04627277
## mpaa_rating_Ryes    -2.82778849  0.002599150 -1.05608442
## thtr_rel_year       -0.11114646  0.007007131 -0.02459562
## imdb_rating         13.77385201 16.180712792 14.98141407
## critics_score        0.01920235  0.114034122  0.06559367
## best_pic_nomyes      0.00000000  8.398831718  3.08695161
## best_actor_winyes   -3.72767988  0.226782251 -0.97505211
## best_actress_winyes -4.46217267  0.030184526 -1.23735197
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Apart from critics_score and imdb_rating, most of the 95% credible intervals cross 0. However this is to be expected using the AIC prior, as the model may use non-significant variables in order to improve prediction compared to the BIC prior. As the main objective of this project is prediction of the audience_score for a new movie, this was a choice made by me. For true explanatory power, perhaps a simple model with critics_score and imdb_rating would suffice.

Diagnostics

plot(movies_final.lm, which = 1, add.smooth = F)

image(movies_final.lm, subset = -1, rotate = F)

According to the diagnostics, the residual vs fitted plot of the model seems to follow a pattern signifying the presence of a non-linear relationship between the predictors. The assumption of equal variance also seems to not being fulfilled in this model due to the reisduals not being spread equally along the range of the predictors.

As such, any predictions or credible intervals estimated from this model may be wrong and as such, one must be careful using it as the conclusions may be plain wrong.

Part 5: Prediction

The movie used for prediction will be Thor: Ragnarok (2017) using information from Rotten Tomatoes and IMDB.

thor.df <- data.frame(runtime = 130, mpaa_rating_R = 'no',
                      thtr_rel_year = 2017, imdb_rating = 8.2, critics_score = 92, best_pic_nom = 'no',
                      best_actor_win = 'no', 
                      best_actress_win = 'no')

thor.predict <- predict(movies_final.lm, newdata = thor.df, estimator = 'BMA')
thor.predict$Ybma

##          [,1]
## [1,] 89.31444

The prediction made by the model was an audience score of 89.3%. As the true audience score according to rotten tomatoes is 88%, I’m very happy with the prediction capabilities of the model even with all the limitations described in the diagnostics.

Part 6: Conclusion

From the results, the predictive model built fulfilled the expectation even though the diagnostics were not very good and it used non-significant predictors. Some limitations of this work consisted on the very small sample of movies (651) in the dataset as well as the fact that basically all of the new features created were useless for the model and also the small sample for prediction (n = 1).

As such, better feature engineering and the a bigger sample of movies to train the model with may prove to be benefical in building a better predictive model for the prediction of audience scores.