Bayesian Linear Regression to Predict the Audience Scores for IMDB Movies

In this article, we are going to use the R package BAS and leverage the Bayesian Linear Regression implementation to predict the audience score for an IMDB movie given other predictor variables. This appeared as a Data Analysis Project in the Coursera course Bayesian Statistics by Duke University.

The dataset (naely, movies.Rdata) we shall use contains information about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is shown in the next section, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.

We are interested in learning what attributes make a movie popular and also in learning something new about movies.

As part of this project we shall complete exploratory data analysis (EDA), modeling, and prediction.

We shall develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables we shall need to construct in the data manipulation section using the mutate() function in dplyr:

feature_film: “yes” if title_type is Feature Film, “no” otherwise
drama: “yes” if genre is Drama, “no” otherwise
runtime
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
thtr_rel_year
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

All analysis are done using the R programming language via RStudio, and this writeup is prepared as an R Markdown document.

The project consists of 6 parts:

Data
Data manipulation
EDA
Modeling
Prediction
Conclusion

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

Let’s make it sure that the data and R Markdown files are in the same directory. When loaded the data file will be called movies. Let’s load the data first

load("movies.Rdata")

Part 1: Data

To start with, let’s describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

The data set is comprised of \(651\) randomly sampled movies (produced and released before \(2016\)). Also, from the below summary of the data, it can be seen the dataset is quite representative (diverse in the sense that it contains many possible combination of values for each of the explanatory variables) of the population, hence likely to be generalizable (assuming that the relation between predictor and response does not change much after \(2016\)).

But, there is no controlled experiment / random assignment done and the covariates of the linear regression are not chosen with a controlled experiment, hence it is likely to imply correlation between response and the predictors and not causality.

Also, since the underlying movies dataset from which the samples were chosen is likely not randomly sampled from the entire population of movies, there can be selection bias (e.g., voluntary bias).

library(GGally)
#head(movies)
names(movies)

##  [1] "title"            "title_type"       "genre"            "runtime"         
##  [5] "mpaa_rating"      "studio"           "thtr_rel_year"    "thtr_rel_month"  
##  [9] "thtr_rel_day"     "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"   "critics_score"   
## [17] "audience_rating"  "audience_score"   "best_pic_nom"     "best_pic_win"    
## [21] "best_actor_win"   "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"           "actor3"          
## [29] "actor4"           "actor5"           "imdb_url"         "rt_url"

summary(movies[,c(2:5, 7:8, 10:23)])

##         title_type                 genre        runtime       mpaa_rating 
##  Documentary : 55   Drama             :305   Min.   : 39.0   G      : 19  
##  Feature Film:591   Comedy            : 87   1st Qu.: 92.0   NC-17  :  2  
##  TV Movie    :  5   Action & Adventure: 65   Median :103.0   PG     :118  
##                     Mystery & Suspense: 59   Mean   :105.8   PG-13  :133  
##                     Documentary       : 52   3rd Qu.:115.8   R      :329  
##                     Horror            : 23   Max.   :267.0   Unrated: 50  
##                     (Other)           : 60   NA's   :1                    
##  thtr_rel_year  thtr_rel_month   dvd_rel_year  dvd_rel_month     dvd_rel_day   
##  Min.   :1970   Min.   : 1.00   Min.   :1991   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.:2001   1st Qu.: 3.000   1st Qu.: 7.00  
##  Median :2000   Median : 7.00   Median :2004   Median : 6.000   Median :15.00  
##  Mean   :1998   Mean   : 6.74   Mean   :2004   Mean   : 6.333   Mean   :15.01  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:2008   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :2014   Max.   :12.00   Max.   :2015   Max.   :12.000   Max.   :31.00  
##                                 NA's   :8      NA's   :8        NA's   :8      
##   imdb_rating    imdb_num_votes           critics_rating critics_score   
##  Min.   :1.900   Min.   :   180   Certified Fresh:135    Min.   :  1.00  
##  1st Qu.:5.900   1st Qu.:  4546   Fresh          :209    1st Qu.: 33.00  
##  Median :6.600   Median : 15116   Rotten         :307    Median : 61.00  
##  Mean   :6.493   Mean   : 57533                          Mean   : 57.69  
##  3rd Qu.:7.300   3rd Qu.: 58301                          3rd Qu.: 83.00  
##  Max.   :9.000   Max.   :893008                          Max.   :100.00  
##                                                                          
##  audience_rating audience_score  best_pic_nom best_pic_win best_actor_win
##  Spilled:275     Min.   :11.00   no :629      no :644      no :558       
##  Upright:376     1st Qu.:46.00   yes: 22      yes:  7      yes: 93       
##                  Median :65.00                                           
##                  Mean   :62.36                                           
##                  3rd Qu.:80.00                                           
##                  Max.   :97.00                                           
##                                                                          
##  best_actress_win best_dir_win
##  no :579          no :608     
##  yes: 72          yes: 43     
##                               
##                               
##                               
##                               
##

From the above summary, we can see that out of \(651\) movies only \(22\) got best picture nominations for the oscar, the mean IMDB rating in the sample dataset is around \(6.5\), the average critics score is around \(57.7\), whereas the average audience score is around \(62.4\).

p <- ggpairs(na.omit(movies), columns=c("title_type", "genre", "runtime", "mpaa_rating", "thtr_rel_year", 
                                        "thtr_rel_month", "imdb_rating", "imdb_num_votes", "critics_score", 
                                        "audience_score", "best_pic_win", "top200_box"), 
             title="correlogram", 
             progress = FALSE) 
suppressMessages(print(p))

The above figure visualizes the distribution of the variables and the mutual association between any two pairs chosen. For example, observe that runtime is right skewed, whereas imdb_rating and audiance_score variables are left skewed. The variables imdb_rating and the critics_score have positive correlations with the response variable audience_score etc.

Part 2: Data manipulation

Let’s create the following new variables using the mutate() function in the dplyr package:

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no
Create two new variables based on thtr_rel_month:
- New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no
- New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

Let’s use the following R code to create the new variables.

movies <- movies %>% mutate(feature_film = ifelse(title_type == 'Feature Film', 'yes', 'no'),
                            drama = ifelse(genre == 'Drama', 'yes', 'no'),
                            mpaa_rating_R = ifelse(mpaa_rating == 'R', 'yes', 'no'),
                            oscar_season = ifelse(thtr_rel_month >= 10, 'yes', 'no'),
                            summer_season = ifelse(thtr_rel_month %in% 5:8, 'yes', 'no'))
names(movies)

##  [1] "title"            "title_type"       "genre"            "runtime"         
##  [5] "mpaa_rating"      "studio"           "thtr_rel_year"    "thtr_rel_month"  
##  [9] "thtr_rel_day"     "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"   "critics_score"   
## [17] "audience_rating"  "audience_score"   "best_pic_nom"     "best_pic_win"    
## [21] "best_actor_win"   "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"           "actor3"          
## [29] "actor4"           "actor5"           "imdb_url"         "rt_url"          
## [33] "feature_film"     "drama"            "mpaa_rating_R"    "oscar_season"    
## [37] "summer_season"

head(movies[c('feature_film', 'drama', 'mpaa_rating_R', 'oscar_season', 'summer_season')])

## # A tibble: 6 x 5
##   feature_film drama mpaa_rating_R oscar_season summer_season
##   <chr>        <chr> <chr>         <chr>        <chr>        
## 1 yes          yes   yes           no           no           
## 2 yes          yes   no            no           no           
## 3 yes          no    yes           no           yes          
## 4 yes          yes   no            yes          no           
## 5 yes          no    yes           no           no           
## 6 no           no    no            no           no

As can be seen from above, the new variables are added to the tibble, as needed.

Part 3: Exploratory data analysis

Let’s perform exploratory data analysis (EDA) of the relationship between audience_score and the new variables constructed in the previous part. The EDA contains numerical summaries and visualizations. Each R output and plot is accompanied by a brief interpretation.

The following code snippet analyzes the relationship between the audience_score and feature_film, with exploratory visualization.

ggplot(movies, aes(audience_score, group=feature_film, fill=feature_film)) + 
  geom_boxplot() + ggtitle('boxplot of audiance score grouped by feauture_film')

ggplot(movies, aes(audience_score, group=feature_film, fill=feature_film)) + 
  geom_density(alpha=0.5) + ggtitle('distribution of audiance score grouped by feauture_film')

movies %>% group_by(feature_film) %>% 
  summarize(score_Min = min(audience_score),
             score_Q1 = quantile(audience_score, .25),
             score_mean = mean(audience_score), 
             score_median = median(audience_score), 
             score_Q3 = quantile(audience_score, .75),
             score_Max = max(audience_score))

## # A tibble: 2 x 7
##   feature_film score_Min score_Q1 score_mean score_median score_Q3 score_Max
## * <chr>            <dbl>    <dbl>      <dbl>        <dbl>    <dbl>     <dbl>
## 1 no                  19     76.5       81.0         85.5       89        96
## 2 yes                 11     44.5       60.5         62         78        97

As can be seen from the above plots and the summary statistics table, the distribution of audiance scores of feature films is more uniform than the ones for the non-feature films, for which the distribution is right skewed. On average the non-feature films obtain much higher audiance scores (~81) than the feature films (~60.5). Hence, this is likely to be a discriminating predictor variable for the response variable audiance_score.

Similarly, the following code snippet analyzes the relationship between the audience_score and drama, with exploratory visualization.

ggplot(movies, aes(audience_score, group=drama, fill=drama)) + 
  geom_boxplot() + ggtitle('boxplot of audiance score grouped by drama')

ggplot(movies, aes(audience_score, group=drama, fill=drama)) + 
  geom_density(alpha=0.5) + ggtitle('distribution of audiance score grouped by drama')

movies %>% group_by(drama) %>% 
  summarize(score_Min = min(audience_score),
             score_Q1 = quantile(audience_score, .25),
             score_mean = mean(audience_score), 
             score_median = median(audience_score), 
             score_Q3 = quantile(audience_score, .75),
             score_Max = max(audience_score))

## # A tibble: 2 x 7
##   drama score_Min score_Q1 score_mean score_median score_Q3 score_Max
## * <chr>     <dbl>    <dbl>      <dbl>        <dbl>    <dbl>     <dbl>
## 1 no           11       41       59.7           61       79        97
## 2 yes          13       52       65.3           70       80        95

Again, as can be seen from the above plots and the summary statistics table, the distribution of audiance scores for movies of genre Drama is more right-skewed than the ones for the non-dramas, for which the distribution is more uniform. On average the movies of genre dramas obtain higher audiance scores (~\(65\)) than the non-dramas (~\(60\)).

Next, the following code snippet analyzes the relationship between the audience_score and mpaa_rating_R, with exploratory visualization.

ggplot(movies, aes(audience_score, group=mpaa_rating_R, fill=mpaa_rating_R)) + 
  geom_boxplot() + ggtitle('boxplot of audiance score grouped by mpaa_rating_R')

ggplot(movies, aes(audience_score, group=mpaa_rating_R, fill=mpaa_rating_R)) + 
  geom_density(alpha=0.5) +  ggtitle('distribution of audiance score grouped by mpaa_rating_R')

movies %>% group_by(mpaa_rating_R) %>% 
  summarize(score_Min = min(audience_score),
             score_Q1 = quantile(audience_score, .25),
             score_mean = mean(audience_score), 
             score_median = median(audience_score), 
             score_Q3 = quantile(audience_score, .75),
             score_Max = max(audience_score))

## # A tibble: 2 x 7
##   mpaa_rating_R score_Min score_Q1 score_mean score_median score_Q3 score_Max
## * <chr>             <dbl>    <dbl>      <dbl>        <dbl>    <dbl>     <dbl>
## 1 no                   11     48.2       62.7         65.5       80        96
## 2 yes                  14     44         62.0         64         79        97

Finally, as can be seen from the above plots and the summary statistics table, the distribution of audiance scores for movies with mpaa rating R is almost similar to the ones with non-R ratings. On average the movies with rating R obtain slightly lower audiance scores (~\(64\)) than the those with non-R ratings (~\(65.5\)). Hence, this predictor variable alone is less likely to be a discriminator for the response variable audiance score.

Part 4: Modeling

Let’s develop a Bayesian regression model (with Bayesian Model averaging) to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables we constructed earlier:feature_film,drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box. Let’s perform Bayesian model selection and report the final model. Also, we shall perform model diagnostics and interpret coefficients of the final model in context of the data.

Let’s first create a reduced dataset with the relevant feature variables as mentioned and then clean the data by removing NA values. Then let;s create a Bayesian linear regression model, with BIC prior on the coefficients and a uniform model prior (equal probabilities to all models), using the following R code snippet.

movies_reduced <- movies[c('feature_film', 'drama', 'runtime', 'mpaa_rating_R', 'thtr_rel_year', 
                           'oscar_season', 'summer_season', 'imdb_rating', 'imdb_num_votes', 
                           'critics_score', 'best_pic_nom', 'best_pic_win', 'best_actor_win', 
                           'best_actress_win', 'best_dir_win', 'top200_box', 'audience_score')]
#head(movies_reduced)

dim(movies_reduced)

## [1] 651  17

movies_no_na <- na.omit(movies_reduced)
dim(movies_no_na)

## [1] 650  17

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lscore <- bas.lm(audience_score ~ .-audience_score, data = movies_no_na,
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable                
bma_lscore

## 
## Call:
## bas.lm(formula = audience_score ~ . - audience_score, data = movies_no_na, 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

summary(bma_lscore)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

Printing the model object and the summary function gives us both the posterior model inclusion probability for each variable and the 5 most probable models. For example, the posterior probability that runtime is included in the model is \(0.46971\), as can be seen from above output. Further, the most likely model, which has the highest posterior probability of \(0.1297\), includes an intercept, runtime, imdb_rating and critics_score (shown in the figure with title Model Ranking, where the most likely model, i.e., model 1 corresponds to the leftmost model), with \(R^2\) of \(0.7549\). Also the above table shows the Bayes factors for each model to the highest probability model (hence its Bayes factor is \(1\)). To see beyond the first five models, we can represent the collection of the models via an image plot. By default this shows the top \(20\) models (?image.bas).

image(bma_lscore, rotate = F)

This above image has rows that correspond to each of the variables and intercept, with labels for the variables on the y-axis. The x-axis corresponds to the possible models. These are sorted by their posterior probability from best at the left to worst at the right with the rank on the top x-axis. Models that are the same color have similar log Bayes factors which allows us to view models that are clustered together that have Bayes Factors where the differences are not worth a bare mention. This plot indicates that the imdb_rating and the critics_score enter the almost all the models together, and is an indication of the high correlation between the two variables. But the best_actor_win_yes and best_actress_win_yes have low correlation, as can be verified using the code snippet below.

cor(movies_no_na$imdb_rating, movies_no_na$critics_score)

## [1] 0.7647832

cor(movies_no_na$best_actor_win == 'yes', movies_no_na$best_actress_win == 'yes')

## [1] 0.1357741

While a posterior probability of \(0.1297\) for model 1 (the one with the highest posteriror probability) sounds small, it is much larger than the uniform prior probability assigned to it, since there are \(2^{16}\) possible models. The following figure shows how the posterior probabilities for the top \(25\) models got increased from the uniform prior probabilities.

k <- 25
indices <- sort(bma_lscore$postprobs, decreasing = TRUE, index.return=TRUE)$ix[1:k]
ggplot() + 
  geom_line(aes(1:k, bma_lscore$priorprobs[indices]/2^16, color='prior probs of models'), stat='identity', lwd=2) +
  geom_line(aes(1:k, bma_lscore$postprobs[indices], color='posterior probs of models'), stat='identity', lwd=2) + 
  xlab(sprintf('top %d models with the highest posterior probs', k)) + ylab('probabilities')

Interpreting the coefficients

It is also possible to visualize the posterior distribution of the coefficients under the model averaging approach. Plots of the posterior distributions averaging over all of the models can be obtained using the plot method for the bas coefficient object. Let’s graph the posterior distribution of the coefficients of runtime and imdb_rating below. Note that the subset parameter dictates which variable is plotted.

# Obtain the coefficients from the model `bma_lscore`
coef_lscore <- coefficients(bma_lscore)
coef_lscore

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  65536 models 
##                      post mean   post SD     post p(B != 0)
## Intercept             6.235e+01   3.946e-01   1.000e+00    
## feature_filmyes      -1.047e-01   5.643e-01   6.537e-02    
## dramayes              1.604e-02   1.939e-01   4.320e-02    
## runtime              -2.568e-02   3.118e-02   4.697e-01    
## mpaa_rating_Ryes     -3.036e-01   7.032e-01   1.998e-01    
## thtr_rel_year        -4.533e-03   1.819e-02   9.069e-02    
## oscar_seasonyes      -8.035e-02   3.773e-01   7.506e-02    
## summer_seasonyes      8.705e-02   3.831e-01   8.042e-02    
## imdb_rating           1.498e+01   7.310e-01   1.000e+00    
## imdb_num_votes        2.081e-07   1.313e-06   5.774e-02    
## critics_score         6.297e-02   3.028e-02   8.886e-01    
## best_pic_nomyes       5.068e-01   1.568e+00   1.312e-01    
## best_pic_winyes      -8.503e-03   8.479e-01   3.985e-02    
## best_actor_winyes    -2.877e-01   8.318e-01   1.443e-01    
## best_actress_winyes  -3.088e-01   9.057e-01   1.413e-01    
## best_dir_winyes      -1.195e-01   6.227e-01   6.694e-02    
## top200_boxyes         8.648e-02   7.050e-01   4.762e-02

# `runtime` is the 4th variable, while `imdb_rating` is the 9th variable in the data set
plot(coef_lscore, subset = c(4,9), ask = FALSE)

As can be seen from the above figure, the vertical bar in each of the posterior distributions plotted represents the posterior probability that the corresponding coefficient is \(0\) while the bell shaped curve represents the density of plausible values from all the models where the coefficient is non-zero. Since imdb_rating is always included (with posterior model inclusion probability 1), there is no vertical bar in the distribution corresponding to imdb_rating, whereas runtime may not be included with a probability around \(0.5\).

We can also provide \(95\%\) credible intervals for these coefficients, as done with the below code snippet.

confint(coef_lscore)

##                              2.5%        97.5%          beta
## Intercept            6.155728e+01 6.308652e+01  6.234769e+01
## feature_filmyes     -1.161146e+00 4.269155e-03 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.224243e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes    -2.074940e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year       -5.324124e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes     -9.846316e-01 1.191461e-02 -8.034940e-02
## summer_seasonyes    -1.463418e-02 9.556674e-01  8.704545e-02
## imdb_rating          1.363469e+01 1.651525e+01  1.498203e+01
## imdb_num_votes      -1.224578e-07 1.187257e-07  2.080713e-07
## critics_score        0.000000e+00 1.045046e-01  6.296648e-02
## best_pic_nomyes      0.000000e+00 4.720048e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.592067e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.921008e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes     -1.434778e+00 0.000000e+00 -1.195011e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The first two columns represent the \(95\%\) credible interval for the parameters. For example, it means that, after seeing the dataset, we may believe that with \(95\%\) probability the coefficient for the predictor variable imdb_rating will be inside the interval \([13.60825, 16.48687]\), with posterior mean \(14.98203\). The other coefficients can be interpreted in the same way. The third column (beta) in the above table is the posterior mean. This uses Monte Carlo sampling to draw from the mixture model over coefficient where models are sampled based on their posterior probabilities.

plot(confint(coef_lscore, parm = 2:16))

## Warning in arrows(x[not.deg], ci[not.deg, 1], x[not.deg], ci[not.deg, 2], :
## zero-length arrow is of indeterminate angle and so skipped

## NULL

using the parm argument to select which coefficients to plot (the intercept is parm=1).

For estimation under selection, BAS supports additional arguments via estimator. The default is estimator="BMA" which uses all models or n.models. Other options include estimation under the highest probability model (HPM) or median probability model (MPM).

Diagnostics

Now, let’s try Zellner-Siow Cauchy (ZS-null prior) on the coefficients and use Markov Chain Monte Carlo (MCMC) to compute posterior inclusion probabilities.

bas_zs =  bas.lm(audience_score ~ ., 
                   data=movies_no_na,
                   initprobs = "eplogp",
                   prior="ZS-null",
                   modelprior=uniform(),
                  method = "MCMC")

summary(bas_zs)

##                     P(B != 0 | Y)  model 1     model 2     model 3     model 4
## Intercept              1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## feature_filmyes        0.06880341   0.0000   0.0000000   0.0000000   0.0000000
## dramayes               0.04591141   0.0000   0.0000000   0.0000000   0.0000000
## runtime                0.45822144   0.0000   1.0000000   0.0000000   0.0000000
## mpaa_rating_Ryes       0.20053101   0.0000   0.0000000   0.0000000   1.0000000
## thtr_rel_year          0.09721222   0.0000   0.0000000   0.0000000   0.0000000
## oscar_seasonyes        0.07740860   0.0000   0.0000000   0.0000000   0.0000000
## summer_seasonyes       0.08372116   0.0000   0.0000000   0.0000000   0.0000000
## imdb_rating            1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## imdb_num_votes         0.06149063   0.0000   0.0000000   0.0000000   0.0000000
## critics_score          0.87814941   1.0000   1.0000000   1.0000000   1.0000000
## best_pic_nomyes        0.13410263   0.0000   0.0000000   0.0000000   0.0000000
## best_pic_winyes        0.04242325   0.0000   0.0000000   0.0000000   0.0000000
## best_actor_winyes      0.14772644   0.0000   0.0000000   1.0000000   0.0000000
## best_actress_winyes    0.14568939   0.0000   0.0000000   0.0000000   0.0000000
## best_dir_winyes        0.06955414   0.0000   0.0000000   0.0000000   0.0000000
## top200_boxyes          0.04998779   0.0000   0.0000000   0.0000000   0.0000000
## BF                             NA   1.0000   0.8702806   0.2236679   0.2217602
## PostProbs                      NA   0.1390   0.1205000   0.0326000   0.0310000
## R2                             NA   0.7525   0.7549000   0.7539000   0.7539000
## dim                            NA   3.0000   4.0000000   4.0000000   4.0000000
## logmarg                        NA 443.9495 443.8105657 442.4519125 442.4433468
##                         model 5
## Intercept             1.0000000
## feature_filmyes       0.0000000
## dramayes              0.0000000
## runtime               1.0000000
## mpaa_rating_Ryes      1.0000000
## thtr_rel_year         0.0000000
## oscar_seasonyes       0.0000000
## summer_seasonyes      0.0000000
## imdb_rating           1.0000000
## imdb_num_votes        0.0000000
## critics_score         1.0000000
## best_pic_nomyes       0.0000000
## best_pic_winyes       0.0000000
## best_actor_winyes     0.0000000
## best_actress_winyes   0.0000000
## best_dir_winyes       0.0000000
## top200_boxyes         0.0000000
## BF                    0.2055844
## PostProbs             0.0280000
## R2                    0.7563000
## dim                   5.0000000
## logmarg             442.3676066

diagnostics(bas_zs, type="pip", col = "blue", pch = 19, cex = 2)

* Convergence: Let’s check whether the MCMC (performed for this model selection) ran long enough and converged. Diagnostics function is used to ensure that most points stay in the \(45^o\) diagonal line. The above convergence plot shows that all points fall within the \(45^o\) diagonal line, suggesting that the posterior inclusion probability (pip) for each variable from MCMC has converged to the theoretical posterior inclusion probability.

diagnostics(bas_zs, type = "model", col = "blue", pch = 19, cex = 2)

From the above figure, We can see that all of the points lie on the \(45^o\) line, validating the convergence of the posterior probabilities for the models given the filtration of all the observed data.

Residuals vs. Fitted Values: The Residuals plot does not display a random scatter around 0, with non-constant variance across the data.

plot(bas_zs, which = 1)

As seen from the figure above, for most of the fitted values, the residuals are centered around \(0\), and the model appears to show a decrease in residual variance at its higher predicted values. This suggests that the model is likely to perform better prediction at higher audience scores. However, for fitted values that are less than \(40\), the model tends to underestimate the audiencescore`s of the movies with high variance. The figure also shows \(3\) possible outliers detected in this region.

Model Probabilities: As can be seen from the next figure, the model probabilities starts to level off after \(2000\) trials with MCMC, as each additional model does not contribute to an increase in cumulative probability. The search stops at \(6000\), instead of enumerations of \(2^16\) possible models, as done earlier with BIC prior on the coefficients.

  plot(bas_zs, which = 2)

* Model Complexity: The next figure shows that the models with highest Bayes Factors / marginal likelihood have around \(3\) or \(4\) predictors.

plot(bas_zs, which = 3)

* Marginal Inclusion Probabilities: The lines in red correspond to the variables where the posterior inclusion probability (pip) is greater than \(0.5\), suggesting that these variables (e.g., imdb_rating, critics_score) are important predictors for the response audience score.

plot(bas_zs, which = 4, ask=F)

Part 5: Prediction

Now let’s pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using the model we
developed and the predict() function in R.

The new movie (not yet observed / not present in the training sample) that we shall use for prediction is Moonlight (2016) (note that the ground-truth audience score for this movie is \(79\)). Let’s first create the data which we shall use for prediction using th following code snippet. The below links are the references for where the data for this movie comes from:

moonlight <- data.frame(feature_film='yes', drama='yes', runtime=111, mpaa_rating_R='yes', 
                        thtr_rel_year=2016, oscar_season='yes', summer_season='no', 
                        imdb_rating=7.4, imdb_num_votes=305195, critics_score=98, 
                        best_pic_nom='yes', best_pic_win='yes', best_actor_win='yes', 
                        best_actress_win='no', best_dir_win='no', top200_box='yes', 
                        audience_score=79)
newdata <- moonlight[-length(moonlight)]
newdata

##   feature_film drama runtime mpaa_rating_R thtr_rel_year oscar_season
## 1          yes   yes     111           yes          2016          yes
##   summer_season imdb_rating imdb_num_votes critics_score best_pic_nom
## 1            no         7.4         305195            98          yes
##   best_pic_win best_actor_win best_actress_win best_dir_win top200_box
## 1          yes            yes               no           no        yes

#movies_no_na[movies_no_na$best_pic_win=='yes', c('best_pic_win','best_pic_nom')]
#names(movies_no_na)

Now, we shall aim to predict the audience score for this new movie using the model fitted above. Simulation is used in BAS to construct predictive intervals with Bayesian Model averaging, while exact inference is often possible with predictive intervals under model selection. BAS has methods defined to return fitted values, fitted, using the observed design matrix and predictions at either the observed data or potentially new values, predict, as with lm.

Let’s use the predict() function to obtain the prediction for the audience score for the new movie first under model averaging and then compute a \(95\%\) prediction interval for the audience score with covariates at the levels of the individual specified by opt, using the following code snippet.

BMA_pred_bma_lscore <- predict(bma_lscore, newdata=newdata, estimator = "BMA", se.fit = TRUE)
ci_BMA_lscore <- confint(BMA_pred_bma_lscore, estimator = "BMA")
opt_BMA <- which.max(BMA_pred_bma_lscore$fit)
ci_BMA_lscore[opt_BMA, ]

##     2.5%    97.5%     pred 
## 58.64516 98.73827 78.46105

Let’s find predicted values under the best predictive model, the one that has predictions closest to BMA and corresponding posterior standard deviations.

BPM_pred_bma_lscore =  predict(bma_lscore, newdata=newdata, estimator="BPM", se.fit=TRUE)
variable.names(BPM_pred_bma_lscore)

##  [1] "Intercept"           "feature_filmyes"     "dramayes"           
##  [4] "runtime"             "thtr_rel_year"       "oscar_seasonyes"    
##  [7] "summer_seasonyes"    "imdb_rating"         "imdb_num_votes"     
## [10] "best_actress_winyes" "best_dir_winyes"     "top200_boxyes"

BPM_pred_bma_lscore$namesx[BPM_pred_bma_lscore$bestmodel+1]

## NULL

opt_BPM = which.max(BPM_pred_bma_lscore$fit)
ci_BPM_lscore = confint(BPM_pred_bma_lscore, parm="pred")
ci_BPM_lscore[opt_BPM,]

##     2.5%    97.5%     pred 
## 57.91846 99.00368 78.46107

In the code above, the function variable.names() can be used to extract the names of all of the predictors in the Best Probability model. This can be used to identify the variables in the Highest Probability Model (HPM).

HPM_pred_bma_lscore =  predict(bma_lscore, newdata=newdata, estimator="HPM", se.fit=TRUE)
variable.names(HPM_pred_bma_lscore)

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

HPM_pred_bma_lscore$namesx[HPM_pred_bma_lscore$bestmodel+1]

## NULL

opt_HPM = which.max(HPM_pred_bma_lscore$fit)
ci_HPM_lscore = confint(HPM_pred_bma_lscore, parm="pred")
ci_HPM_lscore[opt_HPM,]

##     2.5%    97.5%     pred 
## 58.75375 98.28249 78.51812

We can compare this to the Highest probability model that we found earlier and the Median Probability Model (MPM).

MPM_pred_bma_lscore =  predict(bma_lscore, newdata=newdata, estimator="MPM", se.fit=TRUE)
variable.names(MPM_pred_bma_lscore)

## [1] "Intercept"     "imdb_rating"   "critics_score"

MPM_pred_bma_lscore$namesx[MPM_pred_bma_lscore$bestmodel+1]

## NULL

opt_MPM = which.max(MPM_pred_bma_lscore$fit)
ci_MPM_lscore = confint(MPM_pred_bma_lscore, parm="pred")
ci_MPM_lscore[opt_MPM,]

##     2.5%    97.5%     pred 
## 58.77803 98.47299 78.62551

Based on these results, the covariate imdb_rating is included in all of the following models: the best predictive model, the median probability model, and the highest posterior probability model. Finally, let’s compare the predictions by different models with the actual audience score for the movie.

pred_df <- NULL
pred_df <- rbind(pred_df, ci_BPM_lscore[opt_BPM,])
pred_df <- rbind(pred_df, ci_MPM_lscore[opt_MPM,])
pred_df <- rbind(pred_df, ci_HPM_lscore[opt_HPM,])
pred_df <- rbind(pred_df, ci_BMA_lscore[opt_BMA,])
pred_df <- as.data.frame(pred_df)
pred_df$model <- c('BPM', 'MPM', 'HPM', 'BMA')
names(pred_df)[1:2] <- c('LCL', 'UCL')
pred_df <- rbind(pred_df, data.frame(LCL=NA, UCL=NA, pred=moonlight$audience_score, model='Actual'))
ggplot(data = pred_df) + 
  geom_bar(stat = "identity", aes(x = model, y = pred, fill=model)) +
  geom_errorbar(
    aes(x=model, 
        ymin = LCL, 
        ymax = UCL), color = 1
  ) + ggtitle('Prediction with Bayesian Model Averaging (with 95% credible intervals)')

As can be seen from the above plot, the actual (ground-truth) audience score is very close to the ones predicted by different models and always inside the \(95\%\) credible intervals.

Part 6: Conclusion

This section contains a brief summary of out findings from the previous sections as well as a discussion of what we have learned about the data and the research question. We shall also discuss any shortcomings of the current study (either due to data collection or methodology) and include ideas for possible future research.

Findings

The predictor variables that were included in top models with highest posterior probability (using BMA) are runtime, imdb_rating, critics_score, thtr_rel_year and the new variables added e.g., whether mpaa_rating is R, best_actor_win is yes etc., in order to predict the response audience_score, using Bayesian linear regression with BAS.
The predictive models BPM, HPM and MPM preform quite well to predict the audience score of the unseen movie (with error \(\approx 0.5\)), MPM performed the best (with lowest error) on the unseen movie.

Shortcomings

Model evaluation need to be done properly on a larger unseen test dataset to quantify the performance of the models (e.g., with metrics such as RMSE or MAE), we have tested the models only on a single unseen example.
The model’s residuals plot displays a wave pattern which suggests high varying predictive ability, particularly at lower scores, where it’s likely to underestimate the actual score.
The movies dataset is pretty dynamic and ever-changing because the ratings, votes can keep on changing over time. In order to cope with it, we may need to train the model with newer data in a while by choosing random samples from the movies released in later yers, for the models to be able to better generalize to unseen data (use time series models?).

Future Research

To address the shortcomings, we may want to have a separate held-out validation dataset for model selection / tuning the hyper-parameters of the models to ensure proper model evaluation (e.g., with metrics such as RMSE or MAE).
We need to add more samples (e.g., the movies released in later years and particularly the movies with low audience scores) and / or consider to add more features / predictor variables and rebuild the model.