Modeling and Prediction Using Bayesian Regression

Introduction

Congratulations on getting a job as a data scientist at Paramount Pictures!

Our boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.

She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants us team to figure it all out.

The specific modeling task we need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables we will need to construct in the data manipulation section using the mutate function in dplyr:

feature_film: “yes” if title_type is Feature Film, “no” otherwise
drama: “yes” if genre is Drama, “no” otherwise
runtime
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
thtr_rel_year
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

For this project, we will complete exploratory data analysis (EDA), modeling, and prediction.

Load packages

library(ggplot2)
library(MASS)
library(dplyr)
library(statsr)
library(BAS)
library(plotly)
library(broom)

Load data

load("movies.RData")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

Some of these variables are provided only for informational purposes and do not make any sense to include in a statistical analysis. It is up to us to decide which variables are meaningful and which should be omitted. For example, information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.

We might also choose to omit certain observations or restructure some of the variables to make them suitable for answering our research questions.

When we are fitting a model, we should also be careful about collinearity, as some of these variables may be dependent on each other.

My exploratory analysis (see below) confirms that these data appear to be random samples and the results achieved through linear regression can be generalized to a larger population.

Part 2: Data manipulation

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no
Create two new variables based on thtr_rel_month:
- New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no
- New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

movies <- movies %>%
  mutate(feature_film=factor(ifelse(title_type=="Feature Film","yes","no")),
         drama=factor(ifelse(genre=="Drama","yes","no")),
         mpaa_rating_R=factor(ifelse(mpaa_rating=="R","yes","no")),
         oscar_season=factor(ifelse(thtr_rel_month%in%c("10","11","12"),"yes","no")),
         summer_season=factor(ifelse(thtr_rel_month%in%c("5","6","7","8"),"yes","no")))

Part 3: Exploratory data analysis

Conduct exploratory data analysis of the relationship between audience_score and the new variables constructed in the previous part.

First, we take a look at the distribution of audience_score.

m <- list(l=50,r=50,b=100,t=50,pad=4)

audience_score_density <- density(movies$audience_score)

fig <- plot_ly(x=~audience_score, data=movies, type="histogram",name="Histogram") %>%
  add_trace(x=audience_score_density$x,y=audience_score_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
  layout(autosize = F, width=600, height=400,margin = m,
         title="Histogram of 'audience_score'",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 0.8))
fig

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

It can be observed that audience_score has a relatively normal distribution.

Now we look at the distribution of audience_score based on the new dataset features we were asked to create.

feature_film

feature_film_yes <- movies %>% filter(feature_film=="yes")
feature_film_no <- movies %>% filter(feature_film=="no")

feature_film_yes_density <- density(feature_film_yes$audience_score)
feature_film_no_density <- density(feature_film_no$audience_score)

fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=feature_film_yes, name="Histogram-yes",bingroup=1) %>% 
  add_histogram(x = ~audience_score, data=feature_film_no, name="Histogram-no",bingroup=1) %>%
  add_trace(x=feature_film_yes_density$x,y=feature_film_yes_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
  add_trace(x=feature_film_no_density$x,y=feature_film_no_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>% 
  layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
         legend=list(title=list(text='<b> Feature Film? </b>'))) %>%
  layout(autosize = F,width = 600, height = 450, margin = m, 
         title="'audience_score vs. 'feature_film': Histogram & Density Plots",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 1))
fig

summary(feature_film_yes$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   44.50   62.00   60.47   78.00   97.00

summary(feature_film_no$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   76.50   85.50   81.05   89.00   96.00

drama

drama_yes <- movies %>% filter(drama=="yes")
drama_no <- movies %>% filter(drama=="no")

drama_yes_density <- density(drama_yes$audience_score)
drama_no_density <- density(drama_no$audience_score)

fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=drama_yes, name="Histogram-yes",bingroup=1) %>% 
  add_histogram(x = ~audience_score, data=drama_no, name="Histogram-no",bingroup=1) %>%
  add_trace(x=drama_yes_density$x,y=drama_yes_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
  add_trace(x=drama_no_density$x,y=drama_no_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>% 
  layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
         legend=list(title=list(text='<b> Drama? </b>'))) %>%
  layout(autosize = F,width = 600, height = 450, margin = m,
         title="'audience_score vs. 'drama': Histogram & Density Plots",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 1))
fig

summary(drama_yes$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   52.00   70.00   65.35   80.00   95.00

summary(drama_no$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   41.00   61.00   59.73   79.00   97.00

mpaa_rating_R

mpaa_rating_R_yes <- movies %>% filter(mpaa_rating_R=="yes")
mpaa_rating_R_no <- movies %>% filter(mpaa_rating_R=="no")

mpaa_rating_R_yes_density <- density(mpaa_rating_R_yes$audience_score)
mpaa_rating_R_no_density <- density(mpaa_rating_R_no$audience_score)

fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=mpaa_rating_R_yes, name="Histogram-yes",bingroup=1) %>% 
  add_histogram(x = ~audience_score, data=mpaa_rating_R_no, name="Histogram-no",bingroup=1) %>%
  add_trace(x=mpaa_rating_R_yes_density$x,y=mpaa_rating_R_yes_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
  add_trace(x=mpaa_rating_R_no_density$x,y=mpaa_rating_R_no_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>% 
  layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
         legend=list(title=list(text='<b> R Rating? </b>'))) %>%
  layout(autosize = F,width = 600, height = 450, margin = m,
         title="'audience_score vs. 'mpaa_rating_R': Histogram & Density Plots",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 1))
fig

summary(mpaa_rating_R_yes$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   44.00   64.00   62.04   79.00   97.00

summary(mpaa_rating_R_no$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   48.25   65.50   62.69   80.00   96.00

oscar_season

oscar_season_yes <- movies %>% filter(oscar_season=="yes")
oscar_season_no <- movies %>% filter(oscar_season=="no")

oscar_season_yes_density <- density(oscar_season_yes$audience_score)
oscar_season_no_density <- density(oscar_season_no$audience_score)

fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=oscar_season_yes, name="Histogram-yes",bingroup=1) %>% 
  add_histogram(x = ~audience_score, data=oscar_season_no, name="Histogram-no",bingroup=1) %>%
  add_trace(x=oscar_season_yes_density$x,y=oscar_season_yes_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
  add_trace(x=oscar_season_no_density$x,y=oscar_season_no_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>% 
  layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
         legend=list(title=list(text='<b> Oscar Season? </b>'))) %>%
  layout(autosize = F,width = 600, height = 450, margin = m,
         title="'audience_score vs. 'oscar_season': Histogram & Density Plots",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 1))
fig

summary(oscar_season_yes$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   47.50   69.00   63.69   81.00   97.00

summary(oscar_season_no$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   64.00   61.81   79.00   96.00

summer_season

summer_season_yes <- movies %>% filter(summer_season=="yes")
summer_season_no <- movies %>% filter(summer_season=="no")

summer_season_yes_density <- density(summer_season_yes$audience_score)
summer_season_no_density <- density(summer_season_no$audience_score)

fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=summer_season_yes, name="Histogram-yes",bingroup=1) %>% 
  add_histogram(x = ~audience_score, data=summer_season_no, name="Histogram-no",bingroup=1) %>%
  add_trace(x=summer_season_yes_density$x,y=summer_season_yes_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
  add_trace(x=summer_season_no_density$x,y=summer_season_no_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>% 
  layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
         legend=list(title=list(text='<b> Summer Season? </b>'))) %>%
  layout(autosize = F,width = 600, height = 450, margin = m,
         title="'audience_score vs. 'summer_season': Histogram & Density Plots",
         xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.1, y = 1))
fig

summary(summer_season_yes$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   44.75   65.00   61.81   78.00   94.00

summary(summer_season_no$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   46.00   66.00   62.62   80.00   97.00

Our exploratory analysis indicates that these distributions of audience_score are also relatively normal.

Part 4: Modeling

We will now develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables you constructed earlier:

feature_film
drama
runtime
mpaa_rating_R
thtr_rel_year
oscar_season
summer_season
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

Let’s prepare the dataset, movies by selecting the columns specified for the regression.

movies_modeling <- movies %>% select(audience_score,feature_film,drama,runtime,mpaa_rating_R,             thtr_rel_year,oscar_season,summer_season,imdb_rating,imdb_num_votes,critics_score,                    best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box)

str(movies_modeling)

## tibble [651 × 17] (S3: tbl_df/tbl/data.frame)
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
##  $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

We perform the linear regression…

model_full <- lm(audience_score ~ ., data = movies_modeling)
tidy(model_full)

## # A tibble: 17 x 5
##    term                    estimate   std.error statistic  p.value
##    <chr>                      <dbl>       <dbl>     <dbl>    <dbl>
##  1 (Intercept)         124.         77.5            1.61  1.09e- 1
##  2 feature_filmyes      -2.25        1.69          -1.33  1.83e- 1
##  3 dramayes              1.29        0.877          1.47  1.41e- 1
##  4 runtime              -0.0561      0.0242        -2.32  2.04e- 2
##  5 mpaa_rating_Ryes     -1.44        0.813         -1.78  7.60e- 2
##  6 thtr_rel_year        -0.0766      0.0383        -2.00  4.63e- 2
##  7 oscar_seasonyes      -0.533       0.997         -0.535 5.93e- 1
##  8 summer_seasonyes      0.911       0.949          0.959 3.38e- 1
##  9 imdb_rating          14.7         0.607         24.3   2.03e-92
## 10 imdb_num_votes        0.00000723  0.00000452     1.60  1.10e- 1
## 11 critics_score         0.0575      0.0222         2.59  9.73e- 3
## 12 best_pic_nomyes       5.32        2.63           2.02  4.33e- 2
## 13 best_pic_winyes      -3.21        4.61          -0.697 4.86e- 1
## 14 best_actor_winyes    -1.54        1.18          -1.31  1.91e- 1
## 15 best_actress_winyes  -2.20        1.30          -1.69  9.23e- 2
## 16 best_dir_winyes      -1.23        1.73          -0.713 4.76e- 1
## 17 top200_boxyes         0.848       2.78           0.305 7.61e- 1

confint(model_full)

##                             2.5 %        97.5 %
## (Intercept)         -2.775101e+01  2.765742e+02
## feature_filmyes     -5.561752e+00  1.065404e+00
## dramayes            -4.289065e-01  3.013785e+00
## runtime             -1.035699e-01 -8.710668e-03
## mpaa_rating_Ryes    -3.040338e+00  1.514030e-01
## thtr_rel_year       -1.518695e-01 -1.266551e-03
## oscar_seasonyes     -2.490505e+00  1.423948e+00
## summer_seasonyes    -9.534879e-01  2.774785e+00
## imdb_rating          1.352584e+01  1.590862e+01
## imdb_num_votes      -1.646766e-06  1.611569e-05
## critics_score        1.395487e-02  1.010066e-01
## best_pic_nomyes      1.606707e-01  1.048041e+01
## best_pic_winyes     -1.226379e+01  5.840535e+00
## best_actor_winyes   -3.859279e+00  7.706685e-01
## best_actress_winyes -4.758389e+00  3.620654e-01
## best_dir_winyes     -4.624714e+00  2.161716e+00
## top200_boxyes       -4.615603e+00  6.311263e+00

As we can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. We will use the Bayesian Information Criterion (BIC) for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. We can calculate the BIC of the full linear model using the command below:

BIC(model_full)

## [1] 4934.145

We can compare the BIC value of the full model with that of a reduced model. To do this, we will use the function stepAIC from the MASS package that will work backwards through the model space, removing variables until the AIC score can be no longer lowered. It takes all inputs in the full model, and a penalty parameter \(k\). We will look for the the best model according to BIC (in which case \(k = log(n)\) where \(n\) is the number of observations)

model_full <- lm(audience_score ~ ., data = na.omit(movies_modeling))
model_full_step <- stepAIC(model_full, trace = FALSE, direction = "backward")
model_full_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
## 
##               Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                      633   62989.66 3006.940
## 2     - top200_box  1   9.240822       634   62998.90 3005.035
## 3   - oscar_season  1  26.461061       635   63025.36 3003.308
## 4   - best_pic_win  1  45.744109       636   63071.11 3001.780
## 5   - best_dir_win  1  93.998979       637   63165.11 3000.748
## 6  - summer_season  1 166.872376       638   63331.98 3000.463
## 7   - feature_film  1 155.730073       639   63487.71 3000.059
## 8          - drama  1 121.355602       640   63609.06 2999.300
## 9 - imdb_num_votes  1 147.829088       641   63756.89 2998.809

Often, several models are equally plausible and choosing only one ignores the inherent uncertainty involved in choosing the variables to include in the model. A way to get around this problem is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data.

# Exclude observations with missing values in the data set
movies_modeling_no_na <- na.omit(movies_modeling)

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_audience_score <- bas.lm(audience_score ~ ., data = movies_modeling_no_na,
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable                
bma_audience_score

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_modeling_no_na, 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

# Top 5 most probable models
summary(bma_audience_score)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that critics_score is included in the model is 0.89. Further, the most likely model (Model 1), which has posterior probability of 0.1297, includes an intercept and the variables runtime, imdb_rating, and critics_score.

It is also possible to visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients of runtime and critics_score below. Note that the subset command dictates which variable is plotted.

# Obtain the coefficients from the model `bma_audience_score`
coef_audience_score <- coefficients(bma_audience_score)

# `runtime` is the 4th variable, while `critics_score` is the 11th variable in the data set
plot(coef_audience_score, subset = c(4,11), ask = FALSE)

We can also provide 95% credible intervals for these coefficients:

confint(coef_audience_score)

##                              2.5%        97.5%          beta
## Intercept            6.158880e+01 6.314897e+01  6.234769e+01
## feature_filmyes     -1.206636e+00 0.000000e+00 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.341541e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes    -2.093717e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year       -5.082473e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes     -9.203152e-01 1.288938e-02 -8.034940e-02
## summer_seasonyes    -1.924108e-02 1.039107e+00  8.704545e-02
## imdb_rating          1.368965e+01 1.658135e+01  1.498203e+01
## imdb_num_votes      -4.059600e-09 1.505757e-06  2.080713e-07
## critics_score        0.000000e+00 1.061901e-01  6.296648e-02
## best_pic_nomyes      0.000000e+00 4.997640e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.603830e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.934742e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes     -1.438545e+00 0.000000e+00 -1.195011e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

We will now select a movie that is not contained in the movies dataset and generate a prediction for this movie using the predict function in R.

However, before we make predictions on new data, we will use a Bayesian predictive distribution for predictions and interpretation of predictions. Simulation is used in BAS to construct predictive intervals with Bayesian Model Averaging, while exact inference is often possible with predictive intervals under model selection.

Returning to the movies data set, let us find the predictive values under the Best Predictive Model (BPM), the one which has predictions closest to BMA and corresponding posterior standard deviations.

BPM_pred_audience_score <- predict(bma_audience_score, estimator = "BPM", se.fit = TRUE)
variable.names(BPM_pred_audience_score)

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

In the code above, the function variable.names can be used to extract the names of all of the predictors in the Best Probability model. This can be used to identify the variables in the Highest Probability Model (HPM)

HPM_pred_audience_score <- predict(bma_audience_score, estimator = "HPM")
variable.names(HPM_pred_audience_score)

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

and the Median Probability Model (MPM)

MPM_pred_audience_score <- predict(bma_audience_score, estimator = "MPM")
variable.names(MPM_pred_audience_score)

## [1] "Intercept"     "imdb_rating"   "critics_score"

Let us examine what characteristics lead to the highest audience_score in the BPM model.

# Find the index of observation with the largest fitted value
opt <- which.max(BPM_pred_audience_score$fit)

# Extract the row with this observation and glimpse at the row
movies_modeling_no_na %>% 
  slice(opt) %>%
  glimpse()

## Rows: 1
## Columns: 17
## $ audience_score   <dbl> 97
## $ feature_film     <fct> yes
## $ drama            <fct> no
## $ runtime          <dbl> 202
## $ mpaa_rating_R    <fct> yes
## $ thtr_rel_year    <dbl> 1974
## $ oscar_season     <fct> yes
## $ summer_season    <fct> no
## $ imdb_rating      <dbl> 9
## $ imdb_num_votes   <int> 749783
## $ critics_score    <dbl> 97
## $ best_pic_nom     <fct> yes
## $ best_pic_win     <fct> yes
## $ best_actor_win   <fct> yes
## $ best_actress_win <fct> yes
## $ best_dir_win     <fct> yes
## $ top200_box       <fct> no

A 95% credible interval for predicting audience_score can be obtained by

ci_audience_score <- confint(BPM_pred_audience_score, parm = "pred")
ci_audience_score[opt,]

##      2.5%     97.5%      pred 
##  77.42096 117.65506  97.53801

If we were to use BMA, the interval would be

BMA_pred_audience_score <- predict(bma_audience_score, estimator = "BMA", se.fit = TRUE)
ci_bma_audience_score <- confint(BMA_pred_audience_score, estimator = "BMA")
opt_bma <- which.max(BMA_pred_audience_score$fit)
ci_bma_audience_score[opt_bma, ]

##      2.5%     97.5%      pred 
##  80.02126 121.00041  99.78468

Now, the movie I would like to consider for my prediction is John Wick, which has a researched audience_score of 81 and is not included in the original movies dataset. Ratings data for John Wick can be found on the IMDb website here.

john_wick <- read.csv("John Wick ratings.csv")
john_wick

##       title   title_type              genre runtime         mpaa_rating
## 1 John Wick Feature Film Action & Adventure     101 Not Family-Friendly
##                    studio thtr_rel_year thtr_rel_month thtr_rel_day
## 1 LionsGate Entertainment          2014             10           24
##   thtr_rel_season dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating
## 1           Other         2016             6           7         7.4
##   imdb_num_votes  critics_rating critics_score audience_rating audience_score
## 1         553527 Certified Fresh            87         Upright             81
##   audience_count
## 1          83615

Let’s first create the linear regression model based on both the Best Predictive Model (BPM) and Highest Probability Model (HPM), namely, using the variables runtime,imdb_rating, and critics_score.:

mod1 <- lm(audience_score ~ runtime+imdb_rating+critics_score, data=movies_modeling)

summary(mod1)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies_modeling)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.998  -6.565   0.557   5.475  52.448 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.28321    3.21939 -10.338  < 2e-16 ***
## runtime        -0.05362    0.02107  -2.545  0.01117 *  
## imdb_rating    14.98076    0.57735  25.947  < 2e-16 ***
## critics_score   0.07036    0.02156   3.263  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.04 on 646 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7549, Adjusted R-squared:  0.7538 
## F-statistic: 663.3 on 3 and 646 DF,  p-value: < 2.2e-16

Let’s get a prediction of the audience_score based on the John Wick data:

pred <- predict(mod1,newdata=john_wick,interval="prediction",type="response")
pred

##        fit      lwr      upr
## 1 78.28034 58.52952 98.03116

The 95% prediction interval associated with an audience_score of 78.28 is (58.52, 98.03). This means that, according to the mod1 model, 95% the movies that are predicted to have an audience_score of 78.28 should have an actual rating that falls between 58.52 and 98.03.

pred <- predict(mod1,newdata=john_wick,interval="confidence",type="response")
pred

##        fit      lwr      upr
## 1 78.28034 77.11133 79.44936

The confidence interval reflects the uncertainty around the mean predictions. The 95% confidence interval associated with an audience_score of 78.28 is (77.11, 79.44). This means that, according to the mod1 model, a movie with an audience_rating of 78.28 on average will have a rating within the interval (77.11, 79.44).

We calculate the percent error knowing \(y_{obs}=81\) from the data:

\(\frac{y_{obs}-y_{pred}}{y_{obs}} = \frac{81-78.2}{81} = 0.035\) or \(3.5\%\)

Not bad!! Now, let’s look at a linear model based on the Median Probability Model (MPM), namely using only the variables imdb_rating and critics_score.

mod2 <- lm(audience_score ~ imdb_rating+critics_score, data=movies_modeling)

summary(mod2)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score, data = movies_modeling)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.668  -6.758   0.723   5.513  52.438 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -37.03195    2.86401 -12.930  < 2e-16 ***
## imdb_rating    14.65760    0.56590  25.901  < 2e-16 ***
## critics_score   0.07318    0.02161   3.386 0.000753 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.7516 
## F-statistic: 984.4 on 2 and 648 DF,  p-value: < 2.2e-16

Let’s get a prediction of the audience_score based on the John Wick data:

pred <- predict(mod2,newdata=john_wick,interval="prediction",type="response")
pred

##       fit     lwr    upr
## 1 77.8006 57.9782 97.623

And the confidence interval of the prediction…

pred <- predict(mod2,newdata=john_wick,interval="confidence",type="response")
pred

##       fit     lwr     upr
## 1 77.8006 76.6841 78.9171

As we can see, our predicted results for both Model 1 and Model 2 are very close!

Part 6: Conclusion

In this analysis, we fitted selected features about movies to a linear model to predict audience_score. To improve the robustness of our model, we ensured that all the various distributions of our dependent variable audience_score were normal based on the new dataset features we created. We were able to build a model using the Bayesian Information Criterion (BIC) and Bayesian Model Averaging (BMA) to optimize our model selection. It is possible that, with a larger dataset and some additional feature engineering, higher accuracies of predicted values might be achieved. Future work might include a linear model of non-linear terms, however, we must exercise caution so as to not overfit the data.