Bayesian modeling and prediction for movies

The task of this project is to predict audience scores audience_score based on explanatory variables available in the supplied data set using a Bayesian model selection technique.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

## Warning: package 'BAS' was built under R version 3.4.3

library(ggplot2)
library(MASS)

Load data

load("movies.Rdata")

Part 1: Data

The movies data set is comprised of 651 randomly sampled movies produced and released before 2016. Because it is a random sample of movies, the results are generalizable to the target population of all movies produced and released before 2016. Variables relate to how much audiences and critics like the movies, and about the movies themselves. The data set are observations, so movie attributes are not randomly assigned and no causal inferences can be made.

The response variable of interest is audience_score (Audience score on Rotten Tomatoes). The potential explanatory variables are:

feature_film: “yes” if title_type (type of movie (Documentary, Feature Film, TV Movie)) is Feature Film, “no” otherwise
drama: “yes” if genre (Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)) is Drama, “no” otherwise
runtime: Runtime of movie (in minutes)
mpaa_rating_R: “yes” if mpaa_rating (G, PG, PG-13, R, Unrated) is R, “no” otherwise
thtr_rel_year: Year the movie is released in theaters
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_score: Critics score on Rotten Tomatoes
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress_win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)

Part 2: Data manipulation

Several of the potential explanatory variables must be generated from other variables. Create a new data set of just the response variable and 16 potential explanatory variables.

movies <- movies %>% mutate(feature_film = factor(ifelse(title_type=="Feature Film", "yes", "no")))
movies <- movies %>% mutate(drama = factor(ifelse(genre=="Drama", "yes", "no")))
movies <- movies %>% mutate(mpaa_rating_R = factor(ifelse(mpaa_rating=="R", "yes", "no")))
movies <- movies %>% mutate(oscar_season = factor(ifelse(thtr_rel_month %in% c(10,11,12), "yes", "no")))
movies <- movies %>% mutate(summer_season = factor(ifelse(thtr_rel_month %in% c(5,6,7,8), "yes", "no")))

movies2 <- subset(movies, select=c("audience_score",
                                   "feature_film", "drama", "runtime", 
                                   "mpaa_rating_R", "thtr_rel_year", 
                                   "oscar_season", "summer_season", 
                                   "imdb_rating", "imdb_num_votes", 
                                   "critics_score", "best_pic_nom", 
                                   "best_pic_win", "best_actor_win", 
                                   "best_actress_win", "best_dir_win", 
                                   "top200_box"))

Part 3: Exploratory data analysis

Below is a high-level exploratory analysis of the response variable and the 16 potential explanatory variables. Most explantory variables appear to be predictors of audience_score. A few that do not include mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, best_actor_win, and best_actress_win.

Start with a univariate data analysis of the response variable.

Audience scores range from 11 to 97 with mean 62.36. The interquartile range is [46..80]. The distribution is skewed left.

summary(movies2$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

ggplot(movies2, aes(x=audience_score)) +
  geom_histogram() +
  ggtitle("Histogram of Audience Scores")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now perform a bivariate data analysis of the response variable and sixteen potential explanatory variables.

Feature Films

Feature films tend to have lower audience scores than non-feature films. An ANOVA indicates the relationship is significant (F=61.71, p<.0001).

ggplot(movies2, aes(x=feature_film, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Feature Status")

summary(aov(audience_score ~ feature_film, data = movies2))

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## feature_film   1  23081   23081   61.71 1.66e-14 ***
## Residuals    649 242740     374                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dramas

Dramas appear to recieve higher audience scores than non-dramas. An ANOVA indicates the relationship is significant (F=12.73, p=.0004).

ggplot(movies2, aes(x=drama, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Genre")

summary(aov(audience_score ~ drama, data = movies2))

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## drama         1   5113    5113   12.73 0.000387 ***
## Residuals   649 260707     402                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

runtime

There appears to be a positive relationship between audience scores and move runtime. The Pearson correlation coefficient indicates a weak, but significant positive linear relationship (r=.1810, p<.0001).

ggplot(movies2, aes(x=runtime, y=audience_score)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle ("Scatterplot of Audience Scores vs Runtime")

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

cor.test(movies2$audience_score, movies2$runtime, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  movies2$audience_score and movies2$runtime
## t = 4.6839, df = 648, p-value = 3.431e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1055295 0.2543256
## sample estimates:
##       cor 
## 0.1809629

mpaa_rating_R

R-Rated films do not appear to fair any better or worse than other films. An ANOVA indicates the relationship is not significant (F=0.166, p=.684).

ggplot(movies2, aes(x=mpaa_rating_R, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Score vs. Movie Rating")

summary(aov(audience_score ~ mpaa_rating_R, data = movies2))

##                Df Sum Sq Mean Sq F value Pr(>F)
## mpaa_rating_R   1     68    68.1   0.166  0.684
## Residuals     649 265752   409.5

thtr_rel_year

Audience scores appear to be unrelated to movie runtime. The Pearson correlation coefficient indicates a very weak and insignificant negative linear relationship (r=-.054, p=.1682).

ggplot(movies2, aes(x=thtr_rel_year, y=audience_score)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle ("Scatterplot of Audience Scores vs Release Year")

cor.test(movies2$audience_score, movies2$thtr_rel_year, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  movies2$audience_score and movies2$thtr_rel_year
## t = -1.3797, df = 649, p-value = 0.1682
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13037988  0.02285905
## sample estimates:
##         cor 
## -0.05407881

oscar_season

Audience scores do not appear to change if the film is released during Oscar season (October - December). An ANOVA indicates the relationship is insignificant (F=1.158, p=.282).

ggplot(movies2, aes(x=oscar_season, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Oscar Season")

summary(aov(audience_score ~ oscar_season, data = movies2))

##               Df Sum Sq Mean Sq F value Pr(>F)
## oscar_season   1    473   473.4   1.158  0.282
## Residuals    649 265347   408.9

summer_season

Audience scores do not appear to change if the film is released during the summer season (May - July). An ANOVA indicates the relationship is insignificant (F=.23, p=.632).

ggplot(movies2, aes(x=summer_season, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Summer Season")

summary(aov(audience_score ~ summer_season, data = movies2))

##                Df Sum Sq Mean Sq F value Pr(>F)
## summer_season   1     94    94.1    0.23  0.632
## Residuals     649 265726   409.4

imdb_rating

Audience scores are highly correlated with the IMDB rating. The Pearson correlation coefficient indicates a very strong and significant positive linear relationship (r=.865, p<.0001).

ggplot(movies2, aes(x=imdb_rating, y=audience_score)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle ("Scatterplot of Audience Scores vs IMDB Rating")

cor.test(movies2$audience_score, movies2$imdb_rating, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  movies2$audience_score and movies2$imdb_rating
## t = 43.89, df = 649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8441215 0.8830234
## sample estimates:
##       cor 
## 0.8648652

imdb_num_votes

Audience scores are also correlated with the number of IMDB votes. The Pearson correlation coefficient indicates a weak, but significant positive linear relationship (r=.290, p<.0001).

ggplot(movies2, aes(x=imdb_num_votes, y=audience_score)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle ("Scatterplot of Audience Scores vs IMDB Votes")

cor.test(movies2$audience_score, movies2$imdb_num_votes, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  movies2$audience_score and movies2$imdb_num_votes
## t = 7.7142, df = 649, p-value = 4.602e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2178208 0.3586681
## sample estimates:
##       cor 
## 0.2898128

critics_score

Audience scores are correlated with the critics scores. The Pearson correlation coefficient indicates a strong and significant positive linear relationship (r=.704, p<.0001).

ggplot(movies2, aes(x=critics_score, y=audience_score)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle ("Scatterplot of Audience Scores vs Critics Scores")

cor.test(movies2$audience_score, movies2$critics_score, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  movies2$audience_score and movies2$critics_score
## t = 25.273, df = 649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6633319 0.7410163
## sample estimates:
##       cor 
## 0.7042762

best_pic_nom

Best picture nominees receive higher audience scores. An ANOVA indicates the relationship is significant (F=30.68, p<.0001).

ggplot(movies2, aes(x=best_pic_nom, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Best Picture Nomination")

summary(aov(audience_score ~ best_pic_nom, data = movies2))

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## best_pic_nom   1  11999   11999   30.68 4.43e-08 ***
## Residuals    649 253822     391                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

best_pic_win

Best picture winners receive higher audience scores too, but interestingly, the effect is not as great as with best picture nominees. An ANOVA indicates the relationship is significant (F=8.74, p<.0032).

ggplot(movies2, aes(x=best_pic_win, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Best Picture Winner")

summary(aov(audience_score ~ best_pic_win, data = movies2))

##               Df Sum Sq Mean Sq F value  Pr(>F)   
## best_pic_win   1   3535    3535   8.748 0.00321 **
## Residuals    649 262285     404                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

best_actor_win

Surprisingly, films featuring a best actor winner tend to have lower audience scores. However, an ANOVA indicates the relationship is not significant (F=.213, p=.645).

ggplot(movies2, aes(x=feature_film, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Best Actor Winner")

summary(aov(audience_score ~ best_actor_win, data = movies2))

##                 Df Sum Sq Mean Sq F value Pr(>F)
## best_actor_win   1     87    87.0   0.213  0.645
## Residuals      649 265733   409.5

best_actress_win

Films featuring a best actress winner appear to have slightly higher audience scores. However, an ANOVA indicates the relationship is not significant (F=.452, p=.501).

ggplot(movies2, aes(x=best_actress_win, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Best Actress Winner")

summary(aov(audience_score ~ best_actress_win, data = movies2))

##                   Df Sum Sq Mean Sq F value Pr(>F)
## best_actress_win   1    185   185.2   0.452  0.501
## Residuals        649 265635   409.3

best_dir_win

Films by a Best Director winner tend to recieve higher audience scores. An ANOVA indicates the relationship is significant (F=5.797, p=.0163).

ggplot(movies2, aes(x=best_dir_win, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Best Director Winner")

summary(aov(audience_score ~ best_dir_win, data = movies2))

##               Df Sum Sq Mean Sq F value Pr(>F)  
## best_dir_win   1   2353    2353   5.797 0.0163 *
## Residuals    649 263467     406                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

top200_box:

Films in the top 200 box office list have higher audience scores. An ANOVA indicates the relationship is significant (F=5.601, p=.0182).

ggplot(movies2, aes(x=top200_box, y=audience_score)) +
  geom_boxplot() +
  ggtitle("Boxplot of Audience Scores by Top 200 Box Office")

summary(aov(audience_score ~ top200_box, data = movies2))

##              Df Sum Sq Mean Sq F value Pr(>F)  
## top200_box    1   2274  2274.3   5.601 0.0182 *
## Residuals   649 263546   406.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 4: Modeling

We can include all relevant covariates in a regression model in an attempt to explain as much audience score variation as possible. However, running this full model has a cost because it removes observations where the value from any explanatory variable is missing. Our inferences will only be valid if there is no systematic reason behind missing values. Without information regarding missing values, assume the missing values occur randomly.

m_audience_score_full = lm(audience_score ~ ., data = movies2)
summary(m_audience_score_full)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.594  -6.156   0.157   5.909  53.125 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.244e+02  7.749e+01   1.606  0.10886    
## feature_filmyes     -2.248e+00  1.687e+00  -1.332  0.18323    
## dramayes             1.292e+00  8.766e-01   1.474  0.14087    
## runtime             -5.614e-02  2.415e-02  -2.324  0.02042 *  
## mpaa_rating_Ryes    -1.444e+00  8.127e-01  -1.777  0.07598 .  
## thtr_rel_year       -7.657e-02  3.835e-02  -1.997  0.04628 *  
## oscar_seasonyes     -5.333e-01  9.967e-01  -0.535  0.59280    
## summer_seasonyes     9.106e-01  9.493e-01   0.959  0.33778    
## imdb_rating          1.472e+01  6.067e-01  24.258  < 2e-16 ***
## imdb_num_votes       7.234e-06  4.523e-06   1.600  0.11019    
## critics_score        5.748e-02  2.217e-02   2.593  0.00973 ** 
## best_pic_nomyes      5.321e+00  2.628e+00   2.025  0.04330 *  
## best_pic_winyes     -3.212e+00  4.610e+00  -0.697  0.48624    
## best_actor_winyes   -1.544e+00  1.179e+00  -1.310  0.19068    
## best_actress_winyes -2.198e+00  1.304e+00  -1.686  0.09229 .  
## best_dir_winyes     -1.231e+00  1.728e+00  -0.713  0.47630    
## top200_boxyes        8.478e-01  2.782e+00   0.305  0.76067    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.975 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.763,  Adjusted R-squared:  0.757 
## F-statistic: 127.3 on 16 and 633 DF,  p-value: < 2.2e-16

From the full linear model summary, many coefficients of independent variables are not statistically significant, including feature_film, drama, mpaa_rating_R, oscar_season, summer_season, imdb_num_votes, best_pic_win, best_actor_win, best_actess_win, best_dir_win, and top200_box. Bayesian Information Criterion (BIC) is the Bayesian version of model selection based on Adjusted \(R^2.\) BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. Other model selection criteria exist, but we are seeking predictive power, not necesarily model parsimony, so BIC is a good choice.

R function stepAIC works backwards through the model space, removing variables until BIC can be no longer be lowered. Its inputs are the full model and a penalty parameter \(k = \log(n)\).

m_audience_score_full = lm(audience_score ~ ., data = na.omit(movies2))
m_audience_score_bic = stepAIC(m_audience_score_full, k = log(nrow(na.omit(movies2))))

## Start:  AIC=3083.05
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         9  62999 3076.7
## - oscar_season      1        28  63018 3076.9
## - best_pic_win      1        48  63038 3077.1
## - best_dir_win      1        51  63040 3077.1
## - summer_season     1        92  63081 3077.5
## - best_actor_win    1       171  63160 3078.3
## - feature_film      1       177  63166 3078.4
## - drama             1       216  63206 3078.8
## - imdb_num_votes    1       255  63244 3079.2
## - best_actress_win  1       283  63273 3079.5
## - mpaa_rating_R     1       314  63304 3079.8
## - thtr_rel_year     1       397  63386 3080.7
## - best_pic_nom      1       408  63398 3080.8
## - runtime           1       538  63527 3082.1
## <none>                           62990 3083.0
## - critics_score     1       669  63659 3083.4
## - imdb_rating       1     58556 121545 3503.8
## 
## Step:  AIC=3076.67
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        26  63025 3070.5
## - best_pic_win      1        49  63047 3070.7
## - best_dir_win      1        52  63051 3070.7
## - summer_season     1        94  63093 3071.2
## - best_actor_win    1       169  63168 3071.9
## - feature_film      1       176  63175 3072.0
## - drama             1       214  63213 3072.4
## - best_actress_win  1       279  63278 3073.1
## - imdb_num_votes    1       302  63301 3073.3
## - mpaa_rating_R     1       330  63329 3073.6
## - best_pic_nom      1       404  63403 3074.3
## - thtr_rel_year     1       415  63414 3074.5
## - runtime           1       535  63534 3075.7
## <none>                           62999 3076.7
## - critics_score     1       681  63680 3077.2
## - imdb_rating       1     58606 121604 3497.7
## 
## Step:  AIC=3070.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3064.5
## - best_dir_win      1        56  63081 3064.6
## - best_actor_win    1       174  63200 3065.8
## - summer_season     1       177  63202 3065.8
## - feature_film      1       182  63207 3065.9
## - drama             1       222  63247 3066.3
## - best_actress_win  1       281  63307 3066.9
## - imdb_num_votes    1       302  63328 3067.1
## - mpaa_rating_R     1       329  63354 3067.4
## - best_pic_nom      1       387  63412 3068.0
## - thtr_rel_year     1       410  63436 3068.2
## - runtime           1       587  63613 3070.0
## <none>                           63025 3070.5
## - critics_score     1       679  63704 3071.0
## - imdb_rating       1     58603 121628 3491.3
## 
## Step:  AIC=3064.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3058.9
## - best_actor_win    1       163  63234 3059.7
## - feature_film      1       171  63242 3059.7
## - summer_season     1       174  63245 3059.8
## - drama             1       220  63291 3060.2
## - imdb_num_votes    1       271  63342 3060.8
## - best_actress_win  1       294  63365 3061.0
## - mpaa_rating_R     1       330  63401 3061.4
## - best_pic_nom      1       342  63414 3061.5
## - thtr_rel_year     1       397  63468 3062.1
## - runtime           1       586  63657 3064.0
## <none>                           63071 3064.5
## - critics_score     1       680  63751 3065.0
## - imdb_rating       1     58858 121929 3486.4
## 
## Step:  AIC=3058.95
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3054.2
## - best_actor_win    1       171  63336 3054.2
## - feature_film      1       183  63348 3054.4
## - drama             1       228  63394 3054.8
## - imdb_num_votes    1       247  63412 3055.0
## - best_actress_win  1       299  63464 3055.5
## - best_pic_nom      1       326  63491 3055.8
## - mpaa_rating_R     1       345  63510 3056.0
## - thtr_rel_year     1       368  63533 3056.2
## <none>                           63165 3058.9
## - critics_score     1       651  63816 3059.1
## - runtime           1       673  63839 3059.4
## - imdb_rating       1     58895 122061 3480.7
## 
## Step:  AIC=3054.19
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3049.3
## - best_actor_win    1       195  63527 3049.7
## - drama             1       204  63536 3049.8
## - imdb_num_votes    1       260  63592 3050.4
## - best_pic_nom      1       297  63629 3050.8
## - best_actress_win  1       297  63629 3050.8
## - mpaa_rating_R     1       356  63688 3051.4
## - thtr_rel_year     1       361  63693 3051.4
## <none>                           63332 3054.2
## - runtime           1       690  64022 3054.7
## - critics_score     1       732  64064 3055.2
## - imdb_rating       1     58763 122095 3474.4
## 
## Step:  AIC=3049.31
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 3044.1
## - imdb_num_votes    1       173  63661 3044.6
## - best_actor_win    1       219  63706 3045.1
## - thtr_rel_year     1       277  63765 3045.7
## - best_pic_nom      1       291  63778 3045.8
## - best_actress_win  1       306  63794 3046.0
## - mpaa_rating_R     1       453  63941 3047.5
## <none>                           63488 3049.3
## - runtime           1       715  64203 3050.1
## - critics_score     1       875  64363 3051.7
## - imdb_rating       1     63189 126677 3491.8
## 
## Step:  AIC=3044.07
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 3039.1
## - best_actor_win    1       209  63818 3039.7
## - thtr_rel_year     1       272  63881 3040.4
## - best_actress_win  1       274  63883 3040.4
## - best_pic_nom      1       307  63916 3040.7
## - mpaa_rating_R     1       391  64000 3041.6
## - runtime           1       631  64240 3044.0
## <none>                           63609 3044.1
## - critics_score     1       916  64525 3046.9
## - imdb_rating       1     63434 127043 3487.2
## 
## Step:  AIC=3039.1
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - thtr_rel_year     1       201  63958 3034.7
## - best_actor_win    1       219  63976 3034.9
## - best_actress_win  1       266  64023 3035.3
## - mpaa_rating_R     1       367  64124 3036.4
## - best_pic_nom      1       442  64199 3037.1
## - runtime           1       519  64276 3037.9
## <none>                           63757 3039.1
## - critics_score     1       879  64635 3041.5
## - imdb_rating       1     67356 131113 3501.3
## 
## Step:  AIC=3034.67
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actor_win    1       207  64165 3030.3
## - best_actress_win  1       261  64219 3030.8
## - mpaa_rating_R     1       373  64331 3032.0
## - best_pic_nom      1       447  64405 3032.7
## - runtime           1       468  64425 3032.9
## <none>                           63958 3034.7
## - critics_score     1       968  64926 3038.0
## - imdb_rating       1     67172 131129 3494.9
## 
## Step:  AIC=3030.29
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actress_win  1       296  64461 3026.8
## - mpaa_rating_R     1       366  64531 3027.5
## - best_pic_nom      1       396  64561 3027.8
## <none>                           64165 3030.3
## - runtime           1       643  64808 3030.3
## - critics_score     1       968  65133 3033.5
## - imdb_rating       1     67296 131461 3490.0
## 
## Step:  AIC=3026.81
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom
## 
##                 Df Sum of Sq    RSS    AIC
## - best_pic_nom   1       303  64765 3023.4
## - mpaa_rating_R  1       354  64815 3023.9
## <none>                        64461 3026.8
## - runtime        1       814  65275 3028.5
## - critics_score  1       957  65418 3029.9
## - imdb_rating    1     67424 131885 3485.6
## 
## Step:  AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## - mpaa_rating_R  1       361  65126 3020.5
## - runtime        1       638  65403 3023.3
## <none>                        64765 3023.4
## - critics_score  1      1027  65792 3027.1
## - imdb_rating    1     68173 132937 3484.3
## 
## Step:  AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                        65126 3020.5
## - runtime        1       653  65779 3020.5
## - critics_score  1      1073  66199 3024.7
## - imdb_rating    1     67874 133000 3478.2

The optimal model using BIC is:

audience_score = \(\beta_0 + \beta_1runtime + \beta_2imdb_rating + \beta_3critics_score\)

If several models are plausible,choosing one ignores the uncertainty involved in choosing the variables. Bayesian model averaging (BMA) averages models to obtain posteriors of coefficients and predictions from new data. Apply BMA to the data set.

movies2_no_na = na.omit(movies2)
bma_audience_score = bas.lm(audience_score ~ ., data = movies2_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
bma_audience_score

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies2_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

summary(bma_audience_score)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

The most likely model, which has posterior probability of 0.1297, includes variables runtime, imdb_rating, and critics_score. The posterior distribution of the parameter estimators is below.

par(mfrow = c(1,2))
coef_audience_score = coefficients(bma_audience_score)
plot(coef_audience_score, subset = c(4,9,11), ask=FALSE)

We can also provide 95% credible intervals for these coefficients:

confint(coef_audience_score)

##                              2.5%        97.5%          beta
## Intercept            6.158063e+01 6.311649e+01  6.234769e+01
## feature_filmyes     -1.179102e+00 0.000000e+00 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.195619e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes    -2.114906e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year       -5.317024e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes     -1.074762e+00 0.000000e+00 -8.034940e-02
## summer_seasonyes    -2.261352e-03 9.272773e-01  8.704545e-02
## imdb_rating          1.362503e+01 1.650201e+01  1.498203e+01
## imdb_num_votes      -7.756278e-09 1.287327e-06  2.080713e-07
## critics_score        0.000000e+00 1.056672e-01  6.296648e-02
## best_pic_nomyes      0.000000e+00 4.893661e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.644979e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.804581e+00 1.524930e-03 -3.088382e-01
## best_dir_winyes     -1.309628e+00 5.959084e-02 -1.195011e-01
## top200_boxyes       -1.313368e-01 1.245677e-01  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

Let’s find predictive values under the best predictive model, the one that has predictions closest to BMA and corresponding posterior standard deviations.

BPM_pred_audience_score =  predict(bma_audience_score, estimator="BPM", se.fit=TRUE)
bma_audience_score$namesx[BPM_pred_audience_score$bestmodel+1]

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

Let’s turn to see what characteristics lead to the highest audience scores with the BPM model.

opt = which.max(BPM_pred_audience_score$fit)
t(movies2_no_na[opt, ])

##                  [,1]    
## audience_score   "97"    
## feature_film     "yes"   
## drama            "no"    
## runtime          "202"   
## mpaa_rating_R    "yes"   
## thtr_rel_year    "1974"  
## oscar_season     "yes"   
## summer_season    "no"    
## imdb_rating      "9"     
## imdb_num_votes   "749783"
## critics_score    "97"    
## best_pic_nom     "yes"   
## best_pic_win     "yes"   
## best_actor_win   "yes"   
## best_actress_win "yes"   
## best_dir_win     "yes"   
## top200_box       "no"

Here is a 95% prediction interval for the audience score for a movie with covariates at the levels of the movie specified by opt.:

ci_audience_score = confint(BPM_pred_audience_score, parm="pred")
ci_audience_score[opt,]

##      2.5%     97.5%      pred 
##  77.42096 117.65506  97.53801

Part 6: Conclusion

The optimal model was: audience_score = \(\beta_0 + \beta_1runtime + \beta_2imdb_rating + \beta_3critics_score\).

The optimal model emerged both from the BIC method, and from Bayesian model averaging.