Setup

Load packages

library(ggplot2)
library(grid)
library(gridExtra)
library(dplyr)
library(statsr)
library(BAS)
library(psych)
## Warning: package 'psych' was built under R version 3.6.2
library(MASS)
library(tidyverse)
library(broom)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

This data is from imdb.com, rottentomatoes.com, and flixster.com. There are 651 randomly sampled movies produced and released before 2016.

Because the data is randomly sampled from existing datasets, and not derived from experiments, therefore we can infer correlations but not causation. Due to the random sampling, we can assume our inferences about the data are generalizable to other movies outside of the dataset that were also produced and released before 2016.

There are several potential sources of bias to keep in mind when analyzing this data set, namely that the movies are limited to English language and American-produced movies. We cannot assume that we can extrapolate our findings to other languages and countries.


Part 2: Data manipulation

Create the following categories to use as predictors:

feature_film: “yes” if title_type is Feature Film, “no” otherwise

drama: “yes” if genre is Drama, “no” otherwise

mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise

oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise

summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise

movies$feature_film <- ifelse(movies$title_type == 'Feature Film', 'Yes', 'No')
movies$drama <- ifelse(movies$genre == 'Drama', 'Yes', 'No')
movies$mpaa_rating_R <- ifelse(movies$mpaa_rating == 'R', 'Yes', 'No')
movies$oscar_season <- ifelse(movies$thtr_rel_month %in% seq(11, 13), 'Yes', 'No')
movies$summer_season <- ifelse(movies$thtr_rel_month %in% seq(5,8), 'Yes', 'No')

We are also going to drop the columns ‘title’ and ‘studio’ because there are too many distinct


Part 3: Exploratory data analysis

We first to take a look at which of our new variables might be most predictive of a movie’s audience score. The below boxplots visually show the distribution, quartiles and median values for the new variables:

plot1 <- ggplot(movies, aes(x = feature_film,
                            y = audience_score,
                            fill = feature_film)) + geom_boxplot(color = 'blue', alpha = 0.5) +
  theme_bw() + geom_smooth() +
  labs(x = "Feature Film", y = "Audience Score",
       fill = "feature_film") +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')

plot2 <- ggplot(movies, aes(x = drama,
                            y = audience_score,
                            fill = drama)) + geom_boxplot(color = 'blue', alpha = 0.5) +
  theme_bw() + geom_smooth() +
  labs(x = "Drama", y = "Audience Score",
       fill = "drama") +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')

plot3 <- ggplot(movies, aes(x = mpaa_rating_R,
                            y = audience_score,
                            fill = mpaa_rating_R)) + geom_boxplot(color = 'blue', alpha = 0.5) +
  theme_bw() + geom_smooth() +
  labs(x = "Rated R", y = "Audience Score",
       fill = "mpaa_rating_R") +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')

plot4 <- ggplot(movies, aes(x = oscar_season,
                            y = audience_score,
                            fill = oscar_season)) + geom_boxplot(color = 'blue', alpha = 0.5) +
  theme_bw() + geom_smooth() +
  labs(x = "Oscar Season", y = "Audience Score",
       fill = "oscar_season") +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')

plot5 <- ggplot(movies, aes(x = summer_season,
                            y = audience_score,
                            fill = summer_season)) + geom_boxplot(color = 'blue', alpha = 0.5) +
  theme_bw() + geom_smooth() +
  labs(x = "Summer Season", y = "Audience Score",
       fill = "summer_season") +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')

grid.arrange(plot1, plot2,
             plot3, plot4,
             plot5,
             nrow=3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Whether or not the movie is a feature film seems to be a very promising predictor of audience score. The median audience score for feature films is 62 while non-feature films is 86. Next we want to look at some of the summary stats for audience score, and what number of movies fall into each of the categories.

hist(movies$audience_score, main="Distribution of Audience Score", xlab="audience score")

describe(movies$audience_score)
##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 651 62.36 20.22     65   63.39 23.72  11  97    86 -0.36    -0.91
##      se
## X1 0.79

The sample size of movies is 651 and median audience score is 65. The distribution has a left skew, and the mean is slightly lower than the median. We identified feature film as a potentially interesting predictor, so let’s see how many movies are feature films vs. not:

ggplot(data=movies) + geom_bar(mapping=aes(x=feature_film, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("Proportion of Feature Films")

The vast majority of all movies in the datasets are feature films, with only ~11% not being feature films. While feature_film may be helpful for prediction, it may not help us predict the variation in scores between non feature films, so we will need to use other predictors to explain these differences in movie scores - perhaps ‘drama’.


Part 4: Modeling

First, we’ll create the full model with all variables in the dataset. Then we’ll use AIC to begin eliminating variables from the model.

drop_cols <- c("title","genre", "studio", "thtr_rel_month",
           "thtr_rel_day", "dvd_rel_year", "dvd_rel_month",
           "dvd_rel_day", "critics_rating", "audience_rating",
           "director", "actor1", "actor2", "actor3", "actor4",
           "actor5", "imdb_url", "rt_url", "title_type", "mpaa_rating")
movies_reduced = movies[ , !(names(movies) %in% drop_cols)]

names(movies_reduced)
##  [1] "runtime"          "thtr_rel_year"    "imdb_rating"     
##  [4] "imdb_num_votes"   "critics_score"    "audience_score"  
##  [7] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [10] "best_actress_win" "best_dir_win"     "top200_box"      
## [13] "feature_film"     "drama"            "mpaa_rating_R"   
## [16] "oscar_season"     "summer_season"

We will also exclude rows with nulls to avoid non response bias.

movies_reduced = na.omit(movies_reduced)

Next, we’ll create the full model with all variables in the dataset. Then we’ll use AIC to begin eliminating variables from the model.

m_full <- lm(audience_score ~ ., data = movies_reduced)
tidy(m_full)
## # A tibble: 17 x 5
##    term                    estimate   std.error statistic  p.value
##    <chr>                      <dbl>       <dbl>     <dbl>    <dbl>
##  1 (Intercept)         123.         77.6            1.58  1.15e- 1
##  2 runtime              -0.0596      0.0242        -2.46  1.43e- 2
##  3 thtr_rel_year        -0.0756      0.0384        -1.97  4.94e- 2
##  4 imdb_rating          14.7         0.607         24.3   2.19e-92
##  5 imdb_num_votes        0.00000729  0.00000452     1.61  1.08e- 1
##  6 critics_score         0.0576      0.0222         2.60  9.61e- 3
##  7 best_pic_nomyes       5.07        2.63           1.93  5.41e- 2
##  8 best_pic_winyes      -3.12        4.61          -0.676 4.99e- 1
##  9 best_actor_winyes    -1.56        1.18          -1.32  1.86e- 1
## 10 best_actress_winyes  -2.20        1.30          -1.69  9.19e- 2
## 11 best_dir_winyes      -1.27        1.73          -0.738 4.61e- 1
## 12 top200_boxyes         0.639       2.79           0.229 8.19e- 1
## 13 feature_filmYes      -2.30        1.69          -1.36  1.73e- 1
## 14 dramaYes              1.32        0.876          1.51  1.32e- 1
## 15 mpaa_rating_RYes     -1.44        0.813         -1.78  7.62e- 2
## 16 oscar_seasonYes       0.380       1.12           0.340 7.34e- 1
## 17 summer_seasonYes      1.23        0.902          1.37  1.72e- 1

Now we will use the step AIC function to backwards fit the model by process of elimination.

stepAIC.model <- stepAIC(m_full, direction = "backward", trace = TRUE)
## Start:  AIC=3007.11
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + feature_film + 
##     drama + mpaa_rating_R + oscar_season + summer_season
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         5  63012 3005.2
## - oscar_season      1        12  63018 3005.2
## - best_pic_win      1        46  63052 3005.6
## - best_dir_win      1        54  63061 3005.7
## - best_actor_win    1       174  63181 3006.9
## - feature_film      1       185  63192 3007.0
## - summer_season     1       186  63192 3007.0
## <none>                           63007 3007.1
## - drama             1       226  63233 3007.4
## - imdb_num_votes    1       259  63265 3007.8
## - best_actress_win  1       284  63290 3008.0
## - mpaa_rating_R     1       314  63321 3008.3
## - best_pic_nom      1       371  63377 3008.9
## - thtr_rel_year     1       386  63392 3009.1
## - runtime           1       601  63608 3011.3
## - critics_score     1       672  63678 3012.0
## - imdb_rating       1     58542 121549 3432.2
## 
## Step:  AIC=3005.17
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + feature_film + drama + 
##     mpaa_rating_R + oscar_season + summer_season
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        14  63025 3003.3
## - best_pic_win      1        46  63058 3003.6
## - best_dir_win      1        55  63067 3003.7
## - best_actor_win    1       173  63185 3004.9
## - feature_film      1       184  63196 3005.1
## - summer_season     1       190  63201 3005.1
## <none>                           63012 3005.2
## - drama             1       224  63236 3005.5
## - best_actress_win  1       281  63293 3006.1
## - imdb_num_votes    1       299  63311 3006.2
## - mpaa_rating_R     1       327  63338 3006.5
## - best_pic_nom      1       367  63379 3006.9
## - thtr_rel_year     1       400  63411 3007.3
## - runtime           1       600  63612 3009.3
## - critics_score     1       681  63693 3010.2
## - imdb_rating       1     58608 121619 3430.6
## 
## Step:  AIC=3003.31
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + feature_film + drama + 
##     mpaa_rating_R + summer_season
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3001.8
## - best_dir_win      1        56  63081 3001.9
## - best_actor_win    1       174  63200 3003.1
## - summer_season     1       177  63202 3003.1
## - feature_film      1       182  63207 3003.2
## <none>                           63025 3003.3
## - drama             1       222  63247 3003.6
## - best_actress_win  1       281  63307 3004.2
## - imdb_num_votes    1       302  63328 3004.4
## - mpaa_rating_R     1       329  63354 3004.7
## - best_pic_nom      1       387  63412 3005.3
## - thtr_rel_year     1       410  63436 3005.5
## - runtime           1       587  63613 3007.3
## - critics_score     1       679  63704 3008.3
## - imdb_rating       1     58603 121628 3428.6
## 
## Step:  AIC=3001.78
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win + feature_film + drama + mpaa_rating_R + summer_season
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3000.7
## - best_actor_win    1       163  63234 3001.5
## - feature_film      1       171  63242 3001.5
## - summer_season     1       174  63245 3001.6
## <none>                           63071 3001.8
## - drama             1       220  63291 3002.0
## - imdb_num_votes    1       271  63342 3002.6
## - best_actress_win  1       294  63365 3002.8
## - mpaa_rating_R     1       330  63401 3003.2
## - best_pic_nom      1       342  63414 3003.3
## - thtr_rel_year     1       397  63468 3003.9
## - runtime           1       586  63657 3005.8
## - critics_score     1       680  63751 3006.8
## - imdb_rating       1     58858 121929 3428.2
## 
## Step:  AIC=3000.75
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     feature_film + drama + mpaa_rating_R + summer_season
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3000.5
## - best_actor_win    1       171  63336 3000.5
## - feature_film      1       183  63348 3000.6
## <none>                           63165 3000.7
## - drama             1       228  63394 3001.1
## - imdb_num_votes    1       247  63412 3001.3
## - best_actress_win  1       299  63464 3001.8
## - best_pic_nom      1       326  63491 3002.1
## - mpaa_rating_R     1       345  63510 3002.3
## - thtr_rel_year     1       368  63533 3002.5
## - critics_score     1       651  63816 3005.4
## - runtime           1       673  63839 3005.6
## - imdb_rating       1     58895 122061 3426.9
## 
## Step:  AIC=3000.46
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     feature_film + drama + mpaa_rating_R
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3000.1
## <none>                           63332 3000.5
## - best_actor_win    1       195  63527 3000.5
## - drama             1       204  63536 3000.6
## - imdb_num_votes    1       260  63592 3001.1
## - best_pic_nom      1       297  63629 3001.5
## - best_actress_win  1       297  63629 3001.5
## - mpaa_rating_R     1       356  63688 3002.1
## - thtr_rel_year     1       361  63693 3002.2
## - runtime           1       690  64022 3005.5
## - critics_score     1       732  64064 3005.9
## - imdb_rating       1     58763 122095 3425.1
## 
## Step:  AIC=3000.06
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     drama + mpaa_rating_R
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 2999.3
## - imdb_num_votes    1       173  63661 2999.8
## <none>                           63488 3000.1
## - best_actor_win    1       219  63706 3000.3
## - thtr_rel_year     1       277  63765 3000.9
## - best_pic_nom      1       291  63778 3001.0
## - best_actress_win  1       306  63794 3001.2
## - mpaa_rating_R     1       453  63941 3002.7
## - runtime           1       715  64203 3005.3
## - critics_score     1       875  64363 3007.0
## - imdb_rating       1     63189 126677 3447.1
## 
## Step:  AIC=2999.3
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     mpaa_rating_R
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 2998.8
## <none>                           63609 2999.3
## - best_actor_win    1       209  63818 2999.4
## - thtr_rel_year     1       272  63881 3000.1
## - best_actress_win  1       274  63883 3000.1
## - best_pic_nom      1       307  63916 3000.4
## - mpaa_rating_R     1       391  64000 3001.3
## - runtime           1       631  64240 3003.7
## - critics_score     1       916  64525 3006.6
## - imdb_rating       1     63434 127043 3447.0
## 
## Step:  AIC=2998.81
## audience_score ~ runtime + thtr_rel_year + imdb_rating + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win + mpaa_rating_R
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                           63757 2998.8
## - thtr_rel_year     1       201  63958 2998.9
## - best_actor_win    1       219  63976 2999.0
## - best_actress_win  1       266  64023 2999.5
## - mpaa_rating_R     1       367  64124 3000.5
## - best_pic_nom      1       442  64199 3001.3
## - runtime           1       519  64276 3002.1
## - critics_score     1       879  64635 3005.7
## - imdb_rating       1     67356 131113 3465.4
AIC.lm <- lm(audience_score ~ runtime + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win + mpaa_rating_R, data=movies)

AIC.lm$coefficients
##         (Intercept)             runtime       thtr_rel_year 
##         70.10675281         -0.05115515         -0.05122557 
##         imdb_rating       critics_score     best_pic_nomyes 
##         15.00149242          0.06409989          4.88277038 
##   best_actor_winyes best_actress_winyes    mpaa_rating_RYes 
##         -1.73481942         -2.11568281         -1.50528039

Using AIC to eliminate variables from the model, the final model will be:

audience_score = intercept + -0.05115515*runtime + -0.05122557*thtr_rel_year + 15.00149242*imdb_rating + 0.06409989*critics_score + 4.88277038*best_pic_nmyes + -1.73481942*best_actor_winyes + -2.11568281*best_actress_winyes + -1.50528039*mpaa_rating_RYes

To evaluate the model, we can look at the residuals:

plot(movies_reduced$audience_score ~ AIC.lm$residuals)

ggplot(data=AIC.lm, aes(x=AIC.lm$residuals)) + geom_histogram(bins=20)

The residuals appear normally distributed.

Next, we will try a different approach for model selection - Bayesian Model Averaging.

We learned that often, there are several models that equally likely and choosing only one ignores the uncertainy we have in choosing variables to include. BMA allows us to average multiple plausible models to get posteriors of coefficients.

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_movies <- bas.lm(audience_score ~ ., data = movies_reduced,
                   prior = "AIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable            
bma_movies
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_reduced, prior = "AIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept              runtime        thtr_rel_year  
##              1.0000               0.8725               0.5729  
##         imdb_rating       imdb_num_votes        critics_score  
##              1.0000               0.4538               0.9670  
##     best_pic_nomyes      best_pic_winyes    best_actor_winyes  
##              0.6778               0.3013               0.5181  
## best_actress_winyes      best_dir_winyes        top200_boxyes  
##              0.5925               0.3634               0.3022  
##     feature_filmYes             dramaYes     mpaa_rating_RYes  
##              0.3824               0.3833               0.6831  
##     oscar_seasonYes     summer_seasonYes  
##              0.2807               0.4585
# Top 5 most probably models
summary(bma_movies)
##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept               1.0000000     1.0000     1.0000000     1.0000000
## runtime                 0.8725072     1.0000     1.0000000     1.0000000
## thtr_rel_year           0.5729427     1.0000     0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000     1.0000000     1.0000000
## imdb_num_votes          0.4537507     0.0000     0.0000000     0.0000000
## critics_score           0.9670063     1.0000     1.0000000     1.0000000
## best_pic_nomyes         0.6777549     1.0000     1.0000000     1.0000000
## best_pic_winyes         0.3013074     0.0000     0.0000000     0.0000000
## best_actor_winyes       0.5180921     1.0000     1.0000000     0.0000000
## best_actress_winyes     0.5925182     1.0000     1.0000000     1.0000000
## best_dir_winyes         0.3633570     0.0000     0.0000000     0.0000000
## top200_boxyes           0.3022418     0.0000     0.0000000     0.0000000
## feature_filmYes         0.3824117     0.0000     0.0000000     0.0000000
## dramaYes                0.3832989     0.0000     0.0000000     0.0000000
## mpaa_rating_RYes        0.6831172     1.0000     1.0000000     1.0000000
## oscar_seasonYes         0.2806655     0.0000     0.0000000     0.0000000
## summer_seasonYes        0.4584914     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9783606     0.9298905
## PostProbs                      NA     0.0019     0.0019000     0.0018000
## R2                             NA     0.7601     0.7593000     0.7585000
## dim                            NA     9.0000     8.0000000     7.0000000
## logmarg                        NA -3604.4206 -3604.4424632 -3604.4932746
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## runtime                 1.0000000     1.0000000
## thtr_rel_year           1.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     1.0000000     1.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## feature_filmYes         0.0000000     0.0000000
## dramaYes                0.0000000     0.0000000
## mpaa_rating_RYes        1.0000000     1.0000000
## oscar_seasonYes         0.0000000     0.0000000
## summer_seasonYes        0.0000000     1.0000000
## BF                      0.8903369     0.7914465
## PostProbs               0.0017000     0.0015000
## R2                      0.7592000     0.7592000
## dim                     8.0000000     8.0000000
## logmarg             -3604.5367415 -3604.6544792

We can see that the posterior probability that imdb_rating is included in the model is 1.000, and critics_score is almost 1.

The most likely models (model1 and model2) have the same posterior probability of 0.0019.

We can also visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients for the variables we created earlier.

# Obtain the coefficients from the model `bma_movies`
coef_movies <- coefficients(bma_movies)

# Look at the variables we created in the data set
plot(coef_movies, subset = c(13, 14,15,16,17), ask = FALSE)

We can also quantify our uncertainty in the coefficients in our model by generating 95% credible intervals:

confint(coef_movies)
##                              2.5%        97.5%          beta
## Intercept            6.161619e+01 6.315901e+01  6.234769e+01
## runtime             -9.637600e-02 0.000000e+00 -4.954262e-02
## thtr_rel_year       -1.236302e-01 5.869476e-03 -3.445803e-02
## imdb_rating          1.370364e+01 1.621687e+01  1.491842e+01
## imdb_num_votes      -1.386757e-06 1.269664e-05  2.448273e-06
## critics_score        0.000000e+00 1.008312e-01  6.288512e-02
## best_pic_nomyes     -4.725224e-03 8.907068e+00  3.093246e+00
## best_pic_winyes     -8.273182e+00 4.761188e+00 -5.357266e-01
## best_actor_winyes   -3.613718e+00 3.312635e-01 -8.906023e-01
## best_actress_winyes -4.546457e+00 0.000000e+00 -1.274460e+00
## best_dir_winyes     -4.199131e+00 8.795338e-01 -5.628675e-01
## top200_boxyes       -2.372924e+00 5.415304e+00  4.370515e-01
## feature_filmYes     -4.114358e+00 1.006133e+00 -6.183369e-01
## dramaYes            -5.171189e-01 2.141334e+00  3.298391e-01
## mpaa_rating_RYes    -2.876618e+00 1.112439e-02 -1.023176e+00
## oscar_seasonYes     -1.302998e+00 1.679704e+00  6.114252e-02
## summer_seasonYes    -3.304616e-01 2.409419e+00  5.098497e-01
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

In context, a 1 point increase in imdb_rating predicts an increase in audience score of +14.91842.


Part 5: Prediction

Now, we will pick a movie from 2016 (not in sample) and predict audience_score for this movie using the BMA derived model. I am choosing the movie Deadpool.

deadpool <- data.frame(imdb_rating = 8.0, runtime = 108, critics_score = 85, mpaa_rating_R="Yes", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no", summer_season = "No", oscar_season = "Yes", best_pic_win = 'no',
                       best_dir_win = 'no', feature_film = "Yes", drama = "No", imdb_num_votes = 899526, top200_box='yes')

Then, we will predict the audience_score and its 95% credible interval, using the AIC step model and the BMA model to contrast results:

predict(AIC.lm, newdata = deadpool, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 85.26639 65.57666 104.9561
deadpool_bma_pred <- predict(bma_movies, newdata=deadpool, estimator="BMA", se.fit=TRUE, interval="prediction", level = 0.95)

deadpool_bma_pred$Ybma
##          [,1]
## [1,] 87.70602

The actual audience score is 90 - the BMA model does a better job of predicting.


Part 6: Conclusion

The BMA model performing better than the model chosen based solely on AIC is perhaps not surprising since that is the model that allows us to incorporate our uncertainty in the coefficients. This illustrates why it is important for us to express uncertainty in our model in order to make it better at prediction.

I also noted that the residuals may have violated some of the assumptions of normality, and some further analysis could be done to see if transforming audience_score or other variables would improve model fit.

The models also did not benefit from any strong priors about the relationships between variables and audience score. Priors could have been helpful to make us more certain in our estimates of coefficients.