Congratulations on getting a job as a data scientist at Paramount Pictures!
Our boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.
She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants us team to figure it all out.
The specific modeling task we need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables we will need to construct in the data manipulation section using the mutate function in dplyr:
feature_film: “yes” if title_type is Feature Film, “no” otherwisedrama: “yes” if genre is Drama, “no” otherwiseruntimempaa_rating_R: “yes” if mpaa_rating is R, “no” otherwisethtr_rel_yearoscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwisesummer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwiseimdb_ratingimdb_num_votescritics_scorebest_pic_nombest_pic_winbest_actor_winbest_actress_winbest_dir_wintop200_boxFor this project, we will complete exploratory data analysis (EDA), modeling, and prediction.
library(ggplot2)
library(MASS)
library(dplyr)
library(statsr)
library(BAS)
library(plotly)
library(broom)load("movies.RData")The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Some of these variables are provided only for informational purposes and do not make any sense to include in a statistical analysis. It is up to us to decide which variables are meaningful and which should be omitted. For example, information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.
We might also choose to omit certain observations or restructure some of the variables to make them suitable for answering our research questions.
When we are fitting a model, we should also be careful about collinearity, as some of these variables may be dependent on each other.
My exploratory analysis (see below) confirms that these data appear to be random samples and the results achieved through linear regression can be generalized to a larger population.
title_type: New variable should be called feature_film with levels yes (movies that are feature films) and nogenre: New variable should be called drama with levels yes (movies that are dramas) and nompaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and nothtr_rel_month:
oscar_season with levels yes (if movie is released in November, October, or December) and nosummer_season with levels yes (if movie is released in May, June, July, or August) and nomovies <- movies %>%
mutate(feature_film=factor(ifelse(title_type=="Feature Film","yes","no")),
drama=factor(ifelse(genre=="Drama","yes","no")),
mpaa_rating_R=factor(ifelse(mpaa_rating=="R","yes","no")),
oscar_season=factor(ifelse(thtr_rel_month%in%c("10","11","12"),"yes","no")),
summer_season=factor(ifelse(thtr_rel_month%in%c("5","6","7","8"),"yes","no")))Conduct exploratory data analysis of the relationship between audience_score and the new variables constructed in the previous part.
First, we take a look at the distribution of audience_score.
m <- list(l=50,r=50,b=100,t=50,pad=4)
audience_score_density <- density(movies$audience_score)
fig <- plot_ly(x=~audience_score, data=movies, type="histogram",name="Histogram") %>%
add_trace(x=audience_score_density$x,y=audience_score_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
layout(autosize = F, width=600, height=400,margin = m,
title="Histogram of 'audience_score'",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 0.8))
figsummary(movies$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
It can be observed that audience_score has a relatively normal distribution.
Now we look at the distribution of audience_score based on the new dataset features we were asked to create.
feature_filmfeature_film_yes <- movies %>% filter(feature_film=="yes")
feature_film_no <- movies %>% filter(feature_film=="no")
feature_film_yes_density <- density(feature_film_yes$audience_score)
feature_film_no_density <- density(feature_film_no$audience_score)
fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=feature_film_yes, name="Histogram-yes",bingroup=1) %>%
add_histogram(x = ~audience_score, data=feature_film_no, name="Histogram-no",bingroup=1) %>%
add_trace(x=feature_film_yes_density$x,y=feature_film_yes_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
add_trace(x=feature_film_no_density$x,y=feature_film_no_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>%
layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
legend=list(title=list(text='<b> Feature Film? </b>'))) %>%
layout(autosize = F,width = 600, height = 450, margin = m,
title="'audience_score vs. 'feature_film': Histogram & Density Plots",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 1))
figsummary(feature_film_yes$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 44.50 62.00 60.47 78.00 97.00
summary(feature_film_no$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 76.50 85.50 81.05 89.00 96.00
dramadrama_yes <- movies %>% filter(drama=="yes")
drama_no <- movies %>% filter(drama=="no")
drama_yes_density <- density(drama_yes$audience_score)
drama_no_density <- density(drama_no$audience_score)
fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=drama_yes, name="Histogram-yes",bingroup=1) %>%
add_histogram(x = ~audience_score, data=drama_no, name="Histogram-no",bingroup=1) %>%
add_trace(x=drama_yes_density$x,y=drama_yes_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
add_trace(x=drama_no_density$x,y=drama_no_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>%
layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
legend=list(title=list(text='<b> Drama? </b>'))) %>%
layout(autosize = F,width = 600, height = 450, margin = m,
title="'audience_score vs. 'drama': Histogram & Density Plots",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 1))
figsummary(drama_yes$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 52.00 70.00 65.35 80.00 95.00
summary(drama_no$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 41.00 61.00 59.73 79.00 97.00
mpaa_rating_Rmpaa_rating_R_yes <- movies %>% filter(mpaa_rating_R=="yes")
mpaa_rating_R_no <- movies %>% filter(mpaa_rating_R=="no")
mpaa_rating_R_yes_density <- density(mpaa_rating_R_yes$audience_score)
mpaa_rating_R_no_density <- density(mpaa_rating_R_no$audience_score)
fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=mpaa_rating_R_yes, name="Histogram-yes",bingroup=1) %>%
add_histogram(x = ~audience_score, data=mpaa_rating_R_no, name="Histogram-no",bingroup=1) %>%
add_trace(x=mpaa_rating_R_yes_density$x,y=mpaa_rating_R_yes_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
add_trace(x=mpaa_rating_R_no_density$x,y=mpaa_rating_R_no_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>%
layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
legend=list(title=list(text='<b> R Rating? </b>'))) %>%
layout(autosize = F,width = 600, height = 450, margin = m,
title="'audience_score vs. 'mpaa_rating_R': Histogram & Density Plots",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 1))
figsummary(mpaa_rating_R_yes$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 44.00 64.00 62.04 79.00 97.00
summary(mpaa_rating_R_no$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 48.25 65.50 62.69 80.00 96.00
oscar_seasonoscar_season_yes <- movies %>% filter(oscar_season=="yes")
oscar_season_no <- movies %>% filter(oscar_season=="no")
oscar_season_yes_density <- density(oscar_season_yes$audience_score)
oscar_season_no_density <- density(oscar_season_no$audience_score)
fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=oscar_season_yes, name="Histogram-yes",bingroup=1) %>%
add_histogram(x = ~audience_score, data=oscar_season_no, name="Histogram-no",bingroup=1) %>%
add_trace(x=oscar_season_yes_density$x,y=oscar_season_yes_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
add_trace(x=oscar_season_no_density$x,y=oscar_season_no_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>%
layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
legend=list(title=list(text='<b> Oscar Season? </b>'))) %>%
layout(autosize = F,width = 600, height = 450, margin = m,
title="'audience_score vs. 'oscar_season': Histogram & Density Plots",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 1))
figsummary(oscar_season_yes$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 47.50 69.00 63.69 81.00 97.00
summary(oscar_season_no$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 64.00 61.81 79.00 96.00
summer_seasonsummer_season_yes <- movies %>% filter(summer_season=="yes")
summer_season_no <- movies %>% filter(summer_season=="no")
summer_season_yes_density <- density(summer_season_yes$audience_score)
summer_season_no_density <- density(summer_season_no$audience_score)
fig <- plot_ly(alpha = 0.4)
fig <- fig %>% add_histogram(x = ~audience_score, data=summer_season_yes, name="Histogram-yes",bingroup=1) %>%
add_histogram(x = ~audience_score, data=summer_season_no, name="Histogram-no",bingroup=1) %>%
add_trace(x=summer_season_yes_density$x,y=summer_season_yes_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-yes") %>%
add_trace(x=summer_season_no_density$x,y=summer_season_no_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density-no") %>%
layout(barmode = "overlay",yaxis=list(title="Number of Observations"),
legend=list(title=list(text='<b> Summer Season? </b>'))) %>%
layout(autosize = F,width = 600, height = 450, margin = m,
title="'audience_score vs. 'summer_season': Histogram & Density Plots",
xaxis=list(title="audience_score"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.1, y = 1))
figsummary(summer_season_yes$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 44.75 65.00 61.81 78.00 94.00
summary(summer_season_no$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 46.00 66.00 62.62 80.00 97.00
Our exploratory analysis indicates that these distributions of audience_score are also relatively normal.
We will now develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables you constructed earlier:
feature_filmdramaruntimempaa_rating_Rthtr_rel_yearoscar_seasonsummer_seasonimdb_ratingimdb_num_votescritics_scorebest_pic_nombest_pic_winbest_actor_winbest_actress_winbest_dir_wintop200_boxLet’s prepare the dataset, movies by selecting the columns specified for the regression.
movies_modeling <- movies %>% select(audience_score,feature_film,drama,runtime,mpaa_rating_R, thtr_rel_year,oscar_season,summer_season,imdb_rating,imdb_num_votes,critics_score, best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box)
str(movies_modeling)## tibble [651 × 17] (S3: tbl_df/tbl/data.frame)
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ feature_film : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
## $ drama : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating_R : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ oscar_season : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ summer_season : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
We perform the linear regression…
model_full <- lm(audience_score ~ ., data = movies_modeling)
tidy(model_full)## # A tibble: 17 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 124. 77.5 1.61 1.09e- 1
## 2 feature_filmyes -2.25 1.69 -1.33 1.83e- 1
## 3 dramayes 1.29 0.877 1.47 1.41e- 1
## 4 runtime -0.0561 0.0242 -2.32 2.04e- 2
## 5 mpaa_rating_Ryes -1.44 0.813 -1.78 7.60e- 2
## 6 thtr_rel_year -0.0766 0.0383 -2.00 4.63e- 2
## 7 oscar_seasonyes -0.533 0.997 -0.535 5.93e- 1
## 8 summer_seasonyes 0.911 0.949 0.959 3.38e- 1
## 9 imdb_rating 14.7 0.607 24.3 2.03e-92
## 10 imdb_num_votes 0.00000723 0.00000452 1.60 1.10e- 1
## 11 critics_score 0.0575 0.0222 2.59 9.73e- 3
## 12 best_pic_nomyes 5.32 2.63 2.02 4.33e- 2
## 13 best_pic_winyes -3.21 4.61 -0.697 4.86e- 1
## 14 best_actor_winyes -1.54 1.18 -1.31 1.91e- 1
## 15 best_actress_winyes -2.20 1.30 -1.69 9.23e- 2
## 16 best_dir_winyes -1.23 1.73 -0.713 4.76e- 1
## 17 top200_boxyes 0.848 2.78 0.305 7.61e- 1
confint(model_full)## 2.5 % 97.5 %
## (Intercept) -2.775101e+01 2.765742e+02
## feature_filmyes -5.561752e+00 1.065404e+00
## dramayes -4.289065e-01 3.013785e+00
## runtime -1.035699e-01 -8.710668e-03
## mpaa_rating_Ryes -3.040338e+00 1.514030e-01
## thtr_rel_year -1.518695e-01 -1.266551e-03
## oscar_seasonyes -2.490505e+00 1.423948e+00
## summer_seasonyes -9.534879e-01 2.774785e+00
## imdb_rating 1.352584e+01 1.590862e+01
## imdb_num_votes -1.646766e-06 1.611569e-05
## critics_score 1.395487e-02 1.010066e-01
## best_pic_nomyes 1.606707e-01 1.048041e+01
## best_pic_winyes -1.226379e+01 5.840535e+00
## best_actor_winyes -3.859279e+00 7.706685e-01
## best_actress_winyes -4.758389e+00 3.620654e-01
## best_dir_winyes -4.624714e+00 2.161716e+00
## top200_boxyes -4.615603e+00 6.311263e+00
As we can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. We will use the Bayesian Information Criterion (BIC) for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. We can calculate the BIC of the full linear model using the command below:
BIC(model_full)## [1] 4934.145
We can compare the BIC value of the full model with that of a reduced model. To do this, we will use the function stepAIC from the MASS package that will work backwards through the model space, removing variables until the AIC score can be no longer lowered. It takes all inputs in the full model, and a penalty parameter \(k\). We will look for the the best model according to BIC (in which case \(k = log(n)\) where \(n\) is the number of observations)
model_full <- lm(audience_score ~ ., data = na.omit(movies_modeling))
model_full_step <- stepAIC(model_full, trace = FALSE, direction = "backward")
model_full_step$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 633 62989.66 3006.940
## 2 - top200_box 1 9.240822 634 62998.90 3005.035
## 3 - oscar_season 1 26.461061 635 63025.36 3003.308
## 4 - best_pic_win 1 45.744109 636 63071.11 3001.780
## 5 - best_dir_win 1 93.998979 637 63165.11 3000.748
## 6 - summer_season 1 166.872376 638 63331.98 3000.463
## 7 - feature_film 1 155.730073 639 63487.71 3000.059
## 8 - drama 1 121.355602 640 63609.06 2999.300
## 9 - imdb_num_votes 1 147.829088 641 63756.89 2998.809
Often, several models are equally plausible and choosing only one ignores the inherent uncertainty involved in choosing the variables to include in the model. A way to get around this problem is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data.
# Exclude observations with missing values in the data set
movies_modeling_no_na <- na.omit(movies_modeling)
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_audience_score <- bas.lm(audience_score ~ ., data = movies_modeling_no_na,
prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_audience_score##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_modeling_no_na,
## prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06537 0.04320
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.46971 0.19984 0.09069
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.07506 0.08042 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.05774 0.88855 0.13119
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.03985 0.14435 0.14128
## best_dir_winyes top200_boxyes
## 0.06694 0.04762
# Top 5 most probable models
summary(bma_audience_score)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06536947 0.0000 0.0000000 0.0000000
## dramayes 0.04319833 0.0000 0.0000000 0.0000000
## runtime 0.46971477 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.19984016 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.09068970 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07505684 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08042023 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05773502 0.0000 0.0000000 0.0000000
## critics_score 0.88855056 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13119140 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03984766 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14434896 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.14128087 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06693898 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04762234 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9968489 0.2543185
## PostProbs NA 0.1297 0.1293000 0.0330000
## R2 NA 0.7549 0.7525000 0.7539000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3615.2791 -3615.2822108 -3616.6482224
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2521327 0.2391994
## PostProbs 0.0327000 0.0310000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg -3616.6568544 -3616.7095127
Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that critics_score is included in the model is 0.89. Further, the most likely model (Model 1), which has posterior probability of 0.1297, includes an intercept and the variables runtime, imdb_rating, and critics_score.
It is also possible to visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients of runtime and critics_score below. Note that the subset command dictates which variable is plotted.
# Obtain the coefficients from the model `bma_audience_score`
coef_audience_score <- coefficients(bma_audience_score)
# `runtime` is the 4th variable, while `critics_score` is the 11th variable in the data set
plot(coef_audience_score, subset = c(4,11), ask = FALSE)We can also provide 95% credible intervals for these coefficients:
confint(coef_audience_score)## 2.5% 97.5% beta
## Intercept 6.158880e+01 6.314897e+01 6.234769e+01
## feature_filmyes -1.206636e+00 0.000000e+00 -1.046908e-01
## dramayes 0.000000e+00 0.000000e+00 1.604413e-02
## runtime -8.341541e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes -2.093717e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year -5.082473e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes -9.203152e-01 1.288938e-02 -8.034940e-02
## summer_seasonyes -1.924108e-02 1.039107e+00 8.704545e-02
## imdb_rating 1.368965e+01 1.658135e+01 1.498203e+01
## imdb_num_votes -4.059600e-09 1.505757e-06 2.080713e-07
## critics_score 0.000000e+00 1.061901e-01 6.296648e-02
## best_pic_nomyes 0.000000e+00 4.997640e+00 5.068035e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes -2.603830e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.934742e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes -1.438545e+00 0.000000e+00 -1.195011e-01
## top200_boxyes 0.000000e+00 0.000000e+00 8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
We will now select a movie that is not contained in the movies dataset and generate a prediction for this movie using the predict function in R.
However, before we make predictions on new data, we will use a Bayesian predictive distribution for predictions and interpretation of predictions. Simulation is used in BAS to construct predictive intervals with Bayesian Model Averaging, while exact inference is often possible with predictive intervals under model selection.
Returning to the movies data set, let us find the predictive values under the Best Predictive Model (BPM), the one which has predictions closest to BMA and corresponding posterior standard deviations.
BPM_pred_audience_score <- predict(bma_audience_score, estimator = "BPM", se.fit = TRUE)
variable.names(BPM_pred_audience_score)## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
In the code above, the function variable.names can be used to extract the names of all of the predictors in the Best Probability model. This can be used to identify the variables in the Highest Probability Model (HPM)
HPM_pred_audience_score <- predict(bma_audience_score, estimator = "HPM")
variable.names(HPM_pred_audience_score)## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
and the Median Probability Model (MPM)
MPM_pred_audience_score <- predict(bma_audience_score, estimator = "MPM")
variable.names(MPM_pred_audience_score)## [1] "Intercept" "imdb_rating" "critics_score"
Let us examine what characteristics lead to the highest audience_score in the BPM model.
# Find the index of observation with the largest fitted value
opt <- which.max(BPM_pred_audience_score$fit)
# Extract the row with this observation and glimpse at the row
movies_modeling_no_na %>%
slice(opt) %>%
glimpse()## Rows: 1
## Columns: 17
## $ audience_score <dbl> 97
## $ feature_film <fct> yes
## $ drama <fct> no
## $ runtime <dbl> 202
## $ mpaa_rating_R <fct> yes
## $ thtr_rel_year <dbl> 1974
## $ oscar_season <fct> yes
## $ summer_season <fct> no
## $ imdb_rating <dbl> 9
## $ imdb_num_votes <int> 749783
## $ critics_score <dbl> 97
## $ best_pic_nom <fct> yes
## $ best_pic_win <fct> yes
## $ best_actor_win <fct> yes
## $ best_actress_win <fct> yes
## $ best_dir_win <fct> yes
## $ top200_box <fct> no
A 95% credible interval for predicting audience_score can be obtained by
ci_audience_score <- confint(BPM_pred_audience_score, parm = "pred")
ci_audience_score[opt,]## 2.5% 97.5% pred
## 77.42096 117.65506 97.53801
If we were to use BMA, the interval would be
BMA_pred_audience_score <- predict(bma_audience_score, estimator = "BMA", se.fit = TRUE)
ci_bma_audience_score <- confint(BMA_pred_audience_score, estimator = "BMA")
opt_bma <- which.max(BMA_pred_audience_score$fit)
ci_bma_audience_score[opt_bma, ]## 2.5% 97.5% pred
## 80.02126 121.00041 99.78468
Now, the movie I would like to consider for my prediction is John Wick, which has a researched audience_score of 81 and is not included in the original movies dataset. Ratings data for John Wick can be found on the IMDb website here.
john_wick <- read.csv("John Wick ratings.csv")
john_wick## title title_type genre runtime mpaa_rating
## 1 John Wick Feature Film Action & Adventure 101 Not Family-Friendly
## studio thtr_rel_year thtr_rel_month thtr_rel_day
## 1 LionsGate Entertainment 2014 10 24
## thtr_rel_season dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating
## 1 Other 2016 6 7 7.4
## imdb_num_votes critics_rating critics_score audience_rating audience_score
## 1 553527 Certified Fresh 87 Upright 81
## audience_count
## 1 83615
Let’s first create the linear regression model based on both the Best Predictive Model (BPM) and Highest Probability Model (HPM), namely, using the variables runtime,imdb_rating, and critics_score.:
mod1 <- lm(audience_score ~ runtime+imdb_rating+critics_score, data=movies_modeling)
summary(mod1)##
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score,
## data = movies_modeling)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.998 -6.565 0.557 5.475 52.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.28321 3.21939 -10.338 < 2e-16 ***
## runtime -0.05362 0.02107 -2.545 0.01117 *
## imdb_rating 14.98076 0.57735 25.947 < 2e-16 ***
## critics_score 0.07036 0.02156 3.263 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.04 on 646 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7549, Adjusted R-squared: 0.7538
## F-statistic: 663.3 on 3 and 646 DF, p-value: < 2.2e-16
Let’s get a prediction of the audience_score based on the John Wick data:
pred <- predict(mod1,newdata=john_wick,interval="prediction",type="response")
pred## fit lwr upr
## 1 78.28034 58.52952 98.03116
The 95% prediction interval associated with an audience_score of 78.28 is (58.52, 98.03). This means that, according to the mod1 model, 95% the movies that are predicted to have an audience_score of 78.28 should have an actual rating that falls between 58.52 and 98.03.
pred <- predict(mod1,newdata=john_wick,interval="confidence",type="response")
pred## fit lwr upr
## 1 78.28034 77.11133 79.44936
The confidence interval reflects the uncertainty around the mean predictions. The 95% confidence interval associated with an audience_score of 78.28 is (77.11, 79.44). This means that, according to the mod1 model, a movie with an audience_rating of 78.28 on average will have a rating within the interval (77.11, 79.44).
We calculate the percent error knowing \(y_{obs}=81\) from the data:
Not bad!! Now, let’s look at a linear model based on the Median Probability Model (MPM), namely using only the variables imdb_rating and critics_score.
mod2 <- lm(audience_score ~ imdb_rating+critics_score, data=movies_modeling)
summary(mod2)##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score, data = movies_modeling)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.668 -6.758 0.723 5.513 52.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.03195 2.86401 -12.930 < 2e-16 ***
## imdb_rating 14.65760 0.56590 25.901 < 2e-16 ***
## critics_score 0.07318 0.02161 3.386 0.000753 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared: 0.7524, Adjusted R-squared: 0.7516
## F-statistic: 984.4 on 2 and 648 DF, p-value: < 2.2e-16
Let’s get a prediction of the audience_score based on the John Wick data:
pred <- predict(mod2,newdata=john_wick,interval="prediction",type="response")
pred## fit lwr upr
## 1 77.8006 57.9782 97.623
And the confidence interval of the prediction…
pred <- predict(mod2,newdata=john_wick,interval="confidence",type="response")
pred## fit lwr upr
## 1 77.8006 76.6841 78.9171
As we can see, our predicted results for both Model 1 and Model 2 are very close!
In this analysis, we fitted selected features about movies to a linear model to predict audience_score. To improve the robustness of our model, we ensured that all the various distributions of our dependent variable audience_score were normal based on the new dataset features we created. We were able to build a model using the Bayesian Information Criterion (BIC) and Bayesian Model Averaging (BMA) to optimize our model selection. It is possible that, with a larger dataset and some additional feature engineering, higher accuracies of predicted values might be achieved. Future work might include a linear model of non-linear terms, however, we must exercise caution so as to not overfit the data.