The task of this project is to predict audience scores audience_score
based on explanatory variables available in the supplied data set using a Bayesian model selection technique.
library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.3
library(ggplot2)
library(MASS)
load("movies.Rdata")
The movies data set is comprised of 651 randomly sampled movies produced and released before 2016. Because it is a random sample of movies, the results are generalizable to the target population of all movies produced and released before 2016. Variables relate to how much audiences and critics like the movies, and about the movies themselves. The data set are observations, so movie attributes are not randomly assigned and no causal inferences can be made.
The response variable of interest is audience_score
(Audience score on Rotten Tomatoes). The potential explanatory variables are:
feature_film
: “yes” if title_type (type of movie (Documentary, Feature Film, TV Movie)) is Feature Film, “no” otherwisedrama
: “yes” if genre (Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)) is Drama, “no” otherwiseruntime
: Runtime of movie (in minutes)mpaa_rating_R
: “yes” if mpaa_rating (G, PG, PG-13, R, Unrated) is R, “no” otherwisethtr_rel_year
: Year the movie is released in theatersoscar_season
: “yes” if movie is released in November, October, or December (based on thtr_rel_month
), “no” otherwisesummer_season
: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month
), “no” otherwiseimdb_rating
: Rating on IMDBimdb_num_votes
: Number of votes on IMDBcritics_score
: Critics score on Rotten Tomatoesbest_pic_nom
: Whether or not the movie was nominated for a best picture Oscar (no, yes)best_pic_win
: Whether or not the movie won a best picture Oscar (no, yes)best_actor_win
: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given moviebest_actress_win
: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the actresses won an Oscar for their role in the given moviebest_dir_win
: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movietop200_box
: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)Several of the potential explanatory variables must be generated from other variables. Create a new data set of just the response variable and 16 potential explanatory variables.
movies <- movies %>% mutate(feature_film = factor(ifelse(title_type=="Feature Film", "yes", "no")))
movies <- movies %>% mutate(drama = factor(ifelse(genre=="Drama", "yes", "no")))
movies <- movies %>% mutate(mpaa_rating_R = factor(ifelse(mpaa_rating=="R", "yes", "no")))
movies <- movies %>% mutate(oscar_season = factor(ifelse(thtr_rel_month %in% c(10,11,12), "yes", "no")))
movies <- movies %>% mutate(summer_season = factor(ifelse(thtr_rel_month %in% c(5,6,7,8), "yes", "no")))
movies2 <- subset(movies, select=c("audience_score",
"feature_film", "drama", "runtime",
"mpaa_rating_R", "thtr_rel_year",
"oscar_season", "summer_season",
"imdb_rating", "imdb_num_votes",
"critics_score", "best_pic_nom",
"best_pic_win", "best_actor_win",
"best_actress_win", "best_dir_win",
"top200_box"))
Below is a high-level exploratory analysis of the response variable and the 16 potential explanatory variables. Most explantory variables appear to be predictors of audience_score
. A few that do not include mpaa_rating_R
, thtr_rel_year
, oscar_season
, summer_season
, best_actor_win
, and best_actress_win
.
Start with a univariate data analysis of the response variable.
Audience scores range from 11 to 97 with mean 62.36. The interquartile range is [46..80]. The distribution is skewed left.
summary(movies2$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
ggplot(movies2, aes(x=audience_score)) +
geom_histogram() +
ggtitle("Histogram of Audience Scores")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now perform a bivariate data analysis of the response variable and sixteen potential explanatory variables.
Feature films tend to have lower audience scores than non-feature films. An ANOVA indicates the relationship is significant (F=61.71, p<.0001).
ggplot(movies2, aes(x=feature_film, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Feature Status")
summary(aov(audience_score ~ feature_film, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## feature_film 1 23081 23081 61.71 1.66e-14 ***
## Residuals 649 242740 374
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dramas appear to recieve higher audience scores than non-dramas. An ANOVA indicates the relationship is significant (F=12.73, p=.0004).
ggplot(movies2, aes(x=drama, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Genre")
summary(aov(audience_score ~ drama, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## drama 1 5113 5113 12.73 0.000387 ***
## Residuals 649 260707 402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There appears to be a positive relationship between audience scores and move runtime. The Pearson correlation coefficient indicates a weak, but significant positive linear relationship (r=.1810, p<.0001).
ggplot(movies2, aes(x=runtime, y=audience_score)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Scatterplot of Audience Scores vs Runtime")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
cor.test(movies2$audience_score, movies2$runtime, method="pearson")
##
## Pearson's product-moment correlation
##
## data: movies2$audience_score and movies2$runtime
## t = 4.6839, df = 648, p-value = 3.431e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1055295 0.2543256
## sample estimates:
## cor
## 0.1809629
R-Rated films do not appear to fair any better or worse than other films. An ANOVA indicates the relationship is not significant (F=0.166, p=.684).
ggplot(movies2, aes(x=mpaa_rating_R, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Score vs. Movie Rating")
summary(aov(audience_score ~ mpaa_rating_R, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## mpaa_rating_R 1 68 68.1 0.166 0.684
## Residuals 649 265752 409.5
Audience scores appear to be unrelated to movie runtime. The Pearson correlation coefficient indicates a very weak and insignificant negative linear relationship (r=-.054, p=.1682).
ggplot(movies2, aes(x=thtr_rel_year, y=audience_score)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Scatterplot of Audience Scores vs Release Year")
cor.test(movies2$audience_score, movies2$thtr_rel_year, method="pearson")
##
## Pearson's product-moment correlation
##
## data: movies2$audience_score and movies2$thtr_rel_year
## t = -1.3797, df = 649, p-value = 0.1682
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13037988 0.02285905
## sample estimates:
## cor
## -0.05407881
Audience scores do not appear to change if the film is released during Oscar season (October - December). An ANOVA indicates the relationship is insignificant (F=1.158, p=.282).
ggplot(movies2, aes(x=oscar_season, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Oscar Season")
summary(aov(audience_score ~ oscar_season, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## oscar_season 1 473 473.4 1.158 0.282
## Residuals 649 265347 408.9
Audience scores do not appear to change if the film is released during the summer season (May - July). An ANOVA indicates the relationship is insignificant (F=.23, p=.632).
ggplot(movies2, aes(x=summer_season, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Summer Season")
summary(aov(audience_score ~ summer_season, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## summer_season 1 94 94.1 0.23 0.632
## Residuals 649 265726 409.4
Audience scores are highly correlated with the IMDB rating. The Pearson correlation coefficient indicates a very strong and significant positive linear relationship (r=.865, p<.0001).
ggplot(movies2, aes(x=imdb_rating, y=audience_score)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Scatterplot of Audience Scores vs IMDB Rating")
cor.test(movies2$audience_score, movies2$imdb_rating, method="pearson")
##
## Pearson's product-moment correlation
##
## data: movies2$audience_score and movies2$imdb_rating
## t = 43.89, df = 649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8441215 0.8830234
## sample estimates:
## cor
## 0.8648652
Audience scores are also correlated with the number of IMDB votes. The Pearson correlation coefficient indicates a weak, but significant positive linear relationship (r=.290, p<.0001).
ggplot(movies2, aes(x=imdb_num_votes, y=audience_score)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Scatterplot of Audience Scores vs IMDB Votes")
cor.test(movies2$audience_score, movies2$imdb_num_votes, method="pearson")
##
## Pearson's product-moment correlation
##
## data: movies2$audience_score and movies2$imdb_num_votes
## t = 7.7142, df = 649, p-value = 4.602e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2178208 0.3586681
## sample estimates:
## cor
## 0.2898128
Audience scores are correlated with the critics scores. The Pearson correlation coefficient indicates a strong and significant positive linear relationship (r=.704, p<.0001).
ggplot(movies2, aes(x=critics_score, y=audience_score)) +
geom_point() +
geom_smooth(method = lm) +
ggtitle ("Scatterplot of Audience Scores vs Critics Scores")
cor.test(movies2$audience_score, movies2$critics_score, method="pearson")
##
## Pearson's product-moment correlation
##
## data: movies2$audience_score and movies2$critics_score
## t = 25.273, df = 649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6633319 0.7410163
## sample estimates:
## cor
## 0.7042762
Best picture nominees receive higher audience scores. An ANOVA indicates the relationship is significant (F=30.68, p<.0001).
ggplot(movies2, aes(x=best_pic_nom, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Best Picture Nomination")
summary(aov(audience_score ~ best_pic_nom, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## best_pic_nom 1 11999 11999 30.68 4.43e-08 ***
## Residuals 649 253822 391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Best picture winners receive higher audience scores too, but interestingly, the effect is not as great as with best picture nominees. An ANOVA indicates the relationship is significant (F=8.74, p<.0032).
ggplot(movies2, aes(x=best_pic_win, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Best Picture Winner")
summary(aov(audience_score ~ best_pic_win, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## best_pic_win 1 3535 3535 8.748 0.00321 **
## Residuals 649 262285 404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Surprisingly, films featuring a best actor winner tend to have lower audience scores. However, an ANOVA indicates the relationship is not significant (F=.213, p=.645).
ggplot(movies2, aes(x=feature_film, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Best Actor Winner")
summary(aov(audience_score ~ best_actor_win, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## best_actor_win 1 87 87.0 0.213 0.645
## Residuals 649 265733 409.5
Films featuring a best actress winner appear to have slightly higher audience scores. However, an ANOVA indicates the relationship is not significant (F=.452, p=.501).
ggplot(movies2, aes(x=best_actress_win, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Best Actress Winner")
summary(aov(audience_score ~ best_actress_win, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## best_actress_win 1 185 185.2 0.452 0.501
## Residuals 649 265635 409.3
Films by a Best Director winner tend to recieve higher audience scores. An ANOVA indicates the relationship is significant (F=5.797, p=.0163).
ggplot(movies2, aes(x=best_dir_win, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Best Director Winner")
summary(aov(audience_score ~ best_dir_win, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## best_dir_win 1 2353 2353 5.797 0.0163 *
## Residuals 649 263467 406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Films in the top 200 box office list have higher audience scores. An ANOVA indicates the relationship is significant (F=5.601, p=.0182).
ggplot(movies2, aes(x=top200_box, y=audience_score)) +
geom_boxplot() +
ggtitle("Boxplot of Audience Scores by Top 200 Box Office")
summary(aov(audience_score ~ top200_box, data = movies2))
## Df Sum Sq Mean Sq F value Pr(>F)
## top200_box 1 2274 2274.3 5.601 0.0182 *
## Residuals 649 263546 406.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can include all relevant covariates in a regression model in an attempt to explain as much audience score variation as possible. However, running this full model has a cost because it removes observations where the value from any explanatory variable is missing. Our inferences will only be valid if there is no systematic reason behind missing values. Without information regarding missing values, assume the missing values occur randomly.
m_audience_score_full = lm(audience_score ~ ., data = movies2)
summary(m_audience_score_full)
##
## Call:
## lm(formula = audience_score ~ ., data = movies2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.594 -6.156 0.157 5.909 53.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.244e+02 7.749e+01 1.606 0.10886
## feature_filmyes -2.248e+00 1.687e+00 -1.332 0.18323
## dramayes 1.292e+00 8.766e-01 1.474 0.14087
## runtime -5.614e-02 2.415e-02 -2.324 0.02042 *
## mpaa_rating_Ryes -1.444e+00 8.127e-01 -1.777 0.07598 .
## thtr_rel_year -7.657e-02 3.835e-02 -1.997 0.04628 *
## oscar_seasonyes -5.333e-01 9.967e-01 -0.535 0.59280
## summer_seasonyes 9.106e-01 9.493e-01 0.959 0.33778
## imdb_rating 1.472e+01 6.067e-01 24.258 < 2e-16 ***
## imdb_num_votes 7.234e-06 4.523e-06 1.600 0.11019
## critics_score 5.748e-02 2.217e-02 2.593 0.00973 **
## best_pic_nomyes 5.321e+00 2.628e+00 2.025 0.04330 *
## best_pic_winyes -3.212e+00 4.610e+00 -0.697 0.48624
## best_actor_winyes -1.544e+00 1.179e+00 -1.310 0.19068
## best_actress_winyes -2.198e+00 1.304e+00 -1.686 0.09229 .
## best_dir_winyes -1.231e+00 1.728e+00 -0.713 0.47630
## top200_boxyes 8.478e-01 2.782e+00 0.305 0.76067
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.975 on 633 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.763, Adjusted R-squared: 0.757
## F-statistic: 127.3 on 16 and 633 DF, p-value: < 2.2e-16
From the full linear model summary, many coefficients of independent variables are not statistically significant, including feature_film
, drama
, mpaa_rating_R
, oscar_season
, summer_season
, imdb_num_votes
, best_pic_win
, best_actor_win
, best_actess_win
, best_dir_win
, and top200_box
. Bayesian Information Criterion (BIC) is the Bayesian version of model selection based on Adjusted \(R^2.\) BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. Other model selection criteria exist, but we are seeking predictive power, not necesarily model parsimony, so BIC is a good choice.
R function stepAIC
works backwards through the model space, removing variables until BIC can be no longer be lowered. Its inputs are the full model and a penalty parameter \(k = \log(n)\).
m_audience_score_full = lm(audience_score ~ ., data = na.omit(movies2))
m_audience_score_bic = stepAIC(m_audience_score_full, k = log(nrow(na.omit(movies2))))
## Start: AIC=3083.05
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 9 62999 3076.7
## - oscar_season 1 28 63018 3076.9
## - best_pic_win 1 48 63038 3077.1
## - best_dir_win 1 51 63040 3077.1
## - summer_season 1 92 63081 3077.5
## - best_actor_win 1 171 63160 3078.3
## - feature_film 1 177 63166 3078.4
## - drama 1 216 63206 3078.8
## - imdb_num_votes 1 255 63244 3079.2
## - best_actress_win 1 283 63273 3079.5
## - mpaa_rating_R 1 314 63304 3079.8
## - thtr_rel_year 1 397 63386 3080.7
## - best_pic_nom 1 408 63398 3080.8
## - runtime 1 538 63527 3082.1
## <none> 62990 3083.0
## - critics_score 1 669 63659 3083.4
## - imdb_rating 1 58556 121545 3503.8
##
## Step: AIC=3076.67
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 26 63025 3070.5
## - best_pic_win 1 49 63047 3070.7
## - best_dir_win 1 52 63051 3070.7
## - summer_season 1 94 63093 3071.2
## - best_actor_win 1 169 63168 3071.9
## - feature_film 1 176 63175 3072.0
## - drama 1 214 63213 3072.4
## - best_actress_win 1 279 63278 3073.1
## - imdb_num_votes 1 302 63301 3073.3
## - mpaa_rating_R 1 330 63329 3073.6
## - best_pic_nom 1 404 63403 3074.3
## - thtr_rel_year 1 415 63414 3074.5
## - runtime 1 535 63534 3075.7
## <none> 62999 3076.7
## - critics_score 1 681 63680 3077.2
## - imdb_rating 1 58606 121604 3497.7
##
## Step: AIC=3070.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3064.5
## - best_dir_win 1 56 63081 3064.6
## - best_actor_win 1 174 63200 3065.8
## - summer_season 1 177 63202 3065.8
## - feature_film 1 182 63207 3065.9
## - drama 1 222 63247 3066.3
## - best_actress_win 1 281 63307 3066.9
## - imdb_num_votes 1 302 63328 3067.1
## - mpaa_rating_R 1 329 63354 3067.4
## - best_pic_nom 1 387 63412 3068.0
## - thtr_rel_year 1 410 63436 3068.2
## - runtime 1 587 63613 3070.0
## <none> 63025 3070.5
## - critics_score 1 679 63704 3071.0
## - imdb_rating 1 58603 121628 3491.3
##
## Step: AIC=3064.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3058.9
## - best_actor_win 1 163 63234 3059.7
## - feature_film 1 171 63242 3059.7
## - summer_season 1 174 63245 3059.8
## - drama 1 220 63291 3060.2
## - imdb_num_votes 1 271 63342 3060.8
## - best_actress_win 1 294 63365 3061.0
## - mpaa_rating_R 1 330 63401 3061.4
## - best_pic_nom 1 342 63414 3061.5
## - thtr_rel_year 1 397 63468 3062.1
## - runtime 1 586 63657 3064.0
## <none> 63071 3064.5
## - critics_score 1 680 63751 3065.0
## - imdb_rating 1 58858 121929 3486.4
##
## Step: AIC=3058.95
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3054.2
## - best_actor_win 1 171 63336 3054.2
## - feature_film 1 183 63348 3054.4
## - drama 1 228 63394 3054.8
## - imdb_num_votes 1 247 63412 3055.0
## - best_actress_win 1 299 63464 3055.5
## - best_pic_nom 1 326 63491 3055.8
## - mpaa_rating_R 1 345 63510 3056.0
## - thtr_rel_year 1 368 63533 3056.2
## <none> 63165 3058.9
## - critics_score 1 651 63816 3059.1
## - runtime 1 673 63839 3059.4
## - imdb_rating 1 58895 122061 3480.7
##
## Step: AIC=3054.19
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3049.3
## - best_actor_win 1 195 63527 3049.7
## - drama 1 204 63536 3049.8
## - imdb_num_votes 1 260 63592 3050.4
## - best_pic_nom 1 297 63629 3050.8
## - best_actress_win 1 297 63629 3050.8
## - mpaa_rating_R 1 356 63688 3051.4
## - thtr_rel_year 1 361 63693 3051.4
## <none> 63332 3054.2
## - runtime 1 690 64022 3054.7
## - critics_score 1 732 64064 3055.2
## - imdb_rating 1 58763 122095 3474.4
##
## Step: AIC=3049.31
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 3044.1
## - imdb_num_votes 1 173 63661 3044.6
## - best_actor_win 1 219 63706 3045.1
## - thtr_rel_year 1 277 63765 3045.7
## - best_pic_nom 1 291 63778 3045.8
## - best_actress_win 1 306 63794 3046.0
## - mpaa_rating_R 1 453 63941 3047.5
## <none> 63488 3049.3
## - runtime 1 715 64203 3050.1
## - critics_score 1 875 64363 3051.7
## - imdb_rating 1 63189 126677 3491.8
##
## Step: AIC=3044.07
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 3039.1
## - best_actor_win 1 209 63818 3039.7
## - thtr_rel_year 1 272 63881 3040.4
## - best_actress_win 1 274 63883 3040.4
## - best_pic_nom 1 307 63916 3040.7
## - mpaa_rating_R 1 391 64000 3041.6
## - runtime 1 631 64240 3044.0
## <none> 63609 3044.1
## - critics_score 1 916 64525 3046.9
## - imdb_rating 1 63434 127043 3487.2
##
## Step: AIC=3039.1
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - thtr_rel_year 1 201 63958 3034.7
## - best_actor_win 1 219 63976 3034.9
## - best_actress_win 1 266 64023 3035.3
## - mpaa_rating_R 1 367 64124 3036.4
## - best_pic_nom 1 442 64199 3037.1
## - runtime 1 519 64276 3037.9
## <none> 63757 3039.1
## - critics_score 1 879 64635 3041.5
## - imdb_rating 1 67356 131113 3501.3
##
## Step: AIC=3034.67
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 207 64165 3030.3
## - best_actress_win 1 261 64219 3030.8
## - mpaa_rating_R 1 373 64331 3032.0
## - best_pic_nom 1 447 64405 3032.7
## - runtime 1 468 64425 3032.9
## <none> 63958 3034.7
## - critics_score 1 968 64926 3038.0
## - imdb_rating 1 67172 131129 3494.9
##
## Step: AIC=3030.29
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actress_win 1 296 64461 3026.8
## - mpaa_rating_R 1 366 64531 3027.5
## - best_pic_nom 1 396 64561 3027.8
## <none> 64165 3030.3
## - runtime 1 643 64808 3030.3
## - critics_score 1 968 65133 3033.5
## - imdb_rating 1 67296 131461 3490.0
##
## Step: AIC=3026.81
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom
##
## Df Sum of Sq RSS AIC
## - best_pic_nom 1 303 64765 3023.4
## - mpaa_rating_R 1 354 64815 3023.9
## <none> 64461 3026.8
## - runtime 1 814 65275 3028.5
## - critics_score 1 957 65418 3029.9
## - imdb_rating 1 67424 131885 3485.6
##
## Step: AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## - mpaa_rating_R 1 361 65126 3020.5
## - runtime 1 638 65403 3023.3
## <none> 64765 3023.4
## - critics_score 1 1027 65792 3027.1
## - imdb_rating 1 68173 132937 3484.3
##
## Step: AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## <none> 65126 3020.5
## - runtime 1 653 65779 3020.5
## - critics_score 1 1073 66199 3024.7
## - imdb_rating 1 67874 133000 3478.2
The optimal model using BIC is:
audience_score
= \(\beta_0 + \beta_1runtime + \beta_2imdb_rating + \beta_3critics_score\)
If several models are plausible,choosing one ignores the uncertainty involved in choosing the variables. Bayesian model averaging (BMA) averages models to obtain posteriors of coefficients and predictions from new data. Apply BMA to the data set.
movies2_no_na = na.omit(movies2)
bma_audience_score = bas.lm(audience_score ~ ., data = movies2_no_na,
prior = "BIC",
modelprior = uniform())
bma_audience_score
##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies2_no_na, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06537 0.04320
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.46971 0.19984 0.09069
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.07506 0.08042 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.05774 0.88855 0.13119
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.03985 0.14435 0.14128
## best_dir_winyes top200_boxyes
## 0.06694 0.04762
summary(bma_audience_score)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06536947 0.0000 0.0000000 0.0000000
## dramayes 0.04319833 0.0000 0.0000000 0.0000000
## runtime 0.46971477 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.19984016 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.09068970 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07505684 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08042023 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05773502 0.0000 0.0000000 0.0000000
## critics_score 0.88855056 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13119140 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03984766 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14434896 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.14128087 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06693898 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04762234 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9968489 0.2543185
## PostProbs NA 0.1297 0.1293000 0.0330000
## R2 NA 0.7549 0.7525000 0.7539000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3615.2791 -3615.2822108 -3616.6482224
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2521327 0.2391994
## PostProbs 0.0327000 0.0310000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg -3616.6568544 -3616.7095127
The most likely model, which has posterior probability of 0.1297, includes variables runtime
, imdb_rating
, and critics_score
. The posterior distribution of the parameter estimators is below.
par(mfrow = c(1,2))
coef_audience_score = coefficients(bma_audience_score)
plot(coef_audience_score, subset = c(4,9,11), ask=FALSE)
We can also provide 95% credible intervals for these coefficients:
confint(coef_audience_score)
## 2.5% 97.5% beta
## Intercept 6.158063e+01 6.311649e+01 6.234769e+01
## feature_filmyes -1.179102e+00 0.000000e+00 -1.046908e-01
## dramayes 0.000000e+00 0.000000e+00 1.604413e-02
## runtime -8.195619e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes -2.114906e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year -5.317024e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes -1.074762e+00 0.000000e+00 -8.034940e-02
## summer_seasonyes -2.261352e-03 9.272773e-01 8.704545e-02
## imdb_rating 1.362503e+01 1.650201e+01 1.498203e+01
## imdb_num_votes -7.756278e-09 1.287327e-06 2.080713e-07
## critics_score 0.000000e+00 1.056672e-01 6.296648e-02
## best_pic_nomyes 0.000000e+00 4.893661e+00 5.068035e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes -2.644979e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.804581e+00 1.524930e-03 -3.088382e-01
## best_dir_winyes -1.309628e+00 5.959084e-02 -1.195011e-01
## top200_boxyes -1.313368e-01 1.245677e-01 8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Let’s find predictive values under the best predictive model, the one that has predictions closest to BMA and corresponding posterior standard deviations.
BPM_pred_audience_score = predict(bma_audience_score, estimator="BPM", se.fit=TRUE)
bma_audience_score$namesx[BPM_pred_audience_score$bestmodel+1]
## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
Let’s turn to see what characteristics lead to the highest audience scores with the BPM
model.
opt = which.max(BPM_pred_audience_score$fit)
t(movies2_no_na[opt, ])
## [,1]
## audience_score "97"
## feature_film "yes"
## drama "no"
## runtime "202"
## mpaa_rating_R "yes"
## thtr_rel_year "1974"
## oscar_season "yes"
## summer_season "no"
## imdb_rating "9"
## imdb_num_votes "749783"
## critics_score "97"
## best_pic_nom "yes"
## best_pic_win "yes"
## best_actor_win "yes"
## best_actress_win "yes"
## best_dir_win "yes"
## top200_box "no"
Here is a 95% prediction interval for the audience score for a movie with covariates at the levels of the movie specified by opt
.:
ci_audience_score = confint(BPM_pred_audience_score, parm="pred")
ci_audience_score[opt,]
## 2.5% 97.5% pred
## 77.42096 117.65506 97.53801
The optimal model was: audience_score
= \(\beta_0 + \beta_1runtime + \beta_2imdb_rating + \beta_3critics_score\).
The optimal model emerged both from the BIC method, and from Bayesian model averaging.