library(ggplot2)
library(grid)
library(gridExtra)
library(dplyr)
library(statsr)
library(BAS)
library(psych)
## Warning: package 'psych' was built under R version 3.6.2
library(MASS)
library(tidyverse)
library(broom)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
load("movies.Rdata")
This data is from imdb.com, rottentomatoes.com, and flixster.com. There are 651 randomly sampled movies produced and released before 2016.
Because the data is randomly sampled from existing datasets, and not derived from experiments, therefore we can infer correlations but not causation. Due to the random sampling, we can assume our inferences about the data are generalizable to other movies outside of the dataset that were also produced and released before 2016.
There are several potential sources of bias to keep in mind when analyzing this data set, namely that the movies are limited to English language and American-produced movies. We cannot assume that we can extrapolate our findings to other languages and countries.
Create the following categories to use as predictors:
feature_film: “yes” if title_type is Feature Film, “no” otherwise
drama: “yes” if genre is Drama, “no” otherwise
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
movies$feature_film <- ifelse(movies$title_type == 'Feature Film', 'Yes', 'No')
movies$drama <- ifelse(movies$genre == 'Drama', 'Yes', 'No')
movies$mpaa_rating_R <- ifelse(movies$mpaa_rating == 'R', 'Yes', 'No')
movies$oscar_season <- ifelse(movies$thtr_rel_month %in% seq(11, 13), 'Yes', 'No')
movies$summer_season <- ifelse(movies$thtr_rel_month %in% seq(5,8), 'Yes', 'No')
We are also going to drop the columns ‘title’ and ‘studio’ because there are too many distinct
We first to take a look at which of our new variables might be most predictive of a movie’s audience score. The below boxplots visually show the distribution, quartiles and median values for the new variables:
plot1 <- ggplot(movies, aes(x = feature_film,
y = audience_score,
fill = feature_film)) + geom_boxplot(color = 'blue', alpha = 0.5) +
theme_bw() + geom_smooth() +
labs(x = "Feature Film", y = "Audience Score",
fill = "feature_film") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')
plot2 <- ggplot(movies, aes(x = drama,
y = audience_score,
fill = drama)) + geom_boxplot(color = 'blue', alpha = 0.5) +
theme_bw() + geom_smooth() +
labs(x = "Drama", y = "Audience Score",
fill = "drama") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')
plot3 <- ggplot(movies, aes(x = mpaa_rating_R,
y = audience_score,
fill = mpaa_rating_R)) + geom_boxplot(color = 'blue', alpha = 0.5) +
theme_bw() + geom_smooth() +
labs(x = "Rated R", y = "Audience Score",
fill = "mpaa_rating_R") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')
plot4 <- ggplot(movies, aes(x = oscar_season,
y = audience_score,
fill = oscar_season)) + geom_boxplot(color = 'blue', alpha = 0.5) +
theme_bw() + geom_smooth() +
labs(x = "Oscar Season", y = "Audience Score",
fill = "oscar_season") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')
plot5 <- ggplot(movies, aes(x = summer_season,
y = audience_score,
fill = summer_season)) + geom_boxplot(color = 'blue', alpha = 0.5) +
theme_bw() + geom_smooth() +
labs(x = "Summer Season", y = "Audience Score",
fill = "summer_season") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust =0)) + theme(legend.position= 'none')
grid.arrange(plot1, plot2,
plot3, plot4,
plot5,
nrow=3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Whether or not the movie is a feature film seems to be a very promising predictor of audience score. The median audience score for feature films is 62 while non-feature films is 86. Next we want to look at some of the summary stats for audience score, and what number of movies fall into each of the categories.
hist(movies$audience_score, main="Distribution of Audience Score", xlab="audience score")
describe(movies$audience_score)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 651 62.36 20.22 65 63.39 23.72 11 97 86 -0.36 -0.91
## se
## X1 0.79
The sample size of movies is 651 and median audience score is 65. The distribution has a left skew, and the mean is slightly lower than the median. We identified feature film as a potentially interesting predictor, so let’s see how many movies are feature films vs. not:
ggplot(data=movies) + geom_bar(mapping=aes(x=feature_film, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("Proportion of Feature Films")
The vast majority of all movies in the datasets are feature films, with only ~11% not being feature films. While feature_film may be helpful for prediction, it may not help us predict the variation in scores between non feature films, so we will need to use other predictors to explain these differences in movie scores - perhaps ‘drama’.
First, we’ll create the full model with all variables in the dataset. Then we’ll use AIC to begin eliminating variables from the model.
drop_cols <- c("title","genre", "studio", "thtr_rel_month",
"thtr_rel_day", "dvd_rel_year", "dvd_rel_month",
"dvd_rel_day", "critics_rating", "audience_rating",
"director", "actor1", "actor2", "actor3", "actor4",
"actor5", "imdb_url", "rt_url", "title_type", "mpaa_rating")
movies_reduced = movies[ , !(names(movies) %in% drop_cols)]
names(movies_reduced)
## [1] "runtime" "thtr_rel_year" "imdb_rating"
## [4] "imdb_num_votes" "critics_score" "audience_score"
## [7] "best_pic_nom" "best_pic_win" "best_actor_win"
## [10] "best_actress_win" "best_dir_win" "top200_box"
## [13] "feature_film" "drama" "mpaa_rating_R"
## [16] "oscar_season" "summer_season"
We will also exclude rows with nulls to avoid non response bias.
movies_reduced = na.omit(movies_reduced)
Next, we’ll create the full model with all variables in the dataset. Then we’ll use AIC to begin eliminating variables from the model.
m_full <- lm(audience_score ~ ., data = movies_reduced)
tidy(m_full)
## # A tibble: 17 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 123. 77.6 1.58 1.15e- 1
## 2 runtime -0.0596 0.0242 -2.46 1.43e- 2
## 3 thtr_rel_year -0.0756 0.0384 -1.97 4.94e- 2
## 4 imdb_rating 14.7 0.607 24.3 2.19e-92
## 5 imdb_num_votes 0.00000729 0.00000452 1.61 1.08e- 1
## 6 critics_score 0.0576 0.0222 2.60 9.61e- 3
## 7 best_pic_nomyes 5.07 2.63 1.93 5.41e- 2
## 8 best_pic_winyes -3.12 4.61 -0.676 4.99e- 1
## 9 best_actor_winyes -1.56 1.18 -1.32 1.86e- 1
## 10 best_actress_winyes -2.20 1.30 -1.69 9.19e- 2
## 11 best_dir_winyes -1.27 1.73 -0.738 4.61e- 1
## 12 top200_boxyes 0.639 2.79 0.229 8.19e- 1
## 13 feature_filmYes -2.30 1.69 -1.36 1.73e- 1
## 14 dramaYes 1.32 0.876 1.51 1.32e- 1
## 15 mpaa_rating_RYes -1.44 0.813 -1.78 7.62e- 2
## 16 oscar_seasonYes 0.380 1.12 0.340 7.34e- 1
## 17 summer_seasonYes 1.23 0.902 1.37 1.72e- 1
Now we will use the step AIC function to backwards fit the model by process of elimination.
stepAIC.model <- stepAIC(m_full, direction = "backward", trace = TRUE)
## Start: AIC=3007.11
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box + feature_film +
## drama + mpaa_rating_R + oscar_season + summer_season
##
## Df Sum of Sq RSS AIC
## - top200_box 1 5 63012 3005.2
## - oscar_season 1 12 63018 3005.2
## - best_pic_win 1 46 63052 3005.6
## - best_dir_win 1 54 63061 3005.7
## - best_actor_win 1 174 63181 3006.9
## - feature_film 1 185 63192 3007.0
## - summer_season 1 186 63192 3007.0
## <none> 63007 3007.1
## - drama 1 226 63233 3007.4
## - imdb_num_votes 1 259 63265 3007.8
## - best_actress_win 1 284 63290 3008.0
## - mpaa_rating_R 1 314 63321 3008.3
## - best_pic_nom 1 371 63377 3008.9
## - thtr_rel_year 1 386 63392 3009.1
## - runtime 1 601 63608 3011.3
## - critics_score 1 672 63678 3012.0
## - imdb_rating 1 58542 121549 3432.2
##
## Step: AIC=3005.17
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + feature_film + drama +
## mpaa_rating_R + oscar_season + summer_season
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 14 63025 3003.3
## - best_pic_win 1 46 63058 3003.6
## - best_dir_win 1 55 63067 3003.7
## - best_actor_win 1 173 63185 3004.9
## - feature_film 1 184 63196 3005.1
## - summer_season 1 190 63201 3005.1
## <none> 63012 3005.2
## - drama 1 224 63236 3005.5
## - best_actress_win 1 281 63293 3006.1
## - imdb_num_votes 1 299 63311 3006.2
## - mpaa_rating_R 1 327 63338 3006.5
## - best_pic_nom 1 367 63379 3006.9
## - thtr_rel_year 1 400 63411 3007.3
## - runtime 1 600 63612 3009.3
## - critics_score 1 681 63693 3010.2
## - imdb_rating 1 58608 121619 3430.6
##
## Step: AIC=3003.31
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + feature_film + drama +
## mpaa_rating_R + summer_season
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3001.8
## - best_dir_win 1 56 63081 3001.9
## - best_actor_win 1 174 63200 3003.1
## - summer_season 1 177 63202 3003.1
## - feature_film 1 182 63207 3003.2
## <none> 63025 3003.3
## - drama 1 222 63247 3003.6
## - best_actress_win 1 281 63307 3004.2
## - imdb_num_votes 1 302 63328 3004.4
## - mpaa_rating_R 1 329 63354 3004.7
## - best_pic_nom 1 387 63412 3005.3
## - thtr_rel_year 1 410 63436 3005.5
## - runtime 1 587 63613 3007.3
## - critics_score 1 679 63704 3008.3
## - imdb_rating 1 58603 121628 3428.6
##
## Step: AIC=3001.78
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win + feature_film + drama + mpaa_rating_R + summer_season
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3000.7
## - best_actor_win 1 163 63234 3001.5
## - feature_film 1 171 63242 3001.5
## - summer_season 1 174 63245 3001.6
## <none> 63071 3001.8
## - drama 1 220 63291 3002.0
## - imdb_num_votes 1 271 63342 3002.6
## - best_actress_win 1 294 63365 3002.8
## - mpaa_rating_R 1 330 63401 3003.2
## - best_pic_nom 1 342 63414 3003.3
## - thtr_rel_year 1 397 63468 3003.9
## - runtime 1 586 63657 3005.8
## - critics_score 1 680 63751 3006.8
## - imdb_rating 1 58858 121929 3428.2
##
## Step: AIC=3000.75
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## feature_film + drama + mpaa_rating_R + summer_season
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3000.5
## - best_actor_win 1 171 63336 3000.5
## - feature_film 1 183 63348 3000.6
## <none> 63165 3000.7
## - drama 1 228 63394 3001.1
## - imdb_num_votes 1 247 63412 3001.3
## - best_actress_win 1 299 63464 3001.8
## - best_pic_nom 1 326 63491 3002.1
## - mpaa_rating_R 1 345 63510 3002.3
## - thtr_rel_year 1 368 63533 3002.5
## - critics_score 1 651 63816 3005.4
## - runtime 1 673 63839 3005.6
## - imdb_rating 1 58895 122061 3426.9
##
## Step: AIC=3000.46
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## feature_film + drama + mpaa_rating_R
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3000.1
## <none> 63332 3000.5
## - best_actor_win 1 195 63527 3000.5
## - drama 1 204 63536 3000.6
## - imdb_num_votes 1 260 63592 3001.1
## - best_pic_nom 1 297 63629 3001.5
## - best_actress_win 1 297 63629 3001.5
## - mpaa_rating_R 1 356 63688 3002.1
## - thtr_rel_year 1 361 63693 3002.2
## - runtime 1 690 64022 3005.5
## - critics_score 1 732 64064 3005.9
## - imdb_rating 1 58763 122095 3425.1
##
## Step: AIC=3000.06
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## drama + mpaa_rating_R
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 2999.3
## - imdb_num_votes 1 173 63661 2999.8
## <none> 63488 3000.1
## - best_actor_win 1 219 63706 3000.3
## - thtr_rel_year 1 277 63765 3000.9
## - best_pic_nom 1 291 63778 3001.0
## - best_actress_win 1 306 63794 3001.2
## - mpaa_rating_R 1 453 63941 3002.7
## - runtime 1 715 64203 3005.3
## - critics_score 1 875 64363 3007.0
## - imdb_rating 1 63189 126677 3447.1
##
## Step: AIC=2999.3
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## mpaa_rating_R
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 2998.8
## <none> 63609 2999.3
## - best_actor_win 1 209 63818 2999.4
## - thtr_rel_year 1 272 63881 3000.1
## - best_actress_win 1 274 63883 3000.1
## - best_pic_nom 1 307 63916 3000.4
## - mpaa_rating_R 1 391 64000 3001.3
## - runtime 1 631 64240 3003.7
## - critics_score 1 916 64525 3006.6
## - imdb_rating 1 63434 127043 3447.0
##
## Step: AIC=2998.81
## audience_score ~ runtime + thtr_rel_year + imdb_rating + critics_score +
## best_pic_nom + best_actor_win + best_actress_win + mpaa_rating_R
##
## Df Sum of Sq RSS AIC
## <none> 63757 2998.8
## - thtr_rel_year 1 201 63958 2998.9
## - best_actor_win 1 219 63976 2999.0
## - best_actress_win 1 266 64023 2999.5
## - mpaa_rating_R 1 367 64124 3000.5
## - best_pic_nom 1 442 64199 3001.3
## - runtime 1 519 64276 3002.1
## - critics_score 1 879 64635 3005.7
## - imdb_rating 1 67356 131113 3465.4
AIC.lm <- lm(audience_score ~ runtime + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win + mpaa_rating_R, data=movies)
AIC.lm$coefficients
## (Intercept) runtime thtr_rel_year
## 70.10675281 -0.05115515 -0.05122557
## imdb_rating critics_score best_pic_nomyes
## 15.00149242 0.06409989 4.88277038
## best_actor_winyes best_actress_winyes mpaa_rating_RYes
## -1.73481942 -2.11568281 -1.50528039
Using AIC to eliminate variables from the model, the final model will be:
audience_score = intercept + -0.05115515*runtime + -0.05122557*thtr_rel_year + 15.00149242*imdb_rating + 0.06409989*critics_score + 4.88277038*best_pic_nmyes + -1.73481942*best_actor_winyes + -2.11568281*best_actress_winyes + -1.50528039*mpaa_rating_RYes
To evaluate the model, we can look at the residuals:
plot(movies_reduced$audience_score ~ AIC.lm$residuals)
ggplot(data=AIC.lm, aes(x=AIC.lm$residuals)) + geom_histogram(bins=20)
The residuals appear normally distributed.
Next, we will try a different approach for model selection - Bayesian Model Averaging.
We learned that often, there are several models that equally likely and choosing only one ignores the uncertainy we have in choosing variables to include. BMA allows us to average multiple plausible models to get posteriors of coefficients.
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_movies <- bas.lm(audience_score ~ ., data = movies_reduced,
prior = "AIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_movies
##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_reduced, prior = "AIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept runtime thtr_rel_year
## 1.0000 0.8725 0.5729
## imdb_rating imdb_num_votes critics_score
## 1.0000 0.4538 0.9670
## best_pic_nomyes best_pic_winyes best_actor_winyes
## 0.6778 0.3013 0.5181
## best_actress_winyes best_dir_winyes top200_boxyes
## 0.5925 0.3634 0.3022
## feature_filmYes dramaYes mpaa_rating_RYes
## 0.3824 0.3833 0.6831
## oscar_seasonYes summer_seasonYes
## 0.2807 0.4585
# Top 5 most probably models
summary(bma_movies)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.0000000
## runtime 0.8725072 1.0000 1.0000000 1.0000000
## thtr_rel_year 0.5729427 1.0000 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.4537507 0.0000 0.0000000 0.0000000
## critics_score 0.9670063 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.6777549 1.0000 1.0000000 1.0000000
## best_pic_winyes 0.3013074 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.5180921 1.0000 1.0000000 0.0000000
## best_actress_winyes 0.5925182 1.0000 1.0000000 1.0000000
## best_dir_winyes 0.3633570 0.0000 0.0000000 0.0000000
## top200_boxyes 0.3022418 0.0000 0.0000000 0.0000000
## feature_filmYes 0.3824117 0.0000 0.0000000 0.0000000
## dramaYes 0.3832989 0.0000 0.0000000 0.0000000
## mpaa_rating_RYes 0.6831172 1.0000 1.0000000 1.0000000
## oscar_seasonYes 0.2806655 0.0000 0.0000000 0.0000000
## summer_seasonYes 0.4584914 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9783606 0.9298905
## PostProbs NA 0.0019 0.0019000 0.0018000
## R2 NA 0.7601 0.7593000 0.7585000
## dim NA 9.0000 8.0000000 7.0000000
## logmarg NA -3604.4206 -3604.4424632 -3604.4932746
## model 4 model 5
## Intercept 1.0000000 1.0000000
## runtime 1.0000000 1.0000000
## thtr_rel_year 1.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 1.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 1.0000000 1.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## feature_filmYes 0.0000000 0.0000000
## dramaYes 0.0000000 0.0000000
## mpaa_rating_RYes 1.0000000 1.0000000
## oscar_seasonYes 0.0000000 0.0000000
## summer_seasonYes 0.0000000 1.0000000
## BF 0.8903369 0.7914465
## PostProbs 0.0017000 0.0015000
## R2 0.7592000 0.7592000
## dim 8.0000000 8.0000000
## logmarg -3604.5367415 -3604.6544792
We can see that the posterior probability that imdb_rating is included in the model is 1.000, and critics_score is almost 1.
The most likely models (model1 and model2) have the same posterior probability of 0.0019.
We can also visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients for the variables we created earlier.
# Obtain the coefficients from the model `bma_movies`
coef_movies <- coefficients(bma_movies)
# Look at the variables we created in the data set
plot(coef_movies, subset = c(13, 14,15,16,17), ask = FALSE)
We can also quantify our uncertainty in the coefficients in our model by generating 95% credible intervals:
confint(coef_movies)
## 2.5% 97.5% beta
## Intercept 6.161619e+01 6.315901e+01 6.234769e+01
## runtime -9.637600e-02 0.000000e+00 -4.954262e-02
## thtr_rel_year -1.236302e-01 5.869476e-03 -3.445803e-02
## imdb_rating 1.370364e+01 1.621687e+01 1.491842e+01
## imdb_num_votes -1.386757e-06 1.269664e-05 2.448273e-06
## critics_score 0.000000e+00 1.008312e-01 6.288512e-02
## best_pic_nomyes -4.725224e-03 8.907068e+00 3.093246e+00
## best_pic_winyes -8.273182e+00 4.761188e+00 -5.357266e-01
## best_actor_winyes -3.613718e+00 3.312635e-01 -8.906023e-01
## best_actress_winyes -4.546457e+00 0.000000e+00 -1.274460e+00
## best_dir_winyes -4.199131e+00 8.795338e-01 -5.628675e-01
## top200_boxyes -2.372924e+00 5.415304e+00 4.370515e-01
## feature_filmYes -4.114358e+00 1.006133e+00 -6.183369e-01
## dramaYes -5.171189e-01 2.141334e+00 3.298391e-01
## mpaa_rating_RYes -2.876618e+00 1.112439e-02 -1.023176e+00
## oscar_seasonYes -1.302998e+00 1.679704e+00 6.114252e-02
## summer_seasonYes -3.304616e-01 2.409419e+00 5.098497e-01
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
In context, a 1 point increase in imdb_rating predicts an increase in audience score of +14.91842.
Now, we will pick a movie from 2016 (not in sample) and predict audience_score for this movie using the BMA derived model. I am choosing the movie Deadpool.
deadpool <- data.frame(imdb_rating = 8.0, runtime = 108, critics_score = 85, mpaa_rating_R="Yes", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no", summer_season = "No", oscar_season = "Yes", best_pic_win = 'no',
best_dir_win = 'no', feature_film = "Yes", drama = "No", imdb_num_votes = 899526, top200_box='yes')
Then, we will predict the audience_score and its 95% credible interval, using the AIC step model and the BMA model to contrast results:
predict(AIC.lm, newdata = deadpool, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 85.26639 65.57666 104.9561
deadpool_bma_pred <- predict(bma_movies, newdata=deadpool, estimator="BMA", se.fit=TRUE, interval="prediction", level = 0.95)
deadpool_bma_pred$Ybma
## [,1]
## [1,] 87.70602
The actual audience score is 90 - the BMA model does a better job of predicting.
The BMA model performing better than the model chosen based solely on AIC is perhaps not surprising since that is the model that allows us to incorporate our uncertainty in the coefficients. This illustrates why it is important for us to express uncertainty in our model in order to make it better at prediction.
I also noted that the residuals may have violated some of the assumptions of normality, and some further analysis could be done to see if transforming audience_score or other variables would improve model fit.
The models also did not benefit from any strong priors about the relationships between variables and audience score. Priors could have been helpful to make us more certain in our estimates of coefficients.