library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)load("movies.Rdata")We will investigate one question using the “movies” dataset. This dataset includes information from Rotten Tomatoes and IMDB and is comprised of 651 randomly sampled movies produced and released before 2016. The results are generalizable to movies of all genres in US, as this is an observational study that uses random sampling. Since there is no random assignment, we cannot make causal conclusions. As there are no volunteers employed we can exclude the possibility of voluntary response bias. There is not non-response bias as well. However, there might coveniece bias since the movies included are American productions. * * *
We will begin by excluding all NAs from our data.
nonamovies = na.omit(movies)Now, we will create the new variables needed.
nonamovies <- nonamovies %>%
mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"), drama = ifelse(genre == "Drama", "yes", "no"), mpaa_rating_R = ifelse(mpaa_rating == "R", "yes", "no"), oscar_season = ifelse(thtr_rel_month %in% c("10", "11","12"), "yes", "no"), summer_season = ifelse(thtr_rel_month %in% c("5", "6","7", "8"), "yes", "no"))We will procceed with some plots and summaries of the above mentioned variables with the variable “audience score”.
ggplot(nonamovies, aes(x = factor(feature_film), y = audience_score)) +
geom_boxplot() + coord_flip() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10)) nonamovies %>%
group_by(feature_film) %>%
summarise(mean_as = mean(audience_score), med_as = median(audience_score))## # A tibble: 2 x 3
## feature_film mean_as med_as
## <chr> <dbl> <dbl>
## 1 no 82.5 86
## 2 yes 60.6 63
ggplot(nonamovies, aes(x = factor(drama), y = audience_score)) +
geom_boxplot() + coord_flip() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10)) nonamovies %>%
group_by(drama) %>%
summarise(mean_as = mean(audience_score), med_as = median(audience_score))## # A tibble: 2 x 3
## drama mean_as med_as
## <chr> <dbl> <dbl>
## 1 no 59.4 59
## 2 yes 65.3 70
ggplot(nonamovies, aes(x = factor(mpaa_rating_R), y = audience_score)) +
geom_boxplot() + coord_flip() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10)) nonamovies %>%
group_by(mpaa_rating_R) %>%
summarise(mean_as = mean(audience_score), med_as = median(audience_score))## # A tibble: 2 x 3
## mpaa_rating_R mean_as med_as
## <chr> <dbl> <dbl>
## 1 no 62.0 65
## 2 yes 62.4 65
ggplot(nonamovies, aes(x = factor(oscar_season), y = audience_score)) +
geom_boxplot() + coord_flip() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10)) nonamovies %>%
group_by(oscar_season) %>%
summarise(mean_as = mean(audience_score), med_as = median(audience_score))## # A tibble: 2 x 3
## oscar_season mean_as med_as
## <chr> <dbl> <dbl>
## 1 no 61.5 63.5
## 2 yes 63.9 69
ggplot(nonamovies, aes(x = factor(summer_season), y = audience_score)) +
geom_boxplot() + coord_flip() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10)) nonamovies %>%
group_by(summer_season) %>%
summarise(mean_as = mean(audience_score), med_as = median(audience_score))## # A tibble: 2 x 3
## summer_season mean_as med_as
## <chr> <dbl> <dbl>
## 1 no 62.4 65
## 2 yes 61.9 64
As we can see in summary statistics of the variables used for our research question, for all variables (and their corresponding yes/no categories) the values of the median and the mean are fairly close to one another.
when it comes to the plots, we can notice major differences for the variables “feature film” and “drama” and minor ones for “mpaa_rating_R”, “oscar_season” and “summer_season”.
We will develop a Bayesian regression model using Bayesian model averaging (BMA) to predict “audience_score”" from the explanatory variables: “feature_film”, “drama”, “runtime”, “mpaa_rating_R”, “thtr_rel_year”, “oscar_season”, “summer_season”, “imdb_rating”, “imdb_num_votes”, “critics_score”, “best_pic_nom”, “best_pic_win”, “best_actor_win”, “best_actress_win”, “best_dir_win”, “top200_box”.
nonamovies_red = nonamovies %>%
select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
bma_audsco <- bas.lm(audience_score ~ . -audience_score, data = nonamovies_red,
prior = "BIC",
modelprior = uniform())
bma_audsco##
## Call:
## bas.lm(formula = audience_score ~ . - audience_score, data = nonamovies_red,
## prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.05876 0.04509
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.51400 0.16498 0.08089
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.06526 0.07935 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04077 0.11565 0.14770
## best_dir_winyes top200_boxyes
## 0.06701 0.04876
summary(bma_audsco)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.05876309 0.0000 0.0000000 0.0000000
## dramayes 0.04508592 0.0000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.16498090 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.06525993 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07935408 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1558 0.1358000 0.0319000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0318000 0.0289000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
The most likely model, which has posterior probability of 0.1558, includes an intercept, runtime, imdb_rating and critics_score is the one we select. While a posterior probability of 0.1558 is small, it is much larger than the uniform prior probability assigned to it, since there are \(2^{17}\) possible models.
plot(bma_audsco, which=1)round(summary(bma_audsco), 3)## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.000 1.000 1.000 1.000 1.000
## feature_filmyes 0.059 0.000 0.000 0.000 0.000
## dramayes 0.045 0.000 0.000 0.000 0.000
## runtime 0.514 1.000 0.000 0.000 1.000
## mpaa_rating_Ryes 0.165 0.000 0.000 0.000 1.000
## thtr_rel_year 0.081 0.000 0.000 0.000 0.000
## oscar_seasonyes 0.065 0.000 0.000 0.000 0.000
## summer_seasonyes 0.079 0.000 0.000 0.000 0.000
## imdb_rating 1.000 1.000 1.000 1.000 1.000
## imdb_num_votes 0.062 0.000 0.000 0.000 0.000
## critics_score 0.920 1.000 1.000 1.000 1.000
## best_pic_nomyes 0.132 0.000 0.000 0.000 0.000
## best_pic_winyes 0.041 0.000 0.000 0.000 0.000
## best_actor_winyes 0.116 0.000 0.000 0.000 0.000
## best_actress_winyes 0.148 0.000 0.000 1.000 0.000
## best_dir_winyes 0.067 0.000 0.000 0.000 0.000
## top200_boxyes 0.049 0.000 0.000 0.000 0.000
## BF NA 1.000 0.872 0.205 0.204
## PostProbs NA 0.156 0.136 0.032 0.032
## R2 NA 0.748 0.746 0.747 0.750
## dim NA 4.000 3.000 4.000 5.000
## logmarg NA -3434.752 -3434.889 -3436.338 -3436.342
## model 5
## Intercept 1.000
## feature_filmyes 0.000
## dramayes 0.000
## runtime 1.000
## mpaa_rating_Ryes 0.000
## thtr_rel_year 0.000
## oscar_seasonyes 0.000
## summer_seasonyes 0.000
## imdb_rating 1.000
## imdb_num_votes 0.000
## critics_score 1.000
## best_pic_nomyes 1.000
## best_pic_winyes 0.000
## best_actor_winyes 0.000
## best_actress_winyes 0.000
## best_dir_winyes 0.000
## top200_boxyes 0.000
## BF 0.185
## PostProbs 0.029
## R2 0.750
## dim 5.000
## logmarg -3436.438
About the coeffficients: 1) We are 5.9% sure that “feature_film” should be included. 2) We are 4,5% sure that “drama” should be included. 3) We are 51.4% sure that “runtime” should be included. 4) We are 16,5% sure that “mpaa_rating_R” should be included. 5) We are 8,1% sure that “thtr_rel_year” should be included. 6) We are 6,5% sure that “oscar_season” should be included. 7) We are 7,9% sure that “summer_season” should be included. 8) We are 100% sure that “imdb_rating” should be included. 9) We are 6,2% sure that “imdb_num_votes” should be included. 10) We are 92% sure that “critics_score” should be included. 11) We are 13,2% sure that “best_pic_nom” should be included. 12) We are 4,1% sure that “best_pic_win” should be included. 13) We are 11,6% sure that “best_actor_win” should be included. 14) We are 14,8% sure that “best_actress_win” should be included. 15) We are 6,7% sure that “best_dir_winyes” should be included. 16) We are 4,9% sure that “top200_boxyes” should be included.
At this point, we want to use the model we created earlier, bma_audsco to predict the popularity of a movie based on its ‘audience_score’ , The land(2016). The data used to make this prediction come from Rotten Tomatoes (https://www.rottentomatoes.com/m/the_land_2016#audience_reviews) and IMDB (https://www.imdb.com/title/tt5164412/?ref_=ttawd_awd_tt).
theland <- data.frame(audience_score=61,feature_film="yes", drama="yes", runtime = 104,mpaa_rating_R = "yes", thtr_rel_year = 2016, oscar_season = "no", summer_season = "yes", imdb_rating = 6.4, imdb_num_votes = 1393, critics_score = 67, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "no")
audience_score_pred = predict(bma_audsco, newdata=theland, estimator="BMA",se.fit=TRUE)
audience_score_pred$fit## [1] 61.61292
We will also construct a credible interval around this prediction, which will provide a measure of uncertainty around the prediction.
ci_bma_audsco = confint(audience_score_pred, estimator="BMA")
ci_bma_audsco## 2.5% 97.5% pred
## [1,] 41.66233 81.57263 61.61292
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
This model predicts that the audience score of the movie would roughly be 61,6% (which is almost exactly the true audience score). Also, there is a 95% chance that the score is between 41,7% and 81,8%.
As final remarks, I believe that we have ended up in a parsimonious model that makes very close to reality predictions about the future popularity of a movie. The prediction for the movie “The land” is one of an audience score of 61,6% when the actual score of the movie is 61% (which actually belongs to credible interval). As seen above, our model has a quite low posterior probability, maybe that could have been countered with more/different variables offered.