Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.2

library(statsr)
library(BAS)

## Warning: package 'BAS' was built under R version 3.4.2

library(psych)

Load data

movies<-get(load("movies.Rdata"))

Part 1: Data

The study is an observational one. The main contribution to the generazability of inference results is that the used data set represents a simple random sample of movies released before 2016. So we can tell about causality which is unbiased in the sense of free of collection error( see https://en.wikipedia.org/wiki/Simple_random_sample.). But there are biases in the data set that is an obstacle to the generazability as data structure uses several ratings results which can be biased, films released only before 2016 and produced only in the US.

Part 2: Data manipulation

We create new variables according to the assignment:

data_movies<-movies %>% mutate(feature_film = factor(ifelse(title_type=="Feature Film",c("yes"), c("no")),levels=c("no","yes"))) %>%
mutate(drama = factor(ifelse(genre=="Drama", c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(mpaa_rating_R =factor(ifelse(mpaa_rating=="R", c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(oscar_season = factor(ifelse(thtr_rel_month%in% 10:12, c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(summer_season = factor(ifelse(thtr_rel_month %in% 5:8, c("yes"), c("no")),levels=c("no","yes")))

Part 3: Exploratory data analysis

Now we perform an explorartory data analysis of the relationships between variable “audience_score” and newly constructed variables: First, we look at summary statistics of the variables and their relationships:

describe(data_movies$audience_score)

##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 651 62.36 20.22     65   63.39 23.72  11  97    86 -0.36    -0.91
##      se
## X1 0.79

summary(data_movies$feature_film)

##  no yes 
##  60 591

summary(data_movies$drama)

##  no yes 
## 346 305

summary(data_movies$oscar_season)

##  no yes 
## 460 191

summary(data_movies$summer_season)

##  no yes 
## 443 208

aggr1<-aggregate(audience_score ~ feature_film, data_movies,mean)
print(aggr1)

##   feature_film audience_score
## 1           no       81.05000
## 2          yes       60.46531

aggr2<-aggregate(audience_score ~ drama, data_movies,mean)
print(aggr2)

##   drama audience_score
## 1    no       59.73121
## 2   yes       65.34754

aggr3<-aggregate(audience_score ~ oscar_season, data_movies,mean)
print(aggr3)

##   oscar_season audience_score
## 1           no       61.81304
## 2          yes       63.68586

aggr4<-aggregate(audience_score ~ summer_season, data_movies,mean)
print(aggr4)

##   summer_season audience_score
## 1            no       62.62302
## 2           yes       61.80769

The summary statistics of variables show that the ditribution of “audience score” is left-skewed, most films in the data set are fearture films, about half of films are drama by genre, about one of three films are released in a summer season and about 0.3 films are released in an oscar season. The summary statistics of ‘audience score’ by the factor variables show that feature films at average are scored lower the other types films; drama films are scored higher than other genres; films released in an oscar season in general are scored a little bit higher while in released in summer season a little bit less. We visualize the relationships by creating boxplots and credible interval by “bayes inference” function:

boxplot(audience_score~feature_film,data=data_movies)

bayes_inference(y = audience_score, x = feature_film, data = data_movies, statistic = "mean", type = "ci")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming independent Jeffrey's priors for mu and sigma^2)
## 
## 95% Cred. Int.: (16.7349 , 24.4434)
## 
## Post. mean   = 20.5863 
## Post. median = 20.5856 
## Post. mode   = 20.7087

We see that the average audience score for feature film is lower than other type films and the credible interval proves the difference of means.

boxplot(audience_score~drama,data=data_movies)

bayes_inference(y = audience_score, x = drama, data = data_movies, statistic = "mean", type = "ci")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming independent Jeffrey's priors for mu and sigma^2)
## 
## 95% Cred. Int.: (-8.6847 , -2.5438)
## 
## Post. mean   = -5.6159 
## Post. median = -5.6152 
## Post. mode   = -5.603

We see that average score for drama film is higher than other genre films and credible interval proves the difference of means.

boxplot(audience_score~oscar_season,data=data_movies)

bayes_inference(y = audience_score, x = oscar_season, data = data_movies, statistic = "mean", type = "ci")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming independent Jeffrey's priors for mu and sigma^2)
## 
## 95% Cred. Int.: (-5.3181 , 1.5847)
## 
## Post. mean   = -1.8697 
## Post. median = -1.8695 
## Post. mode   = -1.8299

We see that there is very small difference in the average audience score for films released during the oscar season and credible interval of the bayes inference contains 0.

boxplot(audience_score~summer_season,data=data_movies)

bayes_inference(y = audience_score, x = summer_season, data = data_movies, statistic = "mean", type = "ci")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming independent Jeffrey's priors for mu and sigma^2)
## 
## 95% Cred. Int.: (-2.5144 , 4.1297)
## 
## Post. mean   = 0.8122 
## Post. median = 0.8149 
## Post. mode   = 0.928

We see that there is almost no difference in the average audience score for films released during the summer season and the bayes credible interval contains 0.

Part 4: Modeling

First, we delete empty observations from the data frame.

data_movies = na.omit(data_movies)

To develop a Bayesian regression model to predict the “audience score” we run the “Bayesian adaptive sampling without replecement for variable selection (package”BAS" https://www.r-project.org, https://github.com/merliseclyde/BAS): as a prior we will consider “BIC” and as a method “Bayesian Model Averaging”. We assign equal prior probabilities to all models using the uniform function (https://www.coursera.org/learn/bayesian):

movies_bma = bas.lm(audience_score ~ feature_film+drama+runtime+mpaa_rating_R+thtr_rel_year+oscar_season+summer_season+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box, data = data_movies, n.models=2^16, prior = "BIC", modelprior = uniform(), initprobs="eplogp")
summary(movies_bma)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.05876309     0.0000     0.0000000     0.0000000
## dramayes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237

We run also models with replacement by the MCMC method to look at convergence:

movies_mcmc = bas.lm(audience_score ~ feature_film+drama+runtime+mpaa_rating_R+thtr_rel_year+oscar_season+summer_season+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box, data = data_movies, method="MCMC",MCMC.iterations=2*10^6, n.models=2^16, prior = "BIC", modelprior = uniform(), initprobs="eplogp")
summary(movies_mcmc)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept               1.0000000     1.0000     1.0000000     1.0000000
## feature_filmyes         0.0581185     0.0000     0.0000000     0.0000000
## dramayes                0.0451245     0.0000     0.0000000     0.0000000
## runtime                 0.5137515     1.0000     0.0000000     1.0000000
## mpaa_rating_Ryes        0.1650660     0.0000     0.0000000     1.0000000
## thtr_rel_year           0.0815650     0.0000     0.0000000     0.0000000
## oscar_seasonyes         0.0645865     0.0000     0.0000000     0.0000000
## summer_seasonyes        0.0799880     0.0000     0.0000000     0.0000000
## imdb_rating             0.9999930     1.0000     1.0000000     1.0000000
## imdb_num_votes          0.0623855     0.0000     0.0000000     0.0000000
## critics_score           0.9191680     1.0000     1.0000000     1.0000000
## best_pic_nomyes         0.1329805     0.0000     0.0000000     0.0000000
## best_pic_winyes         0.0409610     0.0000     0.0000000     0.0000000
## best_actor_winyes       0.1160390     0.0000     0.0000000     0.0000000
## best_actress_winyes     0.1486245     0.0000     0.0000000     0.0000000
## best_dir_winyes         0.0674020     0.0000     0.0000000     0.0000000
## top200_boxyes           0.0492430     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2039916
## PostProbs                      NA     0.1560     0.1362000     0.0312000
## R2                             NA     0.7483     0.7455000     0.7496000
## dim                            NA     4.0000     3.0000000     5.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3416317
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        0.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     1.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2048238     0.1851908
## PostProbs               0.0312000     0.0284000
## R2                      0.7470000     0.7495000
## dim                     4.0000000     5.0000000
## logmarg             -3436.3375603 -3436.4383237

Now we will look at plots of the models:

diagnostics(movies_mcmc)

plot(movies_mcmc,which=2,add.smooth=F)

plot(movies_bma,which=1,add.smooth=F)

plot(movies_bma,which=2,add.smooth=F)

plot(movies_bma,which=3)

image(movies_bma,subset=-1)

The plots of the model run by MCMC shows that there is a covergence meaning that we can use the Bayesian estimates. The residuals plot of BMA models show that there are outliers.Cummulative model probabilities plot of BMA models shows that we discovered about 5000 unique models out of 2^16 models with bayesian adaptive sampling.The plot of model size versus the log of marginal likelihood shows that models with the highest Bayes factors or marginal likelihoods have from 2 to 5 predictors. In summary of the BMA models we see that the highest marginal posterior inclusion probabilities have the intercept and variable“imdb_rating”(1.0), the next is “critics_score”(0.920) and the forth is “runtime” (0.514). Image of the BMA models shows that top ranked model includes just these four variables. This model has the highest prosterior probability and is the median probability model.

No we will look at credible interval of coefficients:

coef_movies_bicprior=coefficients(movies_bma, estimator="BMA")
plot(coef_movies_bicprior)

confint(coef_movies_bicprior)

##                              2.5%        97.5%          beta
## Intercept            6.138969e+01 6.297492e+01  6.221002e+01
## feature_filmyes     -7.166579e-01 1.156079e-01 -9.162435e-02
## dramayes             0.000000e+00 0.000000e+00  1.956484e-02
## runtime             -9.078604e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes    -1.987046e+00 1.548335e-03 -2.409288e-01
## thtr_rel_year       -4.244719e-02 2.506749e-05 -3.878883e-03
## oscar_seasonyes     -7.071126e-01 0.000000e+00 -6.293171e-02
## summer_seasonyes     0.000000e+00 9.601729e-01  8.660919e-02
## imdb_rating          1.358424e+01 1.656623e+01  1.490680e+01
## imdb_num_votes      -8.120657e-10 2.638531e-06  2.466254e-07
## critics_score       -1.564034e-06 1.108884e-01  6.951964e-02
## best_pic_nomyes      0.000000e+00 4.954424e+00  5.130600e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes   -2.214586e+00 0.000000e+00 -2.123363e-01
## best_actress_winyes -2.970196e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes     -1.397184e+00 8.083505e-02 -1.188402e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The coefficients’ credible intervals show that there is evidence that imdb_rating, critics_score are positively associated with the audience score while longer films are scored less by the audience.

Part 5: Prediction

In this part we have choosen the film “La La Land” released in 2016 to predict the audience score by the model from the previous step. Data from the web-sites the IMDb and Tomatometer rating were addded to the data frame:

data_movies[nrow(data_movies)+1,] = list(title="La La Land",title_type="Feature Film", genre="Drama",runtime=128, mpaa_rating="PG-13",studio="Lionsgate Films",thtr_rel_year=2016,thtr_rel_month=12, thtr_rel_day=9,dvd_rel_year=2017, dvd_rel_month=4,dvd_rel_day=25,imdb_rating=8.1, imdb_num_votes=310814, critics_rating="Certified Fresh",critics_score=93, audience_rating="Spilled",audience_score=81,best_pic_nom="yes", best_pic_win="yes", best_actor_win="yes",best_actress_win="yes", best_dir_win="yes", top200_box="yes", director="Damien Chazelle", actor1="Ryan Gosling", actor2="Emma Stone", actor3="Amiee Conn",actor4="Terry Walters",actor5="Thom Shelton",imdb_url="http://www.imdb.com/title/tt3783958", rt_url="https://www.rottentomatoes.com/m/la_la_land", feature_film="yes", drama="yes", mpaa_rating_R="no", oscar_season="yes",summer_season="no")
tail(data_movies)

## # A tibble: 6 x 37
##                         title   title_type              genre runtime
##                         <chr>       <fctr>             <fctr>   <dbl>
## 1          Death Defying Acts Feature Film              Drama      97
## 2                  Half Baked Feature Film             Comedy      82
## 3           Dance of the Dead Feature Film Action & Adventure      87
## 4 Around the World in 80 Days Feature Film Action & Adventure     120
## 5                         LOL Feature Film             Comedy      97
## 6                  La La Land Feature Film              Drama     128
## # ... with 33 more variables: mpaa_rating <fctr>, studio <fctr>,
## #   thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fctr>,
## #   critics_score <dbl>, audience_rating <fctr>, audience_score <dbl>,
## #   best_pic_nom <fctr>, best_pic_win <fctr>, best_actor_win <fctr>,
## #   best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>,
## #   feature_film <fctr>, drama <fctr>, mpaa_rating_R <fctr>,
## #   oscar_season <fctr>, summer_season <fctr>

Now we are going to make a prediction under model averaging and other estimator (HMP - highest probability model, MPM - the median probability model and BMP):

movies.BMA=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="BMA")
confint(movies.BMA)

##          2.5%    97.5%     pred
## [1,] 67.92074 108.0154 88.13245
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

head(movies.BMA$se.pred)

##          [,1]
## [1,] 10.17927
## [2,] 10.22202
## [3,] 10.25961
## [4,] 10.17068
## [5,] 10.36133
## [6,] 10.24785

movies.BPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="BPM")
confint(movies.BPM)

##          2.5%   97.5%     pred
## [1,] 59.24021 117.025 88.13259
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

movies.BPM$se.pred

##        1 
## 14.71188

movies.HPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="HPM")
confint(movies.HPM)

##          2.5%    97.5%     pred
## [1,] 67.79164 107.7723 87.78198
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

movies.HPM$se.pred

##        1 
## 10.17927

movies.MPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="MPM")
confint(movies.MPM)

##          2.5%    97.5%     pred
## [1,] 67.80094 107.7626 87.78178
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

movies.MPM$se.pred

##        1 
## 10.17456

We choose the BMA models prediction of the audience score for the film “La La Land”: with the probability of 95% the audience score will fall between 68.57 and 117.02.

Part 6: Conclusion

Regarding data and research methodoloigy I have critics about combining data from different sources such as different ratings.In the future research I would use new variables: “age of the film” and age squared instead the the thtr_rel_year. I would keep doing Bayesian estimates with other priors.