Setup

# set defaults: cache chunks to speed compiling subsequent edits.
knitr::opts_chunk$set(cache=TRUE, echo = TRUE)

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(BAS)
library(MASS)
library(ggthemes)
library(plotly)
library(kableExtra)
library(moments)

Load data

load("movies.Rdata")

Part 1: Data

Type of Study

This is an observational study due to the nature of data collection. It only establishes associations and trends and cannot be used to infer causality.

Generalizability & Potential Bias

No random assignment was used in the surveys and the sampling methodology is randomised.

It can be considered to be generalisable to the entire population of movies that appear on Rotten Tomatoes and IMDB but with consideration that this does not represent the general population of movies.

Note also, that when considering audience scores, IMDB ratings and critics scores, these are representative of people that both use that site and actively contribute to ratings. It is not a representation of the population at large.

Part 2: Data manipulation

# Add new variables
movies2 <- movies %>%
    mutate(feature_film = factor(title_type == 'Feature Film', labels=c("No", "Yes"))) %>%
    mutate(drama  = factor(genre == 'Drama', labels=c("No", "Yes"))) %>%
    mutate(mpaa_rating_R = factor(mpaa_rating == 'R', labels=c("No", "Yes"))) %>%
    mutate(oscar_season = factor(thtr_rel_month %in% 10:12, labels=c("No", "Yes"))) %>%
    mutate(summer_season = factor(thtr_rel_month %in% 5:8, labels=c("No", "Yes"))) 

# Print sample of rows with new variable
sample_n(movies2, 10) %>%
    select(title, feature_film:summer_season) %>%
    kbl() %>% 
    kable_styling(bootstrap_options = c("condensed"))

title	feature_film	drama	mpaa_rating_R	oscar_season	summer_season
Vampire Hunter D: Bloodlust	Yes	No	Yes	No	No
The Package	Yes	No	Yes	No	Yes
Welcome to Mooseport	Yes	No	No	No	No
The Last Remake of Beau Geste	Yes	No	No	No	Yes
Captain Phillips	Yes	Yes	No	Yes	No
Django Unchained	Yes	No	Yes	Yes	No
The Sum of All Fears	Yes	Yes	No	No	Yes
Blind Spot: Hitler’s Secretary	No	No	No	Yes	No
Bottle Rocket	Yes	Yes	Yes	No	No
Love and Death	Yes	Yes	No	No	Yes

# Drop unused columns, drop rows with NA's
movies2 <- movies2 %>%
    select(
        audience_score, feature_film, drama, runtime, mpaa_rating_R, 
        thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, 
        critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win,
        best_dir_win, top200_box) %>% 
    drop_na()

Part 3: Exploratory data analysis

Each of the new variables are boolean type, each to be examined against Audience Score. Since analysis methodology is the same for each, we create a function for summary stats and box plot.

summ_var <- function(variable, description){
    summ_list = list(
        box = ggplot(data = movies2, aes(x=(!!as.name(variable)), y=audience_score)) + 
            geom_boxplot(fill="lightblue") + 
            theme_excel_new() +
            xlab(description) +
            ylab("Audience Score") +
            ggtitle(paste("Audience Score vs.", description)) +
            theme(
                plot.margin = margin(0, 0, 0.5, 0.5, "cm"),
                axis.title = element_text()
        ),
    
    tbl = movies2 %>% group_by((!!as.name(variable))) %>% 
        summarise(
            count = n(),
            min = min(audience_score), 
            q25 = quantile(audience_score, 0.25), 
            median = median(audience_score), 
            q75 = quantile(audience_score, 0.75), 
            max = max(audience_score),
            mean = round(mean(audience_score),2), 
            sd = round(sd(audience_score),2),
            skew = round(skewness(audience_score),2)
        ) %>%
        kbl() %>% 
        kable_styling(bootstrap_options = c("condensed"))
    )
}

Feature Film

feature_film <- summ_var("feature_film", "Feature Film")

feature_film$box

feature_film$tbl

feature_film	count	min	q25	median	q75	max	mean	sd	skew
No	59	19	77.0	86	89	96	81.20	13.64	-3.07
Yes	591	11	44.5	62	78	97	60.47	19.82	-0.26

There is a much greater count of feature films reviewed
Both mean and median are much greater for non-feature films, the difference is greater than one feature-film SD
The range is similar for both
Variance is greater for feature films
Feature films are mostly normally distributed while non-feature films are heavily left-skewed.

bi <- bayes_inference(y = audience_score, x = feature_film, data = movies2, 
            statistic = "mean", type = "ht", null=0, alternative = "twosided")

Hypotheses:
H1: mu_No = mu_Yes
H2: mu_No != mu_Yes

Priors:
P(H1) = 0.5 P(H2) = 0.5

Results:
BF[H2:H1] = 322781721529
P(H1|data) = 0
P(H2|data) = 1

Posterior summaries for under H2:
95% Cred. Int.: (15.1737, 25.7915)

The bayes inference output confirms that there is a significant difference in audience score between feature film and non-feature film.

Drama

drama <- summ_var("drama", "Drama")

drama$box

drama$tbl

drama	count	min	q25	median	q75	max	mean	sd	skew
No	345	11	41	61	79	97	59.70	21.30	-0.11
Yes	305	13	52	70	80	95	65.35	18.54	-0.66

There is a more even number of drama vs non-drama, though non-drama is higher
Both mean and median are higher for drama films, the difference is approximately one half of the drama-film SD
The range is similar for both
Variance is slightly higher for non-drama films
Non-drama films are mostly normally distributed while drama films are mildly left-skewed.

bi <- bayes_inference(y = audience_score, x = drama, data = movies2, 
            statistic = "mean", type = "ht", null=0, alternative = "twosided")

Hypotheses:
H1: mu_No = mu_Yes
H2: mu_No != mu_Yes

Priors:
P(H1) = 0.5 P(H2) = 0.5

Results:
BF[H2:H1] = 34.0362
P(H1|data) = 0.0285
P(H2|data) = 0.9715

Posterior summaries for under H2:
95% Cred. Int.: (-8.6372, -2.5226)

The bayes inference output confirms that there is a significant difference in audience score between drama film and non-drama film with no overlap of zero in the credible interval.

MPAA Rating R

mpaa_rating_R <- summ_var("mpaa_rating_R", "MPAA Rating R")

mpaa_rating_R$box

mpaa_rating_R$tbl

mpaa_rating_R	count	min	q25	median	q75	max	mean	sd	skew
No	321	11	48	65	80	96	62.66	20.34	-0.44
Yes	329	14	44	64	79	97	62.04	20.16	-0.29

There is an approximately even number of R vs non-R movies
Both mean and median are similar for both types of film
The range is similar for both
Variance is nearly identical
Both film types show a small left-skew.

bi <- bayes_inference(y = audience_score, x = mpaa_rating_R, data = movies2, 
            statistic = "mean", type = "ht", null=0, alternative = "twosided")

Hypotheses:
H1: mu_No = mu_Yes
H2: mu_No != mu_Yes

Priors:
P(H1) = 0.5 P(H2) = 0.5

Results:
BF[H2:H1] = 14.9135
P(H1|data) = 0.9372
P(H2|data) = 0.0628

Posterior summaries for under H2:
95% Cred. Int.: (-2.4516, 3.6660)

The bayes inference output confirms a strong overlap between the two types of movies.

Oscar Season

oscar_season <- summ_var("oscar_season", "Oscar Season")

oscar_season$box

oscar_season$tbl

oscar_season	count	min	q25	median	q75	max	mean	sd	skew
No	460	11	46.00	64.0	79	96	61.81	20.12	-0.34
Yes	190	13	47.25	68.5	81	97	63.64	20.51	-0.42

There is a much greater number of non-Oscar Season movies
Both mean and median are roughly similar for both types of film
The range is similar for both
Variance is nearly identical
Both film types show a small left-skew.

bi <- bayes_inference(y = audience_score, x = oscar_season, data = movies2, 
            statistic = "mean", type = "ht", null=0, alternative = "twosided")

Hypotheses:
H1: mu_No = mu_Yes
H2: mu_No != mu_Yes

Priors:
P(H1) = 0.5 P(H2) = 0.5

Results:
BF[H2:H1] = 8.5148
P(H1|data) = 0.8949
P(H2|data) = 0.1051

Posterior summaries for under H2:
95% Cred. Int.: (-5.2649, 1.5886)

The bayes inference output shows a higher rating for Oscar Season movies, however there is also significant overlap between the two types.

Summer Season

summer_season <- summ_var("summer_season", "Summer Season")

summer_season$box

summer_season$tbl

summer_season	count	min	q25	median	q75	max	mean	sd	skew
No	442	13	46.00	65.5	80	97	62.60	20.40	-0.36
Yes	208	11	44.75	65.0	78	94	61.81	19.91	-0.38

There is a much greater number of non-Summer Season movies (approximately 2:1)
Both mean and median are roughly similar for both types of film
The range is similar for both
Variance is very similar
Both types of film have a small left-skew.

bi <- bayes_inference(y = audience_score, x = summer_season, data = movies2, 
            statistic = "mean", type = "ht", null=0, alternative = "twosided")

Hypotheses:
H1: mu_No = mu_Yes
H2: mu_No != mu_Yes

Priors:
P(H1) = 0.5 P(H2) = 0.5

Results:
BF[H2:H1] = 13.4802
P(H1|data) = 0.9309
P(H2|data) = 0.0691

Posterior summaries for under H2:
95% Cred. Int.: (-2.5584, 4.1228)

The bayes inference output shows a slightly higher rating for Summer Season movies, however there is also significant overlap between the two types.

Number of votes on IMDB

This variable has an extreme right skew distribution over a large range.

hist(movies2$imdb_num_votes)

ggplot(data = movies2, aes(x=audience_score, y=imdb_num_votes)) + geom_point()

Taking the log of this variable normalises the distribution and the relationship with audience score.

hist(log(movies2$imdb_num_votes))

ggplot(data = movies2, aes(x=audience_score, y=log(imdb_num_votes))) + geom_point()

For modelling, we will use the log of Number of votes on IMDB:

movies2 <- movies2 %>%
    mutate(l_imdb_num_votes = log(imdb_num_votes)) %>%
    select(-imdb_num_votes)

Part 4: Modelling

Base Model

We create an initial model using the bas.lm function from the BAS package.

Here we have 16 potential predictors. The total number of models is \(2^{16}=65536\).

This can take a long time to process, so we can use the argument method = MCMC inside the bas.lm function to speed processing by only sampling most likely models.

We also use the Zellner-Siow cauchy prior (prior = "ZS-null") for the prior distributions of the coefficients in this regression.

We assume all variables have an equal likelihood of being included in the final model using the uniform distribution modelprior = uniform().

m_movies <- bas.lm(audience_score ~ ., 
                          data = movies2,
                          prior = "ZS-null", 
                          method = "MCMC",
                          modelprior = uniform()
                        )
m_movies

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies2, prior = "ZS-null", 
##     modelprior = uniform(), method = "MCMC")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmYes             dramaYes  
##             1.00000              0.08251              0.04699  
##             runtime     mpaa_rating_RYes        thtr_rel_year  
##             0.47373              0.20470              0.10808  
##     oscar_seasonYes     summer_seasonYes          imdb_rating  
##             0.07846              0.08209              0.99999  
##       critics_score      best_pic_nomyes      best_pic_winyes  
##             0.88311              0.13804              0.04130  
##   best_actor_winyes  best_actress_winyes      best_dir_winyes  
##             0.14701              0.14342              0.06986  
##       top200_boxyes     l_imdb_num_votes  
##             0.05009              0.10413

Model Diagnostics

Posterior inclusion probabilities

We will run the convergence plots to double-check the base model first:

diagnostics(m_movies, type="pip", col = "blue", pch = 16, cex = 1.5)

diagnostics(m_movies, type="model", col = "blue", pch = 16, cex = 1.5)

The posterior inclusion & model probabilities converge well to the theoretical posterior inclusion probability.

Residuals Versus Fitted Values.

plot(m_movies, which = 1, add.smooth = F, 
     ask = F, pch = 16, sub.caption="", caption="")
abline(a = 0, b = 0, col = "darkgrey", lwd = 2)

Since the Best Predictive Model is the model whose predictions are closest to those given by BMA, we use BMA to check the residuals.

For predictions over 25, there is a fairly even distribution of residuals, however this becomes skewed to the positive below that. Potential outliers are cases 126, 216 & 251.

Cumulative Sampled Probability

plot(m_movies, which=2, add.smooth = F, sub.caption="", caption="")

We can see that after we have discovered about 7,000 unique models with MCMC sampling, the probability is starting to level off after less than 100, indicating that these additional models have very small probability and do not contribute substantially to the posterior distribution.

These probabilities are proportional to the product of marginal likelihoods of models and priors, \(p(data | M_m)p(M_m)\), rather than Monte Carlo frequencies.

Model Complexity

plot(m_movies, which=3, ask=F, caption="", sub.caption="")

The models with 2 & 3 variables have both the lowest and highest BF, throwing the graph scale out.

The model selection methodology will remove these low values.

Marginal Inclusion Probability

plot(m_movies, which = 4, ask = F, caption = "", sub.caption = "", 
     col.in = "blue", col.ex = "darkgrey", lwd = 3)

Only two variables have marginal inclusion probabilities greater than 0.5: imdb_rating and critics_score.

Additionally, runtime has a value close to 0.5 and may be considered for the model.

It should be kept in mind that low scores may be a result of multicollinearity.

Model Selection - Best Predictive Model

Since the purpose of the model is to predict future audience movie scores, the approach used will be Best Predictive Model (BPM).

movies.BPM = predict(m_movies, estimator = "BPM")
movies.BPM$best.vars

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

BPM estimation with the predict function tells us that only three significant variables exist for creating the best predictive model: runtime, imdb_rating, critics_score

Using the se.fit = TRUE option with predict we can calculate standard deviations for the predictions or for the mean. Then we can use this as input for the confint function for the prediction object. Here we only show the results of the first 10 data points.

movies.BPM = predict(m_movies, estimator = "BPM", se.fit = TRUE)
movies.BPM.conf.fit = confint(movies.BPM, parm = "mean")
movies.BPM.conf.pred = confint(movies.BPM, parm = "pred")
head(cbind(movies.BPM$fit, movies.BPM.conf.fit, movies.BPM.conf.pred), 10)

##                    2.5%    97.5%     mean     2.5%     97.5%     pred
##  [1,] 48.00963 46.62389 49.39537 48.00963 28.24466  67.77460 48.00963
##  [2,] 77.39261 76.01322 78.77200 77.39261 57.62808  97.15713 77.39261
##  [3,] 82.43884 80.81128 84.06640 82.43884 62.65544 102.22223 82.43884
##  [4,] 72.73853 71.14737 74.32968 72.73853 52.95809  92.51896 72.73853
##  [5,] 40.64808 39.32988 41.96629 40.64808 20.88773  60.40843 40.64808
##  [6,] 85.75164 83.87994 87.62335 85.75164 65.94667 105.55662 85.75164
##  [7,] 70.96218 69.23199 72.69236 70.96218 51.17007  90.75428 70.96218
##  [8,] 45.34669 43.92780 46.76559 45.34669 25.57937  65.11402 45.34669
##  [9,] 80.65865 79.17317 82.14413 80.65865 60.88643 100.43086 80.65865
## [10,] 65.04514 63.66929 66.42099 65.04514 45.28086  84.80942 65.04514

The option estimator = "BPM" is not yet available in coef(), so we need to extract a vector of zeros and ones representing which variables are included in the BPM model.

BPM = as.vector(which.matrix(m_movies$which[movies.BPM$best],
                             m_movies$n.vars))
BPM

##  [1] 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0

Next, we will refit the model with bas.lm using the optional argument bestmodel = BPM.

We will also specify that want to have 1 model by setting n.models = 1. In this way, bas.lm starts with the BPM and fits only that model.

movies.BPM = bas.lm(audience_score ~ ., data = movies2,
                      prior = "ZS-null",
                      modelprior = uniform(),
                      bestmodel = BPM, n.models = 1)

Now, since we have only one model in our new object representing the BPM, we can use the coef function to obtain the summaries.

movies.BPM.coef <- coef(movies.BPM)
movies.BPM.coef

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  1 models 
##                      post mean  post SD   post p(B != 0)
## Intercept            62.34769    0.39383   1.00000      
## feature_filmYes       0.00000    0.00000   0.00000      
## dramaYes              0.00000    0.00000   0.00000      
## runtime              -0.05353    0.02105   1.00000      
## mpaa_rating_RYes      0.00000    0.00000   0.00000      
## thtr_rel_year         0.00000    0.00000   0.00000      
## oscar_seasonYes       0.00000    0.00000   0.00000      
## summer_seasonYes      0.00000    0.00000   0.00000      
## imdb_rating          14.95802    0.57691   1.00000      
## critics_score         0.07025    0.02154   1.00000      
## best_pic_nomyes       0.00000    0.00000   0.00000      
## best_pic_winyes       0.00000    0.00000   0.00000      
## best_actor_winyes     0.00000    0.00000   0.00000      
## best_actress_winyes   0.00000    0.00000   0.00000      
## best_dir_winyes       0.00000    0.00000   0.00000      
## top200_boxyes         0.00000    0.00000   0.00000      
## l_imdb_num_votes      0.00000    0.00000   0.00000

par(mfrow = c(1, 3))
plot(movies.BPM.coef, subset = c(4,10,9))

Here, we see the model predicts a small negative correlation with runtime and audience score, a small positive correlation with critics score, while the greatest factor in prediction is the IMDB rating.

From the above table, the approximate predictions at the 95% credible interval are:

for each minute of movie length, we would expect a decrease in audience score of between 0.095 and 0.013
for each point on critic’s score, we would expect the audience score to increase by between 0.027 and 0.113
for each point on IMDB rating, we would expect to see an audience score increase by between 13.83 and 16.09

Part 5: Prediction

For prediction, we will use the movie Arrival with a Rotten Tomatoes audience score of 82%.

For this, we will need the critics score (94%), runtime (116 minutes) and IMDB rating (7.9).

We start by creating a dataframe with the necessary movie properties then set up a new model using the g-prior and our previously selected variables.

Finally, this is passed to the predict function, the prediction is in the fit element of the returned list object while the standard error is found in se.pred.

movie_props <- data.frame(
    imdb_rating = 7.9, 
    critics_score = 94, 
    runtime = 116
    )

movies.g_prior = bas.lm(audience_score ~ imdb_rating + critics_score + runtime, 
                       data = movies2,
                       alpha=3, 
                       prior="g-prior"
                       )

prediction <- predict(movies.g_prior, newdata=movie_props, estimator="BPM", se.fit=T)
as.numeric(prediction$fit)

## [1] 79.68117

as.numeric(prediction$se.pred)

## [1] 13.35252

The model predicts an audience score of 80%.

More specifically, at the 95% credible interval, the model predicts that a movie with these attributes will have an audience score of 80 \(\pm\) 26.2%.

Part 6: Conclusion

We have created a Bayesian Multilinear Regression model using the Zellner-Siow prior with the Markov Chain Monte Carlo method and a uniform model prior.

The fitted model was selected using the Best Predictive approach and employed just 3 of the original 16 variables: runtime, IMDB rating, critic’s score to predict a Rotten Tomatoes audience rating.

While the predicted value (80%) is very close to the actual value (82%), the large standard error makes this somewhat meaningless. To reduce the standard error, we would need a much larger sample size than 650 observations (since the precision g is proportional to the sample size).

On a final note, using IMDB rating and critic’s score to predict an audience rating doesn’t have much value as the movie needs to be already released and rated by critics and members of the public. A model as a predictive tool for unreleased movies would need to take these two variables out of consideration when constructing the model.

Bayesian Statistics: Movie Popularity Data Analysis

Setup

Load packages

Load data

Part 1: Data

Type of Study

Generalizability & Potential Bias

Part 2: Data manipulation

Part 3: Exploratory data analysis

Feature Film

Drama

MPAA Rating R

Oscar Season

Summer Season

Number of votes on IMDB

Part 4: Modelling

Base Model

Model Diagnostics

Posterior inclusion probabilities

Residuals Versus Fitted Values.

Cumulative Sampled Probability

Model Complexity

Marginal Inclusion Probability

Model Selection - Best Predictive Model

Part 5: Prediction

Part 6: Conclusion