bayesian_project.utf8

Setup

Load packages

library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyverse)
library(statsr)
library(BAS)
library(kableExtra)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("~/Desktop/R Programming/Statistics_Coursera/Bayesian/movies.Rdata")

Part 1: Data

The dataset contains information about movies obtained from IMDb and Rotten Tomatoes Websites. It contains 651 randomly sampled movies released between 1970 and 2014.

Generalisation

The data should allow us to generalise statistical results. However, potential bias includes voluntary response bias as the voting is voluntary, and only those who have high interest in movies would take part in the voting.

Causality

It is an observational study, with no random assignment. Therefore, causal inference cannot be drawn from the statistical results.

Part 2: Data manipulation

The following are new variables created based on the existing dataset - feature_film, drama, mpaa_rating_R, oscar_season, and summer_season

movies <- movies %>%
  mutate(feature_film = as.factor(ifelse(title_type == "Feature Film", "yes","no")),
         drama = as.factor(ifelse(genre == "Drama", "yes", "no")),
         mpaa_rating_R = as.factor(ifelse(mpaa_rating == "R", "yes", "no")),
         oscar_season = as.factor(ifelse(thtr_rel_month %in% 10:12, "yes", "no")),
         summer_season = as.factor(ifelse(thtr_rel_month %in% 5:8, "yes", "no")))

Part 3: Exploratory data analysis

Duplicated Titles

Before performing the analysis, it is important to ensure that there is no any duplication. Table 1 shows that there are 8 duplicated movie titles in the dataset. Some movies appear to be new remakes as seen from different released years and studios. However, Man on Wire seems to be a duplicate, with the same released year in 2008 by Magnolia Pictures.

In addition, missing values are removed from the dataset.

DupTitle <- movies %>% 
  dplyr::select(title) %>%
  table() %>%
  data.frame() %>%
  filter(Freq > 1) 

colnames(DupTitle) <- c("title", "Frequency")

DupTitle <- DupTitle %>%
  inner_join(movies, by = "title") %>%
  dplyr::select(title, studio, thtr_rel_year)

colnames(DupTitle) <- c("title", "studio", "released year")

DupTitle %>%
  kable(align = "c", caption ="Table 1 - Movies with duplicated title") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 1 - Movies with duplicated title
title	studio	released year
Hairspray	New Line Cinema	2007
Hairspray	New Line Home Entertainment	1988
Man on Wire	Magnolia Pictures	2008
Man on Wire	Magnolia Pictures	2008
Maniac	Analysis	1980
Maniac	IFC Midnight	2013
Where the Heart Is	Touchstone Pictures	1990
Where the Heart Is	Twentieth Century Fox Home Entertainment	2000

movies <- movies[!duplicated(movies),]

movies %>%
  sapply(is.na) %>%
  colSums()

##            title       title_type            genre          runtime 
##                0                0                0                1 
##      mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
##                0                8                0                0 
##     thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
##                0                8                8                8 
##      imdb_rating   imdb_num_votes   critics_rating    critics_score 
##                0                0                0                0 
##  audience_rating   audience_score     best_pic_nom     best_pic_win 
##                0                0                0                0 
##   best_actor_win best_actress_win     best_dir_win       top200_box 
##                0                0                0                0 
##         director           actor1           actor2           actor3 
##                2                2                7                9 
##           actor4           actor5         imdb_url           rt_url 
##               13               15                0                0 
##     feature_film            drama    mpaa_rating_R     oscar_season 
##                0                0                0                0 
##    summer_season 
##                0

movies <- na.omit(movies)

Relationship between audience_score and the new variables

Figure 1 illustrates the relationship between audience_score and new variables - drama, feature_film, mpaa_rating_R, oscar_season, summer_season. It shows how each variable contributes to the popularity in audience_score.

It is apparent that the distribution of audience score is relatively constant across all variables. However, movies categorised as non-feature_film are highly scored than feature_film as seen from higher median score of 86 points, despite much shorter score range.

In addition, drama movies seem to perform higher in terms of audience_score as compared to non-drama movies. At the median, drama movies obtained an audience_score of 70, while non-drama movies only received a score of 59.

It is thus seen that movies classified as non-feature film and drama are likely to receive high audience_score on Rotten Tomatoes.

selectedvars <- movies %>%
  dplyr::select(feature_film, drama, mpaa_rating_R, oscar_season, summer_season, audience_score)

selectedvars <- selectedvars %>%
  pivot_longer(feature_film:summer_season, names_to = "type", values_to = "val")

ggplot(data = selectedvars, aes(x = val, y = audience_score, fill = val)) +
  geom_boxplot() +
  facet_grid(~type) +
  xlab("") + ylab("Audience Score") +
  theme_wsj() +
  labs(title="Figure 1", subtitle = "Audience Score by Variable", fill = "")

Part 4: Modeling

Variables Selection

There are, in total, 17 numeric variables selected to perform the analysis. The following are variables to be used in this project namely runtime, thtr_rel_year,imdb_rating, imdb_num_votes, critics_score, audience_score, best_pic_nom, best_actor_win, best_actor_win, best_actress_win, best_dir_win, top200_box, feature_film, drama, mpaa_rating_R, oscar_season, summer_season

selected <- movies %>%
  dplyr::select(-title, -title_type, -genre, -mpaa_rating, -studio, -thtr_rel_day, 
                -thtr_rel_month, -dvd_rel_day, -dvd_rel_month, -dvd_rel_year, 
                -director, -actor1,-actor2, -actor3, -actor4, -actor5, -imdb_url, 
                -rt_url, -critics_rating, -audience_rating)

Model Selection using Bayesian Model Averaging (BMA)

Bayesian regression will be conducted using Bayesian Model Averaging (BMA). The following are parameters to be used in the model;

Prior
Zeller-Siow Cauchy

Model Prior
Uniform (equal probabilities to all models)

Method
Markov Chain Monte Carlo (MCMC) to ensure efficient sampling method

set.seed(12356789)
movies.ZS <- bas.lm(audience_score ~ ., data = selected,
                    prior = "ZS-null", modelprior = uniform(),
                    method = "MCMC")

Model Diagnostics

A) Check if we run the MCMC long enough

MCMC sampling is performed for this model selection analysis. Diagnostics function is used to ensure that most points stay in the 45 degrees diagonal line. The convergence Plot shows that all points fall within the 45 diagonal line, suggesting that the posterior inclusion probability (pip) has converged.

diagnostics(movies.ZS, type = "pip", pch = 16, cex = 1.5)

B) Residuals VS Fitted Values

The Residuals plot does not display a random scatter around 0, with non-constant variance across the data. Only at a score of 75, there appears to be a random scatter. This suggests that the model is likely to perform better prediction at higher scores.

plot(movies.ZS, which = 1)

C) Model Probabilities

The plot displays probability of the models it is sampled. It is seen that the probability starts to level off after 2000 trials as each additional model does not contribute to an increase in cumulative probability. The search stops at 6000, instead of enumerations of \(2^{18}\) possible models.

plot(movies.ZS, which = 2)

D) Model Complexity

The plot shows the model complexity which suggests the number of parameters in the model. It is seen that the models with highest Bayes Factors or marginal likelihood have around 3 or 4 predictors.

plot(movies.ZS, which = 3)

E) Marginal Inclusion Probabilities

Posterior Inclusion Probability is analysed below. The lines in red correspond to the variables where the posterior inclusion probability (pip) is greater than 0.5, suggesting that these variables are important for prediction.

plot(movies.ZS, which = 4, ask=F, cex.lab = 0.75, cex.axis=0.75)

Model Selection

Using print method, the summary shows the log of marginal posterior inclusion probability (pip) of each variable. It is apparent that imdb_rating, critics_score, and runtime have posterior inclusion probability greater than 0.5, and are likely to be included in the model.

print(movies.ZS)

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = selected, prior = "ZS-null", 
##     modelprior = uniform(), method = "MCMC")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept              runtime        thtr_rel_year  
##             1.00000              0.51010              0.08603  
##         imdb_rating       imdb_num_votes        critics_score  
##             0.99998              0.06715              0.91346  
##     best_pic_nomyes      best_pic_winyes    best_actor_winyes  
##             0.14098              0.04324              0.11904  
## best_actress_winyes      best_dir_winyes        top200_boxyes  
##             0.15371              0.07186              0.05252  
##     feature_filmyes             dramayes     mpaa_rating_Ryes  
##             0.06235              0.04930              0.17089  
##     oscar_seasonyes     summer_seasonyes  
##             0.06948              0.08290

The summary results show the 5 models, with the indicator (0 or 1) of whether the variable is included in the model. It will calculate the posterior probability by constructing the probability distribution of all possible models. In this analysis, there are a total number of possible models of 262144 models \(2^{18}\).

The first model includes 3 variables runtime, imdb_rating, and critics_score, with highest posterior probability of 0.1446.

summary(movies.ZS)

##                     P(B != 0 | Y)  model 1     model 2     model 3     model 4
## Intercept              1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## runtime                0.51010132   1.0000   0.0000000   0.0000000   1.0000000
## thtr_rel_year          0.08603134   0.0000   0.0000000   0.0000000   0.0000000
## imdb_rating            0.99998169   1.0000   1.0000000   1.0000000   1.0000000
## imdb_num_votes         0.06714859   0.0000   0.0000000   0.0000000   0.0000000
## critics_score          0.91346436   1.0000   1.0000000   1.0000000   1.0000000
## best_pic_nomyes        0.14097977   0.0000   0.0000000   0.0000000   0.0000000
## best_pic_winyes        0.04323502   0.0000   0.0000000   0.0000000   0.0000000
## best_actor_winyes      0.11903687   0.0000   0.0000000   0.0000000   0.0000000
## best_actress_winyes    0.15370941   0.0000   0.0000000   1.0000000   0.0000000
## best_dir_winyes        0.07185745   0.0000   0.0000000   0.0000000   0.0000000
## top200_boxyes          0.05251770   0.0000   0.0000000   0.0000000   0.0000000
## feature_filmyes        0.06234512   0.0000   0.0000000   0.0000000   0.0000000
## dramayes               0.04930115   0.0000   0.0000000   0.0000000   0.0000000
## mpaa_rating_Ryes       0.17088547   0.0000   0.0000000   0.0000000   1.0000000
## oscar_seasonyes        0.06947708   0.0000   0.0000000   0.0000000   0.0000000
## summer_seasonyes       0.08290253   0.0000   0.0000000   0.0000000   0.0000000
## BF                             NA   1.0000   0.9992109   0.2081412   0.2053665
## PostProbs                      NA   0.1446   0.1421000   0.0294000   0.0291000
## R2                             NA   0.7477   0.7449000   0.7464000   0.7490000
## dim                            NA   4.0000   3.0000000   4.0000000   5.0000000
## logmarg                        NA 412.4443 412.4435507 410.8748017 410.8613812
##                         model 5
## Intercept             1.0000000
## runtime               1.0000000
## thtr_rel_year         0.0000000
## imdb_rating           1.0000000
## imdb_num_votes        0.0000000
## critics_score         1.0000000
## best_pic_nomyes       1.0000000
## best_pic_winyes       0.0000000
## best_actor_winyes     0.0000000
## best_actress_winyes   0.0000000
## best_dir_winyes       0.0000000
## top200_boxyes         0.0000000
## feature_filmyes       0.0000000
## dramayes              0.0000000
## mpaa_rating_Ryes      0.0000000
## oscar_seasonyes       0.0000000
## summer_seasonyes      0.0000000
## BF                    0.1855645
## PostProbs             0.0267000
## R2                    0.7489000
## dim                   5.0000000
## logmarg             410.7599872

The log posterior odds and Model Rank can be visualised using image function. The same results are produced, with the model having the highest posterior odds in the leftmost. It includes runtime, imdb_rating, and critics_score.

image(movies.ZS, rotate = F, cex.lab = 0.75, cex.axis=0.75)

Model Coefficients

To obtain 95% credible intervals for coefficients, confint function is used. Given the data, there is a 95% chance that the audience_score increases by 1.36 to 1.65 with an increase in the IMDb rating.

coef_movies <- coef(movies.ZS)
confint(coef_movies)

##                              2.5%        97.5%          beta
## Intercept            6.133800e+01 6.293762e+01  6.216990e+01
## runtime             -9.031633e-02 0.000000e+00 -3.017506e-02
## thtr_rel_year       -5.306542e-02 1.527681e-04 -4.168073e-03
## imdb_rating          1.355340e+01 1.650167e+01  1.488618e+01
## imdb_num_votes      -1.035726e-07 3.691489e-06  2.691560e-07
## critics_score        0.000000e+00 1.111471e-01  6.882612e-02
## best_pic_nomyes     -7.951543e-04 5.124192e+00  5.517638e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -7.938603e-03
## best_actor_winyes   -2.282444e+00 2.257890e-02 -2.164581e-01
## best_actress_winyes -3.003674e+00 1.682111e-03 -3.446054e-01
## best_dir_winyes     -1.521390e+00 1.399334e-03 -1.260273e-01
## top200_boxyes       -1.159643e-02 1.353316e-01  9.631643e-02
## feature_filmyes     -1.207731e+00 6.452415e-03 -9.729139e-02
## dramayes             0.000000e+00 0.000000e+00  2.228857e-02
## mpaa_rating_Ryes    -2.070041e+00 0.000000e+00 -2.493305e-01
## oscar_seasonyes     -8.331063e-01 0.000000e+00 -6.625360e-02
## summer_seasonyes     0.000000e+00 1.027877e+00  9.018730e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

Eddie The Eagle (2016) will be used to perform prediction. Using Bayesian Model Averaging (BMA), the model predicts, with 95% credible interval, that Eddie The Eagle is expected to have audience_score of 57 to 97 points. Its actual audience_score of 82 falls within the prediction interval.

EddieTheEagle <- data.frame(runtime = 106,
                            thtr_rel_year = 2016,
                            imdb_rating = 7.4,
                            imdb_num_votes = 85797,
                            critics_score = 82,
                            best_pic_nom = "no",
                            best_pic_win = "no",
                            best_actor_win = "no",
                            best_actress_win = "no",
                            best_dir_win = "no",
                            top200_box = "no",
                            feature_film = "yes",
                            drama = "yes",
                            mpaa_rating_R = "no",
                            oscar_season = "no",
                            summer_season = "no")

movies.BMA <- predict(movies.ZS, estimator = "BMA")

BMA <- as.vector(which.matrix(movies.ZS$which[movies.BMA$best],
                              movies.ZS$n.vars))

movies.ZS.BMA <- bas.lm(audience_score ~ ., data = selected,
                        prior = "ZS-null",
                        modelprior = uniform(),
                        bestmodel = BMA, n.models = 1)

pred.movies.BMA <- predict(movies.ZS.BMA, newdata = EddieTheEagle, estimator = "BMA",
                           se.fit = TRUE)

ci.pred.movies.BMA <- confint(pred.movies.BMA, estimator = "BMA")

actual.EddieTheEagle <- data.frame(actual = 82)
tbl.EddieTheEagle <- cbind(ci.pred.movies.BMA[1], ci.pred.movies.BMA[2], 
                           ci.pred.movies.BMA[3])

tbl.EddieTheEagle.out <- data.frame(title = "Eddie The Eagle",
                                prediction = tbl.EddieTheEagle,
                                actual = actual.EddieTheEagle)

colnames(tbl.EddieTheEagle.out) <- c("title", "2.5%", "97.5%", "prediction", "actual score")

tibble(tbl.EddieTheEagle.out) %>%
  kbl(caption ="Table 2 - Movie Prediction") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 2 - Movie Prediction
title	2.5%	97.5%	prediction	actual score
Eddie The Eagle	56.97195	96.70901	77.27737	82

Part 6: Conclusion

From the Bayesian Model Averaging (BMA) method, the most important predictors to perform audience_score prediction include runtime, imdb_rating, and critics_score on Rotten Tomatoes.
However, the model’s residuals plot displays a wave pattern which suggests high varying predictive ability.
Although the selected movie’s actual score falls within the prediction interval, the predicted score is far below from the actual one. Thus, further analysis on the predictors to be included in the model should be conducted for a better predictive ability.