library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyverse)
library(statsr)
library(BAS)
library(kableExtra)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
load("~/Desktop/R Programming/Statistics_Coursera/Bayesian/movies.Rdata")
The dataset contains information about movies obtained from IMDb and Rotten Tomatoes Websites. It contains 651 randomly sampled movies released between 1970 and 2014.
The data should allow us to generalise statistical results. However, potential bias includes voluntary response bias as the voting is voluntary, and only those who have high interest in movies would take part in the voting.
It is an observational study, with no random assignment. Therefore, causal inference cannot be drawn from the statistical results.
The following are new variables created based on the existing dataset - feature_film, drama, mpaa_rating_R, oscar_season, and summer_season
movies <- movies %>%
mutate(feature_film = as.factor(ifelse(title_type == "Feature Film", "yes","no")),
drama = as.factor(ifelse(genre == "Drama", "yes", "no")),
mpaa_rating_R = as.factor(ifelse(mpaa_rating == "R", "yes", "no")),
oscar_season = as.factor(ifelse(thtr_rel_month %in% 10:12, "yes", "no")),
summer_season = as.factor(ifelse(thtr_rel_month %in% 5:8, "yes", "no")))
Before performing the analysis, it is important to ensure that there is no any duplication. Table 1 shows that there are 8 duplicated movie titles in the dataset. Some movies appear to be new remakes as seen from different released years and studios. However, Man on Wire seems to be a duplicate, with the same released year in 2008 by Magnolia Pictures.
In addition, missing values are removed from the dataset.
DupTitle <- movies %>%
dplyr::select(title) %>%
table() %>%
data.frame() %>%
filter(Freq > 1)
colnames(DupTitle) <- c("title", "Frequency")
DupTitle <- DupTitle %>%
inner_join(movies, by = "title") %>%
dplyr::select(title, studio, thtr_rel_year)
colnames(DupTitle) <- c("title", "studio", "released year")
DupTitle %>%
kable(align = "c", caption ="Table 1 - Movies with duplicated title") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| title | studio | released year |
|---|---|---|
| Hairspray | New Line Cinema | 2007 |
| Hairspray | New Line Home Entertainment | 1988 |
| Man on Wire | Magnolia Pictures | 2008 |
| Man on Wire | Magnolia Pictures | 2008 |
| Maniac | Analysis | 1980 |
| Maniac | IFC Midnight | 2013 |
| Where the Heart Is | Touchstone Pictures | 1990 |
| Where the Heart Is | Twentieth Century Fox Home Entertainment | 2000 |
movies <- movies[!duplicated(movies),]
movies %>%
sapply(is.na) %>%
colSums()
## title title_type genre runtime
## 0 0 0 1
## mpaa_rating studio thtr_rel_year thtr_rel_month
## 0 8 0 0
## thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## 0 8 8 8
## imdb_rating imdb_num_votes critics_rating critics_score
## 0 0 0 0
## audience_rating audience_score best_pic_nom best_pic_win
## 0 0 0 0
## best_actor_win best_actress_win best_dir_win top200_box
## 0 0 0 0
## director actor1 actor2 actor3
## 2 2 7 9
## actor4 actor5 imdb_url rt_url
## 13 15 0 0
## feature_film drama mpaa_rating_R oscar_season
## 0 0 0 0
## summer_season
## 0
movies <- na.omit(movies)
audience_score and the new variablesFigure 1 illustrates the relationship between audience_score and new variables - drama, feature_film, mpaa_rating_R, oscar_season, summer_season. It shows how each variable contributes to the popularity in audience_score.
It is apparent that the distribution of audience score is relatively constant across all variables. However, movies categorised as non-feature_film are highly scored than feature_film as seen from higher median score of 86 points, despite much shorter score range.
In addition, drama movies seem to perform higher in terms of audience_score as compared to non-drama movies. At the median, drama movies obtained an audience_score of 70, while non-drama movies only received a score of 59.
It is thus seen that movies classified as non-feature film and drama are likely to receive high audience_score on Rotten Tomatoes.
selectedvars <- movies %>%
dplyr::select(feature_film, drama, mpaa_rating_R, oscar_season, summer_season, audience_score)
selectedvars <- selectedvars %>%
pivot_longer(feature_film:summer_season, names_to = "type", values_to = "val")
ggplot(data = selectedvars, aes(x = val, y = audience_score, fill = val)) +
geom_boxplot() +
facet_grid(~type) +
xlab("") + ylab("Audience Score") +
theme_wsj() +
labs(title="Figure 1", subtitle = "Audience Score by Variable", fill = "")
There are, in total, 17 numeric variables selected to perform the analysis. The following are variables to be used in this project namely runtime, thtr_rel_year,imdb_rating, imdb_num_votes, critics_score, audience_score, best_pic_nom, best_actor_win, best_actor_win, best_actress_win, best_dir_win, top200_box, feature_film, drama, mpaa_rating_R, oscar_season, summer_season
selected <- movies %>%
dplyr::select(-title, -title_type, -genre, -mpaa_rating, -studio, -thtr_rel_day,
-thtr_rel_month, -dvd_rel_day, -dvd_rel_month, -dvd_rel_year,
-director, -actor1,-actor2, -actor3, -actor4, -actor5, -imdb_url,
-rt_url, -critics_rating, -audience_rating)
Bayesian regression will be conducted using Bayesian Model Averaging (BMA). The following are parameters to be used in the model;
Prior
Zeller-Siow Cauchy
Model Prior
Uniform (equal probabilities to all models)
Method
Markov Chain Monte Carlo (MCMC) to ensure efficient sampling method
set.seed(12356789)
movies.ZS <- bas.lm(audience_score ~ ., data = selected,
prior = "ZS-null", modelprior = uniform(),
method = "MCMC")
A) Check if we run the MCMC long enough
MCMC sampling is performed for this model selection analysis. Diagnostics function is used to ensure that most points stay in the 45 degrees diagonal line. The convergence Plot shows that all points fall within the 45 diagonal line, suggesting that the posterior inclusion probability (pip) has converged.
diagnostics(movies.ZS, type = "pip", pch = 16, cex = 1.5)
B) Residuals VS Fitted Values
The Residuals plot does not display a random scatter around 0, with non-constant variance across the data. Only at a score of 75, there appears to be a random scatter. This suggests that the model is likely to perform better prediction at higher scores.
plot(movies.ZS, which = 1)
C) Model Probabilities
The plot displays probability of the models it is sampled. It is seen that the probability starts to level off after 2000 trials as each additional model does not contribute to an increase in cumulative probability. The search stops at 6000, instead of enumerations of \(2^{18}\) possible models.
plot(movies.ZS, which = 2)
D) Model Complexity
The plot shows the model complexity which suggests the number of parameters in the model. It is seen that the models with highest Bayes Factors or marginal likelihood have around 3 or 4 predictors.
plot(movies.ZS, which = 3)
E) Marginal Inclusion Probabilities
Posterior Inclusion Probability is analysed below. The lines in red correspond to the variables where the posterior inclusion probability (pip) is greater than 0.5, suggesting that these variables are important for prediction.
plot(movies.ZS, which = 4, ask=F, cex.lab = 0.75, cex.axis=0.75)
Using print method, the summary shows the log of marginal posterior inclusion probability (pip) of each variable. It is apparent that imdb_rating, critics_score, and runtime have posterior inclusion probability greater than 0.5, and are likely to be included in the model.
print(movies.ZS)
##
## Call:
## bas.lm(formula = audience_score ~ ., data = selected, prior = "ZS-null",
## modelprior = uniform(), method = "MCMC")
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept runtime thtr_rel_year
## 1.00000 0.51010 0.08603
## imdb_rating imdb_num_votes critics_score
## 0.99998 0.06715 0.91346
## best_pic_nomyes best_pic_winyes best_actor_winyes
## 0.14098 0.04324 0.11904
## best_actress_winyes best_dir_winyes top200_boxyes
## 0.15371 0.07186 0.05252
## feature_filmyes dramayes mpaa_rating_Ryes
## 0.06235 0.04930 0.17089
## oscar_seasonyes summer_seasonyes
## 0.06948 0.08290
The summary results show the 5 models, with the indicator (0 or 1) of whether the variable is included in the model. It will calculate the posterior probability by constructing the probability distribution of all possible models. In this analysis, there are a total number of possible models of 262144 models \(2^{18}\).
The first model includes 3 variables runtime, imdb_rating, and critics_score, with highest posterior probability of 0.1446.
summary(movies.ZS)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000000 1.0000 1.0000000 1.0000000 1.0000000
## runtime 0.51010132 1.0000 0.0000000 0.0000000 1.0000000
## thtr_rel_year 0.08603134 0.0000 0.0000000 0.0000000 0.0000000
## imdb_rating 0.99998169 1.0000 1.0000000 1.0000000 1.0000000
## imdb_num_votes 0.06714859 0.0000 0.0000000 0.0000000 0.0000000
## critics_score 0.91346436 1.0000 1.0000000 1.0000000 1.0000000
## best_pic_nomyes 0.14097977 0.0000 0.0000000 0.0000000 0.0000000
## best_pic_winyes 0.04323502 0.0000 0.0000000 0.0000000 0.0000000
## best_actor_winyes 0.11903687 0.0000 0.0000000 0.0000000 0.0000000
## best_actress_winyes 0.15370941 0.0000 0.0000000 1.0000000 0.0000000
## best_dir_winyes 0.07185745 0.0000 0.0000000 0.0000000 0.0000000
## top200_boxyes 0.05251770 0.0000 0.0000000 0.0000000 0.0000000
## feature_filmyes 0.06234512 0.0000 0.0000000 0.0000000 0.0000000
## dramayes 0.04930115 0.0000 0.0000000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.17088547 0.0000 0.0000000 0.0000000 1.0000000
## oscar_seasonyes 0.06947708 0.0000 0.0000000 0.0000000 0.0000000
## summer_seasonyes 0.08290253 0.0000 0.0000000 0.0000000 0.0000000
## BF NA 1.0000 0.9992109 0.2081412 0.2053665
## PostProbs NA 0.1446 0.1421000 0.0294000 0.0291000
## R2 NA 0.7477 0.7449000 0.7464000 0.7490000
## dim NA 4.0000 3.0000000 4.0000000 5.0000000
## logmarg NA 412.4443 412.4435507 410.8748017 410.8613812
## model 5
## Intercept 1.0000000
## runtime 1.0000000
## thtr_rel_year 0.0000000
## imdb_rating 1.0000000
## imdb_num_votes 0.0000000
## critics_score 1.0000000
## best_pic_nomyes 1.0000000
## best_pic_winyes 0.0000000
## best_actor_winyes 0.0000000
## best_actress_winyes 0.0000000
## best_dir_winyes 0.0000000
## top200_boxyes 0.0000000
## feature_filmyes 0.0000000
## dramayes 0.0000000
## mpaa_rating_Ryes 0.0000000
## oscar_seasonyes 0.0000000
## summer_seasonyes 0.0000000
## BF 0.1855645
## PostProbs 0.0267000
## R2 0.7489000
## dim 5.0000000
## logmarg 410.7599872
The log posterior odds and Model Rank can be visualised using image function. The same results are produced, with the model having the highest posterior odds in the leftmost. It includes runtime, imdb_rating, and critics_score.
image(movies.ZS, rotate = F, cex.lab = 0.75, cex.axis=0.75)
To obtain 95% credible intervals for coefficients, confint function is used. Given the data, there is a 95% chance that the audience_score increases by 1.36 to 1.65 with an increase in the IMDb rating.
coef_movies <- coef(movies.ZS)
confint(coef_movies)
## 2.5% 97.5% beta
## Intercept 6.133800e+01 6.293762e+01 6.216990e+01
## runtime -9.031633e-02 0.000000e+00 -3.017506e-02
## thtr_rel_year -5.306542e-02 1.527681e-04 -4.168073e-03
## imdb_rating 1.355340e+01 1.650167e+01 1.488618e+01
## imdb_num_votes -1.035726e-07 3.691489e-06 2.691560e-07
## critics_score 0.000000e+00 1.111471e-01 6.882612e-02
## best_pic_nomyes -7.951543e-04 5.124192e+00 5.517638e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -7.938603e-03
## best_actor_winyes -2.282444e+00 2.257890e-02 -2.164581e-01
## best_actress_winyes -3.003674e+00 1.682111e-03 -3.446054e-01
## best_dir_winyes -1.521390e+00 1.399334e-03 -1.260273e-01
## top200_boxyes -1.159643e-02 1.353316e-01 9.631643e-02
## feature_filmyes -1.207731e+00 6.452415e-03 -9.729139e-02
## dramayes 0.000000e+00 0.000000e+00 2.228857e-02
## mpaa_rating_Ryes -2.070041e+00 0.000000e+00 -2.493305e-01
## oscar_seasonyes -8.331063e-01 0.000000e+00 -6.625360e-02
## summer_seasonyes 0.000000e+00 1.027877e+00 9.018730e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Eddie The Eagle (2016) will be used to perform prediction. Using Bayesian Model Averaging (BMA), the model predicts, with 95% credible interval, that Eddie The Eagle is expected to have audience_score of 57 to 97 points. Its actual audience_score of 82 falls within the prediction interval.
EddieTheEagle <- data.frame(runtime = 106,
thtr_rel_year = 2016,
imdb_rating = 7.4,
imdb_num_votes = 85797,
critics_score = 82,
best_pic_nom = "no",
best_pic_win = "no",
best_actor_win = "no",
best_actress_win = "no",
best_dir_win = "no",
top200_box = "no",
feature_film = "yes",
drama = "yes",
mpaa_rating_R = "no",
oscar_season = "no",
summer_season = "no")
movies.BMA <- predict(movies.ZS, estimator = "BMA")
BMA <- as.vector(which.matrix(movies.ZS$which[movies.BMA$best],
movies.ZS$n.vars))
movies.ZS.BMA <- bas.lm(audience_score ~ ., data = selected,
prior = "ZS-null",
modelprior = uniform(),
bestmodel = BMA, n.models = 1)
pred.movies.BMA <- predict(movies.ZS.BMA, newdata = EddieTheEagle, estimator = "BMA",
se.fit = TRUE)
ci.pred.movies.BMA <- confint(pred.movies.BMA, estimator = "BMA")
actual.EddieTheEagle <- data.frame(actual = 82)
tbl.EddieTheEagle <- cbind(ci.pred.movies.BMA[1], ci.pred.movies.BMA[2],
ci.pred.movies.BMA[3])
tbl.EddieTheEagle.out <- data.frame(title = "Eddie The Eagle",
prediction = tbl.EddieTheEagle,
actual = actual.EddieTheEagle)
colnames(tbl.EddieTheEagle.out) <- c("title", "2.5%", "97.5%", "prediction", "actual score")
tibble(tbl.EddieTheEagle.out) %>%
kbl(caption ="Table 2 - Movie Prediction") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| title | 2.5% | 97.5% | prediction | actual score |
|---|---|---|---|---|
| Eddie The Eagle | 56.97195 | 96.70901 | 77.27737 | 82 |
audience_score prediction include runtime, imdb_rating, and critics_score on Rotten Tomatoes.