Bayesian modeling and prediction for movies
Setup
Load packages
library(ggplot2)
library(dplyr)## Warning: package 'dplyr' was built under R version 4.1.3
library(statsr)## Warning: package 'statsr' was built under R version 4.1.3
## Warning: package 'BayesFactor' was built under R version 4.1.3
## Warning: package 'coda' was built under R version 4.1.3
library(tidyverse)## Warning: package 'tidyr' was built under R version 4.1.3
library(kableExtra)
library(purrr)
library(viridis)## Warning: package 'viridis' was built under R version 4.1.3
library(gridExtra)
library(fabletools)
library(BAS)## Warning: package 'BAS' was built under R version 4.1.3
Load data
# getwd()
# Clear workspace
rm(list = ls())
# Import data
load("movies.Rdata")Part 1: Data
We will list some remarks on the data set as follows:
- The
moviesdata set provided for this project includes 651 randomly sampled movies released in the United States before 2016 (1970-2016). - No other information is given on the random sampling method was applied.
- The data may indicate a simple random sample (SRS) of movies. For instance, there are varying amounts of movies for each theater release year as well as different proportions of movies for each genre Thus, there are no controls in place for the SRS to closely resemble actual distributions of movies from 1970-2016. This means the data is highly susceptible to sampling variation when considering its representatives of all movies from this time.
In addition, the measures of movie popularity such as the Rotten Tomatoes scores and the IMDB rating are themselves samples of opinions. These websites are dependent on voluntary participation by users and; therefore, these popularity scores are impossible to avoid bias, and such degrees of bias could be different for each individual movie. Therefore, any conclusions drawn from the data are generalizable to US movies since a simple random sample was used.
The simple random sample does not involve random assignment of movies to the factors under consideration. In addition, the data do not represent an experiment or observational study, and, thus, there are limitations to what sort of conclusions can be drawn from the data. No causality can be determined from inference on this data set, only evidence for associations. Such relationships are useful for hypothesis settings for the design of future researches that may form causality. Nonetheless, the relationships illustrated in this data set may be informative and useful for features of concurrent US movies and their popularity. * * *
Part 2: Data manipulation
Assumed that the genre associated with all movies are correctly specified, we will first examine the “Other” genre to inspect whether there is anything that could be wrongly categorized. Some movies such as “Django Unchained” could be labeled as “Drama”; thereby, we will update the dataset with new label for some movies.
# Movies with genre "Other" that are comedy
movies$genre[which(movies$title == "The Groove Tube")] <- "Comedy"
movies$genre[which(movies$title == "The Butcher's Wife")] <- "Comedy"
movies$genre[which(movies$title == "Groundhog Day")] <- "Comedy"
movies$genre[which(movies$title == "The Main Event")] <- "Comedy"
movies$genre[which(movies$title == "Goin' South")] <- "Comedy"
movies$genre[which(movies$title == "Russian Dolls (Les Poupees Russes)")] <- "Comedy"
movies$genre[which(movies$title == "UHF")] <- "Comedy"
movies$genre[which(movies$title == "Grease")] <- "Comedy"
# Movies with genre "Other" that are dramas
movies$genre[which(movies$title == "Django Unchained")] <- "Drama"
movies$genre[which(movies$title == "The Color of Money")] <- "Drama"
movies$genre[which(movies$title == "Down in the Valley")] <- "Drama"
movies$genre[which(movies$title == "The Fighter")] <- "Drama"
movies$genre[which(movies$title == "Urban Cowboy")] <- "Drama"
# Movies with genre "Other" that are Action & Adventure
movies$genre[which(movies$title == "Maverick")] <- "Action & Adventure"
# Movies with genre "Other" that are Scifi
movies$genre[which(movies$title == "Attack of the 50 Foot Woman")] <- "Science Fiction & Fantasy"Next, we will perform data preprocessing as required. We only keep variables that are meaningful to the modeling and discard those with little anticipated contribution.
# retain old version of data for future use
movies_pre <- movies
# remove na
movies <- na.omit(movies)
# Create new variables
movies %>% mutate(feature_film = factor(ifelse(title_type == "Feature Film", "yes", "no")),
drama = factor(ifelse(genre == "Drama", "yes","no")),
mpaa_rating_R = factor(ifelse(mpaa_rating == "R", "yes", "no")),
oscar_season = factor(ifelse(thtr_rel_month %in% c(10:12), "yes", "no")),
summer_season=
factor(ifelse(thtr_rel_month %in% c(5:8), "yes", "no"))) -> movies
# Select requested variables
movies <- movies %>% select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season,imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)Part 3: Exploratory data analysis
At this stage, we usually visualize the relationship between the
dependent variable (or regresssand) audience_score
with independent variables (or regressors). The concentration
will be placed around those newly added variables and their effects on
our score will be explored in this section. Each new feature will be
attached with a summary statistic table and a boxplot to examine those
relationships.
Summary Statistics Table and Plots
We will create a general function to compute some underlying statistics and return a dataframe.
summary_stats <- function(feature, kable_return = TRUE){
feature <- enquo(feature)
movies %>%
as_tibble() %>%
group_by(!!feature) %>%
summarize(Length = n(), Mean = round(mean(audience_score),1),
Min = round(min(audience_score),1),
Max = round(max(audience_score),1),
Std = round(sd(audience_score),1)) -> stats_df
if (kable_return){
return(stats_df %>% kbl(caption = "Summary statistics", escape = TRUE) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")) }
else{return(stats_df)}
}
mytheme = list(
theme_classic() +
theme(legend.position=c(0.9,0.9), plot.title = element_text(face = "bold",hjust = 0.5),
panel.grid = element_blank(),
panel.background = element_rect(fill = "white", colour = "grey50"),
strip.background = element_rect(colour=NA, fill=NA),
panel.border = element_rect(fill = NA, color = "black"),
axis.text=element_text(face="bold"),
axis.title = element_text(face="bold"),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")))feature_film variable
Using the above defined function, we easily extract some statistics about the feature of interest
summary_stats(feature = feature_film)| feature_film | Length | Mean | Min | Max | Std |
|---|---|---|---|---|---|
| no | 46 | 82.5 | 19 | 96 | 11.9 |
| yes | 573 | 60.6 | 11 | 97 | 19.8 |
ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = factor(feature_film), color = factor(feature_film))) +
geom_boxplot(alpha =0.6) +
scale_fill_viridis(discrete = TRUE,
option = "viridis",
begin = 0.2 ) +
labs(title = "Boxplot of `audience_score` by `feature_film`", y = "`audience_score`", x = "`feature_film`", fill = "`feature_film`",color = "`feature_film`") + mytheme
Movies that are not feature films have an average of 20.5 higher
audience score than those movies that are feature films. Furthermore,
the scores of the two groups appear to be dissimilar if variability
(measured by standard deviation) are accounted.
drama variable
summary_stats(feature = drama)| drama | Length | Mean | Min | Max | Std |
|---|---|---|---|---|---|
| no | 316 | 59.1 | 11 | 97 | 21.1 |
| yes | 303 | 65.5 | 13 | 95 | 18.6 |
ggplot(data = movies, aes(x = drama, y = audience_score, fill = factor(drama), color = factor(drama))) +
geom_boxplot(alpha =0.6) +
scale_fill_viridis(discrete = TRUE,
option = "viridis",
begin = 0.2 ) +
labs(title = "Boxplot of `audience_score` by `drama`", y = "`audience_score`", x = "`drama`", fill = "`drama`",color = "`drama`") + mytheme
Drama movies constitutes up to half of the movies in the data set. The
IQR of two groups are pretty similar, implies that there may be no
significant difference in the impact of those two on audience score.
oscar_season variable
summary_stats(feature = oscar_season)| oscar_season | Length | Mean | Min | Max | Std |
|---|---|---|---|---|---|
| no | 440 | 61.5 | 11 | 96 | 20.1 |
| yes | 179 | 63.9 | 13 | 97 | 20.3 |
ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = factor(oscar_season), color = factor(oscar_season))) +
geom_boxplot(alpha =0.6) +
scale_fill_viridis(discrete = TRUE,
option = "viridis",
begin = 0.2 ) +
labs(title = "Boxplot of `audience_score` by `oscar_season`", y = "`audience_score`", x = "`oscar_season`", fill = "`oscar_season`",color = "`oscar_season`") + mytheme
Fewer movies are released during Oscar season than other times in year.
The IQR of two groups are pretty similar, implies that there may be no
significant difference in the impact of those two on audience score.
####
summer_season variable
summary_stats(feature = summer_season)| summer_season | Length | Mean | Min | Max | Std |
|---|---|---|---|---|---|
| no | 418 | 62.4 | 13 | 97 | 20.3 |
| yes | 201 | 61.9 | 11 | 94 | 19.9 |
ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = factor(summer_season), color = factor(summer_season))) +
geom_boxplot(alpha =0.6) +
scale_fill_viridis(discrete = TRUE,
option = "viridis",
begin = 0.2 ) +
labs(title = "Boxplot of `audience_score` by `summer_season`", y = "`audience_score`", x = "`summer_season`", fill = "`summer_season`",color = "`summer_season`") + mytheme
Fewer movies are released during summer season than other seasons within
a year. The IQR of two groups are pretty similar and so are the
distributions, implies that there may be no significant difference in
the impact of those two on audience score. ####
mpaa_rating_R variable
summary_stats(feature = mpaa_rating_R)| mpaa_rating_R | Length | Mean | Min | Max | Std |
|---|---|---|---|---|---|
| no | 300 | 62.0 | 11 | 96 | 20.3 |
| yes | 319 | 62.4 | 14 | 97 | 20.1 |
ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = factor(mpaa_rating_R), color = factor(mpaa_rating_R))) +
geom_boxplot(alpha =0.6) +
scale_fill_viridis(discrete = TRUE,
option = "viridis",
begin = 0.2 ) +
labs(title = "Boxplot of `audience_score` by `mpaa_rating_R`", y = "`audience_score`", x = "`mpaa_rating_R`", fill = "`mpaa_rating_R`",color = "`mpaa_rating_R`") + mythemeOne-half of the movies are labeled “R” by the MPAA. The IQR of two groups are pretty similar and so are the distributions, implies that there may be no significant difference in the impact of those two on audience score.
Part 4: Modeling
Fit the BAS Model
In this section we are supposed to use specific variables to create a final Bayesian Regression model.
The formula for the regression model can be written as follows: \[\operatorname{audience\_score} = \beta_0 + \beta_1 \operatorname{feature\_film} + \beta_2 \operatorname{drama} + \beta_3 \operatorname{mpaa\_rating\_R} + \beta_4 \operatorname{oscar\_season} + \beta_5 \operatorname{summer\_season} \]
Using the BAS package, a Bayesian regression is
conducted for the data set. A BIC prior is used in addition to the
uniform model prior:
bas_mod = bas.lm(audience_score ~ .,
data=movies,
initprobs = "eplogp",
prior="ZS-null",
modelprior=uniform(),
method = "MCMC")mod_coef <- coef(bas_mod, estimator = "BMA")
# find posterior probabilities
mod_coef_clean <- data.frame(parameter = mod_coef$namesx, post_mean = mod_coef$postmean, post_SD = mod_coef$postsd, post_pne0 = mod_coef$probne0) %>%
arrange(post_pne0) %>%
filter(parameter != "Intercept")
mod_coef_clean$parameter <- factor(mod_coef_clean$parameter, levels = mod_coef_clean$parameter[order(mod_coef_clean$post_pne0, decreasing = TRUE)])
high_post_pne0 <- data.frame(parameter = mod_coef_clean$parameter, post_pne0 = mod_coef_clean$post_pne0) %>%
filter(post_pne0 > 0.5)
# Plot the Data
ggplot(mod_coef_clean, aes(x = parameter, y = post_pne0)) +
geom_pointrange(aes(ymax = post_pne0, ymin = 0)) +
geom_pointrange(data=high_post_pne0, aes(x = parameter, y = post_pne0, ymax = post_pne0), ymin = 0, color = "darkred") +
geom_hline(yintercept = 0.5, color = "red") +
labs(title = "Posterior Marginal Inclusion Probabilities of Explanatory Variables",
x="Explanatory Variable",y = "Marginal Inclusion Probability") +
mytheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
The three explanatory variables with a posterior probability (measured
by marginal inclusion probability) are greater than 0.5 are
imdb_rating, critics_score, and
runtime. These three features have a high chance of
establishing a relationship with audience_score. However,
among those three, the size effect of imdb_rating
approximately, \(15\), is significantly
higher than others, \(0.06\) and \(-0.03\) (two hundred times larger than the
second one which is critics_score). This is illustrated via
two figures below: \(95\%\) credible
intervals of all variables and top three highest marginal inclusion
probabilities variables.
# find credible intervals for betas
coef_int <- data.frame(confint(mod_coef, c(2:17))[,])
coef_int$parameter <- rownames(coef_int)
rownames(coef_int) <- NULL
mod_coef_clean <- mod_coef_clean %>% left_join(coef_int)## Joining, by = "parameter"
mod_coef_clean$parameter <- factor(mod_coef_clean$parameter, levels = mod_coef_clean$parameter[order(mod_coef_clean$beta)])
high_post_pne0 <- high_post_pne0 %>% left_join(mod_coef_clean) %>%
rename(conf_25 = X2.5.,
conf_97_5 = X97.5.)## Joining, by = c("parameter", "post_pne0")
mod_coef_clean %>% rename(conf_25 = X2.5.,
conf_97_5 = X97.5.) -> mod_coef_clean
# Plot the Data
all_var <- ggplot(mod_coef_clean, aes(y = beta, x = parameter)) +
geom_pointrange(aes(ymax = conf_97_5, ymin = conf_25),colour="black", position=position_dodge(0.1)) +
geom_point(data=high_post_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Coefficients Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
coord_flip()
top_3 <- ggplot(high_post_pne0, aes(y = beta, x = parameter)) +
geom_pointrange(aes(ymax = conf_97_5, ymin = conf_25),colour="black", position=position_dodge(0.1)) +
geom_point(data=high_post_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Coefficients Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
coord_flip()
grid.arrange(all_var, top_3, ncol = 1)diagnostics(bas_mod, type="pip", col = "blue", pch = 16, cex = 1.5)
As seen from the figure above, the posterior inclusion probability of
each variable from MCMC have converged well enough to the theoretical
posterior inclusion probability.
diagnostics(bas_mod, type = "model", col = "blue", pch = 16, cex = 1.5)
We can see that all of the points lie on the 45 degree line, validating
the covergence of the model given the filtration of all current
data.
plot(bas_mod, which = 1, add.smooth = F,
ask = F, pch = 16, sub.caption="", caption="")
abline(a = 0, b = 0, col = "darkgrey", lwd = 2)
For most of the fitted values, the residuals are centered around zero,
and the model appears to show a decrease in residual variance at is
higher predicted values. However, for fitted values that are less than
40, the model tends to underestimate the audience scores of the movies
with high variance (strong variability) and this bias may deteriorate
the predicted value. The figure also labels three possible outliers in
this region.
fit <- data.frame(resid = movies$audience_score - predict(bas_mod,estimator = "BMA")$fit)
imdb_check <- ggplot(fit, aes(x=movies$imdb_rating, y = resid))+
geom_point() +
geom_smooth(color="red", lwd = 0.5, se = TRUE, span = 0.5)+
geom_vline(xintercept = 4.5, color = "blue")+
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs IMDB Rating Plot",x="IMDB Rating",y = "Residuals") +
mytheme
critic_check <- ggplot(fit, aes(x=movies$critics_score, y = resid))+
geom_point() +
geom_smooth(color="red", lwd = 0.5, se = TRUE, span = 0.5)+
geom_vline(xintercept = 30, color = "blue")+
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Critics Score Plot",x="Critics Score",y = "Residuals") +
mytheme
grid.arrange(imdb_check, critic_check,ncol = 2)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Number of movies having imdb_rating < 4.5
movies %>% filter(imdb_rating < 4.5) %>%
select(imdb_rating) %>%
summarise(n = n())## # A tibble: 1 x 1
## n
## <int>
## 1 32
We check the fitted values with imdb_rating factor. As
can be seen from the left-hand side plot, movies with a
imdb_rating of less than about 4.5 points (marked on the
plot with a blue vertical line) may suffer from the issue of
underestimation of audience_score. The possible explanation
is that most movies do not have imdb_rating less than this
threshold (32 over 619 or \(5\%\) of
the data set). This can cause a bias in our modeling.
p1 <- ggplot(data=movies, aes(imdb_rating)) + geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, args = list(mean = mean(movies$imdb_rating), sd = sd(movies$imdb_rating)), aes(colour = 'Normal')) + geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.2, 0.85))
p2 <- ggplot(data=movies, aes(critics_score)) + geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, args = list(mean = mean(movies$critics_score), sd = sd(movies$critics_score)), aes(colour = 'Normal')) + geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.2, 0.85))
grid.arrange(p1, p2, ncol = 2)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
here we check the distribution of two most essential predictors in the
model. While
imdb_rating exihibits left-skewed, the
critics-score distribution are multi-modal.
# Summary of the model's performance on training phase
set.seed(28)
n = nrow(movies)
n_cv = 80
ape = matrix(NA, ncol=2, nrow=n_cv) # initialize output matrix
colnames(ape) = c("BMA", "Mean_RMSE")
split_ratio <- 0.9
pb <- txtProgressBar(min = 0, max = n_cv, style = 3) # intialize progress bar##
|
| | 0%
for (i in 1:n_cv) {
setTxtProgressBar(pb, i) # Update progress bar
idx_train = sample(1:n, size=round(split_ratio*n), replace=FALSE) # Determine data split
train_set = movies[idx_train,-17]
test_set = movies[-idx_train,-17]
mod_train = bas.lm(audience_score ~ ., data=train_set,
prior="BIC", modelprior=uniform(), initprobs="eplogp") # Fit model to just training data
yhat_train = predict(mod_train, test_set, estimator="BMA")$fit # predict test data from train model
ape[i, "BMA"] = cv.summary.bas(yhat_train, test_set$audience_score)
ape[i, "Mean_RMSE"] = NA
if (i > 1) {
ape[i, "Mean_RMSE"] = mean(ape[, "BMA"], na.rm = TRUE)
}
}##
|
|= | 1%
|
|== | 2%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|========= | 12%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================== | 42%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|===================================== | 52%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|============================================ | 62%
|
|============================================= | 64%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================== | 82%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================= | 92%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
ape_df <- tibble(batch = 1:80, RMSE = ape[,1], rolling_RMSE = ape[,2])
summary_RMSE <- ape_df %>%
as_tibble() %>%
summarize(n = n(), Mean = mean(RMSE, na.rm = TRUE),
Std = sd(RMSE, na.rm = TRUE), Median = median(rolling_RMSE, na.rm = TRUE))
# plot th data
ggplot(ape_df, aes(x=batch, y = rolling_RMSE)) +
geom_line() +
geom_hline(yintercept = summary_RMSE[[1,4]], color = "#F8766D", lty = 2) +
labs(title = "Cross Validation RMSE Point Estimates by Iteration", x="Iteration",y = "Mean RMSE Estimates") +
mytheme## Warning: Removed 1 row(s) containing missing values (geom_path).
This figure supports the convergence of the accuracy metrics (here we
used Root Mean Squared Error) after a fixed number of
iterations of random sampling. At some beginning iterations, the RMSE
tends to behave like a random walk, implying that the model does not
actually learn the underlying patterns. However, the convergence of RMSE
indicates the ability of learn of our model eventually.
# Quantile-Quantile Plot of Residuals
test <- predict(bas_mod,estimator = "BMA")
resid <- movies$audience_score - test$fit
mu_resid <- mean(resid)
sd_resid <- sd(resid)
std_resid <- (resid-mu_resid)/sd_resid
par(mfrow=c(1,2))
qqnorm(std_resid, lty = 2)
qqline(std_resid)
plot(density(std_resid), main="Probability Density of Std. Residuals",
xlab="Std. Residuals", ylab="P(Std. Residuals)")
The quantile-quantile and density plots of the standardized residuals
represents the quality of the fitted values via residuals (or
standardized residuals). For the majority of fitted values, the
standardized residuals are alike normally distributed around zero. In
evaluating the of the model’s residuals, there is a fairly right-skewed
pattern to the residuals which suggests the model tends to underestimate
the
audience_score rather than overestimate it. However,
this is a marginally trivial stray away from normality compared to the
overall normal distribution of the residuals and does not contradict
with the model’s assumption.
set.seed(28)
n = nrow(movies)
outliers_mat = cbind(movies, diag(1, nrow=n))
outliers_movies = bas.lm(audience_score ~ ., data=outliers_mat,
prior="ZS-null", a=n,
modelprior=tr.beta.binomial(a=1, b=1, trunc=n/2),
method="MCMC",
initprobs="marg-eplogp",
MCMC.iterations=2*10^6)
outliers_movies$namesx[outliers_movies$probne0 > .5][c(3,4,5)] %>%
kbl(caption = "Outliers", escape = TRUE) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")| x |
|---|
122
|
206
|
239
|
From the outlier check, it appears rows 122, 206, and 239 are the significant outliers for the model.
Part 5: Prediction
Given the model settings, we will opt for randomly one new movie at the beginning of 2017 (01-01-2017) to test the ability of predicitng audience score. The chosen movie is the Oscar winner Get Out by Jordan Peele. Features of new data will be updated by collecting information from imdb.
get_out <- tibble(audience_score = 85, feature_film = "yes", drama = "yes", runtime=104, mpaa_rating_R = "yes",thtr_rel_year = 2017, oscar_season = "no", summer_season = "yes", imdb_rating = 8.0, imdb_num_votes = 337591,critics_score=94, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "yes")
prediction_bma <- predict(bas_mod, get_out, estimator = "BMA", se.fit=TRUE)
prediction_hpm <- predict(bas_mod, get_out, estimator = "HPM", se.fit=TRUE)
prediction_mpm <- predict(bas_mod, get_out, estimator = "MPM", se.fit=TRUE)
ci_score<- list(confint(prediction_bma, parm="pred"), confint(prediction_hpm, parm="pred"), confint(prediction_mpm, parm="pred"))
result_df <- tibble(type = c("BMA", "HPM", "MPM"), lwr = c(ci_score[[1]][1],ci_score[[2]][1], ci_score[[3]][1]), pred = c(ci_score[[1]][3],ci_score[[2]][3],ci_score[[3]][3]), upr = c(ci_score[[1]][2], ci_score[[2]][2], ci_score[[3]][2]))
resid <- result_df$pred - get_out$audience_score
acc_df <- tibble(Acc.Metrics = c("BMA", "HPM","MPM"), rmse = c(RMSE(resid[1]), RMSE(resid[2]), RMSE(resid[3])),
mae = c(MAE(resid[1]), MAE(resid[2]), MAE(resid[3])),
mape = c(MAE(resid[1], 85), MAE(resid[2], 85), MAE(resid[3], 85)))
print(acc_df)## # A tibble: 3 x 4
## Acc.Metrics rmse mae mape
## <chr> <dbl> <dbl> <dbl>
## 1 BMA 2.46 2.46 2.46
## 2 HPM 2.71 2.71 2.71
## 3 MPM 2.71 2.71 2.71
print(result_df)## # A tibble: 3 x 4
## type lwr pred upr
## <chr> <dbl> <dbl> <dbl>
## 1 BMA 67.9 87.5 108.
## 2 HPM 67.7 87.7 108.
## 3 MPM 67.7 87.7 108.
The predicted audience_score for Get Out was
approximately 87.6 for BMA predicted value, 87.12 for Highest
Posterior Model and 87.71 for Median Posterior Model.
Three 95% credible intervals contain the actual
audience_score value, 85, which implies that the model are
capable of capturing uncertainty of the forecast. Three accuracy
metrics, including scale-dependent metrics and percentage metric are
quite small, uphold the ability of predicting the audience score of the
proposed model.
Part 6: Conclusion
var_bma <- variable.names(prediction_hpm)
var_hpm <- variable.names(prediction_bma)
var_mpm <- variable.names(prediction_mpm)
Reduce(intersect, list(var_bma,var_hpm,var_mpm))## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
From three models, the two most important features for predicting
score of a given movie are imdb_rating and
critics_score. However, in diagnosing the outliers in
the model’s predictions, it appears that the model fails to predict the
audience_score when the imdb_rating (after we
transform it to match the scale of the audience score) is not
approximately equal to the audience_score, indicating a
contradiction in ratings between users of IMDB and Rotten Tomatoes.
However, there are only a few movies with this issue in the data set.
The credible interval is still large which signifies the forecast
variance in the underlying model.
It is straightforward that more information is necessary to the enhancement of the performance of the model in forecasting a movie’s popularity. Larger sample sizes may help, but even better changes would include advanced sampling designs that control for sources of confounding in the data and make the sample more representative of the typical movies being released. In addition, more specific and distinguishable features of movies should be incorporated to the data set. Features that detail the general theme of the movie’s story, or advertising campaign/budget for the film are possible features that may affect popularity. Eventually, a study on the cases where IMDB and Rotten Tomato scores given by website users disagreed by a significant degree should be the next objective in a near future to address the failure on some movies’ score of the proposed model.