Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
library(statsr)
## Warning: package 'statsr' was built under R version 4.1.3
## Warning: package 'BayesFactor' was built under R version 4.1.3
## Warning: package 'coda' was built under R version 4.1.3
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.1.3
library(kableExtra)
library(purrr)
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.3
library(gridExtra)
library(fabletools)
library(BAS)
## Warning: package 'BAS' was built under R version 4.1.3

Load data

# getwd()
# Clear workspace
rm(list = ls())
# Import data
load("movies.Rdata")

Part 1: Data

We will list some remarks on the data set as follows:

  • The movies data set provided for this project includes 651 randomly sampled movies released in the United States before 2016 (1970-2016).
  • No other information is given on the random sampling method was applied.
  • The data may indicate a simple random sample (SRS) of movies. For instance, there are varying amounts of movies for each theater release year as well as different proportions of movies for each genre Thus, there are no controls in place for the SRS to closely resemble actual distributions of movies from 1970-2016. This means the data is highly susceptible to sampling variation when considering its representatives of all movies from this time.

In addition, the measures of movie popularity such as the Rotten Tomatoes scores and the IMDB rating are themselves samples of opinions. These websites are dependent on voluntary participation by users and; therefore, these popularity scores are impossible to avoid bias, and such degrees of bias could be different for each individual movie. Therefore, any conclusions drawn from the data are generalizable to US movies since a simple random sample was used.

The simple random sample does not involve random assignment of movies to the factors under consideration. In addition, the data do not represent an experiment or observational study, and, thus, there are limitations to what sort of conclusions can be drawn from the data. No causality can be determined from inference on this data set, only evidence for associations. Such relationships are useful for hypothesis settings for the design of future researches that may form causality. Nonetheless, the relationships illustrated in this data set may be informative and useful for features of concurrent US movies and their popularity. * * *

Part 2: Data manipulation

Assumed that the genre associated with all movies are correctly specified, we will first examine the “Other” genre to inspect whether there is anything that could be wrongly categorized. Some movies such as “Django Unchained” could be labeled as “Drama”; thereby, we will update the dataset with new label for some movies.

# Movies with genre "Other" that are comedy 

movies$genre[which(movies$title == "The Groove Tube")] <- "Comedy"
movies$genre[which(movies$title == "The Butcher's Wife")] <- "Comedy"
movies$genre[which(movies$title == "Groundhog Day")] <- "Comedy"
movies$genre[which(movies$title == "The Main Event")] <- "Comedy"
movies$genre[which(movies$title == "Goin' South")] <- "Comedy"
movies$genre[which(movies$title == "Russian Dolls (Les Poupees Russes)")] <- "Comedy"
movies$genre[which(movies$title == "UHF")] <- "Comedy"

movies$genre[which(movies$title == "Grease")] <- "Comedy"

# Movies with genre "Other" that are dramas
movies$genre[which(movies$title == "Django Unchained")] <- "Drama"
movies$genre[which(movies$title == "The Color of Money")] <- "Drama"
movies$genre[which(movies$title == "Down in the Valley")] <- "Drama"
movies$genre[which(movies$title == "The Fighter")] <- "Drama"
movies$genre[which(movies$title == "Urban Cowboy")] <- "Drama"

# Movies with genre "Other" that are Action & Adventure
movies$genre[which(movies$title == "Maverick")] <- "Action & Adventure"

# Movies with genre "Other" that are Scifi
movies$genre[which(movies$title == "Attack of the 50 Foot Woman")] <- "Science Fiction & Fantasy"

Next, we will perform data preprocessing as required. We only keep variables that are meaningful to the modeling and discard those with little anticipated contribution.

# retain old version of data for future use
movies_pre <- movies
# remove na
movies <- na.omit(movies)

# Create new variables
movies %>% mutate(feature_film = factor(ifelse(title_type == "Feature Film", "yes", "no")),
                  drama = factor(ifelse(genre == "Drama", "yes","no")),
                  mpaa_rating_R = factor(ifelse(mpaa_rating == "R", "yes", "no")),
                  oscar_season = factor(ifelse(thtr_rel_month %in% c(10:12), "yes", "no")),
                  summer_season=
factor(ifelse(thtr_rel_month %in% c(5:8), "yes", "no"))) -> movies

# Select requested variables
movies <- movies %>% select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season,imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

Part 3: Exploratory data analysis

At this stage, we usually visualize the relationship between the dependent variable (or regresssand) audience_score with independent variables (or regressors). The concentration will be placed around those newly added variables and their effects on our score will be explored in this section. Each new feature will be attached with a summary statistic table and a boxplot to examine those relationships.

Summary Statistics Table and Plots

We will create a general function to compute some underlying statistics and return a dataframe.

summary_stats <- function(feature, kable_return = TRUE){
  feature <- enquo(feature)
  movies %>%
    as_tibble() %>%
    group_by(!!feature) %>%
    summarize(Length = n(), Mean = round(mean(audience_score),1),
              Min = round(min(audience_score),1),
              Max = round(max(audience_score),1),
              Std = round(sd(audience_score),1)) -> stats_df
  if (kable_return){
    return(stats_df %>% kbl(caption = "Summary statistics", escape = TRUE) %>%
      kable_classic(full_width = FALSE, html_font = "Cambria")) }
  else{return(stats_df)}
}


mytheme = list(
  theme_classic() + 
    theme(legend.position=c(0.9,0.9), plot.title = element_text(face = "bold",hjust = 0.5),
  panel.grid = element_blank(),
  panel.background = element_rect(fill = "white", colour = "grey50"),
  strip.background = element_rect(colour=NA, fill=NA),
  panel.border = element_rect(fill = NA, color = "black"),
  axis.text=element_text(face="bold"),
  axis.title = element_text(face="bold"),
  legend.background = element_blank(),
  legend.box.background = element_rect(colour = "black")))

feature_film variable

Using the above defined function, we easily extract some statistics about the feature of interest

summary_stats(feature = feature_film)
Summary statistics
feature_film Length Mean Min Max Std
no 46 82.5 19 96 11.9
yes 573 60.6 11 97 19.8
ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = factor(feature_film), color = factor(feature_film))) +
  geom_boxplot(alpha =0.6) +
  scale_fill_viridis(discrete = TRUE,
                     option = "viridis",
                     begin = 0.2 )  +
  labs(title = "Boxplot of `audience_score` by `feature_film`", y = "`audience_score`", x = "`feature_film`", fill = "`feature_film`",color = "`feature_film`") + mytheme

Movies that are not feature films have an average of 20.5 higher audience score than those movies that are feature films. Furthermore, the scores of the two groups appear to be dissimilar if variability (measured by standard deviation) are accounted.

drama variable

summary_stats(feature = drama)
Summary statistics
drama Length Mean Min Max Std
no 316 59.1 11 97 21.1
yes 303 65.5 13 95 18.6
ggplot(data = movies, aes(x = drama, y = audience_score, fill = factor(drama), color = factor(drama))) +
  geom_boxplot(alpha =0.6) +
  scale_fill_viridis(discrete = TRUE,
                     option = "viridis",
                     begin = 0.2 )  +
  labs(title = "Boxplot of `audience_score` by `drama`", y = "`audience_score`", x = "`drama`", fill = "`drama`",color = "`drama`") + mytheme

Drama movies constitutes up to half of the movies in the data set. The IQR of two groups are pretty similar, implies that there may be no significant difference in the impact of those two on audience score.

oscar_season variable

summary_stats(feature = oscar_season)
Summary statistics
oscar_season Length Mean Min Max Std
no 440 61.5 11 96 20.1
yes 179 63.9 13 97 20.3
ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = factor(oscar_season), color = factor(oscar_season))) +
  geom_boxplot(alpha =0.6) +
  scale_fill_viridis(discrete = TRUE,
                     option = "viridis",
                     begin = 0.2 )  +
  labs(title = "Boxplot of `audience_score` by `oscar_season`", y = "`audience_score`", x = "`oscar_season`", fill = "`oscar_season`",color = "`oscar_season`") + mytheme

Fewer movies are released during Oscar season than other times in year. The IQR of two groups are pretty similar, implies that there may be no significant difference in the impact of those two on audience score. #### summer_season variable

summary_stats(feature = summer_season)
Summary statistics
summer_season Length Mean Min Max Std
no 418 62.4 13 97 20.3
yes 201 61.9 11 94 19.9
ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = factor(summer_season), color = factor(summer_season))) +
  geom_boxplot(alpha =0.6) +
  scale_fill_viridis(discrete = TRUE,
                     option = "viridis",
                     begin = 0.2 )  +
  labs(title = "Boxplot of `audience_score` by `summer_season`", y = "`audience_score`", x = "`summer_season`", fill = "`summer_season`",color = "`summer_season`") + mytheme

Fewer movies are released during summer season than other seasons within a year. The IQR of two groups are pretty similar and so are the distributions, implies that there may be no significant difference in the impact of those two on audience score. #### mpaa_rating_R variable

summary_stats(feature = mpaa_rating_R)
Summary statistics
mpaa_rating_R Length Mean Min Max Std
no 300 62.0 11 96 20.3
yes 319 62.4 14 97 20.1
ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = factor(mpaa_rating_R), color = factor(mpaa_rating_R))) +
  geom_boxplot(alpha =0.6) +
  scale_fill_viridis(discrete = TRUE,
                     option = "viridis",
                     begin = 0.2 )  +
  labs(title = "Boxplot of `audience_score` by `mpaa_rating_R`", y = "`audience_score`", x = "`mpaa_rating_R`", fill = "`mpaa_rating_R`",color = "`mpaa_rating_R`") + mytheme

One-half of the movies are labeled “R” by the MPAA. The IQR of two groups are pretty similar and so are the distributions, implies that there may be no significant difference in the impact of those two on audience score.


Part 4: Modeling

Fit the BAS Model

In this section we are supposed to use specific variables to create a final Bayesian Regression model.

The formula for the regression model can be written as follows: \[\operatorname{audience\_score} = \beta_0 + \beta_1 \operatorname{feature\_film} + \beta_2 \operatorname{drama} + \beta_3 \operatorname{mpaa\_rating\_R} + \beta_4 \operatorname{oscar\_season} + \beta_5 \operatorname{summer\_season} \]

Using the BAS package, a Bayesian regression is conducted for the data set. A BIC prior is used in addition to the uniform model prior:

bas_mod =  bas.lm(audience_score ~ ., 
                   data=movies,
                  initprobs = "eplogp",
                   prior="ZS-null",
                   modelprior=uniform(),
                  method = "MCMC")
mod_coef <- coef(bas_mod, estimator = "BMA")
# find posterior probabilities 
mod_coef_clean <- data.frame(parameter = mod_coef$namesx, post_mean = mod_coef$postmean, post_SD = mod_coef$postsd, post_pne0 = mod_coef$probne0) %>% 
  arrange(post_pne0) %>% 
  filter(parameter != "Intercept")

mod_coef_clean$parameter <- factor(mod_coef_clean$parameter, levels = mod_coef_clean$parameter[order(mod_coef_clean$post_pne0, decreasing = TRUE)])

high_post_pne0 <- data.frame(parameter = mod_coef_clean$parameter, post_pne0 = mod_coef_clean$post_pne0) %>% 
  filter(post_pne0 > 0.5)
# Plot the Data
ggplot(mod_coef_clean, aes(x = parameter, y = post_pne0)) + 
  geom_pointrange(aes(ymax = post_pne0, ymin = 0)) +
  geom_pointrange(data=high_post_pne0, aes(x = parameter, y = post_pne0, ymax = post_pne0), ymin = 0, color = "darkred") +
  geom_hline(yintercept = 0.5, color = "red") +
  labs(title = "Posterior Marginal Inclusion Probabilities of Explanatory Variables",
         x="Explanatory Variable",y = "Marginal Inclusion Probability") +
  mytheme + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          plot.title = element_text(hjust = 0.5))

The three explanatory variables with a posterior probability (measured by marginal inclusion probability) are greater than 0.5 are imdb_rating, critics_score, and runtime. These three features have a high chance of establishing a relationship with audience_score. However, among those three, the size effect of imdb_rating approximately, \(15\), is significantly higher than others, \(0.06\) and \(-0.03\) (two hundred times larger than the second one which is critics_score). This is illustrated via two figures below: \(95\%\) credible intervals of all variables and top three highest marginal inclusion probabilities variables.

# find credible intervals for betas
coef_int <- data.frame(confint(mod_coef, c(2:17))[,])
coef_int$parameter <- rownames(coef_int)
rownames(coef_int) <- NULL
mod_coef_clean <- mod_coef_clean %>% left_join(coef_int)
## Joining, by = "parameter"
mod_coef_clean$parameter <- factor(mod_coef_clean$parameter, levels = mod_coef_clean$parameter[order(mod_coef_clean$beta)])
high_post_pne0 <- high_post_pne0 %>% left_join(mod_coef_clean) %>%
  rename(conf_25 = X2.5.,
         conf_97_5 = X97.5.)
## Joining, by = c("parameter", "post_pne0")
mod_coef_clean %>% rename(conf_25 = X2.5.,
         conf_97_5 = X97.5.) -> mod_coef_clean
# Plot the Data
all_var <- ggplot(mod_coef_clean, aes(y = beta, x = parameter)) + 
    geom_pointrange(aes(ymax = conf_97_5, ymin = conf_25),colour="black", position=position_dodge(0.1)) +
    geom_point(data=high_post_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
    geom_hline(yintercept = 0, color = "red") +
    labs(title = "Coefficients Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
    coord_flip() 

top_3 <- ggplot(high_post_pne0, aes(y = beta, x = parameter)) + 
    geom_pointrange(aes(ymax = conf_97_5, ymin = conf_25),colour="black", position=position_dodge(0.1)) +
    geom_point(data=high_post_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
    geom_hline(yintercept = 0, color = "red") +
    labs(title = "Coefficients Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
    coord_flip() 
grid.arrange(all_var, top_3, ncol = 1)

diagnostics(bas_mod, type="pip", col = "blue", pch = 16, cex = 1.5)

As seen from the figure above, the posterior inclusion probability of each variable from MCMC have converged well enough to the theoretical posterior inclusion probability.

diagnostics(bas_mod, type = "model", col = "blue", pch = 16, cex = 1.5)

We can see that all of the points lie on the 45 degree line, validating the covergence of the model given the filtration of all current data.

plot(bas_mod, which = 1, add.smooth = F, 
     ask = F, pch = 16, sub.caption="", caption="")
abline(a = 0, b = 0, col = "darkgrey", lwd = 2)

For most of the fitted values, the residuals are centered around zero, and the model appears to show a decrease in residual variance at is higher predicted values. However, for fitted values that are less than 40, the model tends to underestimate the audience scores of the movies with high variance (strong variability) and this bias may deteriorate the predicted value. The figure also labels three possible outliers in this region.

fit <- data.frame(resid = movies$audience_score - predict(bas_mod,estimator = "BMA")$fit) 
imdb_check <- ggplot(fit, aes(x=movies$imdb_rating, y = resid))+ 
    geom_point() + 
    geom_smooth(color="red", lwd = 0.5, se = TRUE, span = 0.5)+
    geom_vline(xintercept = 4.5, color = "blue")+
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(title = "Residuals vs IMDB Rating Plot",x="IMDB Rating",y = "Residuals") +
    mytheme

critic_check <- ggplot(fit, aes(x=movies$critics_score, y = resid))+ 
    geom_point() + 
    geom_smooth(color="red", lwd = 0.5, se = TRUE, span = 0.5)+
    geom_vline(xintercept = 30, color = "blue")+
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(title = "Residuals vs Critics Score Plot",x="Critics Score",y = "Residuals") +
    mytheme

grid.arrange(imdb_check, critic_check,ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Number of movies having imdb_rating < 4.5
movies %>% filter(imdb_rating < 4.5) %>%
  select(imdb_rating) %>%
  summarise(n = n())
## # A tibble: 1 x 1
##       n
##   <int>
## 1    32

We check the fitted values with imdb_rating factor. As can be seen from the left-hand side plot, movies with a imdb_rating of less than about 4.5 points (marked on the plot with a blue vertical line) may suffer from the issue of underestimation of audience_score. The possible explanation is that most movies do not have imdb_rating less than this threshold (32 over 619 or \(5\%\) of the data set). This can cause a bias in our modeling.

p1 <- ggplot(data=movies, aes(imdb_rating)) +  geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
  stat_function(fun = dnorm, args = list(mean = mean(movies$imdb_rating), sd = sd(movies$imdb_rating)), aes(colour = 'Normal')) +   geom_histogram(aes(y = ..density..), alpha = 0.4) +                        
  scale_colour_manual(name = 'Density', values = c('red', 'blue')) + 
  theme(legend.position = c(0.2, 0.85))

p2 <- ggplot(data=movies, aes(critics_score)) +  geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
  stat_function(fun = dnorm, args = list(mean = mean(movies$critics_score), sd = sd(movies$critics_score)), aes(colour = 'Normal')) +   geom_histogram(aes(y = ..density..), alpha = 0.4) +                        
  scale_colour_manual(name = 'Density', values = c('red', 'blue')) + 
  theme(legend.position = c(0.2, 0.85))

grid.arrange(p1, p2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

here we check the distribution of two most essential predictors in the model. While imdb_rating exihibits left-skewed, the critics-score distribution are multi-modal.

# Summary of the model's performance on training phase
set.seed(28)
n = nrow(movies)
n_cv = 80
ape = matrix(NA, ncol=2, nrow=n_cv) # initialize output matrix
colnames(ape) = c("BMA", "Mean_RMSE")
split_ratio <- 0.9
pb <- txtProgressBar(min = 0, max = n_cv, style = 3) # intialize progress bar
## 
  |                                                                            
  |                                                                      |   0%
for (i in 1:n_cv) {
  setTxtProgressBar(pb, i) # Update progress bar
  idx_train = sample(1:n, size=round(split_ratio*n), replace=FALSE) # Determine data split
  train_set = movies[idx_train,-17]
  test_set = movies[-idx_train,-17]
  mod_train = bas.lm(audience_score ~ ., data=train_set, 
                          prior="BIC", modelprior=uniform(), initprobs="eplogp") # Fit model to just training data
  yhat_train = predict(mod_train, test_set, estimator="BMA")$fit # predict test data from train model
  ape[i, "BMA"] = cv.summary.bas(yhat_train, test_set$audience_score)
  ape[i, "Mean_RMSE"] = NA
  if (i > 1) {
      ape[i, "Mean_RMSE"] = mean(ape[, "BMA"], na.rm = TRUE)
    }
  
}
## 
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
ape_df <- tibble(batch = 1:80, RMSE = ape[,1], rolling_RMSE = ape[,2])
summary_RMSE <- ape_df %>%
  as_tibble() %>%
  summarize(n = n(), Mean = mean(RMSE, na.rm = TRUE), 
        Std = sd(RMSE, na.rm = TRUE), Median = median(rolling_RMSE, na.rm = TRUE))

# plot th data
ggplot(ape_df, aes(x=batch, y = rolling_RMSE)) +
  geom_line() +
  geom_hline(yintercept = summary_RMSE[[1,4]], color = "#F8766D", lty = 2) +
  labs(title = "Cross Validation RMSE Point Estimates by Iteration", x="Iteration",y = "Mean RMSE Estimates") + 
  mytheme
## Warning: Removed 1 row(s) containing missing values (geom_path).

This figure supports the convergence of the accuracy metrics (here we used Root Mean Squared Error) after a fixed number of iterations of random sampling. At some beginning iterations, the RMSE tends to behave like a random walk, implying that the model does not actually learn the underlying patterns. However, the convergence of RMSE indicates the ability of learn of our model eventually.

# Quantile-Quantile Plot of Residuals
test <- predict(bas_mod,estimator = "BMA")
resid <- movies$audience_score - test$fit
mu_resid <- mean(resid)
sd_resid <- sd(resid)
std_resid <- (resid-mu_resid)/sd_resid
par(mfrow=c(1,2))
qqnorm(std_resid, lty = 2)
qqline(std_resid)
plot(density(std_resid), main="Probability Density of Std. Residuals", 
    xlab="Std. Residuals", ylab="P(Std. Residuals)")

The quantile-quantile and density plots of the standardized residuals represents the quality of the fitted values via residuals (or standardized residuals). For the majority of fitted values, the standardized residuals are alike normally distributed around zero. In evaluating the of the model’s residuals, there is a fairly right-skewed pattern to the residuals which suggests the model tends to underestimate the audience_score rather than overestimate it. However, this is a marginally trivial stray away from normality compared to the overall normal distribution of the residuals and does not contradict with the model’s assumption.

set.seed(28)
n = nrow(movies)
outliers_mat = cbind(movies, diag(1, nrow=n))
outliers_movies = bas.lm(audience_score ~ ., data=outliers_mat, 
                        prior="ZS-null", a=n,
                        modelprior=tr.beta.binomial(a=1, b=1, trunc=n/2),
                        method="MCMC",
                        initprobs="marg-eplogp",
                        MCMC.iterations=2*10^6)

outliers_movies$namesx[outliers_movies$probne0 > .5][c(3,4,5)] %>% 
  kbl(caption = "Outliers", escape = TRUE) %>%
      kable_classic(full_width = FALSE, html_font = "Cambria")
Outliers
x
122
206
239

From the outlier check, it appears rows 122, 206, and 239 are the significant outliers for the model.


Part 5: Prediction

Given the model settings, we will opt for randomly one new movie at the beginning of 2017 (01-01-2017) to test the ability of predicitng audience score. The chosen movie is the Oscar winner Get Out by Jordan Peele. Features of new data will be updated by collecting information from imdb.

get_out <- tibble(audience_score = 85, feature_film = "yes", drama = "yes", runtime=104, mpaa_rating_R = "yes",thtr_rel_year = 2017, oscar_season = "no", summer_season = "yes", imdb_rating = 8.0, imdb_num_votes = 337591,critics_score=94, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "yes")


prediction_bma <- predict(bas_mod, get_out, estimator = "BMA", se.fit=TRUE)
prediction_hpm <- predict(bas_mod, get_out, estimator = "HPM", se.fit=TRUE)
prediction_mpm <- predict(bas_mod, get_out, estimator = "MPM", se.fit=TRUE)
ci_score<- list(confint(prediction_bma, parm="pred"), confint(prediction_hpm, parm="pred"), confint(prediction_mpm, parm="pred"))
result_df <- tibble(type = c("BMA", "HPM", "MPM"), lwr = c(ci_score[[1]][1],ci_score[[2]][1], ci_score[[3]][1]), pred = c(ci_score[[1]][3],ci_score[[2]][3],ci_score[[3]][3]), upr = c(ci_score[[1]][2], ci_score[[2]][2], ci_score[[3]][2])) 

resid <- result_df$pred - get_out$audience_score

acc_df <- tibble(Acc.Metrics = c("BMA", "HPM","MPM"), rmse = c(RMSE(resid[1]), RMSE(resid[2]), RMSE(resid[3])),
       mae = c(MAE(resid[1]), MAE(resid[2]), MAE(resid[3])),
       mape = c(MAE(resid[1], 85), MAE(resid[2], 85), MAE(resid[3], 85)))

print(acc_df)
## # A tibble: 3 x 4
##   Acc.Metrics  rmse   mae  mape
##   <chr>       <dbl> <dbl> <dbl>
## 1 BMA          2.46  2.46  2.46
## 2 HPM          2.71  2.71  2.71
## 3 MPM          2.71  2.71  2.71
print(result_df)
## # A tibble: 3 x 4
##   type    lwr  pred   upr
##   <chr> <dbl> <dbl> <dbl>
## 1 BMA    67.9  87.5  108.
## 2 HPM    67.7  87.7  108.
## 3 MPM    67.7  87.7  108.

The predicted audience_score for Get Out was approximately 87.6 for BMA predicted value, 87.12 for Highest Posterior Model and 87.71 for Median Posterior Model. Three 95% credible intervals contain the actual audience_score value, 85, which implies that the model are capable of capturing uncertainty of the forecast. Three accuracy metrics, including scale-dependent metrics and percentage metric are quite small, uphold the ability of predicting the audience score of the proposed model.


Part 6: Conclusion

var_bma <- variable.names(prediction_hpm)
var_hpm <- variable.names(prediction_bma)
var_mpm <- variable.names(prediction_mpm)
Reduce(intersect, list(var_bma,var_hpm,var_mpm))
## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

From three models, the two most important features for predicting score of a given movie are imdb_rating and critics_score. However, in diagnosing the outliers in the model’s predictions, it appears that the model fails to predict the audience_score when the imdb_rating (after we transform it to match the scale of the audience score) is not approximately equal to the audience_score, indicating a contradiction in ratings between users of IMDB and Rotten Tomatoes. However, there are only a few movies with this issue in the data set. The credible interval is still large which signifies the forecast variance in the underlying model.

It is straightforward that more information is necessary to the enhancement of the performance of the model in forecasting a movie’s popularity. Larger sample sizes may help, but even better changes would include advanced sampling designs that control for sources of confounding in the data and make the sample more representative of the typical movies being released. In addition, more specific and distinguishable features of movies should be incorporated to the data set. Features that detail the general theme of the movie’s story, or advertising campaign/budget for the film are possible features that may affect popularity. Eventually, a study on the cases where IMDB and Rotten Tomato scores given by website users disagreed by a significant degree should be the next objective in a near future to address the failure on some movies’ score of the proposed model.