Intro

This is a project for Bayesian Statistics course, which is a part of Coursera’s Statistics with R Specialization.

The Project aims to find out what attributes make a movie popular. To achieve this, an exploratory data analysis (EDA) has been completed, and Bayesian regression model predicting audience score for a movie has been developed.


  • Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(dplyr); library(statsr)
library(BAS); library(tidyr); library(corrplot)

if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
              method = "curl")
  }
load("./data/1209_SR-LR-w4_Movies/movies.RData")

Part 1: Data

The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.


  • Generalization. The data were collected by retrospective study using random sampling to select a representative sample of US movies. So, it can be generalized to the US movies, with some reservations:
    • all findings can only refer to the respective years, i.e. to the US movies released from 1970 to 2014;
    • there could be some biases compared to all the people had seen a movie due to
      • considering ratings from only two sources (Rotten Tomatoes/ IMDb);
      • condition of previous registration on these sites;
      • required time to rate the movies
  • Causation/ Correlation. The data come from an available information and not from an random assignment experiment. So, it’s an observational study, i.e there is no causality, but only correlation/ association between the variables examined

Part 2: Data manipulation

Start with examination the data set variables, their description can be found in the Codebook.

Get rid of unnecessary variables, and create new ones for analysis:

  • feature_film (yes if title_type is Feature Film, no otherwise
  • drama (yes if genre is Drama, no otherwise)
  • mpaa_rating_R (yes if mpaa_rating is R, no otherwise)
  • oscar_season (based on thtr_rel_month, yes if movie is released in Nov/ Oct/ Dec, no otherwise)
  • summer_season (based on thtr_rel_month, yes if movie is released in May/ Jun/ Jul/ Aug, no otherwise)
movies <- movies %>% 
          transmute(
            audience_score,
            feature_film=as.factor(
              ifelse(title_type == 'Feature Film', 'yes', 'no')),
            drama=as.factor(ifelse(genre == 'Drama', 'yes', 'no')),
            runtime,
            mpaa_rating_R=as.factor(ifelse(mpaa_rating == 'R', 'yes', 'no')),
            thtr_rel_year,
            oscar_season=as.factor(
              ifelse(thtr_rel_month %in% c(10:12), 'yes', 'no')),
            summer_season=as.factor(
              ifelse(thtr_rel_month %in% c(5:8), 'yes', 'no')),
            imdb_rating, imdb_num_votes, critics_score, best_pic_nom,
            best_pic_win, best_actor_win, best_actress_win, best_dir_win,
            top200_box)
NArows <- sum(!complete.cases(movies))

Now, there is only 1 row (0.15\(\%\)) with NA values. So, it can be removed

movies<- na.omit(movies)

Part 3: Exploratory data analysis

Examine the structure of the resulting data set:

str(movies)
tibble [650 × 17] (S3: tbl_df/tbl/data.frame)
 $ audience_score  : num [1:650] 73 81 91 76 27 86 76 47 89 66 ...
 $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
 $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
 $ runtime         : num [1:650] 80 101 84 139 90 78 142 93 88 119 ...
 $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
 $ thtr_rel_year   : num [1:650] 2013 2001 1996 1993 2004 ...
 $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ imdb_rating     : num [1:650] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
 $ imdb_num_votes  : int [1:650] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ critics_score   : num [1:650] 45 96 91 80 33 91 57 17 90 83 ...
 $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int 334
  ..- attr(*, "names")= chr "334"

There are 11 categorical variables (all with two levels yes/no), 5 numeric, and one integer (for imdb_num_votes).

Histogram

Take a closer look at audience_score summary and histogram:

summ<-summary(movies$audience_score)
summ
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   46.00   65.00   62.35   80.00   97.00 

The mean of audience_score is around 62.35, the median is 65, the IQR is 34.

ggplot(movies, aes(x = audience_score, y = ..density..)) +
  geom_histogram(fill = "cornflowerblue", colour = "darkgrey") + 
  geom_density(size = 1, colour = "brown") 

  • audience_score distribution has a slightly left skewed structure

Correlogram

Understand the relationship between audience_score and other numeric/ integer variables:

res <- cor.mtest(movies[,c(1,4,6,9:11)], conf.level = .95)
corrmat<- corrplot.mixed(cor(movies[,c(1,4,6,9:11)]), diag = NULL, order="FPC",
                         p.mat = res$p, number.font = 4, number.cex=.8,
                         tl.col = "purple3", tl.cex=.7, mar = c(0, 0, 1, 0))

The correlation matrix shows:

  • strong positive correlations between audience_score and imdb_rating/ critics_score
  • high positive relationship between the same explanatory variables, imdb_rating and critics_score

Boxplots

Compare variability in audience_score for the main categorical variables:

data <- movies %>%
  gather('category','yesno', feature_film, drama, mpaa_rating_R,
         oscar_season,summer_season)
ggplot(data=data,aes(x=category, y= audience_score,fill=yesno))+
  geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") + 
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3")) +
  theme(legend.title = element_blank())

The boxplots shows:

  • drama movies have higher median than other genres
  • audience_score for feature_film is noticeably less than “no feature” one
  • audience_score for movies released in Nov/ Oct/ Dec in average is higher in respect to other months
  • for mpaa_rating_R and summer_season, there is almost no difference if movie is from the respective category, or not

Part 4: Modelling

The goal is to predict the popularity of a movie quantified by audience_score using a linear regression model and Bayesian model averaging.

Model selection

  • Bayesian regression model
    • Zernell-Siow Cauchy prior (ZS-null)
    • equal prior probabilities for all models (uniform function)
    • MCMC sampling rather than enumeration (analysis is small enough to enumerate quickly)
model<- bas.lm(data=movies, audience_score ~ ., method='MCMC',
                 prior="ZS-null", modelprior=uniform())

Model diagnostics

REGRESSION FITTING:

par(mfrow=c(2,2))
plot(model, ask=FALSE, cex.lab=0.6)

  • Residual vs Fitted: wave pattern, the model tends to predict underestimated values for scores under 30, overestimated - for score from 30 to 75
    • some predictors not included in the model could affect audience_score
  • Model (cumulative) Probabilities: the model posterior probability density levels off to 1 at about \(5-6\ 000\) model combinations sampled
    • at this point, the number of models is capped, additional models will not add additional probability (no need to continue the evaluation to all \(2^{16}\) combinations)
  • Model Complexity: the models of about 3-4 dimension have higher Bayes factors
  • Inclusion Probability: imdb_rating and critics_score are the most important predictors

MODELS RANKS:

image(model, rotate = F, cex.axis=0.9)

  • imdb_rating and critics_score present in almost all of the top-20 highest probability models

CONVERGENCE

diagnostics(model, type="model",  col = "purple", pch=16)

  • posterior model probabilities follow a normal distribution
    • had this not been the case, there could have been necessary to do more MCMC iterations

Model coefficients interpretation

COEFFICIENT VALUES

coefficients(model)

 Marginal Posterior Summaries of Coefficients: 

 Using  BMA 

 Based on the top  6318 models 
                     post mean      post SD        post p(B != 0)
Intercept            62.3476923077   0.3946336397   1.0000000000 
feature_filmyes      -0.1056235443   0.5674997344   0.0663467407 
dramayes              0.0185910799   0.2052516528   0.0474327087 
runtime              -0.0254840615   0.0311544044   0.4662933350 
mpaa_rating_Ryes     -0.3060973255   0.7049385287   0.2018829346 
thtr_rel_year        -0.0047953020   0.0186969548   0.0952659607 
oscar_seasonyes      -0.0821517208   0.3812201276   0.0775772095 
summer_seasonyes      0.0880613194   0.3850596057   0.0817718506 
imdb_rating          14.9667087348   0.7400006234   0.9999870300 
imdb_num_votes        0.0000002254   0.0000013659   0.0615058899 
critics_score         0.0621940361   0.0307559771   0.8796325684 
best_pic_nomyes       0.5375804370   1.6132832773   0.1380172729 
best_pic_winyes      -0.0122632925   0.8816404902   0.0422111511 
best_actor_winyes    -0.2849496573   0.8267121907   0.1440933228 
best_actress_winyes  -0.3167611464   0.9150215809   0.1453155518 
best_dir_winyes      -0.1231790787   0.6315837028   0.0699012756 
top200_boxyes         0.0906963186   0.7215089841   0.0499572754 
  • imdb_rating and critics_score variables have the highest probabilities that their coefficients are not zero: about \(100\%\) and \(90\%\) respectively
  • increase in imdb_rating by one point increases audience_score by almost 15 points
    • increase in critics_score by one point increases audience_score by only 0.06 point because of high correlation between this variable and imdb_rating
  • besides the above, there are else 5 variables positive affecting audience_score
  • 9 variables have negative effect on audience_score
  • categorical variables coefficients show increase comparing to no level

CREDIBLE INTERVALS

par(mfrow=c(1,3))
plot(coefficients(model), subset=c(1, 9, 11), ask=FALSE, col="cornflowerblue")

  • imdb_rating coefficients distribution does not visually look entirely normal, there appears to be a small non-symmetric bump on the right side
  • regression intercept is just about audience_score median
    • it certainly contributes to the wave pattern in the residuals plot above

Part 5: Prediction

Now test the predictive capability of Bayesian model averaging on movie not included in the original data set: Mad Max: Fury Road, named the best film of 2015 by the International Federation of Film Critics. According to IMDb and Rotten Tomatoes, the movie was released in May-2015, lasts two hours, earned imdb_rating of \(8.1\), critics_score of \(97\%\), and audience_score of \(85\%\). Compare it to predicted value:

madmax <- tibble(runtime=120,
                         thtr_rel_year=2015,
                         imdb_rating=8.1,
                         imdb_num_votes=870737,
                         critics_score=97,
                         audience_score=0,
                         best_pic_nom=factor("yes", levels=c("no", "yes")),
                         best_pic_win=factor("no", levels=c("no", "yes")),
                         best_actor_win=factor("no", levels=c("no", "yes")),
                         best_actress_win=factor("no", levels=c("no", "yes")),
                         best_dir_win=factor("no", levels=c("no", "yes")),
                         top200_box=factor("no", levels=c("no", "yes")),
                         feature_film=factor("yes", levels=c("no", "yes")),
                         drama=factor("no", levels=c("no", "yes")),
                         mpaa_rating_R=factor("yes", levels=c("no", "yes")),
                         oscar_season=factor("no", levels=c("no", "yes")),
                         summer_season=factor("yes", levels=c("no", "yes")))
score_madmax <- predict(model, newdata=madmax, estimator="BMA", se.fit=TRUE)
Movie Title Release Year Runtime, min IMDb Rating Critics Score Predicted Audience Score True Audience Score
Mad Max: Fury Road 2015 120 8.1 97\(\%\) 89.12\(\%\) \(85\%\)

NOTE: although theaters release of the selected film is 2015, the streaming release was made in 2016, and the film is not included in the original data set


Part 6: Conclusion

  • Linear regression model using Bayesian model averaging has shown some capability of predicting movie popularity based on certain characteristics
    • nevertheless, the model could have some bias as predicted value overestimates true audience_score (\(89\%\) vs. \(85\%\))
  • It has found that the most important explanatory variables to predict audience_score are critics_score and imdb_rating
    • the same time, the original data set has some limits, i.e. there are only movies have been rated on Rotten Tomatoes and IMDb websites
  • Further analyses could be conducted to answer the questions:
    • Why imdb_rating coefficients distribution doesn’t show a normal curve?
      • Could it be related to the non-normal distribution of audience_score values?
    • What’s the nature of residuals plot wave pattern?
    • What other variables could affect audience_score?
      • Could using more levels in genre variable lead to better results?