What Makes Movie Popular

Intro

This is a project for Bayesian Statistics course, which is a part of Coursera’s Statistics with R Specialization.

The Project aims to find out what attributes make a movie popular. To achieve this, an exploratory data analysis (EDA) has been completed, and Bayesian regression model predicting audience score for a movie has been developed.

Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(dplyr); library(statsr)
library(BAS); library(tidyr); library(corrplot)

if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
              method = "curl")
  }
load("./data/1209_SR-LR-w4_Movies/movies.RData")

Part 1: Data

The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.

Generalization. The data were collected by retrospective study using random sampling to select a representative sample of US movies. So, it can be generalized to the US movies, with some reservations:
- all findings can only refer to the respective years, i.e. to the US movies released from 1970 to 2014;
- there could be some biases compared to all the people had seen a movie due to
  - considering ratings from only two sources (Rotten Tomatoes/ IMDb);
  - condition of previous registration on these sites;
  - required time to rate the movies
Causation/ Correlation. The data come from an available information and not from an random assignment experiment. So, it’s an observational study, i.e there is no causality, but only correlation/ association between the variables examined

Part 2: Data manipulation

Start with examination the data set variables, their description can be found in the Codebook.

Get rid of unnecessary variables, and create new ones for analysis:

feature_film (yes if title_type is Feature Film, no otherwise
drama (yes if genre is Drama, no otherwise)
mpaa_rating_R (yes if mpaa_rating is R, no otherwise)
oscar_season (based on thtr_rel_month, yes if movie is released in Nov/ Oct/ Dec, no otherwise)
summer_season (based on thtr_rel_month, yes if movie is released in May/ Jun/ Jul/ Aug, no otherwise)

movies <- movies %>% 
          transmute(
            audience_score,
            feature_film=as.factor(
              ifelse(title_type == 'Feature Film', 'yes', 'no')),
            drama=as.factor(ifelse(genre == 'Drama', 'yes', 'no')),
            runtime,
            mpaa_rating_R=as.factor(ifelse(mpaa_rating == 'R', 'yes', 'no')),
            thtr_rel_year,
            oscar_season=as.factor(
              ifelse(thtr_rel_month %in% c(10:12), 'yes', 'no')),
            summer_season=as.factor(
              ifelse(thtr_rel_month %in% c(5:8), 'yes', 'no')),
            imdb_rating, imdb_num_votes, critics_score, best_pic_nom,
            best_pic_win, best_actor_win, best_actress_win, best_dir_win,
            top200_box)

NArows <- sum(!complete.cases(movies))

Now, there is only 1 row (0.15\(\%\)) with NA values. So, it can be removed

movies<- na.omit(movies)

Part 3: Exploratory data analysis

Examine the structure of the resulting data set:

str(movies)

tibble [650 × 17] (S3: tbl_df/tbl/data.frame)
 $ audience_score  : num [1:650] 73 81 91 76 27 86 76 47 89 66 ...
 $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
 $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
 $ runtime         : num [1:650] 80 101 84 139 90 78 142 93 88 119 ...
 $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
 $ thtr_rel_year   : num [1:650] 2013 2001 1996 1993 2004 ...
 $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ imdb_rating     : num [1:650] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
 $ imdb_num_votes  : int [1:650] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ critics_score   : num [1:650] 45 96 91 80 33 91 57 17 90 83 ...
 $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int 334
  ..- attr(*, "names")= chr "334"

There are 11 categorical variables (all with two levels yes/no), 5 numeric, and one integer (for imdb_num_votes).

Histogram

Take a closer look at audience_score summary and histogram:

summ<-summary(movies$audience_score)
summ

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   46.00   65.00   62.35   80.00   97.00

The mean of audience_score is around 62.35, the median is 65, the IQR is 34.

ggplot(movies, aes(x = audience_score, y = ..density..)) +
  geom_histogram(fill = "cornflowerblue", colour = "darkgrey") + 
  geom_density(size = 1, colour = "brown")

audience_score distribution has a slightly left skewed structure

Correlogram

Understand the relationship between audience_score and other numeric/ integer variables:

res <- cor.mtest(movies[,c(1,4,6,9:11)], conf.level = .95)
corrmat<- corrplot.mixed(cor(movies[,c(1,4,6,9:11)]), diag = NULL, order="FPC",
                         p.mat = res$p, number.font = 4, number.cex=.8,
                         tl.col = "purple3", tl.cex=.7, mar = c(0, 0, 1, 0))

The correlation matrix shows:

strong positive correlations between audience_score and imdb_rating/ critics_score
high positive relationship between the same explanatory variables, imdb_rating and critics_score

Boxplots

Compare variability in audience_score for the main categorical variables:

data <- movies %>%
  gather('category','yesno', feature_film, drama, mpaa_rating_R,
         oscar_season,summer_season)
ggplot(data=data,aes(x=category, y= audience_score,fill=yesno))+
  geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") + 
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3")) +
  theme(legend.title = element_blank())

The boxplots shows:

drama movies have higher median than other genres
audience_score for feature_film is noticeably less than “no feature” one
audience_score for movies released in Nov/ Oct/ Dec in average is higher in respect to other months
for mpaa_rating_R and summer_season, there is almost no difference if movie is from the respective category, or not

Part 4: Modelling

The goal is to predict the popularity of a movie quantified by audience_score using a linear regression model and Bayesian model averaging.

Model selection

Bayesian regression model
- Zernell-Siow Cauchy prior (ZS-null)
- equal prior probabilities for all models (uniform function)
- MCMC sampling rather than enumeration (analysis is small enough to enumerate quickly)

model<- bas.lm(data=movies, audience_score ~ ., method='MCMC',
                 prior="ZS-null", modelprior=uniform())

Model diagnostics

REGRESSION FITTING:

par(mfrow=c(2,2))
plot(model, ask=FALSE, cex.lab=0.6)

Residual vs Fitted: wave pattern, the model tends to predict underestimated values for scores under 30, overestimated - for score from 30 to 75
- some predictors not included in the model could affect audience_score
Model (cumulative) Probabilities: the model posterior probability density levels off to 1 at about \(5-6\ 000\) model combinations sampled
- at this point, the number of models is capped, additional models will not add additional probability (no need to continue the evaluation to all \(2^{16}\) combinations)
Model Complexity: the models of about 3-4 dimension have higher Bayes factors
Inclusion Probability: imdb_rating and critics_score are the most important predictors

MODELS RANKS:

image(model, rotate = F, cex.axis=0.9)

imdb_rating and critics_score present in almost all of the top-20 highest probability models

CONVERGENCE

diagnostics(model, type="model",  col = "purple", pch=16)

posterior model probabilities follow a normal distribution
- had this not been the case, there could have been necessary to do more MCMC iterations

Model coefficients interpretation

COEFFICIENT VALUES

coefficients(model)


 Marginal Posterior Summaries of Coefficients: 

 Using  BMA 

 Based on the top  6318 models 
                     post mean      post SD        post p(B != 0)
Intercept            62.3476923077   0.3946336397   1.0000000000 
feature_filmyes      -0.1056235443   0.5674997344   0.0663467407 
dramayes              0.0185910799   0.2052516528   0.0474327087 
runtime              -0.0254840615   0.0311544044   0.4662933350 
mpaa_rating_Ryes     -0.3060973255   0.7049385287   0.2018829346 
thtr_rel_year        -0.0047953020   0.0186969548   0.0952659607 
oscar_seasonyes      -0.0821517208   0.3812201276   0.0775772095 
summer_seasonyes      0.0880613194   0.3850596057   0.0817718506 
imdb_rating          14.9667087348   0.7400006234   0.9999870300 
imdb_num_votes        0.0000002254   0.0000013659   0.0615058899 
critics_score         0.0621940361   0.0307559771   0.8796325684 
best_pic_nomyes       0.5375804370   1.6132832773   0.1380172729 
best_pic_winyes      -0.0122632925   0.8816404902   0.0422111511 
best_actor_winyes    -0.2849496573   0.8267121907   0.1440933228 
best_actress_winyes  -0.3167611464   0.9150215809   0.1453155518 
best_dir_winyes      -0.1231790787   0.6315837028   0.0699012756 
top200_boxyes         0.0906963186   0.7215089841   0.0499572754

imdb_rating and critics_score variables have the highest probabilities that their coefficients are not zero: about \(100\%\) and \(90\%\) respectively
increase in imdb_rating by one point increases audience_score by almost 15 points
- increase in critics_score by one point increases audience_score by only 0.06 point because of high correlation between this variable and imdb_rating
besides the above, there are else 5 variables positive affecting audience_score
9 variables have negative effect on audience_score
categorical variables coefficients show increase comparing to no level

CREDIBLE INTERVALS

par(mfrow=c(1,3))
plot(coefficients(model), subset=c(1, 9, 11), ask=FALSE, col="cornflowerblue")

imdb_rating coefficients distribution does not visually look entirely normal, there appears to be a small non-symmetric bump on the right side
- possible reasons (not explored): wave pattern on the residuals plot, skewness in audience score distribution
regression intercept is just about audience_score median
- it certainly contributes to the wave pattern in the residuals plot above

Part 5: Prediction

Now test the predictive capability of Bayesian model averaging on movie not included in the original data set: Mad Max: Fury Road, named the best film of 2015 by the International Federation of Film Critics. According to IMDb and Rotten Tomatoes, the movie was released in May-2015, lasts two hours, earned imdb_rating of \(8.1\), critics_score of \(97\%\), and audience_score of \(85\%\). Compare it to predicted value:

madmax <- tibble(runtime=120,
                         thtr_rel_year=2015,
                         imdb_rating=8.1,
                         imdb_num_votes=870737,
                         critics_score=97,
                         audience_score=0,
                         best_pic_nom=factor("yes", levels=c("no", "yes")),
                         best_pic_win=factor("no", levels=c("no", "yes")),
                         best_actor_win=factor("no", levels=c("no", "yes")),
                         best_actress_win=factor("no", levels=c("no", "yes")),
                         best_dir_win=factor("no", levels=c("no", "yes")),
                         top200_box=factor("no", levels=c("no", "yes")),
                         feature_film=factor("yes", levels=c("no", "yes")),
                         drama=factor("no", levels=c("no", "yes")),
                         mpaa_rating_R=factor("yes", levels=c("no", "yes")),
                         oscar_season=factor("no", levels=c("no", "yes")),
                         summer_season=factor("yes", levels=c("no", "yes")))
score_madmax <- predict(model, newdata=madmax, estimator="BMA", se.fit=TRUE)

Movie Title	Release Year	Runtime, min	IMDb Rating	Critics Score	Predicted Audience Score	True Audience Score
Mad Max: Fury Road	2015	120	8.1	97\(\%\)	89.12\(\%\)	\(85\%\)

NOTE: although theaters release of the selected film is 2015, the streaming release was made in 2016, and the film is not included in the original data set

Part 6: Conclusion

Linear regression model using Bayesian model averaging has shown some capability of predicting movie popularity based on certain characteristics
- nevertheless, the model could have some bias as predicted value overestimates true audience_score (\(89\%\) vs. \(85\%\))
It has found that the most important explanatory variables to predict audience_score are critics_score and imdb_rating
- the same time, the original data set has some limits, i.e. there are only movies have been rated on Rotten Tomatoes and IMDb websites
Further analyses could be conducted to answer the questions:
- Why imdb_rating coefficients distribution doesn’t show a normal curve?
  - Could it be related to the non-normal distribution of audience_score values?
- What’s the nature of residuals plot wave pattern?
- What other variables could affect audience_score?
  - Could using more levels in genre variable lead to better results?