Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

load("movies.Rdata")

Part 1: Data

The objective of this assignment is to determine an algorithm of predicting audience_score based on various characteristics about a movie. Hence 651 movies were randomly sampled from IMDB. This random sampling seems reasonable and thus a good data set for statistical inference. We can see more information about all the variables here:

https://d18ky98rnyall9.cloudfront.net/_73393031e98b997cf2445132f89606a1_movies_codebook.html?Expires=1477526400&Signature=M-pcsQUzBmpaRhxPTzOKOl3d81DeP7pDeaVQoaXwiu18gx~c-I7JaZMZZ~ot~LBBWGBvpvHZruW8oh4Ws5ZJB5u-A1E~13Xzox4516AMoIMA4KcvKLASt3Vzs0NXw9WIYB7UuXvmZjz-7AL8uX-WVkri3k5Ym6gv1HNeebmmHXU_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A

Despite the fact that these are random selections it is difficult to establish a causal relationship from regression and without running an experiment. We can hypothesize that certain variables include should have a causal relationship but further investigation should be done to confirm this is the case. For example, it is highly likely that audience_score is related or correlated with being a top 200 box office movie but it is unclear which way the causation runs. Is it in the top 200 because people enjoy the movie or do people see movies because they are in the top 200? Or perhaps they cycle on each other?

Part 2: Data manipulation

movies<-mutate(movies,feature_film=ifelse(title_type %in% c("Feature Film"),"yes","no"),
               drama=ifelse(genre %in% c("Drama"),"yes","no"),
               mpaa_rating_R=ifelse(mpaa_rating %in% c("R"),"yes","no"),
               oscar_season=ifelse(thtr_rel_month %in% c(10,11,12),"yes","no"),
               summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8),"yes","no")
               )%>%
        select(audience_score,feature_film,drama,runtime,mpaa_rating_R,thtr_rel_year,
               oscar_season,summer_season,imdb_rating,imdb_num_votes,critics_score,
               best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,
               top200_box)
movies$feature_film<-as.factor(movies$feature_film)
movies$drama<-as.factor(movies$drama)
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
movies$oscar_season<-as.factor(movies$oscar_season)
movies$summer_season<-as.factor(movies$summer_season)

Part 3: Exploratory data analysis

General EDA

summary(select(movies,audience_score,feature_film,drama,mpaa_rating_R,oscar_season,
               summer_season))

##  audience_score  feature_film drama     mpaa_rating_R oscar_season
##  Min.   :11.00   no : 60      no :346   no :322       no :460     
##  1st Qu.:46.00   yes:591      yes:305   yes:329       yes:191     
##  Median :65.00                                                    
##  Mean   :62.36                                                    
##  3rd Qu.:80.00                                                    
##  Max.   :97.00                                                    
##  summer_season
##  no :443      
##  yes:208      
##               
##               
##               
##

About half the movies are R rated and another half are drama with some overlap. A quarter are feature films whichis reasonable as many independent film studios release movies which don’t get much attention. Oscar Season and Summer Season each seem to have around 1/4 of the movies which makes since if movies are roughly evenly distributed throughout the year.

ggplot(data = movies,aes(audience_score)) +    
  geom_histogram(fill="gold3", binwidth = 5)+
  ggtitle("Histogram of Audience Score") +
  labs(x="Audience Score",y="Count")

Audience Score seems to be skewed to the left with very few movies getting below 30 in audience rating. Peak is around 80 to 90 audience score.

Feature Film EDA

ggplot(data = movies, aes(factor(feature_film), audience_score)) +
  geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
  ggtitle("Audience Score by Feature Film") +
  labs(x="Is Feature Film?",y="Audience Score") +
  theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
          axis.title=element_text(size=15))

data.frame(group_by(movies,feature_film)%>%summarise(Min=min(audience_score),
                                       Q1=quantile(audience_score,.25),
                                       Med=median(audience_score),
                                       Mean=mean(audience_score),
                                       Q3=quantile(audience_score,.75),
                                       Max=max(audience_score)))

##   feature_film Min   Q1  Med     Mean Q3 Max
## 1           no  19 76.5 85.5 81.05000 89  96
## 2          yes  11 44.5 62.0 60.46531 78  97

For Feature Films it seems that they are negatively correlated with Audience Score. However, there is much more variation in Feature Films than non-Feature Films. There are 2 outliers in the non-Feature Films that got very bad audience ratings. This distribution could arise from the fact that only certain kinds of people watch particular kinds of non-Feature Films so they are pretty likely to enjoy that movie. Whereas if a random person watched the movie they would probably not enjoy it.

Drama EDA

ggplot(data = movies, aes(factor(drama), audience_score)) +
  geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
  ggtitle("Audience Score by Drama") +
  labs(x="Is Drama?",y="Audience Score") +
  theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
          axis.title=element_text(size=15))

data.frame(group_by(movies,drama)%>%summarise(Min=min(audience_score),
                                       Q1=quantile(audience_score,.25),
                                       Med=median(audience_score),
                                       Mean=mean(audience_score),
                                       Q3=quantile(audience_score,.75),
                                       Max=max(audience_score)))

##   drama Min Q1 Med     Mean Q3 Max
## 1    no  11 41  61 59.73121 79  97
## 2   yes  13 52  70 65.34754 80  95

Audience Score seems slightly higher for Dramas while Not Drama has a bit more variation. Overall, scores don’t appear that different.

MPAA Rating R

ggplot(data = movies, aes(factor(mpaa_rating_R), audience_score)) +
  geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
  ggtitle("Audience Score by R rating") +
  labs(x="Is R rated?",y="Audience Score") +
  theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
          axis.title=element_text(size=15))

data.frame(group_by(movies,mpaa_rating_R)%>%summarise(Min=min(audience_score),
                                       Q1=quantile(audience_score,.25),
                                       Med=median(audience_score),
                                       Mean=mean(audience_score),
                                       Q3=quantile(audience_score,.75),
                                       Max=max(audience_score)))

##   mpaa_rating_R Min    Q1  Med     Mean Q3 Max
## 1            no  11 48.25 65.5 62.68944 80  96
## 2           yes  14 44.00 64.0 62.04255 79  97

Audience score looks roughly the same for R and not R rated movies. Can’t tell much difference.

Oscar Season

ggplot(data = movies, aes(factor(oscar_season), audience_score)) +
  geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
  ggtitle("Audience Score by Oscar Season") +
  labs(x="Is Oscar Season?",y="Audience Score") +
  theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
          axis.title=element_text(size=15))

data.frame(group_by(movies,oscar_season)%>%summarise(Min=min(audience_score),
                                       Q1=quantile(audience_score,.25),
                                       Med=median(audience_score),
                                       Mean=mean(audience_score),
                                       Q3=quantile(audience_score,.75),
                                       Max=max(audience_score)))

##   oscar_season Min   Q1 Med     Mean Q3 Max
## 1           no  11 46.0  64 61.81304 79  96
## 2          yes  13 47.5  69 63.68586 81  97

Again can’t tell much difference between Oscar Season movies and non-Oscar Season movies. Audience score is about the same.

Summer Season

ggplot(data = movies, aes(factor(summer_season), audience_score)) +
  geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
  ggtitle("Audience Score by Summer Season") +
  labs(x="Is Summer Season?",y="Audience Score") +
  theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
          axis.title=element_text(size=15))

data.frame(group_by(movies,summer_season)%>%summarise(Min=min(audience_score),
                                       Q1=quantile(audience_score,.25),
                                       Med=median(audience_score),
                                       Mean=mean(audience_score),
                                       Q3=quantile(audience_score,.75),
                                       Max=max(audience_score)))

##   summer_season Min    Q1 Med     Mean Q3 Max
## 1            no  13 46.00  66 62.62302 80  97
## 2           yes  11 44.75  65 61.80769 78  94

Not much difference in Audience Score for Summer and Non-Summer Season movies.

Part 4: Modeling

Next we conduct the modeling for audience score. Since we are trying to create a predictive model we’ll use AIC (least restrictive) for the priors for the Betas as we don’t care too much if non-informative variables are added in. We don’t have too many variables so we’ll use enurmeration as well.

bma_movies = bas.lm(audience_score ~ . -audience_score, data = movies,
                   prior = "AIC", modelprior = uniform())

## Warning in bas.lm(audience_score ~ . - audience_score, data = movies, prior
## = "AIC", : dropping 1 rows due to missing data

bma_movies

## 
## Call:
## bas.lm(formula = audience_score ~ . - audience_score, data = movies,     prior = "AIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##              1.0000               0.3812               0.3795  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##              0.8560               0.6831               0.5764  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##              0.3648               0.4160               1.0000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##              0.4506               0.9667               0.6933  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##              0.3033               0.5155               0.5941  
##     best_dir_winyes        top200_boxyes  
##              0.3622               0.3051

summary(bma_movies)

##      Intercept feature_filmyes dramayes runtime mpaa_rating_Ryes
## [1,]         1               0        0       1                1
## [2,]         1               0        0       1                1
## [3,]         1               0        0       1                1
## [4,]         1               0        0       1                1
## [5,]         1               0        0       1                1
##      thtr_rel_year oscar_seasonyes summer_seasonyes imdb_rating
## [1,]             1               0                0           1
## [2,]             0               0                0           1
## [3,]             0               0                0           1
## [4,]             1               0                0           1
## [5,]             0               0                1           1
##      imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,]              0             1               1               0
## [2,]              0             1               1               0
## [3,]              0             1               1               0
## [4,]              0             1               1               0
## [5,]              0             1               1               0
##      best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,]                 1                   1               0             0
## [2,]                 1                   1               0             0
## [3,]                 0                   1               0             0
## [4,]                 0                   1               0             0
## [5,]                 0                   1               0             0
##             BF PostProbs     R2 dim   logmarg
## [1,] 1.0000000    0.0017 0.7601   9 -3604.421
## [2,] 0.9783606    0.0017 0.7593   8 -3604.442
## [3,] 0.9298905    0.0016 0.7585   7 -3604.493
## [4,] 0.8903369    0.0015 0.7592   8 -3604.537
## [5,] 0.7914465    0.0013 0.7592   8 -3604.654

image(bma_movies, rotate = FALSE)

Under the AIC prior we see that the top models all include runtime, mpaa_rating_R, thtr_rel_year,imdb_rating, critics score, best_pic_nominees, best actor wins and best actress wins. In the display of the coeficients below we see that these are the ones with the largest coeffients and highest posterior probabilities of Beta not equal to 0.

coef.bas(bma_movies)

## 
##  Marginal Posterior Summaries of Coefficients: 
##                      post mean   post SD     post p(B != 0)
## Intercept             6.235e+01   3.925e-01   1.000e+00    
## feature_filmyes      -6.123e-01   1.308e+00   3.812e-01    
## dramayes              3.209e-01   6.796e-01   3.795e-01    
## runtime              -4.745e-02   2.941e-02   8.560e-01    
## mpaa_rating_Ryes     -1.023e+00   9.621e-01   6.831e-01    
## thtr_rel_year        -3.483e-02   4.158e-02   5.764e-01    
## oscar_seasonyes      -3.098e-01   7.093e-01   3.648e-01    
## summer_seasonyes      4.164e-01   7.588e-01   4.160e-01    
## imdb_rating           1.492e+01   6.373e-01   1.000e+00    
## imdb_num_votes        2.411e-06   4.008e-06   4.506e-01    
## critics_score         6.275e-02   2.460e-02   9.667e-01    
## best_pic_nomyes       3.233e+00   2.989e+00   6.933e-01    
## best_pic_winyes      -5.685e-01   2.728e+00   3.033e-01    
## best_actor_winyes    -8.819e-01   1.210e+00   5.155e-01    
## best_actress_winyes  -1.281e+00   1.465e+00   5.941e-01    
## best_dir_winyes      -5.575e-01   1.261e+00   3.622e-01    
## top200_boxyes         4.617e-01   1.681e+00   3.051e-01

For every minute of additional run time we expect audience score to decrease by .047.
If the movie is rated R then we expect a 1.023 decrease in the audience score.
For each additional theater release year we expect audience score to decrase by .048.
For every 1 point increase in imbd rating we expect an increase of 1.5 in the audience score. Very substantial in relation to other variables.
For every 1 point incrase in critics score we expect an increaseo of .063 in audience score.
If the film has a best picture nomination then we expect a 3.23 bump in audience score.
If the film has a best actor then we expect a .88 decrease in audience score.
If the film has a best actress then we expect a 1.28 decrease in audience score.

plot(bma_movies, which=1)

plot(bma_movies, which=2)

plot(bma_movies, which=3)

plot(bma_movies, which=4)

We do notice that there are a few outliers at rows 126, 216, and 251.Additionally the residual plot doesn’t look quite random. There are many positive residuals for the lower predicted scores. Hence movies with low scores aren’t quite as low as we are predicting.

The model completity appears to peak around the 5 to 10 mark which makes sense given that we have 8 variables in our model.

Part 5: Prediction

Looked at movie Suicide Squad:

IMDb: http://www.imdb.com/title/tt1386697/

Rotten Tomatoes: https://www.rottentomatoes.com/m/suicide_squad_2016/

Box Office Mojo: http://www.boxofficemojo.com/showdowns/chart/?id=superhero2016.htm

Best Actor Win: Yes, Jared Leto.

For the prediction I’m going to use the Best Predictive Model (BPM) as we are making a prediction.

movies3<-rbind(movies,c(66,"yes","no",123,"no",2016,"no","yes",6.6,241744,
                    47,"no","no","yes","no","no","yes"))
movies3[] <- mapply(FUN = as,movies3,sapply(movies,class),SIMPLIFY = FALSE)
movies3_pred = predict.bas(bma_movies,newdata = movies3, estimator="BPM")
suicide_squad_pred=movies3_pred$fit[651]

My predicted value for Suicide Squad is 61.6836967 . This is pretty close to the actual value of 66 on Rotten Tomotas. The residual is 4.3163033 .

Part 6: Conclusion

Overall this Bayesian Regression performed quite well with a residual value of around 4 for Suicide Squad. The model gives us very interpretable coefficients along with probabilities of how likely that are to be included in high posterior probability models. While we can’t quite establish a causal relationship it does help us to see strong correlations.

Some areas for improvement could be considering some other variable transformations or interaction terms. Or additional variables could be created from some of the text variables not used. This could also help to alievate the small issue of positive residuals near the lower end of the audience score spectrum.