bayesian_project.knit

Part 0: Setup

Load packages

library(tidyverse)

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(ggpubr)

Load data

load("movies.rda")

Part 1: Data

1.1 Project Background

A data-set about how much audiences and critics like movies as well as numerous other variables about movies are provided by Boss. The data-set contains 651 randomly sampled movies information, produced and released before year 2016, from Rotten Tomatoes and IMDB for a random sample of movies.

In this project, we are going in learning what attributes make a movie popular. Exploratory data analysis (EDA), modeling, and prediction will be carried out.

The specific modeling task we need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables.

1.2 Data Collection

Generabizability : Our data-set contain 651 movies information. The size is big and they are randomly sampled with release date from 1970 to 2014. Therefore, it is generalizable to represent the population.

Casuality: As our data-set are randomly sampled, there should not be used in causality analysis.

Part 2: Data manipulation

Create new variables for EDA

# Create new variable based on `title_type`: New variable should be called`feature_film` with levels yes (movies that are feature films) and no 
movies=movies %>% mutate(feature_film=as.factor(if_else(title_type=="Feature Film","yes","no")))

# Create new variable based on `genre`: New variable should be called `drama` with levels yes (movies that are dramas) and no
movies=movies %>% mutate(drama=as.factor(if_else(genre=="Drama","yes","no")))

# Create new variable based on `mpaa_rating_R`: New variable should be called `mpaa_rating_R` with levels yes (movies that are R rated) and no
movies=movies %>% mutate(mpaa_rating_R=as.factor(if_else(mpaa_rating=="R","yes","no")))

# Create new variables based on `thtr_rel_month`: New variable called `oscar_season` with levels yes (if movie is released in November, October, or December) and no
movies=movies %>% mutate(oscar_season=as.factor(if_else(thtr_rel_month %in% c(10,11,12),"yes","no")))

# Create new variables based on `thtr_rel_month`:  New variable called `summer_season` with levels yes (if movie is released in May, June, July, or August) and no
movies=movies %>% mutate(summer_season=as.factor(if_else(thtr_rel_month %in% c(5,6,7,8),"yes","no")))

# Subset of audience_score and previous new-constructed variables
movies_eda=movies %>% select(audience_score,feature_film,drama,mpaa_rating_R,oscar_season,summer_season)
summary(movies_eda)

##  audience_score  feature_film drama     mpaa_rating_R oscar_season
##  Min.   :11.00   no : 60      no :346   no :322       no :460     
##  1st Qu.:46.00   yes:591      yes:305   yes:329       yes:191     
##  Median :65.00                                                    
##  Mean   :62.36                                                    
##  3rd Qu.:80.00                                                    
##  Max.   :97.00                                                    
##  summer_season
##  no :443      
##  yes:208      
##               
##               
##               
##

Part 3: Exploratory data analysis

3.0 Audience Score distribution

ggplot(movies_eda)+geom_histogram(aes(x=audience_score),binwidth=5,color="black")+labs(title="Audience Score Distribution")+theme_classic()

From the histogram, we can see a slightly left-skewed graph. The mean is 62.4 which is slightly lower than the median 65.0.

3.1 Feature Film

ggplot(movies_eda)+geom_boxplot(aes(x=feature_film,y=audience_score,fill=feature_film))+
  labs(title="Audience Score Distribution for Feature & Non-feature Film")+theme_classic()

movies_eda %>% group_by(feature_film)%>% select(audience_score) %>% summarize(count=n(),mean=mean(audience_score),median=median(audience_score),min=min(audience_score),max=max(audience_score))

## # A tibble: 2 x 6
##   feature_film count  mean median   min   max
##   <fct>        <int> <dbl>  <dbl> <dbl> <dbl>
## 1 no              60  81.0   85.5    19    96
## 2 yes            591  60.5   62      11    97

There are 591 features film (91%) and 60 non-feature film (9%) in the data-set. Generally, the feature film has a lower audience score (Mean:60, Median 62) than non-feature film (Mean:81, Median 85.5). If we also check with the box-plot, feature film has a significantly larger audience score range than non-feature film.

3.2 Drama

ggplot(movies_eda)+geom_boxplot(aes(x=drama,y=audience_score,fill=drama))+
  labs(title="Audience Score Distribution for Drama & Non-Drama")+theme_classic()

movies_eda %>% group_by(drama)%>% select(audience_score) %>% summarize(count=n(),mean=mean(audience_score),median=median(audience_score),min=min(audience_score),max=max(audience_score))

## # A tibble: 2 x 6
##   drama count  mean median   min   max
##   <fct> <int> <dbl>  <dbl> <dbl> <dbl>
## 1 no      346  59.7     61    11    97
## 2 yes     305  65.3     70    13    95

There are 305 features film (47%) and 346 non-feature film (53%) in the data-set. From the box-plot, the audience score distribution between drama and non-drama is quite similar. Both of them have a similar range of audience score. The drama (Mean:65.3, Median 70.0) has a slight higher mean and median than non-drama (Mean:59.7, Median 61.0).

3.3 MPAA rating (Restricted/Non-Restricted)

ggplot(movies_eda)+geom_boxplot(aes(x=mpaa_rating_R,y=audience_score,fill=mpaa_rating_R))+
  labs(title="Audience Score Distribution for Restricted & Non-Restricted")+theme_classic()

movies_eda %>% group_by(mpaa_rating_R)%>% select(audience_score) %>% summarize(count=n(),mean=mean(audience_score),median=median(audience_score),min=min(audience_score),max=max(audience_score))

## # A tibble: 2 x 6
##   mpaa_rating_R count  mean median   min   max
##   <fct>         <int> <dbl>  <dbl> <dbl> <dbl>
## 1 no              322  62.7   65.5    11    96
## 2 yes             329  62.0   64      14    97

There are 329 MPAA restricted (50%) and 322 MPAA non-restricted (50%) in the data-set. From the box-plot, the audience score distribution between MPAA restricted and MPAA non-restricted is nearly identical. Both of them have a very similar range of audience score. The MPAA restricted (Mean:62, Median 64) and MPAA non-restricted (Mean:62.7, Median 65.5) have a very close mean and median score.

3.4 Oscar Season

ggplot(movies_eda)+geom_boxplot(aes(x=oscar_season,y=audience_score,fill=oscar_season))+
  labs(title="Audience Score Distribution for Oscar & Non-Oscar Season")+theme_classic()

movies_eda %>% group_by(oscar_season)%>% select(audience_score) %>% summarize(count=n(),mean=mean(audience_score),median=median(audience_score),min=min(audience_score),max=max(audience_score))

## # A tibble: 2 x 6
##   oscar_season count  mean median   min   max
##   <fct>        <int> <dbl>  <dbl> <dbl> <dbl>
## 1 no             460  61.8     64    11    96
## 2 yes            191  63.7     69    13    97

There are 191 Oscar season (29%) and 460 Non-Oscar season (71%) in the data-set. Similar to previous MPAA rating, the box-plot of audience score distribution between Oscar and Non-Oscar is very close to each other. Both of them have a very similar range of audience score. The Oscar season film (Mean:63.7, Median 69) has a slight higher mean and median than and Non-Oscar season film (Mean:61.8, Median 64).

3.5 Summer Season

ggplot(movies_eda)+geom_boxplot(aes(x=summer_season,y=audience_score,fill=summer_season))+
  labs(title="Audience Score Distribution for Summer & Non-Summer Season")+theme_classic()

movies_eda %>% group_by(summer_season)%>% select(audience_score) %>% summarize(count=n(),mean=mean(audience_score),median=median(audience_score),min=min(audience_score),max=max(audience_score))

## # A tibble: 2 x 6
##   summer_season count  mean median   min   max
##   <fct>         <int> <dbl>  <dbl> <dbl> <dbl>
## 1 no              443  62.6     66    13    97
## 2 yes             208  61.8     65    11    94

There are 208 Oscar season (32%) and 443 Non-Oscar season (68%) in the data-set. The box-plot of audience score distribution between Summer and Non-Summer is very similar. Both of them have a very similar range of audience score. The Summer season film (Mean:61.8, Median 65) and Non-Summer season (Mean:62.6, Median 66.0) have only a insignificant different means and medians score.

Part 4: Modeling

4.1 Preparation

We select all required columns into a variable list, and then check for any missing value.

variable_list=c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box")

summary(movies %>% select(variable_list))

##  feature_film drama        runtime      mpaa_rating_R thtr_rel_year 
##  no : 60      no :346   Min.   : 39.0   no :322       Min.   :1970  
##  yes:591      yes:305   1st Qu.: 92.0   yes:329       1st Qu.:1990  
##                         Median :103.0                 Median :2000  
##                         Mean   :105.8                 Mean   :1998  
##                         3rd Qu.:115.8                 3rd Qu.:2007  
##                         Max.   :267.0                 Max.   :2014  
##                         NA's   :1                                   
##  oscar_season summer_season  imdb_rating    imdb_num_votes   critics_score   
##  no :460      no :443       Min.   :1.900   Min.   :   180   Min.   :  1.00  
##  yes:191      yes:208       1st Qu.:5.900   1st Qu.:  4546   1st Qu.: 33.00  
##                             Median :6.600   Median : 15116   Median : 61.00  
##                             Mean   :6.493   Mean   : 57533   Mean   : 57.69  
##                             3rd Qu.:7.300   3rd Qu.: 58301   3rd Qu.: 83.00  
##                             Max.   :9.000   Max.   :893008   Max.   :100.00  
##                                                                              
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box
##  no :636   
##  yes: 15   
##            
##            
##            
##            
##

There is only 1 missing data (which is in run-time). Here we can simply remove this row as it is relative small proportion compared to whole data-set.

movies_model=movies %>% select(audience_score,variable_list) %>% drop_na()

4.2 BMA model

Traditional model building strategies often use step-wise variable selection method (e.g. Backward BIC elimination) to choose the model, but sometimes step-wise methods may result a biased estimates or over-narrow confidence intervals.

In this project we will perform Bayesian Model Averaging (BMA) method. The best prediction is given by the sum of the posterior predictive means for each model, weighted by their respective posterior probabilities.

bma_model=bas.lm(audience_score~.,prior="BIC", modelprior= uniform(), data=movies_model)

4.3 Diagnostics for model

4.3.1 Residual variability

# Residuals vs Fitted Values
plot(bma_model,1,add.smooth=TRUE)

Residuals vs Fitted values plot does not show a random scatter around 0 line. It indicates that a non-constant variability (variance) is presented among the data, especially for those with low audience score. Therefore, the model is likely to perform a better prediction at high scores.

4.3.2 Model Probabilities

# Search Order vs Cumulative Probability
plot(bma_model,2,add.smooth=TRUE)

General speaking, the model search stops at around 70000 number, instead of total 2^17. We can observe a linear smooth regression line, so the probability is evenly distributed across each model.

4.3.3 Model Complexity

# Model dimension vs Marginal likelihood (Bayes Factor)
plot(bma_model,3)

The plot suggests the number of parameters in the model. The models with 6 to 10 predictors have the largest likelihood.

4.3.4 Posterior Marginal Inclusion Probabilities

# Variables vs Posterior Marginal Inclusion Probabilities
plot(bma_model,4)

The plot suggest that “imdb_rating” & “critics_score” have the posterior inclusion probability greater than 0.5, and are most likely included in the model.

4.4 Model Selection

summary(bma_model)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

If we look into the summary above, top 2 models (model 1 and model 2) posterior probability are 12.97% and 12.93% respectively. Their predictor variables are very similar (both with “imdb_rating”, “critics_score”), except “run_time” is included in model 1 only. Considering 2^17 possible combinations, these top2 model’s probability is in fact very significant higher than others.

Now we take a look at HPM (High probability Model) and BPM (Best Predictive Model)

# Best Predictive Model (BPM)
BPM_movies=predict(bma_model,estimator = "BPM")
variable.names(BPM_movies)

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

# High Probability Model (HPM)
HPM_movies=predict(bma_model,estimator = "HPM")
variable.names(HPM_movies)

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

Both BPM and HPM result also support that “runtime”, “imdb_rating” & “critics_score”. Thus are important variable predictors for our model.

However, as the top 2 model only contribute to 26% and we still have 74% of model uncertainty. It is better to use a Bayesian Model Averaging (BMA) model to achieve an comprehensive prediction result.

4.5 Interpretation of model coefficients

image(bma_model, rotate=F)

If we look at the visualizations, we can observe “imdb_rating”, “critics_score” and “runtime” has the highest posterior probability of appearance. This indicates that these variables should be crtical in audience score determination. Now let’s make a scatter plot for each of them for better understanding.

# imdb_rating vs audience score
imdb_plot=ggplot(movies_model,aes(x=imdb_rating,y=audience_score))+geom_jitter()+labs(title="IMDB Rating vs Audience Score")+geom_smooth(method = lm, se = FALSE,formula=y~x)+theme_classic()
# critics_score vs audience score
crit_plot=ggplot(movies_model,aes(x=critics_score,y=audience_score))+geom_jitter()+labs(title="Critics Score vs Audience Score")+geom_smooth(method = lm, se = FALSE,formula=y~x)+theme_classic()
# run-time vs audience score
rt_plot=ggplot(movies_model,aes(x=runtime,y=audience_score))+geom_jitter()+labs(title="Run-Time vs Audience Score")+geom_smooth(method = lm, se = FALSE,formula=y~x)+theme_classic()

ggarrange(imdb_plot,crit_plot,rt_plot,ncol=3,nrow = 1)

# correlation coefficient with audience score 
data.frame("imdb_rating"=c(cor(movies_model$imdb_rating,movies_model$audience_score)),
           "critics_score"=c(cor(movies_model$critics_score,movies_model$audience_score)),
           "runtime"=c(cor(movies_model$runtime,movies_model$audience_score)))

##   imdb_rating critics_score   runtime
## 1   0.8649091     0.7041573 0.1809629

From scatter points and correlation coefficient, we find that “imdb_rating” and “critics_score” are both strongly positive correlated with audience score, and “runtime” only slightly positive correlated.

confint(coef(bma_model))

##                              2.5%        97.5%          beta
## Intercept            6.157802e+01 6.312567e+01  6.234769e+01
## feature_filmyes     -1.387448e+00 6.607672e-03 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.201438e-02 2.461195e-05 -2.567772e-02
## mpaa_rating_Ryes    -2.178943e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year       -5.286942e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes     -9.627987e-01 2.432513e-03 -8.034940e-02
## summer_seasonyes    -1.024394e-02 9.955197e-01  8.704545e-02
## imdb_rating          1.372394e+01 1.661822e+01  1.498203e+01
## imdb_num_votes      -3.856746e-07 8.115848e-07  2.080713e-07
## critics_score        0.000000e+00 1.054844e-01  6.296648e-02
## best_pic_nomyes     -4.065101e-04 5.049915e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.568434e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.841653e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes     -1.425996e+00 7.944417e-03 -1.195011e-01
## top200_boxyes        0.000000e+00 2.844025e-01  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Above table demonstrates a 95% credible interval for all variables in the BMA model.

Part 5: Prediction

In this part we select some movies for prediction check.

5.1 Deadpool 2

Film Data Source:

dp2_df=data.frame(runtime =119,
                    thtr_rel_year = 2018,
                    imdb_rating =7.7,
                    imdb_num_votes =551,253,
                    critics_score =84,
                    best_pic_nom = "no",
                    best_pic_win = "no",
                    best_actor_win = "no",
                    best_actress_win = "no",
                    best_dir_win = "no",
                    top200_box = "no",
                    feature_film = "yes",
                    drama = "yes",
                    mpaa_rating_R = "yes",
                    oscar_season = "no",
                    summer_season = "yes")

predict_result=predict(bma_model, newdata =dp2_df, estimator = "BMA",se.fit=TRUE)
confint(predict_result)

##          2.5%    97.5%     pred
## [1,] 62.35507 101.4039 81.66654
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The predicted result is 81.7 (with 95% credible interval between 61.2 and 100.9). The actual audience score is 88 in rottentomatoes. The residual is (88-81.7) = 6.3 which is quite small. Therefore, the model can make a good prediction of audience score.

5.2 Batman v Superman: Dawn of Justice

Film Data Source:

bvs_df=data.frame(runtime =152,
                    thtr_rel_year = 2016,
                    imdb_rating =6.4,
                    imdb_num_votes =685,249,
                    critics_score =29,
                    best_pic_nom = "no",
                    best_pic_win = "no",
                    best_actor_win = "no",
                    best_actress_win = "no",
                    best_dir_win = "no",
                    top200_box = "no",
                    feature_film = "yes",
                    drama = "yes",
                    mpaa_rating_R = "no",
                    oscar_season = "no",
                    summer_season = "no")

predict_result=predict(bma_model, newdata =bvs_df, estimator ="BMA",se.fit=TRUE)
confint(predict_result)

##          2.5%   97.5%     pred
## [1,] 37.27349 77.3494 58.10504
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The predicted result is 58.1 (with 95% credible interval between 36.8 and 77.3). The actual audience score is 63 in rottentomatoes. The residual is (63-58.1) = 4.9 which is quite small. Therefore, the model can make a good prediction of audience score.

Part 6: Conclusion

6.1 Summary

The project aims is to build a multi-variables linear model, which predict a movie’s audience score from a variety of factors. In the calculation, Bayesian Information Criterion (BIC) is used to explore models uncertainty of posterior probability. Each models are assigned with different weights, correspond to their posterior probability, and then compute to a Bayesian Model Averaging (BMA) model. Finally, We use the model to predict the audience score of 2 movies.

6.2 Findings

The results of prediction is fairly good. Both 2 movies residuals are far less than 10 point of audience score. This model should be a good tools for people to determine how one movie is favored by others in rottentomatoes. In the model the “imdb_rating” and “critics_score” variables play an important role as these 2 variables are the highest posterior probability appearance in BMA models. On the other hand, most of the new-created variable (in Part 2) are not useful in our model prediction.

6.3 Shortcomings

The sample size is very small compared to the number in imdb or rottentomatoes movie database. If we can incorporate a bigger sample size, the accuracy of model prediction should be improved. Besides, as above-mentioned, most new-created variables (in Part 2) are not useful in this model. Therefore we may have to consolidate all existing factors or even explore new factors in next model.