Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(BMS)
library(tidyr)
library(MASS)

load("movies.Rdata")
data <- movies

Part 1: Data

John Eugene Driscoll | Module 4 - Data Analysis Project | Sept 2017 Submission

Introduction

The purpose of this document is to complete the data analysis project required during week 5 of the Bayesian Statistics course by Duke University (Coursera.)

The background context regarding the assignment can be found at: https://www.coursera.org/learn/bayesian/supplement/VuNpl/project-information

Research question

Investigate what parameters are major influences on the audience score (audience_score,) using Bayseian statistical methods. Parameters are pulled from the movies data set.

If a meaninguful predection could be pulled from this exercise, I would be interested to repurpose this concept onto video game ratings, a form of media I am passionate about.

Scope of Inference

For the purposes of inference, this shoud be considered an observational study that uses a random sampling approach to obtain a representative sample from U.S. movies released between 1974 and 2016. Since a random sampling method is applied in data collection, the results can be generalizable to the movies released between 1974 and 2016.

Causation can only be inferred from a randomized experiment. This study does not meet the requirements of a randomized experiment, therefore causation can not be determined.

Sources of Bias

As Rotten Tomatoes audience score is created by voulnteers, the study may suffer from voluntary response bias since people with strong responses are more likely to participate. The voluntary participants may not be representative of the U.S. population.

Part 2: Data manipulation

Per the assignment, we must create new variables from the movies data souce to complement the original variables. Then we must strip the original data set down to a only relevant features.

The new variables are:

Summary - feature_film, drama, mpaa_rating_R, oscar_season, summer_season

Detail -

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no

Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no

Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no

Create new variable called oscar_season with levels yes (if movie is released in November, October, or December) and no

Create new variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

The relevant features for our model per the instructions are:

feature_film
drama
runtime
mpaa_rating_R
thtr_rel_year
oscar_season
summer_season
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

Code for creation of new paramters and quick summary stats of the new features and the relevant features are provided.

 movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
 movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no")) 
 movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
 movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
 movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
 movies$feature_film<-as.factor(movies$feature_film)
 movies$drama<-as.factor(movies$drama)
 movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
 movies$oscar_season<-as.factor(movies$oscar_season)
 movies$summer_season<-as.factor(movies$summer_season)

  vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
                      "mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
                      "imdb_num_votes","critics_score","best_pic_win","best_actor_win",
                      "best_actress_win","best_dir_win","top200_box","audience_score")
                                                               
 subset <- movies[vars] 
 summary(subset)

##     runtime       mpaa_rating  thtr_rel_year   imdb_rating   
##  Min.   : 39.0   G      : 19   Min.   :1970   Min.   :1.900  
##  1st Qu.: 92.0   NC-17  :  2   1st Qu.:1990   1st Qu.:5.900  
##  Median :103.0   PG     :118   Median :2000   Median :6.600  
##  Mean   :105.8   PG-13  :133   Mean   :1998   Mean   :6.493  
##  3rd Qu.:115.8   R      :329   3rd Qu.:2007   3rd Qu.:7.300  
##  Max.   :267.0   Unrated: 50   Max.   :2014   Max.   :9.000  
##  NA's   :1                                                   
##  imdb_num_votes   critics_score    audience_score  best_pic_win
##  Min.   :   180   Min.   :  1.00   Min.   :11.00   no :644     
##  1st Qu.:  4546   1st Qu.: 33.00   1st Qu.:46.00   yes:  7     
##  Median : 15116   Median : 61.00   Median :65.00               
##  Mean   : 57533   Mean   : 57.69   Mean   :62.36               
##  3rd Qu.: 58301   3rd Qu.: 83.00   3rd Qu.:80.00               
##  Max.   :893008   Max.   :100.00   Max.   :97.00               
##                                                                
##  best_actor_win best_actress_win best_dir_win top200_box mpaa_rating_R
##  no :558        no :579          no :608      no :636    no :322      
##  yes: 93        yes: 72          yes: 43      yes: 15    yes:329      
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  oscar_season drama     feature_film summer_season
##  no :460      no :346   no : 60      no :443      
##  yes:191      yes:305   yes:591      yes:208      
##                                                   
##                                                   
##                                                   
##                                                   
##

 df <- arrange(subset,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
         drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
         best_actor_win,best_dir_win,top200_box)

 arrange(df, desc(audience_score))

## # A tibble: 651 x 17
##    runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
##      <dbl>      <fctr>         <dbl>       <dbl>          <int>
##  1     202           R          1974         9.0         749783
##  2     106     Unrated          2002         8.5           2362
##  3     121           R          1991         7.5           2551
##  4     133           R          1993         8.1         103378
##  5     154       PG-13          1985         7.8          58668
##  6     113           R          2000         8.5         806911
##  7     126           R          1997         8.3         572236
##  8     137           R          1986         8.4         466400
##  9      94     Unrated          2010         7.4           1935
## 10      94           R          1996         8.2         448434
## # ... with 641 more rows, and 12 more variables: critics_score <dbl>,
## #   audience_score <dbl>, best_pic_win <fctr>, best_actor_win <fctr>,
## #   best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## #   mpaa_rating_R <fctr>, oscar_season <fctr>, drama <fctr>,
## #   feature_film <fctr>, summer_season <fctr>

Part 3: Exploratory data analysis

Analysis of the newly created features:

For the EDA, we have created a boxplot for each newly created feature, comparing them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.

In the first graph, we compare the variability of the different features we aggregated all at once.

The new features do not show much variability when looking at the audience score whether the condition of yes or no was met for the different features.

This may lead to the conclusion that none of the above features are valuable towards the prediction of the audience score. The only feature which might be relevant, could be the ‘feature_film’ feature. There seems to be a an obserevable difference. More research will follow below.

 grouped_movies <- gather(movies, 'variable', 'flag', 33:37) 
                                                               
  
 ggplot(grouped_movies, aes(x=variable, y=audience_score, fill=flag)) + geom_boxplot()

 ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot()

 ggplot(df,aes(x=drama, y=audience_score, fill=drama)) + geom_boxplot()

 ggplot(df,aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R)) + geom_boxplot()

 ggplot(df, aes(x=df$oscar_season, y=df$audience_score, fill=df$oscar_season)) + geom_boxplot()

 ggplot(df, aes(x=summer_season, y=audience_score, fill=summer_season)) + geom_boxplot()

Part 4: Modeling

We will use a step wise backwards elminination process to construct final model.Starting with full model and removing variables till BIC can not be lowered further. BMA will be used to find the best model.

In Bayesian model averaging (BMA) multiple models are averaged to obtain posteriors of coefficients and predictions from new data.

Full model

  library(MASS)
  lm1 <- lm(audience_score ~ ., data = df)
  score_step <- stepAIC(lm1, trace = FALSE)
  score_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + mpaa_rating_R + 
##     oscar_season + drama + feature_film + summer_season
## 
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_actress_win
## 
## 
##                Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                       630   63172.60 3014.825
## 2   - mpaa_rating_R  0   0.000000       630   63172.60 3014.825
## 3     - mpaa_rating  5 563.905967       635   63736.51 3010.601
## 4    - best_pic_win  1   1.948927       636   63738.46 3008.621
## 5    - oscar_season  1   9.743905       637   63748.20 3006.721
## 6      - top200_box  1  16.676369       638   63764.88 3004.891
## 7    - best_dir_win  1  90.517834       639   63855.39 3003.813
## 8  - best_actor_win  1 119.151581       640   63974.55 3003.024
## 9   - summer_season  1 168.616474       641   64143.16 3002.735
## 10          - drama  1 167.988533       642   64311.15 3002.435
## 11   - feature_film  1 166.871295       643   64478.02 3002.120

Reduced Model

  vars1 <- names(movies) %in% c("audience_score","thtr_rel_year","critics_score","imdb_rating",
                                "imdb_num_votes")
                                                          
  df1 <- movies[vars1] 
 
  df1 <- na.omit(df1) 
 
  mf= bms(df1,mprior="unif",g=1700, mcmc="enumeration", user.int=FALSE)
  image(mf[1:5],order=FALSE)

Using BMA,BMP and MPM with reduced dataset(df1 with 5 numerical variables.)

movies_no_na = na.omit(df1)

bma_aud = bas.lm(audience_score ~ ., data = movies_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
bma_aud

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept   thtr_rel_year     imdb_rating  imdb_num_votes  
##        1.00000         0.07265         1.00000         0.04135  
##  critics_score  
##        0.91989

summary(bma_aud)

##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000  1.000000e+00  1.000000e+00
## thtr_rel_year     0.07264667     0.0000  0.000000e+00  1.000000e+00
## imdb_rating       1.00000000     1.0000  1.000000e+00  1.000000e+00
## imdb_num_votes    0.04134915     0.0000  0.000000e+00  0.000000e+00
## critics_score     0.91988729     1.0000  0.000000e+00  1.000000e+00
## BF                        NA     1.0000  8.469154e-02  7.534253e-02
## PostProbs                 NA     0.8199  6.940000e-02  6.180000e-02
## R2                        NA     0.7524  7.480000e-01  7.529000e-01
## dim                       NA     3.0000  2.000000e+00  4.000000e+00
## logmarg                   NA -3621.0579 -3.623527e+03 -3.623644e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## thtr_rel_year   0.000000e+00  1.000000e+00
## imdb_rating     1.000000e+00  1.000000e+00
## imdb_num_votes  1.000000e+00  0.000000e+00
## critics_score   1.000000e+00  0.000000e+00
## BF              4.304024e-02  9.247604e-03
## PostProbs       3.530000e-02  7.600000e-03
## R2              7.524000e-01  7.488000e-01
## dim             4.000000e+00  3.000000e+00
## logmarg        -3.624203e+03 -3.625741e+03

 library(MASS)
      
      

full_model <- bas.lm(data=na.omit(movies), audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes
               + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win
               + best_dir_win + top200_box, prior = 'BIC', modelprior = uniform())
full_model

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season +     imdb_rating + imdb_num_votes + critics_score + best_pic_nom +     best_pic_win + best_actor_win + best_actress_win + best_dir_win +     top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876

coef_aud = coefficients(full_model)
confint(coef_aud)

##                              2.5%        97.5%          beta
## Intercept            6.141029e+01 6.301862e+01  6.221002e+01
## feature_filmyes     -8.335074e-01 0.000000e+00 -9.162435e-02
## dramayes             0.000000e+00 0.000000e+00  1.956484e-02
## runtime             -9.002118e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes    -2.011820e+00 1.120155e-02 -2.409288e-01
## thtr_rel_year       -4.990744e-02 9.710504e-05 -3.878883e-03
## oscar_seasonyes     -7.714591e-01 2.688653e-03 -6.293171e-02
## summer_seasonyes    -5.748289e-03 1.123023e+00  8.660919e-02
## imdb_rating          1.359866e+01 1.654841e+01  1.490680e+01
## imdb_num_votes      -2.477967e-08 2.543358e-06  2.466254e-07
## critics_score        0.000000e+00 1.119926e-01  6.951964e-02
## best_pic_nomyes     -4.635467e-04 5.172906e+00  5.130600e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes   -2.240907e+00 7.032379e-03 -2.123363e-01
## best_actress_winyes -3.005555e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes     -1.221851e+00 2.782579e-02 -1.188402e-01
## top200_boxyes       -1.499603e-01 4.110591e-01  8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

summary(full_model)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.05876309     0.0000     0.0000000     0.0000000
## dramayes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237

Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + critics_score + thtr_rel_year )

final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + critics_score + thtr_rel_year,prior= 'BIC', modelprior = uniform())

coef_reduced = coefficients(final_reduced)

confint(coef_reduced)

##                       2.5%        97.5%          beta
## Intercept      61.37940834 62.975467028  6.221002e+01
## imdb_rating    13.39974861 16.190537288  1.468210e+01
## imdb_num_votes  0.00000000  0.000000000  8.008232e-08
## critics_score   0.00000000  0.113115780  7.319061e-02
## thtr_rel_year  -0.01926706  0.003306831 -2.454141e-03
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 4 (conclusion)

From the the output above we see a similar outcome when comparing frequqinst model analaysis (previous module) and the bayseian model analysis. Most of the variables, including the newly created variables, were been eliminated from the full model to create the reduced model.

By looking at the table directly above this, we can see that the highest rated model contains the intercept, imdb_rating, imdb_num_vote and critics score. Imdb_rating is the single biggest predictor of popularity.

Part 5: Prediction

We wanted to predict the audience score for a new movie that has not been used to fit the model. For the movie “Kung Fu Panda 3.” The data below, obtained from IMDB and Rotten Tomato represent each respective data point required to populate the SimpleModel.

Impressivley, the model was able predict an audicence score of 73 (rounded down to nearest whole number!) The actual audience score for Kung Fu Panda 3 per Rotten Tomatoes was 79.

Sources: Rotten Tomato Page for Kung Fu Panda 3 https://www.rottentomatoes.com/m/kung_fu_panda_3/

IMDB Page for Kung Fu Panda 3 http://www.imdb.com/title/tt2267968/

  kfp <- data.frame(runtime=95,mpaa_rating="PG",thtr_rel_year=2016,imdb_rating=7.1,
                      imdb_num_votes=94961,critics_score=87,audience_score=79,
                      best_pic_win="no",best_actor_win="no",best_actress_win="no",
                      best_dir_win="no",top200_box="yes",mpaa_rating_R="no",oscar_season="no",
                      drama="no",feature_film="yes",summer_season="no")

        df<-rbind(df,kfp)
        df2<-tail(df,1)
        names(kfp)

##  [1] "runtime"          "mpaa_rating"      "thtr_rel_year"   
##  [4] "imdb_rating"      "imdb_num_votes"   "critics_score"   
##  [7] "audience_score"   "best_pic_win"     "best_actor_win"  
## [10] "best_actress_win" "best_dir_win"     "top200_box"      
## [13] "mpaa_rating_R"    "oscar_season"     "drama"           
## [16] "feature_film"     "summer_season"

kfp <- data.frame(audience_score=79, imdb_rating=7.1, critics_score=87,imdb_num_votes=94961, thtr_rel_year=2016 )

BMA_pred_aud =  predict(bma_aud, newdata= df2, estimator="BMA", se.fit=T)

predict(final_reduced, kfp)

## $fit
##          [,1]
## [1,] 73.35345
## 
## $Ybma
##          [,1]
## [1,] 73.35345
## 
## $Ypred
##           [,1]
##  [1,] 73.47649
##  [2,] 72.75486
##  [3,] 72.13615
##  [4,] 73.51227
##  [5,] 71.23846
##  [6,] 72.71466
##  [7,] 72.14702
##  [8,] 71.19747
##  [9,] 77.22007
## [10,] 76.05445
## [11,] 76.98682
## [12,] 76.84748
## [13,] 60.03568
## [14,] 64.08266
## [15,] 62.21002
## [16,] 59.72481
## 
## $postprobs
##  [1]  8.454053e-01  5.564066e-02  5.095065e-02  3.817022e-02  4.707887e-03
##  [6]  2.809584e-03  2.103112e-03  2.125490e-04  1.927465e-87  1.517291e-88
## [11]  2.736049e-92  1.110021e-93 1.769794e-169 2.304423e-170 5.770755e-182
## [16] 1.306043e-182
## 
## $se.fit
## NULL
## 
## $se.pred
## NULL
## 
## $se.bma.fit
## NULL
## 
## $se.bma.pred
## NULL
## 
## $df
##  [1] 616 615 617 615 616 614 616 615 616 615 617 616 616 617 618 617
## 
## $best
##  [1] 13 10  6  4 11  5  3  8 12  2  9 15  7 16  1 14
## 
## $bestmodel
## $bestmodel[[1]]
## [1] 0 1 3
## 
## $bestmodel[[2]]
## [1] 0 1 3 4
## 
## $bestmodel[[3]]
## [1] 0 1
## 
## $bestmodel[[4]]
## [1] 0 1 2 3
## 
## $bestmodel[[5]]
## [1] 0 1 4
## 
## $bestmodel[[6]]
## [1] 0 1 2 3 4
## 
## $bestmodel[[7]]
## [1] 0 1 2
## 
## $bestmodel[[8]]
## [1] 0 1 2 4
## 
## $bestmodel[[9]]
## [1] 0 2 3
## 
## $bestmodel[[10]]
## [1] 0 2 3 4
## 
## $bestmodel[[11]]
## [1] 0 3
## 
## $bestmodel[[12]]
## [1] 0 3 4
## 
## $bestmodel[[13]]
## [1] 0 2 4
## 
## $bestmodel[[14]]
## [1] 0 2
## 
## $bestmodel[[15]]
## [1] 0
## 
## $bestmodel[[16]]
## [1] 0 4
## 
## 
## $prediction
## [1] FALSE
## 
## $estimator
## [1] "BMA"
## 
## attr(,"class")
## [1] "pred.bas"

Part 6: Conclusion

The predictive model presented here is used to predict the audience scores for a movie.

Using Bayesian model average many models can be constructed to perform better predictions. The proposed linear model shows a somewhat accurate prediction rate, but it is not dead accurate.

It may be helpful to use this model along with the expert oppinion of a well practiced movie critic to create a better prior expectation and ulimatly a stronger Bayseian prediction machine.