Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

Setting the file path and loading the data

setwd('C:/Users/sanso/Documents/Git/StatsR/Bayesian')
load("movies.Rdata")
dim(movies)

## [1] 651  32

Part 1: Data

The data contains 651 observations with 32 variables.

The data pertains to movies released before 2016 and have been gathered from Rotten tomatoes and IMDB.

This data has been randomly sampled. Because it is randomly sampled, inferences on this data can be generalized to the population.

However, since no random assignment has been used (this is an observational study), we would not be able to make any causal statements irresepective of the kind of inference

Part 2: Data manipulation

The main purpose of this project is to develop a model for predicting “audience_score” using the following variables. Some already exist which we will not be changing. The other vraibles are new variables which would have to be created

feature_film: “yes” if title_type is Feature Film, “no” otherwise (NEW)
drama: “yes” if genre is Drama, “no” otherwise (NEW)
runtime (EXISTING)
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise (NEW)
thtr_rel_year (EXISTING)
oscar_season: “yes” if movie is released in November, October, or December - (based on thtr_rel_month), “no” otherwise (NEW)
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise (NEW)
imdb_rating (EXISTING)
imdb_num_votes (EXISTING)
critics_score (EXISTING)
best_pic_nom (EXISTING)
best_pic_win (EXISTING)
best_actor_win (EXISTING)
best_actress_win (EXISTING)
best_dir_win (EXISTING)
top200_box (EXISTING)

Hence we would be creating 5 new variables namely

feature_film
drama
mpaa_rating_R
oscar_season
Summer_season

Code for creating the new variables

movies$feature_film <- ifelse(movies$title_type == "Feature Film", "yes", "no")

movies$drama <- ifelse(movies$genre =="Drama", "yes", "no")

movies$mpaa_rating_R <- ifelse(movies$mpaa_rating == "R", "yes", "no")

movies$oscar_season <- ifelse(movies$thtr_rel_month == 10 | movies$thtr_rel_month == 11 | movies$thtr_rel_month ==12,                 "yes","no")


movies$summer_season <- ifelse(movies$thtr_rel_month == 5 | movies$thtr_rel_month == 6 | movies$thtr_rel_month ==7 | movies$thtr_rel_month ==8,
"yes","no")

We have added 5 new variables which should now make the number of variables 37

dim(movies)

## [1] 651  37

However, of this 37 variables, we would only be using some- the 16 variables specified. I would be creating a new data set with only these specific variables for all further analysis along with the outcome variable which is audience_score

movie1 <- data.frame(movies$feature_film, movies$drama, movies$runtime,
                     movies$mpaa_rating_R, movies$thtr_rel_year,
                     movies$oscar_season, movies$summer_season,
                     movies$imdb_rating, movies$imdb_num_votes,
                     movies$critics_score, movies$best_pic_nom,
                     movies$best_pic_win, movies$best_actor_win,
                     movies$best_actress_win, movies$best_dir_win,
                     movies$top200_box, movies$audience_score)

dim(movie1)

## [1] 651  17

We have created a new data set “movie1”, which we would be using for all further analysis. I would be checking what the new data set looks like

names(movie1)

##  [1] "movies.feature_film"     "movies.drama"           
##  [3] "movies.runtime"          "movies.mpaa_rating_R"   
##  [5] "movies.thtr_rel_year"    "movies.oscar_season"    
##  [7] "movies.summer_season"    "movies.imdb_rating"     
##  [9] "movies.imdb_num_votes"   "movies.critics_score"   
## [11] "movies.best_pic_nom"     "movies.best_pic_win"    
## [13] "movies.best_actor_win"   "movies.best_actress_win"
## [15] "movies.best_dir_win"     "movies.top200_box"      
## [17] "movies.audience_score"

I would be renaming all the column names with the names in the original data set without the “movies” prefiix

colnames(movie1) <- c ("feature_film","drama","runtime","mpaa_rating_R",
                      "thtr_rel_year", "oscar_season",
                      "summer_season", "imdb_rating",
                      "imdb_num_votes", "critics_score",
                      "best_pic_nom","best_pic_win",
                      "best_actor_win", "best_actress_win",
                      "best_dir_win", "top_200_box",
                      "audience_score")
names(movie1)

##  [1] "feature_film"     "drama"            "runtime"         
##  [4] "mpaa_rating_R"    "thtr_rel_year"    "oscar_season"    
##  [7] "summer_season"    "imdb_rating"      "imdb_num_votes"  
## [10] "critics_score"    "best_pic_nom"     "best_pic_win"    
## [13] "best_actor_win"   "best_actress_win" "best_dir_win"    
## [16] "top_200_box"      "audience_score"

Part 3: Exploratory data analysis

In this section, we would be conducting exploratory analysis to check what are the factors that contribute to audience_score which is our outcome variable

First I would like to see the distribution of the outcome variable - audience_score. To do this I would be making a histogram

hist(movie1$audience_score, col = "light blue", probability = T, 
     main = "Distribution of Audience Score",
     xlab = "Audience Score"); lines(density(movie1$audience_score))

We see that audience score is a left skewed distribution with the highest density being in the 70 - 90 range

Now, I would first check the impact of the new variables that we created

The first variable is feature_film

g <- ggplot( data = movie1, aes(x = feature_film, y = audience_score))
g <- g + geom_boxplot(aes(fill= feature_film))
g

This shows that there exists a major difference in audince score of feature films vs non feature films.

I will be validating the difference with a t-test

t.test(audience_score ~ feature_film, data = movie1, conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  audience_score by feature_film
## t = 10.648, df = 86.899, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  15.49363 25.67574
## sample estimates:
##  mean in group no mean in group yes 
##          81.05000          60.46531

Here we see that p-value is almost zero. Also the 99% confidence interval for difference of means is between 15.5 and 25.7 signifying that there is indeed a statistically signficiant difference in audience scores of feature films vs non featire films

The second variable I would be checking is “drama”

g <- ggplot( data = movie1, aes(x = drama, y = audience_score))
g <- g + geom_boxplot(aes(fill= drama))
g

Visually there does not exist any major difference in audience score between drama and non drama movies.

I would be further conducting a t-test to validate this. Since there does not seem to be any major difference I would be using a 95% confidence level instead of the 99% I used in the earlier test

t.test(audience_score ~ drama, data = movie1)

## 
##  Welch Two Sample t-test
## 
## data:  audience_score by drama
## t = -3.5987, df = 648.92, p-value = 0.0003442
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.680895 -2.551760
## sample estimates:
##  mean in group no mean in group yes 
##          59.73121          65.34754

While visually I did not perceive any major difference, the test tells me that there is a statistically significant difference between audience score of Drama vs Non drama movies.

the p- value is 0.0003 helping us reject the Null hypothesis of no difference at a 5% level of significance

Also the 95% confidence interval does not contain 0

We can infer that there is indeed a difference in the observed values of audience scores of Drama vs Non Drama movies

The next variable I would be checking is mpaa_rating_R

g <- ggplot( data = movie1, aes(x = mpaa_rating_R, y = audience_score))
g <- g + geom_boxplot(aes(fill= mpaa_rating_R))
g

Again, I do not visually see any major difference visually.

I would be validating using a t-test

t.test(audience_score ~ mpaa_rating_R, data = movie1)

## 
##  Welch Two Sample t-test
## 
## data:  audience_score by mpaa_rating_R
## t = 0.40777, df = 648.44, p-value = 0.6836
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.468251  3.762027
## sample estimates:
##  mean in group no mean in group yes 
##          62.68944          62.04255

Here the p-value is 0.68. Hence we will accept the Null Hypothesis that there is no difference

The 95% confidence level also contains 0 ranging from -2.5 to 3.8.

Hence we can infer that we do not observe any difference in audience score between movies rated R and movies that are not

the 4th variable I would be using is oscar_season

g <- ggplot( data = movie1, aes(x = oscar_season, y = audience_score))
g <- g + geom_boxplot(aes(fill= oscar_season))
g

Again, I do not see any major visual difference (except medians) between audience scores of movies which have released in the oscar season vs movies that have released in other times

Validating with a t-test

t.test(audience_score ~ oscar_season, data = movie1)

## 
##  Welch Two Sample t-test
## 
## data:  audience_score by oscar_season
## t = -1.0685, df = 349.84, p-value = 0.286
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.319956  1.574316
## sample estimates:
##  mean in group no mean in group yes 
##          61.81304          63.68586

The t-test findings also corroborate the visual findings with a p value of 0.29. The 95% confidence interval also contains 0 ranging from -5.3 to 1.6

We accept the Null hypothesis and infer that there is no observed difference between the audience scores of movies released in the oscar season vs movies that have not

The 5th variable that I would be checking is summer_season

g <- ggplot( data = movie1, aes(x = summer_season, y = audience_score))
g <- g + geom_boxplot(aes(fill= summer_season))
g

Again, no visual difference between audience scores of movies released in the summer season vs those in other seasons

Validating with a t-test

t.test(audience_score ~ summer_season, data = movie1)

## 
##  Welch Two Sample t-test
## 
## data:  audience_score by summer_season
## t = 0.48351, df = 414, p-value = 0.629
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.499424  4.130089
## sample estimates:
##  mean in group no mean in group yes 
##          62.62302          61.80769

The t- test also reiterates what we observed visually. The p-value is 0.63 and the 95% confidence interval contains 0 ranging from -2.5 to 4.13. Hence, we would be accepting the Null hypothesis thereby infering that there is no observed difference between movies released in the summer season vs movies that have released in other seasons

The 6th parameter I want to check is the thtr_rel_year

The reason I want to check this is to see if there is any relation between the year of release and the audience score – to also see of it behaves in any way like a time series and if there are any trends available.

g <- ggplot(data = movie1, aes(x=thtr_rel_year, y = audience_score))
g <- g + geom_line() + geom_smooth(method = "lm",se = F)
g

There does not seem to be any trend emerging from this. While this shows that audience scores of movies have decreased very marginally over time, this is debatable since the strength of the relation is suspect.

The 7th and 8th variable I would be checking is to study the linear relation between imdb rating and audience score and also between imdb_num_votes and audience score.

g <- ggplot(data = movie1, aes(x  = imdb_rating, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm" , se = FALSE)
g <- g + ggtitle(label = "IMDB Rating vs Score")
g

Very Strong linear relation is observed

Validating with the correlaion coefficient

cor(movie1$imdb_rating, movie1$audience_score)

## [1] 0.8648652

The correlation also corroborates the visual finding

g <- ggplot(data = movie1, aes(x  = imdb_num_votes, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm" , se = FALSE)
g <- g + ggtitle(label = "IMDB num Votes  vs Score")
g

Not a particularly strong linear relation but it exists nevertheless

Checking the correlation coefficient

cor(movie1$imdb_num_votes, movie1$audience_score)

## [1] 0.2898128

The correlation coefficient also shows that the linear relationship is not particularly strong

Part 4: Modeling

In this section we would be developing a bayesian linear model for predicting Audience Score (audience_score) using all the other variables specified ( data set movie1)

fit1 <- bas.lm(audience_score ~ ., data = movie1, method = 'MCMC',
               modelprior = uniform())

## Warning in bas.lm(audience_score ~ ., data = movie1, method = "MCMC",
## modelprior = uniform()): dropping 1 rows due to missing data

## Warning in bas.lm(audience_score ~ ., data = movie1, method = "MCMC",
## modelprior = uniform()): We recommend using the implementation using the
## Jeffreys-Zellner-Siow prior (prior='JZS') which uses numerical integration
## rahter than the Laplace approximation

summary(fit1)

##                     P(B != 0 | Y)  model 1     model 2     model 3
## Intercept              1.00000000   1.0000   1.0000000   1.0000000
## feature_filmyes        0.06777115   0.0000   0.0000000   0.0000000
## dramayes               0.04671097   0.0000   0.0000000   0.0000000
## runtime                0.46314926   0.0000   1.0000000   0.0000000
## mpaa_rating_Ryes       0.20214996   0.0000   0.0000000   1.0000000
## thtr_rel_year          0.09446869   0.0000   0.0000000   0.0000000
## oscar_seasonyes        0.07770004   0.0000   0.0000000   0.0000000
## summer_seasonyes       0.08284302   0.0000   0.0000000   0.0000000
## imdb_rating            0.99999619   1.0000   1.0000000   1.0000000
## imdb_num_votes         0.06122208   0.0000   0.0000000   0.0000000
## critics_score          0.88149872   1.0000   1.0000000   1.0000000
## best_pic_nomyes        0.13775024   0.0000   0.0000000   0.0000000
## best_pic_winyes        0.04161072   0.0000   0.0000000   0.0000000
## best_actor_winyes      0.14723282   0.0000   0.0000000   0.0000000
## best_actress_winyes    0.14515762   0.0000   0.0000000   0.0000000
## best_dir_winyes        0.06972046   0.0000   0.0000000   0.0000000
## top_200_boxyes         0.05005798   0.0000   0.0000000   0.0000000
## BF                             NA   1.0000   0.8702806   0.2217602
## PostProbs                      NA   0.1383   0.1203000   0.0309000
## R2                             NA   0.7525   0.7549000   0.7539000
## dim                            NA   3.0000   4.0000000   4.0000000
## logmarg                        NA 443.9495 443.8105657 442.4433468
##                         model 4     model 5
## Intercept             1.0000000   1.0000000
## feature_filmyes       0.0000000   0.0000000
## dramayes              0.0000000   0.0000000
## runtime               0.0000000   1.0000000
## mpaa_rating_Ryes      0.0000000   1.0000000
## thtr_rel_year         0.0000000   0.0000000
## oscar_seasonyes       0.0000000   0.0000000
## summer_seasonyes      0.0000000   0.0000000
## imdb_rating           1.0000000   1.0000000
## imdb_num_votes        0.0000000   0.0000000
## critics_score         1.0000000   1.0000000
## best_pic_nomyes       0.0000000   0.0000000
## best_pic_winyes       0.0000000   0.0000000
## best_actor_winyes     1.0000000   0.0000000
## best_actress_winyes   0.0000000   0.0000000
## best_dir_winyes       0.0000000   0.0000000
## top_200_boxyes        0.0000000   0.0000000
## BF                    0.2236679   0.2055844
## PostProbs             0.0303000   0.0287000
## R2                    0.7539000   0.7563000
## dim                   4.0000000   5.0000000
## logmarg             442.4519125 442.3676066

I would be plotting the residuals

plot(fit1, which = 1)

This does not show a random scattering of data points. There is a bit of a wave pattern which indicates that the model predicts skewed rating for ratings under 30

plot(fit1, which = 2)

image(fit1)

The model with the highest probablity contains only 2 parameters imdb_rating and critics_score

Checking for Normal Distribution of posterior probabilities

diagnostics(fit1, type = "model", pch = 5)

This shows that the posterior model probabilities follow a normal distribution.

Part 5: Prediction

In this section, I would be using the model I built to predict the audience score of a movie which is not part of the data set and is a movie which released in 2016

The link for the data set is given below

http://www.imdb.com/title/tt1431045/?ref_=nv_sr_2

https://www.rottentomatoes.com/m/deadpool/

The movie is deadpool which was released in 2016

I would be first creating a dataframe with the parameters that I would be needing

deadpool <- data.frame(runtime=108,
                         thtr_rel_year=2016,
                         imdb_rating=8.1,
                         imdb_num_votes=500049,
                         critics_score=84,
                         audience_score=0,
                         best_pic_nom=factor("no", levels=c("no", "yes")),
                         best_pic_win=factor("no", levels=c("no", "yes")),
                         best_actor_win=factor("no", levels=c("no", "yes")),
                         best_actress_win=factor("no", levels=c("no", "yes")),
                         best_dir_win=factor("no", levels=c("no", "yes")),
                         top200_box=factor("no", levels=c("no", "yes")),
                         feature_film=factor("yes", levels=c("no", "yes")),
                         drama=factor("no", levels=c("no", "yes")),
                         mpaa_rating_R=factor("yes", levels=c("no", "yes")),
                         oscar_season=factor("no", levels=c("no", "yes")),
                         summer_season=factor("no", levels=c("no", "yes")),
                         top_200_box = factor("no", levels = c("no", "yes")))

I would now use the model that we developed “fit1” to predict the audience score of Deadpool and check what the variance with the actual is.

While prediccting the score, I would be using the estimator as “BMA” which is the bayesian

pred <- predict(fit1, newdata = deadpool, estimator = "BMA")
pred$Ybma

##          [,1]
## [1,] 87.90591

The predicted value is 88 which is close to the actual value of 90

Part 6: Conclusion

Overall this model has performed well for the movie Deadpool.

However, most of the parameters that have been used are data points that would be available after a movies has released and running.

If I was a producer, who wanted to know what is the kind of movie to be made so as to maximise profits (Assuming high audience score relates to higher profits), this analysis and model might not be very helpful.

Ideally I would start with parameters which can be set while designing the movie like runtime, genre, rating