Bayesian modeling and prediction for movies

Setup
- Load packages
- Load data
Part 1: Data
Part 2: Data manipulation
Part 3: Exploratory data analysis
Part 4: Modeling
Part 5: Prediction
Part 6: Conclusion

student 1 的工作还不错！

The purpose of these report is to find out what attributes make a movie popular and other interesting things(It is observational study rather than a causality study). Of course, the process of doing data analysis with the tool of Bayesian regression is very important for me. And I will do the analysis using the lab 4 of this course and the project for the second course project of this specialization as references.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016, so random sampling was used for generalizability. Random assignment was not used which means this study is a observational study rather than experimental study, so it cannot be used for causality inference. Let’s see the basics of the data set by:

str(movies)
head(movies)

There are three types of variables in the dataset, which are numeric, factor and character. To build a prediction model, we need to convert charater variables into factor variables.

Part 2: Data manipulation

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no (2 pt)
Create two new variables based on thtr_rel_month:
- New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no
- New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

feature_film_ind <- movies$title_type == "Feature Film"
feature_film <- factor(levels = c("yes", "no"))
feature_film[feature_film_ind] <- "yes"
feature_film[!feature_film_ind] <- "no"


drama_ind <- movies$genre == "Drama"
drama <- factor(levels = c("yes", "no"))
drama[drama_ind] <- "yes"
drama[!drama_ind] <- "no"

mpaa_rating_R_ind <- movies$mpaa_rating == "R"
mpaa_rating_R <- factor(levels = c("yes", "no"))
mpaa_rating_R[mpaa_rating_R_ind] <- "yes"
mpaa_rating_R[!mpaa_rating_R_ind] <- "no"

oscar_season_ind <- movies$thtr_rel_month >= 10
oscar_season <- factor(levels = c("yes", "no"))
oscar_season[oscar_season_ind] <- "yes"
oscar_season[!oscar_season_ind] <- "no"

summer_season_ind <- movies$thtr_rel_month >= 5 & movies$thtr_rel_month <= 8 
summer_season <- factor(levels = c("yes", "no"))
summer_season[summer_season_ind] <- "yes"
summer_season[!summer_season_ind] <- "no"

特别好的代码， The best way to do the job.

movies  = mutate(movies,
       feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
       drama = ifelse(genre == 'Drama', "yes", "no"),
       mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
       oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
       summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))

Part 3: Exploratory data analysis

the relationship between audience score and whether it’s a feature movie.

ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = feature_film)) + geom_boxplot()

It show that feature film have relatively lower audience score.

The numbers of “yes” and “no” are 591, 60.

We can do a bayesian hypothesis testing that whether there differents in the audience scores between feature and non-feature films.

bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0 
## P(H2|data) = 1

From the results, we can see that there are evidence that difference existed between those two groups.

For the other variables, we can do the exaclty the same testing.

the relationship between audience score and the genre of the movie.

ggplot(data = movies, aes(x = drama, y = audience_score, fill = drama)) + geom_boxplot()

bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

It shows that drama have a slightly lower audience score than other movies.

The relationship between auddience score and rating.

ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()

bayes_inference(y = audience_score, x = mpaa_rating_R, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

It shows no significant difference of audience score between the rating R films and others.

The relationship between the auddience score and the time of the movie released

ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + geom_boxplot()

ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = summer_season)) + geom_boxplot()

bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

bayes_inference(y = audience_score, x = summer_season, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")

As we can see that the movie in the oscar season have slightly higher audience score than those who do not, while the movies in the summer season have slightly lower audience score.

Part 4: Modeling

Model selection

4 ways to conducted model selection:

Baysian model selection is conduccted by minimize the BIA or AIC（Bayes model selection 标准）
AIC model selection
pick best predictive model
include costs associated with using model

当模型的BIC相差不是特别大的时候，再按照那些规则选择并不是很好。

Then certain situations appeared, we can’t use those criteria to select models, and intead, we use Bayesian model average. image( fit, rotate = F) will help a lot to understand the models.

attach(movies)
newdata <- data.frame(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
detach(movies)

fit <- bas.lm(audience_score ~ ., data = na.omit(newdata), prior = "BIC", modelprior = uniform())
summary(fit)

##      Intercept feature_filmno dramano runtime mpaa_rating_Rno
## [1,]         1              0       0       1               0
## [2,]         1              0       0       0               0
## [3,]         1              0       0       0               0
## [4,]         1              0       0       0               1
## [5,]         1              0       0       1               1
##      thtr_rel_year oscar_seasonno summer_seasonno imdb_rating
## [1,]             0              0               0           1
## [2,]             0              0               0           1
## [3,]             0              0               0           1
## [4,]             0              0               0           1
## [5,]             0              0               0           1
##      imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,]              0             1               0               0
## [2,]              0             1               0               0
## [3,]              0             1               0               0
## [4,]              0             1               0               0
## [5,]              0             1               0               0
##      best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,]                 0                   0               0             0
## [2,]                 0                   0               0             0
## [3,]                 1                   0               0             0
## [4,]                 0                   0               0             0
## [5,]                 0                   0               0             0
##             BF PostProbs     R2 dim   logmarg
## [1,] 1.0000000    0.1297 0.7549   4 -3615.279
## [2,] 0.9968489    0.1293 0.7525   3 -3615.282
## [3,] 0.2543185    0.0330 0.7539   4 -3616.648
## [4,] 0.2521327    0.0327 0.7539   4 -3616.657
## [5,] 0.2391994    0.0310 0.7563   5 -3616.710

image( fit, rotate = F)

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. Further, posterior probability of models are given. In this case the model with the highest posterior prob is the model which includes imdb_rating and critics_score as explanatory variables.

The naive model with all variables included has posterior probability greater than 0.5.

fit <- bas.lm(audience_score ~ ., data = newdata, prior = "ZS-null", modelprior = uniform())

## Warning in bas.lm(audience_score ~ ., data = newdata, prior = "ZS-null", :
## dropping 1 rows due to missing data

summary(fit)

##      Intercept feature_filmno dramano runtime mpaa_rating_Rno
## [1,]         1              0       0       0               0
## [2,]         1              0       0       1               0
## [3,]         1              0       0       0               0
## [4,]         1              0       0       0               1
## [5,]         1              0       0       1               1
##      thtr_rel_year oscar_seasonno summer_seasonno imdb_rating
## [1,]             0              0               0           1
## [2,]             0              0               0           1
## [3,]             0              0               0           1
## [4,]             0              0               0           1
## [5,]             0              0               0           1
##      imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,]              0             1               0               0
## [2,]              0             1               0               0
## [3,]              0             1               0               0
## [4,]              0             1               0               0
## [5,]              0             1               0               0
##      best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,]                 0                   0               0             0
## [2,]                 0                   0               0             0
## [3,]                 1                   0               0             0
## [4,]                 0                   0               0             0
## [5,]                 0                   0               0             0
##             BF PostProbs     R2 dim  logmarg
## [1,] 1.0000000    0.1388 0.7525   3 443.9495
## [2,] 0.8702806    0.1208 0.7549   4 443.8106
## [3,] 0.2236679    0.0311 0.7539   4 442.4519
## [4,] 0.2217602    0.0308 0.7539   4 442.4433
## [5,] 0.2055844    0.0285 0.7563   5 442.3676

So the naive model includes imdb_rating and critics_score as explanatory variables.

Model diagnostic

fit1 <- bas.lm(audience_score ~ . , data = newdata)

## Warning in bas.lm(audience_score ~ ., data = newdata): dropping 1 rows due
## to missing data

plot(fit1, which=1)

it appears that there may be some outliers in the data - observations 126, 216 and 251 have been flagged as the points with the three largest absolute residuals. We need to check out whether these points are outliers or not.

Coefficents

coef <- coefficients(fit1)
confint(coef)

##                          2.5  %    97.5  %          beta
## Intercept           61.55739623 63.1459557  6.234769e+01
## feature_filmno       0.00000000  0.0000000  3.538671e-02
## dramano              0.00000000  0.0000000 -1.933600e-03
## runtime             -0.06620746  0.0000000 -8.805543e-03
## mpaa_rating_Rno      0.00000000  0.0000000  7.490287e-02
## thtr_rel_year        0.00000000  0.0000000 -9.038291e-04
## oscar_seasonno       0.00000000  0.0000000  2.253495e-02
## summer_seasonno      0.00000000  0.0000000 -2.250274e-02
## imdb_rating         13.80491172 16.8421667  1.532291e+01
## imdb_num_votes       0.00000000  0.0000000  2.649436e-08
## critics_score        0.00000000  0.1028322  4.155546e-02
## best_pic_nomyes      0.00000000  0.0000000  7.996231e-02
## best_pic_winyes      0.00000000  0.0000000 -2.294183e-03
## best_actor_winyes    0.00000000  0.0000000 -8.273357e-02
## best_actress_winyes  0.00000000  0.0000000 -7.940600e-02
## best_dir_winyes      0.00000000  0.0000000 -2.971645e-02
## top200_boxyes        0.00000000  0.0000000  1.714854e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

As we can see, the posterior mean of intercept, imdb_rating and critics_score are 62.26, 14.72 and 0.07, and their 95% credible interval are also shown above.

Part 5: Prediction

We pick the movie the accountant https://www.rottentomatoes.com/m/the_accountant_2016. The audience score of this movie is 84 and critics score is 52. The imdb rating is 7.8 http://www.imdb.com/title/tt2140479/?ref_=nv_sr_1

Simulation is used in BAS to construct predictive intervals with Bayesian Model averaging, while exact inference is often possible with predictive intervals under model selection.

A 95% credible interval for predicting audience score can be obtained by

newdata = data.frame(imdb_rating = 7.8, critics_score = 52, audience_score = 84)
fit <- bas.lm(audience_score ~ imdb_rating + critics_score , data = movies)
predict_fit <- predict(fit, newdata = newdata, interval = "confidence")
predict_fit$fit

##          [,1]
## [1,] 81.20801

sd(predict_fit$Ypred)

## [1] 12.39544

This is just a so-so job….

Honestly, I guess it is very hard to understand course material of this part. I tried very hard to do it, and found that I need to read more materials other than vedios in the course to be able to understand completely. And I look forward to see better projects from my classmates, so I can learn from them.

Part 6: Conclusion

The purpose of these report is to find out what attributes make a movie popular and other interesting things, and it turns out the critics score and the imdb rating play the most important variables.

I choose the model with predictor with high including probability. This is model is very simple but it may be not be very precise.