Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

Generabizability: The data set is comprised of 651 randomly sampled movies produced and released before 2016, therefore, it was used by sampling without doing research on specific groups of people.We use random sampling because if we use non-random sampling our data will be biased and unreliable to use (not good to use) ,and also we should reject the data that make collinearity and parsimony.

Casuality: There is no dependence between data in this dataset, since those variables are not the acutal reasons to change those feeling of audience to rank socres in IMDB system. I’ll analyze much more correlation bwtween those data in my EDA as below.

Part 2: Data manipulation

Create new variables

# create feature_film
movies <- mutate(movies, feature_film = factor(ifelse(title_type == 'Feature Film', 'yes', 'no')))
# create drama
movies <- mutate(movies, drama = factor(ifelse(genre =='Drama', 'yes', 'no')))
# create mpaa_rating_R
movies <- mutate(movies, mpaa_rating_R = factor(ifelse(mpaa_rating == 'R', 'yes', 'no')))
# create oscar_season
movies <- mutate(movies, oscar_season = factor(ifelse(thtr_rel_month >= 10, 'yes', 'no')))
# create summer_season
movies <- mutate(movies, summer_season = factor(ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no')))

head(movies[c("title_type", "feature_film", "genre", "drama", "mpaa_rating", "mpaa_rating_R")])

## # A tibble: 6 x 6
##     title_type feature_film       genre  drama mpaa_rating mpaa_rating_R
##         <fctr>       <fctr>      <fctr> <fctr>      <fctr>        <fctr>
## 1 Feature Film          yes       Drama    yes           R           yes
## 2 Feature Film          yes       Drama    yes       PG-13            no
## 3 Feature Film          yes      Comedy     no           R           yes
## 4 Feature Film          yes       Drama    yes          PG            no
## 5 Feature Film          yes      Horror     no           R           yes
## 6  Documentary           no Documentary     no     Unrated            no

Part 3: Exploratory data analysis

“I want to make a model to predict imdb rate of 2016. So, the consequeces will be the variables which could be used for do the prediction.”

Thus, I will try to find out 4-5 variabiles which can be used for do the prediction. On the other hand, those variables should be independent to each other and have high correalation with ‘imdb_rating’. I’ll use the methods I’ve learned on this class included ‘ggpairs’ form package ‘GGally’ to overlook the association within dataset, backward selection to select the most appropriated variables, linear modeling to predict 2016 imdb rate through other related variables.

*** Since I have added five new variables, I would like to use boxplot and bayes inference to see the relationship between those variables and audience_score.

Test_Movies<- movies[c("feature_film", "drama", "mpaa_rating_R", "oscar_season", "summer_season", "audience_score")]
summary(Test_Movies)

##  feature_film drama     mpaa_rating_R oscar_season summer_season
##  no : 60      no :346   no :322       no :460      no :443      
##  yes:591      yes:305   yes:329       yes:191      yes:208      
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  audience_score 
##  Min.   :11.00  
##  1st Qu.:46.00  
##  Median :65.00  
##  Mean   :62.36  
##  3rd Qu.:80.00  
##  Max.   :97.00

Drama(no:346,yes:305) and mpaa_rating_R(no:322,yes:329) have roughly half movies in their respective categories, thus I guess theese two variables have weak relationship with audience score. But we still need to see the boxplot to make sure my assumption.

*Feature_film Audience Score Distrubution by Feature Film.

ggplot(Test_Movies, aes(factor(feature_film), audience_score, fill=feature_film)) + 
    geom_boxplot() +
    ggtitle('Audience Score Distribution by Feature Film') +
    xlab('feature_film') +
    ylab('audience_score')

There’s a significant impact on audience socore by Feature Film which explaining by the no answer with higher score.

** Hypotheses Test with the mean of Audience Score by Feature Film. H1: mu_no = mu_yes H2: mu_no != mu_yes

bayes_inference(y = audience_score, x =feature_film, data = Test_Movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0 
## P(H2|data) = 1

Results:

BF[H2:H1] = 4.221452e+13

P(H1|data) = 0

P(H2|data) = 1

Bayes factor shows that H2 against H1 also shows a strong evedience that feature film has a significant impact on audience socore.

*Drama Audience Score Distrubution by Drama.

ggplot(Test_Movies, aes(factor(drama), audience_score, fill=drama)) + 
    geom_boxplot() +
    ggtitle('Audience Score Distribution by Drama') +
    xlab('Drama') +
    ylab('audience_score')

There’s no obvious impact on audience socore by drama which explaining by the same distribution to yse and no.

** Hypotheses Test with the mean of drama by Feature Film. H1: mu_no = mu_yes H2: mu_no != mu_yes

bayes_inference(y = audience_score, x =drama, data = Test_Movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 22.6567
## P(H1|data) = 0.0423 
## P(H2|data) = 0.9577

Results:

BF[H2:H1] = 22.6567

P(H1|data) = 0.0423

P(H2|data) = 0.9577

There is a positive evidence of H2 against H1 showing by The bayesian factor is 22.6. So, we still can put drama into our prediction model when making a selection.

*Mpaa_rating_R Audience Score Distrubution by Mpaa rating.

ggplot(Test_Movies, aes(factor(mpaa_rating_R), audience_score, fill=mpaa_rating_R)) + 
    geom_boxplot() +
    ggtitle('Audience Score Distribution by mpaa_rating_R') +
    xlab('mpaa_rating_R') +
    ylab('audience_score')

There’s no obvious impact on audience socore by Mpaa Rating which explaining by the same distribution to yse and no.

** Hypotheses Test with the mean of Audience Score by Feature Film. H1: mu_no = mu_yes H2: mu_no != mu_yes

bayes_inference(y = audience_score, x =mpaa_rating_R, data = Test_Movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 322, y_bar_no = 62.6894, s_no = 20.3167
## n_yes = 329, y_bar_yes = 62.0426, s_yes = 20.1559
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 23.9679
## P(H1|data) = 0.9599 
## P(H2|data) = 0.0401

## Results: ## BF[H1:H2] = 23.9679 ## P(H1|data) = 0.9599 ## P(H2|data) = 0.0401

Bayesian Factor is too small, which means Mpaa rating couldn’t make any impact on audience score. So, I can remove it from my model choosing.

*Oscar_season Audience Score Distrubution by Oscar season.

ggplot(Test_Movies, aes(factor(oscar_season), audience_score, fill=oscar_season)) + 
    geom_boxplot() +
    ggtitle('Audience Score Distribution by oscar_season') +
    xlab('oscar_season') +
    ylab('audience_score')

There’s no obvious impact on audience socore by Oscar Season which explaining by the same distribution to yse and no.

** Hypotheses Test with the mean of Audience Score by Feature Film. H1: mu_no = mu_yes H2: mu_no != mu_yes

bayes_inference(y = audience_score, x =oscar_season, data = Test_Movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 13.3993
## P(H1|data) = 0.9306 
## P(H2|data) = 0.0694

Results:

BF[H1:H2] = 13.3993

P(H1|data) = 0.9306

P(H2|data) = 0.0694

Bayesian Factor is too small, which means Oscar Season couldn’t make any impact on audience score. So, I can remove it from my model choosing.

*Summer_season Audience Score Distrubution by Summer season.

ggplot(Test_Movies, aes(factor(summer_season), audience_score, fill=summer_season)) + 
    geom_boxplot() +
    ggtitle('Audience Score Distribution by summer_season') +
    xlab('summer_season') +
    ylab('audience_score')

There’s no obvious impact on audience socore by Summer Season which explaining by the same distribution to yse and no.

** Hypotheses Test with the mean of Audience Score by Feature Film. H1: mu_no = mu_yes H2: mu_no != mu_yes

bayes_inference(y = audience_score, x =summer_season, data = Test_Movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 21.7118
## P(H1|data) = 0.956 
## P(H2|data) = 0.044

## Results: ## BF[H1:H2] = 21.7118 ## P(H1|data) = 0.956 ## P(H2|data) = 0.044

Bayesian Factor is too small, which means Summer Season couldn’t make any impact on audience score. So, I can remove it from my model choosing.

Part 4: Modeling

Next we conduct the modeling for audience score. Since we are trying to create a predictive model we’ll use BIC (least restrictive) for the priors.

movies_no_na = na.omit(movies)
bma_audience_score = bas.lm(audience_score ~
feature_film + drama + runtime + thtr_rel_year +
imdb_rating + imdb_num_votes + critics_score +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box,
data = movies_no_na, prior = "BIC", modelprior = uniform())
bma_audience_score

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +     best_pic_nom + best_pic_win + best_actor_win + best_actress_win +     best_dir_win + top200_box, data = movies_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06047              0.04375  
##             runtime        thtr_rel_year          imdb_rating  
##             0.52045              0.08059              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06213              0.92585              0.13003  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04074              0.11687              0.14649  
##     best_dir_winyes        top200_boxyes  
##             0.06776              0.04959

summary(bma_audience_score)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06046778     0.0000     0.0000000     0.0000000
## dramayes               0.04374962     0.0000     0.0000000     0.0000000
## runtime                0.52044741     1.0000     0.0000000     0.0000000
## thtr_rel_year          0.08059047     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06213435     0.0000     0.0000000     0.0000000
## critics_score          0.92585282     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13002756     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04073743     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11686578     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14648702     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06776343     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04958820     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.2173     0.1894000     0.0445000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     1.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.1851908     0.1745817
## PostProbs               0.0402000     0.0379000
## R2                      0.7495000     0.7469000
## dim                     5.0000000     4.0000000
## logmarg             -3436.4383237 -3436.4973172

Marginal Posterior Inclusion Probabilities: Intercept feature_filmyes dramayes
1.00000 0.06047 0.04375
runtime thtr_rel_year imdb_rating
0.52045 0.08059 1.00000
imdb_num_votes critics_score best_pic_nomyes
0.06213 0.92585 0.13003
best_pic_winyes best_actor_winyes best_actress_winyes
0.04074 0.11687 0.14649
best_dir_winyes top200_boxyes
0.06776 0.04959

I’ll chose the ones have more than 0.5 : runtime,imdb_rating,critics_score to build up the model.

Then I’ll use the Zellner-Siow Cauchy to choose another model.

movies_ZS =  bas.lm(
audience_score ~
feature_film + drama + runtime + thtr_rel_year +
imdb_rating + imdb_num_votes + critics_score +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box,
data=movies_no_na,prior="ZS-null", modelprior=uniform(), initprobs="eplogp")

plot(movies_ZS, ask=F)

In the plot, I can sure that using runtime,imdb_rating,critics_score to build up a model is the best solution. As below:

plot(movies_ZS, which = 4, ask=FALSE, caption="", sub.caption="")

Also, I can visualize the Model Space

image(movies_ZS, rotate=F)

*** Posterior Distributions of Coefficients

coef.ZS = coef(movies_ZS)
plot(coef.ZS,  ask=F)

Now we can bulid up a final model to do the prediction:

model_movies <- movies[c("runtime","imdb_rating","critics_score","audience_score")]
colSums(is.na(model_movies))

##        runtime    imdb_rating  critics_score audience_score 
##              1              0              0              0

model_movies <- na.omit(model_movies)

model_movies[c("runtime", "imdb_rating", "critics_score")] <- lapply(model_movies[c("runtime", "imdb_rating", "critics_score")], function(x) {log(1 + x)})

bma_movies_final = bas.lm(audience_score ~ ., data = model_movies,
                   prior = "ZS-null", 
                   method = "MCMC",
                   modelprior = uniform())

summary(bma_movies_final)

##               P(B != 0 | Y)  model 1     model 2     model 3      model 4
## Intercept           1.00000   1.0000   1.0000000   1.0000000   1.00000000
## runtime             0.10000   0.0000   1.0000000   0.0000000   1.00000000
## imdb_rating         0.99375   1.0000   1.0000000   1.0000000   1.00000000
## critics_score       0.05000   0.0000   0.0000000   1.0000000   1.00000000
## BF                       NA   1.0000   0.2037049   0.1099445   0.02536395
## PostProbs                NA   0.8688   0.0750000   0.0312000   0.01880000
## R2                       NA   0.6830   0.6848000   0.6842000   0.68590000
## dim                      NA   2.0000   3.0000000   3.0000000   4.00000000
## logmarg                  NA 367.4689 365.8778050 365.2611087 363.79446186
##                     model 5
## Intercept      1.000000e+00
## runtime        1.000000e+00
## imdb_rating    0.000000e+00
## critics_score  0.000000e+00
## BF            2.406588e-158
## PostProbs      6.200000e-03
## R2             2.670000e-02
## dim            2.000000e+00
## logmarg        4.538654e+00

Part 5: Prediction

We only choose runtime, imdb_rating, critics_score to do the prediction. Hidden Figures runetime:127 imbd_rating:7.8 critics_score: 93, audience_score:93

Hidden <- data.frame(runtime=127,imdb_rating=7.8,critics_score=93,audience_score=93)
# log transformation

Hidden [c("runtime", "imdb_rating", "critics_score","audience_score")]  <- lapply(Hidden[c("runtime", "imdb_rating","critics_score","audience_score")], function(x) {log(1 + x)})

model_movies1 <- rbind(model_movies, Hidden)
Hidden <-tail(model_movies1,1)
str(Hidden)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  4 variables:
##  $ runtime       : num 4.85
##  $ imdb_rating   : num 2.17
##  $ critics_score : num 4.54
##  $ audience_score: num 4.54

Hidden.pred <- predict(bma_movies_final, newdata=Hidden, estimator="BMA", se.fit=TRUE, interval="predict")

Hidden.pred$Ybma

##          [,1]
## [1,] 80.01584

Part 6: Conclusion

Final answer is not as good as my expectation, I should keep trying to find the best model. Also, I’ve learned a lot during this assignment.

We couldn’t deal with the dataset through our personnel experience since we need to be objective and go through all dataset. For example, some informative data can be deleted or removed easily but some variables you may thought are not related to response variable but that’s wrong. Just like I thought there have something happen between month and rating because I considered movies in summer vocation are much more popular than another season. But the correlation analyzing consequence shows there’s not much differences between them.
We should be careful when dealing with outliers and some variables aren’t appropriate to be put into the model. That’s the main weakness of my prediction model. Just like some variables such as ‘best_actor_win’ which only have one winner per year and there’re only yes and no to be filled into column. It won’t make an important to the consequence and wastes time to record more data for predicting. We should consider what’s the purpose of building this model to determine which variables should be removed or put into model.
We need to double check if the assumption we set at first is reasonable. At first, I put the type of movie into my model since it has a high correlation coefficient value with the IMDB rating score. But the R-squared value in model is extremely low, thus I would like to know the reason. Then I realized that I just ignored there’re lot of “unrated” information in that variable. So, I removed all unrated data to build up a new model and compare with the original one to make sure if removed those data will make my model better. Tell the truth, I still think the type of movie could change the consequence to my prediction. But the data we collected could only provide the information of what kinds of movie may earn a higher rating but not for the prediction of which type of movie should earn higher rating.
Back to the research question, we could know there’re some variables could predict the rating in IMDB such as scores given by audiences and critics from the Rotten Tomato. On the other, critics scores didn’t mean that much as audiences score. In my opinion, I thought critics gave more general score than audience since they need to be fair to all movie makers but audience are easy to show their preferences. In addition, critics scores were collected from only around 50 critics but audience scores were from more than thousands of people. It may be the reason why the ‘audience_score’ and ‘audience_raing’ make more influence into my model.
To sum up, after this project all of us can learn that IMDB and Rotten Tomato are using the same system to evaluate movies, even they collected data through different channels and methods, but they still can give a high related rating for each movie which means their records to movies have a great credibility.