Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

library(statsr)
library(BAS)

## Warning: package 'BAS' was built under R version 3.3.3

library(MASS)

Load data

load("movies.Rdata")

Part 1: Data

1- This study is observational as is it conducted after the data collection and no experiment was design to control variables or test some parameters prior to the data collection.

2- Additionally, as described in the codebook of this dataset, the sample was built using random sampling of the movies produced and released before 2016. This makes the study generalizable to the entire population of movies the sample was drawn from. Since this is an generalizable observational study, we will only be able to conclude correlations between the population’s variables but no causal inference can be made since there was no experiment performed.

3- Potential source of bias: Since there is no mean for us to know if a voter is allowed to rate the same movie several time, hence influencing the rating parameter of the movies, this can be considered a source of potential bias to the study.

Part 2: Data manipulation

Let’s create few variables as required before proceeding with the data exploration.

# New variable title_type
movies <- movies %>% mutate(feature_film = ifelse(grepl("feature", tolower(title_type)), "yes", "no"))

## Warning: package 'bindrcpp' was built under R version 3.3.3

# New variable drama

movies <- movies %>% mutate(drama = ifelse(grepl("drama", tolower(genre)), "yes", "no"))

# New variable mpaa_rating_R

movies <- movies %>% mutate(mpaa_rating_R = ifelse(grepl("r", tolower(mpaa_rating)), "yes", "no"))

# Two new variables oscar_season & summer_season

movies <- movies %>% mutate(oscar_season = factor(thtr_rel_month %in% c(10, 11, 12), labels=c("no", "yes")))

movies <- movies %>% mutate(summer_season = factor(thtr_rel_month %in% c(5, 6, 7, 8), labels=c("no", "yes")))

Part 3: Exploratory data analysis

Let’s explore this dataset for the relationship between audience_score and the five (05) variables previously created.

Audience score summary statistics by feature_film status

movies %>% group_by(feature_film) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   feature_film Count      avg Median   IQR Total_score
##          <chr> <int>    <dbl>  <dbl> <dbl>       <dbl>
## 1           no    60 81.05000   85.5  12.5        4863
## 2          yes   591 60.46531   62.0  33.5       35735

ggplot(data = movies, aes(x=feature_film, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=feature_film, y=audience_score))+geom_boxplot(fill="blue")

Summary statistics show that feature films score on average lower than non-feature films. Thi is confirmed by the bar plot showing the total audience score per feature status. Additionally, the box plot indicates consistence of the middle 50% of the audience in scoring on non-features movies as the Inter-quartile indicates 12.5 points variation.

Audience score summary statistics by drama status

movies %>% group_by(drama) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   drama Count      avg Median   IQR Total_score
##   <chr> <int>    <dbl>  <dbl> <dbl>       <dbl>
## 1    no   346 59.73121     61    38       20667
## 2   yes   305 65.34754     70    28       19931

ggplot(data = movies, aes(x=drama, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=drama, y=audience_score))+geom_boxplot(fill="blue")

The drama variable shows consistent scoring with a very slight difference between drama movies and others. This could indicate that the variable drama may not influence the scoring of the audience. This will be investigated further.

Audience score summary statistics by mpaa_rating_R status

movies %>% group_by(mpaa_rating_R) %>% summarise(Count = n() ,avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   mpaa_rating_R Count      avg Median   IQR Total_score
##           <chr> <int>    <dbl>  <dbl> <dbl>       <dbl>
## 1            no   272 59.78676     61    31       16262
## 2           yes   379 64.21108     69    34       24336

ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_boxplot(fill="blue")

The Audience score appaears to be slightly higher for the films with an mpaa_rating of value R.This seems to indicate that the mpaa_rating may slightly influence the audience scoring value. Howerver IQRs seems to indicate quite a lot of variablility within ofaudience scoring withing each mpaa_rating category.

Audience score summary statistics by oscar_season status

movies %>% group_by(oscar_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   oscar_season Count      avg Median   IQR Total_score
##         <fctr> <int>    <dbl>  <dbl> <dbl>       <dbl>
## 1           no   460 61.81304     64  33.0       28434
## 2          yes   191 63.68586     69  33.5       12164

ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_boxplot(fill="blue")

The mean and median audience score by oscar_season àre not too different when the film was released in theater during an oscar season or not.

Audience score summary statistics by summer_season status

movies %>% group_by(summer_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   summer_season Count      avg Median   IQR Total_score
##          <fctr> <int>    <dbl>  <dbl> <dbl>       <dbl>
## 1            no   443 62.62302     66 34.00       27742
## 2           yes   208 61.80769     65 33.25       12856

ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_boxplot(fill="blue")

The summer_season variable seems to indicate the same trend as for the oscar_season.

Part 4: Modeling

Model selection

We will first fit the multi-linear model and perform a BIC based model selection.

movies_lm <- lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data =na.omit(movies))

n<- nrow(na.omit(movies))

movies_BIC <- stepAIC(movies_lm, direction = "backward", trace = 0, k=log(n))

summary(movies_BIC)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = na.omit(movies))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.996  -6.706   0.738   5.435  52.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -32.90825    3.35451  -9.810  < 2e-16 ***
## runtime        -0.05791    0.02238  -2.588 0.009891 ** 
## imdb_rating    14.95043    0.60249  24.814  < 2e-16 ***
## critics_score   0.07531    0.02227   3.381 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 615 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.747 
## F-statistic: 609.4 on 3 and 615 DF,  p-value: < 2.2e-16

The model selected based on the BIC considers that runtime, imdb_rating and the critics_score are the most suitable predictors for the the response variable audience_score.

The intercept is not significant as it predicts a penalty of -32.9 points on the audience_score if all predictors are given a value of zero which is not realistic.

The issue here is that we are unable to quantify the level of uncertainty of this model and we might be discarding some variables which are important for this prediction. To ensure we have a look at all the models and quantify the uncertainties associated to each of them.

Using BMA to fit all possible models with their posterior probabilities

movies_bma <- bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())

movies_bma

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season +     imdb_rating + imdb_num_votes + critics_score + best_pic_nom +     best_pic_win + best_actor_win + best_actress_win + best_dir_win +     top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06121              0.04404  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51230              0.09620              0.07969  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06536              0.07927              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06126              0.92333              0.13121  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04076              0.11620              0.14787  
##     best_dir_winyes        top200_boxyes  
##             0.06788              0.04909

summary(movies_bma)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06120788     0.0000     0.0000000     0.0000000
## dramayes               0.04404252     0.0000     0.0000000     0.0000000
## runtime                0.51230392     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.09619567     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.07969499     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06536044     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07927283     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06125933     0.0000     0.0000000     0.0000000
## critics_score          0.92332791     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13120975     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04075939     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11620418     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14787055     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06788223     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04908857     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1686     0.1470000     0.0345000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     0.0000000
## mpaa_rating_Ryes        0.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     1.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.1851908     0.1745817
## PostProbs               0.0312000     0.0294000
## R2                      0.7495000     0.7469000
## dim                     5.0000000     4.0000000
## logmarg             -3436.4383237 -3436.4973172

Let’s plot this BMA to have a visual of eahc model

image(movies_bma, rotate = FALSE)

Looking at the BMA summary executed with BIC as the model selection factor, we can clearly see that model 1 which has the Bayes Factor of 1 and the highest posterior probability of 16.86%, corresponds to the model from the simple BIC analysis. However, we can also see that the model 2 which was silently discarded in the previous BIC model selection has 14.7% probability and a Bayesian factor of 0.87 which is quite high.

The visual shows that the imdb_rating and the critics_score are the main two predictors as they are represented in all the models with a high odds posterior probability log.

The runtime is however present in the first model but looking at the reduction in BIC value or at the increase of the adjusted R-square when adding the runtime in the model, one may say the model 2 is likely fitted as the model 1.

We will stick to the model 1 in which audience_score is predicted by imdb_rating, critics_score and runtime. It is important to point that the runtime is actually predicted to be a penalty factor to the audience score as its correlation corefficient is negative in this case.

par(mfrow = c(1,3))
coef_movies_bma = coefficients(movies_bma)
plot(coef_movies_bma, subset = c(4, 9, 11) ,ask=FALSE)

Part 5: Prediction

Since our objective here is to predict estimates using the Best Predictive Model option withing the predict function for predicting the estimators. Our prior here will be the BIC and the model prior may be uniform as we do not have previous information for believing otherwise.

The “Trolls”, thtr_rel_year:2016, runtime:92 minutes; audience_score: 68; critics_score: 102; All these informations were collected from Rotten Tomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews

Based on the above, we will run the predict function on the dataframe created with those parameters and compare the result with the observed audience score of the movie which currently 68% from Rottentomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews.

# Here we fit the best final model

movies_best_model <- bas.lm(formula = audience_score ~ imdb_rating + critics_score + runtime, data = na.omit(movies), na.action = "na.omit", prior = "BIC", modelprior = uniform())

# Here we do prediction

trolls_data <- data.frame(audience_score=68, critics_score=102, runtime=92, imdb_rating=6.5)

trolls_predict <- predict(movies_best_model, newdata = trolls_data, estimator = "BPM", se.fit = TRUE, prediction = TRUE)

confint(trolls_predict)

##          2.5%    97.5%     pred
## [1,] 45.75539 86.01813 65.88676
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The predicted value using the selected model for the Trolls movies (2016) is 65.88676 an which deviates from the observed value of 2.11324 points. But looking at the confidence interval [45.75 - 86.01], the actual value 68 falls largely within this interval. The point estimage of the audience_score under this model is 65.88 ± 20.13

Part 6: Conclusion

We can see from the output of the that the model has 95% probability of predicting the audience score based on the predictors selected. However the margin of error ME=20.13 is quite large yielding so a low precision of the model.

Bayesian modeling and prediction for movies