library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(statsr)
library(BAS)

## Warning: package 'BAS' was built under R version 3.4.4

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

load("~/statistical  with R /week 5 /_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata")

Part 1: Data

1- This study is observational as is it conducted after the data collection and no experiment was design to control variables or test some parameters prior to the data collection.

2- Additionally, as described in the codebook of this dataset, the sample was built using random sampling of the movies produced and released before 2016. This makes the study generalizable to the entire population of movies the sample was drawn from. Since this is an generalizable observational study, we will only be able to conclude correlations between the population’s variables but no causal inference can be made since there was no experiment performed.

3- Potential source of bias: Since there is no mean for us to know if a voter is allowed to rate the same movie several time, hence influencing the rating parameter of the movies, this can be considered a source of potential bias to the study.

Part 2: Data manipulation

Let’s create few variables as required before proceeding with the data exploration.

# New variable title_type
movies <- movies %>% mutate(feature_film = ifelse(grepl("feature", tolower(title_type)), "yes", "no"))

# New variable drama

movies <- movies %>% mutate(drama = ifelse(grepl("drama", tolower(genre)), "yes", "no"))

# New variable mpaa_rating_R

movies <- movies %>% mutate(mpaa_rating_R = ifelse(grepl("r", tolower(mpaa_rating)), "yes", "no"))

# Two new variables oscar_season & summer_season

movies <- movies %>% mutate(oscar_season = factor(thtr_rel_month %in% c(10, 11, 12), labels=c("no", "yes")))

movies <- movies %>% mutate(summer_season = factor(thtr_rel_month %in% c(5, 6, 7, 8), labels=c("no", "yes")))

Part 3: Exploratory data analysis

Let’s explore this dataset for the relationship between audience_score and the five (05) variables previously created.

Audience score summary statistics by feature_film status

movies %>% group_by(feature_film) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   feature_film Count   avg Median   IQR Total_score
##   <chr>        <int> <dbl>  <dbl> <dbl>       <dbl>
## 1 no              60  81.0   85.5  12.5        4863
## 2 yes            591  60.5   62.0  33.5       35735

ggplot(data=movies, aes(x=feature_film, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=feature_film, y=audience_score))+geom_boxplot(fill="blue")

summary statistics show that feature films on average lower than non-feature films. Thi is confirmed by the bar plot showing the total audience score per feature status. Additionally, the box plot indicates consistence of the middle 50% of the audience in scoring on non-features movies as the Inter-quartile indicates 12.5 points variation.

Audience score summary statistics by drama status

movies %>% group_by(drama) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   drama Count   avg Median   IQR Total_score
##   <chr> <int> <dbl>  <dbl> <dbl>       <dbl>
## 1 no      346  59.7   61.0  38.0       20667
## 2 yes     305  65.3   70.0  28.0       19931

ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_boxplot(fill="blue")

The Audience score appaears to be slightly higher for the films with an mpaa_rating of value R.This seems to indicate that the mpaa_rating may slightly influence the audience scoring value. Howerver IQRs seems to indicate quite a lot of variablility within of audience scoring withing each mpaa_rating category.

Audience score summary statistics by oscar_season status

movies %>% group_by(oscar_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   oscar_season Count   avg Median   IQR Total_score
##   <fct>        <int> <dbl>  <dbl> <dbl>       <dbl>
## 1 no             460  61.8   64.0  33.0       28434
## 2 yes            191  63.7   69.0  33.5       12164

ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_boxplot(fill="blue")

The mean and median audience score by oscar_season àre not too different when the film was released in theater during an oscar season or not. Audience score summary statistics by summer_season status

movies %>% group_by(summer_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))

## # A tibble: 2 x 6
##   summer_season Count   avg Median   IQR Total_score
##   <fct>         <int> <dbl>  <dbl> <dbl>       <dbl>
## 1 no              443  62.6   66.0  34.0       27742
## 2 yes             208  61.8   65.0  33.2       12856

ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_col(fill="blue")

ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_boxplot(fill="blue")

The summer_season variable seems to indicate the same trend as for the oscar_season.

Part 4: Modeling

Model selection We will first fit the multi-linear model and perform a BIC based model selection.

movies_lm <- lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data =na.omit(movies))

n<- nrow(na.omit(movies))

movies_BIC <- stepAIC(movies_lm, direction = "backward", trace = 0, k=log(n))

summary(movies_BIC)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = na.omit(movies))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.996  -6.706   0.738   5.435  52.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -32.90825    3.35451  -9.810  < 2e-16 ***
## runtime        -0.05791    0.02238  -2.588 0.009891 ** 
## imdb_rating    14.95043    0.60249  24.814  < 2e-16 ***
## critics_score   0.07531    0.02227   3.381 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 615 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.747 
## F-statistic: 609.4 on 3 and 615 DF,  p-value: < 2.2e-16

The model selected based on the BIC considers that runtime, imdb_rating and the critics_score are the most suitable predictors for the the response variable audience_score.

The intercept is not significant as it predicts a penalty of -32.9 points on the audience_score if all predictors are given a value of zero which is not realistic.

The issue here is that we are unable to quantify the level of uncertainty of this model and we might be discarding some variables which are important for this prediction. To ensure we have a look at all the models and quantify the uncertainties associated to each of them. Using BMA to fit all possible models with their posterior probabilities

movies_bma <- bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())

movies_bma

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime + 
##     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06121              0.04404  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51230              0.09620              0.07969  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06536              0.07927              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06126              0.92333              0.13121  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04076              0.11620              0.14787  
##     best_dir_winyes        top200_boxyes  
##             0.06788              0.04909

summary(movies_bma)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06120788     0.0000     0.0000000     0.0000000
## dramayes               0.04404252     0.0000     0.0000000     0.0000000
## runtime                0.51230392     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.09619567     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.07969499     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.06536044     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07927283     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06125933     0.0000     0.0000000     0.0000000
## critics_score          0.92332791     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13120975     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04075939     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11620418     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14787055     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06788223     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04908857     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1686     0.1470000     0.0345000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     0.0000000
## mpaa_rating_Ryes        0.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         1.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     1.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.1851908     0.1745817
## PostProbs               0.0312000     0.0294000
## R2                      0.7495000     0.7469000
## dim                     5.0000000     4.0000000
## logmarg             -3436.4383237 -3436.4973172

Let’s plot this BMA to have a visual of eahc model

image(movies_bma,rotate = FALSE)

Looking at the BMA summary executed with BIC as the model selection factor, we can clearly see that model 1 which has the Bayes Factor of 1 and the highest posterior probability of 16.86%, corresponds to the model from the simple BIC analysis. However, we can also see that the model 2 which was silently discarded in the previous BIC model selection has 14.7% probability and a Bayesian factor of 0.87 which is quite high.

The visual shows that the imdb_rating and the critics_score are the main two predictors as they are represented in all the models with a high odds posterior probability log.

The runtime is however present in the first model but looking at the reduction in BIC value or at the increase of the adjusted R-square when adding the runtime in the model, one may say the model 2 is likely fitted as the model 1.

We will stick to the model 1 in which audience_score is predicted by imdb_rating, critics_score and runtime. It is important to point that the runtime is actually predicted to be a penalty factor to the audience score as its correlation corefficient is negative in this case.

par(mfrow = c(1,3))
coef_movies_bma = coefficients(movies_bma)
plot(coef_movies_bma, subset = c(4, 9, 11) ,ask=FALSE)

Part 5: Prediction

Since our objective here is to predict estimates using the Best Predictive Model option withing the predict function for predicting the estimators. Our prior here will be the BIC and the model prior may be uniform as we do not have previous information for believing otherwise.

The “Trolls”, thtr_rel_year:2016, runtime:92 minutes; audience_score: 68; critics_score: 102; All these informations were collected from Rotten Tomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews

Based on the above, we will run the predict function on the dataframe created with those parameters and compare the result with the observed audience score of the movie which currently 68% from Rottentomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews.

# Here we fit the best final model

movies_best_model <- bas.lm(formula = audience_score ~ imdb_rating + critics_score + runtime, data = na.omit(movies), na.action = "na.omit", prior = "BIC", modelprior = uniform())

# Here we do prediction

trolls_data <- data.frame(audience_score=68, critics_score=102, runtime=92, imdb_rating=6.5)

trolls_predict <- predict(movies_best_model, newdata = trolls_data, estimator = "BPM", se.fit = TRUE, prediction = TRUE)

confint(trolls_predict)

##          2.5%    97.5%     pred
## [1,] 45.75539 86.01813 65.88676
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The predicted value using the selected model for the Trolls movies (2016) is 65.88676 an which deviates from the observed value of 2.11324 points. But looking at the confidence interval [45.75 - 86.01], the actual value 68 falls largely within this interval. The point estimage of the audience_score under this model is 65.88 ± 20.13

Part 6: Conclusion

We can see from the output of the that the model has 95% probability of predicting the audience score based on the predictors selected. However the margin of error ME=20.13 is quite large yielding so a low precision of the model.

Bayesian modeling and prediction for Movies project

Michelle Tan