library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)## Warning: package 'dplyr' was built under R version 3.3.3
library(statsr)
library(BAS)## Warning: package 'BAS' was built under R version 3.3.3
library(MASS)load("movies.Rdata")1- This study is observational as is it conducted after the data collection and no experiment was design to control variables or test some parameters prior to the data collection.
2- Additionally, as described in the codebook of this dataset, the sample was built using random sampling of the movies produced and released before 2016. This makes the study generalizable to the entire population of movies the sample was drawn from. Since this is an generalizable observational study, we will only be able to conclude correlations between the population’s variables but no causal inference can be made since there was no experiment performed.
3- Potential source of bias: Since there is no mean for us to know if a voter is allowed to rate the same movie several time, hence influencing the rating parameter of the movies, this can be considered a source of potential bias to the study.
Let’s create few variables as required before proceeding with the data exploration.
# New variable title_type
movies <- movies %>% mutate(feature_film = ifelse(grepl("feature", tolower(title_type)), "yes", "no"))## Warning: package 'bindrcpp' was built under R version 3.3.3
# New variable drama
movies <- movies %>% mutate(drama = ifelse(grepl("drama", tolower(genre)), "yes", "no"))
# New variable mpaa_rating_R
movies <- movies %>% mutate(mpaa_rating_R = ifelse(grepl("r", tolower(mpaa_rating)), "yes", "no"))
# Two new variables oscar_season & summer_season
movies <- movies %>% mutate(oscar_season = factor(thtr_rel_month %in% c(10, 11, 12), labels=c("no", "yes")))
movies <- movies %>% mutate(summer_season = factor(thtr_rel_month %in% c(5, 6, 7, 8), labels=c("no", "yes")))Let’s explore this dataset for the relationship between audience_score and the five (05) variables previously created.
movies %>% group_by(feature_film) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))## # A tibble: 2 x 6
## feature_film Count avg Median IQR Total_score
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 60 81.05000 85.5 12.5 4863
## 2 yes 591 60.46531 62.0 33.5 35735
ggplot(data = movies, aes(x=feature_film, y=audience_score))+geom_col(fill="blue")ggplot(data = movies, aes(x=feature_film, y=audience_score))+geom_boxplot(fill="blue") Summary statistics show that feature films score on average lower than non-feature films. Thi is confirmed by the bar plot showing the total audience score per feature status. Additionally, the box plot indicates consistence of the middle 50% of the audience in scoring on non-features movies as the Inter-quartile indicates 12.5 points variation.
movies %>% group_by(drama) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))## # A tibble: 2 x 6
## drama Count avg Median IQR Total_score
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 346 59.73121 61 38 20667
## 2 yes 305 65.34754 70 28 19931
ggplot(data = movies, aes(x=drama, y=audience_score))+geom_col(fill="blue")ggplot(data = movies, aes(x=drama, y=audience_score))+geom_boxplot(fill="blue") The drama variable shows consistent scoring with a very slight difference between drama movies and others. This could indicate that the variable drama may not influence the scoring of the audience. This will be investigated further.
movies %>% group_by(mpaa_rating_R) %>% summarise(Count = n() ,avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))## # A tibble: 2 x 6
## mpaa_rating_R Count avg Median IQR Total_score
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 272 59.78676 61 31 16262
## 2 yes 379 64.21108 69 34 24336
ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_col(fill="blue")ggplot(data = movies, aes(x=mpaa_rating_R, y=audience_score))+geom_boxplot(fill="blue")The Audience score appaears to be slightly higher for the films with an mpaa_rating of value R.This seems to indicate that the mpaa_rating may slightly influence the audience scoring value. Howerver IQRs seems to indicate quite a lot of variablility within ofaudience scoring withing each mpaa_rating category.
movies %>% group_by(oscar_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))## # A tibble: 2 x 6
## oscar_season Count avg Median IQR Total_score
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 460 61.81304 64 33.0 28434
## 2 yes 191 63.68586 69 33.5 12164
ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_col(fill="blue")ggplot(data = movies, aes(x=oscar_season, y=audience_score))+geom_boxplot(fill="blue") The mean and median audience score by oscar_season àre not too different when the film was released in theater during an oscar season or not.
movies %>% group_by(summer_season) %>% summarise(Count = n(), avg=mean(audience_score), Median=median(audience_score), IQR=IQR(audience_score), Total_score=sum(audience_score))## # A tibble: 2 x 6
## summer_season Count avg Median IQR Total_score
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 443 62.62302 66 34.00 27742
## 2 yes 208 61.80769 65 33.25 12856
ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_col(fill="blue")ggplot(data = movies, aes(x=summer_season, y=audience_score))+geom_boxplot(fill="blue") The summer_season variable seems to indicate the same trend as for the oscar_season.
We will first fit the multi-linear model and perform a BIC based model selection.
movies_lm <- lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data =na.omit(movies))
n<- nrow(na.omit(movies))
movies_BIC <- stepAIC(movies_lm, direction = "backward", trace = 0, k=log(n))
summary(movies_BIC)##
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score,
## data = na.omit(movies))
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.996 -6.706 0.738 5.435 52.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.90825 3.35451 -9.810 < 2e-16 ***
## runtime -0.05791 0.02238 -2.588 0.009891 **
## imdb_rating 14.95043 0.60249 24.814 < 2e-16 ***
## critics_score 0.07531 0.02227 3.381 0.000769 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.15 on 615 degrees of freedom
## Multiple R-squared: 0.7483, Adjusted R-squared: 0.747
## F-statistic: 609.4 on 3 and 615 DF, p-value: < 2.2e-16
The model selected based on the BIC considers that runtime, imdb_rating and the critics_score are the most suitable predictors for the the response variable audience_score.
The intercept is not significant as it predicts a penalty of -32.9 points on the audience_score if all predictors are given a value of zero which is not realistic.
The issue here is that we are unable to quantify the level of uncertainty of this model and we might be discarding some variables which are important for this prediction. To ensure we have a look at all the models and quantify the uncertainties associated to each of them.
movies_bma <- bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
movies_bma##
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06121 0.04404
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.51230 0.09620 0.07969
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.06536 0.07927 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06126 0.92333 0.13121
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04076 0.11620 0.14787
## best_dir_winyes top200_boxyes
## 0.06788 0.04909
summary(movies_bma)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06120788 0.0000 0.0000000 0.0000000
## dramayes 0.04404252 0.0000 0.0000000 0.0000000
## runtime 0.51230392 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.09619567 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.07969499 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.06536044 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07927283 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06125933 0.0000 0.0000000 0.0000000
## critics_score 0.92332791 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13120975 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04075939 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11620418 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14787055 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06788223 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04908857 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1686 0.1470000 0.0345000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 0.0000000
## mpaa_rating_Ryes 0.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 1.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 1.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.1851908 0.1745817
## PostProbs 0.0312000 0.0294000
## R2 0.7495000 0.7469000
## dim 5.0000000 4.0000000
## logmarg -3436.4383237 -3436.4973172
Let’s plot this BMA to have a visual of eahc model
image(movies_bma, rotate = FALSE)Looking at the BMA summary executed with BIC as the model selection factor, we can clearly see that model 1 which has the Bayes Factor of 1 and the highest posterior probability of 16.86%, corresponds to the model from the simple BIC analysis. However, we can also see that the model 2 which was silently discarded in the previous BIC model selection has 14.7% probability and a Bayesian factor of 0.87 which is quite high.
The visual shows that the imdb_rating and the critics_score are the main two predictors as they are represented in all the models with a high odds posterior probability log.
The runtime is however present in the first model but looking at the reduction in BIC value or at the increase of the adjusted R-square when adding the runtime in the model, one may say the model 2 is likely fitted as the model 1.
We will stick to the model 1 in which audience_score is predicted by imdb_rating, critics_score and runtime. It is important to point that the runtime is actually predicted to be a penalty factor to the audience score as its correlation corefficient is negative in this case.
par(mfrow = c(1,3))
coef_movies_bma = coefficients(movies_bma)
plot(coef_movies_bma, subset = c(4, 9, 11) ,ask=FALSE)Since our objective here is to predict estimates using the Best Predictive Model option withing the predict function for predicting the estimators. Our prior here will be the BIC and the model prior may be uniform as we do not have previous information for believing otherwise.
The “Trolls”, thtr_rel_year:2016, runtime:92 minutes; audience_score: 68; critics_score: 102; All these informations were collected from Rotten Tomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews
Based on the above, we will run the predict function on the dataframe created with those parameters and compare the result with the observed audience score of the movie which currently 68% from Rottentomatoes website https://www.rottentomatoes.com/m/trolls#audience_reviews.
# Here we fit the best final model
movies_best_model <- bas.lm(formula = audience_score ~ imdb_rating + critics_score + runtime, data = na.omit(movies), na.action = "na.omit", prior = "BIC", modelprior = uniform())
# Here we do prediction
trolls_data <- data.frame(audience_score=68, critics_score=102, runtime=92, imdb_rating=6.5)
trolls_predict <- predict(movies_best_model, newdata = trolls_data, estimator = "BPM", se.fit = TRUE, prediction = TRUE)
confint(trolls_predict)## 2.5% 97.5% pred
## [1,] 45.75539 86.01813 65.88676
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The predicted value using the selected model for the Trolls movies (2016) is 65.88676 an which deviates from the observed value of 2.11324 points. But looking at the confidence interval [45.75 - 86.01], the actual value 68 falls largely within this interval. The point estimage of the audience_score under this model is 65.88 ± 20.13
We can see from the output of the that the model has 95% probability of predicting the audience score based on the predictors selected. However the margin of error ME=20.13 is quite large yielding so a low precision of the model.