library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
load("movies.Rdata")
The objective of this assignment is to determine an algorithm of predicting audience_score based on various characteristics about a movie. Hence 651 movies were randomly sampled from IMDB. This random sampling seems reasonable and thus a good data set for statistical inference. We can see more information about all the variables here:
Despite the fact that these are random selections it is difficult to establish a causal relationship from regression and without running an experiment. We can hypothesize that certain variables include should have a causal relationship but further investigation should be done to confirm this is the case. For example, it is highly likely that audience_score is related or correlated with being a top 200 box office movie but it is unclear which way the causation runs. Is it in the top 200 because people enjoy the movie or do people see movies because they are in the top 200? Or perhaps they cycle on each other?
movies<-mutate(movies,feature_film=ifelse(title_type %in% c("Feature Film"),"yes","no"),
drama=ifelse(genre %in% c("Drama"),"yes","no"),
mpaa_rating_R=ifelse(mpaa_rating %in% c("R"),"yes","no"),
oscar_season=ifelse(thtr_rel_month %in% c(10,11,12),"yes","no"),
summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8),"yes","no")
)%>%
select(audience_score,feature_film,drama,runtime,mpaa_rating_R,thtr_rel_year,
oscar_season,summer_season,imdb_rating,imdb_num_votes,critics_score,
best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,
top200_box)
movies$feature_film<-as.factor(movies$feature_film)
movies$drama<-as.factor(movies$drama)
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
movies$oscar_season<-as.factor(movies$oscar_season)
movies$summer_season<-as.factor(movies$summer_season)
summary(select(movies,audience_score,feature_film,drama,mpaa_rating_R,oscar_season,
summer_season))
## audience_score feature_film drama mpaa_rating_R oscar_season
## Min. :11.00 no : 60 no :346 no :322 no :460
## 1st Qu.:46.00 yes:591 yes:305 yes:329 yes:191
## Median :65.00
## Mean :62.36
## 3rd Qu.:80.00
## Max. :97.00
## summer_season
## no :443
## yes:208
##
##
##
##
About half the movies are R rated and another half are drama with some overlap. A quarter are feature films whichis reasonable as many independent film studios release movies which don’t get much attention. Oscar Season and Summer Season each seem to have around 1/4 of the movies which makes since if movies are roughly evenly distributed throughout the year.
ggplot(data = movies,aes(audience_score)) +
geom_histogram(fill="gold3", binwidth = 5)+
ggtitle("Histogram of Audience Score") +
labs(x="Audience Score",y="Count")
Audience Score seems to be skewed to the left with very few movies getting below 30 in audience rating. Peak is around 80 to 90 audience score.
ggplot(data = movies, aes(factor(feature_film), audience_score)) +
geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
ggtitle("Audience Score by Feature Film") +
labs(x="Is Feature Film?",y="Audience Score") +
theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
axis.title=element_text(size=15))
data.frame(group_by(movies,feature_film)%>%summarise(Min=min(audience_score),
Q1=quantile(audience_score,.25),
Med=median(audience_score),
Mean=mean(audience_score),
Q3=quantile(audience_score,.75),
Max=max(audience_score)))
## feature_film Min Q1 Med Mean Q3 Max
## 1 no 19 76.5 85.5 81.05000 89 96
## 2 yes 11 44.5 62.0 60.46531 78 97
For Feature Films it seems that they are negatively correlated with Audience Score. However, there is much more variation in Feature Films than non-Feature Films. There are 2 outliers in the non-Feature Films that got very bad audience ratings. This distribution could arise from the fact that only certain kinds of people watch particular kinds of non-Feature Films so they are pretty likely to enjoy that movie. Whereas if a random person watched the movie they would probably not enjoy it.
ggplot(data = movies, aes(factor(drama), audience_score)) +
geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
ggtitle("Audience Score by Drama") +
labs(x="Is Drama?",y="Audience Score") +
theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
axis.title=element_text(size=15))
data.frame(group_by(movies,drama)%>%summarise(Min=min(audience_score),
Q1=quantile(audience_score,.25),
Med=median(audience_score),
Mean=mean(audience_score),
Q3=quantile(audience_score,.75),
Max=max(audience_score)))
## drama Min Q1 Med Mean Q3 Max
## 1 no 11 41 61 59.73121 79 97
## 2 yes 13 52 70 65.34754 80 95
Audience Score seems slightly higher for Dramas while Not Drama has a bit more variation. Overall, scores don’t appear that different.
ggplot(data = movies, aes(factor(mpaa_rating_R), audience_score)) +
geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
ggtitle("Audience Score by R rating") +
labs(x="Is R rated?",y="Audience Score") +
theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
axis.title=element_text(size=15))
data.frame(group_by(movies,mpaa_rating_R)%>%summarise(Min=min(audience_score),
Q1=quantile(audience_score,.25),
Med=median(audience_score),
Mean=mean(audience_score),
Q3=quantile(audience_score,.75),
Max=max(audience_score)))
## mpaa_rating_R Min Q1 Med Mean Q3 Max
## 1 no 11 48.25 65.5 62.68944 80 96
## 2 yes 14 44.00 64.0 62.04255 79 97
Audience score looks roughly the same for R and not R rated movies. Can’t tell much difference.
ggplot(data = movies, aes(factor(oscar_season), audience_score)) +
geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
ggtitle("Audience Score by Oscar Season") +
labs(x="Is Oscar Season?",y="Audience Score") +
theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
axis.title=element_text(size=15))
data.frame(group_by(movies,oscar_season)%>%summarise(Min=min(audience_score),
Q1=quantile(audience_score,.25),
Med=median(audience_score),
Mean=mean(audience_score),
Q3=quantile(audience_score,.75),
Max=max(audience_score)))
## oscar_season Min Q1 Med Mean Q3 Max
## 1 no 11 46.0 64 61.81304 79 96
## 2 yes 13 47.5 69 63.68586 81 97
Again can’t tell much difference between Oscar Season movies and non-Oscar Season movies. Audience score is about the same.
ggplot(data = movies, aes(factor(summer_season), audience_score)) +
geom_boxplot(fill="gold3", outlier.colour = "navy", outlier.size = 3)+
ggtitle("Audience Score by Summer Season") +
labs(x="Is Summer Season?",y="Audience Score") +
theme(plot.title=element_text(size = 20, face="bold"), axis.text=element_text(size=10),
axis.title=element_text(size=15))
data.frame(group_by(movies,summer_season)%>%summarise(Min=min(audience_score),
Q1=quantile(audience_score,.25),
Med=median(audience_score),
Mean=mean(audience_score),
Q3=quantile(audience_score,.75),
Max=max(audience_score)))
## summer_season Min Q1 Med Mean Q3 Max
## 1 no 13 46.00 66 62.62302 80 97
## 2 yes 11 44.75 65 61.80769 78 94
Not much difference in Audience Score for Summer and Non-Summer Season movies.
Next we conduct the modeling for audience score. Since we are trying to create a predictive model we’ll use AIC (least restrictive) for the priors for the Betas as we don’t care too much if non-informative variables are added in. We don’t have too many variables so we’ll use enurmeration as well.
bma_movies = bas.lm(audience_score ~ . -audience_score, data = movies,
prior = "AIC", modelprior = uniform())
## Warning in bas.lm(audience_score ~ . - audience_score, data = movies, prior
## = "AIC", : dropping 1 rows due to missing data
bma_movies
##
## Call:
## bas.lm(formula = audience_score ~ . - audience_score, data = movies, prior = "AIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.0000 0.3812 0.3795
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.8560 0.6831 0.5764
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.3648 0.4160 1.0000
## imdb_num_votes critics_score best_pic_nomyes
## 0.4506 0.9667 0.6933
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.3033 0.5155 0.5941
## best_dir_winyes top200_boxyes
## 0.3622 0.3051
summary(bma_movies)
## Intercept feature_filmyes dramayes runtime mpaa_rating_Ryes
## [1,] 1 0 0 1 1
## [2,] 1 0 0 1 1
## [3,] 1 0 0 1 1
## [4,] 1 0 0 1 1
## [5,] 1 0 0 1 1
## thtr_rel_year oscar_seasonyes summer_seasonyes imdb_rating
## [1,] 1 0 0 1
## [2,] 0 0 0 1
## [3,] 0 0 0 1
## [4,] 1 0 0 1
## [5,] 0 0 1 1
## imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,] 0 1 1 0
## [2,] 0 1 1 0
## [3,] 0 1 1 0
## [4,] 0 1 1 0
## [5,] 0 1 1 0
## best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,] 1 1 0 0
## [2,] 1 1 0 0
## [3,] 0 1 0 0
## [4,] 0 1 0 0
## [5,] 0 1 0 0
## BF PostProbs R2 dim logmarg
## [1,] 1.0000000 0.0017 0.7601 9 -3604.421
## [2,] 0.9783606 0.0017 0.7593 8 -3604.442
## [3,] 0.9298905 0.0016 0.7585 7 -3604.493
## [4,] 0.8903369 0.0015 0.7592 8 -3604.537
## [5,] 0.7914465 0.0013 0.7592 8 -3604.654
image(bma_movies, rotate = FALSE)
Under the AIC prior we see that the top models all include runtime, mpaa_rating_R, thtr_rel_year,imdb_rating, critics score, best_pic_nominees, best actor wins and best actress wins. In the display of the coeficients below we see that these are the ones with the largest coeffients and highest posterior probabilities of Beta not equal to 0.
coef.bas(bma_movies)
##
## Marginal Posterior Summaries of Coefficients:
## post mean post SD post p(B != 0)
## Intercept 6.235e+01 3.925e-01 1.000e+00
## feature_filmyes -6.123e-01 1.308e+00 3.812e-01
## dramayes 3.209e-01 6.796e-01 3.795e-01
## runtime -4.745e-02 2.941e-02 8.560e-01
## mpaa_rating_Ryes -1.023e+00 9.621e-01 6.831e-01
## thtr_rel_year -3.483e-02 4.158e-02 5.764e-01
## oscar_seasonyes -3.098e-01 7.093e-01 3.648e-01
## summer_seasonyes 4.164e-01 7.588e-01 4.160e-01
## imdb_rating 1.492e+01 6.373e-01 1.000e+00
## imdb_num_votes 2.411e-06 4.008e-06 4.506e-01
## critics_score 6.275e-02 2.460e-02 9.667e-01
## best_pic_nomyes 3.233e+00 2.989e+00 6.933e-01
## best_pic_winyes -5.685e-01 2.728e+00 3.033e-01
## best_actor_winyes -8.819e-01 1.210e+00 5.155e-01
## best_actress_winyes -1.281e+00 1.465e+00 5.941e-01
## best_dir_winyes -5.575e-01 1.261e+00 3.622e-01
## top200_boxyes 4.617e-01 1.681e+00 3.051e-01
plot(bma_movies, which=1)
plot(bma_movies, which=2)
plot(bma_movies, which=3)
plot(bma_movies, which=4)
We do notice that there are a few outliers at rows 126, 216, and 251.Additionally the residual plot doesn’t look quite random. There are many positive residuals for the lower predicted scores. Hence movies with low scores aren’t quite as low as we are predicting.
The model completity appears to peak around the 5 to 10 mark which makes sense given that we have 8 variables in our model.
Looked at movie Suicide Squad:
IMDb: http://www.imdb.com/title/tt1386697/
Rotten Tomatoes: https://www.rottentomatoes.com/m/suicide_squad_2016/
Box Office Mojo: http://www.boxofficemojo.com/showdowns/chart/?id=superhero2016.htm
Best Actor Win: Yes, Jared Leto.
For the prediction I’m going to use the Best Predictive Model (BPM) as we are making a prediction.
movies3<-rbind(movies,c(66,"yes","no",123,"no",2016,"no","yes",6.6,241744,
47,"no","no","yes","no","no","yes"))
movies3[] <- mapply(FUN = as,movies3,sapply(movies,class),SIMPLIFY = FALSE)
movies3_pred = predict.bas(bma_movies,newdata = movies3, estimator="BPM")
suicide_squad_pred=movies3_pred$fit[651]
My predicted value for Suicide Squad is 61.6836967 . This is pretty close to the actual value of 66 on Rotten Tomotas. The residual is 4.3163033 .
Overall this Bayesian Regression performed quite well with a residual value of around 4 for Suicide Squad. The model gives us very interpretable coefficients along with probabilities of how likely that are to be included in high posterior probability models. While we can’t quite establish a causal relationship it does help us to see strong correlations.
Some areas for improvement could be considering some other variable transformations or interaction terms. Or additional variables could be created from some of the text variables not used. This could also help to alievate the small issue of positive residuals near the lower end of the audience score spectrum.