library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.4.2
library(dplyr)## Warning: package 'dplyr' was built under R version 3.4.2
library(statsr)
library(BAS)## Warning: package 'BAS' was built under R version 3.4.2
library(psych)movies<-get(load("movies.Rdata"))The study is an observational one. The main contribution to the generazability of inference results is that the used data set represents a simple random sample of movies released before 2016. So we can tell about causality which is unbiased in the sense of free of collection error( see https://en.wikipedia.org/wiki/Simple_random_sample.). But there are biases in the data set that is an obstacle to the generazability as data structure uses several ratings results which can be biased, films released only before 2016 and produced only in the US.
We create new variables according to the assignment:
data_movies<-movies %>% mutate(feature_film = factor(ifelse(title_type=="Feature Film",c("yes"), c("no")),levels=c("no","yes"))) %>%
mutate(drama = factor(ifelse(genre=="Drama", c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(mpaa_rating_R =factor(ifelse(mpaa_rating=="R", c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(oscar_season = factor(ifelse(thtr_rel_month%in% 10:12, c("yes"), c("no")),levels=c("no","yes")))%>%
mutate(summer_season = factor(ifelse(thtr_rel_month %in% 5:8, c("yes"), c("no")),levels=c("no","yes")))Now we perform an explorartory data analysis of the relationships between variable “audience_score” and newly constructed variables: First, we look at summary statistics of the variables and their relationships:
describe(data_movies$audience_score)## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 651 62.36 20.22 65 63.39 23.72 11 97 86 -0.36 -0.91
## se
## X1 0.79
summary(data_movies$feature_film)## no yes
## 60 591
summary(data_movies$drama)## no yes
## 346 305
summary(data_movies$oscar_season)## no yes
## 460 191
summary(data_movies$summer_season)## no yes
## 443 208
aggr1<-aggregate(audience_score ~ feature_film, data_movies,mean)
print(aggr1)## feature_film audience_score
## 1 no 81.05000
## 2 yes 60.46531
aggr2<-aggregate(audience_score ~ drama, data_movies,mean)
print(aggr2)## drama audience_score
## 1 no 59.73121
## 2 yes 65.34754
aggr3<-aggregate(audience_score ~ oscar_season, data_movies,mean)
print(aggr3)## oscar_season audience_score
## 1 no 61.81304
## 2 yes 63.68586
aggr4<-aggregate(audience_score ~ summer_season, data_movies,mean)
print(aggr4)## summer_season audience_score
## 1 no 62.62302
## 2 yes 61.80769
The summary statistics of variables show that the ditribution of “audience score” is left-skewed, most films in the data set are fearture films, about half of films are drama by genre, about one of three films are released in a summer season and about 0.3 films are released in an oscar season. The summary statistics of ‘audience score’ by the factor variables show that feature films at average are scored lower the other types films; drama films are scored higher than other genres; films released in an oscar season in general are scored a little bit higher while in released in summer season a little bit less. We visualize the relationships by creating boxplots and credible interval by “bayes inference” function:
boxplot(audience_score~feature_film,data=data_movies)bayes_inference(y = audience_score, x = feature_film, data = data_movies, statistic = "mean", type = "ci")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (16.7349 , 24.4434)
##
## Post. mean = 20.5863
## Post. median = 20.5856
## Post. mode = 20.7087
We see that the average audience score for feature film is lower than other type films and the credible interval proves the difference of means.
boxplot(audience_score~drama,data=data_movies)bayes_inference(y = audience_score, x = drama, data = data_movies, statistic = "mean", type = "ci")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (-8.6847 , -2.5438)
##
## Post. mean = -5.6159
## Post. median = -5.6152
## Post. mode = -5.603
We see that average score for drama film is higher than other genre films and credible interval proves the difference of means.
boxplot(audience_score~oscar_season,data=data_movies)bayes_inference(y = audience_score, x = oscar_season, data = data_movies, statistic = "mean", type = "ci")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (-5.3181 , 1.5847)
##
## Post. mean = -1.8697
## Post. median = -1.8695
## Post. mode = -1.8299
We see that there is very small difference in the average audience score for films released during the oscar season and credible interval of the bayes inference contains 0.
boxplot(audience_score~summer_season,data=data_movies)bayes_inference(y = audience_score, x = summer_season, data = data_movies, statistic = "mean", type = "ci")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming independent Jeffrey's priors for mu and sigma^2)
##
## 95% Cred. Int.: (-2.5144 , 4.1297)
##
## Post. mean = 0.8122
## Post. median = 0.8149
## Post. mode = 0.928
We see that there is almost no difference in the average audience score for films released during the summer season and the bayes credible interval contains 0.
First, we delete empty observations from the data frame.
data_movies = na.omit(data_movies)To develop a Bayesian regression model to predict the “audience score” we run the “Bayesian adaptive sampling without replecement for variable selection (package”BAS" https://www.r-project.org, https://github.com/merliseclyde/BAS): as a prior we will consider “BIC” and as a method “Bayesian Model Averaging”. We assign equal prior probabilities to all models using the uniform function (https://www.coursera.org/learn/bayesian):
movies_bma = bas.lm(audience_score ~ feature_film+drama+runtime+mpaa_rating_R+thtr_rel_year+oscar_season+summer_season+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box, data = data_movies, n.models=2^16, prior = "BIC", modelprior = uniform(), initprobs="eplogp")
summary(movies_bma)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.05876309 0.0000 0.0000000 0.0000000
## dramayes 0.04508592 0.0000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.16498090 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.06525993 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07935408 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1558 0.1358000 0.0319000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0318000 0.0289000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
We run also models with replacement by the MCMC method to look at convergence:
movies_mcmc = bas.lm(audience_score ~ feature_film+drama+runtime+mpaa_rating_R+thtr_rel_year+oscar_season+summer_season+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box, data = data_movies, method="MCMC",MCMC.iterations=2*10^6, n.models=2^16, prior = "BIC", modelprior = uniform(), initprobs="eplogp")
summary(movies_mcmc)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.0581185 0.0000 0.0000000 0.0000000
## dramayes 0.0451245 0.0000 0.0000000 0.0000000
## runtime 0.5137515 1.0000 0.0000000 1.0000000
## mpaa_rating_Ryes 0.1650660 0.0000 0.0000000 1.0000000
## thtr_rel_year 0.0815650 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.0645865 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.0799880 0.0000 0.0000000 0.0000000
## imdb_rating 0.9999930 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.0623855 0.0000 0.0000000 0.0000000
## critics_score 0.9191680 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.1329805 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.0409610 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.1160390 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.1486245 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.0674020 0.0000 0.0000000 0.0000000
## top200_boxyes 0.0492430 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2039916
## PostProbs NA 0.1560 0.1362000 0.0312000
## R2 NA 0.7483 0.7455000 0.7496000
## dim NA 4.0000 3.0000000 5.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3416317
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 0.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 1.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2048238 0.1851908
## PostProbs 0.0312000 0.0284000
## R2 0.7470000 0.7495000
## dim 4.0000000 5.0000000
## logmarg -3436.3375603 -3436.4383237
Now we will look at plots of the models:
diagnostics(movies_mcmc)plot(movies_mcmc,which=2,add.smooth=F)plot(movies_bma,which=1,add.smooth=F)plot(movies_bma,which=2,add.smooth=F)plot(movies_bma,which=3)image(movies_bma,subset=-1)The plots of the model run by MCMC shows that there is a covergence meaning that we can use the Bayesian estimates. The residuals plot of BMA models show that there are outliers.Cummulative model probabilities plot of BMA models shows that we discovered about 5000 unique models out of 2^16 models with bayesian adaptive sampling.The plot of model size versus the log of marginal likelihood shows that models with the highest Bayes factors or marginal likelihoods have from 2 to 5 predictors. In summary of the BMA models we see that the highest marginal posterior inclusion probabilities have the intercept and variable“imdb_rating”(1.0), the next is “critics_score”(0.920) and the forth is “runtime” (0.514). Image of the BMA models shows that top ranked model includes just these four variables. This model has the highest prosterior probability and is the median probability model.
No we will look at credible interval of coefficients:
coef_movies_bicprior=coefficients(movies_bma, estimator="BMA")
plot(coef_movies_bicprior)confint(coef_movies_bicprior)## 2.5% 97.5% beta
## Intercept 6.138969e+01 6.297492e+01 6.221002e+01
## feature_filmyes -7.166579e-01 1.156079e-01 -9.162435e-02
## dramayes 0.000000e+00 0.000000e+00 1.956484e-02
## runtime -9.078604e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes -1.987046e+00 1.548335e-03 -2.409288e-01
## thtr_rel_year -4.244719e-02 2.506749e-05 -3.878883e-03
## oscar_seasonyes -7.071126e-01 0.000000e+00 -6.293171e-02
## summer_seasonyes 0.000000e+00 9.601729e-01 8.660919e-02
## imdb_rating 1.358424e+01 1.656623e+01 1.490680e+01
## imdb_num_votes -8.120657e-10 2.638531e-06 2.466254e-07
## critics_score -1.564034e-06 1.108884e-01 6.951964e-02
## best_pic_nomyes 0.000000e+00 4.954424e+00 5.130600e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes -2.214586e+00 0.000000e+00 -2.123363e-01
## best_actress_winyes -2.970196e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes -1.397184e+00 8.083505e-02 -1.188402e-01
## top200_boxyes 0.000000e+00 0.000000e+00 8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The coefficients’ credible intervals show that there is evidence that imdb_rating, critics_score are positively associated with the audience score while longer films are scored less by the audience.
In this part we have choosen the film “La La Land” released in 2016 to predict the audience score by the model from the previous step. Data from the web-sites the IMDb and Tomatometer rating were addded to the data frame:
data_movies[nrow(data_movies)+1,] = list(title="La La Land",title_type="Feature Film", genre="Drama",runtime=128, mpaa_rating="PG-13",studio="Lionsgate Films",thtr_rel_year=2016,thtr_rel_month=12, thtr_rel_day=9,dvd_rel_year=2017, dvd_rel_month=4,dvd_rel_day=25,imdb_rating=8.1, imdb_num_votes=310814, critics_rating="Certified Fresh",critics_score=93, audience_rating="Spilled",audience_score=81,best_pic_nom="yes", best_pic_win="yes", best_actor_win="yes",best_actress_win="yes", best_dir_win="yes", top200_box="yes", director="Damien Chazelle", actor1="Ryan Gosling", actor2="Emma Stone", actor3="Amiee Conn",actor4="Terry Walters",actor5="Thom Shelton",imdb_url="http://www.imdb.com/title/tt3783958", rt_url="https://www.rottentomatoes.com/m/la_la_land", feature_film="yes", drama="yes", mpaa_rating_R="no", oscar_season="yes",summer_season="no")
tail(data_movies)## # A tibble: 6 x 37
## title title_type genre runtime
## <chr> <fctr> <fctr> <dbl>
## 1 Death Defying Acts Feature Film Drama 97
## 2 Half Baked Feature Film Comedy 82
## 3 Dance of the Dead Feature Film Action & Adventure 87
## 4 Around the World in 80 Days Feature Film Action & Adventure 120
## 5 LOL Feature Film Comedy 97
## 6 La La Land Feature Film Drama 128
## # ... with 33 more variables: mpaa_rating <fctr>, studio <fctr>,
## # thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## # imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fctr>,
## # critics_score <dbl>, audience_rating <fctr>, audience_score <dbl>,
## # best_pic_nom <fctr>, best_pic_win <fctr>, best_actor_win <fctr>,
## # best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## # director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## # actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>,
## # feature_film <fctr>, drama <fctr>, mpaa_rating_R <fctr>,
## # oscar_season <fctr>, summer_season <fctr>
Now we are going to make a prediction under model averaging and other estimator (HMP - highest probability model, MPM - the median probability model and BMP):
movies.BMA=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="BMA")
confint(movies.BMA)## 2.5% 97.5% pred
## [1,] 67.92074 108.0154 88.13245
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
head(movies.BMA$se.pred)## [,1]
## [1,] 10.17927
## [2,] 10.22202
## [3,] 10.25961
## [4,] 10.17068
## [5,] 10.36133
## [6,] 10.24785
movies.BPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="BPM")
confint(movies.BPM)## 2.5% 97.5% pred
## [1,] 59.24021 117.025 88.13259
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
movies.BPM$se.pred## 1
## 14.71188
movies.HPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="HPM")
confint(movies.HPM)## 2.5% 97.5% pred
## [1,] 67.79164 107.7723 87.78198
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
movies.HPM$se.pred## 1
## 10.17927
movies.MPM=predict(movies_bma,newdata=data_movies[nrow(data_movies),],prediction=TRUE, se.fit=TRUE, estimator="MPM")
confint(movies.MPM)## 2.5% 97.5% pred
## [1,] 67.80094 107.7626 87.78178
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
movies.MPM$se.pred## 1
## 10.17456
We choose the BMA models prediction of the audience score for the film “La La Land”: with the probability of 95% the audience score will fall between 68.57 and 117.02.
Regarding data and research methodoloigy I have critics about combining data from different sources such as different ratings.In the future research I would use new variables: “age of the film” and age squared instead the the thtr_rel_year. I would keep doing Bayesian estimates with other priors.