The purpose of these report is to find out what attributes make a movie popular and other interesting things(It is observational study rather than a causality study). Of course, the process of doing data analysis with the tool of Bayesian regression is very important for me. And I will do the analysis using the lab 4 of this course and the project for the second course project of this specialization as references.
library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies
. Delete this note when before you submit your work.
load("movies.Rdata")
The data set is comprised of 651 randomly sampled movies produced and released before 2016, so random sampling was used for generalizability. Random assignment was not used which means this study is a observational study rather than experimental study, so it cannot be used for causality inference. Let’s see the basics of the data set by:
str(movies)
head(movies)
There are three types of variables in the dataset, which are numeric, factor and character. To build a prediction model, we need to convert charater variables into factor variables.
feature_film_ind <- movies$title_type == "Feature Film"
feature_film <- factor(levels = c("yes", "no"))
feature_film[feature_film_ind] <- "yes"
feature_film[!feature_film_ind] <- "no"
drama_ind <- movies$genre == "Drama"
drama <- factor(levels = c("yes", "no"))
drama[drama_ind] <- "yes"
drama[!drama_ind] <- "no"
mpaa_rating_R_ind <- movies$mpaa_rating == "R"
mpaa_rating_R <- factor(levels = c("yes", "no"))
mpaa_rating_R[mpaa_rating_R_ind] <- "yes"
mpaa_rating_R[!mpaa_rating_R_ind] <- "no"
oscar_season_ind <- movies$thtr_rel_month >= 10
oscar_season <- factor(levels = c("yes", "no"))
oscar_season[oscar_season_ind] <- "yes"
oscar_season[!oscar_season_ind] <- "no"
summer_season_ind <- movies$thtr_rel_month >= 5 & movies$thtr_rel_month <= 8
summer_season <- factor(levels = c("yes", "no"))
summer_season[summer_season_ind] <- "yes"
summer_season[!summer_season_ind] <- "no"
特别好的代码 , The best way to do the job.
movies = mutate(movies,
feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
drama = ifelse(genre == 'Drama', "yes", "no"),
mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))
ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = feature_film)) + geom_boxplot()
It show that feature film have relatively lower audience score.
The numbers of “yes” and “no” are 591, 60.
We can do a bayesian hypothesis testing that whether there differents in the audience scores between feature and non-feature films.
bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0
## P(H2|data) = 1
From the results, we can see that there are evidence that difference existed between those two groups.
For the other variables, we can do the exaclty the same testing.
ggplot(data = movies, aes(x = drama, y = audience_score, fill = drama)) + geom_boxplot()
bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")
It shows that drama have a slightly lower audience score than other movies.
ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()
bayes_inference(y = audience_score, x = mpaa_rating_R, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")
It shows no significant difference of audience score between the rating R films and others.
ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + geom_boxplot()
ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = summer_season)) + geom_boxplot()
bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")
bayes_inference(y = audience_score, x = summer_season, data = movies, statistic = "mean", type = "ht",null=0, alternative = "twosided")
As we can see that the movie in the oscar season have slightly higher audience score than those who do not, while the movies in the summer season have slightly lower audience score.
4 ways to conducted model selection:
当模型的BIC相差不是特别大的时候,再按照那些规则选择并不是很好。
Then certain situations appeared, we can’t use those criteria to select models, and intead, we use Bayesian model average. image( fit, rotate = F)
will help a lot to understand the models.
attach(movies)
newdata <- data.frame(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
detach(movies)
fit <- bas.lm(audience_score ~ ., data = na.omit(newdata), prior = "BIC", modelprior = uniform())
summary(fit)
## Intercept feature_filmno dramano runtime mpaa_rating_Rno
## [1,] 1 0 0 1 0
## [2,] 1 0 0 0 0
## [3,] 1 0 0 0 0
## [4,] 1 0 0 0 1
## [5,] 1 0 0 1 1
## thtr_rel_year oscar_seasonno summer_seasonno imdb_rating
## [1,] 0 0 0 1
## [2,] 0 0 0 1
## [3,] 0 0 0 1
## [4,] 0 0 0 1
## [5,] 0 0 0 1
## imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,] 0 1 0 0
## [2,] 0 1 0 0
## [3,] 0 1 0 0
## [4,] 0 1 0 0
## [5,] 0 1 0 0
## best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 1 0 0 0
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## BF PostProbs R2 dim logmarg
## [1,] 1.0000000 0.1297 0.7549 4 -3615.279
## [2,] 0.9968489 0.1293 0.7525 3 -3615.282
## [3,] 0.2543185 0.0330 0.7539 4 -3616.648
## [4,] 0.2521327 0.0327 0.7539 4 -3616.657
## [5,] 0.2391994 0.0310 0.7563 5 -3616.710
image( fit, rotate = F)
Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. Further, posterior probability of models are given. In this case the model with the highest posterior prob is the model which includes imdb_rating and critics_score as explanatory variables.
The naive model with all variables included has posterior probability greater than 0.5.
fit <- bas.lm(audience_score ~ ., data = newdata, prior = "ZS-null", modelprior = uniform())
## Warning in bas.lm(audience_score ~ ., data = newdata, prior = "ZS-null", :
## dropping 1 rows due to missing data
summary(fit)
## Intercept feature_filmno dramano runtime mpaa_rating_Rno
## [1,] 1 0 0 0 0
## [2,] 1 0 0 1 0
## [3,] 1 0 0 0 0
## [4,] 1 0 0 0 1
## [5,] 1 0 0 1 1
## thtr_rel_year oscar_seasonno summer_seasonno imdb_rating
## [1,] 0 0 0 1
## [2,] 0 0 0 1
## [3,] 0 0 0 1
## [4,] 0 0 0 1
## [5,] 0 0 0 1
## imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,] 0 1 0 0
## [2,] 0 1 0 0
## [3,] 0 1 0 0
## [4,] 0 1 0 0
## [5,] 0 1 0 0
## best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 1 0 0 0
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## BF PostProbs R2 dim logmarg
## [1,] 1.0000000 0.1388 0.7525 3 443.9495
## [2,] 0.8702806 0.1208 0.7549 4 443.8106
## [3,] 0.2236679 0.0311 0.7539 4 442.4519
## [4,] 0.2217602 0.0308 0.7539 4 442.4433
## [5,] 0.2055844 0.0285 0.7563 5 442.3676
So the naive model includes imdb_rating and critics_score as explanatory variables.
fit1 <- bas.lm(audience_score ~ . , data = newdata)
## Warning in bas.lm(audience_score ~ ., data = newdata): dropping 1 rows due
## to missing data
plot(fit1, which=1)
it appears that there may be some outliers in the data - observations 126, 216 and 251 have been flagged as the points with the three largest absolute residuals. We need to check out whether these points are outliers or not.
coef <- coefficients(fit1)
confint(coef)
## 2.5 % 97.5 % beta
## Intercept 61.55739623 63.1459557 6.234769e+01
## feature_filmno 0.00000000 0.0000000 3.538671e-02
## dramano 0.00000000 0.0000000 -1.933600e-03
## runtime -0.06620746 0.0000000 -8.805543e-03
## mpaa_rating_Rno 0.00000000 0.0000000 7.490287e-02
## thtr_rel_year 0.00000000 0.0000000 -9.038291e-04
## oscar_seasonno 0.00000000 0.0000000 2.253495e-02
## summer_seasonno 0.00000000 0.0000000 -2.250274e-02
## imdb_rating 13.80491172 16.8421667 1.532291e+01
## imdb_num_votes 0.00000000 0.0000000 2.649436e-08
## critics_score 0.00000000 0.1028322 4.155546e-02
## best_pic_nomyes 0.00000000 0.0000000 7.996231e-02
## best_pic_winyes 0.00000000 0.0000000 -2.294183e-03
## best_actor_winyes 0.00000000 0.0000000 -8.273357e-02
## best_actress_winyes 0.00000000 0.0000000 -7.940600e-02
## best_dir_winyes 0.00000000 0.0000000 -2.971645e-02
## top200_boxyes 0.00000000 0.0000000 1.714854e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
As we can see, the posterior mean of intercept, imdb_rating and critics_score are 62.26, 14.72 and 0.07, and their 95% credible interval are also shown above.
We pick the movie the accountant https://www.rottentomatoes.com/m/the_accountant_2016. The audience score of this movie is 84 and critics score is 52. The imdb rating is 7.8 http://www.imdb.com/title/tt2140479/?ref_=nv_sr_1
Simulation is used in BAS to construct predictive intervals with Bayesian Model averaging, while exact inference is often possible with predictive intervals under model selection.
A 95% credible interval for predicting audience score can be obtained by
newdata = data.frame(imdb_rating = 7.8, critics_score = 52, audience_score = 84)
fit <- bas.lm(audience_score ~ imdb_rating + critics_score , data = movies)
predict_fit <- predict(fit, newdata = newdata, interval = "confidence")
predict_fit$fit
## [,1]
## [1,] 81.20801
sd(predict_fit$Ypred)
## [1] 12.39544
This is just a so-so job….
Honestly, I guess it is very hard to understand course material of this part. I tried very hard to do it, and found that I need to read more materials other than vedios in the course to be able to understand completely. And I look forward to see better projects from my classmates, so I can learn from them.
The purpose of these report is to find out what attributes make a movie popular and other interesting things, and it turns out the critics score and the imdb rating play the most important variables.
I choose the model with predictor with high including probability. This is model is very simple but it may be not be very precise.