library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(GGally)
library(devtools)
library(mvtnorm)
library(corrplot)#load("movies.Rdata")
load("movies.Rdata")
movies<-na.omit(movies)
summary(movies)## title title_type genre
## Length:619 Documentary : 42 Drama :298
## Class :character Feature Film:573 Comedy : 86
## Mode :character TV Movie : 4 Action & Adventure: 62
## Mystery & Suspense: 56
## Documentary : 40
## Horror : 22
## (Other) : 55
## runtime mpaa_rating studio
## Min. : 65.0 G : 16 Paramount Pictures : 37
## 1st Qu.: 93.0 NC-17 : 1 Warner Bros. Pictures : 30
## Median :103.0 PG :111 Sony Pictures Home Entertainment: 27
## Mean :106.5 PG-13 :131 Universal Pictures : 23
## 3rd Qu.:116.0 R :319 Warner Home Video : 19
## Max. :267.0 Unrated: 41 Miramax Films : 18
## (Other) :465
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1972 Min. : 1.000 Min. : 1.00 Min. :1991
## 1st Qu.:1991 1st Qu.: 4.000 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.000 Median :15.00 Median :2004
## Mean :1998 Mean : 6.733 Mean :14.43 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.000 3rd Qu.:22.00 3rd Qu.:2008
## Max. :2014 Max. :12.000 Max. :31.00 Max. :2015
##
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 183
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 5026
## Median : 6.000 Median :15.00 Median :6.600 Median : 16480
## Mean : 6.346 Mean :15.08 Mean :6.486 Mean : 60014
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 62507
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
##
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:131 Min. : 1.00 Spilled:264 Min. :11.00
## Fresh :195 1st Qu.: 33.00 Upright:355 1st Qu.:46.00
## Rotten :293 Median : 61.00 Median :65.00
## Mean : 57.43 Mean :62.21
## 3rd Qu.: 82.50 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :597 no :612 no :528 no :548 no :576
## yes: 22 yes: 7 yes: 91 yes: 71 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :604 Length:619 Length:619 Length:619
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:619 Length:619 Length:619
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:619 Length:619
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
This concludes 651 observations from the independent movie internet-resources such as Rotten Tomatoes and IMDB, random sampled. 651 movies, which are less than 10% of all movies of the cinematographic era and less than 10% of movies on both web resources. So, because data is got from 2 popular movie resources, we should keep in mind that the prediction model, based on the data from this resources, is applicable to make a prediction for them only. It could not be spread through entire population, except IMDB and Rotten Tomatoes web-site variability of which should be considered in a separate work.
During this work we’ll use Oscar Season and Summer Season, Drama and Featured film variables to predict movie popularity (audience_score) applying Bayesian Regression model
movies$feature_film<-ifelse(movies$title_type == "Feature Film",
c("yes"), c("no"))
movies$drama<-ifelse(movies$genre == "Drama",
c("yes"), c("no"))
movies$mpaa_rating_R<-ifelse(movies$mpaa_rating == "R",
c("yes"), c("no"))
movies$oscar_season<-ifelse(movies$thtr_rel_month > 9,
c("yes"), c("no"))
movies <- movies %>%
mutate(summer_season = ifelse( thtr_rel_month >4 & thtr_rel_month <9, "yes", "no"))
movies_now<-movies%>%
select(title_type,feature_film, genre, drama, mpaa_rating, mpaa_rating_R, thtr_rel_month,oscar_season,summer_season )
movies_now## # A tibble: 619 × 9
## title_type feature_film genre drama mpaa_rating
## <fctr> <chr> <fctr> <chr> <fctr>
## 1 Feature Film yes Drama yes R
## 2 Feature Film yes Drama yes PG-13
## 3 Feature Film yes Comedy no R
## 4 Feature Film yes Drama yes PG
## 5 Feature Film yes Horror no R
## 6 Feature Film yes Drama yes PG-13
## 7 Feature Film yes Drama yes R
## 8 Documentary no Documentary no Unrated
## 9 Feature Film yes Drama yes Unrated
## 10 Feature Film yes Action & Adventure no PG
## # ... with 609 more rows, and 4 more variables: mpaa_rating_R <chr>,
## # thtr_rel_month <dbl>, oscar_season <chr>, summer_season <chr>
Audience score
movies<-movies%>%
mutate(audience_score_log=log(audience_score))
movies## # A tibble: 619 × 38
## title title_type genre runtime
## <chr> <fctr> <fctr> <dbl>
## 1 Filly Brown Feature Film Drama 80
## 2 The Dish Feature Film Drama 101
## 3 Waiting for Guffman Feature Film Comedy 84
## 4 The Age of Innocence Feature Film Drama 139
## 5 Malevolence Feature Film Horror 90
## 6 Lady Jane Feature Film Drama 142
## 7 Mad Dog Time Feature Film Drama 93
## 8 Beauty Is Embarrassing Documentary Documentary 88
## 9 The Snowtown Murders Feature Film Drama 119
## 10 Superman II Feature Film Action & Adventure 127
## # ... with 609 more rows, and 34 more variables: mpaa_rating <fctr>,
## # studio <fctr>, thtr_rel_year <dbl>, thtr_rel_month <dbl>,
## # thtr_rel_day <dbl>, dvd_rel_year <dbl>, dvd_rel_month <dbl>,
## # dvd_rel_day <dbl>, imdb_rating <dbl>, imdb_num_votes <int>,
## # critics_rating <fctr>, critics_score <dbl>, audience_rating <fctr>,
## # audience_score <dbl>, best_pic_nom <fctr>, best_pic_win <fctr>,
## # best_actor_win <fctr>, best_actress_win <fctr>, best_dir_win <fctr>,
## # top200_box <fctr>, director <chr>, actor1 <chr>, actor2 <chr>,
## # actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## # rt_url <chr>, feature_film <chr>, drama <chr>, mpaa_rating_R <chr>,
## # oscar_season <chr>, summer_season <chr>, audience_score_log <dbl>
ggplot(movies, aes(x=audience_score_log))+geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Applying a logariphm function to audience_score, we can see the audience score has sharp left-skewed distribution and with its peak at ~ 4.3 (between 75-90).
To eliminate in advance any correlated variables, it’s a good idea to run a correlation analysis and remove any high correlated sectors.
movies_numeric <- movies %>%
select(audience_score,imdb_rating,imdb_num_votes,critics_score)
corr_mat <- cor(movies_numeric,use="complete.obs", method="pearson")
knitr::kable(corr_mat)| audience_score | imdb_rating | imdb_num_votes | critics_score | |
|---|---|---|---|---|
| audience_score | 1.0000000 | 0.8605425 | 0.3035904 | 0.7015256 |
| imdb_rating | 0.8605425 | 1.0000000 | 0.3476431 | 0.7619990 |
| imdb_num_votes | 0.3035904 | 0.3476431 | 1.0000000 | 0.2217208 |
| critics_score | 0.7015256 | 0.7619990 | 0.2217208 | 1.0000000 |
corrplot(corr_mat, type = "upper")Here we can see that all 4 variables are highly correlated, so to eliminate “heavy models” in the next steps, we chose the only one response variable for our model - audience score.
ggplot(data = movies, aes(x = audience_score, y = feature_film)) +
geom_point()f_film.lm=lm(audience_score~feature_film, data = movies)
summary(f_film.lm)##
## Call:
## lm(formula = audience_score ~ feature_film, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.543 -14.578 2.422 15.422 36.422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.543 2.853 28.929 < 2e-16 ***
## feature_filmyes -21.966 2.966 -7.407 4.27e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.35 on 617 degrees of freedom
## Multiple R-squared: 0.08166, Adjusted R-squared: 0.08017
## F-statistic: 54.86 on 1 and 617 DF, p-value: 4.266e-13
This variable has an estimated slope of -21.966 and estimate intercept 82.543. So, if the movie is featured film, it gets -21.966 points in rating. this correlation, according to the p-value, has an impact on movie rating formation. We can see one specific low case, which could be a potential outlier.
ggplot(data = movies, aes(x = audience_score, y = drama)) +
geom_point()drama.lm=lm(audience_score~drama, data = movies)
summary(drama.lm)##
## Call:
## lm(formula = audience_score ~ drama, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.289 -16.289 2.648 15.711 37.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.352 1.115 53.238 < 2e-16 ***
## dramayes 5.937 1.607 3.695 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.97 on 617 degrees of freedom
## Multiple R-squared: 0.02165, Adjusted R-squared: 0.02006
## F-statistic: 13.65 on 1 and 617 DF, p-value: 0.0002397
This variable has an estimated slope at 5,937 and estimate intercept at 59.352. So, if the movie has a drama genre, it gets 5,937 points in rating. this correlation, according to the p-value, has an impact on movie rating formation.
ggplot(data = movies, aes(x = audience_score, y = mpaa_rating_R)) +
geom_point()rating_R.lm=lm(audience_score~mpaa_rating_R, data = movies)
summary(rating_R.lm)##
## Call:
## lm(formula = audience_score ~ mpaa_rating_R, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.037 -16.037 2.627 17.627 34.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0367 1.1659 53.212 <2e-16 ***
## mpaa_rating_Ryes 0.3364 1.6240 0.207 0.836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.19 on 617 degrees of freedom
## Multiple R-squared: 6.953e-05, Adjusted R-squared: -0.001551
## F-statistic: 0.0429 on 1 and 617 DF, p-value: 0.836
This variable has an estimated slope at 0.3364, and estimate intercept at 62.0367 So, if the movie has a drama genre, it gets 0.3364 points in rating. this correlation, according to the p-value, has a low impact on movie rating formation.
ggplot(data = movies, aes(x = audience_score, y = oscar_season)) +
geom_point()oscar_s.lm=lm(audience_score~oscar_season, data = movies)
summary(oscar_s.lm)##
## Call:
## lm(formula = audience_score ~ oscar_season, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.860 -16.199 2.461 17.461 34.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5386 0.9614 64.010 <2e-16 ***
## oscar_seasonyes 2.3217 1.7878 1.299 0.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.17 on 617 degrees of freedom
## Multiple R-squared: 0.002726, Adjusted R-squared: 0.00111
## F-statistic: 1.686 on 1 and 617 DF, p-value: 0.1946
This variable has an estimated slope at 2.3217, and estimate intercept at 61.5386 So, if the movie has a drama genre, it gets 2.3217 points in rating. This correlation, according to the p-value, has a low impact on movie rating formation.
ggplot(data = movies, aes(x = audience_score, y = summer_season)) +
geom_point()summer_s.lm=lm(audience_score~summer_season, data = movies)
summary(summer_s.lm)##
## Call:
## lm(formula = audience_score ~ summer_season, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.851 -16.383 2.617 17.617 34.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.3828 0.9876 63.164 <2e-16 ***
## summer_seasonyes -0.5320 1.7332 -0.307 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.19 on 617 degrees of freedom
## Multiple R-squared: 0.0001527, Adjusted R-squared: -0.001468
## F-statistic: 0.09423 on 1 and 617 DF, p-value: 0.759
This variable has an estimate slope at -0.5320, and estimate intercept at 62.3828 So, if the movie has a drama genre, it gets - 0.5320 points in rating. This correlation, according to p-value, has a low impact on movie rating formation.
Now we need to check that standard errors of the models are normally distributed with a constant variance. As with the frequentist approach, we check this assumption by examining the distribution of the residuals for the model. If the residuals are highly non-normal or skewed, the assumption is violated and any subsequent inference is not valid. We are using logarithmic values of each variable to show off any abnormalities.
ggplot(data = movies, aes(x = audience_score_log, y = feature_film)) +
geom_point()f_film.lm=lm(audience_score_log~feature_film, data = movies)
summary(f_film.lm)##
## Call:
## lm(formula = audience_score_log ~ feature_film, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63992 -0.20917 0.08932 0.29292 0.53690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.39437 0.05602 78.448 < 2e-16 ***
## feature_filmyes -0.35656 0.05822 -6.124 1.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3799 on 617 degrees of freedom
## Multiple R-squared: 0.0573, Adjusted R-squared: 0.05578
## F-statistic: 37.51 on 1 and 617 DF, p-value: 1.623e-09
plot(residuals(f_film.lm)~fitted(f_film.lm))
abline(h=0)hist(f_film.lm$residuals)qqnorm(f_film.lm$residuals)
qqline(f_film.lm$residuals) Knowing that the audience score variable is right-skewed, we can trace this abnormality in all plots. With the exception of one observation, the residual plot suggests that the linear regression is a reasonable approximation.
ggplot(data = movies, aes(x = audience_score_log, y = drama)) +
geom_point()drama.lm=lm(audience_score_log~drama, data = movies)
plot(residuals(drama.lm)~fitted(drama.lm))
abline(h=0)hist(drama.lm$residuals)qqnorm(drama.lm$residuals)
qqline(drama.lm$residuals) With the exception of one observation, the residual plot suggests that the linear regression is a reasonable approximation. But still, we see some potential data errors or outliers.
ggplot(data = movies, aes(x = audience_score_log, y = mpaa_rating_R)) +
geom_point()rating_R.lm=lm(audience_score_log~mpaa_rating_R, data = movies)
plot(residuals(rating_R.lm)~fitted(rating_R.lm))
abline(h=0)hist(rating_R.lm$residuals)qqnorm(rating_R.lm$residuals)
qqline(rating_R.lm$residuals) With the exception of one observation (upper right corner on QQ Plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against their theoretical values which are expected with normally distributed data. Here again, we trace potential outliers or data errors.
ggplot(data = movies, aes(x = audience_score_log, y = oscar_season)) +
geom_point()summer_s.lm=lm(audience_score_log~oscar_season, data = movies)
plot(residuals(oscar_s.lm)~fitted(oscar_s.lm))
abline(h=0)hist(oscar_s.lm$residuals)qqnorm(oscar_s.lm$residuals)
qqline(oscar_s.lm$residuals) With the exception of one observation (upper right corner on QQ plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against their theoretical values which are expected with normally distributed data.Also here we see some potential outliers, which could be the case of audience score variable.
ggplot(data = movies, aes(x = audience_score_log, y = summer_season)) +
geom_point()summer_s.lm=lm(audience_score_log~summer_season, data = movies)
plot(residuals(summer_s.lm)~fitted(summer_s.lm))
abline(h=0)hist(summer_s.lm$residuals)qqnorm(summer_s.lm$residuals)
qqline(summer_s.lm$residuals) With the exception of one observation (upper right corner on QQ plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against theoretical values which are expected with normally distributed data. Potential outliers or data errors are traced here, too.
We need to create the different “clean” data frame base on movies, where we exclude variables which were used to create the new ones (drama, Oscar season, summer season, mpaa_rating_R) and not interesting variables like actor1, actor2, etc, and highly correlated to audience score variables which were discussed previously.
movies2<-movies%>%
select(title, runtime, feature_film, drama, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box, audience_score )We can start to build a multiple linear regration model by including all parameters from the renewed data set.
model_movie=lm(audience_score ~. -audience_score -title, data=movies2 )
model_movie##
## Call:
## lm(formula = audience_score ~ . - audience_score - title, data = movies2)
##
## Coefficients:
## (Intercept) runtime feature_filmyes
## 375.4030 0.1286 -28.1035
## dramayes mpaa_rating_Ryes thtr_rel_year
## 7.2844 2.0143 -0.1534
## oscar_seasonyes summer_seasonyes best_pic_nomyes
## -0.1181 1.3703 20.9336
## best_pic_winyes best_actor_winyes best_actress_winyes
## -3.1341 -1.8388 -2.7785
## best_dir_winyes top200_boxyes
## 4.2985 11.4685
A way to get around the problem with choosing an appropriate model is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. We can use this for either implementing BMA or selecting models.For model prior uniform distribution is chosen and the BIC is selected as a prior distribution because it provides highest Posterior Probabilities values among all others methods.
model_bma = bas.lm(audience_score ~. -audience_score -title , data = movies2,
prior = "BIC",
modelprior = uniform(),
method="MCMC")
model_bma##
## Call:
## bas.lm(formula = audience_score ~ . - audience_score - title, data = movies2, prior = "BIC", modelprior = uniform(), method = "MCMC")
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept runtime feature_filmyes
## 1.00000 0.88651 1.00000
## dramayes mpaa_rating_Ryes thtr_rel_year
## 0.99942 0.08647 0.41791
## oscar_seasonyes summer_seasonyes best_pic_nomyes
## 0.04280 0.06140 0.99971
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04036 0.05565 0.07672
## best_dir_winyes top200_boxyes
## 0.12779 0.38098
summary(model_bma)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000 1.0000000
## runtime 0.88650513 1.0000 1.000000 1.0000000
## feature_filmyes 1.00000000 1.0000 1.000000 1.0000000
## dramayes 0.99942017 1.0000 1.000000 1.0000000
## mpaa_rating_Ryes 0.08647461 0.0000 0.000000 0.0000000
## thtr_rel_year 0.41791382 0.0000 1.000000 0.0000000
## oscar_seasonyes 0.04280396 0.0000 0.000000 0.0000000
## summer_seasonyes 0.06140137 0.0000 0.000000 0.0000000
## best_pic_nomyes 0.99970703 1.0000 1.000000 1.0000000
## best_pic_winyes 0.04036255 0.0000 0.000000 0.0000000
## best_actor_winyes 0.05564575 0.0000 0.000000 0.0000000
## best_actress_winyes 0.07672119 0.0000 0.000000 0.0000000
## best_dir_winyes 0.12778931 0.0000 0.000000 0.0000000
## top200_boxyes 0.38098145 0.0000 0.000000 1.0000000
## BF NA 1.0000 0.744528 0.5955936
## PostProbs NA 0.1982 0.149200 0.1207000
## R2 NA 0.1912 0.198800 0.1982000
## dim NA 5.0000 6.000000 6.0000000
## logmarg NA -3799.2340 -3799.529049 -3799.7522408
## model 4 model 5
## Intercept 1.0000000 1.0000000
## runtime 1.0000000 1.0000000
## feature_filmyes 1.0000000 1.0000000
## dramayes 1.0000000 1.0000000
## mpaa_rating_Ryes 0.0000000 0.0000000
## thtr_rel_year 1.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## best_pic_nomyes 1.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 1.0000000
## top200_boxyes 1.0000000 0.0000000
## BF 0.3408385 0.1351441
## PostProbs 0.0696000 0.0273000
## R2 0.2050000 0.1943000
## dim 7.0000000 6.0000000
## logmarg -3800.3103906 -3801.2354577
Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that feature_filmes is included in the model is 1. Running the BAS function, we can see that the model 1 includes: runtime, best_pic_nom - yes,feature_film - yes, drama - yes. Posterior probability of this model is .2101 with R^2 - .1912 and Base Factor - 1
image(model_bma)pd<-par(mfrow = c(1,2))
coef_lwage = coefficients(model_bma)
coef_lwage##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 444 models
## post mean post SD post p(B != 0)
## Intercept 62.21002 0.73014 1.00000
## runtime 0.11995 0.05824 0.88651
## feature_filmyes -26.88250 2.90932 1.00000
## dramayes 7.20117 1.58547 0.99942
## mpaa_rating_Ryes 0.16366 0.69259 0.08647
## thtr_rel_year -0.06850 0.09221 0.41791
## oscar_seasonyes -0.01859 0.36878 0.04280
## summer_seasonyes 0.09502 0.54050 0.06140
## best_pic_nomyes 19.98468 4.25396 0.99971
## best_pic_winyes 0.03972 1.63768 0.04036
## best_actor_winyes -0.10303 0.66915 0.05565
## best_actress_winyes -0.20943 0.98386 0.07672
## best_dir_winyes 0.62193 1.96351 0.12779
## top200_boxyes 4.29951 6.24713 0.38098
plot(coef_lwage, subset = c(1,2,3,4), ask=FALSE) `
Applying visualization of the posterior distribution of the coefficients under the model averaging approach, we graph the posterior distribution of the coefficients o below. Here we can see that the only runtime represents a small probability of coefficient equals to 0.
model_movie=lm(audience_score ~ runtime + best_pic_nom + feature_film + drama, data = movies)
confint(model_movie)## 2.5 % 97.5 %
## (Intercept) 57.55539518 76.9189227
## runtime 0.06378568 0.2233382
## best_pic_nomyes 11.78346447 27.9421451
## feature_filmyes -31.97039264 -20.7221895
## dramayes 3.95310290 10.0038629
Recalling the problems of outliers, appeared in the beginning of EDA, after choosing the model with appropriate components, we should check the model again.
Plotting model against residuals shows fun-shape distribution which is explained by the fact that people mostly rate movies with high scores. To understand if they are outliers or just a data error, we could run Bayes.outlier.prob function, but I have troubles to find an appropriate package. That’s why we decided to go with model plotting.
plot(model_movie) Graph “Residuals vs leverage” shows outliers of the cases 179, 385 and 443. * * *
movie_2016<-data.frame(runtime = 123, best_pic_nom = "yes", feature_film = "no", drama = "no")
movie_2016 = predict(model_bma, estimator="BPM", se.fit=TRUE)
ci<-confint(movie_2016, estimator="BPM")We took Scuicide Sqade movie featured in 2016 and applied our prediction model which found the audience score of 59.35431 with CI (23.46142 , 95.24719) when the actual movie rating, according to Rotten Tomatoes is 62 https://www.rottentomatoes.com/m/suicide_squad_2016
In conclusion, we must say that during the data analysis we found a few outliers which could be explained by very negative reviews with the lowest rating score. We showed that people mostly rate movies with high ratings between 75 and 90 and having some ratings which are out of this diapason appeared on the graphs as outliers and mostly are exceptions rathers than the standard behavior of customers. During this work, we found out the audience score of Suicide Squad 59.35 when the real current score, according to Rotten Tomatoes is 62 which is not a big difference but still leaves some space for improvements.