Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(GGally)
library(devtools)
library(mvtnorm)
library(corrplot)

Load data

#load("movies.Rdata")
load("movies.Rdata")
movies<-na.omit(movies)
summary(movies)

##     title                  title_type                 genre    
##  Length:619         Documentary : 42   Drama             :298  
##  Class :character   Feature Film:573   Comedy            : 86  
##  Mode  :character   TV Movie    :  4   Action & Adventure: 62  
##                                        Mystery & Suspense: 56  
##                                        Documentary       : 40  
##                                        Horror            : 22  
##                                        (Other)           : 55  
##     runtime       mpaa_rating                               studio   
##  Min.   : 65.0   G      : 16   Paramount Pictures              : 37  
##  1st Qu.: 93.0   NC-17  :  1   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :111   Sony Pictures Home Entertainment: 27  
##  Mean   :106.5   PG-13  :131   Universal Pictures              : 23  
##  3rd Qu.:116.0   R      :319   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 41   Miramax Films                   : 18  
##                                (Other)                         :465  
##  thtr_rel_year  thtr_rel_month    thtr_rel_day    dvd_rel_year 
##  Min.   :1972   Min.   : 1.000   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1991   1st Qu.: 4.000   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.000   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.733   Mean   :14.43   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.000   3rd Qu.:22.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.000   Max.   :31.00   Max.   :2015  
##                                                                
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   183  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  5026  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 16480  
##  Mean   : 6.346   Mean   :15.08   Mean   :6.486   Mean   : 60014  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 62507  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##                                                                   
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:131    Min.   :  1.00   Spilled:264     Min.   :11.00  
##  Fresh          :195    1st Qu.: 33.00   Upright:355     1st Qu.:46.00  
##  Rotten         :293    Median : 61.00                   Median :65.00  
##                         Mean   : 57.43                   Mean   :62.21  
##                         3rd Qu.: 82.50                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :597      no :612      no :528        no :548          no :576     
##  yes: 22      yes:  7      yes: 91        yes: 71          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :604    Length:619         Length:619         Length:619        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:619         Length:619         Length:619        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:619         Length:619        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Part 1: Data

This concludes 651 observations from the independent movie internet-resources such as Rotten Tomatoes and IMDB, random sampled. 651 movies, which are less than 10% of all movies of the cinematographic era and less than 10% of movies on both web resources. So, because data is got from 2 popular movie resources, we should keep in mind that the prediction model, based on the data from this resources, is applicable to make a prediction for them only. It could not be spread through entire population, except IMDB and Rotten Tomatoes web-site variability of which should be considered in a separate work.

During this work we’ll use Oscar Season and Summer Season, Drama and Featured film variables to predict movie popularity (audience_score) applying Bayesian Regression model

Part 2: Data manipulation

movies$feature_film<-ifelse(movies$title_type == "Feature Film", 
c("yes"), c("no"))

movies$drama<-ifelse(movies$genre == "Drama", 
c("yes"), c("no"))

movies$mpaa_rating_R<-ifelse(movies$mpaa_rating == "R", 
c("yes"), c("no"))


movies$oscar_season<-ifelse(movies$thtr_rel_month > 9,  
c("yes"), c("no"))


movies <- movies %>%
  mutate(summer_season = ifelse( thtr_rel_month >4 & thtr_rel_month <9, "yes", "no"))
    
movies_now<-movies%>%
  select(title_type,feature_film, genre, drama, mpaa_rating, mpaa_rating_R, thtr_rel_month,oscar_season,summer_season )
movies_now

## # A tibble: 619 × 9
##      title_type feature_film              genre drama mpaa_rating
##          <fctr>        <chr>             <fctr> <chr>      <fctr>
## 1  Feature Film          yes              Drama   yes           R
## 2  Feature Film          yes              Drama   yes       PG-13
## 3  Feature Film          yes             Comedy    no           R
## 4  Feature Film          yes              Drama   yes          PG
## 5  Feature Film          yes             Horror    no           R
## 6  Feature Film          yes              Drama   yes       PG-13
## 7  Feature Film          yes              Drama   yes           R
## 8   Documentary           no        Documentary    no     Unrated
## 9  Feature Film          yes              Drama   yes     Unrated
## 10 Feature Film          yes Action & Adventure    no          PG
## # ... with 609 more rows, and 4 more variables: mpaa_rating_R <chr>,
## #   thtr_rel_month <dbl>, oscar_season <chr>, summer_season <chr>

Part 3: Exploratory data analysis

Audience score

movies<-movies%>%
  mutate(audience_score_log=log(audience_score))
movies

## # A tibble: 619 × 38
##                     title   title_type              genre runtime
##                     <chr>       <fctr>             <fctr>   <dbl>
## 1             Filly Brown Feature Film              Drama      80
## 2                The Dish Feature Film              Drama     101
## 3     Waiting for Guffman Feature Film             Comedy      84
## 4    The Age of Innocence Feature Film              Drama     139
## 5             Malevolence Feature Film             Horror      90
## 6               Lady Jane Feature Film              Drama     142
## 7            Mad Dog Time Feature Film              Drama      93
## 8  Beauty Is Embarrassing  Documentary        Documentary      88
## 9    The Snowtown Murders Feature Film              Drama     119
## 10            Superman II Feature Film Action & Adventure     127
## # ... with 609 more rows, and 34 more variables: mpaa_rating <fctr>,
## #   studio <fctr>, thtr_rel_year <dbl>, thtr_rel_month <dbl>,
## #   thtr_rel_day <dbl>, dvd_rel_year <dbl>, dvd_rel_month <dbl>,
## #   dvd_rel_day <dbl>, imdb_rating <dbl>, imdb_num_votes <int>,
## #   critics_rating <fctr>, critics_score <dbl>, audience_rating <fctr>,
## #   audience_score <dbl>, best_pic_nom <fctr>, best_pic_win <fctr>,
## #   best_actor_win <fctr>, best_actress_win <fctr>, best_dir_win <fctr>,
## #   top200_box <fctr>, director <chr>, actor1 <chr>, actor2 <chr>,
## #   actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## #   rt_url <chr>, feature_film <chr>, drama <chr>, mpaa_rating_R <chr>,
## #   oscar_season <chr>, summer_season <chr>, audience_score_log <dbl>

ggplot(movies, aes(x=audience_score_log))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Applying a logariphm function to audience_score, we can see the audience score has sharp left-skewed distribution and with its peak at ~ 4.3 (between 75-90).

To eliminate in advance any correlated variables, it’s a good idea to run a correlation analysis and remove any high correlated sectors.

movies_numeric <- movies %>%
  select(audience_score,imdb_rating,imdb_num_votes,critics_score)
corr_mat <- cor(movies_numeric,use="complete.obs", method="pearson")
knitr::kable(corr_mat)

	audience_score	imdb_rating	imdb_num_votes	critics_score
audience_score	1.0000000	0.8605425	0.3035904	0.7015256
imdb_rating	0.8605425	1.0000000	0.3476431	0.7619990
imdb_num_votes	0.3035904	0.3476431	1.0000000	0.2217208
critics_score	0.7015256	0.7619990	0.2217208	1.0000000

corrplot(corr_mat, type = "upper")

Here we can see that all 4 variables are highly correlated, so to eliminate “heavy models” in the next steps, we chose the only one response variable for our model - audience score.

Building correlation charts

ggplot(data = movies, aes(x = audience_score, y = feature_film)) +
  geom_point()

f_film.lm=lm(audience_score~feature_film, data = movies)
summary(f_film.lm)

## 
## Call:
## lm(formula = audience_score ~ feature_film, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.543 -14.578   2.422  15.422  36.422 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       82.543      2.853  28.929  < 2e-16 ***
## feature_filmyes  -21.966      2.966  -7.407 4.27e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.35 on 617 degrees of freedom
## Multiple R-squared:  0.08166,    Adjusted R-squared:  0.08017 
## F-statistic: 54.86 on 1 and 617 DF,  p-value: 4.266e-13

This variable has an estimated slope of -21.966 and estimate intercept 82.543. So, if the movie is featured film, it gets -21.966 points in rating. this correlation, according to the p-value, has an impact on movie rating formation. We can see one specific low case, which could be a potential outlier.

ggplot(data = movies, aes(x = audience_score, y = drama)) +
  geom_point()

drama.lm=lm(audience_score~drama, data = movies)
summary(drama.lm)

## 
## Call:
## lm(formula = audience_score ~ drama, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.289 -16.289   2.648  15.711  37.648 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   59.352      1.115  53.238  < 2e-16 ***
## dramayes       5.937      1.607   3.695  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.97 on 617 degrees of freedom
## Multiple R-squared:  0.02165,    Adjusted R-squared:  0.02006 
## F-statistic: 13.65 on 1 and 617 DF,  p-value: 0.0002397

This variable has an estimated slope at 5,937 and estimate intercept at 59.352. So, if the movie has a drama genre, it gets 5,937 points in rating. this correlation, according to the p-value, has an impact on movie rating formation.

ggplot(data = movies, aes(x = audience_score, y = mpaa_rating_R)) +
  geom_point()

rating_R.lm=lm(audience_score~mpaa_rating_R, data = movies)
summary(rating_R.lm)

## 
## Call:
## lm(formula = audience_score ~ mpaa_rating_R, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.037 -16.037   2.627  17.627  34.627 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       62.0367     1.1659  53.212   <2e-16 ***
## mpaa_rating_Ryes   0.3364     1.6240   0.207    0.836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.19 on 617 degrees of freedom
## Multiple R-squared:  6.953e-05,  Adjusted R-squared:  -0.001551 
## F-statistic: 0.0429 on 1 and 617 DF,  p-value: 0.836

This variable has an estimated slope at 0.3364, and estimate intercept at 62.0367 So, if the movie has a drama genre, it gets 0.3364 points in rating. this correlation, according to the p-value, has a low impact on movie rating formation.

ggplot(data = movies, aes(x = audience_score, y = oscar_season)) +
  geom_point()

oscar_s.lm=lm(audience_score~oscar_season, data = movies)
summary(oscar_s.lm)

## 
## Call:
## lm(formula = audience_score ~ oscar_season, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.860 -16.199   2.461  17.461  34.461 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.5386     0.9614  64.010   <2e-16 ***
## oscar_seasonyes   2.3217     1.7878   1.299    0.195    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.17 on 617 degrees of freedom
## Multiple R-squared:  0.002726,   Adjusted R-squared:  0.00111 
## F-statistic: 1.686 on 1 and 617 DF,  p-value: 0.1946

This variable has an estimated slope at 2.3217, and estimate intercept at 61.5386 So, if the movie has a drama genre, it gets 2.3217 points in rating. This correlation, according to the p-value, has a low impact on movie rating formation.

ggplot(data = movies, aes(x = audience_score, y = summer_season)) +
  geom_point()

summer_s.lm=lm(audience_score~summer_season, data = movies)
summary(summer_s.lm)

## 
## Call:
## lm(formula = audience_score ~ summer_season, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.851 -16.383   2.617  17.617  34.617 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       62.3828     0.9876  63.164   <2e-16 ***
## summer_seasonyes  -0.5320     1.7332  -0.307    0.759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.19 on 617 degrees of freedom
## Multiple R-squared:  0.0001527,  Adjusted R-squared:  -0.001468 
## F-statistic: 0.09423 on 1 and 617 DF,  p-value: 0.759

This variable has an estimate slope at -0.5320, and estimate intercept at 62.3828 So, if the movie has a drama genre, it gets - 0.5320 points in rating. This correlation, according to p-value, has a low impact on movie rating formation.

Checking for residuals normal distribution

Now we need to check that standard errors of the models are normally distributed with a constant variance. As with the frequentist approach, we check this assumption by examining the distribution of the residuals for the model. If the residuals are highly non-normal or skewed, the assumption is violated and any subsequent inference is not valid. We are using logarithmic values of each variable to show off any abnormalities.

ggplot(data = movies, aes(x = audience_score_log, y = feature_film)) +
  geom_point()

f_film.lm=lm(audience_score_log~feature_film, data = movies)
summary(f_film.lm)

## 
## Call:
## lm(formula = audience_score_log ~ feature_film, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63992 -0.20917  0.08932  0.29292  0.53690 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.39437    0.05602  78.448  < 2e-16 ***
## feature_filmyes -0.35656    0.05822  -6.124 1.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3799 on 617 degrees of freedom
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.05578 
## F-statistic: 37.51 on 1 and 617 DF,  p-value: 1.623e-09

plot(residuals(f_film.lm)~fitted(f_film.lm))
abline(h=0)

hist(f_film.lm$residuals)

qqnorm(f_film.lm$residuals)
qqline(f_film.lm$residuals)

Knowing that the audience score variable is right-skewed, we can trace this abnormality in all plots. With the exception of one observation, the residual plot suggests that the linear regression is a reasonable approximation.

ggplot(data = movies, aes(x = audience_score_log, y = drama)) +
  geom_point()

drama.lm=lm(audience_score_log~drama, data = movies)

plot(residuals(drama.lm)~fitted(drama.lm))
abline(h=0)

hist(drama.lm$residuals)

qqnorm(drama.lm$residuals)
qqline(drama.lm$residuals)

With the exception of one observation, the residual plot suggests that the linear regression is a reasonable approximation. But still, we see some potential data errors or outliers.

ggplot(data = movies, aes(x = audience_score_log, y = mpaa_rating_R)) +
  geom_point()

rating_R.lm=lm(audience_score_log~mpaa_rating_R, data = movies)

plot(residuals(rating_R.lm)~fitted(rating_R.lm))
abline(h=0)

hist(rating_R.lm$residuals)

qqnorm(rating_R.lm$residuals)
qqline(rating_R.lm$residuals)

With the exception of one observation (upper right corner on QQ Plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against their theoretical values which are expected with normally distributed data. Here again, we trace potential outliers or data errors.

ggplot(data = movies, aes(x = audience_score_log, y = oscar_season)) +
  geom_point()

summer_s.lm=lm(audience_score_log~oscar_season, data = movies)

plot(residuals(oscar_s.lm)~fitted(oscar_s.lm))
abline(h=0)

hist(oscar_s.lm$residuals)

qqnorm(oscar_s.lm$residuals)
qqline(oscar_s.lm$residuals)

With the exception of one observation (upper right corner on QQ plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against their theoretical values which are expected with normally distributed data.Also here we see some potential outliers, which could be the case of audience score variable.

ggplot(data = movies, aes(x = audience_score_log, y = summer_season)) +
  geom_point()

summer_s.lm=lm(audience_score_log~summer_season, data = movies)

plot(residuals(summer_s.lm)~fitted(summer_s.lm))
abline(h=0)

hist(summer_s.lm$residuals)

qqnorm(summer_s.lm$residuals)
qqline(summer_s.lm$residuals)

With the exception of one observation (upper right corner on QQ plot), the residual plot suggests that the linear regression is a reasonable approximation. The diagnostic plot (QQ) fitting for checking residuals, shows that residual quantiles are plotted against theoretical values which are expected with normally distributed data. Potential outliers or data errors are traced here, too.

We need to create the different “clean” data frame base on movies, where we exclude variables which were used to create the new ones (drama, Oscar season, summer season, mpaa_rating_R) and not interesting variables like actor1, actor2, etc, and highly correlated to audience score variables which were discussed previously.

movies2<-movies%>%
  select(title, runtime, feature_film, drama, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box, audience_score )

Part 4: Modeling

We can start to build a multiple linear regration model by including all parameters from the renewed data set.

model_movie=lm(audience_score ~. -audience_score -title, data=movies2 )
model_movie

## 
## Call:
## lm(formula = audience_score ~ . - audience_score - title, data = movies2)
## 
## Coefficients:
##         (Intercept)              runtime      feature_filmyes  
##            375.4030               0.1286             -28.1035  
##            dramayes     mpaa_rating_Ryes        thtr_rel_year  
##              7.2844               2.0143              -0.1534  
##     oscar_seasonyes     summer_seasonyes      best_pic_nomyes  
##             -0.1181               1.3703              20.9336  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             -3.1341              -1.8388              -2.7785  
##     best_dir_winyes        top200_boxyes  
##              4.2985              11.4685

A way to get around the problem with choosing an appropriate model is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. We can use this for either implementing BMA or selecting models.For model prior uniform distribution is chosen and the BIC is selected as a prior distribution because it provides highest Posterior Probabilities values among all others methods.

model_bma = bas.lm(audience_score ~. -audience_score -title , data = movies2,
                   prior = "BIC", 
                   modelprior = uniform(),
                   method="MCMC")
model_bma

## 
## Call:
## bas.lm(formula = audience_score ~ . - audience_score - title,     data = movies2, prior = "BIC", modelprior = uniform(), method = "MCMC")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept              runtime      feature_filmyes  
##             1.00000              0.88651              1.00000  
##            dramayes     mpaa_rating_Ryes        thtr_rel_year  
##             0.99942              0.08647              0.41791  
##     oscar_seasonyes     summer_seasonyes      best_pic_nomyes  
##             0.04280              0.06140              0.99971  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04036              0.05565              0.07672  
##     best_dir_winyes        top200_boxyes  
##             0.12779              0.38098

summary(model_bma)

##                     P(B != 0 | Y)    model 1      model 2       model 3
## Intercept              1.00000000     1.0000     1.000000     1.0000000
## runtime                0.88650513     1.0000     1.000000     1.0000000
## feature_filmyes        1.00000000     1.0000     1.000000     1.0000000
## dramayes               0.99942017     1.0000     1.000000     1.0000000
## mpaa_rating_Ryes       0.08647461     0.0000     0.000000     0.0000000
## thtr_rel_year          0.41791382     0.0000     1.000000     0.0000000
## oscar_seasonyes        0.04280396     0.0000     0.000000     0.0000000
## summer_seasonyes       0.06140137     0.0000     0.000000     0.0000000
## best_pic_nomyes        0.99970703     1.0000     1.000000     1.0000000
## best_pic_winyes        0.04036255     0.0000     0.000000     0.0000000
## best_actor_winyes      0.05564575     0.0000     0.000000     0.0000000
## best_actress_winyes    0.07672119     0.0000     0.000000     0.0000000
## best_dir_winyes        0.12778931     0.0000     0.000000     0.0000000
## top200_boxyes          0.38098145     0.0000     0.000000     1.0000000
## BF                             NA     1.0000     0.744528     0.5955936
## PostProbs                      NA     0.1982     0.149200     0.1207000
## R2                             NA     0.1912     0.198800     0.1982000
## dim                            NA     5.0000     6.000000     6.0000000
## logmarg                        NA -3799.2340 -3799.529049 -3799.7522408
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## runtime                 1.0000000     1.0000000
## feature_filmyes         1.0000000     1.0000000
## dramayes                1.0000000     1.0000000
## mpaa_rating_Ryes        0.0000000     0.0000000
## thtr_rel_year           1.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## best_pic_nomyes         1.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     1.0000000
## top200_boxyes           1.0000000     0.0000000
## BF                      0.3408385     0.1351441
## PostProbs               0.0696000     0.0273000
## R2                      0.2050000     0.1943000
## dim                     7.0000000     6.0000000
## logmarg             -3800.3103906 -3801.2354577

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that feature_filmes is included in the model is 1. Running the BAS function, we can see that the model 1 includes: runtime, best_pic_nom - yes,feature_film - yes, drama - yes. Posterior probability of this model is .2101 with R^2 - .1912 and Base Factor - 1

image(model_bma)

Credible intervals

pd<-par(mfrow = c(1,2))
coef_lwage = coefficients(model_bma)
coef_lwage

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  444 models 
##                      post mean  post SD    post p(B != 0)
## Intercept             62.21002    0.73014    1.00000     
## runtime                0.11995    0.05824    0.88651     
## feature_filmyes      -26.88250    2.90932    1.00000     
## dramayes               7.20117    1.58547    0.99942     
## mpaa_rating_Ryes       0.16366    0.69259    0.08647     
## thtr_rel_year         -0.06850    0.09221    0.41791     
## oscar_seasonyes       -0.01859    0.36878    0.04280     
## summer_seasonyes       0.09502    0.54050    0.06140     
## best_pic_nomyes       19.98468    4.25396    0.99971     
## best_pic_winyes        0.03972    1.63768    0.04036     
## best_actor_winyes     -0.10303    0.66915    0.05565     
## best_actress_winyes   -0.20943    0.98386    0.07672     
## best_dir_winyes        0.62193    1.96351    0.12779     
## top200_boxyes          4.29951    6.24713    0.38098

plot(coef_lwage, subset = c(1,2,3,4), ask=FALSE)

`Applying visualization of the posterior distribution of the coefficients under the model averaging approach, we graph the posterior distribution of the coefficients o below. Here we can see that the only runtime represents a small probability of coefficient equals to 0.

model_movie=lm(audience_score ~ runtime + best_pic_nom + feature_film + drama, data = movies)
confint(model_movie)

##                        2.5 %      97.5 %
## (Intercept)      57.55539518  76.9189227
## runtime           0.06378568   0.2233382
## best_pic_nomyes  11.78346447  27.9421451
## feature_filmyes -31.97039264 -20.7221895
## dramayes          3.95310290  10.0038629

outliers

Recalling the problems of outliers, appeared in the beginning of EDA, after choosing the model with appropriate components, we should check the model again.

Plotting model against residuals shows fun-shape distribution which is explained by the fact that people mostly rate movies with high scores. To understand if they are outliers or just a data error, we could run Bayes.outlier.prob function, but I have troubles to find an appropriate package. That’s why we decided to go with model plotting.

plot(model_movie)

Graph “Residuals vs leverage” shows outliers of the cases 179, 385 and 443. * * *

Part 5: Prediction

movie_2016<-data.frame(runtime = 123, best_pic_nom = "yes", feature_film = "no", drama = "no")
movie_2016 =  predict(model_bma, estimator="BPM", se.fit=TRUE)
ci<-confint(movie_2016, estimator="BPM")

We took Scuicide Sqade movie featured in 2016 and applied our prediction model which found the audience score of 59.35431 with CI (23.46142 , 95.24719) when the actual movie rating, according to Rotten Tomatoes is 62 https://www.rottentomatoes.com/m/suicide_squad_2016

Part 6: Conclusion

In conclusion, we must say that during the data analysis we found a few outliers which could be explained by very negative reviews with the lowest rating score. We showed that people mostly rate movies with high ratings between 75 and 90 and having some ratings which are out of this diapason appeared on the graphs as outliers and mostly are exceptions rathers than the standard behavior of customers. During this work, we found out the audience score of Suicide Squad 59.35 when the real current score, according to Rotten Tomatoes is 62 which is not a big difference but still leaves some space for improvements.

Bayesian modeling and prediction for movies