Introduction

The specific modeling task you need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables you will need to construct in the data manipulation section using the mutate function in dplyr:

Load Packages

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(statsr)
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.3
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(grid)
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Load Data

load("movies.Rdata")
dim(movies)
## [1] 651  32
str(movies)
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)
##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 37   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :133   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :15.00   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##  NA's   :8                                                              
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :629      no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.69                   Mean   :62.36                            
##  3rd Qu.: 83.00                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :558        no :579          no :608      no :636    Length:651        
##  yes: 93        yes: 72          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
movies <- na.omit(movies)
dim(movies)
## [1] 619  32

Scope of Inference

Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.

The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to construct in the data manipulation section.

Data Manipulation

Create new variables using the mutate function in the dplyr package following guidelines:

movies <- mutate(movies, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies$feature_film <- as.factor(movies$feature_film)
summary(movies$feature_film)
##  No Yes 
##  46 573
movies <- mutate(movies, drama = ifelse(genre == "Drama", "Yes", "No"))
movies$drama <- as.factor(movies$drama)
summary(movies$drama)
##  No Yes 
## 321 298
movies<- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies$mpaa_rating_R <- as.factor(movies$mpaa_rating_R)
summary(movies$mpaa_rating_R)
##  No Yes 
## 300 319
movies <- mutate(movies, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies$oscar_season <- as.factor(movies$oscar_season)
summary(movies$oscar_season)
##  No Yes 
## 440 179
movies <- mutate(movies, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies$summer_season <- as.factor(movies$summer_season)
summary(movies$summer_season)
##  No Yes 
## 418 201

create a data frame with the new variables

df <- movies[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]

summary(df)
##  feature_film drama     mpaa_rating_R oscar_season summer_season
##  No : 46      No :321   No :300       No :440      No :418      
##  Yes:573      Yes:298   Yes:319       Yes:179      Yes:201      
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  audience_score 
##  Min.   :11.00  
##  1st Qu.:46.00  
##  Median :65.00  
##  Mean   :62.21  
##  3rd Qu.:80.00  
##  Max.   :97.00

Exploratory Data Analysis

Analysing Audience score

summary(df$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.21   80.00   97.00
IQR(df$audience_score)
## [1] 34
mean(df$audience_score)
## [1] 62.21002
ggplot(df, aes(x = audience_score, y = ..density..)) +
  geom_histogram(bins = 40, fill = 'blue', colour = 'black') + 
  geom_density(size = 1, colour = 'brown') 

comparing audience score with the rest of the data set

pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) + 
  geom_density() + 
  ggtitle("Audience score vs. feature_film") + 
  labs(x = "feature film", y = "Density")

    
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
  geom_density () + 
  labs(title = "Audience score vs. drama") + 
  labs(x = "drama", y = "Density")


pf3 <- ggplot(movies, aes(audience_score, fill = top200_box))+ 
  geom_density () + 
  labs(title = "Audience score vs. top200_box") + 
  labs(x = "top200 box", y = "Density")

    
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) + 
  geom_density() + 
  labs(title = "Audience score vs. oscar_season") +
  labs(x = "oscar season", y = "Density")


pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+ 
  geom_density () + 
  labs(title = "Audience score vs. summer_season") + 
  labs(x = "summer season", y = "Density")

    
pf6 <- ggplot(movies, aes(audience_score, fill = best_pic_nom))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_pic_nom") + 
  labs(x = "best pic nom", y = "Density")


pf7 <- ggplot(movies, aes(audience_score, fill = best_pic_win)) + 
  geom_density() + 
  labs(title = "Audience score vs. best pic win") + 
  labs(x = "best pic win", y = "Density")

    
pf8 <- ggplot(movies, aes(audience_score, fill = best_actor_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actor_win") + 
  labs(x = "best actor win", y = "Density")

  
pf9 <- ggplot(movies, aes(audience_score, fill = best_dir_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_dir_win") + 
  labs(x = "best dir win", y = "Density")

    
pf10 <- ggplot(movies, aes(audience_score, fill = best_actress_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actress_win") + 
  labs(x = "best actress win", y = "Density")

   
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)

Modeling

use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.

bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 1.212332e+13
## P(H1|data) = 0 
## P(H2|data) = 1

- Audience score with drama

bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 34.6357
## P(H1|data) = 0.0281 
## P(H2|data) = 0.9719

- Audience score with mpaa_rating_r

bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 24.8392
## P(H1|data) = 0.9613 
## P(H2|data) = 0.0387

- Audience score with oscar_season

bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 10.019
## P(H1|data) = 0.9092 
## P(H2|data) = 0.0908

- Audience Score with summer_season

bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 22.7623
## P(H1|data) = 0.9579 
## P(H2|data) = 0.0421

Backward Elimination we will use backwards elimination to pick significant predictors and first, we will start with full model.

data.model <- movies[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]

str(data.model)
## tibble [619 x 17] (S3: tbl_df/tbl/data.frame)
##  $ feature_film    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ drama           : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
##  $ runtime         : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
##  $ thtr_rel_year   : num [1:619] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 2 ...
##  $ imdb_rating     : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ audience_score  : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
##  - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
##   ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
## 
##                Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                       602   61285.24 2878.422
## 2      - top200_box  1   8.158011       603   61293.40 2876.504
## 3    - oscar_season  1   9.834264       604   61303.24 2874.604
## 4    - best_pic_win  1  43.432561       605   61346.67 2873.042
## 5    - best_dir_win  1  91.172934       606   61437.84 2871.961
## 6  - best_actor_win  1 125.157576       607   61563.00 2871.221
## 7   - summer_season  1 176.422141       608   61739.42 2870.992
## 8    - feature_film  1 188.009104       609   61927.43 2870.875
## 9           - drama  1 124.155090       610   62051.59 2870.114
## 10 - imdb_num_votes  1 167.442064       611   62219.03 2869.782
## 11  - thtr_rel_year  1 168.854308       612   62387.88 2869.460
bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmYes             dramaYes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_RYes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonYes     summer_seasonYes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876
summary(bma_audience_score)
##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmYes        0.05876309     0.0000     0.0000000     0.0000000
## dramaYes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_RYes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonYes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonYes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmYes         0.0000000     0.0000000
## dramaYes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_RYes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonYes         0.0000000     0.0000000
## summer_seasonYes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237
image(bma_audience_score, rotate = FALSE)

We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.

coef_bma_audience_score <- coef(bma_audience_score)
par(ask=F)
plot(coef_bma_audience_score)

Model for Prediction

finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]

bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())

summary(bma_finalmodel)
##               P(B != 0 | Y)     model 1  model 2      model 3      model 4
## Intercept           1.00000   1.0000000   1.0000   1.00000000   1.00000000
## runtime             0.50625   0.0000000   1.0000   1.00000000   0.00000000
## imdb_rating         0.99375   1.0000000   1.0000   1.00000000   1.00000000
## critics_score       0.91875   1.0000000   1.0000   0.00000000   0.00000000
## BF                       NA   0.9928814   1.0000   0.09635143   0.08035048
## PostProbs                NA   0.4720000   0.4410   0.05590000   0.01860000
## R2                       NA   0.7455000   0.7483   0.74360000   0.74050000
## dim                      NA   3.0000000   4.0000   3.00000000   2.00000000
## logmarg                  NA 413.8748206 413.8820 411.54221157 411.36060745
##                     model 5
## Intercept      1.000000e+00
## runtime        0.000000e+00
## imdb_rating    0.000000e+00
## critics_score  0.000000e+00
## BF            1.792035e-180
## PostProbs      6.200000e-03
## R2             0.000000e+00
## dim            1.000000e+00
## logmarg        0.000000e+00

Prediction

Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code

blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,oscar_season="no",summer_season="no",imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)

data.predict <- rbind(data.model, blackpanther)

blackpanther <- tail(blackpanther, 1)

str(blackpanther)
## 'data.frame':    1 obs. of  17 variables:
##  $ feature_film    : chr "yes"
##  $ drama           : chr "no"
##  $ runtime         : num 135
##  $ mpaa_rating_R   : chr "no"
##  $ thtr_rel_year   : num 2018
##  $ oscar_season    : chr "no"
##  $ summer_season   : chr "no"
##  $ imdb_rating     : num 7.4
##  $ imdb_num_votes  : num 443501
##  $ critics_score   : num 97
##  $ best_pic_nom    : chr "no"
##  $ best_pic_win    : chr "no"
##  $ best_actor_win  : chr "no"
##  $ best_actress_win: chr "no"
##  $ best_dir_win    : chr "no"
##  $ top200_box      : chr "yes"
##  $ audience_score  : num 79

We will predict the audience_score

audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma
##          [,1]
## [1,] 77.61483

The prediction is lower than the actual audience_score.

Conclusion

The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.