Introduction

This project will focus on attributes make a movie popular as well as how much audiences and critics like movies and numerous other variables about the movies. The dataset was randomly collected from Rotten Tomatoes and IMDB in the US.

Set up

2.1 Load packages

2.2 Load data

load("movies.Rdata")

Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.

The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to constructed in the data manipulation section.

feature_film: ‘yes’ if title_type is Feature Film, ‘no’ otherwise, drama: “yes” if genre is Drama, “no” otherwise, mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise, runtime, thtr_rel_year, oscar_season: “yes” if movie is released in November, October, or December (based on, thtr_rel_month), “no” otherwise, summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box

#evaluate the size of the dataset
dim(movies)
## [1] 651  32
#types and summary of each variable 
str(movies)
## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)
##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 37   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :133   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :15.00   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##  NA's   :8                                                              
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :629      no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.69                   Mean   :62.36                            
##  3rd Qu.: 83.00                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :558        no :579          no :608      no :636    Length:651        
##  yes: 93        yes: 72          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

We see that there are some NA values for some variables. In this project, we will assume that the missing NA is not important and will not affect the result of the analysis so much that we should concern. So, we can eliminate all NA values

movies_no_na <- na.omit(movies)
dim(movies_no_na)
## [1] 619  32

The data has 619 observations and 37 variables.

####2.3 Data manipulation

feature_film: “yes” if title_type is Feature Film, “no” otherwise

movies_no_na <- mutate(movies_no_na, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies_no_na$feature_film <- as.factor(movies_no_na$feature_film)
summary(movies_no_na$feature_film)
##  No Yes 
##  46 573

drama: “yes” if genre is Drama, “no” otherwise

movies_no_na <- mutate(movies_no_na, drama = ifelse(genre == "Drama", "Yes", "No"))
movies_no_na$drama <- as.factor(movies_no_na$drama)
summary(movies_no_na$drama)
##  No Yes 
## 321 298

mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise

movies_no_na <- mutate(movies_no_na, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies_no_na$mpaa_rating_R <- as.factor(movies_no_na$mpaa_rating_R)
summary(movies_no_na$mpaa_rating_R)
##  No Yes 
## 300 319

oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise

movies_no_na <- mutate(movies_no_na, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies_no_na$oscar_season <- as.factor(movies_no_na$oscar_season)
summary(movies_no_na$oscar_season)
##  No Yes 
## 440 179

summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise

movies_no_na <- mutate(movies_no_na, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies_no_na$summer_season <- as.factor(movies_no_na$summer_season)
summary(movies_no_na$summer_season)
##  No Yes 
## 418 201
df <- movies_no_na[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]
summary(df)
##  feature_film drama     mpaa_rating_R oscar_season summer_season
##  No : 46      No :321   No :300       No :440      No :418      
##  Yes:573      Yes:298   Yes:319       Yes:179      Yes:201      
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  audience_score 
##  Min.   :11.00  
##  1st Qu.:46.00  
##  Median :65.00  
##  Mean   :62.21  
##  3rd Qu.:80.00  
##  Max.   :97.00

####2.4Exploratory Data Analysis

First, we will overview the association of the audience score and feature film

summary(df$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.21   80.00   97.00
IQR(df$audience_score)
## [1] 34
mean(df$audience_score)
## [1] 62.21002

The IQR of the audience_score is 34 (1st Qu - 46 and 3rd Qu. 80), while the mean is around 62.2, the median is 65

ggplot(df, aes(x = audience_score, y = ..density..)) +
  geom_histogram(bins = 40, fill = 'blue', colour = 'black') + 
  geom_density(size = 1, colour = 'brown') 

The audience_score shows a slight left skewed structure.

We will create a boxplot for each feature and compare them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.

p1 <- ggplot(df, aes(y=audience_score, x=feature_film, fill = feature_film)) +
  geom_boxplot() +
  ggtitle("Relationship between Audience Score and Feature Films") +
  xlab("Feature Films") +
  ylab("Audience Score") + 
  scale_fill_brewer(name = "feature film")
p2 <- ggplot(df, aes(x = summer_season, y = audience_score, fill = summer_season)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs summer season') + 
  xlab('summer season') + 
  ylab('Audience Score') + scale_fill_brewer(name = "summer season")
p3 <- ggplot(df, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs oscar_season') +
  xlab('oscar_season') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "oscar_season")
p4 <- ggplot(df, aes(x = drama, y = audience_score, fill = drama)) + 
  geom_boxplot() +
  ggtitle('Audience score vs drama') + 
  xlab('drama') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "drama")
p5 <- ggplot(df, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs mpaa_rating_R') +
  xlab('mpaa_rating_R') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "mpaa_rating_R")
# arrange the previously created plots 
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)

From above plots, we can see that:

Audience score for no feature film is higher than featur film. Drama movies have higher avergae audience score. Audience score for no mpaa_rating_R is slightly higher than mpaa_rating_R. Audience score for movies released in Nov, Oct or Dec has higher score than movies released in other months. Finally we will have a look at the other features used in the dataset

pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) + 
  geom_density() + 
  ggtitle("Audience score vs. feature_film") + 
  labs(x = "feature film", y = "Density")
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
  geom_density () + 
  labs(title = "Audience score vs. drama") + 
  labs(x = "drama", y = "Density")
pf3 <- ggplot(movies_no_na, aes(audience_score, fill = top200_box))+ 
  geom_density () + 
  labs(title = "Audience score vs. top200_box") + 
  labs(x = "top200 box", y = "Density")
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) + 
  geom_density() + 
  labs(title = "Audience score vs. oscar_season") +
  labs(x = "oscar season", y = "Density")
pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+ 
  geom_density () + 
  labs(title = "Audience score vs. summer_season") + 
  labs(x = "summer season", y = "Density")
pf6 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_nom))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_pic_nom") + 
  labs(x = "best pic nom", y = "Density")
pf7 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_win)) + 
  geom_density() + 
  labs(title = "Audience score vs. best pic win") + 
  labs(x = "best pic win", y = "Density")
pf8 <- ggplot(movies_no_na, aes(audience_score, fill = best_actor_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actor_win") + 
  labs(x = "best actor win", y = "Density")
pf9 <- ggplot(movies_no_na, aes(audience_score, fill = best_dir_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_dir_win") + 
  labs(x = "best dir win", y = "Density")
pf10 <- ggplot(movies_no_na, aes(audience_score, fill = best_actress_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actress_win") + 
  labs(x = "best actress win", y = "Density")
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)

####2.5 Hypothesis testing

We will use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.

bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 1.212332e+13
## P(H1|data) = 0 
## P(H2|data) = 1

bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 34.6357
## P(H1|data) = 0.0281 
## P(H2|data) = 0.9719

bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 24.8392
## P(H1|data) = 0.9613 
## P(H2|data) = 0.0387

bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 10.019
## P(H1|data) = 0.9092 
## P(H2|data) = 0.0908

bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 22.7623
## P(H1|data) = 0.9579 
## P(H2|data) = 0.0421

####Summary

feature_film: BF[H2:H1] = 14.6e+9 –> Evidence against H1 is Very Strong

drama: BF[H2:H1] = 50.77 –> Evidence against H1 is positive

mpaa_rating_R: BF[H1:H2] = 15.354 –> Evidence against H2 is Positive

oscar_season: BF[H1:H2] = 6.2077 –> Evidence against H2 is Positive

summer_season: BF[H1:H2] = 14.0392 –> Evidence against H2 is Positive

For feature_film, there is strong evidence against H1, which means that there is a significant difference in mean audience_score for feature- and non-feature films

2.6 Modeling

Based on the research question, we will just focus on the relationship between audience_score and variables mentioned in part 2. In this section, we will use backwards elimination to pick significant predictors and first, we will start with full model.

data.model <- movies_no_na[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]
str(data.model)
## tibble [619 × 15] (S3: tbl_df/tbl/data.frame)
##  $ feature_film    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ drama           : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
##  $ runtime         : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
##  $ thtr_rel_year   : num [1:619] 2013 2001 1996 1993 2004 ...
##  $ imdb_rating     : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ audience_score  : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
##  - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
##   ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win + top200_box
## 
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
## 
##               Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                      604   61461.67 2876.201
## 2     - top200_box  1   8.119148       605   61469.79 2874.283
## 3   - best_pic_win  1  40.723899       606   61510.51 2872.693
## 4   - best_dir_win  1  84.558038       607   61595.07 2871.544
## 5 - best_actor_win  1 144.351490       608   61739.42 2870.992
## 6   - feature_film  1 188.009104       609   61927.43 2870.875
## 7          - drama  1 124.155090       610   62051.59 2870.114
## 8 - imdb_num_votes  1 167.442064       611   62219.03 2869.782
## 9  - thtr_rel_year  1 168.854308       612   62387.88 2869.460

We do not see mpaa_rating_R in this model but summer_season, oscar_season are still included in this model. We will use Bayesian Model Average (BMA) next

bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmYes             dramaYes  
##             1.00000              0.05829              0.04505  
##             runtime     mpaa_rating_RYes        thtr_rel_year  
##             0.52445              0.16532              0.08061  
##         imdb_rating       imdb_num_votes        critics_score  
##             1.00000              0.06301              0.92393  
##     best_pic_nomyes      best_pic_winyes    best_actor_winyes  
##             0.12969              0.04073              0.11716  
## best_actress_winyes      best_dir_winyes        top200_boxyes  
##             0.14758              0.06684              0.04878
summary(bma_audience_score)
##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmYes        0.05828654     0.0000     0.0000000     0.0000000
## dramaYes               0.04504589     0.0000     0.0000000     0.0000000
## runtime                0.52445351     1.0000     0.0000000     0.0000000
## mpaa_rating_RYes       0.16532487     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08061489     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06300703     0.0000     0.0000000     0.0000000
## critics_score          0.92392984     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.12968919     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04072981     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11715515     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14758294     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06684245     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04877824     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1814     0.1581000     0.0371000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmYes         0.0000000     0.0000000
## dramaYes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_RYes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0370000     0.0336000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237
image(bma_audience_score, rotate = FALSE)

We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.

coef_bma_audience_score <- coef(bma_audience_score)
plot(coef_bma_audience_score)

Now we can build up a model for prediction:

finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]
bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())
summary(bma_finalmodel)
##               P(B != 0 | Y)  model 1     model 2      model 3      model 4
## Intercept           1.00000   1.0000   1.0000000   1.00000000   1.00000000
## runtime             0.63750   1.0000   0.0000000   0.00000000   1.00000000
## imdb_rating         0.99375   1.0000   1.0000000   1.00000000   1.00000000
## critics_score       0.80000   1.0000   1.0000000   0.00000000   0.00000000
## BF                       NA   1.0000   0.9928814   0.08035048   0.09635143
## PostProbs                NA   0.5404   0.2547000   0.10560000   0.08700000
## R2                       NA   0.7483   0.7455000   0.74050000   0.74360000
## dim                      NA   4.0000   3.0000000   2.00000000   3.00000000
## logmarg                  NA 413.8820 413.8748206 411.36060745 411.54221157
##                     model 5
## Intercept      1.000000e+00
## runtime        0.000000e+00
## imdb_rating    0.000000e+00
## critics_score  0.000000e+00
## BF            1.792035e-180
## PostProbs      6.200000e-03
## R2             0.000000e+00
## dim            1.000000e+00
## logmarg        0.000000e+00

2.7. Prediction

Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code

blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)

data.predict <- rbind(data.model, blackpanther)
blackpanther <- tail(blackpanther, 1)
str(blackpanther)
## 'data.frame':    1 obs. of  15 variables:
##  $ feature_film    : chr "yes"
##  $ drama           : chr "no"
##  $ runtime         : num 135
##  $ mpaa_rating_R   : chr "no"
##  $ thtr_rel_year   : num 2018
##  $ imdb_rating     : num 7.4
##  $ imdb_num_votes  : num 443501
##  $ critics_score   : num 97
##  $ best_pic_nom    : chr "no"
##  $ best_pic_win    : chr "no"
##  $ best_actor_win  : chr "no"
##  $ best_actress_win: chr "no"
##  $ best_dir_win    : chr "no"
##  $ top200_box      : chr "yes"
##  $ audience_score  : num 79

We will predict the audience_score

audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma
##          [,1]
## [1,] 77.23458

The prediction is lower than the actual audience_score.

2.8 Conclusion

The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.