Modeling and prediction for movies

1. Introduction
2. Set up
3 Exploratory Data Analysis
4. Hypothesis testing
5. Modeling
6. Prediction
7. Conclusion

1. Introduction

This project will focus on attributes make a movie popular as well as how much audiences and critics like movies and numerous other variables about the movies. The dataset was randomly collected from Rotten Tomatoes and IMDB in the US.

2. Set up

2.1 Load packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(BAS)
library(MASS)
library(grid)
library(GGally)
library(gridExtra)

2.2 Load data

load("movies.Rdata")

Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.

The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to construct in the data manipulation section.

feature_film: ‘yes’ if title_type is Feature Film, ‘no’ otherwise
drama: “yes” if genre is Drama, “no” otherwise
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
runtime
thtr_rel_year
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

# evaluate the size of the dataset
dim(movies)

## [1] 651  32

# types and summary of each variable 
str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

We see that there are some NA values for some variables. In this project, we will assume that the missing NA is not important and will not affect the result of the analysis so much that we should concern. So, we can eliminate all NA values

movies_no_na <- na.omit(movies)
dim(movies_no_na)

## [1] 619  32

The data has 619 observations and 37 variables.

2.3 Data manipulation

feature_film: “yes” if title_type is Feature Film, “no” otherwise

movies_no_na <- mutate(movies_no_na, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies_no_na$feature_film <- as.factor(movies_no_na$feature_film)
summary(movies_no_na$feature_film)

##  No Yes 
##  46 573

drama: “yes” if genre is Drama, “no” otherwise

movies_no_na <- mutate(movies_no_na, drama = ifelse(genre == "Drama", "Yes", "No"))
movies_no_na$drama <- as.factor(movies_no_na$drama)
summary(movies_no_na$drama)

##  No Yes 
## 321 298

mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise

movies_no_na <- mutate(movies_no_na, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies_no_na$mpaa_rating_R <- as.factor(movies_no_na$mpaa_rating_R)
summary(movies_no_na$mpaa_rating_R)

##  No Yes 
## 300 319

oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise

movies_no_na <- mutate(movies_no_na, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies_no_na$oscar_season <- as.factor(movies_no_na$oscar_season)
summary(movies_no_na$oscar_season)

##  No Yes 
## 440 179

summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise

movies_no_na <- mutate(movies_no_na, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies_no_na$summer_season <- as.factor(movies_no_na$summer_season)
summary(movies_no_na$summer_season)

##  No Yes 
## 418 201

df <- movies_no_na[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]

summary(df)

##  feature_film drama     mpaa_rating_R oscar_season summer_season
##  No : 46      No :321   No :300       No :440      No :418      
##  Yes:573      Yes:298   Yes:319       Yes:179      Yes:201      
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  audience_score 
##  Min.   :11.00  
##  1st Qu.:46.00  
##  Median :65.00  
##  Mean   :62.21  
##  3rd Qu.:80.00  
##  Max.   :97.00

3 Exploratory Data Analysis

First, we will overview the association of the audience score and feature film

summary(df$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.21   80.00   97.00

IQR(df$audience_score)

## [1] 34

mean(df$audience_score)

## [1] 62.21002

The IQR of the audience_score is 34 (1st Qu - 46 and 3rd Qu. 80), while the mean is around 62.2, the median is 65

ggplot(df, aes(x = audience_score, y = ..density..)) +
  geom_histogram(bins = 40, fill = 'blue', colour = 'black') + 
  geom_density(size = 1, colour = 'brown')

The audience_score shows a slight left skewed structure.

We will create a boxplot for each feature and compare them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.

p1 <- ggplot(df, aes(y=audience_score, x=feature_film, fill = feature_film)) +
  geom_boxplot() +
  ggtitle("Relationship between Audience Score and Feature Films") +
  xlab("Feature Films") +
  ylab("Audience Score") + 
  scale_fill_brewer(name = "feature film")


p2 <- ggplot(df, aes(x = summer_season, y = audience_score, fill = summer_season)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs summer season') + 
  xlab('summer season') + 
  ylab('Audience Score') + scale_fill_brewer(name = "summer season")


p3 <- ggplot(df, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs oscar_season') +
  xlab('oscar_season') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "oscar_season")


p4 <- ggplot(df, aes(x = drama, y = audience_score, fill = drama)) + 
  geom_boxplot() +
  ggtitle('Audience score vs drama') + 
  xlab('drama') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "drama")


p5 <- ggplot(df, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + 
  geom_boxplot() + 
  ggtitle('Audience score vs mpaa_rating_R') +
  xlab('mpaa_rating_R') + 
  ylab('Audience Score') + 
  scale_fill_brewer(name = "mpaa_rating_R")

# arrange the previously created plots 
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)

From above plots, we can see that:

Audience score for no feature film is higher than featur film.
Drama movies have higher avergae audience score.
Audience score for no mpaa_rating_R is slightly higher than mpaa_rating_R.
Audience score for movies released in Nov, Oct or Dec has higher score than movies released in other months.

Finally we will have a look at the other features used in the dataset

pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) + 
  geom_density() + 
  ggtitle("Audience score vs. feature_film") + 
  labs(x = "feature film", y = "Density")

    
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
  geom_density () + 
  labs(title = "Audience score vs. drama") + 
  labs(x = "drama", y = "Density")


pf3 <- ggplot(movies_no_na, aes(audience_score, fill = top200_box))+ 
  geom_density () + 
  labs(title = "Audience score vs. top200_box") + 
  labs(x = "top200 box", y = "Density")

    
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) + 
  geom_density() + 
  labs(title = "Audience score vs. oscar_season") +
  labs(x = "oscar season", y = "Density")


pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+ 
  geom_density () + 
  labs(title = "Audience score vs. summer_season") + 
  labs(x = "summer season", y = "Density")

    
pf6 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_nom))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_pic_nom") + 
  labs(x = "best pic nom", y = "Density")


pf7 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_win)) + 
  geom_density() + 
  labs(title = "Audience score vs. best pic win") + 
  labs(x = "best pic win", y = "Density")

    
pf8 <- ggplot(movies_no_na, aes(audience_score, fill = best_actor_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actor_win") + 
  labs(x = "best actor win", y = "Density")

  
pf9 <- ggplot(movies_no_na, aes(audience_score, fill = best_dir_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_dir_win") + 
  labs(x = "best dir win", y = "Density")

    
pf10 <- ggplot(movies_no_na, aes(audience_score, fill = best_actress_win))+ 
  geom_density () + 
  labs(title = "Audience score vs. best_actress_win") + 
  labs(x = "best actress win", y = "Density")

   
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)

4. Hypothesis testing

we will use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.

bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors: P(H1) = 0.5  P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 14614401072
## P(H1|data) = 0 
## P(H2|data) = 1 
## 
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
## 
## 
## Posterior Summaries
##                       2.5%        25%       50%        75%       97.5%
## overall mean    68.5515270  70.411637  71.38884  72.383111   74.246452
## mu_No - mu_Yes  15.8806636  19.595658  21.55900  23.518128   27.218463
## sigma^2        336.1636955 360.773617 374.94964 389.500759  420.280587
## effect size      0.8109977   1.009448   1.11570   1.216969    1.411302
## n_0             13.7298943 159.386470 383.07153 775.309194 2142.883789
## 95% Cred. Int.: (15.8807 , 27.2185)

bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors: P(H1) = 0.5  P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 50.7705
## P(H1|data) = 0.0193 
## P(H2|data) = 0.9807 
## 
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
## 
## 
## Posterior Summaries
##                       2.5%         25%         50%          75%
## overall mean    60.7105099  61.7820198  62.3158053   62.8533773
## mu_No - mu_Yes  -9.0521102  -6.9365112  -5.8646481   -4.7746478
## sigma^2        358.2014534 384.1593371 398.9600947  414.7302604
## effect size     -0.4523203  -0.3476954  -0.2942123   -0.2389429
## n_0             29.3100817 317.9508340 780.6478066 1544.4094407
##                       97.5%
## overall mean     63.8919523
## mu_No - mu_Yes   -2.7044426
## sigma^2         446.1217457
## effect size      -0.1353839
## n_0            4142.7222153
## 95% Cred. Int.: (-9.0521 , -2.7044)

bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors: P(H1) = 0.5  P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 15.354
## P(H1|data) = 0.9389 
## P(H2|data) = 0.0611 
## 
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
## 
## 
## Posterior Summaries
##                       2.5%          25%          50%          75%
## overall mean    60.6208708  61.66028575  62.21158440 6.275903e+01
## mu_No - mu_Yes  -3.5243613  -1.44083605  -0.34393238 7.396143e-01
## sigma^2        365.0174696 392.19005313 407.53617585 4.235716e+02
## effect size     -0.1757209  -0.07164857  -0.01703257 3.667843e-02
## n_0             29.6638881 352.10063692 858.36892112 1.705595e+03
##                       97.5%
## overall mean     63.7987863
## mu_No - mu_Yes    2.7445470
## sigma^2         456.2057888
## effect size       0.1371334
## n_0            4557.4941083
## 95% Cred. Int.: (-3.5244 , 2.7445)

bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors: P(H1) = 0.5  P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 6.2077
## P(H1|data) = 0.8613 
## P(H2|data) = 0.1387 
## 
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
## 
## 
## Posterior Summaries
##                       2.5%        25%         50%           75%
## overall mean    60.9162681  62.092184  62.7002082   63.30879379
## mu_No - mu_Yes  -5.8310567  -3.511795  -2.2835284   -1.06811530
## sigma^2        364.9257678 391.105564 406.1201338  421.69427581
## effect size     -0.2894951  -0.173561  -0.1130275   -0.05312607
## n_0             33.1871629 355.907792 854.4906625 1671.06291154
##                       97.5%
## overall mean   6.445812e+01
## mu_No - mu_Yes 1.235673e+00
## sigma^2        4.539947e+02
## effect size    6.080273e-02
## n_0            4.507536e+03
## 95% Cred. Int.: (-5.8311 , 1.2357)

bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No  = mu_Yes
## H2: mu_No != mu_Yes
## 
## Priors: P(H1) = 0.5  P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 14.0392
## P(H1|data) = 0.9335 
## P(H2|data) = 0.0665 
## 
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
## 
## 
## Posterior Summaries
##                       2.5%          25%          50%          75%
## overall mean    60.3496334  61.50994512  62.09766475 6.268886e+01
## mu_No - mu_Yes  -2.8726836  -0.55635330   0.58209277 1.756742e+00
## sigma^2        366.5515586 392.60108984 407.92120298 4.236298e+02
## effect size     -0.1401389  -0.02767417   0.02892943 8.671627e-02
## n_0             29.5635751 336.31717094 834.54925726 1.703500e+03
##                       97.5%
## overall mean     63.7877739
## mu_No - mu_Yes    4.0059776
## sigma^2         455.9898100
## effect size       0.1980793
## n_0            4510.5593331
## 95% Cred. Int.: (-2.8727 , 4.006)

Summary

feature_film: BF[H2:H1] = 14.6e+9 –> Evidence against H1 is Very Strong

drama: BF[H2:H1] = 50.77 –> Evidence against H1 is positive

mpaa_rating_R: BF[H1:H2] = 15.354 –> Evidence against H2 is Positive

oscar_season: BF[H1:H2] = 6.2077 –> Evidence against H2 is Positive

summer_season: BF[H1:H2] = 14.0392 –> Evidence against H2 is Positive

For feature_film, there is strong evidence against H1, which means that there is a significant difference in mean audience_score for feature- and non-feature films

5. Modeling

Based on the research question, we will just focus on the relationship between audience_score and variables mentioned in part 2. In this section, we will use backwards elimination to pick significant predictors and first, we will start with full model.

data.model <- movies_no_na[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]

str(data.model)

## Classes 'tbl_df', 'tbl' and 'data.frame':    619 obs. of  17 variables:
##  $ feature_film    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ drama           : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
##  $ runtime         : num  80 101 84 139 90 142 93 88 119 127 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ oscar_season    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 2 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num  45 96 91 80 33 57 17 90 83 89 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ audience_score  : num  73 81 91 76 27 76 47 89 66 75 ...

lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
## 
##                Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                       602   61285.24 2878.422
## 2      - top200_box  1   8.158011       603   61293.40 2876.504
## 3    - oscar_season  1   9.834264       604   61303.24 2874.604
## 4    - best_pic_win  1  43.432561       605   61346.67 2873.042
## 5    - best_dir_win  1  91.172934       606   61437.84 2871.961
## 6  - best_actor_win  1 125.157576       607   61563.00 2871.221
## 7   - summer_season  1 176.422141       608   61739.42 2870.992
## 8    - feature_film  1 188.009104       609   61927.43 2870.875
## 9           - drama  1 124.155090       610   62051.59 2870.114
## 10 - imdb_num_votes  1 167.442064       611   62219.03 2869.782
## 11  - thtr_rel_year  1 168.854308       612   62387.88 2869.460

We do not see mpaa_rating_R in this model but summer_season, oscar_season are still included in this model. We will use Bayesian Model Average (BMA) next

bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmYes             dramaYes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_RYes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonYes     summer_seasonYes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876

summary(bma_audience_score)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmYes        0.05876309     0.0000     0.0000000     0.0000000
## dramaYes               0.04508592     0.0000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000
## mpaa_rating_RYes       0.16498090     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000
## oscar_seasonYes        0.06525993     0.0000     0.0000000     0.0000000
## summer_seasonYes       0.07935408     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238
## PostProbs                      NA     0.1558     0.1358000     0.0319000
## R2                             NA     0.7483     0.7455000     0.7470000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmYes         0.0000000     0.0000000
## dramaYes                0.0000000     0.0000000
## runtime                 1.0000000     1.0000000
## mpaa_rating_RYes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonYes         0.0000000     0.0000000
## summer_seasonYes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     1.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2039916     0.1851908
## PostProbs               0.0318000     0.0289000
## R2                      0.7496000     0.7495000
## dim                     5.0000000     5.0000000
## logmarg             -3436.3416317 -3436.4383237

image(bma_audience_score, rotate = FALSE)

We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.

coef_bma_audience_score <- coef(bma_audience_score)
plot(coef_bma_audience_score)

Now we can build up a model for prediction:

finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]

bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())

summary(bma_finalmodel)

##               P(B != 0 | Y)     model 1  model 2      model 3      model 4
## Intercept           1.00000   1.0000000   1.0000   1.00000000   1.00000000
## runtime             0.48125   0.0000000   1.0000   0.00000000   1.00000000
## imdb_rating         1.00000   1.0000000   1.0000   1.00000000   1.00000000
## critics_score       0.90000   1.0000000   1.0000   0.00000000   0.00000000
## BF                       NA   0.9928814   1.0000   0.08035048   0.09635143
## PostProbs                NA   0.4534000   0.4410   0.06210000   0.03730000
## R2                       NA   0.7455000   0.7483   0.74050000   0.74360000
## dim                      NA   3.0000000   4.0000   2.00000000   3.00000000
## logmarg                  NA 413.8748206 413.8820 411.36060745 411.54221157
##                     model 5
## Intercept      1.000000e+00
## runtime        0.000000e+00
## imdb_rating    0.000000e+00
## critics_score  0.000000e+00
## BF            1.792035e-180
## PostProbs      6.200000e-03
## R2             0.000000e+00
## dim            1.000000e+00
## logmarg        0.000000e+00

6. Prediction

Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code

blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,oscar_season="no",summer_season="no",imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)

data.predict <- rbind(data.model, blackpanther)

blackpanther <- tail(blackpanther, 1)

str(blackpanther)

## 'data.frame':    1 obs. of  17 variables:
##  $ feature_film    : Factor w/ 1 level "yes": 1
##  $ drama           : Factor w/ 1 level "no": 1
##  $ runtime         : num 135
##  $ mpaa_rating_R   : Factor w/ 1 level "no": 1
##  $ thtr_rel_year   : num 2018
##  $ oscar_season    : Factor w/ 1 level "no": 1
##  $ summer_season   : Factor w/ 1 level "no": 1
##  $ imdb_rating     : num 7.4
##  $ imdb_num_votes  : num 443501
##  $ critics_score   : num 97
##  $ best_pic_nom    : Factor w/ 1 level "no": 1
##  $ best_pic_win    : Factor w/ 1 level "no": 1
##  $ best_actor_win  : Factor w/ 1 level "no": 1
##  $ best_actress_win: Factor w/ 1 level "no": 1
##  $ best_dir_win    : Factor w/ 1 level "no": 1
##  $ top200_box      : Factor w/ 1 level "yes": 1
##  $ audience_score  : num 79

We will predict the audience_score

audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma

##          [,1]
## [1,] 77.66515

The prediction is lower than the actual audience_score.

7. Conclusion

The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.