Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

The full directory path information is requied becasue the R markdown file and the movies.Rdata file are saved in different directory from working directory for R.

load("~/Documents/Personal/Career Development/Coursera/Coursera-Statistics with R Specialization/04. Bayesian Statistics/movies.Rdata")

Part 1: Data

The model fitted from this dataset can be generalized to predict the parameters of the sample population. First, the data is indepedent. According to Rotten Tomatoes and IMDB APIs, the dataset ‘movies’ contains data regarding 651 randomly sampled movies released before 2016, which is less than 10% of total number of movies released before 2016. Second, the sample size is large enough to assume nearly normal distribution. The causal conclusion can’t be drawm since the dataset is comprised of observational data. The relationship can be discussed instead.

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Part 2: Data manipulation

The data should be restructured for modelling.

First of all, create a new variable with 2 levels (yes if movie is featured film or no if not) from title_type and name it as feature_film.

movies$feature_film <- if_else(movies$title_type == "Feature Film", "yes", "no")

Next, let’s create another variable related to genre. It will be called drama with levels yes (movies that are dramas) and no.

movies <- movies %>% mutate(drama = if_else(genre == "Drama", "yes", "no"))

Keep creating enw variables for modelling using mpaa_rating and thtr_rel_month into mpaa_rating_R, oscar_season and summer_season. All new variables have 2 levels of yes' orno. Formpaa_rating_R, assignyesif the movie is rated R based on 'mpaa_rating variable. oscar_season and summer_season variable are based on thtr_rel_month. For oscar_season, place yes if the movie is released in October, November or December. For summer_season',yes` if the movie is released in months of May, June, July or August.

# `mpaa_rating_R` variable
movies <- movies %>% mutate(mpaa_rating_R = if_else(mpaa_rating == "R", "yes", "no"))

# `oscar_season` variable
movies <- movies %>% mutate(oscar_season = if_else(thtr_rel_month > 9, "yes", "no"))

# `summer_season` variable
movies <- movies %>% mutate(summer_season = if_else(thtr_rel_month > 4 & thtr_rel_month < 9, "yes", "no"))

For the analysis purpose, the format of the new variables needs to be converted from character to factor

# Check the format and the summary of the new variables
str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  37 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
##  $ feature_film    : chr  "yes" "yes" "yes" "yes" ...
##  $ drama           : chr  "yes" "yes" "no" "yes" ...
##  $ mpaa_rating_R   : chr  "yes" "no" "yes" "no" ...
##  $ oscar_season    : chr  "no" "no" "no" "yes" ...
##  $ summer_season   : chr  "no" "no" "yes" "no" ...

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url          feature_film      
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     drama           mpaa_rating_R      oscar_season      
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  summer_season     
##  Length:651        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

movies$feature_film <- as.factor(movies$feature_film)
movies$drama <- as.factor(movies$drama)
movies$mpaa_rating_R <- as.factor(movies$mpaa_rating_R)
movies$oscar_season <-as.factor(movies$oscar_season)
movies$summer_season <- as.factor(movies$summer_season)

# Check the changes
str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  37 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
##  $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
##  $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
##  $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
##  $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url          feature_film drama    
##  Length:651         Length:651         no : 60      no :346  
##  Class :character   Class :character   yes:591      yes:305  
##  Mode  :character   Mode  :character                         
##                                                              
##                                                              
##                                                              
##                                                              
##  mpaa_rating_R oscar_season summer_season
##  no :322       no :460      no :443      
##  yes:329       yes:191      yes:208      
##                                          
##                                          
##                                          
##                                          
##

In addition to creating and adding new variables usefule for the modelling, omit some of variable and observations which are not relevant for the modelling. There are variables which are not meaningful for the modelling such as director, actor1~5 and website addresses. There was 1 observation with NA in runtime variable, which we have to omit from the dataset for the modelling.

# Taking out the unmeaningful varialbles.
movies_red <- subset(movies, select = c(runtime, thtr_rel_year, imdb_rating, imdb_num_votes,
                                        critics_score, audience_score, best_pic_nom, best_pic_win,
                                        best_actor_win, best_actress_win, best_dir_win, top200_box,
                                        feature_film, drama, mpaa_rating_R, oscar_season, summer_season))

# Taking out `NA` observations.
movies_red <- na.omit(movies_red)

Part 3: Exploratory data analysis

It is important to explore the relationship between audience_score and the newly created variables since those newly added variables will be used as explanatory variables for the modelling. Understanding the relationship will provide insight for the model selection as well.

There are not much difference in the distributions between yes and no except for drama and feature_film. But, the difference in the distributions are larger in feature_film than drama. The distribution of audience_score is much narrower for not feature film than for feature film. The audience_score of not feature film concentrated in the higher score with possible outliers in the lower score. You may have to consider these possible outliers when developing model. The audience_score for drama is more right-skewed than for non-drama.

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

p1 <- ggplot(movies_red, aes(feature_film, audience_score))+geom_boxplot()
p2 <- ggplot(movies_red, aes(drama, audience_score))+geom_boxplot()
p3 <- ggplot(movies_red, aes(mpaa_rating_R, audience_score))+geom_boxplot()
p4 <- ggplot(movies_red, aes(oscar_season, audience_score))+geom_boxplot()
p5 <- ggplot(movies_red, aes(summer_season, audience_score))+geom_boxplot()

grid.arrange(p1, p2, p3, p4, p5, ncol = 3)

The non-feature film shows higher audience score than feature film with distribution concentrated highly toward high score.

movies_red %>%
    group_by(feature_film) %>%
    summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
              Q75=quantile(audience_score, 0.75), Max=max(audience_score))

## # A tibble: 2 x 7
##   feature_film   Min   Q25 Median     Mean   Q75   Max
##         <fctr> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1           no    19  77.0     86 81.20339    89    96
## 2          yes    11  44.5     62 60.46531    78    97

print(p1)

The drama gets higer approval from audience than non-drama. The drama is also right skewed toward high score but not as extreme as non-feature film.

movies_red %>%
    group_by(drama) %>%
    summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
              Q75=quantile(audience_score, 0.75), Max=max(audience_score))

## # A tibble: 2 x 7
##    drama   Min   Q25 Median     Mean   Q75   Max
##   <fctr> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1     no    11    41     61 59.69565    79    97
## 2    yes    13    52     70 65.34754    80    95

print(p2)

The distribution of audience_score for the other variables are much similar in terms of range of distribution with similar IQR and median.

movies_red %>%
    group_by(mpaa_rating_R) %>%
    summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
              Q75=quantile(audience_score, 0.75), Max=max(audience_score))

## # A tibble: 2 x 7
##   mpaa_rating_R   Min   Q25 Median     Mean   Q75   Max
##          <fctr> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1            no    11    48     65 62.66044    80    96
## 2           yes    14    44     64 62.04255    79    97

print(p3)

movies_red %>%
    group_by(oscar_season) %>%
    summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
              Q75=quantile(audience_score, 0.75), Max=max(audience_score))

## # A tibble: 2 x 7
##   oscar_season   Min   Q25 Median     Mean   Q75   Max
##         <fctr> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1           no    11 46.00   64.0 61.81304    79    96
## 2          yes    13 47.25   68.5 63.64211    81    97

print(p4)

movies_red %>%
    group_by(summer_season) %>%
    summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
              Q75=quantile(audience_score, 0.75), Max=max(audience_score))

## # A tibble: 2 x 7
##   summer_season   Min   Q25 Median     Mean   Q75   Max
##          <fctr> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1            no    13 46.00   65.5 62.60181    80    97
## 2           yes    11 44.75   65.0 61.80769    78    94

print(p5)

Since audience_score is our response, we would like to explore the relationship of the other variables as predictors. One possible, simplistic, explanation for the variation in audience_score that we see in the data is that greater critics get better audience score. The plot below visualizes a scatterplot between audience_score and `critics_score.

ggplot(movies_red, aes(x=critics_score, y=audience_score))+geom_point()

This plot is rather noisy. While there may be a slight positive linear relationship between critics_score and audience_score, critics_score is at best a crude predictor of wages. We can quantify this by fitting a simple linear regression.

m_audience_critics <- lm(audience_score ~ critics_score, data = movies_red)
summary(m_audience_critics)

## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies_red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.047  -9.580   0.711  10.418  43.545 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.43408    1.27665   26.19   <2e-16 ***
## critics_score  0.50150    0.01987   25.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.38 on 648 degrees of freedom
## Multiple R-squared:  0.4958, Adjusted R-squared:  0.4951 
## F-statistic: 637.3 on 1 and 648 DF,  p-value: < 2.2e-16

The Bayesian model specification assumes that the errors are normally distributed with a constant variance. However, the residuals are normally distributed with right skewness but not with a constant variance.

ggplot(m_audience_critics, aes(x=fitted(m_audience_critics), y=residuals(m_audience_critics)))+geom_point()+geom_hline(yintercept = 0)

plot(m_audience_critics, which = 2)

One way to accommodate the right-skewness in the data is to (natural) log transform the dependent variable. So we need to create laudience_score which is transformed natural log of audience_score.

movies_red <- movies_red %>% mutate(laudience_score=log(audience_score))

We can try to fit a linear model with log-audience_score as a dependence variable.

m_laudience_critics <- lm(laudience_score ~ critics_score, data = movies_red)
summary(m_laudience_critics)

## 
## Call:
## lm(formula = laudience_score ~ critics_score, data = movies_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15137 -0.13186  0.02633  0.17198  0.79799 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5209026  0.0255006  138.07   <2e-16 ***
## critics_score 0.0094548  0.0003968   23.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2872 on 648 degrees of freedom
## Multiple R-squared:  0.467,  Adjusted R-squared:  0.4662 
## F-statistic: 567.7 on 1 and 648 DF,  p-value: < 2.2e-16

Once again, check the assumption of the residuals being normally distrbuted with a constant variance. However the linear model with log is worse than without log in meeting the assupmtion of the residuals being normally distributed with a constant variance.

ggplot(m_laudience_critics, aes(x=fitted(m_laudience_critics), y=residuals(m_laudience_critics)))+geom_point()+geom_hline(yintercept = 0)

plot(m_laudience_critics, which = 2)

Part 4: Modeling

As results of the simple linear model with critics_score, we will use the audience_score as a dependent variable which is not with natural log. Other than critics_score, there are more predictors such as imdb_num_votes, audience_rating, ’top200_boxand so on, which can explain theaudience_score`. In this case, we have 2^n (n: number of predictors) models to choose from. So, Bayesian Model Averaging will provide a solution averaging multiple models to obtain posteriors of coefficients and predictions from new data.

The summary report provide with information regarding to the posterior inclusion probability for each variable and the most probable model. The model 1 is the most likely model with posterior inclusion porbaility of 0.1297, closely followed by model 2 with 0.1293. Considering the uniform prior probaility of 1/2^16 (number of models with 16 predictors), the posterior inclusion probability of 0.1297 is way much larger. The predictors included in model 1 are Intercept, runtime, imdb_rating and critics_score. The posterior probabilities that runtime, imdb_rating and critics_score are included in the model are 0.469, 1.0 and 0.888 respectively.

bma_aud <- bas.lm(audience_score ~ . -laudience_score, data = movies_red,
                  prior = "BIC",
                  modelprior = uniform())
bma_aud

## 
## Call:
## bas.lm(formula = audience_score ~ . - laudience_score, data = movies_red,     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept              runtime        thtr_rel_year  
##             1.00000              0.46971              0.09069  
##         imdb_rating       imdb_num_votes        critics_score  
##             1.00000              0.05774              0.88855  
##     best_pic_nomyes      best_pic_winyes    best_actor_winyes  
##             0.13119              0.03985              0.14435  
## best_actress_winyes      best_dir_winyes        top200_boxyes  
##             0.14128              0.06694              0.04762  
##     feature_filmyes             dramayes     mpaa_rating_Ryes  
##             0.06537              0.04320              0.19984  
##     oscar_seasonyes     summer_seasonyes  
##             0.07506              0.08042

summary(bma_aud)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## runtime                 0.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

The coefficient with negative is reverse relationship with audience_score and vice-versa. The absoulte value of the coeffients represent the strength of the relationship. The larger the absolute value is, the stronger the relationship with audience_score is.

coeff_aud <- coefficients(bma_aud)

We graph the posterior distribution of coefficients of predictors, which are included in model 1, under Bayesian model averaging.

par(mfrow = c(2,2))
plot(coeff_aud, subset = c(2, 4, 6), ask = FALSE)

The following information is 95% credible interval for these coefficients.

confint(coeff_aud)

##                              2.5%        97.5%          beta
## Intercept            6.157841e+01 6.312879e+01  6.234769e+01
## runtime             -8.269274e-02 3.081946e-05 -2.567772e-02
## thtr_rel_year       -5.342548e-02 1.890224e-04 -4.532635e-03
## imdb_rating          1.367993e+01 1.655751e+01  1.498203e+01
## imdb_num_votes      -2.524976e-07 1.136800e-06  2.080713e-07
## critics_score        0.000000e+00 1.060466e-01  6.296648e-02
## best_pic_nomyes      0.000000e+00 4.894688e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.611887e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.842719e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes     -1.160611e+00 1.489674e-01 -1.195011e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.648185e-02
## feature_filmyes     -1.032673e+00 4.369181e-02 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## mpaa_rating_Ryes    -2.121643e+00 0.000000e+00 -3.036174e-01
## oscar_seasonyes     -1.039402e+00 2.670058e-03 -8.034940e-02
## summer_seasonyes    -2.940979e-03 1.044027e+00  8.704545e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

The audience_score will be predicted with best predictive model BPM for a new movie released in 2016. The new movies called “Accountant” released in 2016 starring Ben Afflect is selected for the prediction. The related information of the movies for the prediction was collected either from IMDB, Rotten Tomato or BoxOfficeMojo. The information are 128 mins for runtime, year 2016 for thtr_rel_year, 7.4 for imdb_rating, 196,667 for imdb_num_votes, 52 for critics_score, 77 for audience_score, nos for nomiation or win in Osacr, yes for top200_box, yes for feature_film, yes for drama, yes for mpaa_rating_R, yes for oscar_season and no for summer_season. Based on this information, will first create an data frame called Accountant with 18 colums and 2 rows before the prediction.

Accountant <- data_frame(runtime = 128, thtr_rel_year = 2016, imdb_rating =7.4, imdb_num_votes = 196667,
                         critics_score=52, audience_score=77, best_pic_nom="no", best_pic_win="no",
                         best_actor_win="no", best_actress_win="no", best_dir_win="no", top200_box="yes",
                         feature_film="yes", drama="yes", mpaa_rating_R="yes", oscar_season="yes",
                         summer_season="no", laudience_score=log(audience_score))

The prediciton model with BPM includes thtr_rel_year, imdb_rating, best_pci_win, feature_film and mpaa_rating_R as predictors. Based on this model, the predicted audience_score is approximately 75 with 95% credible interval from 55 to 95 points.

BPM_pred_aud <- predict(bma_aud, Accountant, estimator = "BPM", se.fit = TRUE)
bma_aud$namesx[BPM_pred_aud$bestmodel+1]

## [1] "Intercept"        "thtr_rel_year"    "imdb_rating"     
## [4] "best_pic_winyes"  "feature_filmyes"  "mpaa_rating_Ryes"

confint(BPM_pred_aud, parm = "pred")

##          2.5%    97.5%     pred
## [1,] 54.91379 94.87586 74.89483
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The prediction model with HPM uses the same predictors which we found the most liely model in Bayesian model averaging runtime, imdb_rating, and critics_score. So the final selection will be HPM (high probaility model) for the predicting new movie Accountant. Based on this model, the predicted audience_score of movie Accountant is 74 points with 95% credible interval between 55 to 94 points.

HPM_pred_aud <- predict(bma_aud, Accountant, estimator = "HPM", se.fit = TRUE)
bma_aud$namesx[HPM_pred_aud$bestmodel+1]

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

confint(HPM_pred_aud, parm = "pred")

##         2.5%   97.5%     pred
## [1,] 54.5907 94.1498 74.37025
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 6: Discussion

Bayesian model averaging is used to select the mst likely models with predictors having higher posterior probabilities in a way improving the posterior inclusion probability. We expected the the ratings from sites and critics may have impact on the perception of the audience. According to our results, we also found that popular movies seems to be shoter movies. The poularity of movie is more strongly influenced by critics_score as we expected, which is followed with large gap by imdb_rating. We have witnessed a couple of outliers in runtime variable which is one of predcitor. This study has not instestigated the effect of these outliers on the dependent variable. For futher refinement of the model, we need to address what to do with the outliers.