library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)The full directory path information is requied becasue the R markdown file and the movies.Rdata file are saved in different directory from working directory for R.
load("~/Documents/Personal/Career Development/Coursera/Coursera-Statistics with R Specialization/04. Bayesian Statistics/movies.Rdata")The model fitted from this dataset can be generalized to predict the parameters of the sample population. First, the data is indepedent. According to Rotten Tomatoes and IMDB APIs, the dataset ‘movies’ contains data regarding 651 randomly sampled movies released before 2016, which is less than 10% of total number of movies released before 2016. Second, the sample size is large enough to assume nearly normal distribution. The causal conclusion can’t be drawm since the dataset is comprised of observational data. The relationship can be discussed instead.
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The data should be restructured for modelling.
First of all, create a new variable with 2 levels (yes if movie is featured film or no if not) from title_type and name it as feature_film.
movies$feature_film <- if_else(movies$title_type == "Feature Film", "yes", "no")Next, let’s create another variable related to genre. It will be called drama with levels yes (movies that are dramas) and no.
movies <- movies %>% mutate(drama = if_else(genre == "Drama", "yes", "no"))Keep creating enw variables for modelling using mpaa_rating and thtr_rel_month into mpaa_rating_R, oscar_season and summer_season. All new variables have 2 levels of yes' orno. Formpaa_rating_R, assignyesif the movie is rated R based on 'mpaa_rating variable. oscar_season and summer_season variable are based on thtr_rel_month. For oscar_season, place yes if the movie is released in October, November or December. For summer_season',yes` if the movie is released in months of May, June, July or August.
# `mpaa_rating_R` variable
movies <- movies %>% mutate(mpaa_rating_R = if_else(mpaa_rating == "R", "yes", "no"))
# `oscar_season` variable
movies <- movies %>% mutate(oscar_season = if_else(thtr_rel_month > 9, "yes", "no"))
# `summer_season` variable
movies <- movies %>% mutate(summer_season = if_else(thtr_rel_month > 4 & thtr_rel_month < 9, "yes", "no"))For the analysis purpose, the format of the new variables needs to be converted from character to factor
# Check the format and the summary of the new variables
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 37 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## $ feature_film : chr "yes" "yes" "yes" "yes" ...
## $ drama : chr "yes" "yes" "no" "yes" ...
## $ mpaa_rating_R : chr "yes" "no" "yes" "no" ...
## $ oscar_season : chr "no" "no" "no" "yes" ...
## $ summer_season : chr "no" "no" "yes" "no" ...
summary(movies)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url feature_film
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## drama mpaa_rating_R oscar_season
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## summer_season
## Length:651
## Class :character
## Mode :character
##
##
##
##
movies$feature_film <- as.factor(movies$feature_film)
movies$drama <- as.factor(movies$drama)
movies$mpaa_rating_R <- as.factor(movies$mpaa_rating_R)
movies$oscar_season <-as.factor(movies$oscar_season)
movies$summer_season <- as.factor(movies$summer_season)
# Check the changes
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 37 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## $ feature_film : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
## $ drama : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
## $ mpaa_rating_R : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
## $ oscar_season : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ summer_season : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
summary(movies)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url feature_film drama
## Length:651 Length:651 no : 60 no :346
## Class :character Class :character yes:591 yes:305
## Mode :character Mode :character
##
##
##
##
## mpaa_rating_R oscar_season summer_season
## no :322 no :460 no :443
## yes:329 yes:191 yes:208
##
##
##
##
##
In addition to creating and adding new variables usefule for the modelling, omit some of variable and observations which are not relevant for the modelling. There are variables which are not meaningful for the modelling such as director, actor1~5 and website addresses. There was 1 observation with NA in runtime variable, which we have to omit from the dataset for the modelling.
# Taking out the unmeaningful varialbles.
movies_red <- subset(movies, select = c(runtime, thtr_rel_year, imdb_rating, imdb_num_votes,
critics_score, audience_score, best_pic_nom, best_pic_win,
best_actor_win, best_actress_win, best_dir_win, top200_box,
feature_film, drama, mpaa_rating_R, oscar_season, summer_season))
# Taking out `NA` observations.
movies_red <- na.omit(movies_red)It is important to explore the relationship between audience_score and the newly created variables since those newly added variables will be used as explanatory variables for the modelling. Understanding the relationship will provide insight for the model selection as well.
There are not much difference in the distributions between yes and no except for drama and feature_film. But, the difference in the distributions are larger in feature_film than drama. The distribution of audience_score is much narrower for not feature film than for feature film. The audience_score of not feature film concentrated in the higher score with possible outliers in the lower score. You may have to consider these possible outliers when developing model. The audience_score for drama is more right-skewed than for non-drama.
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(movies_red, aes(feature_film, audience_score))+geom_boxplot()
p2 <- ggplot(movies_red, aes(drama, audience_score))+geom_boxplot()
p3 <- ggplot(movies_red, aes(mpaa_rating_R, audience_score))+geom_boxplot()
p4 <- ggplot(movies_red, aes(oscar_season, audience_score))+geom_boxplot()
p5 <- ggplot(movies_red, aes(summer_season, audience_score))+geom_boxplot()
grid.arrange(p1, p2, p3, p4, p5, ncol = 3)The non-feature film shows higher audience score than feature film with distribution concentrated highly toward high score.
movies_red %>%
group_by(feature_film) %>%
summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
Q75=quantile(audience_score, 0.75), Max=max(audience_score))## # A tibble: 2 x 7
## feature_film Min Q25 Median Mean Q75 Max
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 19 77.0 86 81.20339 89 96
## 2 yes 11 44.5 62 60.46531 78 97
print(p1)The drama gets higer approval from audience than non-drama. The drama is also right skewed toward high score but not as extreme as non-feature film.
movies_red %>%
group_by(drama) %>%
summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
Q75=quantile(audience_score, 0.75), Max=max(audience_score))## # A tibble: 2 x 7
## drama Min Q25 Median Mean Q75 Max
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 11 41 61 59.69565 79 97
## 2 yes 13 52 70 65.34754 80 95
print(p2)The distribution of audience_score for the other variables are much similar in terms of range of distribution with similar IQR and median.
movies_red %>%
group_by(mpaa_rating_R) %>%
summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
Q75=quantile(audience_score, 0.75), Max=max(audience_score))## # A tibble: 2 x 7
## mpaa_rating_R Min Q25 Median Mean Q75 Max
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 11 48 65 62.66044 80 96
## 2 yes 14 44 64 62.04255 79 97
print(p3)movies_red %>%
group_by(oscar_season) %>%
summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
Q75=quantile(audience_score, 0.75), Max=max(audience_score))## # A tibble: 2 x 7
## oscar_season Min Q25 Median Mean Q75 Max
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 11 46.00 64.0 61.81304 79 96
## 2 yes 13 47.25 68.5 63.64211 81 97
print(p4)movies_red %>%
group_by(summer_season) %>%
summarize(Min=min(audience_score), Q25=quantile(audience_score, 0.25), Median=median(audience_score), Mean=mean(audience_score),
Q75=quantile(audience_score, 0.75), Max=max(audience_score))## # A tibble: 2 x 7
## summer_season Min Q25 Median Mean Q75 Max
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 13 46.00 65.5 62.60181 80 97
## 2 yes 11 44.75 65.0 61.80769 78 94
print(p5)Since audience_score is our response, we would like to explore the relationship of the other variables as predictors. One possible, simplistic, explanation for the variation in audience_score that we see in the data is that greater critics get better audience score. The plot below visualizes a scatterplot between audience_score and `critics_score.
ggplot(movies_red, aes(x=critics_score, y=audience_score))+geom_point()This plot is rather noisy. While there may be a slight positive linear relationship between critics_score and audience_score, critics_score is at best a crude predictor of wages. We can quantify this by fitting a simple linear regression.
m_audience_critics <- lm(audience_score ~ critics_score, data = movies_red)
summary(m_audience_critics)##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.047 -9.580 0.711 10.418 43.545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.43408 1.27665 26.19 <2e-16 ***
## critics_score 0.50150 0.01987 25.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.38 on 648 degrees of freedom
## Multiple R-squared: 0.4958, Adjusted R-squared: 0.4951
## F-statistic: 637.3 on 1 and 648 DF, p-value: < 2.2e-16
The Bayesian model specification assumes that the errors are normally distributed with a constant variance. However, the residuals are normally distributed with right skewness but not with a constant variance.
ggplot(m_audience_critics, aes(x=fitted(m_audience_critics), y=residuals(m_audience_critics)))+geom_point()+geom_hline(yintercept = 0)plot(m_audience_critics, which = 2)One way to accommodate the right-skewness in the data is to (natural) log transform the dependent variable. So we need to create laudience_score which is transformed natural log of audience_score.
movies_red <- movies_red %>% mutate(laudience_score=log(audience_score))We can try to fit a linear model with log-audience_score as a dependence variable.
m_laudience_critics <- lm(laudience_score ~ critics_score, data = movies_red)
summary(m_laudience_critics)##
## Call:
## lm(formula = laudience_score ~ critics_score, data = movies_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15137 -0.13186 0.02633 0.17198 0.79799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5209026 0.0255006 138.07 <2e-16 ***
## critics_score 0.0094548 0.0003968 23.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2872 on 648 degrees of freedom
## Multiple R-squared: 0.467, Adjusted R-squared: 0.4662
## F-statistic: 567.7 on 1 and 648 DF, p-value: < 2.2e-16
Once again, check the assumption of the residuals being normally distrbuted with a constant variance. However the linear model with log is worse than without log in meeting the assupmtion of the residuals being normally distributed with a constant variance.
ggplot(m_laudience_critics, aes(x=fitted(m_laudience_critics), y=residuals(m_laudience_critics)))+geom_point()+geom_hline(yintercept = 0)plot(m_laudience_critics, which = 2)As results of the simple linear model with critics_score, we will use the audience_score as a dependent variable which is not with natural log. Other than critics_score, there are more predictors such as imdb_num_votes, audience_rating, ’top200_boxand so on, which can explain theaudience_score`. In this case, we have 2^n (n: number of predictors) models to choose from. So, Bayesian Model Averaging will provide a solution averaging multiple models to obtain posteriors of coefficients and predictions from new data.
The summary report provide with information regarding to the posterior inclusion probability for each variable and the most probable model. The model 1 is the most likely model with posterior inclusion porbaility of 0.1297, closely followed by model 2 with 0.1293. Considering the uniform prior probaility of 1/2^16 (number of models with 16 predictors), the posterior inclusion probability of 0.1297 is way much larger. The predictors included in model 1 are Intercept, runtime, imdb_rating and critics_score. The posterior probabilities that runtime, imdb_rating and critics_score are included in the model are 0.469, 1.0 and 0.888 respectively.
bma_aud <- bas.lm(audience_score ~ . -laudience_score, data = movies_red,
prior = "BIC",
modelprior = uniform())
bma_aud##
## Call:
## bas.lm(formula = audience_score ~ . - laudience_score, data = movies_red, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept runtime thtr_rel_year
## 1.00000 0.46971 0.09069
## imdb_rating imdb_num_votes critics_score
## 1.00000 0.05774 0.88855
## best_pic_nomyes best_pic_winyes best_actor_winyes
## 0.13119 0.03985 0.14435
## best_actress_winyes best_dir_winyes top200_boxyes
## 0.14128 0.06694 0.04762
## feature_filmyes dramayes mpaa_rating_Ryes
## 0.06537 0.04320 0.19984
## oscar_seasonyes summer_seasonyes
## 0.07506 0.08042
summary(bma_aud)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## runtime 0.46971477 1.0000 0.0000000 0.0000000
## thtr_rel_year 0.09068970 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05773502 0.0000 0.0000000 0.0000000
## critics_score 0.88855056 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13119140 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03984766 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14434896 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.14128087 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06693898 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04762234 0.0000 0.0000000 0.0000000
## feature_filmyes 0.06536947 0.0000 0.0000000 0.0000000
## dramayes 0.04319833 0.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.19984016 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07505684 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08042023 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9968489 0.2543185
## PostProbs NA 0.1297 0.1293000 0.0330000
## R2 NA 0.7549 0.7525000 0.7539000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3615.2791 -3615.2822108 -3616.6482224
## model 4 model 5
## Intercept 1.0000000 1.0000000
## runtime 0.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## BF 0.2521327 0.2391994
## PostProbs 0.0327000 0.0310000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg -3616.6568544 -3616.7095127
The coefficient with negative is reverse relationship with audience_score and vice-versa. The absoulte value of the coeffients represent the strength of the relationship. The larger the absolute value is, the stronger the relationship with audience_score is.
coeff_aud <- coefficients(bma_aud)We graph the posterior distribution of coefficients of predictors, which are included in model 1, under Bayesian model averaging.
par(mfrow = c(2,2))
plot(coeff_aud, subset = c(2, 4, 6), ask = FALSE)The following information is 95% credible interval for these coefficients.
confint(coeff_aud)## 2.5% 97.5% beta
## Intercept 6.157841e+01 6.312879e+01 6.234769e+01
## runtime -8.269274e-02 3.081946e-05 -2.567772e-02
## thtr_rel_year -5.342548e-02 1.890224e-04 -4.532635e-03
## imdb_rating 1.367993e+01 1.655751e+01 1.498203e+01
## imdb_num_votes -2.524976e-07 1.136800e-06 2.080713e-07
## critics_score 0.000000e+00 1.060466e-01 6.296648e-02
## best_pic_nomyes 0.000000e+00 4.894688e+00 5.068035e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes -2.611887e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.842719e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes -1.160611e+00 1.489674e-01 -1.195011e-01
## top200_boxyes 0.000000e+00 0.000000e+00 8.648185e-02
## feature_filmyes -1.032673e+00 4.369181e-02 -1.046908e-01
## dramayes 0.000000e+00 0.000000e+00 1.604413e-02
## mpaa_rating_Ryes -2.121643e+00 0.000000e+00 -3.036174e-01
## oscar_seasonyes -1.039402e+00 2.670058e-03 -8.034940e-02
## summer_seasonyes -2.940979e-03 1.044027e+00 8.704545e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The audience_score will be predicted with best predictive model BPM for a new movie released in 2016. The new movies called “Accountant” released in 2016 starring Ben Afflect is selected for the prediction. The related information of the movies for the prediction was collected either from IMDB, Rotten Tomato or BoxOfficeMojo. The information are 128 mins for runtime, year 2016 for thtr_rel_year, 7.4 for imdb_rating, 196,667 for imdb_num_votes, 52 for critics_score, 77 for audience_score, nos for nomiation or win in Osacr, yes for top200_box, yes for feature_film, yes for drama, yes for mpaa_rating_R, yes for oscar_season and no for summer_season. Based on this information, will first create an data frame called Accountant with 18 colums and 2 rows before the prediction.
Accountant <- data_frame(runtime = 128, thtr_rel_year = 2016, imdb_rating =7.4, imdb_num_votes = 196667,
critics_score=52, audience_score=77, best_pic_nom="no", best_pic_win="no",
best_actor_win="no", best_actress_win="no", best_dir_win="no", top200_box="yes",
feature_film="yes", drama="yes", mpaa_rating_R="yes", oscar_season="yes",
summer_season="no", laudience_score=log(audience_score))The prediciton model with BPM includes thtr_rel_year, imdb_rating, best_pci_win, feature_film and mpaa_rating_R as predictors. Based on this model, the predicted audience_score is approximately 75 with 95% credible interval from 55 to 95 points.
BPM_pred_aud <- predict(bma_aud, Accountant, estimator = "BPM", se.fit = TRUE)
bma_aud$namesx[BPM_pred_aud$bestmodel+1]## [1] "Intercept" "thtr_rel_year" "imdb_rating"
## [4] "best_pic_winyes" "feature_filmyes" "mpaa_rating_Ryes"
confint(BPM_pred_aud, parm = "pred")## 2.5% 97.5% pred
## [1,] 54.91379 94.87586 74.89483
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The prediction model with HPM uses the same predictors which we found the most liely model in Bayesian model averaging runtime, imdb_rating, and critics_score. So the final selection will be HPM (high probaility model) for the predicting new movie Accountant. Based on this model, the predicted audience_score of movie Accountant is 74 points with 95% credible interval between 55 to 94 points.
HPM_pred_aud <- predict(bma_aud, Accountant, estimator = "HPM", se.fit = TRUE)
bma_aud$namesx[HPM_pred_aud$bestmodel+1]## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
confint(HPM_pred_aud, parm = "pred")## 2.5% 97.5% pred
## [1,] 54.5907 94.1498 74.37025
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Bayesian model averaging is used to select the mst likely models with predictors having higher posterior probabilities in a way improving the posterior inclusion probability. We expected the the ratings from sites and critics may have impact on the perception of the audience. According to our results, we also found that popular movies seems to be shoter movies. The poularity of movie is more strongly influenced by critics_score as we expected, which is followed with large gap by imdb_rating. We have witnessed a couple of outliers in runtime variable which is one of predcitor. This study has not instestigated the effect of these outliers on the dependent variable. For futher refinement of the model, we need to address what to do with the outliers.