This project will focus on attributes make a movie popular as well as how much audiences and critics like movies and numerous other variables about the movies. The dataset was randomly collected from Rotten Tomatoes and IMDB in the US.
library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(BAS)
library(MASS)
library(grid)
library(GGally)
library(gridExtra)
load("movies.Rdata")
Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.
The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to construct in the data manipulation section.
# evaluate the size of the dataset
dim(movies)
## [1] 651 32
# types and summary of each variable
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
We see that there are some NA values for some variables. In this project, we will assume that the missing NA is not important and will not affect the result of the analysis so much that we should concern. So, we can eliminate all NA values
movies_no_na <- na.omit(movies)
dim(movies_no_na)
## [1] 619 32
The data has 619 observations and 37 variables.
feature_film
: “yes” if title_type is Feature Film, “no” otherwise
movies_no_na <- mutate(movies_no_na, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies_no_na$feature_film <- as.factor(movies_no_na$feature_film)
summary(movies_no_na$feature_film)
## No Yes
## 46 573
drama
: “yes” if genre is Drama, “no” otherwise
movies_no_na <- mutate(movies_no_na, drama = ifelse(genre == "Drama", "Yes", "No"))
movies_no_na$drama <- as.factor(movies_no_na$drama)
summary(movies_no_na$drama)
## No Yes
## 321 298
mpaa_rating_R
: “yes” if mpaa_rating is R, “no” otherwise
movies_no_na <- mutate(movies_no_na, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies_no_na$mpaa_rating_R <- as.factor(movies_no_na$mpaa_rating_R)
summary(movies_no_na$mpaa_rating_R)
## No Yes
## 300 319
oscar_season
: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
movies_no_na <- mutate(movies_no_na, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies_no_na$oscar_season <- as.factor(movies_no_na$oscar_season)
summary(movies_no_na$oscar_season)
## No Yes
## 440 179
summer_season
: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
movies_no_na <- mutate(movies_no_na, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies_no_na$summer_season <- as.factor(movies_no_na$summer_season)
summary(movies_no_na$summer_season)
## No Yes
## 418 201
df <- movies_no_na[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]
summary(df)
## feature_film drama mpaa_rating_R oscar_season summer_season
## No : 46 No :321 No :300 No :440 No :418
## Yes:573 Yes:298 Yes:319 Yes:179 Yes:201
##
##
##
##
## audience_score
## Min. :11.00
## 1st Qu.:46.00
## Median :65.00
## Mean :62.21
## 3rd Qu.:80.00
## Max. :97.00
First, we will overview the association of the audience score and feature film
summary(df$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.21 80.00 97.00
IQR(df$audience_score)
## [1] 34
mean(df$audience_score)
## [1] 62.21002
The IQR of the audience_score is 34 (1st Qu - 46 and 3rd Qu. 80), while the mean is around 62.2, the median is 65
ggplot(df, aes(x = audience_score, y = ..density..)) +
geom_histogram(bins = 40, fill = 'blue', colour = 'black') +
geom_density(size = 1, colour = 'brown')
The audience_score shows a slight left skewed structure.
We will create a boxplot for each feature and compare them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.
p1 <- ggplot(df, aes(y=audience_score, x=feature_film, fill = feature_film)) +
geom_boxplot() +
ggtitle("Relationship between Audience Score and Feature Films") +
xlab("Feature Films") +
ylab("Audience Score") +
scale_fill_brewer(name = "feature film")
p2 <- ggplot(df, aes(x = summer_season, y = audience_score, fill = summer_season)) +
geom_boxplot() +
ggtitle('Audience score vs summer season') +
xlab('summer season') +
ylab('Audience Score') + scale_fill_brewer(name = "summer season")
p3 <- ggplot(df, aes(x = oscar_season, y = audience_score, fill = oscar_season)) +
geom_boxplot() +
ggtitle('Audience score vs oscar_season') +
xlab('oscar_season') +
ylab('Audience Score') +
scale_fill_brewer(name = "oscar_season")
p4 <- ggplot(df, aes(x = drama, y = audience_score, fill = drama)) +
geom_boxplot() +
ggtitle('Audience score vs drama') +
xlab('drama') +
ylab('Audience Score') +
scale_fill_brewer(name = "drama")
p5 <- ggplot(df, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) +
geom_boxplot() +
ggtitle('Audience score vs mpaa_rating_R') +
xlab('mpaa_rating_R') +
ylab('Audience Score') +
scale_fill_brewer(name = "mpaa_rating_R")
# arrange the previously created plots
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)
From above plots, we can see that:
Finally we will have a look at the other features used in the dataset
pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) +
geom_density() +
ggtitle("Audience score vs. feature_film") +
labs(x = "feature film", y = "Density")
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
geom_density () +
labs(title = "Audience score vs. drama") +
labs(x = "drama", y = "Density")
pf3 <- ggplot(movies_no_na, aes(audience_score, fill = top200_box))+
geom_density () +
labs(title = "Audience score vs. top200_box") +
labs(x = "top200 box", y = "Density")
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) +
geom_density() +
labs(title = "Audience score vs. oscar_season") +
labs(x = "oscar season", y = "Density")
pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+
geom_density () +
labs(title = "Audience score vs. summer_season") +
labs(x = "summer season", y = "Density")
pf6 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_nom))+
geom_density () +
labs(title = "Audience score vs. best_pic_nom") +
labs(x = "best pic nom", y = "Density")
pf7 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_win)) +
geom_density() +
labs(title = "Audience score vs. best pic win") +
labs(x = "best pic win", y = "Density")
pf8 <- ggplot(movies_no_na, aes(audience_score, fill = best_actor_win))+
geom_density () +
labs(title = "Audience score vs. best_actor_win") +
labs(x = "best actor win", y = "Density")
pf9 <- ggplot(movies_no_na, aes(audience_score, fill = best_dir_win))+
geom_density () +
labs(title = "Audience score vs. best_dir_win") +
labs(x = "best dir win", y = "Density")
pf10 <- ggplot(movies_no_na, aes(audience_score, fill = best_actress_win))+
geom_density () +
labs(title = "Audience score vs. best_actress_win") +
labs(x = "best actress win", y = "Density")
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)
we will use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.
bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors: P(H1) = 0.5 P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 14614401072
## P(H1|data) = 0
## P(H2|data) = 1
##
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
##
##
## Posterior Summaries
## 2.5% 25% 50% 75% 97.5%
## overall mean 68.5515270 70.411637 71.38884 72.383111 74.246452
## mu_No - mu_Yes 15.8806636 19.595658 21.55900 23.518128 27.218463
## sigma^2 336.1636955 360.773617 374.94964 389.500759 420.280587
## effect size 0.8109977 1.009448 1.11570 1.216969 1.411302
## n_0 13.7298943 159.386470 383.07153 775.309194 2142.883789
## 95% Cred. Int.: (15.8807 , 27.2185)
bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors: P(H1) = 0.5 P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 50.7705
## P(H1|data) = 0.0193
## P(H2|data) = 0.9807
##
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
##
##
## Posterior Summaries
## 2.5% 25% 50% 75%
## overall mean 60.7105099 61.7820198 62.3158053 62.8533773
## mu_No - mu_Yes -9.0521102 -6.9365112 -5.8646481 -4.7746478
## sigma^2 358.2014534 384.1593371 398.9600947 414.7302604
## effect size -0.4523203 -0.3476954 -0.2942123 -0.2389429
## n_0 29.3100817 317.9508340 780.6478066 1544.4094407
## 97.5%
## overall mean 63.8919523
## mu_No - mu_Yes -2.7044426
## sigma^2 446.1217457
## effect size -0.1353839
## n_0 4142.7222153
## 95% Cred. Int.: (-9.0521 , -2.7044)
bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors: P(H1) = 0.5 P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 15.354
## P(H1|data) = 0.9389
## P(H2|data) = 0.0611
##
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
##
##
## Posterior Summaries
## 2.5% 25% 50% 75%
## overall mean 60.6208708 61.66028575 62.21158440 6.275903e+01
## mu_No - mu_Yes -3.5243613 -1.44083605 -0.34393238 7.396143e-01
## sigma^2 365.0174696 392.19005313 407.53617585 4.235716e+02
## effect size -0.1757209 -0.07164857 -0.01703257 3.667843e-02
## n_0 29.6638881 352.10063692 858.36892112 1.705595e+03
## 97.5%
## overall mean 63.7987863
## mu_No - mu_Yes 2.7445470
## sigma^2 456.2057888
## effect size 0.1371334
## n_0 4557.4941083
## 95% Cred. Int.: (-3.5244 , 2.7445)
bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors: P(H1) = 0.5 P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 6.2077
## P(H1|data) = 0.8613
## P(H2|data) = 0.1387
##
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
##
##
## Posterior Summaries
## 2.5% 25% 50% 75%
## overall mean 60.9162681 62.092184 62.7002082 63.30879379
## mu_No - mu_Yes -5.8310567 -3.511795 -2.2835284 -1.06811530
## sigma^2 364.9257678 391.105564 406.1201338 421.69427581
## effect size -0.2894951 -0.173561 -0.1130275 -0.05312607
## n_0 33.1871629 355.907792 854.4906625 1671.06291154
## 97.5%
## overall mean 6.445812e+01
## mu_No - mu_Yes 1.235673e+00
## sigma^2 4.539947e+02
## effect size 6.080273e-02
## n_0 4.507536e+03
## 95% Cred. Int.: (-5.8311 , 1.2357)
bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided', prior = "JZS", rscale = 1, method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming Zellner-Siow Cauchy prior on the difference of means. )
## (Assuming independent Jeffreys prior on the overall mean and variance. )
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors: P(H1) = 0.5 P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 14.0392
## P(H1|data) = 0.9335
## P(H2|data) = 0.0665
##
## Posterior summaries for under H2:
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming Zellner-Siow Cauchy prior for difference in means)
## (Assuming independent Jeffrey's priors for overall mean and variance)
##
##
## Posterior Summaries
## 2.5% 25% 50% 75%
## overall mean 60.3496334 61.50994512 62.09766475 6.268886e+01
## mu_No - mu_Yes -2.8726836 -0.55635330 0.58209277 1.756742e+00
## sigma^2 366.5515586 392.60108984 407.92120298 4.236298e+02
## effect size -0.1401389 -0.02767417 0.02892943 8.671627e-02
## n_0 29.5635751 336.31717094 834.54925726 1.703500e+03
## 97.5%
## overall mean 63.7877739
## mu_No - mu_Yes 4.0059776
## sigma^2 455.9898100
## effect size 0.1980793
## n_0 4510.5593331
## 95% Cred. Int.: (-2.8727 , 4.006)
Summary
feature_film: BF[H2:H1] = 14.6e+9 –> Evidence against H1 is Very Strong
drama: BF[H2:H1] = 50.77 –> Evidence against H1 is positive
mpaa_rating_R: BF[H1:H2] = 15.354 –> Evidence against H2 is Positive
oscar_season: BF[H1:H2] = 6.2077 –> Evidence against H2 is Positive
summer_season: BF[H1:H2] = 14.0392 –> Evidence against H2 is Positive
For feature_film, there is strong evidence against H1, which means that there is a significant difference in mean audience_score for feature- and non-feature films
Based on the research question, we will just focus on the relationship between audience_score and variables mentioned in part 2. In this section, we will use backwards elimination to pick significant predictors and first, we will start with full model.
data.model <- movies_no_na[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]
str(data.model)
## Classes 'tbl_df', 'tbl' and 'data.frame': 619 obs. of 17 variables:
## $ feature_film : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ drama : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
## $ runtime : num 80 101 84 139 90 142 93 88 119 127 ...
## $ mpaa_rating_R : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ oscar_season : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
## $ summer_season : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 2 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num 45 96 91 80 33 57 17 90 83 89 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ audience_score : num 73 81 91 76 27 76 47 89 66 75 ...
lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 602 61285.24 2878.422
## 2 - top200_box 1 8.158011 603 61293.40 2876.504
## 3 - oscar_season 1 9.834264 604 61303.24 2874.604
## 4 - best_pic_win 1 43.432561 605 61346.67 2873.042
## 5 - best_dir_win 1 91.172934 606 61437.84 2871.961
## 6 - best_actor_win 1 125.157576 607 61563.00 2871.221
## 7 - summer_season 1 176.422141 608 61739.42 2870.992
## 8 - feature_film 1 188.009104 609 61927.43 2870.875
## 9 - drama 1 124.155090 610 62051.59 2870.114
## 10 - imdb_num_votes 1 167.442064 611 62219.03 2869.782
## 11 - thtr_rel_year 1 168.854308 612 62387.88 2869.460
We do not see mpaa_rating_R in this model but summer_season, oscar_season are still included in this model. We will use Bayesian Model Average (BMA) next
bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score
##
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmYes dramaYes
## 1.00000 0.05876 0.04509
## runtime mpaa_rating_RYes thtr_rel_year
## 0.51400 0.16498 0.08089
## oscar_seasonYes summer_seasonYes imdb_rating
## 0.06526 0.07935 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04077 0.11565 0.14770
## best_dir_winyes top200_boxyes
## 0.06701 0.04876
summary(bma_audience_score)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmYes 0.05876309 0.0000 0.0000000 0.0000000
## dramaYes 0.04508592 0.0000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000
## mpaa_rating_RYes 0.16498090 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000
## oscar_seasonYes 0.06525993 0.0000 0.0000000 0.0000000
## summer_seasonYes 0.07935408 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1558 0.1358000 0.0319000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmYes 0.0000000 0.0000000
## dramaYes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_RYes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonYes 0.0000000 0.0000000
## summer_seasonYes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0318000 0.0289000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
image(bma_audience_score, rotate = FALSE)
We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.
coef_bma_audience_score <- coef(bma_audience_score)
plot(coef_bma_audience_score)
Now we can build up a model for prediction:
finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]
bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())
summary(bma_finalmodel)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000 1.0000000 1.0000 1.00000000 1.00000000
## runtime 0.48125 0.0000000 1.0000 0.00000000 1.00000000
## imdb_rating 1.00000 1.0000000 1.0000 1.00000000 1.00000000
## critics_score 0.90000 1.0000000 1.0000 0.00000000 0.00000000
## BF NA 0.9928814 1.0000 0.08035048 0.09635143
## PostProbs NA 0.4534000 0.4410 0.06210000 0.03730000
## R2 NA 0.7455000 0.7483 0.74050000 0.74360000
## dim NA 3.0000000 4.0000 2.00000000 3.00000000
## logmarg NA 413.8748206 413.8820 411.36060745 411.54221157
## model 5
## Intercept 1.000000e+00
## runtime 0.000000e+00
## imdb_rating 0.000000e+00
## critics_score 0.000000e+00
## BF 1.792035e-180
## PostProbs 6.200000e-03
## R2 0.000000e+00
## dim 1.000000e+00
## logmarg 0.000000e+00
Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code
blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,oscar_season="no",summer_season="no",imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)
data.predict <- rbind(data.model, blackpanther)
blackpanther <- tail(blackpanther, 1)
str(blackpanther)
## 'data.frame': 1 obs. of 17 variables:
## $ feature_film : Factor w/ 1 level "yes": 1
## $ drama : Factor w/ 1 level "no": 1
## $ runtime : num 135
## $ mpaa_rating_R : Factor w/ 1 level "no": 1
## $ thtr_rel_year : num 2018
## $ oscar_season : Factor w/ 1 level "no": 1
## $ summer_season : Factor w/ 1 level "no": 1
## $ imdb_rating : num 7.4
## $ imdb_num_votes : num 443501
## $ critics_score : num 97
## $ best_pic_nom : Factor w/ 1 level "no": 1
## $ best_pic_win : Factor w/ 1 level "no": 1
## $ best_actor_win : Factor w/ 1 level "no": 1
## $ best_actress_win: Factor w/ 1 level "no": 1
## $ best_dir_win : Factor w/ 1 level "no": 1
## $ top200_box : Factor w/ 1 level "yes": 1
## $ audience_score : num 79
We will predict the audience_score
audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma
## [,1]
## [1,] 77.66515
The prediction is lower than the actual audience_score.
The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.