This project will focus on attributes make a movie popular as well as how much audiences and critics like movies and numerous other variables about the movies. The dataset was randomly collected from Rotten Tomatoes and IMDB in the US.
load("movies.Rdata")
Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.
The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to constructed in the data manipulation section.
feature_film: ‘yes’ if title_type is Feature Film, ‘no’ otherwise, drama: “yes” if genre is Drama, “no” otherwise, mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise, runtime, thtr_rel_year, oscar_season: “yes” if movie is released in November, October, or December (based on, thtr_rel_month), “no” otherwise, summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box
#evaluate the size of the dataset
dim(movies)
## [1] 651 32
#types and summary of each variable
str(movies)
## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 37 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :133 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Certified Fresh:135
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Fresh :209
## Median :15.00 Median :6.600 Median : 15116 Rotten :307
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:275 Min. :11.00 no :629 no :644
## 1st Qu.: 33.00 Upright:376 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :558 no :579 no :608 no :636 Length:651
## yes: 93 yes: 72 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
We see that there are some NA values for some variables. In this project, we will assume that the missing NA is not important and will not affect the result of the analysis so much that we should concern. So, we can eliminate all NA values
movies_no_na <- na.omit(movies)
dim(movies_no_na)
## [1] 619 32
The data has 619 observations and 37 variables.
####2.3 Data manipulation
feature_film: “yes” if title_type is Feature Film, “no” otherwise
movies_no_na <- mutate(movies_no_na, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies_no_na$feature_film <- as.factor(movies_no_na$feature_film)
summary(movies_no_na$feature_film)
## No Yes
## 46 573
drama: “yes” if genre is Drama, “no” otherwise
movies_no_na <- mutate(movies_no_na, drama = ifelse(genre == "Drama", "Yes", "No"))
movies_no_na$drama <- as.factor(movies_no_na$drama)
summary(movies_no_na$drama)
## No Yes
## 321 298
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
movies_no_na <- mutate(movies_no_na, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies_no_na$mpaa_rating_R <- as.factor(movies_no_na$mpaa_rating_R)
summary(movies_no_na$mpaa_rating_R)
## No Yes
## 300 319
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
movies_no_na <- mutate(movies_no_na, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies_no_na$oscar_season <- as.factor(movies_no_na$oscar_season)
summary(movies_no_na$oscar_season)
## No Yes
## 440 179
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
movies_no_na <- mutate(movies_no_na, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies_no_na$summer_season <- as.factor(movies_no_na$summer_season)
summary(movies_no_na$summer_season)
## No Yes
## 418 201
df <- movies_no_na[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]
summary(df)
## feature_film drama mpaa_rating_R oscar_season summer_season
## No : 46 No :321 No :300 No :440 No :418
## Yes:573 Yes:298 Yes:319 Yes:179 Yes:201
##
##
##
##
## audience_score
## Min. :11.00
## 1st Qu.:46.00
## Median :65.00
## Mean :62.21
## 3rd Qu.:80.00
## Max. :97.00
####2.4Exploratory Data Analysis
First, we will overview the association of the audience score and feature film
summary(df$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.21 80.00 97.00
IQR(df$audience_score)
## [1] 34
mean(df$audience_score)
## [1] 62.21002
The IQR of the audience_score is 34 (1st Qu - 46 and 3rd Qu. 80), while the mean is around 62.2, the median is 65
ggplot(df, aes(x = audience_score, y = ..density..)) +
geom_histogram(bins = 40, fill = 'blue', colour = 'black') +
geom_density(size = 1, colour = 'brown')
The audience_score shows a slight left skewed structure.
We will create a boxplot for each feature and compare them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.
p1 <- ggplot(df, aes(y=audience_score, x=feature_film, fill = feature_film)) +
geom_boxplot() +
ggtitle("Relationship between Audience Score and Feature Films") +
xlab("Feature Films") +
ylab("Audience Score") +
scale_fill_brewer(name = "feature film")
p2 <- ggplot(df, aes(x = summer_season, y = audience_score, fill = summer_season)) +
geom_boxplot() +
ggtitle('Audience score vs summer season') +
xlab('summer season') +
ylab('Audience Score') + scale_fill_brewer(name = "summer season")
p3 <- ggplot(df, aes(x = oscar_season, y = audience_score, fill = oscar_season)) +
geom_boxplot() +
ggtitle('Audience score vs oscar_season') +
xlab('oscar_season') +
ylab('Audience Score') +
scale_fill_brewer(name = "oscar_season")
p4 <- ggplot(df, aes(x = drama, y = audience_score, fill = drama)) +
geom_boxplot() +
ggtitle('Audience score vs drama') +
xlab('drama') +
ylab('Audience Score') +
scale_fill_brewer(name = "drama")
p5 <- ggplot(df, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) +
geom_boxplot() +
ggtitle('Audience score vs mpaa_rating_R') +
xlab('mpaa_rating_R') +
ylab('Audience Score') +
scale_fill_brewer(name = "mpaa_rating_R")
# arrange the previously created plots
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)
From above plots, we can see that:
Audience score for no feature film is higher than featur film. Drama movies have higher avergae audience score. Audience score for no mpaa_rating_R is slightly higher than mpaa_rating_R. Audience score for movies released in Nov, Oct or Dec has higher score than movies released in other months. Finally we will have a look at the other features used in the dataset
pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) +
geom_density() +
ggtitle("Audience score vs. feature_film") +
labs(x = "feature film", y = "Density")
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
geom_density () +
labs(title = "Audience score vs. drama") +
labs(x = "drama", y = "Density")
pf3 <- ggplot(movies_no_na, aes(audience_score, fill = top200_box))+
geom_density () +
labs(title = "Audience score vs. top200_box") +
labs(x = "top200 box", y = "Density")
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) +
geom_density() +
labs(title = "Audience score vs. oscar_season") +
labs(x = "oscar season", y = "Density")
pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+
geom_density () +
labs(title = "Audience score vs. summer_season") +
labs(x = "summer season", y = "Density")
pf6 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_nom))+
geom_density () +
labs(title = "Audience score vs. best_pic_nom") +
labs(x = "best pic nom", y = "Density")
pf7 <- ggplot(movies_no_na, aes(audience_score, fill = best_pic_win)) +
geom_density() +
labs(title = "Audience score vs. best pic win") +
labs(x = "best pic win", y = "Density")
pf8 <- ggplot(movies_no_na, aes(audience_score, fill = best_actor_win))+
geom_density () +
labs(title = "Audience score vs. best_actor_win") +
labs(x = "best actor win", y = "Density")
pf9 <- ggplot(movies_no_na, aes(audience_score, fill = best_dir_win))+
geom_density () +
labs(title = "Audience score vs. best_dir_win") +
labs(x = "best dir win", y = "Density")
pf10 <- ggplot(movies_no_na, aes(audience_score, fill = best_actress_win))+
geom_density () +
labs(title = "Audience score vs. best_actress_win") +
labs(x = "best actress win", y = "Density")
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)
####2.5 Hypothesis testing
We will use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.
bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 1.212332e+13
## P(H1|data) = 0
## P(H2|data) = 1
bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 34.6357
## P(H1|data) = 0.0281
## P(H2|data) = 0.9719
bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 24.8392
## P(H1|data) = 0.9613
## P(H2|data) = 0.0387
bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 10.019
## P(H1|data) = 0.9092
## P(H2|data) = 0.0908
bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 22.7623
## P(H1|data) = 0.9579
## P(H2|data) = 0.0421
####Summary
feature_film: BF[H2:H1] = 14.6e+9 –> Evidence against H1 is Very Strong
drama: BF[H2:H1] = 50.77 –> Evidence against H1 is positive
mpaa_rating_R: BF[H1:H2] = 15.354 –> Evidence against H2 is Positive
oscar_season: BF[H1:H2] = 6.2077 –> Evidence against H2 is Positive
summer_season: BF[H1:H2] = 14.0392 –> Evidence against H2 is Positive
For feature_film, there is strong evidence against H1, which means that there is a significant difference in mean audience_score for feature- and non-feature films
Based on the research question, we will just focus on the relationship between audience_score and variables mentioned in part 2. In this section, we will use backwards elimination to pick significant predictors and first, we will start with full model.
data.model <- movies_no_na[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]
str(data.model)
## tibble [619 × 15] (S3: tbl_df/tbl/data.frame)
## $ feature_film : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ drama : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
## $ runtime : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
## $ mpaa_rating_R : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
## $ thtr_rel_year : num [1:619] 2013 2001 1996 1993 2004 ...
## $ imdb_rating : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ audience_score : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
## - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
## ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
## best_dir_win + top200_box
##
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 604 61461.67 2876.201
## 2 - top200_box 1 8.119148 605 61469.79 2874.283
## 3 - best_pic_win 1 40.723899 606 61510.51 2872.693
## 4 - best_dir_win 1 84.558038 607 61595.07 2871.544
## 5 - best_actor_win 1 144.351490 608 61739.42 2870.992
## 6 - feature_film 1 188.009104 609 61927.43 2870.875
## 7 - drama 1 124.155090 610 62051.59 2870.114
## 8 - imdb_num_votes 1 167.442064 611 62219.03 2869.782
## 9 - thtr_rel_year 1 168.854308 612 62387.88 2869.460
We do not see mpaa_rating_R in this model but summer_season, oscar_season are still included in this model. We will use Bayesian Model Average (BMA) next
bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score
##
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmYes dramaYes
## 1.00000 0.05829 0.04505
## runtime mpaa_rating_RYes thtr_rel_year
## 0.52445 0.16532 0.08061
## imdb_rating imdb_num_votes critics_score
## 1.00000 0.06301 0.92393
## best_pic_nomyes best_pic_winyes best_actor_winyes
## 0.12969 0.04073 0.11716
## best_actress_winyes best_dir_winyes top200_boxyes
## 0.14758 0.06684 0.04878
summary(bma_audience_score)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmYes 0.05828654 0.0000 0.0000000 0.0000000
## dramaYes 0.04504589 0.0000 0.0000000 0.0000000
## runtime 0.52445351 1.0000 0.0000000 0.0000000
## mpaa_rating_RYes 0.16532487 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08061489 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06300703 0.0000 0.0000000 0.0000000
## critics_score 0.92392984 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.12968919 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04072981 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11715515 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14758294 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06684245 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04877824 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1814 0.1581000 0.0371000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmYes 0.0000000 0.0000000
## dramaYes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_RYes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0370000 0.0336000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
image(bma_audience_score, rotate = FALSE)
We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.
coef_bma_audience_score <- coef(bma_audience_score)
plot(coef_bma_audience_score)
Now we can build up a model for prediction:
finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]
bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())
summary(bma_finalmodel)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000 1.0000 1.0000000 1.00000000 1.00000000
## runtime 0.63750 1.0000 0.0000000 0.00000000 1.00000000
## imdb_rating 0.99375 1.0000 1.0000000 1.00000000 1.00000000
## critics_score 0.80000 1.0000 1.0000000 0.00000000 0.00000000
## BF NA 1.0000 0.9928814 0.08035048 0.09635143
## PostProbs NA 0.5404 0.2547000 0.10560000 0.08700000
## R2 NA 0.7483 0.7455000 0.74050000 0.74360000
## dim NA 4.0000 3.0000000 2.00000000 3.00000000
## logmarg NA 413.8820 413.8748206 411.36060745 411.54221157
## model 5
## Intercept 1.000000e+00
## runtime 0.000000e+00
## imdb_rating 0.000000e+00
## critics_score 0.000000e+00
## BF 1.792035e-180
## PostProbs 6.200000e-03
## R2 0.000000e+00
## dim 1.000000e+00
## logmarg 0.000000e+00
Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code
blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)
data.predict <- rbind(data.model, blackpanther)
blackpanther <- tail(blackpanther, 1)
str(blackpanther)
## 'data.frame': 1 obs. of 15 variables:
## $ feature_film : chr "yes"
## $ drama : chr "no"
## $ runtime : num 135
## $ mpaa_rating_R : chr "no"
## $ thtr_rel_year : num 2018
## $ imdb_rating : num 7.4
## $ imdb_num_votes : num 443501
## $ critics_score : num 97
## $ best_pic_nom : chr "no"
## $ best_pic_win : chr "no"
## $ best_actor_win : chr "no"
## $ best_actress_win: chr "no"
## $ best_dir_win : chr "no"
## $ top200_box : chr "yes"
## $ audience_score : num 79
We will predict the audience_score
audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma
## [,1]
## [1,] 77.23458
The prediction is lower than the actual audience_score.
The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.