The specific modeling task you need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables you will need to construct in the data manipulation section using the mutate function in dplyr:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(statsr)
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.3
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(grid)
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
load("movies.Rdata")
dim(movies)
## [1] 651 32
str(movies)
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movies)
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 37 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :133 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Certified Fresh:135
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Fresh :209
## Median :15.00 Median :6.600 Median : 15116 Rotten :307
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:275 Min. :11.00 no :629 no :644
## 1st Qu.: 33.00 Upright:376 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :558 no :579 no :608 no :636 Length:651
## yes: 93 yes: 72 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
movies <- na.omit(movies)
dim(movies)
## [1] 619 32
Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.
The movies data from IMDB was used for the analysis at hand. Some of variables are in the original dataset provided, and others are new variables. This will need be to construct in the data manipulation section.
Create new variables using the mutate function in the dplyr package following guidelines:
movies <- mutate(movies, feature_film = ifelse(title_type == "Feature Film", "Yes", "No"))
movies$feature_film <- as.factor(movies$feature_film)
summary(movies$feature_film)
## No Yes
## 46 573
movies <- mutate(movies, drama = ifelse(genre == "Drama", "Yes", "No"))
movies$drama <- as.factor(movies$drama)
summary(movies$drama)
## No Yes
## 321 298
movies<- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating == "R", "Yes", "No"))
movies$mpaa_rating_R <- as.factor(movies$mpaa_rating_R)
summary(movies$mpaa_rating_R)
## No Yes
## 300 319
movies <- mutate(movies, oscar_season = ifelse(thtr_rel_month %in% c(10,11,12), "Yes", "No"))
movies$oscar_season <- as.factor(movies$oscar_season)
summary(movies$oscar_season)
## No Yes
## 440 179
movies <- mutate(movies, summer_season = ifelse(thtr_rel_month %in% c(5,6,7,8), "Yes", "No"))
movies$summer_season <- as.factor(movies$summer_season)
summary(movies$summer_season)
## No Yes
## 418 201
create a data frame with the new variables
df <- movies[c("feature_film","drama","mpaa_rating_R","oscar_season","summer_season","audience_score")]
summary(df)
## feature_film drama mpaa_rating_R oscar_season summer_season
## No : 46 No :321 No :300 No :440 No :418
## Yes:573 Yes:298 Yes:319 Yes:179 Yes:201
##
##
##
##
## audience_score
## Min. :11.00
## 1st Qu.:46.00
## Median :65.00
## Mean :62.21
## 3rd Qu.:80.00
## Max. :97.00
Analysing Audience score
summary(df$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.21 80.00 97.00
IQR(df$audience_score)
## [1] 34
mean(df$audience_score)
## [1] 62.21002
ggplot(df, aes(x = audience_score, y = ..density..)) +
geom_histogram(bins = 40, fill = 'blue', colour = 'black') +
geom_density(size = 1, colour = 'brown')
comparing audience score with the rest of the data set
pf1 <- ggplot(df, aes(audience_score, fill = feature_film)) +
geom_density() +
ggtitle("Audience score vs. feature_film") +
labs(x = "feature film", y = "Density")
pf2 <- ggplot(df, aes(audience_score, fill = drama)) +
geom_density () +
labs(title = "Audience score vs. drama") +
labs(x = "drama", y = "Density")
pf3 <- ggplot(movies, aes(audience_score, fill = top200_box))+
geom_density () +
labs(title = "Audience score vs. top200_box") +
labs(x = "top200 box", y = "Density")
pf4 <- ggplot(df, aes(audience_score, fill = oscar_season)) +
geom_density() +
labs(title = "Audience score vs. oscar_season") +
labs(x = "oscar season", y = "Density")
pf5 <- ggplot(df, aes(audience_score, fill = summer_season))+
geom_density () +
labs(title = "Audience score vs. summer_season") +
labs(x = "summer season", y = "Density")
pf6 <- ggplot(movies, aes(audience_score, fill = best_pic_nom))+
geom_density () +
labs(title = "Audience score vs. best_pic_nom") +
labs(x = "best pic nom", y = "Density")
pf7 <- ggplot(movies, aes(audience_score, fill = best_pic_win)) +
geom_density() +
labs(title = "Audience score vs. best pic win") +
labs(x = "best pic win", y = "Density")
pf8 <- ggplot(movies, aes(audience_score, fill = best_actor_win))+
geom_density () +
labs(title = "Audience score vs. best_actor_win") +
labs(x = "best actor win", y = "Density")
pf9 <- ggplot(movies, aes(audience_score, fill = best_dir_win))+
geom_density () +
labs(title = "Audience score vs. best_dir_win") +
labs(x = "best dir win", y = "Density")
pf10 <- ggplot(movies, aes(audience_score, fill = best_actress_win))+
geom_density () +
labs(title = "Audience score vs. best_actress_win") +
labs(x = "best actress win", y = "Density")
grid.arrange(pf1, pf2, pf3, pf4, pf5, pf6, pf7, pf8, pf9, pf10, ncol = 2)
use the bayes_inference function, which will allow us to construct credible intervals perform a hypothesis test and calculate Bayes factors for a variety of different circumstances. The main goal is to investigate if the newly created features(feature_film, drama, mpaa_rating_R, oscar_season and summer_season) influence the audience_score.
bayes_inference(y = audience_score, x = feature_film, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 46, y_bar_No = 82.5435, s_No = 11.9177
## n_Yes = 573, y_bar_Yes = 60.5777, s_Yes = 19.8187
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 1.212332e+13
## P(H1|data) = 0
## P(H2|data) = 1
- Audience score with drama
bayes_inference(y = audience_score, x = drama, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 321, y_bar_No = 59.352, s_No = 21.1448
## n_Yes = 298, y_bar_Yes = 65.2886, s_Yes = 18.6305
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 34.6357
## P(H1|data) = 0.0281
## P(H2|data) = 0.9719
- Audience score with mpaa_rating_r
bayes_inference(y = audience_score, x = mpaa_rating_R, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 300, y_bar_No = 62.0367, s_No = 20.3187
## n_Yes = 319, y_bar_Yes = 62.373, s_Yes = 20.0743
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 24.8392
## P(H1|data) = 0.9613
## P(H2|data) = 0.0387
- Audience score with oscar_season
bayes_inference(y = audience_score, x = oscar_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 440, y_bar_No = 61.5386, s_No = 20.107
## n_Yes = 179, y_bar_Yes = 63.8603, s_Yes = 20.3118
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 10.019
## P(H1|data) = 0.9092
## P(H2|data) = 0.0908
- Audience Score with summer_season
bayes_inference(y = audience_score, x = summer_season, data = df, statistic = 'mean', type = 'ht', null = 0, alternative = 'twosided')
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_No = 418, y_bar_No = 62.3828, s_No = 20.3266
## n_Yes = 201, y_bar_Yes = 61.8507, s_Yes = 19.9092
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_No = mu_Yes
## H2: mu_No != mu_Yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 22.7623
## P(H1|data) = 0.9579
## P(H2|data) = 0.0421
Backward Elimination we will use backwards elimination to pick significant predictors and first, we will start with full model.
data.model <- movies[c("feature_film","drama","runtime","mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating","imdb_num_votes","critics_score","best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box","audience_score")]
str(data.model)
## tibble [619 x 17] (S3: tbl_df/tbl/data.frame)
## $ feature_film : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ drama : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 2 1 ...
## $ runtime : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
## $ mpaa_rating_R : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 1 ...
## $ thtr_rel_year : num [1:619] 2013 2001 1996 1993 2004 ...
## $ oscar_season : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
## $ summer_season : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 2 ...
## $ imdb_rating : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ audience_score : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
## - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
## ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
lm1 <- lm(audience_score ~ ., data = data.model)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Final Model:
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 602 61285.24 2878.422
## 2 - top200_box 1 8.158011 603 61293.40 2876.504
## 3 - oscar_season 1 9.834264 604 61303.24 2874.604
## 4 - best_pic_win 1 43.432561 605 61346.67 2873.042
## 5 - best_dir_win 1 91.172934 606 61437.84 2871.961
## 6 - best_actor_win 1 125.157576 607 61563.00 2871.221
## 7 - summer_season 1 176.422141 608 61739.42 2870.992
## 8 - feature_film 1 188.009104 609 61927.43 2870.875
## 9 - drama 1 124.155090 610 62051.59 2870.114
## 10 - imdb_num_votes 1 167.442064 611 62219.03 2869.782
## 11 - thtr_rel_year 1 168.854308 612 62387.88 2869.460
bma_audience_score <- bas.lm(audience_score ~., data = data.model, prior = "BIC", modelprior = uniform())
bma_audience_score
##
## Call:
## bas.lm(formula = audience_score ~ ., data = data.model, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmYes dramaYes
## 1.00000 0.05876 0.04509
## runtime mpaa_rating_RYes thtr_rel_year
## 0.51400 0.16498 0.08089
## oscar_seasonYes summer_seasonYes imdb_rating
## 0.06526 0.07935 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04077 0.11565 0.14770
## best_dir_winyes top200_boxyes
## 0.06701 0.04876
summary(bma_audience_score)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmYes 0.05876309 0.0000 0.0000000 0.0000000
## dramaYes 0.04508592 0.0000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000
## mpaa_rating_RYes 0.16498090 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000
## oscar_seasonYes 0.06525993 0.0000 0.0000000 0.0000000
## summer_seasonYes 0.07935408 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1558 0.1358000 0.0319000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmYes 0.0000000 0.0000000
## dramaYes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_RYes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonYes 0.0000000 0.0000000
## summer_seasonYes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0318000 0.0289000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
image(bma_audience_score, rotate = FALSE)
We can see from the model rank that there are three variables that have high posterior odds which are runtime, imdb_rating, critic_score.
coef_bma_audience_score <- coef(bma_audience_score)
par(ask=F)
plot(coef_bma_audience_score)
Model for Prediction
finalmodel <- data.model[c("runtime","imdb_rating","critics_score","audience_score")]
bma_finalmodel <- bas.lm(audience_score ~., data = finalmodel, prior = "ZS-null", method = "MCMC", modelprior = uniform())
summary(bma_finalmodel)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000 1.0000000 1.0000 1.00000000 1.00000000
## runtime 0.50625 0.0000000 1.0000 1.00000000 0.00000000
## imdb_rating 0.99375 1.0000000 1.0000 1.00000000 1.00000000
## critics_score 0.91875 1.0000000 1.0000 0.00000000 0.00000000
## BF NA 0.9928814 1.0000 0.09635143 0.08035048
## PostProbs NA 0.4720000 0.4410 0.05590000 0.01860000
## R2 NA 0.7455000 0.7483 0.74360000 0.74050000
## dim NA 3.0000000 4.0000 3.00000000 2.00000000
## logmarg NA 413.8748206 413.8820 411.54221157 411.36060745
## model 5
## Intercept 1.000000e+00
## runtime 0.000000e+00
## imdb_rating 0.000000e+00
## critics_score 0.000000e+00
## BF 1.792035e-180
## PostProbs 6.200000e-03
## R2 0.000000e+00
## dim 1.000000e+00
## logmarg 0.000000e+00
Build test data cases for the movie “Black Panther (2018)” using the data gathered from IMDB (imdb_rating = 7.4) and rotten tomatoes website (audience_score = 79) and storing the data in the variable named blackpanther (test data case) using the following code
blackpanther <- data.frame(feature_film="yes",drama="no",runtime=135,mpaa_rating_R="no",thtr_rel_year=2018,oscar_season="no",summer_season="no",imdb_rating=7.4,imdb_num_votes=443501,critics_score=97,best_pic_nom="no",best_pic_win="no",best_actor_win="no",best_actress_win="no",best_dir_win="no",top200_box="yes",audience_score=79)
data.predict <- rbind(data.model, blackpanther)
blackpanther <- tail(blackpanther, 1)
str(blackpanther)
## 'data.frame': 1 obs. of 17 variables:
## $ feature_film : chr "yes"
## $ drama : chr "no"
## $ runtime : num 135
## $ mpaa_rating_R : chr "no"
## $ thtr_rel_year : num 2018
## $ oscar_season : chr "no"
## $ summer_season : chr "no"
## $ imdb_rating : num 7.4
## $ imdb_num_votes : num 443501
## $ critics_score : num 97
## $ best_pic_nom : chr "no"
## $ best_pic_win : chr "no"
## $ best_actor_win : chr "no"
## $ best_actress_win: chr "no"
## $ best_dir_win : chr "no"
## $ top200_box : chr "yes"
## $ audience_score : num 79
We will predict the audience_score
audience_score_prediction <-predict(bma_finalmodel, newdata=blackpanther, estimator="BMA", se.fit=TRUE, interval="predict", level = 0.95)
audience_score_prediction$Ybma
## [,1]
## [1,] 77.61483
The prediction is lower than the actual audience_score.
The project uses data from movies dataset to determine if there is any association between audience_score and other variables. Doing exploratory data analysis and modeling help us to know which variables are significant predictors. Yet, together with Bayes model, we can see different model and can pick the model that have the highest prediction.