This file will introduce you to the capabilities of the Bayesian approach to forecasting, plotting credible intervals, and shifting the focus of the model based on the same data.
The data set is comprised of 651 randomly sampled movies produced and released before 2016. Sources for this data set were Rotten Tomatoes and IMDB APIs. Since these sources mostly focusing on English-spoken auditory we need to be very careful to generalizable our conclusion for all world population, but we can be more confident about the US, UK, and other English spoken countries. Also, we need to mention that this data set covert for the most part active internet users. We need to be careful to generalize conclusions for the least active internet users, probably, such as old people and others.
Since we do not have an experiment, but observation study, we can consider the only association, and do not can make causal conclusions.
We are interested in learning what attributes make a movie popular. We going to develop a Bayesian regression model to predict audience_score from the dataset explanatory variables. Also, we going to figure out the same ways to build a more accurate model by using the same data.
In this part, we are going to make some data manipulation whose were given by course project terms of Bayesian Statistics.
feature_filmtitle_type: The new variable would be called feature_film with levels yes (movies that are feature films) and no.## Documentary Feature Film TV Movie
## 55 591 5
movies <- movies %>%
mutate(feature_film = as.factor(if_else(title_type == "Feature Film", "yes", "no")))
summary(movies$feature_film)## no yes
## 60 591
dramagenre: The new variable would be called drama with levels yes (movies that are dramas) and no.## Action & Adventure Animation Art House & International
## 65 9 14
## Comedy Documentary Drama
## 87 52 305
## Horror Musical & Performing Arts Mystery & Suspense
## 23 12 59
## Other Science Fiction & Fantasy
## 16 9
movies <- movies %>%
mutate(drama = as.factor(if_else(genre == "Drama", "yes", "no")))
summary(movies$drama)## no yes
## 346 305
mpaa_rating_Rmpaa_rating: The new variable would be called mpaa_rating_R with levels yes (movies that are R rated) and no.## G NC-17 PG PG-13 R Unrated
## 19 2 118 133 329 50
movies <- movies %>%
mutate(mpaa_rating_R = as.factor(if_else(mpaa_rating == "R", "yes", "no")))
summary(movies$mpaa_rating_R)## no yes
## 322 329
thtr_rel_month:oscar_seasonoscar_season with levels yes (if movie is released in November, October, or December) and no.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 7.00 6.74 10.00 12.00
movies <- movies %>%
mutate(oscar_season = as.factor(if_else(thtr_rel_month %in% c(10,11,12), "yes", "no")))
summary(movies$oscar_season)## no yes
## 460 191
summer_seasonsummer_season with levels yes (if movie is released in May, June, July, or August) and no.movies <- movies %>%
mutate(summer_season = as.factor(if_else(thtr_rel_month %in% c(5,6,7,8), "yes", "no")))
summary(movies$summer_season)## no yes
## 443 208
audience_score and some explanatory numeric variablesAlso, we are going transfom the predicting variable audience_score and some explanatory numeric variables.
Below in Part3, we will see that the distribution of audience_score is left-skewed which makes mean of this variable less robust measurement for predicting purposes. Our goal for transforming audience_score variable is to make the distribution more symmetric.
Transformation of explanatory numeric variables has similar purposes, and, also, if we going to Bayesian Multiple Linear Regression we will see that range of credible intervals of explanatory variables for predicting value also depends on how close stands sizes of explanatory variables for a predicted case and mean of explanatory variables that we have in our data. That leads us to the situation there for different ranges of explanatory variables we should use a different derivative of original variables.
\[\begin{equation} y_{\text{score}, i} = \beta_0 + \beta_1 (x_{\text{hs},i}-\bar{x}_{\text{hs}}) + \beta_2 (x_{\text{IQ},i}-\bar{x}_{\text{IQ}}) + \beta_3(x_{\text{work},i}-\bar{x}_{\text{work}}) + \beta_4 (x_{\text{age},i}-\bar{x}_{\text{age}}) + \epsilon_i. \end{equation}\]
movies <- movies %>%
mutate(cs_log = log(critics_score),
cs_sqrt = sqrt(critics_score),
cs_power_2 = critics_score^2,
cs_power_4 = critics_score^4,
ir_log = log(imdb_rating),
ir_sqrt = sqrt(imdb_rating),
ir_power_2 = imdb_rating^2,
ir_power_4 = imdb_rating^4,
rt_log = log(runtime),
rt_sqrt = sqrt(runtime),
rt_power_2 = runtime^2,
rt_power_4 = runtime^4,
inv_log = log(imdb_num_votes),
inv_sqrt = sqrt(imdb_num_votes),
inv_power_2 = imdb_num_votes^2,
inv_power_4 = imdb_num_votes^4)For the very beginning let’s see audience score on Rotten Tomatoes audience_score distributions:
movies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
ggtitle('Distribution of audience score on Rotten Tomatoes')## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.0 46.0 65.0 62.4 80.0 97.0
Distribution of audience_score is left-skewed with center around 65.
movies %>%
ggplot(aes(x = ascore_power )) +
geom_histogram(bins = 10) +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
summarise(Mean_n_power = mean((audience_score)^(power))^(1/power),
Mean_n = mean(audience_score),
median_n = median(audience_score))## # A tibble: 1 x 3
## Mean_n_power Mean_n median_n
## <dbl> <dbl> <dbl>
## 1 65.0 62.4 65
Distribution of ascore_power is roughly symmetric with the center around 1850.
feature_filmmovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
facet_wrap("feature_film") +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
group_by(feature_film) %>%
summarise(Min_n = min(audience_score),
Q1_n = quantile(audience_score,0.25,type = 1),
Median_n = median(audience_score),
Q3_n = quantile(audience_score,0.75,type = 1),
Max_n = max(audience_score),
Mean_n = mean(audience_score),
Sd_n = sd(audience_score),
N_n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
## feature_film Min_n Q1_n Median_n Q3_n Max_n Mean_n Sd_n N_n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 no 19 75 85.5 89 96 81.0 13.6 60
## 2 yes 11 44 62 78 97 60.5 19.8 591
Distributions of audience_score are left-skewed with center around 85.5 and 62 for “no” and “yes” feature_film groups respectively.
dramamovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
facet_wrap("drama") +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
group_by(drama) %>%
summarise(Min_n = min(audience_score),
Q1_n = quantile(audience_score,0.25,type = 1),
Median_n = median(audience_score),
Q3_n = quantile(audience_score,0.75,type = 1),
Max_n = max(audience_score),
Mean_n = mean(audience_score),
Sd_n = sd(audience_score),
N_n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
## drama Min_n Q1_n Median_n Q3_n Max_n Mean_n Sd_n N_n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 no 11 41 61 79 97 59.7 21.3 346
## 2 yes 13 52 70 80 95 65.3 18.5 305
Distributions of audience_score are left-skewed with center around 61 and 70 for “no” and “yes” drama groups respectively.
mpaa_rating_Rmovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
facet_wrap("mpaa_rating_R") +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
group_by(mpaa_rating_R) %>%
summarise(Min_n = min(audience_score),
Q1_n = quantile(audience_score,0.25,type = 1),
Median_n = median(audience_score),
Q3_n = quantile(audience_score,0.75,type = 1),
Max_n = max(audience_score),
Mean_n = mean(audience_score),
Sd_n = sd(audience_score),
N_n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
## mpaa_rating_R Min_n Q1_n Median_n Q3_n Max_n Mean_n Sd_n N_n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 no 11 48 65.5 80 96 62.7 20.3 322
## 2 yes 14 44 64 79 97 62.0 20.2 329
Distributions of audience_score are left-skewed with center around 65.5 and 64.0 for “no” and “yes” mpaa_rating_R groups respectively.
oscar_seasonmovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
facet_wrap("oscar_season") +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
group_by(oscar_season) %>%
summarise(Min_n = min(audience_score),
Q1_n = quantile(audience_score,0.25,type = 1),
Median_n = median(audience_score),
Q3_n = quantile(audience_score,0.75,type = 1),
Max_n = max(audience_score),
Mean_n = mean(audience_score),
Sd_n = sd(audience_score),
N_n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
## oscar_season Min_n Q1_n Median_n Q3_n Max_n Mean_n Sd_n N_n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 no 11 46 64 79 96 61.8 20.1 460
## 2 yes 13 47 69 81 97 63.7 20.5 191
Distributions of audience_score are left-skewed with center around 64.0 and 69.0 for “no” and “yes” oscar_season groups respectively.
summer_seasonmovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
facet_wrap("summer_season") +
ggtitle('Distribution of audience score on Rotten Tomatoes')movies %>%
group_by(summer_season) %>%
summarise(Min_n = min(audience_score),
Q1_n = quantile(audience_score,0.25,type = 1),
Median_n = median(audience_score),
Q3_n = quantile(audience_score,0.75,type = 1),
Max_n = max(audience_score),
Mean_n = mean(audience_score),
Sd_n = sd(audience_score),
N_n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 9
## summer_season Min_n Q1_n Median_n Q3_n Max_n Mean_n Sd_n N_n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 no 13 46 66 80 97 62.6 20.4 443
## 2 yes 11 44 65 78 94 61.8 19.9 208
Distributions of audience_score are left-skewed with center around 66.0 and 65.0 for “no” and “yes” summer_season groups respectively.
critics_scoreplot <- ggplot(data = movies) +
geom_histogram(aes(x = critics_score), bins = 20) +
ggtitle('Critics score distribution')
log_plot <- ggplot(data = movies) +
geom_histogram(aes(x = cs_log), bins = 20) +
ggtitle('Log')
sqrt_plot <- ggplot(data = movies) +
geom_histogram(aes(x = cs_sqrt), bins = 20) +
ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) +
geom_histogram(aes(x = cs_power_2), bins = 20) +
ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) +
geom_histogram(aes(x = cs_power_4), bins = 20) +
ggtitle('Power 4')
require(gridExtra)## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(plot,
grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot,
ncol = 4), ncol = 1)mean <- movies %>%
summarise("critics_score" = mean(critics_score),
"cs_log" = exp(mean(cs_log)),
"cs_sqrt" = mean(cs_sqrt)^2,
"cs_power_2" = sqrt(mean(cs_power_2)),
"cs_power_4" = mean(cs_power_4)^(1/4)) %>%
t()
median <- movies %>%
summarise("critics_score" = median(critics_score),
"cs_log" = exp(median(cs_log)),
"cs_sqrt" = median(cs_sqrt)^2,
"cs_power_2" = sqrt(median(cs_power_2)),
"cs_power_4" = median(cs_power_4)^(1/4)) %>%
t()
summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output## mean median median/mean median-mean
## critics_score 57.69 61 1.0574 3.312
## cs_log 47.16 61 1.2934 13.836
## cs_sqrt 53.07 61 1.1494 7.927
## cs_power_2 64.29 61 0.9488 -3.292
## cs_power_4 72.03 61 0.8468 -11.034
Distribution of critics_score is left-skewed with center around 61.
imdb_ratingplot <- ggplot(data = movies) + geom_histogram(aes(x = imdb_rating), bins = 20) + ggtitle('IMDB rating distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_power_4), bins = 20) + ggtitle('Power 4')
require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)mean <- movies %>%
summarise("imdb_rating" = mean(imdb_rating),
"ir_log" = exp(mean(ir_log)),
"ir_sqrt" = mean(ir_sqrt)^2,
"ir_power_2" = sqrt(mean(ir_power_2)),
"ir_power_4" = mean(ir_power_4)^(1/4)) %>%
t()
median <- movies %>%
summarise("imdb_rating" = median(imdb_rating),
"ir_log" = exp(median(ir_log)),
"ir_sqrt" = median(ir_sqrt)^2,
"ir_power_2" = sqrt(median(ir_power_2)),
"ir_power_4" = median(ir_power_4)^(1/4)) %>%
t()
summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output## mean median median/mean median-mean
## imdb_rating 6.493 6.6 1.0165 0.10691
## ir_log 6.386 6.6 1.0336 0.21449
## ir_sqrt 6.442 6.6 1.0245 0.15802
## ir_power_2 6.583 6.6 1.0026 0.01706
## ir_power_4 6.729 6.6 0.9808 -0.12938
Distribution of imdb_rating is left-skewed with center around 6.6.
runtimeplot <- ggplot(data = movies) + geom_histogram(aes(x = runtime), bins = 20) + ggtitle('Runtime distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_power_4), bins = 20) + ggtitle('Power 4')
require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
mean <- movies %>%
filter(!is.na(runtime)) %>%
summarise("runtime" = mean(runtime),
"rt_log" = exp(mean(rt_log)),
"rt_sqrt" = mean(rt_sqrt)^2,
"rt_power_2" = sqrt(mean(rt_power_2)),
"rt_power_4" = mean(rt_power_4)^(1/4)) %>%
t()
median <- movies %>%
filter(!is.na(runtime)) %>%
summarise("runtime" = median(runtime),
"rt_log" = exp(median(rt_log)),
"rt_sqrt" = median(rt_sqrt)^2,
"rt_power_2" = sqrt(median(rt_power_2)),
"rt_power_4" = median(rt_power_4)^(1/4)) %>%
t()
summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output## mean median median/mean median-mean
## runtime 105.8 103 0.9733 -2.822
## rt_log 104.2 103 0.9884 -1.213
## rt_sqrt 105.0 103 0.9809 -2.003
## rt_power_2 107.6 103 0.9573 -4.591
## rt_power_4 112.1 103 0.9188 -9.100
Distribution of runtime is right-skewed with center around 103.
imdb_num_votesplot <- ggplot(data = movies) + geom_histogram(aes(x = imdb_num_votes), bins = 20) + ggtitle('IMDB number votes distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_power_4), bins = 20) + ggtitle('Power 4')
require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)mean <- movies %>%
filter(!is.na(imdb_num_votes)) %>%
summarise("imdb_num_votes" = mean(imdb_num_votes),
"inv_log" = exp(mean(inv_log)),
"inv_sqrt" = mean(inv_sqrt)^2,
"inv_power_2" = sqrt(mean(inv_power_2)),
"inv_power_4" = mean(inv_power_4)^(1/4)) %>%
t()
median <- movies %>%
filter(!is.na(imdb_num_votes)) %>%
summarise("imdb_num_votes" = median(imdb_num_votes),
"inv_log" = exp(median(inv_log)),
"inv_sqrt" = median(inv_sqrt)^2,
"inv_power_2" = sqrt(median(inv_power_2)),
"inv_power_4" = median(inv_power_4)^(1/4)) %>%
t()
summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output## mean median median/mean median-mean
## imdb_num_votes 57533 15116 0.26274 -42417
## inv_log 16403 15116 0.92156 -1287
## inv_sqrt 32367 15116 0.46702 -17251
## inv_power_2 125947 15116 0.12002 -110831
## inv_power_4 268792 15116 0.05624 -253676
Distribution of imdb_num_votes is right-skewed with center around 15116.
At the end of the project, we going to predict “Ford v Ferrari” (2019) movie. In that case, we should choose which explanatory variables we going to include in the predicting model. Including variables in which explanatory variables values would be close to values of the movies may increase the accuracy of prediction in that area of variables.
From imdb.com and rottentomatoes.com we gote that “Ford v Ferrari” (2019) movie has:
In that case we should stay on derivatives which normalize mean would be closes to value abowe. From data manipulation section we can get folowing result:
In the last case, we chouse inv_power_2 instead inv_power_4 with normalized mean 268792 because of singularity in prediction processes. We should mention that if we chouse to predict moves with low explanatory variables we probably would include log derivatives.
Also, we will construct the predicting model with original variables to make a prediction for comparison, but in this paper, we are not going to describe calculating processes for that additional model.
# Get interesting us columns
col_names <- c("audience_score", "ascore_power",
# constructed explanatory categorical variables
"feature_film", "drama", "mpaa_rating_R",
"oscar_season", "summer_season",
# original explanatory numeric variables
"thtr_rel_year", "top200_box",
"best_pic_nom", "best_pic_win",
"best_actor_win", "best_actress_win", "best_dir_win",
# original explanatory numeric variables
"runtime",
"imdb_rating",
"imdb_num_votes",
"critics_score",
# transform explanatory numeric variables
"cs_power_4",
"ir_power_4",
"rt_power_4",
"inv_power_2",
"title")
movies_no_na <- movies[,col_names]
# Exclude observations with missing values in the data set
movies_no_na <- na.omit(movies_no_na)Number of possible models.
drops <- c("audience_score","title","runtime", "imdb_rating", "imdb_num_votes", "critics_score")
n <- dim(movies_no_na)[2]-length(drops) - 1
2^n## [1] 65536
Because the total number of possible models is relatively small - 65`536, we are not going to use the common stochastic methods - MARKOV CHAIN MONTE CARLO (MCMC), instead, we will iterate all possible models.
In our model, we going to explain ascore_power variable which is audience_score in 1.8 power, which we discussed previously in the data manipulation section. Also, we choose the most conservative method “BIC” prior distribution (Bayesian information criterion) for regression coefficients and “uniform” modelprior.
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
movies_score_bas.lm <- bas.lm(ascore_power ~ .,
data = movies_no_na[ , !(names(movies_no_na) %in% drops)],
prior = "BIC",
modelprior = uniform())
# original
drops <- c("ascore_power", "title", "cs_power_4", "ir_power_4", "rt_power_4", "inv_power_2")
movies_score_bas.lm_original <- bas.lm(audience_score ~ .,
data = movies_no_na[ , !(names(movies_no_na) %in% drops)],
prior = "BIC",
modelprior = uniform())Let’s examine the summary output of the model we got.
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000 1.0000 1.0000 1.0000 1.0000
## feature_filmyes 0.65782 1.0000 0.0000 1.0000 1.0000
## dramayes 0.13818 0.0000 0.0000 0.0000 0.0000
## mpaa_rating_Ryes 0.09033 0.0000 0.0000 0.0000 0.0000
## oscar_seasonyes 0.12259 0.0000 0.0000 0.0000 0.0000
## summer_seasonyes 0.16687 0.0000 0.0000 0.0000 0.0000
## thtr_rel_year 0.19477 0.0000 0.0000 0.0000 0.0000
## top200_boxyes 0.22812 0.0000 0.0000 0.0000 0.0000
## best_pic_nomyes 0.04213 0.0000 0.0000 0.0000 0.0000
## best_pic_winyes 0.04747 0.0000 0.0000 0.0000 0.0000
## best_actor_winyes 0.33614 0.0000 0.0000 1.0000 0.0000
## best_actress_winyes 0.05137 0.0000 0.0000 0.0000 0.0000
## best_dir_winyes 0.09775 0.0000 0.0000 0.0000 0.0000
## cs_power_4 0.21866 0.0000 0.0000 0.0000 1.0000
## ir_power_4 1.00000 1.0000 1.0000 1.0000 1.0000
## rt_power_4 0.68553 1.0000 1.0000 1.0000 1.0000
## inv_power_2 0.99991 1.0000 1.0000 1.0000 1.0000
## BF NA 1.0000 0.5780 0.5043 0.4470
## PostProbs NA 0.0634 0.0367 0.0320 0.0284
## R2 NA 0.8152 0.8130 0.8166 0.8165
## dim NA 5.0000 4.0000 6.0000 6.0000
## logmarg NA -6028.1402 -6028.6884 -6028.8248 -6028.9453
## model 5
## Intercept 1.0000
## feature_filmyes 1.0000
## dramayes 0.0000
## mpaa_rating_Ryes 0.0000
## oscar_seasonyes 0.0000
## summer_seasonyes 0.0000
## thtr_rel_year 0.0000
## top200_boxyes 0.0000
## best_pic_nomyes 0.0000
## best_pic_winyes 0.0000
## best_actor_winyes 1.0000
## best_actress_winyes 0.0000
## best_dir_winyes 0.0000
## cs_power_4 0.0000
## ir_power_4 1.0000
## rt_power_4 0.0000
## inv_power_2 1.0000
## BF 0.4022
## PostProbs 0.0255
## R2 0.8146
## dim 5.0000
## logmarg -6029.0510
plot(movies_score_bas.lm, which=2, add.smooth = F, sub.caption="", caption="")
title(main = paste0("Cumulative sampled model probability from ",movies_score_bas.lm$n.models, " models"))From summary output and cumulative sampled model probability plot, we can conclude that the best estimator for predictions would be Bayesian model averaging (BMA). It is so because the posterior probability for the highest probability model is roughly 0.0634.
We can visualize including coefficient into the top 20 models.
We can see that for the top 20 models mostly includes feature_film and derivatives of imdb_rating, runtime. Interesting that those models just a few times include the derivative of critics_score.
For the next step, we can plot the marginal inclusion coefficient probability.
plot(movies_score_bas.lm, which = 4, ask = F, caption = "", sub.caption = "",
col.in = "blue", col.ex = "darkgrey", lwd = 3)This plot clearly demonstrates the importance of coefficients. From this perspective, we can see that best_actor_win is more important than derivative of critics_score.
For our next step let’s visualize our predictions and real audience_score values from different perspectives.
plot(movies_score_bas.lm, which = 1, add.smooth = F,
ask = F, pch = 16, sub.caption="", caption="")
abline(a = 0, b = 0, col = "darkgrey", lwd = 2)At the plot, we can see a relatively stable distribution of residuals, but for roughly 50-62 predicted audience score (1100-1700) distribution wider that for 75-96 score (2400-3700), and range increase again after 96 scores (3700).
Outliers.
## # A tibble: 3 x 2
## title audience_score
## <chr> <dbl>
## 1 The Wood 92
## 2 Dirty Love 67
## 3 Madea Goes to Jail 70
It’s interesting with “Madea Goes to Jail” outliers. This film was in outliers too in many other cases of explanatory variables. In a real-world tasks, we should surely need to make attention to this example to understand what’s going on.
# makе prediction
BMA_pred_movies_score <- predict(movies_score_bas.lm, estimator = "BMA", se.fit = TRUE, nsim = 10000)
# get credible intervals
ci_coef_movies_score <- confint(BMA_pred_movies_score, parm = "coef")
ci_pred_movies_score <- confint(BMA_pred_movies_score, parm = "pred")
# unite result in the data frame
df_movies_score <- data.frame((ci_coef_movies_score[,1:3]^(1/power)), (ci_pred_movies_score[,1:2]^(1/power)),
movies_no_na$critics_score, movies_no_na$audience_score, movies_no_na$imdb_rating)
colnames(df_movies_score) = c("low_mu","hight_mu","pred",
"low","hight",
"critics_score","audience_score", "imdb_rating")
# calculate residuals
df_movies_score$resid <- (df_movies_score$audience_score - df_movies_score$pred)
# potential outliers
df_outliers <- cbind(movies_no_na[outliers_index,c("title","audience_score")], df_movies_score[outliers_index,"pred"])
colnames(df_outliers) <- c("title","audience_score","pred")Now, we can illustrate our uncertainty about predicted values through the plot constructed below.
ggplot(data = df_movies_score) +
geom_point(aes(x = pred, y = audience_score), color = "blue", size = 0.6) +
geom_line(aes(x = pred, y = low_mu, lty = "second")) +
geom_line(aes(x = pred, y = hight_mu, lty = "second")) +
geom_line(aes(x = pred, y = low, lty = "third")) +
geom_line(aes(x = pred, y = hight, lty = "third")) +
geom_line(aes(x = pred, y = pred, color = "orange")) +
scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
, name = "") +
theme_bw() +
theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
geom_point(data = df_outliers, aes(x = pred, y = audience_score), color = "orange", pch = 1, cex = 6) +
ggtitle('Model uncertainty distribution')## Warning: Removed 63 row(s) containing missing values (geom_path).
The plot shows us that roughly under 50 predicted audience score our uncertainty significantly increase compared with the uncertainty of the total model.
For our next step, lets print coefficients from the model.
# Obtain the coefficients from the model
coef_bma_ascore <- coefficients(movies_score_bas.lm)
coef_bma_ascore_ci <- confint(coef_bma_ascore)
# converted measurement of predicting variable back
coef <- cbind(round(coef_bma_ascore_ci, 6),
`P(B != 0 | Y)` = round(coef_bma_ascore$probne0, 3))
coef## 2.5% 97.5% beta P(B != 0 | Y)
## Intercept 1799.03321 1862.486446 1831.4077 1.000
## feature_filmyes 0.00000 282.023404 121.5143 0.658
## dramayes -0.02023 81.096352 8.0824 0.138
## mpaa_rating_Ryes -48.61841 0.322871 -4.0625 0.090
## oscar_seasonyes -75.80410 0.000000 -7.2184 0.123
## summer_seasonyes 0.00000 88.792518 10.5660 0.167
## thtr_rel_year -4.07432 0.009322 -0.5692 0.195
## top200_boxyes -0.03932 316.609408 50.6231 0.228
## best_pic_nomyes 0.00000 0.000000 1.6630 0.042
## best_pic_winyes 0.00000 0.000000 -5.3074 0.047
## best_actor_winyes -159.86718 0.000000 -35.9863 0.336
## best_actress_winyes 0.00000 1.283068 -2.1158 0.051
## best_dir_winyes -107.88909 1.964775 -9.3770 0.098
## cs_power_4 0.00000 0.000002 0.0000 0.219
## ir_power_4 0.76373 0.855985 0.8093 1.000
## rt_power_4 0.00000 0.000000 0.0000 0.686
## inv_power_2 0.00000 0.000000 0.0000 1.000
Our model predict ascore_power = audience_score in 1.8 power. More then that our explanatory variables act as derivatives of runtime, imdb_rating, imdb_num_votes, critics_score. In that case, we have a model with multilinear relationships, and it is easy to explain how one variable affects another. But from that point of view, it’s challenging to understand what it really means.
For that case, we can introvert derivatives into original values. In this situation, it’s relatively easy to understand numbers because measurements are understandable, but after back transformation, we have no more linear relationship. In different areas of explanatory and predicted variables, we have different approximations of how would change predicting audience_score from explanatory variables.
For example, we going to plot the relationship between categorical variable feature_filmyes and predicting audience_score.
min_as <- min(movies_no_na$audience_score)
max_as <- max(movies_no_na$audience_score)
df_as <- data.frame(as = seq(from = min_as, to = max_as, length.out = 1000))
df_as$as_power <- df_as$as^power
df_as$ff_yes_power <- df_as$as_power+121.514270597514
df_as$ff_yes_low_power <- df_as$as_power+0
df_as$ff_yes_hight_power <- df_as$as_power+283.7847953641650
df_as$ff_yes <- df_as$ff_yes_power^(1/power)-df_as$as
df_as$ff_yes_low <- df_as$ff_yes_low_power^(1/power)-df_as$as
df_as$ff_yes_hight <- df_as$ff_yes_hight_power^(1/power)-df_as$as
ggplot(data = df_as) +
geom_line(aes(x = as, y = ff_yes, color = "orange")) +
geom_line(aes(x = as, y = ff_yes_low)) +
geom_line(aes(x = as, y = ff_yes_hight)) +
ggtitle("Relationship between feature_filmyes and audience_score") +
ylab("Changes score") +
xlab("Pred. audience score")The plot show as a clear picture that for low predicted audience score feature_filmyes variable has a bigger impact than for high variable. For instance, for 25 scores feature_filmyes variable adds approximately 5 points, but for 75 it would be only 2 points.
We going to express this relationship for all explanatory variables in audience_score area nearly 92 scores.
score <- 92
Intercept <- coef_bma_ascore_ci[1,"beta"]
Intercept_low <- coef_bma_ascore_ci[1,"2.5%"]
Intercept_hight <- coef_bma_ascore_ci[1,"97.5%"]
score <- score^power
score_low <- score + Intercept_low - Intercept
score_hight <- score + Intercept_hight - Intercept
coef_ci <- coef_bma_ascore_ci[-1,]
coef_ci <- data.frame(coef_ci)
coef_ci$beta_power <- as.vector(coef_bma_ascore_ci[-1,"beta"] + score)
coef_ci$beta_low_power <- as.vector(coef_bma_ascore_ci[-1,"2.5%"] + score_low)
coef_ci$beta_hight_power <- as.vector(coef_bma_ascore_ci[-1,"97.5%"] + score_hight)
coef_ci$beta <- coef_ci$beta_power^(1/power) - score^(1/power)
coef_ci$beta_low <- coef_ci$beta_low_power^(1/power) - score_low^(1/power)
coef_ci$beta_hight <- coef_ci$beta_hight_power^(1/power) - score_hight^(1/power)
coef <- rbind(data.frame("beta" = Intercept^(1/power),
"beta_low" = Intercept_low^(1/power),
"beta_hight" = Intercept_hight^(1/power)),
coef_ci[,c("beta","beta_low","beta_hight")])
coef <- cbind(coef, `P(B != 0 | Y)` = round(coef_bma_ascore$probne0, 3))
rownames(coef) <- c("Intercept", rownames(coef)[-1])
coef <- coef[order(-coef[,"P(B != 0 | Y)"]),]
round(coef, 20)## beta beta_low beta_hight
## Intercept 64.96184434861301 64.32134242627214 65.57199551221359
## ir_power_4 0.01207245864869 0.01144059169204 0.01271734145087
## inv_power_2 -0.00000000002086 -0.00000000002932 -0.00000000001228
## rt_power_4 -0.00000000205532 -0.00000000457295 0.00000000000000
## feature_filmyes 1.79864653047836 0.00000000000000 4.11712981605788
## best_actor_winyes -0.53808586955788 -2.42056386815170 0.00000000000000
## top200_boxyes 0.75270948406494 -0.00058911138517 4.61238516315210
## cs_power_4 0.00000000524112 -0.00000000001833 0.00000003410270
## thtr_rel_year -0.00849194624540 -0.06105204573318 0.00013849722012
## summer_seasonyes 0.15751138415058 0.00000000000000 1.31182314185546
## dramayes 0.12050661021073 -0.00030307349095 1.19870203515380
## oscar_seasonyes -0.10773079362508 -1.14129027097823 0.00000000000000
## best_dir_winyes -0.13996660311540 -1.62784112504021 0.02918853911457
## mpaa_rating_Ryes -0.06061826380078 -0.73066864704901 0.00479704566106
## best_actress_winyes -0.03156682228452 0.00000000000000 0.01906198394596
## best_pic_winyes -0.07920030828470 0.00000000000000 0.00000000000000
## best_pic_nomyes 0.02480592414193 0.00000000000000 0.00000000000000
## P(B != 0 | Y)
## Intercept 1.000
## ir_power_4 1.000
## inv_power_2 1.000
## rt_power_4 0.686
## feature_filmyes 0.658
## best_actor_winyes 0.336
## top200_boxyes 0.228
## cs_power_4 0.219
## thtr_rel_year 0.195
## summer_seasonyes 0.167
## dramayes 0.138
## oscar_seasonyes 0.123
## best_dir_winyes 0.098
## mpaa_rating_Ryes 0.090
## best_actress_winyes 0.051
## best_pic_winyes 0.047
## best_pic_nomyes 0.042
Our model predicts, that the mean audience_score of the average film would be 64.96 scores with a 95% chance that it is between 64.35 to 65.58 points. That score includes that moves include all categorical explanatory variable with value “no”. In other cases, they change score roughly the points in “beta” columns in the table above. Also in the table, we have credible intervals for those coefficients.
In that case, we still do not have a clear understanding picture of ir_power_4, inv_power_2, rt_power_4, and cs_power_4 because these variables have different measurements from originals.
Now, we going to plot the relationship between the original imdb_rating and predicting audience_score in our model. To make plot easy to the understanding we construct it for different imdb_rating and audience_score, but with condition that all other categorical variables are “no” and other numeric variables have an average value from the original data frame.
min_imdb_rating <- min(movies$imdb_rating)
max_imdb_rating <- max(movies$imdb_rating)
df_imdb_rating <- data.frame(imdb_rating = seq(from = min_imdb_rating, to = max_imdb_rating, length.out = 200))
for (value in names(movies_score_bas.lm$mean.x)) {
if (grepl("yes",value) == T) {
df_imdb_rating[,gsub("yes","",value)] <- "no"
}
df_imdb_rating[,value] <- movies_score_bas.lm$mean.x[value]
}
df_imdb_rating$imdb_rating <- seq(from = min_imdb_rating, to = max_imdb_rating, length.out = 200)
df_imdb_rating <- df_imdb_rating %>%
mutate(ir_log = log(imdb_rating),
ir_sqrt = sqrt(imdb_rating),
ir_power_2 = imdb_rating^2,
ir_power_4 = imdb_rating^4)
imdb_rating.new = predict(movies_score_bas.lm,
df_imdb_rating,
estimator = "BMA",
se.fit = TRUE,
nsim = 1000)
# get credible intervals
ci_coef_imdb_rating <- confint(imdb_rating.new, parm = "coef")
ci_pred_imdb_rating <- confint(imdb_rating.new, parm = "pred")
# unite result in the data frame
df_imdb_rating.new <- data.frame((ci_coef_imdb_rating[,1:3]^(1/power)),
(ci_pred_imdb_rating[,1:2]^(1/power)),
df_imdb_rating$imdb_rating)
colnames(df_imdb_rating.new) = c("low_mu","hight_mu","pred",
"low","hight",
"imdb_rating")
ggplot(data = df_imdb_rating.new) +
geom_line(aes(x = imdb_rating, y = pred, color = "orange")) +
geom_line(aes(x = imdb_rating, y = low_mu, lty = "second")) +
geom_line(aes(x = imdb_rating, y = hight_mu, lty = "second")) +
geom_line(aes(x = imdb_rating, y = low, lty = "third")) +
geom_line(aes(x = imdb_rating, y = hight, lty = "third")) +
scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
, name = "") +
theme_bw() +
theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
ggtitle('Model uncertainty distribution under imdb_rating variable') +
ylab("predicting `audience_score`") +
xlab("original `imdb_rating`")## Warning: Removed 48 row(s) containing missing values (geom_path).
## Warning: Removed 102 row(s) containing missing values (geom_path).
Plot cleary expres model uncertainty about imdb_rating impact on audience_score. Also, we can see that each additional imdb_rating point for low it value add much less additional audience_score scores, then for hight imdb_rating.
Let’s make a prediction for “Ford v Ferrari” (2019) movie and compare the result with the average current rating. The data about movie we got from imdb.com and rottentomatoes.com.
As the first step, we going to build the data frame with needed “Ford v Ferrari” (2019) movie variables for our model. After that, we will make predictions, construct credible intervals, and prepare data for subsequent plot.
# add variables from our imdb.com and rottentomatoes.com resources
ford_v_ferrari <- data.frame(feature_film = "yes", drama = "no", mpaa_rating_R = "no",
oscar_season = "yes", summer_season = "no", thtr_rel_year = 2019,
top200_box = "yes", best_pic_nom = "yes", best_pic_win = "yes",
best_actor_win = "yes", best_actress_win = "no", best_dir_win = "yes",
critics_score = 92, imdb_rating = 8.1, runtime = 152,
imdb_num_votes = 244880, audience_score = 98)
# create derivatives
ford_v_ferrari <- ford_v_ferrari %>%
mutate(cs_power_4 = critics_score^4,
ir_power_4 = imdb_rating^4,
rt_power_4 = runtime^4,
inv_power_2 = imdb_num_votes^2)
# make prediction
ford_v_ferrari.new = predict(movies_score_bas.lm,
newdata = ford_v_ferrari,
estimator = "BMA",
se.fit = TRUE,
nsim = 10000)
# construct credible intervals
ci_coef_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "coef")
ci_pred_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "pred")
ci_ford_v_ferrari.new <- data.frame((ci_coef_ford_v_ferrari.new[,1])^(1/power),
(ci_coef_ford_v_ferrari.new[,2])^(1/power),
(ci_coef_ford_v_ferrari.new[,3])^(1/power),
(ci_pred_ford_v_ferrari.new[,1])^(1/power),
(ci_pred_ford_v_ferrari.new[,2])^(1/power),
ford_v_ferrari$audience_score)
colnames(ci_ford_v_ferrari.new) = c("low_mu","hight_mu","pred", "low","hight", "audience_score")
# prepare data for plot
df <- median(ford_v_ferrari.new$df)
pred <- ford_v_ferrari.new$Ybma[1,1]
sd_fit <- ford_v_ferrari.new$se.bma.fit
sd_pred <- ford_v_ferrari.new$se.bma.pred
df_ci_coef <- data.frame(mu = (rt(10000,df)*sd_fit+pred)^(1/power))
df_ci_coef$pred <- (rt(10000,median(df))*sd_pred +pred)^(1/power)Almost the same steps we going to do with the original variables model.
# make prediction
ford_v_ferrari.new = predict(movies_score_bas.lm_original,
newdata = ford_v_ferrari,
estimator = "BMA",
se.fit = TRUE,
nsim = 10000)
# construct credible intervals
ci_coef_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "coef")
ci_pred_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "pred")
ci_ford_v_ferrari.new_original <- data.frame((ci_coef_ford_v_ferrari.new[,1]),
(ci_coef_ford_v_ferrari.new[,2]),
(ci_coef_ford_v_ferrari.new[,3]),
(ci_pred_ford_v_ferrari.new[,1]),
(ci_pred_ford_v_ferrari.new[,2]),
ford_v_ferrari$audience_score)
colnames(ci_ford_v_ferrari.new_original) = c("low_mu","hight_mu","pred","low","hight","audience_score")
# prepare data for plot
df <- median(ford_v_ferrari.new$df)
pred_original <- ford_v_ferrari.new$Ybma[1,1]
sd_fit <- ford_v_ferrari.new$se.bma.fit
sd_pred <- ford_v_ferrari.new$se.bma.pred
df_ci_coef_original <- data.frame(mu = (rt(10000,df)*sd_fit+pred_original))
df_ci_coef_original$pred <- (rt(10000,median(df))*sd_pred +pred_original)plot_original <- ggplot() +
geom_density(data = df_ci_coef_original, aes(x = mu), color = "darkblue", fill="blue", alpha=.3) +
geom_density(data = df_ci_coef_original, aes(x = pred), linetype="dotted", color = "darkblue", fill="blue", alpha=.1) +
geom_vline(xintercept = pred_original, color = "darkblue", size=1, linetype="dashed") +
geom_vline(xintercept = 98, color = "darkred", size=1) +
ggtitle("Original model prediction distribution") +
xlim(60,120)
plot <- ggplot() +
geom_density(data = df_ci_coef, aes(x = mu), color = "darkgreen", fill="green", alpha=.3) +
geom_density(data = df_ci_coef, aes(x = pred), linetype="dotted", color = "darkgreen", fill="green", alpha=.1) +
geom_vline(xintercept = pred^(1/power), color = "darkgreen", size=1, linetype="dashed") +
geom_vline(xintercept = 98, color = "darkred", size=1) +
ggtitle("Derivative model prediction distribution") +
xlim(60,120)
require(gridExtra)
grid.arrange(plot, plot_original, ncol = 1)## Warning: Removed 45 rows containing non-finite values (stat_density).
rownames(ci_ford_v_ferrari.new_original) <- "original model"
rownames(ci_ford_v_ferrari.new) <- "derivative model"
print(rbind(ci_ford_v_ferrari.new_original, ci_ford_v_ferrari.new))## low_mu hight_mu pred low hight audience_score
## original model 83.49 92.43 87.66 67.92 107.9 98
## derivative model 90.19 98.51 93.90 81.23 105.8 98
Our main derivative model predicts roughly 93.9 audience score with a 95% chance that it is between 80.5 to 100 points (audience score can’t reach 105.5).
The model with original variables for this film shows the less accurate results - 87.7 points with a 95% chance that it is between 67.9 to 100 points. Compare with the main model result mean shifted to the left and the range of expected value is wider 32.1 points versus 19.5 derivative model.
To summarize our findings we can make some conclusions.
First, we create and compare 2 models:
All models were created by Bayesian approach which includes specifying a prior Normal distribution for the error, prior Student’s t-distributions for all in the following equation \[\begin{equation} y_{\text{score}, i} = \beta_0 + \beta_1 (x_{\text{hs},i}-\bar{x}_{\text{hs}}) + \beta_2 (x_{\text{IQ},i}-\bar{x}_{\text{IQ}}) + \beta_3(x_{\text{work},i}-\bar{x}_{\text{work}}) + \beta_4 (x_{\text{age},i}-\bar{x}_{\text{age}}) + \epsilon_i. \end{equation}\].
We used “BIC” prior distribution for finding appropriate regression coefficients and Bayesian model averaging to proportionally aggregate information from all possible models. Also, we used the advantage of the Bayesian approach and transform data to make it more informative for interesting prediction areas (“Ford v Ferrari” (2019) movie) and as the result, we get a more accurate prediction result. That’s would be very interesting because we can specify our sphere of interest and manage our model to reach the best result using all data knowledge we have with almost no retraining.
For shortcomings, we can name difficulties of the Bayesian approach and hight computational resource intensity.
If you find this project interesting and you want to have contact with someone who also interesting in Bayesian statistics - you are welcome.
Instagram: @ditiatev_iurii