Bayesian modeling and prediction.

Info

This file will introduce you to the capabilities of the Bayesian approach to forecasting, plotting credible intervals, and shifting the focus of the model based on the same data.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(GGally)

Load data

load("movies.Rdata")

Setting a seed

set.seed(11082020)

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. Sources for this data set were Rotten Tomatoes and IMDB APIs. Since these sources mostly focusing on English-spoken auditory we need to be very careful to generalizable our conclusion for all world population, but we can be more confident about the US, UK, and other English spoken countries. Also, we need to mention that this data set covert for the most part active internet users. We need to be careful to generalize conclusions for the least active internet users, probably, such as old people and others.

Since we do not have an experiment, but observation study, we can consider the only association, and do not can make causal conclusions.

Research question

We are interested in learning what attributes make a movie popular. We going to develop a Bayesian regression model to predict audience_score from the dataset explanatory variables. Also, we going to figure out the same ways to build a more accurate model by using the same data.

Part 2: Data manipulation

In this part, we are going to make some data manipulation whose were given by course project terms of Bayesian Statistics.

`feature_film`

New variable based on title_type: The new variable would be called feature_film with levels yes (movies that are feature films) and no.

summary(movies$title_type)

##  Documentary Feature Film     TV Movie 
##           55          591            5

movies <- movies %>%
  mutate(feature_film = as.factor(if_else(title_type == "Feature Film", "yes", "no")))

summary(movies$feature_film)

##  no yes 
##  60 591

`drama`

New variable based on genre: The new variable would be called drama with levels yes (movies that are dramas) and no.

summary(movies$genre)

##        Action & Adventure                 Animation Art House & International 
##                        65                         9                        14 
##                    Comedy               Documentary                     Drama 
##                        87                        52                       305 
##                    Horror Musical & Performing Arts        Mystery & Suspense 
##                        23                        12                        59 
##                     Other Science Fiction & Fantasy 
##                        16                         9

movies <- movies %>%
  mutate(drama = as.factor(if_else(genre == "Drama", "yes", "no")))

summary(movies$drama)

##  no yes 
## 346 305

`mpaa_rating_R`

New variable based on mpaa_rating: The new variable would be called mpaa_rating_R with levels yes (movies that are R rated) and no.

summary(movies$mpaa_rating)

##       G   NC-17      PG   PG-13       R Unrated 
##      19       2     118     133     329      50

movies <- movies %>%
  mutate(mpaa_rating_R = as.factor(if_else(mpaa_rating == "R", "yes", "no")))

summary(movies$mpaa_rating_R)

##  no yes 
## 322 329

Two new variables based on thtr_rel_month:

`oscar_season`

The new variable would be called oscar_season with levels yes (if movie is released in November, October, or December) and no.

summary(movies$thtr_rel_month)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00    6.74   10.00   12.00

movies <- movies %>%
  mutate(oscar_season = as.factor(if_else(thtr_rel_month %in% c(10,11,12), "yes", "no")))

summary(movies$oscar_season)

##  no yes 
## 460 191

`summer_season`

The new variable would be called summer_season with levels yes (if movie is released in May, June, July, or August) and no.

movies <- movies %>%
  mutate(summer_season = as.factor(if_else(thtr_rel_month %in% c(5,6,7,8), "yes", "no")))

summary(movies$summer_season)

##  no yes 
## 443 208

`audience_score and some explanatory numeric variables`

Also, we are going transfom the predicting variable audience_score and some explanatory numeric variables.

Below in Part3, we will see that the distribution of audience_score is left-skewed which makes mean of this variable less robust measurement for predicting purposes. Our goal for transforming audience_score variable is to make the distribution more symmetric.

Transformation of explanatory numeric variables has similar purposes, and, also, if we going to Bayesian Multiple Linear Regression we will see that range of credible intervals of explanatory variables for predicting value also depends on how close stands sizes of explanatory variables for a predicted case and mean of explanatory variables that we have in our data. That leads us to the situation there for different ranges of explanatory variables we should use a different derivative of original variables.

\[\begin{equation} y_{\text{score}, i} = \beta_0 + \beta_1 (x_{\text{hs},i}-\bar{x}_{\text{hs}}) + \beta_2 (x_{\text{IQ},i}-\bar{x}_{\text{IQ}}) + \beta_3(x_{\text{work},i}-\bar{x}_{\text{work}}) + \beta_4 (x_{\text{age},i}-\bar{x}_{\text{age}}) + \epsilon_i. \end{equation}\]

power <- 1.8
movies <- movies %>%
  mutate(ascore_power = audience_score^(power))

movies <- movies %>%
  mutate(cs_log     = log(critics_score),
         cs_sqrt    = sqrt(critics_score),
         cs_power_2 = critics_score^2,
         cs_power_4 = critics_score^4,
         
         ir_log     = log(imdb_rating),
         ir_sqrt    = sqrt(imdb_rating),
         ir_power_2 = imdb_rating^2,
         ir_power_4 = imdb_rating^4,
         
         rt_log     = log(runtime),
         rt_sqrt    = sqrt(runtime),
         rt_power_2 = runtime^2,
         rt_power_4 = runtime^4,
         
         inv_log     = log(imdb_num_votes),
         inv_sqrt    = sqrt(imdb_num_votes),
         inv_power_2 = imdb_num_votes^2,
         inv_power_4 = imdb_num_votes^4)

Part 3: Exploratory data analysis

For the very beginning let’s see audience score on Rotten Tomatoes audience_score distributions:

movies %>%
    ggplot(aes(x = audience_score)) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of audience score on Rotten Tomatoes')

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0    46.0    65.0    62.4    80.0    97.0

Distribution of audience_score is left-skewed with center around 65.

movies %>%
    ggplot(aes(x = ascore_power )) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
  summarise(Mean_n_power = mean((audience_score)^(power))^(1/power),
            Mean_n = mean(audience_score),
            median_n = median(audience_score))

## # A tibble: 1 x 3
##   Mean_n_power Mean_n median_n
##          <dbl>  <dbl>    <dbl>
## 1         65.0   62.4       65

Distribution of ascore_power is roughly symmetric with the center around 1850.

`feature_film`

movies %>%
        ggplot(aes(x = audience_score)) +
        geom_histogram(bins = 10) +
        facet_wrap("feature_film") +
        ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
        group_by(feature_film) %>%
        summarise(Min_n = min(audience_score),
                  Q1_n =  quantile(audience_score,0.25,type = 1),
                  Median_n = median(audience_score),
                  Q3_n = quantile(audience_score,0.75,type = 1),
                  Max_n = max(audience_score),
                  Mean_n = mean(audience_score),
                  Sd_n = sd(audience_score),
                  N_n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 9
##   feature_film Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n   N_n
##   <fct>        <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1 no              19    75     85.5    89    96   81.0  13.6    60
## 2 yes             11    44     62      78    97   60.5  19.8   591

Distributions of audience_score are left-skewed with center around 85.5 and 62 for “no” and “yes” feature_film groups respectively.

`drama`

movies %>%
        ggplot(aes(x = audience_score)) +
        geom_histogram(bins = 10) +
        facet_wrap("drama") +
        ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
        group_by(drama) %>%
        summarise(Min_n = min(audience_score),
                  Q1_n =  quantile(audience_score,0.25,type = 1),
                  Median_n = median(audience_score),
                  Q3_n = quantile(audience_score,0.75,type = 1),
                  Max_n = max(audience_score),
                  Mean_n = mean(audience_score),
                  Sd_n = sd(audience_score),
                  N_n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 9
##   drama Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n   N_n
##   <fct> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1 no       11    41       61    79    97   59.7  21.3   346
## 2 yes      13    52       70    80    95   65.3  18.5   305

Distributions of audience_score are left-skewed with center around 61 and 70 for “no” and “yes” drama groups respectively.

`mpaa_rating_R`

movies %>%
        ggplot(aes(x = audience_score)) +
        geom_histogram(bins = 10) +
        facet_wrap("mpaa_rating_R") +
        ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
        group_by(mpaa_rating_R) %>%
        summarise(Min_n = min(audience_score),
                  Q1_n =  quantile(audience_score,0.25,type = 1),
                  Median_n = median(audience_score),
                  Q3_n = quantile(audience_score,0.75,type = 1),
                  Max_n = max(audience_score),
                  Mean_n = mean(audience_score),
                  Sd_n = sd(audience_score),
                  N_n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 9
##   mpaa_rating_R Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n   N_n
##   <fct>         <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1 no               11    48     65.5    80    96   62.7  20.3   322
## 2 yes              14    44     64      79    97   62.0  20.2   329

Distributions of audience_score are left-skewed with center around 65.5 and 64.0 for “no” and “yes” mpaa_rating_R groups respectively.

`oscar_season`

movies %>%
        ggplot(aes(x = audience_score)) +
        geom_histogram(bins = 10) +
        facet_wrap("oscar_season") +
        ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
        group_by(oscar_season) %>%
        summarise(Min_n = min(audience_score),
                  Q1_n =  quantile(audience_score,0.25,type = 1),
                  Median_n = median(audience_score),
                  Q3_n = quantile(audience_score,0.75,type = 1),
                  Max_n = max(audience_score),
                  Mean_n = mean(audience_score),
                  Sd_n = sd(audience_score),
                  N_n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 9
##   oscar_season Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n   N_n
##   <fct>        <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1 no              11    46       64    79    96   61.8  20.1   460
## 2 yes             13    47       69    81    97   63.7  20.5   191

Distributions of audience_score are left-skewed with center around 64.0 and 69.0 for “no” and “yes” oscar_season groups respectively.

`summer_season`

movies %>%
        ggplot(aes(x = audience_score)) +
        geom_histogram(bins = 10) +
        facet_wrap("summer_season") +
        ggtitle('Distribution of audience score on Rotten Tomatoes')

movies %>%
        group_by(summer_season) %>%
        summarise(Min_n = min(audience_score),
                  Q1_n =  quantile(audience_score,0.25,type = 1),
                  Median_n = median(audience_score),
                  Q3_n = quantile(audience_score,0.75,type = 1),
                  Max_n = max(audience_score),
                  Mean_n = mean(audience_score),
                  Sd_n = sd(audience_score),
                  N_n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 9
##   summer_season Min_n  Q1_n Median_n  Q3_n Max_n Mean_n  Sd_n   N_n
##   <fct>         <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <int>
## 1 no               13    46       66    80    97   62.6  20.4   443
## 2 yes              11    44       65    78    94   61.8  19.9   208

Distributions of audience_score are left-skewed with center around 66.0 and 65.0 for “no” and “yes” summer_season groups respectively.

`critics_score`

plot <- ggplot(data = movies) + 
  geom_histogram(aes(x = critics_score), bins = 20) + 
  ggtitle('Critics score distribution')

log_plot <- ggplot(data = movies) + 
  geom_histogram(aes(x = cs_log), bins = 20) + 
  ggtitle('Log')

sqrt_plot <- ggplot(data = movies) + 
  geom_histogram(aes(x = cs_sqrt), bins = 20) + 
  ggtitle('Sqrt')

power_2_plot <- ggplot(data = movies) + 
  geom_histogram(aes(x = cs_power_2), bins = 20) + 
  ggtitle('Power 2')

power_4_plot <- ggplot(data = movies) + 
  geom_histogram(aes(x = cs_power_4), bins = 20) + 
  ggtitle('Power 4')

require(gridExtra)

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(plot, 
             grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, 
                          ncol = 4), ncol = 1)

mean <- movies %>%
  summarise("critics_score" = mean(critics_score),
            "cs_log" = exp(mean(cs_log)),
            "cs_sqrt" = mean(cs_sqrt)^2,
            "cs_power_2" = sqrt(mean(cs_power_2)),
            "cs_power_4" = mean(cs_power_4)^(1/4)) %>% 
  t()
median <- movies %>%
  summarise("critics_score" = median(critics_score),
            "cs_log" = exp(median(cs_log)),
            "cs_sqrt" = median(cs_sqrt)^2,
            "cs_power_2" = sqrt(median(cs_power_2)),
            "cs_power_4" = median(cs_power_4)^(1/4)) %>% 
  t()
summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output

##                mean median median/mean median-mean
## critics_score 57.69     61      1.0574       3.312
## cs_log        47.16     61      1.2934      13.836
## cs_sqrt       53.07     61      1.1494       7.927
## cs_power_2    64.29     61      0.9488      -3.292
## cs_power_4    72.03     61      0.8468     -11.034

Distribution of critics_score is left-skewed with center around 61.

`imdb_rating`

plot <- ggplot(data = movies) + geom_histogram(aes(x = imdb_rating), bins = 20) + ggtitle('IMDB rating distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = ir_power_4), bins = 20) + ggtitle('Power 4')

require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)

mean <- movies %>%
  summarise("imdb_rating" = mean(imdb_rating),
            "ir_log" = exp(mean(ir_log)),
            "ir_sqrt" = mean(ir_sqrt)^2,
            "ir_power_2" = sqrt(mean(ir_power_2)),
            "ir_power_4" = mean(ir_power_4)^(1/4)) %>% 
  t()
median <- movies %>%
  summarise("imdb_rating" = median(imdb_rating),
            "ir_log" = exp(median(ir_log)),
            "ir_sqrt" = median(ir_sqrt)^2,
            "ir_power_2" = sqrt(median(ir_power_2)),
            "ir_power_4" = median(ir_power_4)^(1/4)) %>% 
  t()

summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output

##              mean median median/mean median-mean
## imdb_rating 6.493    6.6      1.0165     0.10691
## ir_log      6.386    6.6      1.0336     0.21449
## ir_sqrt     6.442    6.6      1.0245     0.15802
## ir_power_2  6.583    6.6      1.0026     0.01706
## ir_power_4  6.729    6.6      0.9808    -0.12938

Distribution of imdb_rating is left-skewed with center around 6.6.

`runtime`

plot <- ggplot(data = movies) + geom_histogram(aes(x = runtime), bins = 20) + ggtitle('Runtime distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = rt_power_4), bins = 20) + ggtitle('Power 4')

require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

mean <- movies %>%
  filter(!is.na(runtime)) %>%
  summarise("runtime" = mean(runtime),
            "rt_log" = exp(mean(rt_log)),
            "rt_sqrt" = mean(rt_sqrt)^2,
            "rt_power_2" = sqrt(mean(rt_power_2)),
            "rt_power_4" = mean(rt_power_4)^(1/4)) %>% 
  t()
median <- movies %>%
  filter(!is.na(runtime)) %>%
  summarise("runtime" = median(runtime),
            "rt_log" = exp(median(rt_log)),
            "rt_sqrt" = median(rt_sqrt)^2,
            "rt_power_2" = sqrt(median(rt_power_2)),
            "rt_power_4" = median(rt_power_4)^(1/4)) %>% 
  t()

summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output

##             mean median median/mean median-mean
## runtime    105.8    103      0.9733      -2.822
## rt_log     104.2    103      0.9884      -1.213
## rt_sqrt    105.0    103      0.9809      -2.003
## rt_power_2 107.6    103      0.9573      -4.591
## rt_power_4 112.1    103      0.9188      -9.100

Distribution of runtime is right-skewed with center around 103.

`imdb_num_votes`

plot <- ggplot(data = movies) + geom_histogram(aes(x = imdb_num_votes), bins = 20) + ggtitle('IMDB number votes distribution')
log_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_log), bins = 20) + ggtitle('Log')
sqrt_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_sqrt), bins = 20) + ggtitle('Sqrt')
power_2_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_power_2), bins = 20) + ggtitle('Power 2')
power_4_plot <- ggplot(data = movies) + geom_histogram(aes(x = inv_power_4), bins = 20) + ggtitle('Power 4')

require(gridExtra)
grid.arrange(plot, grid.arrange(log_plot, sqrt_plot, power_2_plot, power_4_plot, ncol = 4), ncol = 1)

mean <- movies %>%
  filter(!is.na(imdb_num_votes)) %>%
  summarise("imdb_num_votes" = mean(imdb_num_votes),
            "inv_log" = exp(mean(inv_log)),
            "inv_sqrt" = mean(inv_sqrt)^2,
            "inv_power_2" = sqrt(mean(inv_power_2)),
            "inv_power_4" = mean(inv_power_4)^(1/4)) %>% 
  t()
median <- movies %>%
  filter(!is.na(imdb_num_votes)) %>%
  summarise("imdb_num_votes" = median(imdb_num_votes),
            "inv_log" = exp(median(inv_log)),
            "inv_sqrt" = median(inv_sqrt)^2,
            "inv_power_2" = sqrt(median(inv_power_2)),
            "inv_power_4" = median(inv_power_4)^(1/4)) %>% 
  t()

summary_output <- cbind(mean, median, median/mean, median-mean)
colnames(summary_output) <- c("mean", "median", "median/mean", "median-mean")
summary_output

##                  mean median median/mean median-mean
## imdb_num_votes  57533  15116     0.26274      -42417
## inv_log         16403  15116     0.92156       -1287
## inv_sqrt        32367  15116     0.46702      -17251
## inv_power_2    125947  15116     0.12002     -110831
## inv_power_4    268792  15116     0.05624     -253676

Distribution of imdb_num_votes is right-skewed with center around 15116.

Part 4: Modeling

At the end of the project, we going to predict “Ford v Ferrari” (2019) movie. In that case, we should choose which explanatory variables we going to include in the predicting model. Including variables in which explanatory variables values would be close to values of the movies may increase the accuracy of prediction in that area of variables.

From imdb.com and rottentomatoes.com we gote that “Ford v Ferrari” (2019) movie has:

critics_score = 92;
imdb_rating = 8.1;
runtime = 152;
imdb_num_votes = 244880.

In that case we should stay on derivatives which normalize mean would be closes to value abowe. From data manipulation section we can get folowing result:

for mean of critics_score (57.69) - cs_power_4 with normalized mean: 72.03;
for mean of imdb_rating (6.493) - ir_power_4 with normalized mean: 6.729;
for mean of runtime (105.8) - rt_power_4 with normalized mean: 112.1;
for mean of imdb_num_votes (57533) - inv_power_2 with normalized mean: 125947.

In the last case, we chouse inv_power_2 instead inv_power_4 with normalized mean 268792 because of singularity in prediction processes. We should mention that if we chouse to predict moves with low explanatory variables we probably would include log derivatives.

Also, we will construct the predicting model with original variables to make a prediction for comparison, but in this paper, we are not going to describe calculating processes for that additional model.

Prepare data

# Get interesting us columns
col_names <- c("audience_score", "ascore_power",
               
               # constructed explanatory categorical variables
               "feature_film", "drama", "mpaa_rating_R",
               "oscar_season", "summer_season",
               
               # original explanatory numeric variables
               "thtr_rel_year", "top200_box",
               "best_pic_nom", "best_pic_win", 
               "best_actor_win", "best_actress_win", "best_dir_win",
               
               # original explanatory numeric variables
               "runtime",
               "imdb_rating",
               "imdb_num_votes",
               "critics_score",
               
               # transform explanatory numeric variables
               "cs_power_4",
               "ir_power_4",
               "rt_power_4",
               "inv_power_2",
               
               "title")
movies_no_na <- movies[,col_names]

# Exclude observations with missing values in the data set
movies_no_na <- na.omit(movies_no_na)

Carrying out the model selection

Number of possible models.

drops <- c("audience_score","title","runtime", "imdb_rating", "imdb_num_votes", "critics_score")
n <- dim(movies_no_na)[2]-length(drops) - 1
2^n

## [1] 65536

Because the total number of possible models is relatively small - 65`536, we are not going to use the common stochastic methods - MARKOV CHAIN MONTE CARLO (MCMC), instead, we will iterate all possible models.

In our model, we going to explain ascore_power variable which is audience_score in 1.8 power, which we discussed previously in the data manipulation section. Also, we choose the most conservative method “BIC” prior distribution (Bayesian information criterion) for regression coefficients and “uniform” modelprior.

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
movies_score_bas.lm <- bas.lm(ascore_power ~ ., 
                              data = movies_no_na[ , !(names(movies_no_na) %in% drops)],
                              prior = "BIC", 
                              modelprior = uniform())

# original
drops <- c("ascore_power", "title", "cs_power_4", "ir_power_4", "rt_power_4", "inv_power_2")
movies_score_bas.lm_original <- bas.lm(audience_score ~ .,
                                       data = movies_no_na[ , !(names(movies_no_na) %in% drops)],
                                       prior = "BIC", 
                                       modelprior = uniform())

Model diagnostics

Posterior inclusion probability

Let’s examine the summary output of the model we got.

summary(movies_score_bas.lm)

##                     P(B != 0 | Y)    model 1    model 2    model 3    model 4
## Intercept                 1.00000     1.0000     1.0000     1.0000     1.0000
## feature_filmyes           0.65782     1.0000     0.0000     1.0000     1.0000
## dramayes                  0.13818     0.0000     0.0000     0.0000     0.0000
## mpaa_rating_Ryes          0.09033     0.0000     0.0000     0.0000     0.0000
## oscar_seasonyes           0.12259     0.0000     0.0000     0.0000     0.0000
## summer_seasonyes          0.16687     0.0000     0.0000     0.0000     0.0000
## thtr_rel_year             0.19477     0.0000     0.0000     0.0000     0.0000
## top200_boxyes             0.22812     0.0000     0.0000     0.0000     0.0000
## best_pic_nomyes           0.04213     0.0000     0.0000     0.0000     0.0000
## best_pic_winyes           0.04747     0.0000     0.0000     0.0000     0.0000
## best_actor_winyes         0.33614     0.0000     0.0000     1.0000     0.0000
## best_actress_winyes       0.05137     0.0000     0.0000     0.0000     0.0000
## best_dir_winyes           0.09775     0.0000     0.0000     0.0000     0.0000
## cs_power_4                0.21866     0.0000     0.0000     0.0000     1.0000
## ir_power_4                1.00000     1.0000     1.0000     1.0000     1.0000
## rt_power_4                0.68553     1.0000     1.0000     1.0000     1.0000
## inv_power_2               0.99991     1.0000     1.0000     1.0000     1.0000
## BF                             NA     1.0000     0.5780     0.5043     0.4470
## PostProbs                      NA     0.0634     0.0367     0.0320     0.0284
## R2                             NA     0.8152     0.8130     0.8166     0.8165
## dim                            NA     5.0000     4.0000     6.0000     6.0000
## logmarg                        NA -6028.1402 -6028.6884 -6028.8248 -6028.9453
##                        model 5
## Intercept               1.0000
## feature_filmyes         1.0000
## dramayes                0.0000
## mpaa_rating_Ryes        0.0000
## oscar_seasonyes         0.0000
## summer_seasonyes        0.0000
## thtr_rel_year           0.0000
## top200_boxyes           0.0000
## best_pic_nomyes         0.0000
## best_pic_winyes         0.0000
## best_actor_winyes       1.0000
## best_actress_winyes     0.0000
## best_dir_winyes         0.0000
## cs_power_4              0.0000
## ir_power_4              1.0000
## rt_power_4              0.0000
## inv_power_2             1.0000
## BF                      0.4022
## PostProbs               0.0255
## R2                      0.8146
## dim                     5.0000
## logmarg             -6029.0510

plot(movies_score_bas.lm, which=2, add.smooth = F, sub.caption="", caption="")
title(main = paste0("Cumulative sampled model probability from ",movies_score_bas.lm$n.models, " models"))

From summary output and cumulative sampled model probability plot, we can conclude that the best estimator for predictions would be Bayesian model averaging (BMA). It is so because the posterior probability for the highest probability model is roughly 0.0634.

We can visualize including coefficient into the top 20 models.

image(movies_score_bas.lm, rotate = F)

We can see that for the top 20 models mostly includes feature_film and derivatives of imdb_rating, runtime. Interesting that those models just a few times include the derivative of critics_score.

For the next step, we can plot the marginal inclusion coefficient probability.

plot(movies_score_bas.lm, which = 4, ask = F, caption = "", sub.caption = "", 
     col.in = "blue", col.ex = "darkgrey", lwd = 3)

This plot clearly demonstrates the importance of coefficients. From this perspective, we can see that best_actor_win is more important than derivative of critics_score.

For our next step let’s visualize our predictions and real audience_score values from different perspectives.

Residuals plot

plot(movies_score_bas.lm, which = 1, add.smooth = F, 
     ask = F, pch = 16, sub.caption="", caption="")
abline(a = 0, b = 0, col = "darkgrey", lwd = 2)

At the plot, we can see a relatively stable distribution of residuals, but for roughly 50-62 predicted audience score (1100-1700) distribution wider that for 75-96 score (2400-3700), and range increase again after 96 scores (3700).

Outliers.

outliers_index <- c(18,126,216)
movies_no_na[outliers_index,c("title","audience_score")]

## # A tibble: 3 x 2
##   title              audience_score
##   <chr>                       <dbl>
## 1 The Wood                       92
## 2 Dirty Love                     67
## 3 Madea Goes to Jail             70

It’s interesting with “Madea Goes to Jail” outliers. This film was in outliers too in many other cases of explanatory variables. In a real-world tasks, we should surely need to make attention to this example to understand what’s going on.

Makе prediction for current data

# makе prediction
BMA_pred_movies_score <- predict(movies_score_bas.lm, estimator = "BMA", se.fit = TRUE, nsim = 10000)

# get credible intervals
ci_coef_movies_score <- confint(BMA_pred_movies_score, parm = "coef")
ci_pred_movies_score <- confint(BMA_pred_movies_score, parm = "pred")

# unite result in the data frame
df_movies_score <- data.frame((ci_coef_movies_score[,1:3]^(1/power)), (ci_pred_movies_score[,1:2]^(1/power)),
                              movies_no_na$critics_score, movies_no_na$audience_score, movies_no_na$imdb_rating)

colnames(df_movies_score) = c("low_mu","hight_mu","pred",
                              "low","hight",
                              "critics_score","audience_score", "imdb_rating")
# calculate residuals
df_movies_score$resid <- (df_movies_score$audience_score - df_movies_score$pred)

# potential outliers
df_outliers <- cbind(movies_no_na[outliers_index,c("title","audience_score")], df_movies_score[outliers_index,"pred"])
colnames(df_outliers) <- c("title","audience_score","pred")

Now, we can illustrate our uncertainty about predicted values through the plot constructed below.

ggplot(data = df_movies_score) +
  geom_point(aes(x = pred, y = audience_score), color = "blue", size = 0.6) +
  
  geom_line(aes(x = pred, y = low_mu, lty = "second")) +
  geom_line(aes(x = pred, y = hight_mu, lty = "second")) +
  
  geom_line(aes(x = pred, y = low, lty = "third")) +
  geom_line(aes(x = pred, y = hight, lty = "third")) +
  geom_line(aes(x = pred, y = pred, color = "orange")) +
  
  scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
  scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
                        , name = "") +
  theme_bw() +
  theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
  
  geom_point(data = df_outliers, aes(x = pred, y = audience_score), color = "orange", pch = 1, cex = 6) +
  ggtitle('Model uncertainty distribution')

## Warning: Removed 63 row(s) containing missing values (geom_path).

The plot shows us that roughly under 50 predicted audience score our uncertainty significantly increase compared with the uncertainty of the total model.

Interpretation of model coefficients

For our next step, lets print coefficients from the model.

# Obtain the coefficients from the model
coef_bma_ascore <- coefficients(movies_score_bas.lm)
coef_bma_ascore_ci <- confint(coef_bma_ascore)

# converted measurement of predicting variable back
coef <- cbind(round(coef_bma_ascore_ci, 6), 
              `P(B != 0 | Y)` = round(coef_bma_ascore$probne0, 3))
coef

##                           2.5%       97.5%      beta P(B != 0 | Y)
## Intercept           1799.03321 1862.486446 1831.4077         1.000
## feature_filmyes        0.00000  282.023404  121.5143         0.658
## dramayes              -0.02023   81.096352    8.0824         0.138
## mpaa_rating_Ryes     -48.61841    0.322871   -4.0625         0.090
## oscar_seasonyes      -75.80410    0.000000   -7.2184         0.123
## summer_seasonyes       0.00000   88.792518   10.5660         0.167
## thtr_rel_year         -4.07432    0.009322   -0.5692         0.195
## top200_boxyes         -0.03932  316.609408   50.6231         0.228
## best_pic_nomyes        0.00000    0.000000    1.6630         0.042
## best_pic_winyes        0.00000    0.000000   -5.3074         0.047
## best_actor_winyes   -159.86718    0.000000  -35.9863         0.336
## best_actress_winyes    0.00000    1.283068   -2.1158         0.051
## best_dir_winyes     -107.88909    1.964775   -9.3770         0.098
## cs_power_4             0.00000    0.000002    0.0000         0.219
## ir_power_4             0.76373    0.855985    0.8093         1.000
## rt_power_4             0.00000    0.000000    0.0000         0.686
## inv_power_2            0.00000    0.000000    0.0000         1.000

Our model predict ascore_power = audience_score in 1.8 power. More then that our explanatory variables act as derivatives of runtime, imdb_rating, imdb_num_votes, critics_score. In that case, we have a model with multilinear relationships, and it is easy to explain how one variable affects another. But from that point of view, it’s challenging to understand what it really means.

For that case, we can introvert derivatives into original values. In this situation, it’s relatively easy to understand numbers because measurements are understandable, but after back transformation, we have no more linear relationship. In different areas of explanatory and predicted variables, we have different approximations of how would change predicting audience_score from explanatory variables.

For example, we going to plot the relationship between categorical variable feature_filmyes and predicting audience_score.

min_as <- min(movies_no_na$audience_score)
max_as <- max(movies_no_na$audience_score)
df_as <- data.frame(as = seq(from = min_as, to = max_as, length.out = 1000))
df_as$as_power <- df_as$as^power
df_as$ff_yes_power <- df_as$as_power+121.514270597514
df_as$ff_yes_low_power <- df_as$as_power+0
df_as$ff_yes_hight_power <- df_as$as_power+283.7847953641650
df_as$ff_yes <- df_as$ff_yes_power^(1/power)-df_as$as
df_as$ff_yes_low <- df_as$ff_yes_low_power^(1/power)-df_as$as
df_as$ff_yes_hight <- df_as$ff_yes_hight_power^(1/power)-df_as$as

ggplot(data = df_as) +
  geom_line(aes(x = as, y = ff_yes, color = "orange")) +
  geom_line(aes(x = as, y = ff_yes_low)) +
  geom_line(aes(x = as, y = ff_yes_hight)) +
  ggtitle("Relationship between feature_filmyes and audience_score") +
  ylab("Changes score") +
  xlab("Pred. audience score")

The plot show as a clear picture that for low predicted audience score feature_filmyes variable has a bigger impact than for high variable. For instance, for 25 scores feature_filmyes variable adds approximately 5 points, but for 75 it would be only 2 points.

We going to express this relationship for all explanatory variables in audience_score area nearly 92 scores.

score <- 92

Intercept       <- coef_bma_ascore_ci[1,"beta"]
Intercept_low   <- coef_bma_ascore_ci[1,"2.5%"]
Intercept_hight <- coef_bma_ascore_ci[1,"97.5%"]

score       <- score^power
score_low   <- score + Intercept_low   - Intercept
score_hight <- score + Intercept_hight - Intercept

coef_ci <- coef_bma_ascore_ci[-1,]
coef_ci <- data.frame(coef_ci)

coef_ci$beta_power       <- as.vector(coef_bma_ascore_ci[-1,"beta"]  + score)
coef_ci$beta_low_power   <- as.vector(coef_bma_ascore_ci[-1,"2.5%"]  + score_low)
coef_ci$beta_hight_power <- as.vector(coef_bma_ascore_ci[-1,"97.5%"] + score_hight)

coef_ci$beta       <- coef_ci$beta_power^(1/power)       - score^(1/power)
coef_ci$beta_low   <- coef_ci$beta_low_power^(1/power)   - score_low^(1/power)
coef_ci$beta_hight <- coef_ci$beta_hight_power^(1/power) - score_hight^(1/power)

coef <- rbind(data.frame("beta" = Intercept^(1/power), 
                         "beta_low" = Intercept_low^(1/power), 
                         "beta_hight" = Intercept_hight^(1/power)),
              coef_ci[,c("beta","beta_low","beta_hight")])

coef <- cbind(coef, `P(B != 0 | Y)` = round(coef_bma_ascore$probne0, 3))

rownames(coef) <- c("Intercept", rownames(coef)[-1])
coef <- coef[order(-coef[,"P(B != 0 | Y)"]),]

round(coef, 20)

##                                  beta          beta_low        beta_hight
## Intercept           64.96184434861301 64.32134242627214 65.57199551221359
## ir_power_4           0.01207245864869  0.01144059169204  0.01271734145087
## inv_power_2         -0.00000000002086 -0.00000000002932 -0.00000000001228
## rt_power_4          -0.00000000205532 -0.00000000457295  0.00000000000000
## feature_filmyes      1.79864653047836  0.00000000000000  4.11712981605788
## best_actor_winyes   -0.53808586955788 -2.42056386815170  0.00000000000000
## top200_boxyes        0.75270948406494 -0.00058911138517  4.61238516315210
## cs_power_4           0.00000000524112 -0.00000000001833  0.00000003410270
## thtr_rel_year       -0.00849194624540 -0.06105204573318  0.00013849722012
## summer_seasonyes     0.15751138415058  0.00000000000000  1.31182314185546
## dramayes             0.12050661021073 -0.00030307349095  1.19870203515380
## oscar_seasonyes     -0.10773079362508 -1.14129027097823  0.00000000000000
## best_dir_winyes     -0.13996660311540 -1.62784112504021  0.02918853911457
## mpaa_rating_Ryes    -0.06061826380078 -0.73066864704901  0.00479704566106
## best_actress_winyes -0.03156682228452  0.00000000000000  0.01906198394596
## best_pic_winyes     -0.07920030828470  0.00000000000000  0.00000000000000
## best_pic_nomyes      0.02480592414193  0.00000000000000  0.00000000000000
##                     P(B != 0 | Y)
## Intercept                   1.000
## ir_power_4                  1.000
## inv_power_2                 1.000
## rt_power_4                  0.686
## feature_filmyes             0.658
## best_actor_winyes           0.336
## top200_boxyes               0.228
## cs_power_4                  0.219
## thtr_rel_year               0.195
## summer_seasonyes            0.167
## dramayes                    0.138
## oscar_seasonyes             0.123
## best_dir_winyes             0.098
## mpaa_rating_Ryes            0.090
## best_actress_winyes         0.051
## best_pic_winyes             0.047
## best_pic_nomyes             0.042

Our model predicts, that the mean audience_score of the average film would be 64.96 scores with a 95% chance that it is between 64.35 to 65.58 points. That score includes that moves include all categorical explanatory variable with value “no”. In other cases, they change score roughly the points in “beta” columns in the table above. Also in the table, we have credible intervals for those coefficients.

In that case, we still do not have a clear understanding picture of ir_power_4, inv_power_2, rt_power_4, and cs_power_4 because these variables have different measurements from originals.

Now, we going to plot the relationship between the original imdb_rating and predicting audience_score in our model. To make plot easy to the understanding we construct it for different imdb_rating and audience_score, but with condition that all other categorical variables are “no” and other numeric variables have an average value from the original data frame.

min_imdb_rating <- min(movies$imdb_rating)
max_imdb_rating <- max(movies$imdb_rating)
df_imdb_rating <- data.frame(imdb_rating = seq(from = min_imdb_rating, to = max_imdb_rating, length.out = 200))

for (value in names(movies_score_bas.lm$mean.x)) {
  if (grepl("yes",value) == T) {
    df_imdb_rating[,gsub("yes","",value)] <- "no"
  }
  df_imdb_rating[,value] <- movies_score_bas.lm$mean.x[value]
}

df_imdb_rating$imdb_rating <- seq(from = min_imdb_rating, to = max_imdb_rating, length.out = 200)

df_imdb_rating <- df_imdb_rating %>%
  mutate(ir_log     = log(imdb_rating),
         ir_sqrt    = sqrt(imdb_rating),
         ir_power_2 = imdb_rating^2,
         ir_power_4 = imdb_rating^4)

imdb_rating.new = predict(movies_score_bas.lm, 
                          df_imdb_rating, 
                          estimator = "BMA",
                          se.fit = TRUE, 
                          nsim = 1000)

# get credible intervals
ci_coef_imdb_rating <- confint(imdb_rating.new, parm = "coef")
ci_pred_imdb_rating <- confint(imdb_rating.new, parm = "pred")

# unite result in the data frame
df_imdb_rating.new <- data.frame((ci_coef_imdb_rating[,1:3]^(1/power)), 
                                 (ci_pred_imdb_rating[,1:2]^(1/power)),
                                 df_imdb_rating$imdb_rating)

colnames(df_imdb_rating.new) = c("low_mu","hight_mu","pred",
                                 "low","hight",
                                 "imdb_rating")

ggplot(data = df_imdb_rating.new) +
  geom_line(aes(x = imdb_rating, y = pred, color = "orange")) +  
  geom_line(aes(x = imdb_rating, y = low_mu, lty = "second")) +
  geom_line(aes(x = imdb_rating, y = hight_mu, lty = "second")) +
  geom_line(aes(x = imdb_rating, y = low, lty = "third")) +
  geom_line(aes(x = imdb_rating, y = hight, lty = "third")) +
  
  scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
  scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
                        , name = "") +
  theme_bw() +
  theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
  
  ggtitle('Model uncertainty distribution under imdb_rating variable') +
  ylab("predicting `audience_score`") +
  xlab("original `imdb_rating`")

## Warning: Removed 48 row(s) containing missing values (geom_path).

## Warning: Removed 102 row(s) containing missing values (geom_path).

Plot cleary expres model uncertainty about imdb_rating impact on audience_score. Also, we can see that each additional imdb_rating point for low it value add much less additional audience_score scores, then for hight imdb_rating.

Part 5: Prediction

Let’s make a prediction for “Ford v Ferrari” (2019) movie and compare the result with the average current rating. The data about movie we got from imdb.com and rottentomatoes.com.

Construct main prediction with derivative variables.

As the first step, we going to build the data frame with needed “Ford v Ferrari” (2019) movie variables for our model. After that, we will make predictions, construct credible intervals, and prepare data for subsequent plot.

# add variables from our imdb.com and rottentomatoes.com resources 
ford_v_ferrari <- data.frame(feature_film = "yes", drama = "no", mpaa_rating_R = "no",
                             oscar_season = "yes", summer_season = "no", thtr_rel_year = 2019,
                             top200_box = "yes", best_pic_nom = "yes", best_pic_win = "yes",
                             best_actor_win = "yes", best_actress_win = "no", best_dir_win = "yes",
                             critics_score = 92, imdb_rating = 8.1, runtime = 152,
                             imdb_num_votes = 244880, audience_score = 98)
# create derivatives
ford_v_ferrari <- ford_v_ferrari %>%
  mutate(cs_power_4 = critics_score^4,
         ir_power_4 = imdb_rating^4,
         rt_power_4 = runtime^4,
         inv_power_2 = imdb_num_votes^2)

# make prediction
ford_v_ferrari.new = predict(movies_score_bas.lm, 
                             newdata = ford_v_ferrari, 
                             estimator = "BMA",
                             se.fit = TRUE, 
                             nsim = 10000)

# construct credible intervals
ci_coef_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "coef")
ci_pred_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "pred")
ci_ford_v_ferrari.new <- data.frame((ci_coef_ford_v_ferrari.new[,1])^(1/power),
                                    (ci_coef_ford_v_ferrari.new[,2])^(1/power),
                                    (ci_coef_ford_v_ferrari.new[,3])^(1/power),
                                    (ci_pred_ford_v_ferrari.new[,1])^(1/power),
                                    (ci_pred_ford_v_ferrari.new[,2])^(1/power),
                                    ford_v_ferrari$audience_score)
colnames(ci_ford_v_ferrari.new) = c("low_mu","hight_mu","pred", "low","hight", "audience_score")

# prepare data for plot
df <- median(ford_v_ferrari.new$df)
pred <- ford_v_ferrari.new$Ybma[1,1]
sd_fit <- ford_v_ferrari.new$se.bma.fit
sd_pred <- ford_v_ferrari.new$se.bma.pred
df_ci_coef <- data.frame(mu = (rt(10000,df)*sd_fit+pred)^(1/power))
df_ci_coef$pred <- (rt(10000,median(df))*sd_pred +pred)^(1/power)

Construct additional prediction with original variables.

Almost the same steps we going to do with the original variables model.

# make prediction
ford_v_ferrari.new = predict(movies_score_bas.lm_original, 
                             newdata = ford_v_ferrari, 
                             estimator = "BMA",
                             se.fit = TRUE, 
                             nsim = 10000)

# construct credible intervals
ci_coef_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "coef")
ci_pred_ford_v_ferrari.new <- confint(ford_v_ferrari.new, parm = "pred")
ci_ford_v_ferrari.new_original <- data.frame((ci_coef_ford_v_ferrari.new[,1]),
                                             (ci_coef_ford_v_ferrari.new[,2]),
                                             (ci_coef_ford_v_ferrari.new[,3]),
                                             (ci_pred_ford_v_ferrari.new[,1]),
                                             (ci_pred_ford_v_ferrari.new[,2]),
                                             ford_v_ferrari$audience_score)
colnames(ci_ford_v_ferrari.new_original) = c("low_mu","hight_mu","pred","low","hight","audience_score")

# prepare data for plot
df <- median(ford_v_ferrari.new$df)
pred_original <- ford_v_ferrari.new$Ybma[1,1]
sd_fit <- ford_v_ferrari.new$se.bma.fit
sd_pred <- ford_v_ferrari.new$se.bma.pred

df_ci_coef_original <- data.frame(mu = (rt(10000,df)*sd_fit+pred_original))
df_ci_coef_original$pred <- (rt(10000,median(df))*sd_pred +pred_original)

Get result from both predictions

plot_original <- ggplot() +
  geom_density(data = df_ci_coef_original, aes(x = mu), color = "darkblue", fill="blue", alpha=.3) +
  geom_density(data = df_ci_coef_original, aes(x = pred), linetype="dotted", color = "darkblue", fill="blue", alpha=.1) +
  geom_vline(xintercept = pred_original, color = "darkblue", size=1, linetype="dashed") +
  geom_vline(xintercept = 98, color = "darkred", size=1) +
  ggtitle("Original model prediction distribution") +
  xlim(60,120)

plot <- ggplot() +  
  geom_density(data = df_ci_coef, aes(x = mu), color = "darkgreen", fill="green", alpha=.3) +
  geom_density(data = df_ci_coef, aes(x = pred), linetype="dotted", color = "darkgreen", fill="green", alpha=.1) +
  geom_vline(xintercept = pred^(1/power), color = "darkgreen", size=1, linetype="dashed") +
  geom_vline(xintercept = 98, color = "darkred", size=1) +
  ggtitle("Derivative model prediction distribution") +
  xlim(60,120)

require(gridExtra)
grid.arrange(plot, plot_original, ncol = 1)

## Warning: Removed 45 rows containing non-finite values (stat_density).

rownames(ci_ford_v_ferrari.new_original) <- "original model"
rownames(ci_ford_v_ferrari.new) <- "derivative model"

print(rbind(ci_ford_v_ferrari.new_original, ci_ford_v_ferrari.new))

##                  low_mu hight_mu  pred   low hight audience_score
## original model    83.49    92.43 87.66 67.92 107.9             98
## derivative model  90.19    98.51 93.90 81.23 105.8             98

Our main derivative model predicts roughly 93.9 audience score with a 95% chance that it is between 80.5 to 100 points (audience score can’t reach 105.5).

The model with original variables for this film shows the less accurate results - 87.7 points with a 95% chance that it is between 67.9 to 100 points. Compare with the main model result mean shifted to the left and the range of expected value is wider 32.1 points versus 19.5 derivative model.

Part 6: Conclusion

To summarize our findings we can make some conclusions.

First, we create and compare 2 models:

the model with original variables;
the model with derivatives variables.

All models were created by Bayesian approach which includes specifying a prior Normal distribution for the error, prior Student’s t-distributions for all in the following equation \[\begin{equation} y_{\text{score}, i} = \beta_0 + \beta_1 (x_{\text{hs},i}-\bar{x}_{\text{hs}}) + \beta_2 (x_{\text{IQ},i}-\bar{x}_{\text{IQ}}) + \beta_3(x_{\text{work},i}-\bar{x}_{\text{work}}) + \beta_4 (x_{\text{age},i}-\bar{x}_{\text{age}}) + \epsilon_i. \end{equation}\].

We used “BIC” prior distribution for finding appropriate regression coefficients and Bayesian model averaging to proportionally aggregate information from all possible models. Also, we used the advantage of the Bayesian approach and transform data to make it more informative for interesting prediction areas (“Ford v Ferrari” (2019) movie) and as the result, we get a more accurate prediction result. That’s would be very interesting because we can specify our sphere of interest and manage our model to reach the best result using all data knowledge we have with almost no retraining.

For shortcomings, we can name difficulties of the Bayesian approach and hight computational resource intensity.

Contacts

If you find this project interesting and you want to have contact with someone who also interesting in Bayesian statistics - you are welcome.

Instagram: @ditiatev_iurii