library(tidyverse)
library(brms)
load("movies.Rdata")
The idea is using machine learning help with model specification in Bayesian regression. The sampling process can be quite slow when too many parameters need to be sampled. In this article, I will using random forest try to identify more importance features, and use the several most importance features to predict the audience score.
This is a small sample of IMDB sample set with some extra interesting features.
movies_df <- movies %>% select(-ends_with("url"))
movies_df
First, to have rough idea of the data. I plotted audience score with some intuitively most relevant features, such as critics score, critics rating.
movies_df %>%
ggplot(aes(x = critics_score, y = audience_score)) +
geom_point(aes(col = critics_rating)) +
scale_color_manual(values = c("blue", "orange", "gray")) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
movies_df %>%
ggplot(aes(x = critics_rating, y = audience_score)) +
geom_boxplot()
There is some data cleaning, drop character variables and drop NA.
library(mlr)
## Loading required package: ParamHelpers
movie_no_char <- movies_df %>% select(-actor1, -actor2, -actor3,
-actor4, -actor5, -director, -title)
movie_cln <- movie_no_char %>% drop_na()
Making a regression task as per mlr normal procedure.
mod_task <- makeRegrTask(data = movie_cln, target = "audience_score")
Do a simple random forest regression.
lr <- makeLearner("regr.ranger", importance = "impurity")
Checking with performance with random forest. This can be used as benchmark for later reference. I also use cross validation to make the measure more reliable.
sample_desc <- makeResampleDesc("CV")
res <- resample(lr, mod_task, sample_desc, rmse)
## Resampling: cross-validation
## Measures: rmse
## [Resample] iter 1: 6.7092821
## [Resample] iter 2: 7.8045171
## [Resample] iter 3: 8.0942407
## [Resample] iter 4: 7.2091167
## [Resample] iter 5: 8.2472928
## [Resample] iter 6: 7.2830116
## [Resample] iter 7: 6.6373611
## [Resample] iter 8: 7.0552352
## [Resample] iter 9: 7.0969398
## [Resample] iter 10: 8.7423425
##
## Aggregated Result: rmse.test.rmse=7.5173078
##
Then train the model get feature importance.
imp <- lr %>% train(mod_task) %>%
getFeatureImportance(res)
imp$res %>% as.data.frame() %>%
gather(feature, importance) %>%
arrange(desc(importance))
For Bayesian, I picked several most import features. For the categorical variable, I set them as group level mean only effect.
Check the default priors first.
priors <- get_prior(audience_score ~ imdb_rating + critics_score +
(1|studio) + (1 | genre), data= movie_cln)
I modified the prior for sigma to more common half cauchy prior.
priors$prior[priors$class == "sd"] <- "cauchy(0, 1)"
priors$prior[10] <- "cauchy(0, 1)"
Run the model with brms. Set the warmup and the iterations to be a little larger than default.
brmod <- brm(audience_score ~ imdb_rating + critics_score +
(1|studio) + (1 | genre), data = movie_cln, iter = 4000,
warmup = 2000, prior = priors)
Check sampling path and the posterior distribution.
plot(brmod)
There are the regression result. The IMDB rating have very strong effect. It’s not surprise since IMDb rating is also an audience score in nature. Another thing worth notice is that the coefficient of critics_score is close to zero. This means that critics opinions usually don’t represent the audiences’ opinions.
summary(brmod)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: audience_score ~ imdb_rating + critics_score + (1 | studio) + (1 | genre)
## Data: movie_cln (Number of observations: 634)
## Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
## total post-warmup samples = 8000
##
## Group-Level Effects:
## ~genre (Number of levels: 11)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 3.16 1.14 1.43 5.91 3226 1.00
##
## ~studio (Number of levels: 210)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 0.73 0.58 0.03 2.13 2469 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept -37.03 3.21 -43.29 -30.67 8000 1.00
## imdb_rating 14.75 0.58 13.63 15.87 8000 1.00
## critics_score 0.07 0.02 0.03 0.11 7202 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 9.84 0.29 9.29 10.42 8000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Then, we can plot the original score with the estimated ones. Because the approach, we can also draw the credible interval alone with the point estimates.
pred <- predict(brmod, movie_cln)
pred %>% as.tibble() %>%
bind_cols(audience_score = movie_cln$audience_score) %>%
ggplot(aes(audience_score, Estimate)) + geom_point() +
geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5), alpha = 0.3)
Finally, let check the rmse with Bayesian model as well. Techniquely, this is wrong since the data is split into test and training. The focus should be finding the relation between variables not accuracy prediction.
pred %>% as.tibble() %>%
bind_cols(audience_score = movie_cln$audience_score) %>%
mutate(err = (audience_score - Estimate)^2) %>%
summarise(rmsr = sqrt(sum(err)/n()))