## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'BAS' was built under R version 4.0.2
## Warning: package 'GGally' was built under R version 4.0.2
## Warning: package 'gplots' was built under R version 4.0.2
## Warning: package 'plotmo' was built under R version 4.0.2
## Warning: package 'TeachingDemos' was built under R version 4.0.2
## Warning: package 'tidyverse' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
The goal of this project is to develop a Bayesian regression model to predict audience score of a film.
The data draws from APIs from imdb.com, rottentomatoes.com, and flixster.com. It is comprised of 651 randomly sampled movies produced and released before 2016.
Since random sampling is used, only correlations can be drawn, hence we can generalize our conclusions about audience_score to the population of movies rated by users of the two web-systems. One source of bias comes from the fact that the data is drawn from sources that cater to English speakers only, the data will be biased towards movies where English is the main language.
Another source of bias is that users who rated a movie on IMDb may also have rated the movies on Rotten Tomatoes. Therefore, if we create a model that uses a Rotten Tomatoes measure as response variable and an IMDb measure as explanatory variable, we may have bias, since part of the information are given by the same users in both systems. Another fact is that the ratings are expected to be correlated, given the bias described. Random assignment was not used, so our conclusions cannot be of causation.
In this section, we would do the following: 1. Remove Observations with NUll values 2. Create new variables to aid exploratory data analysis. 3. Create a subset of the movies data
Finally, We are create some new variables to aid our data analysis. Below is a summary of the new variables.
feature_film: “yes” if title_type is Feature Film, “no” otherwise.
drama: “yes” if genre is Drama, “no” otherwise runtime.
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.
# Create new variable based on `title_type`: New variable should be called
# `feature_film` with levels yes (movies that are feature films) and no
movies <- movies %>%
mutate(feature_film=ifelse(title_type=="Feature Film", "yes", "no"))
# Create new variable based on `genre`: New variable should be called `drama`
# with levels yes (movies that are dramas) and no
movies <- movies %>%
mutate(drama=ifelse(genre=="Drama", "yes", "no"))
# Create new variable based on `mpaa_rating`: New variable should be called
# `mpaa_rating_R` with levels yes (movies that are R rated) and no
movies <- movies %>%
mutate(mpaa_rating_R=ifelse(mpaa_rating=="R", "yes", "no"))
# Create two new variables based on `thtr_rel_month`:
# + New variable called `oscar_season` with levels yes (if movie is released in November, October, or December) and no
# + New variable called `summer_season` with levels yes (if movie is released in May, June, July, or August) and no
movies <- movies %>%
mutate(oscar_season=ifelse(thtr_rel_month >= 10 & thtr_rel_month <= 12 , "yes", "no"),
summer_season=ifelse(thtr_rel_month >= 5 & thtr_rel_month <= 8, "yes", "no"))movies dataWe’ll then create a subset of the movies data
movies_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies <- movies[movies_features]We’ll start out at a higher, broader level by taking a look at a summary of the variables in movies.
## audience_score feature_film drama runtime mpaa_rating_R
## Min. :11.00 Length:619 Length:619 Min. : 65.0 Length:619
## 1st Qu.:46.00 Class :character Class :character 1st Qu.: 93.0 Class :character
## Median :65.00 Mode :character Mode :character Median :103.0 Mode :character
## Mean :62.21 Mean :106.5
## 3rd Qu.:80.00 3rd Qu.:116.0
## Max. :97.00 Max. :267.0
## thtr_rel_year oscar_season summer_season imdb_rating imdb_num_votes
## Min. :1972 Length:619 Length:619 Min. :1.900 Min. : 183
## 1st Qu.:1991 Class :character Class :character 1st Qu.:5.900 1st Qu.: 5026
## Median :2000 Mode :character Mode :character Median :6.600 Median : 16480
## Mean :1998 Mean :6.486 Mean : 60014
## 3rd Qu.:2007 3rd Qu.:7.300 3rd Qu.: 62507
## Max. :2014 Max. :9.000 Max. :893008
## critics_score best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box
## Min. : 1.00 no :597 no :612 no :528 no :548 no :576 no :604
## 1st Qu.: 33.00 yes: 22 yes: 7 yes: 91 yes: 71 yes: 43 yes: 15
## Median : 61.00
## Mean : 57.43
## 3rd Qu.: 82.50
## Max. :100.00
This summary gives us a look at the spread of each variable.
Let’s also take a look at the levels of each variable.
## tibble [619 x 17] (S3: tbl_df/tbl/data.frame)
## $ audience_score : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
## $ feature_film : chr [1:619] "yes" "yes" "yes" "yes" ...
## $ drama : chr [1:619] "yes" "yes" "no" "yes" ...
## $ runtime : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
## $ mpaa_rating_R : chr [1:619] "yes" "no" "yes" "no" ...
## $ thtr_rel_year : num [1:619] 2013 2001 1996 1993 2004 ...
## $ oscar_season : chr [1:619] "no" "no" "no" "yes" ...
## $ summer_season : chr [1:619] "no" "no" "yes" "no" ...
## $ imdb_rating : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
## ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
We are going to conduct analyses on the relation of the new variables feature_film, drama, mpaa_rating_R, oscar_season and summer_season to the response variable audience_score.
The following boxplots show the visualization of distribution of audience_score from each group of yes and no of our new binary-categorical variables. The histogram shows the distribution of audience_score.
# plotting variables as horizontal -> Labels are read horizontally
a <- ggplot(aes(x=audience_score), data=movies) + geom_histogram(binwidth = 7, color = "steelblue", fill="white")
b <- ggplot(aes(x=feature_film, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
c <- ggplot(aes(x=drama, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
d <- ggplot(aes(x=mpaa_rating_R, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
e <- ggplot(aes(x=oscar_season, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
f <- ggplot(aes(x=summer_season, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
# Create a grid of plots
grid.arrange(
a, b, c, d, e, f,
nrow = 2,
top = "Audience Score",
bottom = textGrob(
"",
gp = gpar(fontface = 3, fontsize = 9),
hjust = 1,
x = 1
)
)From the plots above we see that the biggest difference is from audience_scores from feature_film. > Other variables have similar and small differences in the median.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.21 80.00 97.00
The median of audience_score is 65 and the mean is 62, which are not easily understood form the histogram.
We’ll then map out correlation charts that will show the relationships between audience_score and all other variables in movies.
Notice the high correlation between audience_score and critics_score. Let’s visualize this correlation with a scatter plot with a regression line.
## [1] 0.7015256
ggplot(data=movies, aes(x = audience_score, y = critics_score)) +
geom_jitter() +
geom_smooth(method = "lm")## `geom_smooth()` using formula 'y ~ x'
Let’s do the same with imdb_rating, again due to the high correlation it has with audience_score.
## [1] 0.8605425
ggplot(data=movies, aes(x = audience_score, y = imdb_rating)) +
geom_jitter() +
geom_smooth(method = "lm")## `geom_smooth()` using formula 'y ~ x'
We can see strong positive correlations with both sets of variables.
We’ll first create the full linear model with all the variables in movies.
We will use the stepAIC function from library MASS to build a model (backwards) until the AIC can not be lowered.
##
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies))
##
## Coefficients:
## (Intercept) feature_filmyes dramayes runtime
## 1.279e+02 -2.478e+00 1.340e+00 -6.438e-02
## mpaa_rating_Ryes thtr_rel_year oscar_seasonyes summer_seasonyes
## -1.417e+00 -7.778e-02 -3.442e-01 9.833e-01
## imdb_rating imdb_num_votes critics_score best_pic_nomyes
## 1.467e+01 7.547e-06 6.176e-02 5.258e+00
## best_pic_winyes best_actor_winyes best_actress_winyes best_dir_winyes
## -3.097e+00 -1.347e+00 -2.311e+00 -1.233e+00
## top200_boxyes
## 7.978e-01
We will use the stepAIC function, tuned to optimize for AIC, to find the best model. The model will be built backwards.
## Start: AIC=2878.42
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 8 61293 2876.5
## - oscar_season 1 11 61296 2876.5
## - best_pic_win 1 45 61330 2876.9
## - best_dir_win 1 50 61336 2876.9
## - summer_season 1 102 61388 2877.5
## - best_actor_win 1 126 61411 2877.7
## - feature_film 1 177 61463 2878.2
## <none> 61285 2878.4
## - drama 1 226 61511 2878.7
## - imdb_num_votes 1 272 61557 2879.2
## - mpaa_rating_R 1 289 61574 2879.3
## - best_actress_win 1 308 61593 2879.5
## - thtr_rel_year 1 378 61663 2880.2
## - best_pic_nom 1 397 61682 2880.4
## - runtime 1 647 61932 2882.9
## - critics_score 1 737 62022 2883.8
## - imdb_rating 1 54331 115617 3269.3
##
## Step: AIC=2876.5
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 10 61303 2874.6
## - best_pic_win 1 45 61338 2875.0
## - best_dir_win 1 52 61345 2875.0
## - summer_season 1 105 61399 2875.6
## - best_actor_win 1 125 61419 2875.8
## - feature_film 1 176 61470 2876.3
## <none> 61293 2876.5
## - drama 1 224 61517 2876.8
## - mpaa_rating_R 1 303 61597 2877.6
## - best_actress_win 1 305 61598 2877.6
## - imdb_num_votes 1 320 61613 2877.7
## - best_pic_nom 1 393 61686 2878.5
## - thtr_rel_year 1 396 61689 2878.5
## - runtime 1 645 61939 2881.0
## - critics_score 1 748 62042 2882.0
## - imdb_rating 1 54384 115677 3267.7
##
## Step: AIC=2874.6
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 43 61347 2873.0
## - best_dir_win 1 54 61358 2873.2
## - best_actor_win 1 129 61432 2873.9
## - summer_season 1 167 61470 2874.3
## - feature_film 1 180 61483 2874.4
## <none> 61303 2874.6
## - drama 1 231 61534 2874.9
## - mpaa_rating_R 1 303 61607 2875.7
## - best_actress_win 1 305 61609 2875.7
## - imdb_num_votes 1 321 61624 2875.8
## - best_pic_nom 1 384 61687 2876.5
## - thtr_rel_year 1 394 61697 2876.6
## - runtime 1 692 61995 2879.6
## - critics_score 1 748 62052 2880.1
## - imdb_rating 1 54374 115677 3265.7
##
## Step: AIC=2873.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 91 61438 2872.0
## - best_actor_win 1 119 61466 2872.2
## - summer_season 1 164 61511 2872.7
## - feature_film 1 170 61517 2872.8
## <none> 61347 2873.0
## - drama 1 229 61576 2873.3
## - imdb_num_votes 1 289 61636 2874.0
## - mpaa_rating_R 1 305 61651 2874.1
## - best_actress_win 1 319 61665 2874.3
## - best_pic_nom 1 342 61689 2874.5
## - thtr_rel_year 1 381 61728 2874.9
## - runtime 1 692 62038 2878.0
## - critics_score 1 750 62097 2878.6
## - imdb_rating 1 54632 115979 3265.3
##
## Step: AIC=2871.96
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 125 61563 2871.2
## - summer_season 1 157 61595 2871.5
## - feature_film 1 185 61622 2871.8
## <none> 61438 2872.0
## - drama 1 238 61676 2872.4
## - imdb_num_votes 1 266 61704 2872.6
## - mpaa_rating_R 1 318 61756 2873.2
## - best_actress_win 1 324 61762 2873.2
## - best_pic_nom 1 327 61764 2873.2
## - thtr_rel_year 1 352 61790 2873.5
## - critics_score 1 719 62157 2877.2
## - runtime 1 789 62227 2877.9
## - imdb_rating 1 54671 116109 3264.0
##
## Step: AIC=2871.22
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 176 61739 2871.0
## <none> 61563 2871.2
## - feature_film 1 212 61775 2871.3
## - drama 1 238 61801 2871.6
## - imdb_num_votes 1 282 61845 2872.0
## - best_pic_nom 1 294 61857 2872.2
## - mpaa_rating_R 1 310 61873 2872.3
## - thtr_rel_year 1 349 61912 2872.7
## - best_actress_win 1 354 61917 2872.8
## - critics_score 1 712 62275 2876.3
## - runtime 1 964 62527 2878.8
## - imdb_rating 1 54679 116242 3262.7
##
## Step: AIC=2870.99
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 188 61927 2870.9
## <none> 61739 2871.0
## - drama 1 214 61953 2871.1
## - best_pic_nom 1 262 62002 2871.6
## - imdb_num_votes 1 300 62039 2872.0
## - mpaa_rating_R 1 319 62058 2872.2
## - thtr_rel_year 1 342 62082 2872.4
## - best_actress_win 1 353 62092 2872.5
## - critics_score 1 813 62553 2877.1
## - runtime 1 1004 62743 2879.0
## - imdb_rating 1 54557 116296 3261.0
##
## Step: AIC=2870.87
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 124 62052 2870.1
## - imdb_num_votes 1 196 62124 2870.8
## <none> 61927 2870.9
## - best_pic_nom 1 248 62176 2871.4
## - thtr_rel_year 1 251 62178 2871.4
## - best_actress_win 1 364 62291 2872.5
## - mpaa_rating_R 1 411 62339 2873.0
## - critics_score 1 982 62909 2878.6
## - runtime 1 1005 62932 2878.8
## - imdb_rating 1 58584 120511 3281.0
##
## Step: AIC=2870.11
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 167 62219 2869.8
## <none> 62052 2870.1
## - thtr_rel_year 1 244 62296 2870.5
## - best_pic_nom 1 266 62317 2870.8
## - best_actress_win 1 330 62381 2871.4
## - mpaa_rating_R 1 351 62403 2871.6
## - runtime 1 910 62961 2877.1
## - critics_score 1 1021 63073 2878.2
## - imdb_rating 1 58890 120941 3281.2
##
## Step: AIC=2869.78
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - thtr_rel_year 1 169 62388 2869.5
## <none> 62219 2869.8
## - best_actress_win 1 320 62539 2871.0
## - mpaa_rating_R 1 330 62549 2871.1
## - best_pic_nom 1 395 62614 2871.7
## - runtime 1 777 62996 2875.5
## - critics_score 1 979 63198 2877.4
## - imdb_rating 1 63047 125266 3300.9
##
## Step: AIC=2869.46
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## <none> 62388 2869.5
## - best_actress_win 1 313 62701 2870.6
## - mpaa_rating_R 1 333 62720 2870.8
## - best_pic_nom 1 400 62787 2871.4
## - runtime 1 713 63101 2874.5
## - critics_score 1 1066 63454 2877.9
## - imdb_rating 1 62879 125267 3299.0
The final model built using AIC consists of the following variables:
runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win
AIC.lm <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies)The coefficients of this model:
## (Intercept) runtime mpaa_rating_Ryes thtr_rel_year imdb_rating
## 68.80456944 -0.05746878 -1.47061512 -0.05028853 14.97379533
## critics_score best_pic_nomyes best_actor_winyes best_actress_winyes
## 0.06895589 4.89109667 -1.55307354 -2.20922544
Taking a look at the standard deviation of the model:
## [1] 10.08554
Plotting the residuals of the model:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
We will use the stepAIC function, tuned to optimize for BIC, to find the best model. The model will be built backwards.
## Start: AIC=2953.7
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 8 61293 2947.4
## - oscar_season 1 11 61296 2947.4
## - best_pic_win 1 45 61330 2947.7
## - best_dir_win 1 50 61336 2947.8
## - summer_season 1 102 61388 2948.3
## - best_actor_win 1 126 61411 2948.5
## - feature_film 1 177 61463 2949.1
## - drama 1 226 61511 2949.6
## - imdb_num_votes 1 272 61557 2950.0
## - mpaa_rating_R 1 289 61574 2950.2
## - best_actress_win 1 308 61593 2950.4
## - thtr_rel_year 1 378 61663 2951.1
## - best_pic_nom 1 397 61682 2951.3
## <none> 61285 2953.7
## - runtime 1 647 61932 2953.8
## - critics_score 1 737 62022 2954.7
## - imdb_rating 1 54331 115617 3340.2
##
## Step: AIC=2947.35
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 10 61303 2941.0
## - best_pic_win 1 45 61338 2941.4
## - best_dir_win 1 52 61345 2941.5
## - summer_season 1 105 61399 2942.0
## - best_actor_win 1 125 61419 2942.2
## - feature_film 1 176 61470 2942.7
## - drama 1 224 61517 2943.2
## - mpaa_rating_R 1 303 61597 2944.0
## - best_actress_win 1 305 61598 2944.0
## - imdb_num_votes 1 320 61613 2944.1
## - best_pic_nom 1 393 61686 2944.9
## - thtr_rel_year 1 396 61689 2944.9
## <none> 61293 2947.4
## - runtime 1 645 61939 2947.4
## - critics_score 1 748 62042 2948.4
## - imdb_rating 1 54384 115677 3334.1
##
## Step: AIC=2941.03
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 43 61347 2935.0
## - best_dir_win 1 54 61358 2935.1
## - best_actor_win 1 129 61432 2935.9
## - summer_season 1 167 61470 2936.3
## - feature_film 1 180 61483 2936.4
## - drama 1 231 61534 2936.9
## - mpaa_rating_R 1 303 61607 2937.7
## - best_actress_win 1 305 61609 2937.7
## - imdb_num_votes 1 321 61624 2937.8
## - best_pic_nom 1 384 61687 2938.5
## - thtr_rel_year 1 394 61697 2938.6
## <none> 61303 2941.0
## - runtime 1 692 61995 2941.5
## - critics_score 1 748 62052 2942.1
## - imdb_rating 1 54374 115677 3327.6
##
## Step: AIC=2935.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 91 61438 2929.5
## - best_actor_win 1 119 61466 2929.8
## - summer_season 1 164 61511 2930.3
## - feature_film 1 170 61517 2930.3
## - drama 1 229 61576 2930.9
## - imdb_num_votes 1 289 61636 2931.5
## - mpaa_rating_R 1 305 61651 2931.7
## - best_actress_win 1 319 61665 2931.8
## - best_pic_nom 1 342 61689 2932.1
## - thtr_rel_year 1 381 61728 2932.4
## <none> 61347 2935.0
## - runtime 1 692 62038 2935.5
## - critics_score 1 750 62097 2936.1
## - imdb_rating 1 54632 115979 3322.8
##
## Step: AIC=2929.53
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 125 61563 2924.4
## - summer_season 1 157 61595 2924.7
## - feature_film 1 185 61622 2925.0
## - drama 1 238 61676 2925.5
## - imdb_num_votes 1 266 61704 2925.8
## - mpaa_rating_R 1 318 61756 2926.3
## - best_actress_win 1 324 61762 2926.4
## - best_pic_nom 1 327 61764 2926.4
## - thtr_rel_year 1 352 61790 2926.6
## <none> 61438 2929.5
## - critics_score 1 719 62157 2930.3
## - runtime 1 789 62227 2931.0
## - imdb_rating 1 54671 116109 3317.1
##
## Step: AIC=2924.36
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 176 61739 2919.7
## - feature_film 1 212 61775 2920.1
## - drama 1 238 61801 2920.3
## - imdb_num_votes 1 282 61845 2920.8
## - best_pic_nom 1 294 61857 2920.9
## - mpaa_rating_R 1 310 61873 2921.0
## - thtr_rel_year 1 349 61912 2921.4
## - best_actress_win 1 354 61917 2921.5
## <none> 61563 2924.4
## - critics_score 1 712 62275 2925.1
## - runtime 1 964 62527 2927.5
## - imdb_rating 1 54679 116242 3311.4
##
## Step: AIC=2919.7
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 188 61927 2915.2
## - drama 1 214 61953 2915.4
## - best_pic_nom 1 262 62002 2915.9
## - imdb_num_votes 1 300 62039 2916.3
## - mpaa_rating_R 1 319 62058 2916.5
## - thtr_rel_year 1 342 62082 2916.7
## - best_actress_win 1 353 62092 2916.8
## <none> 61739 2919.7
## - critics_score 1 813 62553 2921.4
## - runtime 1 1004 62743 2923.3
## - imdb_rating 1 54557 116296 3305.2
##
## Step: AIC=2915.16
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 124 62052 2910.0
## - imdb_num_votes 1 196 62124 2910.7
## - best_pic_nom 1 248 62176 2911.2
## - thtr_rel_year 1 251 62178 2911.2
## - best_actress_win 1 364 62291 2912.4
## - mpaa_rating_R 1 411 62339 2912.8
## <none> 61927 2915.2
## - critics_score 1 982 62909 2918.5
## - runtime 1 1005 62932 2918.7
## - imdb_rating 1 58584 120511 3320.8
##
## Step: AIC=2909.97
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 167 62219 2905.2
## - thtr_rel_year 1 244 62296 2906.0
## - best_pic_nom 1 266 62317 2906.2
## - best_actress_win 1 330 62381 2906.8
## - mpaa_rating_R 1 351 62403 2907.0
## <none> 62052 2910.0
## - runtime 1 910 62961 2912.5
## - critics_score 1 1021 63073 2913.6
## - imdb_rating 1 58890 120941 3316.6
##
## Step: AIC=2905.21
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - thtr_rel_year 1 169 62388 2900.5
## - best_actress_win 1 320 62539 2902.0
## - mpaa_rating_R 1 330 62549 2902.1
## - best_pic_nom 1 395 62614 2902.7
## <none> 62219 2905.2
## - runtime 1 777 62996 2906.5
## - critics_score 1 979 63198 2908.4
## - imdb_rating 1 63047 125266 3331.9
##
## Step: AIC=2900.46
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actress_win 1 313 62701 2897.1
## - mpaa_rating_R 1 333 62720 2897.3
## - best_pic_nom 1 400 62787 2898.0
## <none> 62388 2900.5
## - runtime 1 713 63101 2901.1
## - critics_score 1 1066 63454 2904.5
## - imdb_rating 1 62879 125267 3325.5
##
## Step: AIC=2897.13
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom
##
## Df Sum of Sq RSS AIC
## - best_pic_nom 1 303 63004 2893.7
## - mpaa_rating_R 1 323 63024 2893.9
## <none> 62701 2897.1
## - runtime 1 884 63585 2899.4
## - critics_score 1 1061 63762 2901.1
## - imdb_rating 1 62968 125669 3321.1
##
## Step: AIC=2893.69
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## - mpaa_rating_R 1 332 63336 2890.5
## <none> 63004 2893.7
## - runtime 1 704 63708 2894.1
## - critics_score 1 1135 64139 2898.3
## - imdb_rating 1 63737 126741 3319.9
##
## Step: AIC=2890.51
## audience_score ~ runtime + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## <none> 63336 2890.5
## - runtime 1 690 64025 2890.8
## - critics_score 1 1177 64513 2895.5
## - imdb_rating 1 63413 126749 3313.5
The final model will use the following variables:
audience_score ~ runtime + imdb_rating + critics_score
## (Intercept) runtime imdb_rating critics_score
## -32.90824894 -0.05790520 14.95043323 0.07530741
## [1] 10.14815
Taking a look at the residuals:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
as_full.bas <- bas.lm(audience_score ~ .,
prior ="BIC",
modelprior = uniform(),
data = na.omit(movies))
as_full.bas##
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies),
## prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes runtime
## 1.00000 0.05876 0.04509 0.51400
## mpaa_rating_Ryes thtr_rel_year oscar_seasonyes summer_seasonyes
## 0.16498 0.08089 0.06526 0.07935
## imdb_rating imdb_num_votes critics_score best_pic_nomyes
## 1.00000 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes best_dir_winyes
## 0.04077 0.11565 0.14770 0.06701
## top200_boxyes
## 0.04876
According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.
## 2.5% 97.5% beta
## Intercept 6.143478e+01 6.301352e+01 6.221002e+01
## feature_filmyes -6.649503e-01 1.249856e-01 -9.162435e-02
## dramayes 0.000000e+00 0.000000e+00 1.956484e-02
## runtime -8.989580e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes -1.964520e+00 0.000000e+00 -2.409288e-01
## thtr_rel_year -4.634386e-02 0.000000e+00 -3.878883e-03
## oscar_seasonyes -7.105776e-01 1.738174e-02 -6.293171e-02
## summer_seasonyes -2.130373e-04 1.097149e+00 8.660919e-02
## imdb_rating 1.351437e+01 1.645399e+01 1.490680e+01
## imdb_num_votes -6.167253e-08 2.423759e-06 2.466254e-07
## critics_score 0.000000e+00 1.111559e-01 6.951964e-02
## best_pic_nomyes 0.000000e+00 4.846975e+00 5.130600e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes -2.259661e+00 1.732811e-02 -2.123363e-01
## best_actress_winyes -3.099880e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes -1.553895e+00 4.523009e-02 -1.188402e-01
## top200_boxyes 0.000000e+00 0.000000e+00 8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.00000000 1.0000 1.0000000 1.0000000 1.0000000
## feature_filmyes 0.05876309 0.0000 0.0000000 0.0000000 0.0000000
## dramayes 0.04508592 0.0000 0.0000000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000 1.0000000
## mpaa_rating_Ryes 0.16498090 0.0000 0.0000000 0.0000000 1.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000 0.0000000
## oscar_seasonyes 0.06525993 0.0000 0.0000000 0.0000000 0.0000000
## summer_seasonyes 0.07935408 0.0000 0.0000000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000 0.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238 0.2039916
## PostProbs NA 0.1558 0.1358000 0.0319000 0.0318000
## R2 NA 0.7483 0.7455000 0.7470000 0.7496000
## dim NA 4.0000 3.0000000 4.0000000 5.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603 -3436.3416317
## model 5
## Intercept 1.0000000
## feature_filmyes 0.0000000
## dramayes 0.0000000
## runtime 1.0000000
## mpaa_rating_Ryes 0.0000000
## thtr_rel_year 0.0000000
## oscar_seasonyes 0.0000000
## summer_seasonyes 0.0000000
## imdb_rating 1.0000000
## imdb_num_votes 0.0000000
## critics_score 1.0000000
## best_pic_nomyes 1.0000000
## best_pic_winyes 0.0000000
## best_actor_winyes 0.0000000
## best_actress_winyes 0.0000000
## best_dir_winyes 0.0000000
## top200_boxyes 0.0000000
## BF 0.1851908
## PostProbs 0.0289000
## R2 0.7495000
## dim 5.0000000
## logmarg -3436.4383237
The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.
Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.
We see a normal distribution here.
Now let’s plot the residuals against the fitted values.
We see some left-skewness here, but the data is generally scattered around 0.
Now let’s plot the absolute value of the residuals against the fitted values.
We do not see a fan shape, meeting the necessary condition.
The movie I’ve chosen is Captain America: Civil War. The information for the prediction comes from:
IMDB and Rotten Tomatoes.
I’ll create the data frames containing information on Captain America: Civil War information.
captain_america <- data.frame(imdb_rating = 8.1, runtime = 147, critics_score = 91, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")I will run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.
## fit lwr upr
## 1 86.53117 66.50016 106.5622
The BIC model predicts a score of 86.53.
## fit lwr upr
## 1 86.5377 66.51882 106.5566
The AIC model predicts a score of 86.54.
As the true score was 89, both the BIC and AIC models were very accurate, with the BIC model being marginally more precise than the AIC model.
The model created using the stepAIC tuned toward BIC was the same model found to be ideal by bas.lm.