library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.4
library(statsr)
## Warning: package 'statsr' was built under R version 4.0.4
## Warning: package 'BayesFactor' was built under R version 4.0.4
## Warning: package 'coda' was built under R version 4.0.4
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.4
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.4
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.4
load("movies.Rdata")
This project is being carried out with the aim to find attributes that make a movie popular. Also, we are going to find out other attributes that might interest to us using Exploaratory Data Dnalysis(EDA) and will use to Bayesian statistics for making modelling and prediction.
summary(movies)
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 37 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :133 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Certified Fresh:135
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Fresh :209
## Median :15.00 Median :6.600 Median : 15116 Rotten :307
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:275 Min. :11.00 no :629 no :644
## 1st Qu.: 33.00 Upright:376 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :558 no :579 no :608 no :636 Length:651
## yes: 93 yes: 72 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
glimpse(movies)
## Rows: 651
## Columns: 32
## $ title <chr> "Filly Brown", "The Dish", "Waiting for Guffman", ...
## $ title_type <fct> Feature Film, Feature Film, Feature Film, Feature ...
## $ genre <fct> Drama, Drama, Comedy, Drama, Horror, Documentary, ...
## $ runtime <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 127, 1...
## $ mpaa_rating <fct> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrated, Un...
## $ studio <fct> Indomina Media Inc., Warner Bros. Pictures, Sony P...
## $ thtr_rel_year <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 1996, 20...
## $ thtr_rel_month <dbl> 4, 3, 8, 10, 9, 1, 1, 11, 9, 3, 6, 12, 1, 9, 6, 8,...
## $ thtr_rel_day <dbl> 19, 14, 21, 1, 10, 15, 1, 8, 7, 2, 19, 18, 4, 23, ...
## $ dvd_rel_year <dbl> 2013, 2001, 2001, 2001, 2005, 2010, 2003, 2004, 20...
## $ dvd_rel_month <dbl> 7, 8, 8, 11, 4, 4, 2, 3, 1, 8, 5, 9, 7, 2, 3, 12, ...
## $ dvd_rel_day <dbl> 30, 28, 21, 6, 19, 20, 18, 2, 21, 14, 1, 23, 9, 13...
## $ imdb_rating <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, 6.6, ...
## $ imdb_num_votes <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 2272, 8...
## $ critics_rating <fct> Rotten, Certified Fresh, Certified Fresh, Certifie...
## $ critics_score <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 67, 80...
## $ audience_rating <fct> Upright, Upright, Upright, Upright, Spilled, Uprig...
## $ audience_score <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 46, 89...
## $ best_pic_nom <fct> no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ best_pic_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ best_actor_win <fct> no, no, no, yes, no, no, no, yes, no, no, yes, no,...
## $ best_actress_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, ye...
## $ best_dir_win <fct> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ top200_box <fct> no, no, no, no, no, no, no, no, no, no, yes, no, n...
## $ director <chr> "Michael D. Olmos", "Rob Sitch", "Christopher Gues...
## $ actor1 <chr> "Gina Rodriguez", "Sam Neill", "Christopher Guest"...
## $ actor2 <chr> "Jenni Rivera", "Kevin Harrington", "Catherine O'H...
## $ actor3 <chr> "Lou Diamond Phillips", "Patrick Warburton", "Park...
## $ actor4 <chr> "Emilio Rivera", "Tom Long", "Eugene Levy", "Richa...
## $ actor5 <chr> "Joseph Julian Soria", "Genevieve Mooy", "Bob Bala...
## $ imdb_url <chr> "http://www.imdb.com/title/tt1869425/", "http://ww...
## $ rt_url <chr> "//www.rottentomatoes.com/m/filly_brown_2012/", "/...
The dataset consists of 651 randomly selected movies which were produced and released before 2016 and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.We are only able to draw correlation as it done by random sampling. Since the data is collected using random sampling and given the shear size of the observations involved, it is possible to generalize the results to a larger audience. Since the data is taken from an English-speaking platfrom, and much of it is catered to the English speakers, it is safe to assume that there will be prejudice in favor of English movies compared to movies from foreifn countries such as Bollywood, Chinese, Korean, et al.
We are going to create a few new variables to assist in our EDA. Below is their description:
feature_film: “yes” if title_type is Feature Film, “no” otherwise.drama: “yes” if genre is Drama, “no” otherwise runtime.mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwiseoscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.movies <- movies %>%
mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"),
drama = ifelse(genre == "Drama", "yes", "no"),
mpaa_rating_R = ifelse(mpaa_rating == "R","yes","no"),
oscar_season = ifelse(thtr_rel_month == 11 | thtr_rel_month == 10 | thtr_rel_month == 12, "yes", "no"),
summer_season = ifelse(thtr_rel_month == 5 | thtr_rel_month == 6 | thtr_rel_month == 7 | thtr_rel_month == 8, "yes","no"))
We’ll then create a new dataframe “movies2” that will include a subset of the total variables
movies2_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies2 <- movies[movies2_features]
We will begin our EDA by looking at the summary of the newly created data frame “movies2”
summary(movies2)
## audience_score feature_film drama runtime
## Min. :11.00 Length:651 Length:651 Min. : 39.0
## 1st Qu.:46.00 Class :character Class :character 1st Qu.: 92.0
## Median :65.00 Mode :character Mode :character Median :103.0
## Mean :62.36 Mean :105.8
## 3rd Qu.:80.00 3rd Qu.:115.8
## Max. :97.00 Max. :267.0
## NA's :1
## mpaa_rating_R thtr_rel_year oscar_season summer_season
## Length:651 Min. :1970 Length:651 Length:651
## Class :character 1st Qu.:1990 Class :character Class :character
## Mode :character Median :2000 Mode :character Mode :character
## Mean :1998
## 3rd Qu.:2007
## Max. :2014
##
## imdb_rating imdb_num_votes critics_score best_pic_nom best_pic_win
## Min. :1.900 Min. : 180 Min. : 1.00 no :629 no :644
## 1st Qu.:5.900 1st Qu.: 4546 1st Qu.: 33.00 yes: 22 yes: 7
## Median :6.600 Median : 15116 Median : 61.00
## Mean :6.493 Mean : 57533 Mean : 57.69
## 3rd Qu.:7.300 3rd Qu.: 58301 3rd Qu.: 83.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## best_actor_win best_actress_win best_dir_win top200_box
## no :558 no :579 no :608 no :636
## yes: 93 yes: 72 yes: 43 yes: 15
##
##
##
##
##
This gives us how spread each variabe in the dataset is.
Now let’s look at the structure of the dataframe.
str(movies2)
## tibble [651 x 17] (S3: tbl_df/tbl/data.frame)
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ feature_film : chr [1:651] "yes" "yes" "yes" "yes" ...
## $ drama : chr [1:651] "yes" "yes" "no" "yes" ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating_R : chr [1:651] "yes" "no" "yes" "no" ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ oscar_season : chr [1:651] "no" "no" "no" "yes" ...
## $ summer_season : chr [1:651] "no" "no" "yes" "no" ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Let’s create a boxplot to understand how the newly created variables interact with “audience_score”
plot1 <- ggplot(movies2, aes(x=mpaa_rating_R,y=audience_score))+
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
plot2 <- ggplot(movies2, aes(x=oscar_season, y=audience_score))+
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
plot3 <- ggplot(movies2, aes(x=summer_season,y=audience_score))+
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
plot4 <- ggplot(movies2, aes(x=feature_film, y=audience_score))+
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
plot5 <- ggplot(movies2, aes(x=drama, y=audience_score))+
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
grid.arrange(plot1,plot2,plot3,plot4,plot5, ncol=3)
Let’s explore the correlation between the audience score and the newly created variables using more visualization charts.
suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = 1:8))))
suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = c(1,9:17)))))
From the charts above, we can confer that there exists a high correlation between audience_score and critics_score
Let’s further explore its correlation using a scatterplot fitted with a regression line.
cor(movies2$audience_score, movies2$critics_score)
## [1] 0.7042762
ggplot(data=movies2, aes(x = audience_score, y = critics_score)) +
geom_jitter(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, colour = "red")
## `geom_smooth()` using formula 'y ~ x'
Let’s examine the relation between imdb_rating and audience_score similarly.
cor(movies2$audience_score, movies2$imdb_rating)
## [1] 0.8648652
ggplot(data=movies2, aes(x = audience_score, y = imdb_rating)) +
geom_jitter(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, colour = "red")
## `geom_smooth()` using formula 'y ~ x'
From the charts above we can understand the high correlation of audience_score with both set of variables.
We will start by incoporating the linear model by examining the relationship between the response variable with all the predictors.
As for modeling, we will use the stepAIC function from the MASS library in the backwards direction until we reach a stage where we cannot further lower the AIC.
as_model <- lm(audience_score ~ ., data= na.omit(movies2))
as_model
##
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies2))
##
## Coefficients:
## (Intercept) feature_filmyes dramayes
## 1.244e+02 -2.248e+00 1.292e+00
## runtime mpaa_rating_Ryes thtr_rel_year
## -5.614e-02 -1.444e+00 -7.657e-02
## oscar_seasonyes summer_seasonyes imdb_rating
## -5.333e-01 9.106e-01 1.472e+01
## imdb_num_votes critics_score best_pic_nomyes
## 7.234e-06 5.748e-02 5.321e+00
## best_pic_winyes best_actor_winyes best_actress_winyes
## -3.212e+00 -1.544e+00 -2.198e+00
## best_dir_winyes top200_boxyes
## -1.231e+00 8.478e-01
stepAIC.model <- stepAIC(as_model, direction = "backward", trace = TRUE)
## Start: AIC=3006.94
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 9 62999 3005.0
## - oscar_season 1 28 63018 3005.2
## - best_pic_win 1 48 63038 3005.4
## - best_dir_win 1 51 63040 3005.5
## - summer_season 1 92 63081 3005.9
## - best_actor_win 1 171 63160 3006.7
## - feature_film 1 177 63166 3006.8
## <none> 62990 3006.9
## - drama 1 216 63206 3007.2
## - imdb_num_votes 1 255 63244 3007.6
## - best_actress_win 1 283 63273 3007.9
## - mpaa_rating_R 1 314 63304 3008.2
## - thtr_rel_year 1 397 63386 3009.0
## - best_pic_nom 1 408 63398 3009.1
## - runtime 1 538 63527 3010.5
## - critics_score 1 669 63659 3011.8
## - imdb_rating 1 58556 121545 3432.2
##
## Step: AIC=3005.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 26 63025 3003.3
## - best_pic_win 1 49 63047 3003.5
## - best_dir_win 1 52 63051 3003.6
## - summer_season 1 94 63093 3004.0
## - best_actor_win 1 169 63168 3004.8
## - feature_film 1 176 63175 3004.8
## <none> 62999 3005.0
## - drama 1 214 63213 3005.2
## - best_actress_win 1 279 63278 3005.9
## - imdb_num_votes 1 302 63301 3006.1
## - mpaa_rating_R 1 330 63329 3006.4
## - best_pic_nom 1 404 63403 3007.2
## - thtr_rel_year 1 415 63414 3007.3
## - runtime 1 535 63534 3008.5
## - critics_score 1 681 63680 3010.0
## - imdb_rating 1 58606 121604 3430.5
##
## Step: AIC=3003.31
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3001.8
## - best_dir_win 1 56 63081 3001.9
## - best_actor_win 1 174 63200 3003.1
## - summer_season 1 177 63202 3003.1
## - feature_film 1 182 63207 3003.2
## <none> 63025 3003.3
## - drama 1 222 63247 3003.6
## - best_actress_win 1 281 63307 3004.2
## - imdb_num_votes 1 302 63328 3004.4
## - mpaa_rating_R 1 329 63354 3004.7
## - best_pic_nom 1 387 63412 3005.3
## - thtr_rel_year 1 410 63436 3005.5
## - runtime 1 587 63613 3007.3
## - critics_score 1 679 63704 3008.3
## - imdb_rating 1 58603 121628 3428.6
##
## Step: AIC=3001.78
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3000.7
## - best_actor_win 1 163 63234 3001.5
## - feature_film 1 171 63242 3001.5
## - summer_season 1 174 63245 3001.6
## <none> 63071 3001.8
## - drama 1 220 63291 3002.0
## - imdb_num_votes 1 271 63342 3002.6
## - best_actress_win 1 294 63365 3002.8
## - mpaa_rating_R 1 330 63401 3003.2
## - best_pic_nom 1 342 63414 3003.3
## - thtr_rel_year 1 397 63468 3003.9
## - runtime 1 586 63657 3005.8
## - critics_score 1 680 63751 3006.8
## - imdb_rating 1 58858 121929 3428.2
##
## Step: AIC=3000.75
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3000.5
## - best_actor_win 1 171 63336 3000.5
## - feature_film 1 183 63348 3000.6
## <none> 63165 3000.7
## - drama 1 228 63394 3001.1
## - imdb_num_votes 1 247 63412 3001.3
## - best_actress_win 1 299 63464 3001.8
## - best_pic_nom 1 326 63491 3002.1
## - mpaa_rating_R 1 345 63510 3002.3
## - thtr_rel_year 1 368 63533 3002.5
## - critics_score 1 651 63816 3005.4
## - runtime 1 673 63839 3005.6
## - imdb_rating 1 58895 122061 3426.9
##
## Step: AIC=3000.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3000.1
## <none> 63332 3000.5
## - best_actor_win 1 195 63527 3000.5
## - drama 1 204 63536 3000.6
## - imdb_num_votes 1 260 63592 3001.1
## - best_pic_nom 1 297 63629 3001.5
## - best_actress_win 1 297 63629 3001.5
## - mpaa_rating_R 1 356 63688 3002.1
## - thtr_rel_year 1 361 63693 3002.2
## - runtime 1 690 64022 3005.5
## - critics_score 1 732 64064 3005.9
## - imdb_rating 1 58763 122095 3425.1
##
## Step: AIC=3000.06
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 2999.3
## - imdb_num_votes 1 173 63661 2999.8
## <none> 63488 3000.1
## - best_actor_win 1 219 63706 3000.3
## - thtr_rel_year 1 277 63765 3000.9
## - best_pic_nom 1 291 63778 3001.0
## - best_actress_win 1 306 63794 3001.2
## - mpaa_rating_R 1 453 63941 3002.7
## - runtime 1 715 64203 3005.3
## - critics_score 1 875 64363 3007.0
## - imdb_rating 1 63189 126677 3447.1
##
## Step: AIC=2999.3
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 2998.8
## <none> 63609 2999.3
## - best_actor_win 1 209 63818 2999.4
## - thtr_rel_year 1 272 63881 3000.1
## - best_actress_win 1 274 63883 3000.1
## - best_pic_nom 1 307 63916 3000.4
## - mpaa_rating_R 1 391 64000 3001.3
## - runtime 1 631 64240 3003.7
## - critics_score 1 916 64525 3006.6
## - imdb_rating 1 63434 127043 3447.0
##
## Step: AIC=2998.81
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## <none> 63757 2998.8
## - thtr_rel_year 1 201 63958 2998.9
## - best_actor_win 1 219 63976 2999.0
## - best_actress_win 1 266 64023 2999.5
## - mpaa_rating_R 1 367 64124 3000.5
## - best_pic_nom 1 442 64199 3001.3
## - runtime 1 519 64276 3002.1
## - critics_score 1 879 64635 3005.7
## - imdb_rating 1 67356 131113 3465.4
The final model built using AIC consists of the following variables:
runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win
AIC.lm_model <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies2)
Let’s take a look at the coefficients of this model:
AIC.lm_model$coefficients
## (Intercept) runtime mpaa_rating_Ryes thtr_rel_year
## 70.10675281 -0.05115515 -1.50528039 -0.05122557
## imdb_rating critics_score best_pic_nomyes best_actor_winyes
## 15.00149242 0.06409989 4.88277038 -1.73481942
## best_actress_winyes
## -2.11568281
Let’s take a look at the standard deviation of this model:
summary(AIC.lm_model)$sigma
## [1] 9.973201
Let’s plot the residuals of this model:
ggplot(data=AIC.lm_model, aes(x=AIC.lm_model$residuals)) + geom_histogram(bin = 30)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
we will use the stepAIC function from the MASS library in the backwards direction until we reach a stage where we cannot further lower the BIC.
stepBIC.model <- stepAIC(as_model, direction = "backward", k=log(nrow(movies2)), trace = TRUE)
## Start: AIC=3083.07
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 9 62999 3076.7
## - oscar_season 1 28 63018 3076.9
## - best_pic_win 1 48 63038 3077.1
## - best_dir_win 1 51 63040 3077.1
## - summer_season 1 92 63081 3077.5
## - best_actor_win 1 171 63160 3078.4
## - feature_film 1 177 63166 3078.4
## - drama 1 216 63206 3078.8
## - imdb_num_votes 1 255 63244 3079.2
## - best_actress_win 1 283 63273 3079.5
## - mpaa_rating_R 1 314 63304 3079.8
## - thtr_rel_year 1 397 63386 3080.7
## - best_pic_nom 1 408 63398 3080.8
## - runtime 1 538 63527 3082.1
## <none> 62990 3083.1
## - critics_score 1 669 63659 3083.5
## - imdb_rating 1 58556 121545 3503.9
##
## Step: AIC=3076.69
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 26 63025 3070.5
## - best_pic_win 1 49 63047 3070.7
## - best_dir_win 1 52 63051 3070.8
## - summer_season 1 94 63093 3071.2
## - best_actor_win 1 169 63168 3072.0
## - feature_film 1 176 63175 3072.0
## - drama 1 214 63213 3072.4
## - best_actress_win 1 279 63278 3073.1
## - imdb_num_votes 1 302 63301 3073.3
## - mpaa_rating_R 1 330 63329 3073.6
## - best_pic_nom 1 404 63403 3074.4
## - thtr_rel_year 1 415 63414 3074.5
## - runtime 1 535 63534 3075.7
## <none> 62999 3076.7
## - critics_score 1 681 63680 3077.2
## - imdb_rating 1 58606 121604 3497.7
##
## Step: AIC=3070.49
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3064.5
## - best_dir_win 1 56 63081 3064.6
## - best_actor_win 1 174 63200 3065.8
## - summer_season 1 177 63202 3065.8
## - feature_film 1 182 63207 3065.9
## - drama 1 222 63247 3066.3
## - best_actress_win 1 281 63307 3066.9
## - imdb_num_votes 1 302 63328 3067.1
## - mpaa_rating_R 1 329 63354 3067.4
## - best_pic_nom 1 387 63412 3068.0
## - thtr_rel_year 1 410 63436 3068.2
## - runtime 1 587 63613 3070.0
## <none> 63025 3070.5
## - critics_score 1 679 63704 3071.0
## - imdb_rating 1 58603 121628 3491.3
##
## Step: AIC=3064.48
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3059.0
## - best_actor_win 1 163 63234 3059.7
## - feature_film 1 171 63242 3059.8
## - summer_season 1 174 63245 3059.8
## - drama 1 220 63291 3060.3
## - imdb_num_votes 1 271 63342 3060.8
## - best_actress_win 1 294 63365 3061.0
## - mpaa_rating_R 1 330 63401 3061.4
## - best_pic_nom 1 342 63414 3061.5
## - thtr_rel_year 1 397 63468 3062.1
## - runtime 1 586 63657 3064.0
## <none> 63071 3064.5
## - critics_score 1 680 63751 3065.0
## - imdb_rating 1 58858 121929 3486.5
##
## Step: AIC=3058.97
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3054.2
## - best_actor_win 1 171 63336 3054.2
## - feature_film 1 183 63348 3054.4
## - drama 1 228 63394 3054.8
## - imdb_num_votes 1 247 63412 3055.0
## - best_actress_win 1 299 63464 3055.6
## - best_pic_nom 1 326 63491 3055.8
## - mpaa_rating_R 1 345 63510 3056.0
## - thtr_rel_year 1 368 63533 3056.3
## <none> 63165 3059.0
## - critics_score 1 651 63816 3059.2
## - runtime 1 673 63839 3059.4
## - imdb_rating 1 58895 122061 3480.7
##
## Step: AIC=3054.2
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3049.3
## - best_actor_win 1 195 63527 3049.7
## - drama 1 204 63536 3049.8
## - imdb_num_votes 1 260 63592 3050.4
## - best_pic_nom 1 297 63629 3050.8
## - best_actress_win 1 297 63629 3050.8
## - mpaa_rating_R 1 356 63688 3051.4
## - thtr_rel_year 1 361 63693 3051.4
## <none> 63332 3054.2
## - runtime 1 690 64022 3054.8
## - critics_score 1 732 64064 3055.2
## - imdb_rating 1 58763 122095 3474.4
##
## Step: AIC=3049.32
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 3044.1
## - imdb_num_votes 1 173 63661 3044.6
## - best_actor_win 1 219 63706 3045.1
## - thtr_rel_year 1 277 63765 3045.7
## - best_pic_nom 1 291 63778 3045.8
## - best_actress_win 1 306 63794 3046.0
## - mpaa_rating_R 1 453 63941 3047.5
## <none> 63488 3049.3
## - runtime 1 715 64203 3050.1
## - critics_score 1 875 64363 3051.7
## - imdb_rating 1 63189 126677 3491.9
##
## Step: AIC=3044.09
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 3039.1
## - best_actor_win 1 209 63818 3039.7
## - thtr_rel_year 1 272 63881 3040.4
## - best_actress_win 1 274 63883 3040.4
## - best_pic_nom 1 307 63916 3040.7
## - mpaa_rating_R 1 391 64000 3041.6
## - runtime 1 631 64240 3044.0
## <none> 63609 3044.1
## - critics_score 1 916 64525 3046.9
## - imdb_rating 1 63434 127043 3487.3
##
## Step: AIC=3039.12
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - thtr_rel_year 1 201 63958 3034.7
## - best_actor_win 1 219 63976 3034.9
## - best_actress_win 1 266 64023 3035.3
## - mpaa_rating_R 1 367 64124 3036.4
## - best_pic_nom 1 442 64199 3037.1
## - runtime 1 519 64276 3037.9
## <none> 63757 3039.1
## - critics_score 1 879 64635 3041.5
## - imdb_rating 1 67356 131113 3501.3
##
## Step: AIC=3034.68
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 207 64165 3030.3
## - best_actress_win 1 261 64219 3030.9
## - mpaa_rating_R 1 373 64331 3032.0
## - best_pic_nom 1 447 64405 3032.7
## - runtime 1 468 64425 3032.9
## <none> 63958 3034.7
## - critics_score 1 968 64926 3038.0
## - imdb_rating 1 67172 131129 3494.9
##
## Step: AIC=3030.3
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actress_win 1 296 64461 3026.8
## - mpaa_rating_R 1 366 64531 3027.5
## - best_pic_nom 1 396 64561 3027.8
## <none> 64165 3030.3
## - runtime 1 643 64808 3030.3
## - critics_score 1 968 65133 3033.6
## - imdb_rating 1 67296 131461 3490.0
##
## Step: AIC=3026.82
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom
##
## Df Sum of Sq RSS AIC
## - best_pic_nom 1 303 64765 3023.4
## - mpaa_rating_R 1 354 64815 3023.9
## <none> 64461 3026.8
## - runtime 1 814 65275 3028.5
## - critics_score 1 957 65418 3029.9
## - imdb_rating 1 67424 131885 3485.7
##
## Step: AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## - mpaa_rating_R 1 361 65126 3020.5
## - runtime 1 638 65403 3023.3
## <none> 64765 3023.4
## - critics_score 1 1027 65792 3027.1
## - imdb_rating 1 68173 132937 3484.3
##
## Step: AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## <none> 65126 3020.5
## - runtime 1 653 65779 3020.5
## - critics_score 1 1073 66199 3024.7
## - imdb_rating 1 67874 133000 3478.2
The final model will use the following variables:
audience_score ~ runtime + imdb_rating + critics_score
BIC.lm_model <- lm(audience_score ~ runtime + imdb_rating + critics_score, data=movies2)
BIC.lm_model
##
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score,
## data = movies2)
##
## Coefficients:
## (Intercept) runtime imdb_rating critics_score
## -33.28321 -0.05362 14.98076 0.07036
BIC.lm_model$coefficients
## (Intercept) runtime imdb_rating critics_score
## -33.28320569 -0.05361506 14.98076157 0.07035672
summary(BIC.lm_model)$sigma
## [1] 10.04062
Taking a look at the residuals:
ggplot(data=BIC.lm_model, aes(x=BIC.lm_model$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
model.bas <- bas.lm(audience_score ~ .,
prior ="BIC",
modelprior = uniform(),
data = na.omit(movies2))
model.bas
##
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies2),
## prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06537 0.04320
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.46971 0.19984 0.09069
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.07506 0.08042 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.05774 0.88855 0.13119
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.03985 0.14435 0.14128
## best_dir_winyes top200_boxyes
## 0.06694 0.04762
According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.
confint(coef(model.bas))
## 2.5% 97.5% beta
## Intercept 6.155045e+01 6.311231e+01 6.234769e+01
## feature_filmyes -1.023984e+00 3.529045e-02 -1.046908e-01
## dramayes 0.000000e+00 0.000000e+00 1.604413e-02
## runtime -8.309256e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes -2.127469e+00 6.036765e-04 -3.036174e-01
## thtr_rel_year -4.768019e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes -9.562348e-01 0.000000e+00 -8.034940e-02
## summer_seasonyes 0.000000e+00 1.064320e+00 8.704545e-02
## imdb_rating 1.364619e+01 1.655342e+01 1.498203e+01
## imdb_num_votes -2.378714e-07 1.779947e-06 2.080713e-07
## critics_score 0.000000e+00 1.059350e-01 6.296648e-02
## best_pic_nomyes 0.000000e+00 5.055486e+00 5.068035e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes -2.629305e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.732913e+00 1.953780e-02 -3.088382e-01
## best_dir_winyes -1.548177e+00 0.000000e+00 -1.195011e-01
## top200_boxyes 0.000000e+00 0.000000e+00 8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(model.bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06536947 0.0000 0.0000000 0.0000000
## dramayes 0.04319833 0.0000 0.0000000 0.0000000
## runtime 0.46971477 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.19984016 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.09068970 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07505684 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08042023 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05773502 0.0000 0.0000000 0.0000000
## critics_score 0.88855056 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13119140 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03984766 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14434896 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.14128087 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06693898 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04762234 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9968489 0.2543185
## PostProbs NA 0.1297 0.1293000 0.0330000
## R2 NA 0.7549 0.7525000 0.7539000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3615.2791 -3615.2822108 -3616.6482224
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2521327 0.2391994
## PostProbs 0.0327000 0.0310000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg -3616.6568544 -3616.7095127
The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.
Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.
image(model.bas, rotate = F)
qqnorm(BIC.lm_model$residuals, col="aquamarine4")
qqline(BIC.lm_model$residuals)
We can see a normal distribution here.
Let’s plot the residuals against the fitted values here.
plot(BIC.lm_model$residuals ~ BIC.lm_model$fitted, col="red")
abline(h=0, lty=2)
From the plot, we can infer the presence of left skewness but the data is generally scattered around 0.
Let’s plot the absolute values of the residuals against the fitted values here.
plot(abs(BIC.lm_model$residuals) ~ BIC.lm_model$fitted, col="red")
abline(h=0, lty=2)
We don’t see a fan shaped figure here;hence the condition is met.
The movie I’ve chosen is Avengers: Endgame(2019). The information I will be using for the prediction comes from:
IMDB and Rotten Tomatoes.
Let’s create a data frame containing Avengers: Endgame(2019)’s information.
Endgame_df <- data.frame(imdb_rating = 8.4, runtime = 181, critics_score = 94, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")
Endgame_df
## imdb_rating runtime critics_score mpaa_rating_R thtr_rel_year best_pic_nom
## 1 8.4 181 94 no 2016 no
## best_actor_win best_actress_win
## 1 no no
We will now run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.
predict(BIC.lm_model, newdata = Endgame_df, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 89.4644 69.49893 109.4299
The BIC model predicts a score of 89.4644
predict(AIC.lm_model, newdata = Endgame_df, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 89.61485 69.63368 109.596
The AIC model predicts a score of 89.61485
As the true score was 88, the AIC model was only marginally more accurate (89.4644% accuracy vs 89.61485% accuracy)
The model created using the stepAIC tuned toward AIC was the same model found to be ideal by bas.lm. In the end, the AIC and BIC models scored almost identically. I believe if the scope of this project were increased, there would be the possibility of normally distributed errors. A method to deal with these issues– which was not touched on in this project– was variable transformation such as log transformation.