R Markdown

Load required packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.4
library(statsr)
## Warning: package 'statsr' was built under R version 4.0.4
## Warning: package 'BayesFactor' was built under R version 4.0.4
## Warning: package 'coda' was built under R version 4.0.4
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.4
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.4
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.4

Load Data

load("movies.Rdata")

Introduction

This project is being carried out with the aim to find attributes that make a movie popular. Also, we are going to find out other attributes that might interest to us using Exploaratory Data Dnalysis(EDA) and will use to Bayesian statistics for making modelling and prediction.

Part 1: Data

summary(movies)
##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 37   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :133   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :15.00   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##  NA's   :8                                                              
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :629      no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.69                   Mean   :62.36                            
##  3rd Qu.: 83.00                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :558        no :579          no :608      no :636    Length:651        
##  yes: 93        yes: 72          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
glimpse(movies)
## Rows: 651
## Columns: 32
## $ title            <chr> "Filly Brown", "The Dish", "Waiting for Guffman", ...
## $ title_type       <fct> Feature Film, Feature Film, Feature Film, Feature ...
## $ genre            <fct> Drama, Drama, Comedy, Drama, Horror, Documentary, ...
## $ runtime          <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 127, 1...
## $ mpaa_rating      <fct> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrated, Un...
## $ studio           <fct> Indomina Media Inc., Warner Bros. Pictures, Sony P...
## $ thtr_rel_year    <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 1996, 20...
## $ thtr_rel_month   <dbl> 4, 3, 8, 10, 9, 1, 1, 11, 9, 3, 6, 12, 1, 9, 6, 8,...
## $ thtr_rel_day     <dbl> 19, 14, 21, 1, 10, 15, 1, 8, 7, 2, 19, 18, 4, 23, ...
## $ dvd_rel_year     <dbl> 2013, 2001, 2001, 2001, 2005, 2010, 2003, 2004, 20...
## $ dvd_rel_month    <dbl> 7, 8, 8, 11, 4, 4, 2, 3, 1, 8, 5, 9, 7, 2, 3, 12, ...
## $ dvd_rel_day      <dbl> 30, 28, 21, 6, 19, 20, 18, 2, 21, 14, 1, 23, 9, 13...
## $ imdb_rating      <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, 6.6, ...
## $ imdb_num_votes   <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 2272, 8...
## $ critics_rating   <fct> Rotten, Certified Fresh, Certified Fresh, Certifie...
## $ critics_score    <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 67, 80...
## $ audience_rating  <fct> Upright, Upright, Upright, Upright, Spilled, Uprig...
## $ audience_score   <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 46, 89...
## $ best_pic_nom     <fct> no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ best_pic_win     <fct> no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ best_actor_win   <fct> no, no, no, yes, no, no, no, yes, no, no, yes, no,...
## $ best_actress_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, ye...
## $ best_dir_win     <fct> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ top200_box       <fct> no, no, no, no, no, no, no, no, no, no, yes, no, n...
## $ director         <chr> "Michael D. Olmos", "Rob Sitch", "Christopher Gues...
## $ actor1           <chr> "Gina Rodriguez", "Sam Neill", "Christopher Guest"...
## $ actor2           <chr> "Jenni Rivera", "Kevin Harrington", "Catherine O'H...
## $ actor3           <chr> "Lou Diamond Phillips", "Patrick Warburton", "Park...
## $ actor4           <chr> "Emilio Rivera", "Tom Long", "Eugene Levy", "Richa...
## $ actor5           <chr> "Joseph Julian Soria", "Genevieve Mooy", "Bob Bala...
## $ imdb_url         <chr> "http://www.imdb.com/title/tt1869425/", "http://ww...
## $ rt_url           <chr> "//www.rottentomatoes.com/m/filly_brown_2012/", "/...

The dataset consists of 651 randomly selected movies which were produced and released before 2016 and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.We are only able to draw correlation as it done by random sampling. Since the data is collected using random sampling and given the shear size of the observations involved, it is possible to generalize the results to a larger audience. Since the data is taken from an English-speaking platfrom, and much of it is catered to the English speakers, it is safe to assume that there will be prejudice in favor of English movies compared to movies from foreifn countries such as Bollywood, Chinese, Korean, et al.

Part 2: Data Manipulation

We are going to create a few new variables to assist in our EDA. Below is their description:

  1. feature_film: “yes” if title_type is Feature Film, “no” otherwise.
  2. drama: “yes” if genre is Drama, “no” otherwise runtime.
  3. mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
  4. oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.
  5. summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.
movies <- movies %>%
  mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"),
         drama = ifelse(genre == "Drama", "yes", "no"),
         mpaa_rating_R = ifelse(mpaa_rating == "R","yes","no"),
         oscar_season = ifelse(thtr_rel_month == 11 | thtr_rel_month == 10 | thtr_rel_month == 12, "yes", "no"),
         summer_season = ifelse(thtr_rel_month == 5 | thtr_rel_month == 6 | thtr_rel_month == 7 | thtr_rel_month == 8, "yes","no"))

We’ll then create a new dataframe “movies2” that will include a subset of the total variables

movies2_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies2 <- movies[movies2_features]

Part 3: Exploratory Data Analysis (EDA)

We will begin our EDA by looking at the summary of the newly created data frame “movies2”

summary(movies2)
##  audience_score  feature_film          drama              runtime     
##  Min.   :11.00   Length:651         Length:651         Min.   : 39.0  
##  1st Qu.:46.00   Class :character   Class :character   1st Qu.: 92.0  
##  Median :65.00   Mode  :character   Mode  :character   Median :103.0  
##  Mean   :62.36                                         Mean   :105.8  
##  3rd Qu.:80.00                                         3rd Qu.:115.8  
##  Max.   :97.00                                         Max.   :267.0  
##                                                        NA's   :1      
##  mpaa_rating_R      thtr_rel_year  oscar_season       summer_season     
##  Length:651         Min.   :1970   Length:651         Length:651        
##  Class :character   1st Qu.:1990   Class :character   Class :character  
##  Mode  :character   Median :2000   Mode  :character   Mode  :character  
##                     Mean   :1998                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2014                                        
##                                                                         
##   imdb_rating    imdb_num_votes   critics_score    best_pic_nom best_pic_win
##  Min.   :1.900   Min.   :   180   Min.   :  1.00   no :629      no :644     
##  1st Qu.:5.900   1st Qu.:  4546   1st Qu.: 33.00   yes: 22      yes:  7     
##  Median :6.600   Median : 15116   Median : 61.00                            
##  Mean   :6.493   Mean   : 57533   Mean   : 57.69                            
##  3rd Qu.:7.300   3rd Qu.: 58301   3rd Qu.: 83.00                            
##  Max.   :9.000   Max.   :893008   Max.   :100.00                            
##                                                                             
##  best_actor_win best_actress_win best_dir_win top200_box
##  no :558        no :579          no :608      no :636   
##  yes: 93        yes: 72          yes: 43      yes: 15   
##                                                         
##                                                         
##                                                         
##                                                         
## 

This gives us how spread each variabe in the dataset is.

Now let’s look at the structure of the dataframe.

str(movies2)
## tibble [651 x 17] (S3: tbl_df/tbl/data.frame)
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ feature_film    : chr [1:651] "yes" "yes" "yes" "yes" ...
##  $ drama           : chr [1:651] "yes" "yes" "no" "yes" ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating_R   : chr [1:651] "yes" "no" "yes" "no" ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : chr [1:651] "no" "no" "no" "yes" ...
##  $ summer_season   : chr [1:651] "no" "no" "yes" "no" ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Let’s create a boxplot to understand how the newly created variables interact with “audience_score”

plot1 <- ggplot(movies2, aes(x=mpaa_rating_R,y=audience_score))+
            geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)
  
plot2 <- ggplot(movies2, aes(x=oscar_season, y=audience_score))+
            geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)
  
plot3 <- ggplot(movies2, aes(x=summer_season,y=audience_score))+
            geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)
  
plot4 <- ggplot(movies2, aes(x=feature_film, y=audience_score))+
            geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)
  
plot5 <- ggplot(movies2, aes(x=drama, y=audience_score))+
            geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)
            
grid.arrange(plot1,plot2,plot3,plot4,plot5, ncol=3)

Let’s explore the correlation between the audience score and the newly created variables using more visualization charts.

suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = 1:8))))

suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = c(1,9:17)))))

From the charts above, we can confer that there exists a high correlation between audience_score and critics_score

Let’s further explore its correlation using a scatterplot fitted with a regression line.

cor(movies2$audience_score, movies2$critics_score)
## [1] 0.7042762
ggplot(data=movies2, aes(x = audience_score, y = critics_score)) +
  geom_jitter(alpha  = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "red")
## `geom_smooth()` using formula 'y ~ x'

Let’s examine the relation between imdb_rating and audience_score similarly.

cor(movies2$audience_score, movies2$imdb_rating)
## [1] 0.8648652
ggplot(data=movies2, aes(x = audience_score, y = imdb_rating)) +
  geom_jitter(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "red")
## `geom_smooth()` using formula 'y ~ x'

From the charts above we can understand the high correlation of audience_score with both set of variables.

Part 4: Modeling

We will start by incoporating the linear model by examining the relationship between the response variable with all the predictors.

As for modeling, we will use the stepAIC function from the MASS library in the backwards direction until we reach a stage where we cannot further lower the AIC.

as_model <- lm(audience_score ~ ., data= na.omit(movies2))
as_model
## 
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies2))
## 
## Coefficients:
##         (Intercept)      feature_filmyes             dramayes  
##           1.244e+02           -2.248e+00            1.292e+00  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##          -5.614e-02           -1.444e+00           -7.657e-02  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##          -5.333e-01            9.106e-01            1.472e+01  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##           7.234e-06            5.748e-02            5.321e+00  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##          -3.212e+00           -1.544e+00           -2.198e+00  
##     best_dir_winyes        top200_boxyes  
##          -1.231e+00            8.478e-01

Creating the model based on AIC

stepAIC.model <- stepAIC(as_model, direction = "backward", trace = TRUE)
## Start:  AIC=3006.94
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         9  62999 3005.0
## - oscar_season      1        28  63018 3005.2
## - best_pic_win      1        48  63038 3005.4
## - best_dir_win      1        51  63040 3005.5
## - summer_season     1        92  63081 3005.9
## - best_actor_win    1       171  63160 3006.7
## - feature_film      1       177  63166 3006.8
## <none>                           62990 3006.9
## - drama             1       216  63206 3007.2
## - imdb_num_votes    1       255  63244 3007.6
## - best_actress_win  1       283  63273 3007.9
## - mpaa_rating_R     1       314  63304 3008.2
## - thtr_rel_year     1       397  63386 3009.0
## - best_pic_nom      1       408  63398 3009.1
## - runtime           1       538  63527 3010.5
## - critics_score     1       669  63659 3011.8
## - imdb_rating       1     58556 121545 3432.2
## 
## Step:  AIC=3005.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        26  63025 3003.3
## - best_pic_win      1        49  63047 3003.5
## - best_dir_win      1        52  63051 3003.6
## - summer_season     1        94  63093 3004.0
## - best_actor_win    1       169  63168 3004.8
## - feature_film      1       176  63175 3004.8
## <none>                           62999 3005.0
## - drama             1       214  63213 3005.2
## - best_actress_win  1       279  63278 3005.9
## - imdb_num_votes    1       302  63301 3006.1
## - mpaa_rating_R     1       330  63329 3006.4
## - best_pic_nom      1       404  63403 3007.2
## - thtr_rel_year     1       415  63414 3007.3
## - runtime           1       535  63534 3008.5
## - critics_score     1       681  63680 3010.0
## - imdb_rating       1     58606 121604 3430.5
## 
## Step:  AIC=3003.31
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3001.8
## - best_dir_win      1        56  63081 3001.9
## - best_actor_win    1       174  63200 3003.1
## - summer_season     1       177  63202 3003.1
## - feature_film      1       182  63207 3003.2
## <none>                           63025 3003.3
## - drama             1       222  63247 3003.6
## - best_actress_win  1       281  63307 3004.2
## - imdb_num_votes    1       302  63328 3004.4
## - mpaa_rating_R     1       329  63354 3004.7
## - best_pic_nom      1       387  63412 3005.3
## - thtr_rel_year     1       410  63436 3005.5
## - runtime           1       587  63613 3007.3
## - critics_score     1       679  63704 3008.3
## - imdb_rating       1     58603 121628 3428.6
## 
## Step:  AIC=3001.78
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3000.7
## - best_actor_win    1       163  63234 3001.5
## - feature_film      1       171  63242 3001.5
## - summer_season     1       174  63245 3001.6
## <none>                           63071 3001.8
## - drama             1       220  63291 3002.0
## - imdb_num_votes    1       271  63342 3002.6
## - best_actress_win  1       294  63365 3002.8
## - mpaa_rating_R     1       330  63401 3003.2
## - best_pic_nom      1       342  63414 3003.3
## - thtr_rel_year     1       397  63468 3003.9
## - runtime           1       586  63657 3005.8
## - critics_score     1       680  63751 3006.8
## - imdb_rating       1     58858 121929 3428.2
## 
## Step:  AIC=3000.75
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3000.5
## - best_actor_win    1       171  63336 3000.5
## - feature_film      1       183  63348 3000.6
## <none>                           63165 3000.7
## - drama             1       228  63394 3001.1
## - imdb_num_votes    1       247  63412 3001.3
## - best_actress_win  1       299  63464 3001.8
## - best_pic_nom      1       326  63491 3002.1
## - mpaa_rating_R     1       345  63510 3002.3
## - thtr_rel_year     1       368  63533 3002.5
## - critics_score     1       651  63816 3005.4
## - runtime           1       673  63839 3005.6
## - imdb_rating       1     58895 122061 3426.9
## 
## Step:  AIC=3000.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3000.1
## <none>                           63332 3000.5
## - best_actor_win    1       195  63527 3000.5
## - drama             1       204  63536 3000.6
## - imdb_num_votes    1       260  63592 3001.1
## - best_pic_nom      1       297  63629 3001.5
## - best_actress_win  1       297  63629 3001.5
## - mpaa_rating_R     1       356  63688 3002.1
## - thtr_rel_year     1       361  63693 3002.2
## - runtime           1       690  64022 3005.5
## - critics_score     1       732  64064 3005.9
## - imdb_rating       1     58763 122095 3425.1
## 
## Step:  AIC=3000.06
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 2999.3
## - imdb_num_votes    1       173  63661 2999.8
## <none>                           63488 3000.1
## - best_actor_win    1       219  63706 3000.3
## - thtr_rel_year     1       277  63765 3000.9
## - best_pic_nom      1       291  63778 3001.0
## - best_actress_win  1       306  63794 3001.2
## - mpaa_rating_R     1       453  63941 3002.7
## - runtime           1       715  64203 3005.3
## - critics_score     1       875  64363 3007.0
## - imdb_rating       1     63189 126677 3447.1
## 
## Step:  AIC=2999.3
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 2998.8
## <none>                           63609 2999.3
## - best_actor_win    1       209  63818 2999.4
## - thtr_rel_year     1       272  63881 3000.1
## - best_actress_win  1       274  63883 3000.1
## - best_pic_nom      1       307  63916 3000.4
## - mpaa_rating_R     1       391  64000 3001.3
## - runtime           1       631  64240 3003.7
## - critics_score     1       916  64525 3006.6
## - imdb_rating       1     63434 127043 3447.0
## 
## Step:  AIC=2998.81
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                           63757 2998.8
## - thtr_rel_year     1       201  63958 2998.9
## - best_actor_win    1       219  63976 2999.0
## - best_actress_win  1       266  64023 2999.5
## - mpaa_rating_R     1       367  64124 3000.5
## - best_pic_nom      1       442  64199 3001.3
## - runtime           1       519  64276 3002.1
## - critics_score     1       879  64635 3005.7
## - imdb_rating       1     67356 131113 3465.4

The final model built using AIC consists of the following variables:

runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win

AIC.lm_model <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies2)

Let’s take a look at the coefficients of this model:

AIC.lm_model$coefficients
##         (Intercept)             runtime    mpaa_rating_Ryes       thtr_rel_year 
##         70.10675281         -0.05115515         -1.50528039         -0.05122557 
##         imdb_rating       critics_score     best_pic_nomyes   best_actor_winyes 
##         15.00149242          0.06409989          4.88277038         -1.73481942 
## best_actress_winyes 
##         -2.11568281

Let’s take a look at the standard deviation of this model:

summary(AIC.lm_model)$sigma
## [1] 9.973201

Let’s plot the residuals of this model:

ggplot(data=AIC.lm_model, aes(x=AIC.lm_model$residuals)) + geom_histogram(bin = 30)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating the model using BIC

we will use the stepAIC function from the MASS library in the backwards direction until we reach a stage where we cannot further lower the BIC.

stepBIC.model <- stepAIC(as_model, direction = "backward", k=log(nrow(movies2)), trace = TRUE)
## Start:  AIC=3083.07
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         9  62999 3076.7
## - oscar_season      1        28  63018 3076.9
## - best_pic_win      1        48  63038 3077.1
## - best_dir_win      1        51  63040 3077.1
## - summer_season     1        92  63081 3077.5
## - best_actor_win    1       171  63160 3078.4
## - feature_film      1       177  63166 3078.4
## - drama             1       216  63206 3078.8
## - imdb_num_votes    1       255  63244 3079.2
## - best_actress_win  1       283  63273 3079.5
## - mpaa_rating_R     1       314  63304 3079.8
## - thtr_rel_year     1       397  63386 3080.7
## - best_pic_nom      1       408  63398 3080.8
## - runtime           1       538  63527 3082.1
## <none>                           62990 3083.1
## - critics_score     1       669  63659 3083.5
## - imdb_rating       1     58556 121545 3503.9
## 
## Step:  AIC=3076.69
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        26  63025 3070.5
## - best_pic_win      1        49  63047 3070.7
## - best_dir_win      1        52  63051 3070.8
## - summer_season     1        94  63093 3071.2
## - best_actor_win    1       169  63168 3072.0
## - feature_film      1       176  63175 3072.0
## - drama             1       214  63213 3072.4
## - best_actress_win  1       279  63278 3073.1
## - imdb_num_votes    1       302  63301 3073.3
## - mpaa_rating_R     1       330  63329 3073.6
## - best_pic_nom      1       404  63403 3074.4
## - thtr_rel_year     1       415  63414 3074.5
## - runtime           1       535  63534 3075.7
## <none>                           62999 3076.7
## - critics_score     1       681  63680 3077.2
## - imdb_rating       1     58606 121604 3497.7
## 
## Step:  AIC=3070.49
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3064.5
## - best_dir_win      1        56  63081 3064.6
## - best_actor_win    1       174  63200 3065.8
## - summer_season     1       177  63202 3065.8
## - feature_film      1       182  63207 3065.9
## - drama             1       222  63247 3066.3
## - best_actress_win  1       281  63307 3066.9
## - imdb_num_votes    1       302  63328 3067.1
## - mpaa_rating_R     1       329  63354 3067.4
## - best_pic_nom      1       387  63412 3068.0
## - thtr_rel_year     1       410  63436 3068.2
## - runtime           1       587  63613 3070.0
## <none>                           63025 3070.5
## - critics_score     1       679  63704 3071.0
## - imdb_rating       1     58603 121628 3491.3
## 
## Step:  AIC=3064.48
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3059.0
## - best_actor_win    1       163  63234 3059.7
## - feature_film      1       171  63242 3059.8
## - summer_season     1       174  63245 3059.8
## - drama             1       220  63291 3060.3
## - imdb_num_votes    1       271  63342 3060.8
## - best_actress_win  1       294  63365 3061.0
## - mpaa_rating_R     1       330  63401 3061.4
## - best_pic_nom      1       342  63414 3061.5
## - thtr_rel_year     1       397  63468 3062.1
## - runtime           1       586  63657 3064.0
## <none>                           63071 3064.5
## - critics_score     1       680  63751 3065.0
## - imdb_rating       1     58858 121929 3486.5
## 
## Step:  AIC=3058.97
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3054.2
## - best_actor_win    1       171  63336 3054.2
## - feature_film      1       183  63348 3054.4
## - drama             1       228  63394 3054.8
## - imdb_num_votes    1       247  63412 3055.0
## - best_actress_win  1       299  63464 3055.6
## - best_pic_nom      1       326  63491 3055.8
## - mpaa_rating_R     1       345  63510 3056.0
## - thtr_rel_year     1       368  63533 3056.3
## <none>                           63165 3059.0
## - critics_score     1       651  63816 3059.2
## - runtime           1       673  63839 3059.4
## - imdb_rating       1     58895 122061 3480.7
## 
## Step:  AIC=3054.2
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3049.3
## - best_actor_win    1       195  63527 3049.7
## - drama             1       204  63536 3049.8
## - imdb_num_votes    1       260  63592 3050.4
## - best_pic_nom      1       297  63629 3050.8
## - best_actress_win  1       297  63629 3050.8
## - mpaa_rating_R     1       356  63688 3051.4
## - thtr_rel_year     1       361  63693 3051.4
## <none>                           63332 3054.2
## - runtime           1       690  64022 3054.8
## - critics_score     1       732  64064 3055.2
## - imdb_rating       1     58763 122095 3474.4
## 
## Step:  AIC=3049.32
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 3044.1
## - imdb_num_votes    1       173  63661 3044.6
## - best_actor_win    1       219  63706 3045.1
## - thtr_rel_year     1       277  63765 3045.7
## - best_pic_nom      1       291  63778 3045.8
## - best_actress_win  1       306  63794 3046.0
## - mpaa_rating_R     1       453  63941 3047.5
## <none>                           63488 3049.3
## - runtime           1       715  64203 3050.1
## - critics_score     1       875  64363 3051.7
## - imdb_rating       1     63189 126677 3491.9
## 
## Step:  AIC=3044.09
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 3039.1
## - best_actor_win    1       209  63818 3039.7
## - thtr_rel_year     1       272  63881 3040.4
## - best_actress_win  1       274  63883 3040.4
## - best_pic_nom      1       307  63916 3040.7
## - mpaa_rating_R     1       391  64000 3041.6
## - runtime           1       631  64240 3044.0
## <none>                           63609 3044.1
## - critics_score     1       916  64525 3046.9
## - imdb_rating       1     63434 127043 3487.3
## 
## Step:  AIC=3039.12
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - thtr_rel_year     1       201  63958 3034.7
## - best_actor_win    1       219  63976 3034.9
## - best_actress_win  1       266  64023 3035.3
## - mpaa_rating_R     1       367  64124 3036.4
## - best_pic_nom      1       442  64199 3037.1
## - runtime           1       519  64276 3037.9
## <none>                           63757 3039.1
## - critics_score     1       879  64635 3041.5
## - imdb_rating       1     67356 131113 3501.3
## 
## Step:  AIC=3034.68
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actor_win    1       207  64165 3030.3
## - best_actress_win  1       261  64219 3030.9
## - mpaa_rating_R     1       373  64331 3032.0
## - best_pic_nom      1       447  64405 3032.7
## - runtime           1       468  64425 3032.9
## <none>                           63958 3034.7
## - critics_score     1       968  64926 3038.0
## - imdb_rating       1     67172 131129 3494.9
## 
## Step:  AIC=3030.3
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actress_win  1       296  64461 3026.8
## - mpaa_rating_R     1       366  64531 3027.5
## - best_pic_nom      1       396  64561 3027.8
## <none>                           64165 3030.3
## - runtime           1       643  64808 3030.3
## - critics_score     1       968  65133 3033.6
## - imdb_rating       1     67296 131461 3490.0
## 
## Step:  AIC=3026.82
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom
## 
##                 Df Sum of Sq    RSS    AIC
## - best_pic_nom   1       303  64765 3023.4
## - mpaa_rating_R  1       354  64815 3023.9
## <none>                        64461 3026.8
## - runtime        1       814  65275 3028.5
## - critics_score  1       957  65418 3029.9
## - imdb_rating    1     67424 131885 3485.7
## 
## Step:  AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## - mpaa_rating_R  1       361  65126 3020.5
## - runtime        1       638  65403 3023.3
## <none>                        64765 3023.4
## - critics_score  1      1027  65792 3027.1
## - imdb_rating    1     68173 132937 3484.3
## 
## Step:  AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                        65126 3020.5
## - runtime        1       653  65779 3020.5
## - critics_score  1      1073  66199 3024.7
## - imdb_rating    1     67874 133000 3478.2

The final model will use the following variables:

audience_score ~ runtime + imdb_rating + critics_score

BIC.lm_model <- lm(audience_score ~ runtime + imdb_rating + critics_score, data=movies2)
BIC.lm_model
## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies2)
## 
## Coefficients:
##   (Intercept)        runtime    imdb_rating  critics_score  
##     -33.28321       -0.05362       14.98076        0.07036
BIC.lm_model$coefficients
##   (Intercept)       runtime   imdb_rating critics_score 
##  -33.28320569   -0.05361506   14.98076157    0.07035672
summary(BIC.lm_model)$sigma
## [1] 10.04062

Taking a look at the residuals:

ggplot(data=BIC.lm_model, aes(x=BIC.lm_model$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating the model using Bayesian Averaging

model.bas <- bas.lm(audience_score ~ .,
       prior ="BIC",
       modelprior = uniform(),
       data = na.omit(movies2))
model.bas
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies2), 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.

confint(coef(model.bas))
##                              2.5%        97.5%          beta
## Intercept            6.155045e+01 6.311231e+01  6.234769e+01
## feature_filmyes     -1.023984e+00 3.529045e-02 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.309256e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes    -2.127469e+00 6.036765e-04 -3.036174e-01
## thtr_rel_year       -4.768019e-02 0.000000e+00 -4.532635e-03
## oscar_seasonyes     -9.562348e-01 0.000000e+00 -8.034940e-02
## summer_seasonyes     0.000000e+00 1.064320e+00  8.704545e-02
## imdb_rating          1.364619e+01 1.655342e+01  1.498203e+01
## imdb_num_votes      -2.378714e-07 1.779947e-06  2.080713e-07
## critics_score        0.000000e+00 1.059350e-01  6.296648e-02
## best_pic_nomyes      0.000000e+00 5.055486e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.629305e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.732913e+00 1.953780e-02 -3.088382e-01
## best_dir_winyes     -1.548177e+00 0.000000e+00 -1.195011e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(model.bas)
##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.

Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.

image(model.bas, rotate = F)

qqnorm(BIC.lm_model$residuals, col="aquamarine4")
qqline(BIC.lm_model$residuals)

We can see a normal distribution here.

Let’s plot the residuals against the fitted values here.

plot(BIC.lm_model$residuals ~ BIC.lm_model$fitted, col="red")
abline(h=0, lty=2)

From the plot, we can infer the presence of left skewness but the data is generally scattered around 0.

Let’s plot the absolute values of the residuals against the fitted values here.

plot(abs(BIC.lm_model$residuals) ~ BIC.lm_model$fitted, col="red")
abline(h=0, lty=2)

We don’t see a fan shaped figure here;hence the condition is met.

Prediction

The movie I’ve chosen is Avengers: Endgame(2019). The information I will be using for the prediction comes from:

IMDB and Rotten Tomatoes.

Let’s create a data frame containing Avengers: Endgame(2019)’s information.

Endgame_df <- data.frame(imdb_rating = 8.4, runtime = 181, critics_score = 94, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")

Endgame_df
##   imdb_rating runtime critics_score mpaa_rating_R thtr_rel_year best_pic_nom
## 1         8.4     181            94            no          2016           no
##   best_actor_win best_actress_win
## 1             no               no

We will now run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.

predict(BIC.lm_model, newdata = Endgame_df, interval = "prediction", level = 0.95)
##       fit      lwr      upr
## 1 89.4644 69.49893 109.4299

The BIC model predicts a score of 89.4644

predict(AIC.lm_model, newdata = Endgame_df, interval = "prediction", level = 0.95)
##        fit      lwr     upr
## 1 89.61485 69.63368 109.596

The AIC model predicts a score of 89.61485

As the true score was 88, the AIC model was only marginally more accurate (89.4644% accuracy vs 89.61485% accuracy)

Conclusion

The model created using the stepAIC tuned toward AIC was the same model found to be ideal by bas.lm. In the end, the AIC and BIC models scored almost identically. I believe if the scope of this project were increased, there would be the possibility of normally distributed errors. A method to deal with these issues– which was not touched on in this project– was variable transformation such as log transformation.