Frequentist modeling and prediction.

Info

This file will introduce you to the basic frequency-based approach to modeling and forecasting, as well as constructing confidence intervals. Compared to the Bayesian approach, it is a simpler tool, but simpler does not always mean worse.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. Sources for this data set were Rotten Tomatoes and IMDB APIs. Since these sources mostly focusing on English-spoken auditory we need to be very careful to generalizable our conclusion for all world population, but we can be more confident about the US, UK, and other English spoken countries. Also, we need to mention that this data set covert for the most part active internet users. We need to be careful to generalize conclusions for the least active internet users, probably, such as old people and others.

Since we do not have an experiment, but observation study, we can consider the only association, and do not can make causal conclusions.

Part 2: Research question

Recently, the film market is undergoing serious changes, such companies as Netflix and HBO have brought new ideas / approaches to the market. We are interested in learning what attributes associated with popular moves such as including in moves adult material or very famous actors/actresses and others.

Part 3: Exploratory data analysis

For the very beginning let’s see rating distributions:

Rating on IMDB imdb_rating

movies %>%
    ggplot(aes(x = imdb_rating)) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of Rating on IMDB')

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

Distribution of imdb_rating is left-skewed with center around 6.6, hence the most appropriate measure of rating going to be median.

Audience score on Rotten Tomatoes audience_score

movies %>%
    ggplot(aes(x = audience_score)) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of audience score on Rotten Tomatoes')

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

Distribution of audience_score is also left-skewed with center around 65, hence the most appropriate measure of rating going to be median.

Critics score on Rotten Tomatoes critics_score

movies %>%
    ggplot(aes(x = critics_score)) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of critics score on Rotten Tomatoes')

summary(movies$critics_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

Distribution of audience_score is also left-skewed with center around 61, hence the most appropriate measure of rating going to be median.

All of these distributions look roughly similar, probably, it would be a good idea to create an average rating score.

Part 4: Modeling

Predict variable

For our next step, let’s check collinearity of our ratings.

ggpairs(movies, columns = c('imdb_rating', 'critics_score', 'audience_score'))

As aspected, they are collinear, but with it, we can find that Rating on IMDB and Audience score on Rotten Tomatoes is mush close to each other, on the other side Critics score has more differences. We going create one average score.

movies <- movies %>%
  mutate(score_avg = (imdb_rating*10 + critics_score  + audience_score)/3)

Let’s see what we got.

movies %>%
    ggplot(aes(x = score_avg)) +
    geom_histogram(bins = 10) +
    ggtitle('Distribution of average score.')

summary(movies$score_avg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.67   48.33   62.67   61.66   77.67   94.67

Distribution of score_avg is left-skewed with center around 61.6, hence the most appropriate measure of rating going to be median.

Checking conditions for explanatory variables

Categorical variables

Before we move on, let’s check conditions to have more or equal than 5 values in each group for using categorical explanatory variables.

title_type - all groups are good.

movies %>%
  group_by(title_type) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   title_type       n
##   <fct>        <int>
## 1 Documentary     55
## 2 Feature Film   591
## 3 TV Movie         5

genre - all groups are good.

movies %>%
  group_by(genre) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 11 x 2
##    genre                         n
##    <fct>                     <int>
##  1 Action & Adventure           65
##  2 Animation                     9
##  3 Art House & International    14
##  4 Comedy                       87
##  5 Documentary                  52
##  6 Drama                       305
##  7 Horror                       23
##  8 Musical & Performing Arts    12
##  9 Mystery & Suspense           59
## 10 Other                        16
## 11 Science Fiction & Fantasy     9

mpaa_rating - all groups are good, except “NC-17”. So, we should keep in our mind that if mpaa_rating gets into our final model and would be significant we must be very careful about making a prediction for that value.

movies %>%
  group_by(mpaa_rating) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 6 x 2
##   mpaa_rating     n
##   <fct>       <int>
## 1 G              19
## 2 NC-17           2
## 3 PG            118
## 4 PG-13         133
## 5 R             329
## 6 Unrated        50

studio - Not good.

movies %>%
  group_by(studio) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 212 x 2
##    studio                                n
##    <fct>                             <int>
##  1 20th Century Fox                     18
##  2 20th Century Fox Film Corporat        2
##  3 20th Century Fox Film Corporation     5
##  4 7-57 Releasing                        1
##  5 7th art                               1
##  6 905 Corporation                       1
##  7 A24                                   1
##  8 A24 Films                             2
##  9 All Girl Productions                  1
## 10 Alliance Atlantis Communications      1
## # ... with 202 more rows

For studio we got a lot of inappropriate groups. We should convert that variable.

movies <- movies %>% 
  mutate(studio_mod = case_when(studio == "20th Century Fox Film Corporat" ~ '20th Century Fox',
                                studio == "20th Century Fox Film Corporation" ~ '20th Century Fox',
                                studio == "A24" ~ 'A24 Films',
                                studio == "Buena Vista Distribution Compa" ~ 'Buena Vista',
                                studio == "Buena Vista Internationa" ~ 'Buena Vista',
                                studio == "Buena Vista Pictures" ~ 'Buena Vista',
                                studio == "Columbia Tristar Pictures" ~ 'Columbia Pictures',
                                studio == "Dreamworks" ~ 'DreamWorks Studios',
                                studio == "Fox" ~ '20th Century Fox',
                                studio == "Fox Atomic" ~ '20th Century Fox',
                                studio == "Fox Searchlight" ~ '20th Century Fox',
                                studio == "Fox Searchlight Pictures" ~ '20th Century Fox',
                                studio == "First Run Entertainment" ~ 'First Run Features',
                                studio == "HBO Documentary" ~ 'HBO',
                                studio == "HBO Video" ~ 'HBO',
                                studio == "IFC" ~ 'IFC Films',
                                studio == "IFC First Take" ~ 'IFC Films',
                                studio == "IFC Midnight" ~ 'IFC Films',
                                studio == "Lions Gate Films Inc." ~ 'Lions Gate Films',
                                studio == "Lions Gate Releasing" ~ 'Lions Gate Films',
                                studio == "Lionsgate" ~ 'Lions Gate Films',
                                studio == "LionsGate Entertainment" ~ 'Lions Gate Films',
                                studio == "Lions Gate Films Inc." ~ 'Lions Gate Films',
                                studio == "Lionsgate Films" ~ 'Lions Gate Films',
                                studio == "Lionsgate Releasing" ~ 'Lions Gate Films',
                                studio == "Magnet Releasing" ~ 'Magnolia Pictures',
                                studio == "Magnet/Magnolia Pictures" ~ 'Magnolia Pictures',
                                studio == "MGM Home Entertainment" ~ 'MGM',
                                studio == "MGM/UA" ~ 'MGM',
                                studio == "MGM/United Artists" ~ 'MGM',
                                studio == "Miramax" ~ 'Miramax Films',
                                studio == "National Geographic Entertainment" ~ 'National Geographic',
                                studio == "Orion Home Video" ~ 'Orion',
                                studio == "Orion Pictures" ~ 'Orion',
                                studio == "Orion Pictures Corporation" ~ 'Orion',
                                studio == "Paramount Classics" ~ 'Paramount Pictures',
                                studio == "Paramount" ~ 'Paramount Pictures',
                                studio == "Paramount Home Video" ~ 'Paramount Pictures',
                                studio == "Paramount Studios" ~ 'Paramount Pictures',
                                studio == "Paramount Vantage" ~ 'Paramount Pictures',
                                studio == "Sony Entertainment" ~ 'Sony Pictures',
                                studio == "Sony Pictures Classics" ~ 'Sony Pictures',
                                studio == "Sony Pictures Entertainment" ~ 'Sony Pictures',
                                studio == "Sony Pictures Home Entertainment" ~ 'Sony Pictures',
                                studio == "Sony Pictures/Columbia" ~ 'Sony Pictures',
                                studio == "Sony Pictures/Screen Gems" ~ 'Sony Pictures',
                                studio == "The Weinstein Co." ~ 'The Weinstein Company',
                                studio == "Touchstone Home Entertainment" ~ 'Touchstone Pictures',
                                studio == "TriStar" ~ 'Columbia Pictures',
                                studio == "TriStar Pictures" ~ 'Columbia Pictures',
                                studio == "Universal" ~ 'Universal Pictures',
                                studio == "Universal Studios" ~ 'Universal Pictures',
                                studio == "Walt Disney Home Entertainment" ~ 'Walt Disney Pictures',
                                studio == "Walt Disney Productions" ~ 'Walt Disney Pictures',
                                studio == "Disney" ~ 'Walt Disney Pictures',
                                studio == "Warner Bros Pictures" ~ 'Warner Bros. Pictures',
                                studio == "Warner Bros." ~ 'Warner Bros. Pictures',
                                studio == "WARNER BROTHERS PICTURES" ~ 'Warner Bros. Pictures',
                                studio == "Warner Home Video" ~ 'Warner Bros. Pictures',
                                studio == "Warner Independent" ~ 'Warner Bros. Pictures',
                                studio == "Warner Independent Pictures" ~ 'Warner Bros. Pictures',
                                studio == "Warners Bros. Pictures" ~ 'Warner Bros. Pictures',
                                studio == "Twentieth Century Fox Home Entertainment" ~ '20th Century Fox',
                                is.na(studio) ~ 'Others',
                                TRUE ~ as.character(studio)))

movies <- movies %>%
  group_by(studio_mod) %>%
  mutate(n = n()) %>% 
  mutate(studio_mod2 = if_else(n < 5, 'Others', studio_mod))

movies %>%
  group_by(studio_mod2) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 23 x 2
##    studio_mod2                  n
##    <chr>                    <int>
##  1 20th Century Fox            51
##  2 Buena Vista                 18
##  3 Columbia Pictures           11
##  4 First Run Features           7
##  5 HBO                          9
##  6 IFC Films                   18
##  7 Lions Gate Films            20
##  8 Magnolia Pictures           11
##  9 MCA Universal Home Video    13
## 10 MGM                         27
## # ... with 13 more rows

Finally, we got studio_mod2, where all groups are good.

best_pic_nom - all groups are good.

movies %>%
  group_by(best_pic_nom) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   best_pic_nom     n
##   <fct>        <int>
## 1 no             629
## 2 yes             22

best_pic_win - all groups are good.

movies %>%
  group_by(best_pic_win) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   best_pic_win     n
##   <fct>        <int>
## 1 no             644
## 2 yes              7

best_actor_win - all groups are good.

movies %>%
  group_by(best_actor_win) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   best_actor_win     n
##   <fct>          <int>
## 1 no               558
## 2 yes               93

best_actress_win - all groups are good.

movies %>%
  group_by(best_actress_win) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   best_actress_win     n
##   <fct>            <int>
## 1 no                 579
## 2 yes                 72

best_dir_win - all groups are good.

movies %>%
  group_by(best_dir_win) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   best_dir_win     n
##   <fct>        <int>
## 1 no             608
## 2 yes             43

Numerical variables

Now let’s check numerical explanatory variables.

runtime - not good.

movies %>%
  filter(!is.na(runtime)) %>%
  ggplot(aes(x = runtime, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('runtime')

## `geom_smooth()` using formula 'y ~ x'

The plot doesn’t show us a clean relationship between runtime and score_avg. Probably, we get so trend because of the influential point above 250 minutes. We shouldn’t use runtime in our prediction model.

thtr_rel_year - not good.

movies %>%
  ggplot(aes(x = thtr_rel_year, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('thtr_rel_year')

## `geom_smooth()` using formula 'y ~ x'

movies %>%
  filter(thtr_rel_year >= 1990) %>%
  ggplot(aes(x = thtr_rel_year, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('thtr_rel_year after 1990')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ thtr_rel_year, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('thtr_rel_year')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('thtr_rel_year')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('thtr_rel_year')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ thtr_rel_year, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.762 -13.092   0.905  15.657  31.131 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   287.01806  130.02595   2.207   0.0276 *
## thtr_rel_year  -0.11279    0.06508  -1.733   0.0835 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.21 on 649 degrees of freedom
## Multiple R-squared:  0.004607,   Adjusted R-squared:  0.003074 
## F-statistic: 3.004 on 1 and 649 DF,  p-value: 0.08354

Plots show us unclear relationship between thtr_rel_year and score_avg. Also, we can see different trend direction if we use all data (negative) and data after 1990 (positive). At the same time residuals distribution are left-skewed. Considering all this we should not use thtr_rel_year in our predict model.

dvd_rel_year - not good.

movies %>%
  filter(!is.na(dvd_rel_year)) %>%
  ggplot(aes(x = dvd_rel_year, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('dvd_rel_year')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ dvd_rel_year, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('dvd_rel_year')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('dvd_rel_year')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('dvd_rel_year')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ dvd_rel_year, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.36 -13.17   1.07  15.81  32.93 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  205.75178  309.17429   0.665    0.506
## dvd_rel_year  -0.07183    0.15425  -0.466    0.642
## 
## Residual standard error: 18.15 on 641 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.0003382,  Adjusted R-squared:  -0.001221 
## F-statistic: 0.2168 on 1 and 641 DF,  p-value: 0.6416

Here we get roughly the same picture that thtr_rel_year does.

thtr_rel_month - not good.

movies %>%
  ggplot(aes(x = thtr_rel_month, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('thtr_rel_month')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ thtr_rel_month, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('thtr_rel_month')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('thtr_rel_month')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('thtr_rel_month')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ thtr_rel_month, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.940 -13.272   0.836  16.173  31.839 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     60.1652     1.5331  39.245   <2e-16 ***
## thtr_rel_month   0.2218     0.2012   1.103    0.271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.23 on 649 degrees of freedom
## Multiple R-squared:  0.001869,   Adjusted R-squared:  0.0003314 
## F-statistic: 1.216 on 1 and 649 DF,  p-value: 0.2707

Plots show us unclear relationship between thtr_rel_month and score_avg. At the same time residuals distribution are left-skewed. Considering all this we should not use thtr_rel_month in our predict model.

dvd_rel_month - not good.

movies %>%
  filter(!is.na(dvd_rel_month)) %>%
  ggplot(aes(x = dvd_rel_month, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('dvd_rel_month')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ dvd_rel_month, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('dvd_rel_month')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('dvd_rel_month')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('dvd_rel_month')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ dvd_rel_month, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.678 -12.915   1.244  15.776  33.244 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    60.0749     1.5199  39.527   <2e-16 ***
## dvd_rel_month   0.2695     0.2118   1.273    0.204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.13 on 641 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.00252,    Adjusted R-squared:  0.000964 
## F-statistic: 1.619 on 1 and 641 DF,  p-value: 0.2036

Roughly the same picture that thtr_rel_month does.

dvd_rel_day - not good.

movies %>%
  filter(!is.na(dvd_rel_day)) %>%
  ggplot(aes(x = dvd_rel_day, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('dvd_rel_day')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ dvd_rel_day, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('dvd_rel_day')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('dvd_rel_day')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('dvd_rel_day')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ dvd_rel_day, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.117 -13.135   1.195  15.893  32.872 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 61.760221   1.407975  43.865   <2e-16 ***
## dvd_rel_day  0.001435   0.080787   0.018    0.986    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.15 on 641 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  4.92e-07,   Adjusted R-squared:  -0.00156 
## F-statistic: 0.0003154 on 1 and 641 DF,  p-value: 0.9858

Plots show us unclear relationship between dvd_rel_day and score_avg. At the same time residuals distribution are left-skewed. Considering all this we should not use dvd_rel_day in our predict model.

thtr_rel_day - not good.

movies %>%
  filter(!is.na(thtr_rel_day)) %>%
  ggplot(aes(x = thtr_rel_day, y = score_avg)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  ggtitle('thtr_rel_day')

## `geom_smooth()` using formula 'y ~ x'

m <- lm(score_avg ~ thtr_rel_day, data = movies)

m %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") + 
  ggtitle('thtr_rel_day')

m %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab("Residuals") + 
  ggtitle('thtr_rel_day')

m %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() + 
  ggtitle('thtr_rel_day')

summary(m)

## 
## Call:
## lm(formula = score_avg ~ thtr_rel_day, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.217 -13.278   0.793  16.008  32.793 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.11010    1.36653  44.719   <2e-16 ***
## thtr_rel_day  0.03818    0.08077   0.473    0.637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.25 on 649 degrees of freedom
## Multiple R-squared:  0.0003442,  Adjusted R-squared:  -0.001196 
## F-statistic: 0.2234 on 1 and 649 DF,  p-value: 0.6366

Roughly the same picture that dvd_rel_day does.

Building model

For our next step, let’s create a function that can iterate explanatory variably in the process of creating prediction functions. We going to use adjusted R2 method for checking usefulness explanatory variables. That method is more computationally intensive, but it is more reliable then p-value method. Also, forward-selection method was chosen for the simplification task.

lm_auto_forvard <- function(lm_data, pred_var, expl_var) {
  
  iterations <- 1:length(expl_var)
  formula_lm <- paste0(pred_var, ' ~ ')
  adjR2 <- 0
  
  for (iter in iterations) {
    # find next best variables
    df_best_next_var <- data.frame(var = character(), adjR2 = double(), R2 = double(), stringsAsFactors = FALSE)
    for (v in expl_var) {
      fla <- as.formula(paste(formula_lm,  v, sep = '+'))
      m <- lm(fla, data = lm_data)
      sum_m <- summary(m)
      df_best_next_var[nrow(df_best_next_var) + 1, ] = c(v, sum_m[["adj.r.squared"]], sum_m[["r.squared"]])
    }
    
    # check adjR2
    best_var <- df_best_next_var %>%
      slice(which.max(adjR2))
    if (best_var$adjR2 < adjR2) {
      break
    }
    
    # update formula
    adjR2 <- best_var$adjR2
    formula_lm <- paste(formula_lm, best_var$var, sep = '+')
    
    # update available variables
    vars <- df_best_next_var %>%
      slice(which(var != best_var$var))
    vars <- vars$var
  }
  
  return(lm(formula_lm, data = lm_data))
}

Let’s create our predicting model. We going to include good existing explanatory variables we have from our data set, to find which we will use and in which order.

expl_var = c("title_type", "genre", "mpaa_rating", "studio_mod2",
             "best_pic_nom", "best_pic_win", 
             "best_actor_win", "best_actress_win", "best_dir_win")
pred_var = "score_avg"

m_fun <- lm_auto_forvard(lm_data = movies, expl_var = expl_var, pred_var = pred_var)

Model diagnostics

The function we created shows us all explanatory variables are categorical, it’s not an easy way to check collinearity for that type of data. We going to stay with this uncertainty.

m_fun %>%
  ggplot(aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals") +
  ggtitle("Histogram of residuals")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

m_fun %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal probability plot")

Normal probability plot and histogram of residuals shows us nearly normal residuals with mean 0.

m_fun %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("score") +
  ylab("Residuals") +
  ggtitle('Residuals plots of residuals vs. y^')

Residuals plots of residuals vs. y^ do not show us constant variability of residuals. This is something we should be caring about. For low predicting scores, we have wider residual distribution then for high predicting scores.

For next step, let’s get summary output for our model.

summary(m_fun)

## 
## Call:
## lm(formula = formula_lm, data = lm_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.546 -10.153   0.421  10.323  35.630 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             74.0548     7.5611   9.794  < 2e-16 ***
## genreAnimation                           0.8455     6.0024   0.141 0.888031    
## genreArt House & International           7.4262     4.6591   1.594 0.111475    
## genreComedy                             -0.4653     2.5477  -0.183 0.855143    
## genreDocumentary                        10.9838     6.1044   1.799 0.072461 .  
## genreDrama                              12.5784     2.1756   5.781 1.18e-08 ***
## genreHorror                             -2.4238     3.8438  -0.631 0.528555    
## genreMusical & Performing Arts          19.5520     5.2291   3.739 0.000202 ***
## genreMystery & Suspense                  5.7979     2.8352   2.045 0.041289 *  
## genreOther                              12.5047     4.4098   2.836 0.004725 ** 
## genreScience Fiction & Fantasy          -0.2580     5.4850  -0.047 0.962495    
## best_pic_nomyes                         19.1211     3.4229   5.586 3.50e-08 ***
## mpaa_ratingNC-17                         2.1674    11.7085   0.185 0.853199    
## mpaa_ratingPG                           -6.4749     4.4934  -1.441 0.150105    
## mpaa_ratingPG-13                       -10.8158     4.6433  -2.329 0.020168 *  
## mpaa_ratingR                            -5.4506     4.5487  -1.198 0.231275    
## mpaa_ratingUnrated                       1.1602     5.1445   0.226 0.821651    
## best_dir_winyes                          8.1096     2.4931   3.253 0.001206 ** 
## title_typeFeature Film                 -16.4164     5.7116  -2.874 0.004192 ** 
## title_typeTV Movie                     -25.0210     9.1459  -2.736 0.006405 ** 
## studio_mod2Buena Vista                   0.9398     4.2954   0.219 0.826877    
## studio_mod2Columbia Pictures             6.5448     5.0938   1.285 0.199329    
## studio_mod2First Run Features           -4.0191     6.7136  -0.599 0.549633    
## studio_mod2HBO                           2.6817     5.8322   0.460 0.645819    
## studio_mod2IFC Films                     0.5690     4.3674   0.130 0.896391    
## studio_mod2Lions Gate Films             -0.3855     4.0649  -0.095 0.924479    
## studio_mod2Magnolia Pictures             4.5064     5.1503   0.875 0.381932    
## studio_mod2MCA Universal Home Video      3.6035     4.8255   0.747 0.455496    
## studio_mod2MGM                          -6.2066     3.6408  -1.705 0.088751 .  
## studio_mod2Miramax Films                -2.6942     3.7672  -0.715 0.474775    
## studio_mod2New Line Cinema              -9.6836     5.2969  -1.828 0.068012 .  
## studio_mod2New Line Home Entertainment  -1.6168     6.2102  -0.260 0.794688    
## studio_mod2Orion                        -2.3130     5.1080  -0.453 0.650842    
## studio_mod2Others                       -1.2176     2.5276  -0.482 0.630175    
## studio_mod2Paramount Pictures           -1.3575     2.9483  -0.460 0.645375    
## studio_mod2Sony Pictures                -1.3521     2.9977  -0.451 0.652111    
## studio_mod2The Weinstein Company        -1.7299     5.3453  -0.324 0.746329    
## studio_mod2Touchstone Pictures           3.0605     7.1870   0.426 0.670371    
## studio_mod2United Artists               12.3171     6.7990   1.812 0.070538 .  
## studio_mod2Universal Pictures            4.5828     3.7123   1.235 0.217489    
## studio_mod2Walt Disney Pictures         10.7846     6.3486   1.699 0.089882 .  
## studio_mod2Warner Bros. Pictures         0.1373     2.8061   0.049 0.961004    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.17 on 609 degrees of freedom
## Multiple R-squared:  0.3518, Adjusted R-squared:  0.3082 
## F-statistic: 8.063 on 41 and 609 DF,  p-value: < 2.2e-16

We got the model that has multiple R-squared equal 0.3518. That means that our model predicts roughly 35% of scores variability can be explained by the model, this is not to mush. Also, we can say that we have a very big residual standard error: 15,17. That means that our 95% confidence level would be roughly 60 score points. Which this we should mention that we got pretty low p-value, which means that our model statistical significant.

As we have all categorical explanatory variables, let’s make a quick view whish values of those give us the highest expected score:

“Musical & Performing Arts” for the genre: roughly 19,5 scores.
If a movie was nominated for best picture Oscar: roughly 19 scores.
“NC-17” mpaa_rating gives roughly 2 additional scores.
The documentary type of movie does not have a penalty.
Already closes “United Artists” studio would give us roughly 12 scores.

For the next step, let’s check ANOVA output.

anova(m_fun)

## Analysis of Variance Table
## 
## Response: score_avg
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## genre         10  49983  4998.3 21.7241 < 2.2e-16 ***
## best_pic_nom   1   9130  9129.6 39.6796 5.736e-10 ***
## mpaa_rating    5   6199  1239.8  5.3885 7.370e-05 ***
## best_dir_win   1   2869  2869.2 12.4702 0.0004448 ***
## title_type     2   2328  1164.0  5.0592 0.0066203 ** 
## studio_mod2   22   5550   252.3  1.0965 0.3449917    
## Residuals    609 140120   230.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We got that except studio_mod2 variable all other variables are statistically significant. Return to our beginning questions we can say that our model does not get statistical significant evidence about the relationship between famous “actress/actor” or “studio” and “average move score”.

Practically, we got that including in moves adult material is relevant for expected move score, and in our model that variable describes roughly 3% (6199/(49983+9130+6199+2869+2328+5550+140120) = 0.02867531). And studio describes 2,5% of expected movie score.

Part 5: Prediction

Let’s make a prediction for “Ford v Ferrari” (2019) movie and compare the result with the average current rating. The data about movie we got from imdb.com and rottentomatoes.com, information about director from wikipedia.org.

ford_v_ferrari <- data.frame(genre = "Drama", 
                      best_pic_nom = "yes",
                      mpaa_rating = "PG-13",
                      best_dir_win = "yes",
                      title_type = "Feature Film",
                      studio_mod2 = "20th Century Fox")

predict(m_fun, ford_v_ferrari, interval = "prediction", level = 0.95)

##       fit      lwr      upr
## 1 86.6318 55.55224 117.7114

We got pretty hight expected scores, let’s check current ratings:

imdb_rating <- 8.1
critics_score <- 92
audience_score <- 98

score_avg <- (imdb_rating*10 + critics_score  + audience_score)/3
print(paste0("Curent averege score for 'Ford v Ferrari' is ", round(score_avg,1) ))

## [1] "Curent averege score for 'Ford v Ferrari' is 90.3"

Hence, the model predicts, with 95% confidence, that “Ford v Ferrari” is expected to have an evaluation score between 55.5 and 100. Also, we can mention that for that specific case we got a much more accurate result than we expected on average.

Part 6: Conclusion

To summarize our findings we can make some conclusions. First, we manage to get a statistically significant predicted model for movie scores. Second, the practical significan of our model pretty small - roughly 35%. That would be expected because we do not consider so significant variables which can cotegaraze the budjet of a movie and for example storyline and other potensional very signifant variables. Third, we probably can figure out the significance of including adult material into a movie, also we expectivly find assosiation between studio and average movie score. Firth, unexpectable we do not find practical significance between movie stars and movie scores. Fifth, we tested our model for “Ford v Ferrari” movie and get a very impressive result 86.6 predicted score versus 90.3 current rating, but, maybe, it is just good luck, or, probably, we got here result of more narrow variability of residuals for high predicted scores (roughly above 75).