This file will introduce you to the basic frequency-based approach to modeling and forecasting, as well as constructing confidence intervals. Compared to the Bayesian approach, it is a simpler tool, but simpler does not always mean worse.
The data set is comprised of 651 randomly sampled movies produced and released before 2016. Sources for this data set were Rotten Tomatoes and IMDB APIs. Since these sources mostly focusing on English-spoken auditory we need to be very careful to generalizable our conclusion for all world population, but we can be more confident about the US, UK, and other English spoken countries. Also, we need to mention that this data set covert for the most part active internet users. We need to be careful to generalize conclusions for the least active internet users, probably, such as old people and others.
Since we do not have an experiment, but observation study, we can consider the only association, and do not can make causal conclusions.
Recently, the film market is undergoing serious changes, such companies as Netflix and HBO have brought new ideas / approaches to the market. We are interested in learning what attributes associated with popular moves such as including in moves adult material or very famous actors/actresses and others.
For the very beginning let’s see rating distributions:
imdb_ratingmovies %>%
ggplot(aes(x = imdb_rating)) +
geom_histogram(bins = 10) +
ggtitle('Distribution of Rating on IMDB')## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 5.900 6.600 6.493 7.300 9.000
Distribution of imdb_rating is left-skewed with center around 6.6, hence the most appropriate measure of rating going to be median.
audience_scoremovies %>%
ggplot(aes(x = audience_score)) +
geom_histogram(bins = 10) +
ggtitle('Distribution of audience score on Rotten Tomatoes')## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
Distribution of audience_score is also left-skewed with center around 65, hence the most appropriate measure of rating going to be median.
critics_scoremovies %>%
ggplot(aes(x = critics_score)) +
geom_histogram(bins = 10) +
ggtitle('Distribution of critics score on Rotten Tomatoes')## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.00 61.00 57.69 83.00 100.00
Distribution of audience_score is also left-skewed with center around 61, hence the most appropriate measure of rating going to be median.
All of these distributions look roughly similar, probably, it would be a good idea to create an average rating score.
For our next step, let’s check collinearity of our ratings.
As aspected, they are collinear, but with it, we can find that Rating on IMDB and Audience score on Rotten Tomatoes is mush close to each other, on the other side Critics score has more differences. We going create one average score.
Let’s see what we got.
movies %>%
ggplot(aes(x = score_avg)) +
geom_histogram(bins = 10) +
ggtitle('Distribution of average score.')## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.67 48.33 62.67 61.66 77.67 94.67
Distribution of score_avg is left-skewed with center around 61.6, hence the most appropriate measure of rating going to be median.
Before we move on, let’s check conditions to have more or equal than 5 values in each group for using categorical explanatory variables.
title_type - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## title_type n
## <fct> <int>
## 1 Documentary 55
## 2 Feature Film 591
## 3 TV Movie 5
genre - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
## genre n
## <fct> <int>
## 1 Action & Adventure 65
## 2 Animation 9
## 3 Art House & International 14
## 4 Comedy 87
## 5 Documentary 52
## 6 Drama 305
## 7 Horror 23
## 8 Musical & Performing Arts 12
## 9 Mystery & Suspense 59
## 10 Other 16
## 11 Science Fiction & Fantasy 9
mpaa_rating - all groups are good, except “NC-17”. So, we should keep in our mind that if mpaa_rating gets into our final model and would be significant we must be very careful about making a prediction for that value.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## mpaa_rating n
## <fct> <int>
## 1 G 19
## 2 NC-17 2
## 3 PG 118
## 4 PG-13 133
## 5 R 329
## 6 Unrated 50
studio - Not good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 212 x 2
## studio n
## <fct> <int>
## 1 20th Century Fox 18
## 2 20th Century Fox Film Corporat 2
## 3 20th Century Fox Film Corporation 5
## 4 7-57 Releasing 1
## 5 7th art 1
## 6 905 Corporation 1
## 7 A24 1
## 8 A24 Films 2
## 9 All Girl Productions 1
## 10 Alliance Atlantis Communications 1
## # ... with 202 more rows
For studio we got a lot of inappropriate groups. We should convert that variable.
movies <- movies %>%
mutate(studio_mod = case_when(studio == "20th Century Fox Film Corporat" ~ '20th Century Fox',
studio == "20th Century Fox Film Corporation" ~ '20th Century Fox',
studio == "A24" ~ 'A24 Films',
studio == "Buena Vista Distribution Compa" ~ 'Buena Vista',
studio == "Buena Vista Internationa" ~ 'Buena Vista',
studio == "Buena Vista Pictures" ~ 'Buena Vista',
studio == "Columbia Tristar Pictures" ~ 'Columbia Pictures',
studio == "Dreamworks" ~ 'DreamWorks Studios',
studio == "Fox" ~ '20th Century Fox',
studio == "Fox Atomic" ~ '20th Century Fox',
studio == "Fox Searchlight" ~ '20th Century Fox',
studio == "Fox Searchlight Pictures" ~ '20th Century Fox',
studio == "First Run Entertainment" ~ 'First Run Features',
studio == "HBO Documentary" ~ 'HBO',
studio == "HBO Video" ~ 'HBO',
studio == "IFC" ~ 'IFC Films',
studio == "IFC First Take" ~ 'IFC Films',
studio == "IFC Midnight" ~ 'IFC Films',
studio == "Lions Gate Films Inc." ~ 'Lions Gate Films',
studio == "Lions Gate Releasing" ~ 'Lions Gate Films',
studio == "Lionsgate" ~ 'Lions Gate Films',
studio == "LionsGate Entertainment" ~ 'Lions Gate Films',
studio == "Lions Gate Films Inc." ~ 'Lions Gate Films',
studio == "Lionsgate Films" ~ 'Lions Gate Films',
studio == "Lionsgate Releasing" ~ 'Lions Gate Films',
studio == "Magnet Releasing" ~ 'Magnolia Pictures',
studio == "Magnet/Magnolia Pictures" ~ 'Magnolia Pictures',
studio == "MGM Home Entertainment" ~ 'MGM',
studio == "MGM/UA" ~ 'MGM',
studio == "MGM/United Artists" ~ 'MGM',
studio == "Miramax" ~ 'Miramax Films',
studio == "National Geographic Entertainment" ~ 'National Geographic',
studio == "Orion Home Video" ~ 'Orion',
studio == "Orion Pictures" ~ 'Orion',
studio == "Orion Pictures Corporation" ~ 'Orion',
studio == "Paramount Classics" ~ 'Paramount Pictures',
studio == "Paramount" ~ 'Paramount Pictures',
studio == "Paramount Home Video" ~ 'Paramount Pictures',
studio == "Paramount Studios" ~ 'Paramount Pictures',
studio == "Paramount Vantage" ~ 'Paramount Pictures',
studio == "Sony Entertainment" ~ 'Sony Pictures',
studio == "Sony Pictures Classics" ~ 'Sony Pictures',
studio == "Sony Pictures Entertainment" ~ 'Sony Pictures',
studio == "Sony Pictures Home Entertainment" ~ 'Sony Pictures',
studio == "Sony Pictures/Columbia" ~ 'Sony Pictures',
studio == "Sony Pictures/Screen Gems" ~ 'Sony Pictures',
studio == "The Weinstein Co." ~ 'The Weinstein Company',
studio == "Touchstone Home Entertainment" ~ 'Touchstone Pictures',
studio == "TriStar" ~ 'Columbia Pictures',
studio == "TriStar Pictures" ~ 'Columbia Pictures',
studio == "Universal" ~ 'Universal Pictures',
studio == "Universal Studios" ~ 'Universal Pictures',
studio == "Walt Disney Home Entertainment" ~ 'Walt Disney Pictures',
studio == "Walt Disney Productions" ~ 'Walt Disney Pictures',
studio == "Disney" ~ 'Walt Disney Pictures',
studio == "Warner Bros Pictures" ~ 'Warner Bros. Pictures',
studio == "Warner Bros." ~ 'Warner Bros. Pictures',
studio == "WARNER BROTHERS PICTURES" ~ 'Warner Bros. Pictures',
studio == "Warner Home Video" ~ 'Warner Bros. Pictures',
studio == "Warner Independent" ~ 'Warner Bros. Pictures',
studio == "Warner Independent Pictures" ~ 'Warner Bros. Pictures',
studio == "Warners Bros. Pictures" ~ 'Warner Bros. Pictures',
studio == "Twentieth Century Fox Home Entertainment" ~ '20th Century Fox',
is.na(studio) ~ 'Others',
TRUE ~ as.character(studio)))
movies <- movies %>%
group_by(studio_mod) %>%
mutate(n = n()) %>%
mutate(studio_mod2 = if_else(n < 5, 'Others', studio_mod))
movies %>%
group_by(studio_mod2) %>%
summarise(n = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 23 x 2
## studio_mod2 n
## <chr> <int>
## 1 20th Century Fox 51
## 2 Buena Vista 18
## 3 Columbia Pictures 11
## 4 First Run Features 7
## 5 HBO 9
## 6 IFC Films 18
## 7 Lions Gate Films 20
## 8 Magnolia Pictures 11
## 9 MCA Universal Home Video 13
## 10 MGM 27
## # ... with 13 more rows
Finally, we got studio_mod2, where all groups are good.
best_pic_nom - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## best_pic_nom n
## <fct> <int>
## 1 no 629
## 2 yes 22
best_pic_win - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## best_pic_win n
## <fct> <int>
## 1 no 644
## 2 yes 7
best_actor_win - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## best_actor_win n
## <fct> <int>
## 1 no 558
## 2 yes 93
best_actress_win - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## best_actress_win n
## <fct> <int>
## 1 no 579
## 2 yes 72
best_dir_win - all groups are good.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## best_dir_win n
## <fct> <int>
## 1 no 608
## 2 yes 43
Now let’s check numerical explanatory variables.
runtime - not good.
movies %>%
filter(!is.na(runtime)) %>%
ggplot(aes(x = runtime, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('runtime')## `geom_smooth()` using formula 'y ~ x'
The plot doesn’t show us a clean relationship between runtime and score_avg. Probably, we get so trend because of the influential point above 250 minutes. We shouldn’t use runtime in our prediction model.
thtr_rel_year - not good.
movies %>%
ggplot(aes(x = thtr_rel_year, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('thtr_rel_year')## `geom_smooth()` using formula 'y ~ x'
movies %>%
filter(thtr_rel_year >= 1990) %>%
ggplot(aes(x = thtr_rel_year, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('thtr_rel_year after 1990')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ thtr_rel_year, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('thtr_rel_year')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('thtr_rel_year')##
## Call:
## lm(formula = score_avg ~ thtr_rel_year, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.762 -13.092 0.905 15.657 31.131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 287.01806 130.02595 2.207 0.0276 *
## thtr_rel_year -0.11279 0.06508 -1.733 0.0835 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.21 on 649 degrees of freedom
## Multiple R-squared: 0.004607, Adjusted R-squared: 0.003074
## F-statistic: 3.004 on 1 and 649 DF, p-value: 0.08354
Plots show us unclear relationship between thtr_rel_year and score_avg. Also, we can see different trend direction if we use all data (negative) and data after 1990 (positive). At the same time residuals distribution are left-skewed. Considering all this we should not use thtr_rel_year in our predict model.
dvd_rel_year - not good.
movies %>%
filter(!is.na(dvd_rel_year)) %>%
ggplot(aes(x = dvd_rel_year, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('dvd_rel_year')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ dvd_rel_year, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('dvd_rel_year')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('dvd_rel_year')##
## Call:
## lm(formula = score_avg ~ dvd_rel_year, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.36 -13.17 1.07 15.81 32.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.75178 309.17429 0.665 0.506
## dvd_rel_year -0.07183 0.15425 -0.466 0.642
##
## Residual standard error: 18.15 on 641 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.0003382, Adjusted R-squared: -0.001221
## F-statistic: 0.2168 on 1 and 641 DF, p-value: 0.6416
Here we get roughly the same picture that thtr_rel_year does.
thtr_rel_month - not good.
movies %>%
ggplot(aes(x = thtr_rel_month, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('thtr_rel_month')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ thtr_rel_month, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('thtr_rel_month')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('thtr_rel_month')##
## Call:
## lm(formula = score_avg ~ thtr_rel_month, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.940 -13.272 0.836 16.173 31.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.1652 1.5331 39.245 <2e-16 ***
## thtr_rel_month 0.2218 0.2012 1.103 0.271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.23 on 649 degrees of freedom
## Multiple R-squared: 0.001869, Adjusted R-squared: 0.0003314
## F-statistic: 1.216 on 1 and 649 DF, p-value: 0.2707
Plots show us unclear relationship between thtr_rel_month and score_avg. At the same time residuals distribution are left-skewed. Considering all this we should not use thtr_rel_month in our predict model.
dvd_rel_month - not good.
movies %>%
filter(!is.na(dvd_rel_month)) %>%
ggplot(aes(x = dvd_rel_month, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('dvd_rel_month')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ dvd_rel_month, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('dvd_rel_month')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('dvd_rel_month')##
## Call:
## lm(formula = score_avg ~ dvd_rel_month, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.678 -12.915 1.244 15.776 33.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.0749 1.5199 39.527 <2e-16 ***
## dvd_rel_month 0.2695 0.2118 1.273 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.13 on 641 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.00252, Adjusted R-squared: 0.000964
## F-statistic: 1.619 on 1 and 641 DF, p-value: 0.2036
Roughly the same picture that thtr_rel_month does.
dvd_rel_day - not good.
movies %>%
filter(!is.na(dvd_rel_day)) %>%
ggplot(aes(x = dvd_rel_day, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('dvd_rel_day')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ dvd_rel_day, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('dvd_rel_day')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('dvd_rel_day')##
## Call:
## lm(formula = score_avg ~ dvd_rel_day, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.117 -13.135 1.195 15.893 32.872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.760221 1.407975 43.865 <2e-16 ***
## dvd_rel_day 0.001435 0.080787 0.018 0.986
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.15 on 641 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 4.92e-07, Adjusted R-squared: -0.00156
## F-statistic: 0.0003154 on 1 and 641 DF, p-value: 0.9858
Plots show us unclear relationship between dvd_rel_day and score_avg. At the same time residuals distribution are left-skewed. Considering all this we should not use dvd_rel_day in our predict model.
thtr_rel_day - not good.
movies %>%
filter(!is.na(thtr_rel_day)) %>%
ggplot(aes(x = thtr_rel_day, y = score_avg)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle('thtr_rel_day')## `geom_smooth()` using formula 'y ~ x'
m <- lm(score_avg ~ thtr_rel_day, data = movies)
m %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('thtr_rel_day')m %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
xlab("Residuals") +
ggtitle('thtr_rel_day')##
## Call:
## lm(formula = score_avg ~ thtr_rel_day, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.217 -13.278 0.793 16.008 32.793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.11010 1.36653 44.719 <2e-16 ***
## thtr_rel_day 0.03818 0.08077 0.473 0.637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.25 on 649 degrees of freedom
## Multiple R-squared: 0.0003442, Adjusted R-squared: -0.001196
## F-statistic: 0.2234 on 1 and 649 DF, p-value: 0.6366
Roughly the same picture that dvd_rel_day does.
For our next step, let’s create a function that can iterate explanatory variably in the process of creating prediction functions. We going to use adjusted R2 method for checking usefulness explanatory variables. That method is more computationally intensive, but it is more reliable then p-value method. Also, forward-selection method was chosen for the simplification task.
lm_auto_forvard <- function(lm_data, pred_var, expl_var) {
iterations <- 1:length(expl_var)
formula_lm <- paste0(pred_var, ' ~ ')
adjR2 <- 0
for (iter in iterations) {
# find next best variables
df_best_next_var <- data.frame(var = character(), adjR2 = double(), R2 = double(), stringsAsFactors = FALSE)
for (v in expl_var) {
fla <- as.formula(paste(formula_lm, v, sep = '+'))
m <- lm(fla, data = lm_data)
sum_m <- summary(m)
df_best_next_var[nrow(df_best_next_var) + 1, ] = c(v, sum_m[["adj.r.squared"]], sum_m[["r.squared"]])
}
# check adjR2
best_var <- df_best_next_var %>%
slice(which.max(adjR2))
if (best_var$adjR2 < adjR2) {
break
}
# update formula
adjR2 <- best_var$adjR2
formula_lm <- paste(formula_lm, best_var$var, sep = '+')
# update available variables
vars <- df_best_next_var %>%
slice(which(var != best_var$var))
vars <- vars$var
}
return(lm(formula_lm, data = lm_data))
}Let’s create our predicting model. We going to include good existing explanatory variables we have from our data set, to find which we will use and in which order.
expl_var = c("title_type", "genre", "mpaa_rating", "studio_mod2",
"best_pic_nom", "best_pic_win",
"best_actor_win", "best_actress_win", "best_dir_win")
pred_var = "score_avg"
m_fun <- lm_auto_forvard(lm_data = movies, expl_var = expl_var, pred_var = pred_var)Model diagnostics
The function we created shows us all explanatory variables are categorical, it’s not an easy way to check collinearity for that type of data. We going to stay with this uncertainty.
m_fun %>%
ggplot(aes(x = .resid)) +
geom_histogram() +
xlab("Residuals") +
ggtitle("Histogram of residuals")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
m_fun %>%
ggplot(aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("Normal probability plot")Normal probability plot and histogram of residuals shows us nearly normal residuals with mean 0.
m_fun %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("score") +
ylab("Residuals") +
ggtitle('Residuals plots of residuals vs. y^')Residuals plots of residuals vs. y^ do not show us constant variability of residuals. This is something we should be caring about. For low predicting scores, we have wider residual distribution then for high predicting scores.
For next step, let’s get summary output for our model.
##
## Call:
## lm(formula = formula_lm, data = lm_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.546 -10.153 0.421 10.323 35.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.0548 7.5611 9.794 < 2e-16 ***
## genreAnimation 0.8455 6.0024 0.141 0.888031
## genreArt House & International 7.4262 4.6591 1.594 0.111475
## genreComedy -0.4653 2.5477 -0.183 0.855143
## genreDocumentary 10.9838 6.1044 1.799 0.072461 .
## genreDrama 12.5784 2.1756 5.781 1.18e-08 ***
## genreHorror -2.4238 3.8438 -0.631 0.528555
## genreMusical & Performing Arts 19.5520 5.2291 3.739 0.000202 ***
## genreMystery & Suspense 5.7979 2.8352 2.045 0.041289 *
## genreOther 12.5047 4.4098 2.836 0.004725 **
## genreScience Fiction & Fantasy -0.2580 5.4850 -0.047 0.962495
## best_pic_nomyes 19.1211 3.4229 5.586 3.50e-08 ***
## mpaa_ratingNC-17 2.1674 11.7085 0.185 0.853199
## mpaa_ratingPG -6.4749 4.4934 -1.441 0.150105
## mpaa_ratingPG-13 -10.8158 4.6433 -2.329 0.020168 *
## mpaa_ratingR -5.4506 4.5487 -1.198 0.231275
## mpaa_ratingUnrated 1.1602 5.1445 0.226 0.821651
## best_dir_winyes 8.1096 2.4931 3.253 0.001206 **
## title_typeFeature Film -16.4164 5.7116 -2.874 0.004192 **
## title_typeTV Movie -25.0210 9.1459 -2.736 0.006405 **
## studio_mod2Buena Vista 0.9398 4.2954 0.219 0.826877
## studio_mod2Columbia Pictures 6.5448 5.0938 1.285 0.199329
## studio_mod2First Run Features -4.0191 6.7136 -0.599 0.549633
## studio_mod2HBO 2.6817 5.8322 0.460 0.645819
## studio_mod2IFC Films 0.5690 4.3674 0.130 0.896391
## studio_mod2Lions Gate Films -0.3855 4.0649 -0.095 0.924479
## studio_mod2Magnolia Pictures 4.5064 5.1503 0.875 0.381932
## studio_mod2MCA Universal Home Video 3.6035 4.8255 0.747 0.455496
## studio_mod2MGM -6.2066 3.6408 -1.705 0.088751 .
## studio_mod2Miramax Films -2.6942 3.7672 -0.715 0.474775
## studio_mod2New Line Cinema -9.6836 5.2969 -1.828 0.068012 .
## studio_mod2New Line Home Entertainment -1.6168 6.2102 -0.260 0.794688
## studio_mod2Orion -2.3130 5.1080 -0.453 0.650842
## studio_mod2Others -1.2176 2.5276 -0.482 0.630175
## studio_mod2Paramount Pictures -1.3575 2.9483 -0.460 0.645375
## studio_mod2Sony Pictures -1.3521 2.9977 -0.451 0.652111
## studio_mod2The Weinstein Company -1.7299 5.3453 -0.324 0.746329
## studio_mod2Touchstone Pictures 3.0605 7.1870 0.426 0.670371
## studio_mod2United Artists 12.3171 6.7990 1.812 0.070538 .
## studio_mod2Universal Pictures 4.5828 3.7123 1.235 0.217489
## studio_mod2Walt Disney Pictures 10.7846 6.3486 1.699 0.089882 .
## studio_mod2Warner Bros. Pictures 0.1373 2.8061 0.049 0.961004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.17 on 609 degrees of freedom
## Multiple R-squared: 0.3518, Adjusted R-squared: 0.3082
## F-statistic: 8.063 on 41 and 609 DF, p-value: < 2.2e-16
We got the model that has multiple R-squared equal 0.3518. That means that our model predicts roughly 35% of scores variability can be explained by the model, this is not to mush. Also, we can say that we have a very big residual standard error: 15,17. That means that our 95% confidence level would be roughly 60 score points. Which this we should mention that we got pretty low p-value, which means that our model statistical significant.
As we have all categorical explanatory variables, let’s make a quick view whish values of those give us the highest expected score:
For the next step, let’s check ANOVA output.
## Analysis of Variance Table
##
## Response: score_avg
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 10 49983 4998.3 21.7241 < 2.2e-16 ***
## best_pic_nom 1 9130 9129.6 39.6796 5.736e-10 ***
## mpaa_rating 5 6199 1239.8 5.3885 7.370e-05 ***
## best_dir_win 1 2869 2869.2 12.4702 0.0004448 ***
## title_type 2 2328 1164.0 5.0592 0.0066203 **
## studio_mod2 22 5550 252.3 1.0965 0.3449917
## Residuals 609 140120 230.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We got that except studio_mod2 variable all other variables are statistically significant. Return to our beginning questions we can say that our model does not get statistical significant evidence about the relationship between famous “actress/actor” or “studio” and “average move score”.
Practically, we got that including in moves adult material is relevant for expected move score, and in our model that variable describes roughly 3% (6199/(49983+9130+6199+2869+2328+5550+140120) = 0.02867531). And studio describes 2,5% of expected movie score.
Let’s make a prediction for “Ford v Ferrari” (2019) movie and compare the result with the average current rating. The data about movie we got from imdb.com and rottentomatoes.com, information about director from wikipedia.org.
ford_v_ferrari <- data.frame(genre = "Drama",
best_pic_nom = "yes",
mpaa_rating = "PG-13",
best_dir_win = "yes",
title_type = "Feature Film",
studio_mod2 = "20th Century Fox")
predict(m_fun, ford_v_ferrari, interval = "prediction", level = 0.95)## fit lwr upr
## 1 86.6318 55.55224 117.7114
We got pretty hight expected scores, let’s check current ratings:
imdb_rating <- 8.1
critics_score <- 92
audience_score <- 98
score_avg <- (imdb_rating*10 + critics_score + audience_score)/3
print(paste0("Curent averege score for 'Ford v Ferrari' is ", round(score_avg,1) ))## [1] "Curent averege score for 'Ford v Ferrari' is 90.3"
Hence, the model predicts, with 95% confidence, that “Ford v Ferrari” is expected to have an evaluation score between 55.5 and 100. Also, we can mention that for that specific case we got a much more accurate result than we expected on average.
To summarize our findings we can make some conclusions. First, we manage to get a statistically significant predicted model for movie scores. Second, the practical significan of our model pretty small - roughly 35%. That would be expected because we do not consider so significant variables which can cotegaraze the budjet of a movie and for example storyline and other potensional very signifant variables. Third, we probably can figure out the significance of including adult material into a movie, also we expectivly find assosiation between studio and average movie score. Firth, unexpectable we do not find practical significance between movie stars and movie scores. Fifth, we tested our model for “Ford v Ferrari” movie and get a very impressive result 86.6 predicted score versus 90.3 current rating, but, maybe, it is just good luck, or, probably, we got here result of more narrow variability of residuals for high predicted scores (roughly above 75).