Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(plyr)
library(statsr)
library(car)
library(ggthemes)
library(corrplot)
library(tidyverse)
library(caret)
library(MASS)
library(gvlma)
library(kableExtra)
library(broom)
library(graphics)
library(stats)
library(rcompanion)
library(PerformanceAnalytics)

Load data

load("~/Desktop/R Programming/Statistics_Coursera/Regression_Modelling/movies.Rdata")

Part 1: Data

The dataset contains information about movies obtained from IMDb and Rotten Tomatoes Websites. It contains 651 randomly sampled movies released between 1970 and 2014. There are 32 variables which include title, genre, runtime, studio, imdb_rating, audience score, critics_score, and etc.

Generalisation

The data should allow us to generalise statistical results. However, potential bias includes voluntary response bias as the voting is voluntary, and only those who have high interest in movies would take part in the voting.

Causality

It is an observational study, with no random assignment. Therefore, causal inference cannot be drawn from the statistical results.

Part 2: Research question

MakeUseOf, an online technology publication website, rated IMDb as a great source of movie ratings website to understand general public opinions. The following question will be investigated;

What are the important predictors contributing to movie ratings on IMDb website?

Part 3: Exploratory data analysis

Data Exploration

Figure 1 and 2 illustrate movie ratings of all genres from IMDb and Rotten Tomatoes websites. IMDb ratings range from 1.9 to 9.0, while Rotten Tomatoes website has a score between 11 and 97.

The figures show that both websites reported almost the same pattern of movie ratings, with, at the median, Documentary was the highest rated movie genre, followed by Musical & Performing Arts and Other. Therefore, it can be implied that factors contributing to movie ratings on IMDb and Rotten Tomatoes websites may possibly be similar. Thus, identifying potential predictors could help movie producers improve marketing strategies and increase popularity among general audiences.

Figure 3 and 4 show the distribution of IMDb ratings and Rotten Tomatoes audience scores. The distribution of movie ratings/scores on both websites are left skewed, suggesting that voters are likely to score highly. The median IMDb ratings is 6.6 which is considered fairly good, while Rotten Tomatoes has a score, at the median, of 65 which corresponds to the proportion of positive reviews given by audiences.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Cleaning & Manipulation

Table 1 shows movies with duplicated titles. Duplicated movie titles can be understood as remakes due to different years of release. However, the movie titled “Man on Wire” appears to be a duplicated record by Magnolia Pictures in 2008, and it will be removed from the dataset.

Table 1 - Movies with duplicated title
title	studio	released year
Hairspray	New Line Cinema	2007
Hairspray	New Line Home Entertainment	1988
Man on Wire	Magnolia Pictures	2008
Man on Wire	Magnolia Pictures	2008
Maniac	Analysis	1980
Maniac	IFC Midnight	2013
Where the Heart Is	Touchstone Pictures	1990
Where the Heart Is	Twentieth Century Fox Home Entertainment	2000

Table 2 shows that there are 32 missing values, and these will be removed from the dataset.

movies %>%
  sapply(is.na) %>%
  colSums()

##            title       title_type            genre          runtime 
##                0                0                0                1 
##      mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
##                0                8                0                0 
##     thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
##                0                8                8                8 
##      imdb_rating   imdb_num_votes   critics_rating    critics_score 
##                0                0                0                0 
##  audience_rating   audience_score     best_pic_nom     best_pic_win 
##                0                0                0                0 
##   best_actor_win best_actress_win     best_dir_win       top200_box 
##                0                0                0                0 
##         director           actor1           actor2           actor3 
##                2                2                7                9 
##           actor4           actor5         imdb_url           rt_url 
##               13               15                0                0

NACheck <- complete.cases(movies) %>%
  count()

colnames(NACheck) <- c(" ", "frequency")

NACheck %>%
  kable(caption ="Table 2 - Number of Complteted Cases") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 2 - Number of Complteted Cases
	frequency
FALSE	32
TRUE	619

A new parameter - oscar_win - based on the existing data on winning actors, actresses, and directors will be created.

movies <- movies %>%
  mutate(oscar_win = ifelse(best_pic_nom =="yes"| best_pic_win =="yes"| best_actor_win =="yes"|
                               best_actress_win =="yes"| best_dir_win =="yes" ,1,0))

Figure 5 shows the correlation coefficients for all pairs of selected variables. 11 numerical variables are selected to examine whether or not they are collinear. It is important to note that collinearity should be avoided in order to ensure reliable estimates.

The correlation between different pairs are significantly high; for example, imdb_rating is highly correlated with critics_rating, critics_score, audience_rating, audience_score, and top200_box. The variable critics_rating is associated with audience_rating, audience_score, and top200_box. This may suggest that the model is likely to cause a bias.

vars <- names(movies) %in% c("title_type", "genre", "runtime", "imdb_rating", "imdb_num_votes","critics_rating", "critics_score", "audience_rating", "audience_score","oscar_win","top200_box")

selected_vars <- movies[vars]
selected_vars$title_type <- as.numeric(selected_vars$title_type)
selected_vars$genre <- as.numeric(selected_vars$genre)
selected_vars$imdb_rating <- as.numeric(selected_vars$imdb_rating)
selected_vars$imdb_num_votes <- as.numeric(selected_vars$imdb_num_votes)
selected_vars$critics_rating <- as.numeric(selected_vars$audience_rating)
selected_vars$top200_box <- as.numeric(selected_vars$top200_box)
selected_vars$audience_rating <- as.numeric(selected_vars$audience_rating)

corr_matrix <- cor(selected_vars)
corrplot(corr_matrix , method="color",addCoef.col = "black", type="upper",
         tl.col = "black", tl.srt = 40,  number.cex=0.50)

Part 4: Modeling

Model Selection

Stepwise regression method is used for model selection. It is a combination of forward and backward eliminations to select the most contributive predictors.

The summary table shows that the potential predictors include genre, runtime, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, with \(Adj. R^2\) of 0.8318 - a slight increase from 0.8317 of the full model.

The correlation efficient \((R)\) is 0.91. The \(R^2\) is 0.84, which means that approximately 84% of the variability in IMDb rating can be explained by the explanatory variables.

FullModel <- lm(imdb_rating ~ genre + runtime + imdb_num_votes + critics_rating +
                  critics_score + audience_rating + audience_score +
                  oscar_win + top200_box, data = movies) 
summary(FullModel)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score + 
##     oscar_win + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.32321 -0.17168  0.02619  0.24978  1.08579 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.680e+00  1.711e-01  15.667  < 2e-16 ***
## genreAnimation                 -3.179e-01  1.603e-01  -1.983 0.047786 *  
## genreArt House & International  3.382e-01  1.332e-01   2.539 0.011350 *  
## genreComedy                    -1.270e-01  7.376e-02  -1.722 0.085560 .  
## genreDocumentary                3.460e-01  9.341e-02   3.704 0.000231 ***
## genreDrama                      1.080e-01  6.417e-02   1.683 0.092915 .  
## genreHorror                     8.850e-02  1.095e-01   0.808 0.419120    
## genreMusical & Performing Arts  1.011e-01  1.447e-01   0.699 0.484957    
## genreMystery & Suspense         2.592e-01  8.194e-02   3.163 0.001636 ** 
## genreOther                     -4.419e-02  1.258e-01  -0.351 0.725580    
## genreScience Fiction & Fantasy -1.495e-01  1.591e-01  -0.940 0.347679    
## runtime                         3.176e-03  1.066e-03   2.980 0.002991 ** 
## imdb_num_votes                  9.410e-07  1.973e-07   4.770 2.29e-06 ***
## critics_ratingFresh             8.381e-02  5.550e-02   1.510 0.131515    
## critics_ratingRotten            3.493e-01  8.906e-02   3.922 9.76e-05 ***
## critics_score                   1.480e-02  1.501e-03   9.859  < 2e-16 ***
## audience_ratingUpright         -3.424e-01  7.157e-02  -4.784 2.14e-06 ***
## audience_score                  3.972e-02  2.043e-03  19.437  < 2e-16 ***
## oscar_win                       3.772e-02  4.316e-02   0.874 0.382435    
## top200_boxyes                  -1.011e-01  1.232e-01  -0.821 0.412203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4451 on 630 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8366, Adjusted R-squared:  0.8317 
## F-statistic: 169.8 on 19 and 630 DF,  p-value: < 2.2e-16

ModelSelect <- stepAIC(FullModel, direction = "both",
                       trace = FALSE)
summary(ModelSelect)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33520 -0.17886  0.02612  0.24827  1.08840 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.654e+00  1.689e-01  15.708  < 2e-16 ***
## genreAnimation                 -3.040e-01  1.598e-01  -1.902 0.057569 .  
## genreArt House & International  3.421e-01  1.329e-01   2.574 0.010288 *  
## genreComedy                    -1.188e-01  7.338e-02  -1.619 0.105943    
## genreDocumentary                3.503e-01  9.300e-02   3.767 0.000181 ***
## genreDrama                      1.194e-01  6.342e-02   1.883 0.060165 .  
## genreHorror                     9.226e-02  1.092e-01   0.845 0.398410    
## genreMusical & Performing Arts  1.061e-01  1.443e-01   0.735 0.462433    
## genreMystery & Suspense         2.728e-01  8.113e-02   3.363 0.000818 ***
## genreOther                     -3.483e-02  1.255e-01  -0.277 0.781493    
## genreScience Fiction & Fantasy -1.543e-01  1.590e-01  -0.970 0.332244    
## runtime                         3.393e-03  1.023e-03   3.317 0.000962 ***
## imdb_num_votes                  9.170e-07  1.931e-07   4.748 2.55e-06 ***
## critics_ratingFresh             8.445e-02  5.538e-02   1.525 0.127782    
## critics_ratingRotten            3.508e-01  8.900e-02   3.942 8.99e-05 ***
## critics_score                   1.482e-02  1.498e-03   9.890  < 2e-16 ***
## audience_ratingUpright         -3.481e-01  7.138e-02  -4.877 1.36e-06 ***
## audience_score                  3.980e-02  2.041e-03  19.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 632 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8362, Adjusted R-squared:  0.8318 
## F-statistic: 189.8 on 17 and 632 DF,  p-value: < 2.2e-16

FitModel <- lm(imdb_rating ~ genre + runtime + imdb_num_votes + critics_rating +
                 critics_score + audience_rating + audience_score, data = movies) 
summary(FitModel)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33520 -0.17886  0.02612  0.24827  1.08840 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.654e+00  1.689e-01  15.708  < 2e-16 ***
## genreAnimation                 -3.040e-01  1.598e-01  -1.902 0.057569 .  
## genreArt House & International  3.421e-01  1.329e-01   2.574 0.010288 *  
## genreComedy                    -1.188e-01  7.338e-02  -1.619 0.105943    
## genreDocumentary                3.503e-01  9.300e-02   3.767 0.000181 ***
## genreDrama                      1.194e-01  6.342e-02   1.883 0.060165 .  
## genreHorror                     9.226e-02  1.092e-01   0.845 0.398410    
## genreMusical & Performing Arts  1.061e-01  1.443e-01   0.735 0.462433    
## genreMystery & Suspense         2.728e-01  8.113e-02   3.363 0.000818 ***
## genreOther                     -3.483e-02  1.255e-01  -0.277 0.781493    
## genreScience Fiction & Fantasy -1.543e-01  1.590e-01  -0.970 0.332244    
## runtime                         3.393e-03  1.023e-03   3.317 0.000962 ***
## imdb_num_votes                  9.170e-07  1.931e-07   4.748 2.55e-06 ***
## critics_ratingFresh             8.445e-02  5.538e-02   1.525 0.127782    
## critics_ratingRotten            3.508e-01  8.900e-02   3.942 8.99e-05 ***
## critics_score                   1.482e-02  1.498e-03   9.890  < 2e-16 ***
## audience_ratingUpright         -3.481e-01  7.138e-02  -4.877 1.36e-06 ***
## audience_score                  3.980e-02  2.041e-03  19.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 632 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8362, Adjusted R-squared:  0.8318 
## F-statistic: 189.8 on 17 and 632 DF,  p-value: < 2.2e-16

Model Comparison

In order to ensure that the selected regression model provide better performance, it is important to compare the overall quality of the two models - Full Model VS Fit Model.

Table 3 shows the quality comparison of the full and fit model. The fit model has slightly better \(Adj. R^2\), and lower amount of residuals standard error (RSE or sigma).

The AIC and BIC metrics are further used for model evaluation. The metrics will increase error when including additional variables, ensuring unbiased estimate of the model prediction. The AIC and the BIC of the reduced model are lower than those of the full model. Therefore, the fit model with lower AIC and BIC is preferred.

The p-value of the fit model is lower than the full model. This means that the fit model is statistically more significant than the full model.

FullCompare <- glance(FullModel) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value) 

FitCompare <- glance(FitModel) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value) 

FullCompare$p.value <- sprintf("%6.4G", FullCompare$p.value)
FitCompare$p.value <- sprintf("%6.4G", FitCompare$p.value)

ModelBind <- rbind(FullCompare, FitCompare) 

ModelTitle <- data.frame(model = c("full model", "fit model"))

FullGlance <- cbind(ModelTitle, ModelBind) 

colnames(FullGlance) <- c("", "adj.r.squared", "sigma", "AIC", "BIC", "p-value")

FullGlance %>%
  kbl(caption ="Table 3 - Model Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c(" " = 1, "Model Comparison" = 5))

Table 3 - Model Comparison
	Model Comparison
	adj.r.squared	sigma	AIC	BIC	p-value
full model	0.8316596	0.445111	814.0449	908.0613	5.525E-233
fit model	0.8318041	0.444920	811.5471	896.6095	6.001E-235

Model Evaluation

Despite lower residuals standard error (RSE or sigma), in order to be able to ensure the best possible estimates for the mode, the following assumptions must be satisfied prior to modelling.

1. Check for Multicollinearity

In the presence of multicollinearity, the regression model will produce unstable solution. Multicollinearity can be assessed by computing the variance inflation factor (VIF) which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

Table 4 shows the VIF values of all predictors in the fit model. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. It appears that the VIF values of critics_score, audience_score, and critics_rating are larger than 5, but do not exceed 10. However, the amount is not severe enough to warrant correction.

variancefactor <- vif(FitModel)

variancefactor %>%
  kbl(caption ="Table 4 - Variance Inflation Factor of Fit model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 4 - Variance Inflation Factor of Fit model
	GVIF	Df	GVIF^(1/(2*Df))
genre	1.714093	10	1.027311
runtime	1.297382	1	1.139027
imdb_num_votes	1.539171	1	1.240633
critics_rating	5.443540	2	1.527462
critics_score	5.941347	1	2.437488
audience_rating	4.083203	1	2.020694
audience_score	5.591765	1	2.364691

2. Check for Autocorrelation

Autocorrelation refers to the residuals not being independent of each other, which will affect the results of confidence intervals. Durbin-Watson test for autocorrelation will be performed.

The p-value corresponds to 0.418 which means that we cannot reject the null hypothesis that there is no autocorrelation.

\[ H_0: \text{ There is no autocorrelation.} \] \[ H_a: \text{ There is evidence for autocorrelation.} \]

durbinWatsonTest(FitModel)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.02185454      1.931961   0.404
##  Alternative hypothesis: rho != 0

2. Check for Homoskedasticity

Homoskedasticity means that the residuals show an even and random pattern across the range of observations. In order to check for Homoskedasticity, Global Validation Of Linear Model Assumptions (gvlma) will be used.

The result shows that heteroscedasticity is present in the model, and resolving the issue is essential.

gvlma(FitModel)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score, 
##     data = movies)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                      2.654e+00                      -3.040e-01  
## genreArt House & International                     genreComedy  
##                      3.421e-01                      -1.188e-01  
##               genreDocumentary                      genreDrama  
##                      3.503e-01                       1.194e-01  
##                    genreHorror  genreMusical & Performing Arts  
##                      9.226e-02                       1.061e-01  
##        genreMystery & Suspense                      genreOther  
##                      2.728e-01                      -3.483e-02  
## genreScience Fiction & Fantasy                         runtime  
##                     -1.543e-01                       3.393e-03  
##                 imdb_num_votes             critics_ratingFresh  
##                      9.170e-07                       8.445e-02  
##           critics_ratingRotten                   critics_score  
##                      3.508e-01                       1.482e-02  
##         audience_ratingUpright                  audience_score  
##                     -3.481e-01                       3.980e-02  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = FitModel) 
## 
##                     Value   p-value                   Decision
## Global Stat        944.40 0.000e+00 Assumptions NOT satisfied!
## Skewness           200.95 0.000e+00 Assumptions NOT satisfied!
## Kurtosis           705.60 0.000e+00 Assumptions NOT satisfied!
## Link Function       31.70 1.801e-08 Assumptions NOT satisfied!
## Heteroscedasticity   6.15 1.314e-02 Assumptions NOT satisfied!

2.1 Resolving Heteroscedasticity

In order to correct the issue of heteroscedasticity, Boxcox transformation is used and applied on the response variable in the model.

The gvlma assumption shows that the issue of heteroscedasticity is no longer violated, suggesting an equal variance across the range of observation.

The \(Adj. R^2\) increases to 0.8673 (from 0.8318). The \(R^2\) increases to 0.87 (from 0.83), which means that approximately 87% of the variability in IMDb rating can be explained by the explanatory variables.

imdb_rating_BC <- BoxCoxTrans(movies$imdb_rating)
movies <- cbind(movies, imdb_rating_trans = predict(imdb_rating_BC, movies$imdb_rating))

FitBC <- lm(movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes + critics_rating + critics_score + audience_rating + audience_score, data = movies) 

summary(FitBC)

## 
## Call:
## lm(formula = movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3125  -1.2420   0.1275   1.3818   5.8925 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.008e+00  9.098e-01  -1.108 0.268178    
## genreAnimation                 -1.599e+00  8.606e-01  -1.858 0.063663 .  
## genreArt House & International  1.992e+00  7.157e-01   2.784 0.005536 ** 
## genreComedy                    -6.317e-01  3.951e-01  -1.599 0.110384    
## genreDocumentary                3.038e+00  5.008e-01   6.066 2.26e-09 ***
## genreDrama                      6.484e-01  3.415e-01   1.899 0.058081 .  
## genreHorror                     4.115e-01  5.879e-01   0.700 0.484238    
## genreMusical & Performing Arts  1.126e+00  7.772e-01   1.448 0.148024    
## genreMystery & Suspense         1.406e+00  4.369e-01   3.218 0.001359 ** 
## genreOther                     -2.416e-01  6.760e-01  -0.357 0.720955    
## genreScience Fiction & Fantasy -3.507e-01  8.563e-01  -0.410 0.682240    
## runtime                         2.018e-02  5.509e-03   3.662 0.000271 ***
## imdb_num_votes                  7.875e-06  1.040e-06   7.571 1.31e-13 ***
## critics_ratingFresh             2.444e-01  2.982e-01   0.820 0.412786    
## critics_ratingRotten            1.439e+00  4.793e-01   3.002 0.002784 ** 
## critics_score                   7.889e-02  8.069e-03   9.777  < 2e-16 ***
## audience_ratingUpright         -1.748e+00  3.844e-01  -4.548 6.51e-06 ***
## audience_score                  2.349e-01  1.099e-02  21.376  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.396 on 632 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8708, Adjusted R-squared:  0.8673 
## F-statistic: 250.5 on 17 and 632 DF,  p-value: < 2.2e-16

gvlma(FitBC)

## 
## Call:
## lm(formula = movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes + 
##     critics_rating + critics_score + audience_rating + audience_score, 
##     data = movies)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                     -1.008e+00                      -1.599e+00  
## genreArt House & International                     genreComedy  
##                      1.992e+00                      -6.317e-01  
##               genreDocumentary                      genreDrama  
##                      3.038e+00                       6.484e-01  
##                    genreHorror  genreMusical & Performing Arts  
##                      4.115e-01                       1.126e+00  
##        genreMystery & Suspense                      genreOther  
##                      1.406e+00                      -2.416e-01  
## genreScience Fiction & Fantasy                         runtime  
##                     -3.507e-01                       2.018e-02  
##                 imdb_num_votes             critics_ratingFresh  
##                      7.875e-06                       2.444e-01  
##           critics_ratingRotten                   critics_score  
##                      1.439e+00                       7.889e-02  
##         audience_ratingUpright                  audience_score  
##                     -1.748e+00                       2.349e-01  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = FitBC) 
## 
##                        Value   p-value                   Decision
## Global Stat        221.14611 0.000e+00 Assumptions NOT satisfied!
## Skewness            59.12972 1.477e-14 Assumptions NOT satisfied!
## Kurtosis           160.25126 0.000e+00 Assumptions NOT satisfied!
## Link Function        0.01804 8.932e-01    Assumptions acceptable.
## Heteroscedasticity   1.74709 1.862e-01    Assumptions acceptable.

3. Check for Normal Distribution

Figure 6 shows that the distribution of residuals of the model is left-skewed. However, Central Limit Theorem is applied since the dataset is larger than 30. In addition, according to the classic OLS assumptions (Gauss-Markov Theorem), residuals are not required to be normally distributed.

hist(FitModel$residuals, main = "Figure 6 - Residuals Distribution of the model", xlab = "")

Diagnostics Plot

Figure 7 shows a random scatter around zero in the Residuals VS Fitted Plot, the linearity can be assumed.

The residuals distribution in the Normal Q-Q plot fairly follows the straight line.

The Scale - Location plot shows that the residuals are spread equally along the range of predictors. There is no presence of heteroscedasticity.

The Residuals VS Leverage plot highlights the top three most extreme points (330, 475, 597). However, the data do not present any influential points as the Cook’s distance lines are not shown on the plot.

par(mfrow=c(2,2))
plot(FitModel)

Part 5: Prediction

Movies Prediction

Table 5 shows the movie ratings’ prediction for Hacksaw Ridge, The Accountant, and Miss Peregrine’s Home for Peculiar Children.

For Hacksaw Ridge, the model predicts, with 95% confidence, that the movie is expected to have IMDb ratings of (7.5, 8.6). The actual IMDb rating of Hacksaw Ridge - 8.1 - falls within the prediction interval.

The model predicts, with 95% confidence, that the Accountant is expected to have IMDb ratings of (6.6, 7.9). Its actual rating is 7.3 and falls within the prediction interval.

Miss Peregrine’s Home for Peculiar Children is expected to have IMDb ratings between (5.8, 7.3). Its actual rating of 6.7 also falls within the prediction interval.

HacksawRidge <- data.frame(genre = "Drama",
                           runtime = 139,
                           imdb_num_votes = 431349,
                           critics_rating = "Certified Fresh",
                           critics_score = 84,
                           audience_rating = "Upright",
                           audience_score = 91)

TheAccountant <- data.frame(genre = "Drama",
                            runtime = 128,
                            imdb_num_votes = 258741,
                            critics_rating = "Rotten",
                            critics_score = 51,
                            audience_rating = "Upright",
                            audience_score = 76)

Peregrine <- data.frame(genre = "Drama",
                        runtime = 127,
                        imdb_num_votes = 154960,
                        critics_rating = "Fresh",
                        critics_score = 64,
                        audience_rating = "Upright",
                        audience_score = 60)

pred.int.HacksawRidge <- predict(FitBC, newdata = HacksawRidge, interval = "prediction") %>%
  transform(fit = (fit * 2 + 1)^(1/2),
            lwr = (lwr * 2 + 1)^(1/2),
            upr = (upr * 2 + 1)^(1/2)) 

pred.int.TheAccountant <- predict(FitBC, newdata = TheAccountant, interval = "prediction") %>%
  transform(fit = (fit * 2 + 1)^(1/2),
            lwr = (lwr * 2 + 1)^(1/2),
            upr = (upr * 2 + 1)^(1/2)) 

pred.int.Peregrine <- predict(FitBC, newdata = Peregrine, interval = "prediction") %>%
  transform(fit = (fit * 2 + 1)^(1/2),
            lwr = (lwr * 2 + 1)^(1/2),
            upr = (upr * 2 + 1)^(1/2)) 


actual.HacksawRidge <- data.frame("actual" = 8.1)
actual.TheAccountant <- data.frame("actual" = 7.3)
actual.Peregrine <- data.frame("actual" = 6.7)

tbl.HacksawRidge <- data.frame(title = "Hacksaw Ridge (2016)",
                               prediction = pred.int.HacksawRidge,
                               actual = actual.HacksawRidge)

tbl.TheAccountant <- data.frame(title = "The Accountant (2016)",
                           prediction = pred.int.TheAccountant,
                           actual = actual.TheAccountant)

tbl.Peregrine <- data.frame(title = "Miss Peregrine’s Home for Peculiar Children (2016)",
                           prediction = pred.int.Peregrine,
                           actual = actual.Peregrine)

table.pred <- rbind(tbl.HacksawRidge, tbl.TheAccountant, tbl.Peregrine) 
colnames(table.pred) <- c('Title', 'fit', 'lower bound', 'upper bound', 'Actual Rating')

tibble(table.pred) %>%
  kbl(caption ="Table 5 - Movies Prediction") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c(" " = 1, "Prediction" = 3, " " = 1))

Table 5 - Movies Prediction
	Prediction
Title	fit	lower bound	upper bound	Actual Rating
Hacksaw Ridge (2016)	8.074666	7.462234	8.643815	8.1
The Accountant (2016)	7.256784	6.570201	7.883799	7.3
Miss Peregrine’s Home for Peculiar Children (2016)	6.567332	5.799651	7.254223	6.7

Model Evaluation

In order to evaluate whether or not the model provides reliable estimates, the data are split into training and testing sets.

Table 6 shows that the model wrongly predicts 95 movies, while only 35 movies are correctly predicted and captured within the prediction interval.

set.seed(1234)

training.samples <- movies$imdb_rating %>%
  createDataPartition(p = 0.8, list = FALSE)
train.set <- movies[training.samples, ]
test.set <- movies[-training.samples, ]


preds.pi <- predict(FitBC, test.set, interval = "confidence") %>%
  transform(fit = (fit * 2 + 1)^(1/2),
            lwr = (lwr * 2 + 1)^(1/2),
            upr = (upr * 2 + 1)^(1/2)) 

modelEval.pi <- cbind(preds.pi, test.set$imdb_rating) %>%
  mutate(capture_pi = ifelse(lwr < test.set$imdb_rating &
                               upr > test.set$imdb_rating, "yes", "no"))

colnames(modelEval.pi) <- c("fit", "lower bound", "upper bound", "IMDb rating", "captured interval")

head(tibble(modelEval.pi)) %>%
  kbl(caption ="Table 6 - Prediction Evaluation") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c("Prediction" = 3, " " = 1, " " = 1))

Table 6 - Prediction Evaluation
Prediction
fit	lower bound	upper bound	IMDb rating	captured interval
6.656337	6.570826	6.740763	5.5	no
7.366241	7.292145	7.439598	7.3	yes
6.774342	6.687276	6.860303	6.6	no
7.502219	7.407494	7.595764	7.6	no
4.944425	4.824123	5.061869	3.6	no
6.842395	6.748122	6.935388	7.0	no

set.seed(123456)
count(modelEval.pi$`captured interval`)

##     x freq
## 1  no   95
## 2 yes   35

Part 6: Conclusion

genre, runtime, imdb_num_votes, critics_score, critics_rating, audience_rating, and audience_score are potential predictors contributing to popularity on IMDb website.
The fit model is evaluated using adjusted \(Adj. R^2\), RSE or sigma, AIC, BIC, and p-value. The result clearly indicates that the reduced model is significantly better than the full model.
The presence of heteroscedasticity is resolved using Box Cox Transformation, with an increase in \(Adj. R^2\) to 0.8673 (from 0.8318).
The predictive power is evaluated by splitting the dataset into training and test data. The result between predicted ratings and actual ratings suggests that the model produces a bias.
The selected model is used to predict IMDb ratings for Hacksaw Ridge, The Accountant, and Miss Peregrine’s Home for Peculiar Children.The followings are the prediction results;
- The actual rating of Hacksaw Ridge (8.1) falls within the 95% prediction confidence (7.5, 8.6).
- The actual rating of The Accountant (7.3) falls within the 95% prediction confidence (6.6, 7.9).
- The actual rating of Miss Peregrine’s Home for Peculiar Children (6.7) also falls within the 95% prediction confidence (5.8, 7.3).
The potential shortcomings include limited observations and moderately high correlation of given variables, leading to a lower predictive power.