The dataset contains information about movies obtained from IMDb and Rotten Tomatoes Websites. It contains 651 randomly sampled movies released between 1970 and 2014. There are 32 variables which include title, genre, runtime, studio, imdb_rating, audience score, critics_score, and etc.
The data should allow us to generalise statistical results. However, potential bias includes voluntary response bias as the voting is voluntary, and only those who have high interest in movies would take part in the voting.
It is an observational study, with no random assignment. Therefore, causal inference cannot be drawn from the statistical results.
MakeUseOf, an online technology publication website, rated IMDb as a great source of movie ratings website to understand general public opinions. The following question will be investigated;
What are the important predictors contributing to movie ratings on IMDb website?
Figure 1 and 2 illustrate movie ratings of all genres from IMDb and Rotten Tomatoes websites. IMDb ratings range from 1.9 to 9.0, while Rotten Tomatoes website has a score between 11 and 97.
The figures show that both websites reported almost the same pattern of movie ratings, with, at the median, Documentary was the highest rated movie genre, followed by Musical & Performing Arts and Other. Therefore, it can be implied that factors contributing to movie ratings on IMDb and Rotten Tomatoes websites may possibly be similar. Thus, identifying potential predictors could help movie producers improve marketing strategies and increase popularity among general audiences.
Figure 3 and 4 show the distribution of IMDb ratings and Rotten Tomatoes audience scores. The distribution of movie ratings/scores on both websites are left skewed, suggesting that voters are likely to score highly. The median IMDb ratings is 6.6 which is considered fairly good, while Rotten Tomatoes has a score, at the median, of 65 which corresponds to the proportion of positive reviews given by audiences.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Table 1 shows movies with duplicated titles. Duplicated movie titles can be understood as remakes due to different years of release. However, the movie titled “Man on Wire” appears to be a duplicated record by Magnolia Pictures in 2008, and it will be removed from the dataset.
| title | studio | released year |
|---|---|---|
| Hairspray | New Line Cinema | 2007 |
| Hairspray | New Line Home Entertainment | 1988 |
| Man on Wire | Magnolia Pictures | 2008 |
| Man on Wire | Magnolia Pictures | 2008 |
| Maniac | Analysis | 1980 |
| Maniac | IFC Midnight | 2013 |
| Where the Heart Is | Touchstone Pictures | 1990 |
| Where the Heart Is | Twentieth Century Fox Home Entertainment | 2000 |
Table 2 shows that there are 32 missing values, and these will be removed from the dataset.
## title title_type genre runtime
## 0 0 0 1
## mpaa_rating studio thtr_rel_year thtr_rel_month
## 0 8 0 0
## thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## 0 8 8 8
## imdb_rating imdb_num_votes critics_rating critics_score
## 0 0 0 0
## audience_rating audience_score best_pic_nom best_pic_win
## 0 0 0 0
## best_actor_win best_actress_win best_dir_win top200_box
## 0 0 0 0
## director actor1 actor2 actor3
## 2 2 7 9
## actor4 actor5 imdb_url rt_url
## 13 15 0 0
NACheck <- complete.cases(movies) %>%
count()
colnames(NACheck) <- c(" ", "frequency")
NACheck %>%
kable(caption ="Table 2 - Number of Complteted Cases") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| frequency | |
|---|---|
| FALSE | 32 |
| TRUE | 619 |
A new parameter - oscar_win - based on the existing data on winning actors, actresses, and directors will be created.
movies <- movies %>%
mutate(oscar_win = ifelse(best_pic_nom =="yes"| best_pic_win =="yes"| best_actor_win =="yes"|
best_actress_win =="yes"| best_dir_win =="yes" ,1,0))Figure 5 shows the correlation coefficients for all pairs of selected variables. 11 numerical variables are selected to examine whether or not they are collinear. It is important to note that collinearity should be avoided in order to ensure reliable estimates.
The correlation between different pairs are significantly high; for example, imdb_rating is highly correlated with critics_rating, critics_score, audience_rating, audience_score, and top200_box. The variable critics_rating is associated with audience_rating, audience_score, and top200_box. This may suggest that the model is likely to cause a bias.
vars <- names(movies) %in% c("title_type", "genre", "runtime", "imdb_rating", "imdb_num_votes","critics_rating", "critics_score", "audience_rating", "audience_score","oscar_win","top200_box")
selected_vars <- movies[vars]
selected_vars$title_type <- as.numeric(selected_vars$title_type)
selected_vars$genre <- as.numeric(selected_vars$genre)
selected_vars$imdb_rating <- as.numeric(selected_vars$imdb_rating)
selected_vars$imdb_num_votes <- as.numeric(selected_vars$imdb_num_votes)
selected_vars$critics_rating <- as.numeric(selected_vars$audience_rating)
selected_vars$top200_box <- as.numeric(selected_vars$top200_box)
selected_vars$audience_rating <- as.numeric(selected_vars$audience_rating)
corr_matrix <- cor(selected_vars)
corrplot(corr_matrix , method="color",addCoef.col = "black", type="upper",
tl.col = "black", tl.srt = 40, number.cex=0.50) Stepwise regression method is used for model selection. It is a combination of forward and backward eliminations to select the most contributive predictors.
The summary table shows that the potential predictors include genre, runtime, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, with \(Adj. R^2\) of 0.8318 - a slight increase from 0.8317 of the full model.
The correlation efficient \((R)\) is 0.91. The \(R^2\) is 0.84, which means that approximately 84% of the variability in IMDb rating can be explained by the explanatory variables.
FullModel <- lm(imdb_rating ~ genre + runtime + imdb_num_votes + critics_rating +
critics_score + audience_rating + audience_score +
oscar_win + top200_box, data = movies)
summary(FullModel)##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score +
## oscar_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.32321 -0.17168 0.02619 0.24978 1.08579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.680e+00 1.711e-01 15.667 < 2e-16 ***
## genreAnimation -3.179e-01 1.603e-01 -1.983 0.047786 *
## genreArt House & International 3.382e-01 1.332e-01 2.539 0.011350 *
## genreComedy -1.270e-01 7.376e-02 -1.722 0.085560 .
## genreDocumentary 3.460e-01 9.341e-02 3.704 0.000231 ***
## genreDrama 1.080e-01 6.417e-02 1.683 0.092915 .
## genreHorror 8.850e-02 1.095e-01 0.808 0.419120
## genreMusical & Performing Arts 1.011e-01 1.447e-01 0.699 0.484957
## genreMystery & Suspense 2.592e-01 8.194e-02 3.163 0.001636 **
## genreOther -4.419e-02 1.258e-01 -0.351 0.725580
## genreScience Fiction & Fantasy -1.495e-01 1.591e-01 -0.940 0.347679
## runtime 3.176e-03 1.066e-03 2.980 0.002991 **
## imdb_num_votes 9.410e-07 1.973e-07 4.770 2.29e-06 ***
## critics_ratingFresh 8.381e-02 5.550e-02 1.510 0.131515
## critics_ratingRotten 3.493e-01 8.906e-02 3.922 9.76e-05 ***
## critics_score 1.480e-02 1.501e-03 9.859 < 2e-16 ***
## audience_ratingUpright -3.424e-01 7.157e-02 -4.784 2.14e-06 ***
## audience_score 3.972e-02 2.043e-03 19.437 < 2e-16 ***
## oscar_win 3.772e-02 4.316e-02 0.874 0.382435
## top200_boxyes -1.011e-01 1.232e-01 -0.821 0.412203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4451 on 630 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8366, Adjusted R-squared: 0.8317
## F-statistic: 169.8 on 19 and 630 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33520 -0.17886 0.02612 0.24827 1.08840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.654e+00 1.689e-01 15.708 < 2e-16 ***
## genreAnimation -3.040e-01 1.598e-01 -1.902 0.057569 .
## genreArt House & International 3.421e-01 1.329e-01 2.574 0.010288 *
## genreComedy -1.188e-01 7.338e-02 -1.619 0.105943
## genreDocumentary 3.503e-01 9.300e-02 3.767 0.000181 ***
## genreDrama 1.194e-01 6.342e-02 1.883 0.060165 .
## genreHorror 9.226e-02 1.092e-01 0.845 0.398410
## genreMusical & Performing Arts 1.061e-01 1.443e-01 0.735 0.462433
## genreMystery & Suspense 2.728e-01 8.113e-02 3.363 0.000818 ***
## genreOther -3.483e-02 1.255e-01 -0.277 0.781493
## genreScience Fiction & Fantasy -1.543e-01 1.590e-01 -0.970 0.332244
## runtime 3.393e-03 1.023e-03 3.317 0.000962 ***
## imdb_num_votes 9.170e-07 1.931e-07 4.748 2.55e-06 ***
## critics_ratingFresh 8.445e-02 5.538e-02 1.525 0.127782
## critics_ratingRotten 3.508e-01 8.900e-02 3.942 8.99e-05 ***
## critics_score 1.482e-02 1.498e-03 9.890 < 2e-16 ***
## audience_ratingUpright -3.481e-01 7.138e-02 -4.877 1.36e-06 ***
## audience_score 3.980e-02 2.041e-03 19.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4449 on 632 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8362, Adjusted R-squared: 0.8318
## F-statistic: 189.8 on 17 and 632 DF, p-value: < 2.2e-16
FitModel <- lm(imdb_rating ~ genre + runtime + imdb_num_votes + critics_rating +
critics_score + audience_rating + audience_score, data = movies)
summary(FitModel)##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33520 -0.17886 0.02612 0.24827 1.08840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.654e+00 1.689e-01 15.708 < 2e-16 ***
## genreAnimation -3.040e-01 1.598e-01 -1.902 0.057569 .
## genreArt House & International 3.421e-01 1.329e-01 2.574 0.010288 *
## genreComedy -1.188e-01 7.338e-02 -1.619 0.105943
## genreDocumentary 3.503e-01 9.300e-02 3.767 0.000181 ***
## genreDrama 1.194e-01 6.342e-02 1.883 0.060165 .
## genreHorror 9.226e-02 1.092e-01 0.845 0.398410
## genreMusical & Performing Arts 1.061e-01 1.443e-01 0.735 0.462433
## genreMystery & Suspense 2.728e-01 8.113e-02 3.363 0.000818 ***
## genreOther -3.483e-02 1.255e-01 -0.277 0.781493
## genreScience Fiction & Fantasy -1.543e-01 1.590e-01 -0.970 0.332244
## runtime 3.393e-03 1.023e-03 3.317 0.000962 ***
## imdb_num_votes 9.170e-07 1.931e-07 4.748 2.55e-06 ***
## critics_ratingFresh 8.445e-02 5.538e-02 1.525 0.127782
## critics_ratingRotten 3.508e-01 8.900e-02 3.942 8.99e-05 ***
## critics_score 1.482e-02 1.498e-03 9.890 < 2e-16 ***
## audience_ratingUpright -3.481e-01 7.138e-02 -4.877 1.36e-06 ***
## audience_score 3.980e-02 2.041e-03 19.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4449 on 632 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8362, Adjusted R-squared: 0.8318
## F-statistic: 189.8 on 17 and 632 DF, p-value: < 2.2e-16
In order to ensure that the selected regression model provide better performance, it is important to compare the overall quality of the two models - Full Model VS Fit Model.
Table 3 shows the quality comparison of the full and fit model. The fit model has slightly better \(Adj. R^2\), and lower amount of residuals standard error (RSE or sigma).
The AIC and BIC metrics are further used for model evaluation. The metrics will increase error when including additional variables, ensuring unbiased estimate of the model prediction. The AIC and the BIC of the reduced model are lower than those of the full model. Therefore, the fit model with lower AIC and BIC is preferred.
The p-value of the fit model is lower than the full model. This means that the fit model is statistically more significant than the full model.
FullCompare <- glance(FullModel) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
FitCompare <- glance(FitModel) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
FullCompare$p.value <- sprintf("%6.4G", FullCompare$p.value)
FitCompare$p.value <- sprintf("%6.4G", FitCompare$p.value)
ModelBind <- rbind(FullCompare, FitCompare)
ModelTitle <- data.frame(model = c("full model", "fit model"))
FullGlance <- cbind(ModelTitle, ModelBind)
colnames(FullGlance) <- c("", "adj.r.squared", "sigma", "AIC", "BIC", "p-value")
FullGlance %>%
kbl(caption ="Table 3 - Model Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_header_above(c(" " = 1, "Model Comparison" = 5))| adj.r.squared | sigma | AIC | BIC | p-value | |
|---|---|---|---|---|---|
| full model | 0.8316596 | 0.445111 | 814.0449 | 908.0613 | 5.525E-233 |
| fit model | 0.8318041 | 0.444920 | 811.5471 | 896.6095 | 6.001E-235 |
Despite lower residuals standard error (RSE or sigma), in order to be able to ensure the best possible estimates for the mode, the following assumptions must be satisfied prior to modelling.
1. Check for Multicollinearity
In the presence of multicollinearity, the regression model will produce unstable solution. Multicollinearity can be assessed by computing the variance inflation factor (VIF) which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
Table 4 shows the VIF values of all predictors in the fit model. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. It appears that the VIF values of critics_score, audience_score, and critics_rating are larger than 5, but do not exceed 10. However, the amount is not severe enough to warrant correction.
variancefactor <- vif(FitModel)
variancefactor %>%
kbl(caption ="Table 4 - Variance Inflation Factor of Fit model") %>%
kable_styling(bootstrap_options = c("striped", "hover")) | GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 1.714093 | 10 | 1.027311 |
| runtime | 1.297382 | 1 | 1.139027 |
| imdb_num_votes | 1.539171 | 1 | 1.240633 |
| critics_rating | 5.443540 | 2 | 1.527462 |
| critics_score | 5.941347 | 1 | 2.437488 |
| audience_rating | 4.083203 | 1 | 2.020694 |
| audience_score | 5.591765 | 1 | 2.364691 |
2. Check for Autocorrelation
Autocorrelation refers to the residuals not being independent of each other, which will affect the results of confidence intervals. Durbin-Watson test for autocorrelation will be performed.
The p-value corresponds to 0.418 which means that we cannot reject the null hypothesis that there is no autocorrelation.
\[ H_0: \text{ There is no autocorrelation.} \] \[ H_a: \text{ There is evidence for autocorrelation.} \]
## lag Autocorrelation D-W Statistic p-value
## 1 0.02185454 1.931961 0.404
## Alternative hypothesis: rho != 0
2. Check for Homoskedasticity
Homoskedasticity means that the residuals show an even and random pattern across the range of observations. In order to check for Homoskedasticity, Global Validation Of Linear Model Assumptions (gvlma) will be used.
The result shows that heteroscedasticity is present in the model, and resolving the issue is essential.
##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score,
## data = movies)
##
## Coefficients:
## (Intercept) genreAnimation
## 2.654e+00 -3.040e-01
## genreArt House & International genreComedy
## 3.421e-01 -1.188e-01
## genreDocumentary genreDrama
## 3.503e-01 1.194e-01
## genreHorror genreMusical & Performing Arts
## 9.226e-02 1.061e-01
## genreMystery & Suspense genreOther
## 2.728e-01 -3.483e-02
## genreScience Fiction & Fantasy runtime
## -1.543e-01 3.393e-03
## imdb_num_votes critics_ratingFresh
## 9.170e-07 8.445e-02
## critics_ratingRotten critics_score
## 3.508e-01 1.482e-02
## audience_ratingUpright audience_score
## -3.481e-01 3.980e-02
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = FitModel)
##
## Value p-value Decision
## Global Stat 944.40 0.000e+00 Assumptions NOT satisfied!
## Skewness 200.95 0.000e+00 Assumptions NOT satisfied!
## Kurtosis 705.60 0.000e+00 Assumptions NOT satisfied!
## Link Function 31.70 1.801e-08 Assumptions NOT satisfied!
## Heteroscedasticity 6.15 1.314e-02 Assumptions NOT satisfied!
2.1 Resolving Heteroscedasticity
In order to correct the issue of heteroscedasticity, Boxcox transformation is used and applied on the response variable in the model.
The gvlma assumption shows that the issue of heteroscedasticity is no longer violated, suggesting an equal variance across the range of observation.
The \(Adj. R^2\) increases to 0.8673 (from 0.8318). The \(R^2\) increases to 0.87 (from 0.83), which means that approximately 87% of the variability in IMDb rating can be explained by the explanatory variables.
imdb_rating_BC <- BoxCoxTrans(movies$imdb_rating)
movies <- cbind(movies, imdb_rating_trans = predict(imdb_rating_BC, movies$imdb_rating))FitBC <- lm(movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes + critics_rating + critics_score + audience_rating + audience_score, data = movies)
summary(FitBC) ##
## Call:
## lm(formula = movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3125 -1.2420 0.1275 1.3818 5.8925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.008e+00 9.098e-01 -1.108 0.268178
## genreAnimation -1.599e+00 8.606e-01 -1.858 0.063663 .
## genreArt House & International 1.992e+00 7.157e-01 2.784 0.005536 **
## genreComedy -6.317e-01 3.951e-01 -1.599 0.110384
## genreDocumentary 3.038e+00 5.008e-01 6.066 2.26e-09 ***
## genreDrama 6.484e-01 3.415e-01 1.899 0.058081 .
## genreHorror 4.115e-01 5.879e-01 0.700 0.484238
## genreMusical & Performing Arts 1.126e+00 7.772e-01 1.448 0.148024
## genreMystery & Suspense 1.406e+00 4.369e-01 3.218 0.001359 **
## genreOther -2.416e-01 6.760e-01 -0.357 0.720955
## genreScience Fiction & Fantasy -3.507e-01 8.563e-01 -0.410 0.682240
## runtime 2.018e-02 5.509e-03 3.662 0.000271 ***
## imdb_num_votes 7.875e-06 1.040e-06 7.571 1.31e-13 ***
## critics_ratingFresh 2.444e-01 2.982e-01 0.820 0.412786
## critics_ratingRotten 1.439e+00 4.793e-01 3.002 0.002784 **
## critics_score 7.889e-02 8.069e-03 9.777 < 2e-16 ***
## audience_ratingUpright -1.748e+00 3.844e-01 -4.548 6.51e-06 ***
## audience_score 2.349e-01 1.099e-02 21.376 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.396 on 632 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8673
## F-statistic: 250.5 on 17 and 632 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = movies$imdb_rating_trans ~ genre + runtime + imdb_num_votes +
## critics_rating + critics_score + audience_rating + audience_score,
## data = movies)
##
## Coefficients:
## (Intercept) genreAnimation
## -1.008e+00 -1.599e+00
## genreArt House & International genreComedy
## 1.992e+00 -6.317e-01
## genreDocumentary genreDrama
## 3.038e+00 6.484e-01
## genreHorror genreMusical & Performing Arts
## 4.115e-01 1.126e+00
## genreMystery & Suspense genreOther
## 1.406e+00 -2.416e-01
## genreScience Fiction & Fantasy runtime
## -3.507e-01 2.018e-02
## imdb_num_votes critics_ratingFresh
## 7.875e-06 2.444e-01
## critics_ratingRotten critics_score
## 1.439e+00 7.889e-02
## audience_ratingUpright audience_score
## -1.748e+00 2.349e-01
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = FitBC)
##
## Value p-value Decision
## Global Stat 221.14611 0.000e+00 Assumptions NOT satisfied!
## Skewness 59.12972 1.477e-14 Assumptions NOT satisfied!
## Kurtosis 160.25126 0.000e+00 Assumptions NOT satisfied!
## Link Function 0.01804 8.932e-01 Assumptions acceptable.
## Heteroscedasticity 1.74709 1.862e-01 Assumptions acceptable.
3. Check for Normal Distribution
Figure 6 shows that the distribution of residuals of the model is left-skewed. However, Central Limit Theorem is applied since the dataset is larger than 30. In addition, according to the classic OLS assumptions (Gauss-Markov Theorem), residuals are not required to be normally distributed.
Figure 7 shows a random scatter around zero in the Residuals VS Fitted Plot, the linearity can be assumed.
The residuals distribution in the Normal Q-Q plot fairly follows the straight line.
The Scale - Location plot shows that the residuals are spread equally along the range of predictors. There is no presence of heteroscedasticity.
The Residuals VS Leverage plot highlights the top three most extreme points (330, 475, 597). However, the data do not present any influential points as the Cook’s distance lines are not shown on the plot.
Table 5 shows the movie ratings’ prediction for Hacksaw Ridge, The Accountant, and Miss Peregrine’s Home for Peculiar Children.
For Hacksaw Ridge, the model predicts, with 95% confidence, that the movie is expected to have IMDb ratings of (7.5, 8.6). The actual IMDb rating of Hacksaw Ridge - 8.1 - falls within the prediction interval.
The model predicts, with 95% confidence, that the Accountant is expected to have IMDb ratings of (6.6, 7.9). Its actual rating is 7.3 and falls within the prediction interval.
Miss Peregrine’s Home for Peculiar Children is expected to have IMDb ratings between (5.8, 7.3). Its actual rating of 6.7 also falls within the prediction interval.
HacksawRidge <- data.frame(genre = "Drama",
runtime = 139,
imdb_num_votes = 431349,
critics_rating = "Certified Fresh",
critics_score = 84,
audience_rating = "Upright",
audience_score = 91)
TheAccountant <- data.frame(genre = "Drama",
runtime = 128,
imdb_num_votes = 258741,
critics_rating = "Rotten",
critics_score = 51,
audience_rating = "Upright",
audience_score = 76)
Peregrine <- data.frame(genre = "Drama",
runtime = 127,
imdb_num_votes = 154960,
critics_rating = "Fresh",
critics_score = 64,
audience_rating = "Upright",
audience_score = 60)
pred.int.HacksawRidge <- predict(FitBC, newdata = HacksawRidge, interval = "prediction") %>%
transform(fit = (fit * 2 + 1)^(1/2),
lwr = (lwr * 2 + 1)^(1/2),
upr = (upr * 2 + 1)^(1/2))
pred.int.TheAccountant <- predict(FitBC, newdata = TheAccountant, interval = "prediction") %>%
transform(fit = (fit * 2 + 1)^(1/2),
lwr = (lwr * 2 + 1)^(1/2),
upr = (upr * 2 + 1)^(1/2))
pred.int.Peregrine <- predict(FitBC, newdata = Peregrine, interval = "prediction") %>%
transform(fit = (fit * 2 + 1)^(1/2),
lwr = (lwr * 2 + 1)^(1/2),
upr = (upr * 2 + 1)^(1/2))
actual.HacksawRidge <- data.frame("actual" = 8.1)
actual.TheAccountant <- data.frame("actual" = 7.3)
actual.Peregrine <- data.frame("actual" = 6.7)
tbl.HacksawRidge <- data.frame(title = "Hacksaw Ridge (2016)",
prediction = pred.int.HacksawRidge,
actual = actual.HacksawRidge)
tbl.TheAccountant <- data.frame(title = "The Accountant (2016)",
prediction = pred.int.TheAccountant,
actual = actual.TheAccountant)
tbl.Peregrine <- data.frame(title = "Miss Peregrine’s Home for Peculiar Children (2016)",
prediction = pred.int.Peregrine,
actual = actual.Peregrine)
table.pred <- rbind(tbl.HacksawRidge, tbl.TheAccountant, tbl.Peregrine)
colnames(table.pred) <- c('Title', 'fit', 'lower bound', 'upper bound', 'Actual Rating')
tibble(table.pred) %>%
kbl(caption ="Table 5 - Movies Prediction") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c(" " = 1, "Prediction" = 3, " " = 1))| Title | fit | lower bound | upper bound | Actual Rating |
|---|---|---|---|---|
| Hacksaw Ridge (2016) | 8.074666 | 7.462234 | 8.643815 | 8.1 |
| The Accountant (2016) | 7.256784 | 6.570201 | 7.883799 | 7.3 |
| Miss Peregrine’s Home for Peculiar Children (2016) | 6.567332 | 5.799651 | 7.254223 | 6.7 |
In order to evaluate whether or not the model provides reliable estimates, the data are split into training and testing sets.
Table 6 shows that the model wrongly predicts 95 movies, while only 35 movies are correctly predicted and captured within the prediction interval.
set.seed(1234)
training.samples <- movies$imdb_rating %>%
createDataPartition(p = 0.8, list = FALSE)
train.set <- movies[training.samples, ]
test.set <- movies[-training.samples, ]
preds.pi <- predict(FitBC, test.set, interval = "confidence") %>%
transform(fit = (fit * 2 + 1)^(1/2),
lwr = (lwr * 2 + 1)^(1/2),
upr = (upr * 2 + 1)^(1/2))
modelEval.pi <- cbind(preds.pi, test.set$imdb_rating) %>%
mutate(capture_pi = ifelse(lwr < test.set$imdb_rating &
upr > test.set$imdb_rating, "yes", "no"))
colnames(modelEval.pi) <- c("fit", "lower bound", "upper bound", "IMDb rating", "captured interval")
head(tibble(modelEval.pi)) %>%
kbl(caption ="Table 6 - Prediction Evaluation") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c("Prediction" = 3, " " = 1, " " = 1))| fit | lower bound | upper bound | IMDb rating | captured interval |
|---|---|---|---|---|
| 6.656337 | 6.570826 | 6.740763 | 5.5 | no |
| 7.366241 | 7.292145 | 7.439598 | 7.3 | yes |
| 6.774342 | 6.687276 | 6.860303 | 6.6 | no |
| 7.502219 | 7.407494 | 7.595764 | 7.6 | no |
| 4.944425 | 4.824123 | 5.061869 | 3.6 | no |
| 6.842395 | 6.748122 | 6.935388 | 7.0 | no |
## x freq
## 1 no 95
## 2 yes 35
genre, runtime, imdb_num_votes, critics_score, critics_rating, audience_rating, and audience_score are potential predictors contributing to popularity on IMDb website.
The fit model is evaluated using adjusted \(Adj. R^2\), RSE or sigma, AIC, BIC, and p-value. The result clearly indicates that the reduced model is significantly better than the full model.
The presence of heteroscedasticity is resolved using Box Cox Transformation, with an increase in \(Adj. R^2\) to 0.8673 (from 0.8318).
The predictive power is evaluated by splitting the dataset into training and test data. The result between predicted ratings and actual ratings suggests that the model produces a bias.
The selected model is used to predict IMDb ratings for Hacksaw Ridge, The Accountant, and Miss Peregrine’s Home for Peculiar Children.The followings are the prediction results;
The potential shortcomings include limited observations and moderately high correlation of given variables, leading to a lower predictive power.