Modeling and Prediction for Movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.1

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

library(statsr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.5.1

library(knitr)
library(gridExtra)

Load data

load("movies.Rdata")

* * *

Part 1: Data

The data are collected from Rotten Tomatoes (www.rottentomatoes.com) and the Internet Movie Database (IMDB - www.imdb.com) composed of a variety of information about the movies (including genre, MPAA rating, release date, actors, etc.).

Generalizability: Observation were randomly sampled from the sources specified above and thus, it is feasible to assume that the data is generalizable to the target population. Since the data was randomly sampled, it is representative of movies produced and released prior to 2016.

Causality: No causal relations can be derived from this data - ONLY CORRELATION. Although random sampling was used, the data were purely observation and not randomly assigned. Thus, this data cannot be used to evaluate causation of any kind.

* * *

Part 2: Research question

Premise/Backstory Behind the Research Question: I got a new job as a data scientist at Paramount Pictures. My boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies.

She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants you to figure it all out.

Research Question: Are there any factors (or a single factor) that correlates to a movie’s rating on IMDB? Can this factor/these factors be used to predict the quality/success of a movie (in this case, quality/success is measured by the IMDB rating = higher rating = higher audience approval of the movie)? Is there evidence of any collinearity of variables?

Motivation: Since elementary school, IMDB has been my go-to website to gauge whether a movie is worth watching or not. According to the IMDB website, it touts itself as “the world’s most popular and authoritative source for movie, TV and celebrity content.” It would be interesting to see what factor(s) correlate(s) to the IMDB score or if any factor(s) show(s) any sort of linear relationship(s) with the IMDB score.

* * *

Part 3: Exploratory data analysis

First, get a glimpse of the data:

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

The mean IMDB rating is 6.493 with a median rating of 6.6.

sd(movies$imdb_rating)

## [1] 1.084747

The standard deviation is 1.084747

To see if the distribution of the ratings is normal (or nearly normal), a plot of the distribution is needed:

movies %>% 
  ggplot(aes(x=imdb_rating)) + 
      geom_bar(fill="blue") + 
      xlab("IMDB rating") +
      ylab("Frequency")

The IMDB rating distribution histrogram is visibly left-skewed (negative skewness), which suggests that the rating tends to be higher than the mean of 6.493.

Since the research question is also interested in potential collinearlity (or collinearities), a correlation plot is used to view the variables:

numeric_movies = movies %>% 
  mutate_each(funs(as.numeric)) %>% 
  select(title_type:top200_box)

## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`

## Warning in evalq(as.numeric(title), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(director), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(actor1), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(actor2), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(actor3), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(actor4), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(actor5), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(imdb_url), <environment>): NAs introduced by
## coercion

## Warning in evalq(as.numeric(rt_url), <environment>): NAs introduced by
## coercion

numeric_movies %>% 
  cor(use = "complete.obs") %>% 
  corrplot(method="shade",
        shade.col=NA,
        cl.pos = "n",
        tl.col="black",
        tl.srt=45)

Based on the plot above, there appears to be more than one variable that has a high correlation with the IMDB rating variable. The variables that appear to be highly correlated to the IMDB ratings are the following: audience_rating, audience_score, critics_rating, and critics_score.

Now that the plot showed these correlations, the four variables and their correlation values with the IMDB rating are extracted:

numeric_movies %>% 
  select(imdb_rating, audience_rating, audience_score, critics_rating, critics_score) %>% 
  cor(use = "complete.obs") %>% 
  round(2) %>% 
  kable()

	imdb_rating	audience_rating	audience_score	critics_rating	critics_score
imdb_rating	1.00	0.70	0.86	-0.62	0.77
audience_rating	0.70	1.00	0.86	-0.53	0.59
audience_score	0.86	0.86	1.00	-0.60	0.70
critics_rating	-0.62	-0.53	-0.60	1.00	-0.83
critics_score	0.77	0.59	0.70	-0.83	1.00

The four variables extracted above are all subjective opinions of people (the audience and critics). They can be considered a “human factor” in evaluation and can be considered collinear. Since the research question focuses on IMDB ratings as the primary source to model performance in terms of the 1 to 10 scale, these four variables will be excluded from the dataset.

numeric_movies = numeric_movies %>% 
  select(-critics_rating, -critics_score, -audience_rating, -audience_score)

* * *

Part 4: Modeling

After removing the four collinear variables (audience rating, audience score, critics rating, critics score) mentioned in the previous section, other non-categorical and non-numeric variables (movie title, studio, actors, directors, urls, etc.), will also be removed from the full model. Non-categorial and non-numeric variables cannot be modeled properly as they are not representative entries.

Only the following numeric and categorical variables will be considered for the full model.

Variable - Description title_type - Type of movie genre - Genre of movie runtime - Runtime of movie (in minutes) mpaa_rating - MPAA rating of the movie thtr_rel_year - Year the movie is released in theaters imdb_rating - Rating on IMDB best_pic_nom - Whether or not the movie was nominated for a best picture Oscar best_pic_win - Whether or not the movie won a best picture Oscar best_actor_win - Whether or not one of the main actors in the movie ever won an Oscar best_actress win - Whether or not one of the main actresses in the movie ever won an Oscar best_dir_win - Whether or not the director of the movie ever won an Oscar top200_box - Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo

Since the research question seeks to find statistically signifcant predictors in the model, the variables will be further whittled down using the backward stepwise selection based on p-value. For this method, the full model will be used and the highest p-values will be eliminated in a step-by-step elimination process.

lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies) %>% 
  summary()

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8697 -0.4878  0.0616  0.5921  2.0046 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    12.074704   7.170786   1.684 0.092706 .  
## title_typeFeature Film         -0.795096   0.332657  -2.390 0.017138 *  
## title_typeTV Movie             -1.373207   0.523852  -2.621 0.008971 ** 
## genreAnimation                 -0.171716   0.355277  -0.483 0.629031    
## genreArt House & International  0.628210   0.271781   2.311 0.021132 *  
## genreComedy                    -0.083033   0.150776  -0.551 0.582035    
## genreDocumentary                0.918563   0.357447   2.570 0.010407 *  
## genreDrama                      0.612370   0.128470   4.767 2.33e-06 ***
## genreHorror                    -0.135842   0.224828  -0.604 0.545930    
## genreMusical & Performing Arts  0.945776   0.304956   3.101 0.002013 ** 
## genreMystery & Suspense         0.414523   0.167938   2.468 0.013842 *  
## genreOther                      0.498389   0.255731   1.949 0.051757 .  
## genreScience Fiction & Fantasy -0.304731   0.320228  -0.952 0.341665    
## runtime                         0.010373   0.002154   4.817 1.83e-06 ***
## mpaa_ratingNC-17               -0.362918   0.682024  -0.532 0.594833    
## mpaa_ratingPG                  -0.587293   0.248170  -2.366 0.018262 *  
## mpaa_ratingPG-13               -0.784461   0.259655  -3.021 0.002621 ** 
## mpaa_ratingR                   -0.468523   0.250175  -1.873 0.061566 .  
## mpaa_ratingUnrated             -0.310926   0.291280  -1.067 0.286183    
## thtr_rel_year                  -0.002951   0.003585  -0.823 0.410754    
## best_pic_nomyes                 0.888587   0.230827   3.850 0.000131 ***
## best_pic_winyes                -0.055243   0.409861  -0.135 0.892826    
## best_actor_winyes              -0.043021   0.107055  -0.402 0.687927    
## best_actress_winyes             0.001246   0.118535   0.011 0.991614    
## best_dir_winyes                 0.350842   0.154500   2.271 0.023498 *  
## top200_boxyes                   0.575735   0.243214   2.367 0.018228 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8972 on 624 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3423, Adjusted R-squared:  0.316 
## F-statistic: 12.99 on 25 and 624 DF,  p-value: < 2.2e-16

For the initial model, the adjusted R-squared value is = 0.316. There are 13 statistically significant (p-value < 0.05) predictors. The next step is the eliminate the variable with the highest p-value, which is best_actress_win (p-value = 0.99).

lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_pic_win + best_actor_win + best_dir_win + top200_box, data = movies) %>% 
  summary()

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + best_pic_nom + best_pic_win + best_actor_win + 
##     best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8699 -0.4872  0.0615  0.5919  2.0045 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    12.075168   7.164912   1.685 0.092426 .  
## title_typeFeature Film         -0.795087   0.332390  -2.392 0.017050 *  
## title_typeTV Movie             -1.373054   0.523231  -2.624 0.008898 ** 
## genreAnimation                 -0.171521   0.354508  -0.484 0.628678    
## genreArt House & International  0.628344   0.271264   2.316 0.020862 *  
## genreComedy                    -0.082883   0.149979  -0.553 0.580716    
## genreDocumentary                0.918655   0.357055   2.573 0.010316 *  
## genreDrama                      0.612545   0.127294   4.812 1.87e-06 ***
## genreHorror                    -0.135773   0.224555  -0.605 0.545643    
## genreMusical & Performing Arts  0.945782   0.304712   3.104 0.001996 ** 
## genreMystery & Suspense         0.414720   0.166757   2.487 0.013143 *  
## genreOther                      0.498486   0.255362   1.952 0.051376 .  
## genreScience Fiction & Fantasy -0.304718   0.319969  -0.952 0.341295    
## runtime                         0.010376   0.002140   4.850 1.56e-06 ***
## mpaa_ratingNC-17               -0.363092   0.681279  -0.533 0.594254    
## mpaa_ratingPG                  -0.587267   0.247959  -2.368 0.018168 *  
## mpaa_ratingPG-13               -0.784428   0.259428  -3.024 0.002600 ** 
## mpaa_ratingR                   -0.468537   0.249972  -1.874 0.061347 .  
## mpaa_ratingUnrated             -0.310951   0.291037  -1.068 0.285740    
## thtr_rel_year                  -0.002952   0.003582  -0.824 0.410296    
## best_pic_nomyes                 0.888833   0.229447   3.874 0.000118 ***
## best_pic_winyes                -0.055026   0.409017  -0.135 0.893024    
## best_actor_winyes              -0.042957   0.106799  -0.402 0.687655    
## best_dir_winyes                 0.350843   0.154377   2.273 0.023386 *  
## top200_boxyes                   0.575869   0.242688   2.373 0.017952 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8965 on 625 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3423, Adjusted R-squared:  0.3171 
## F-statistic: 13.55 on 24 and 625 DF,  p-value: < 2.2e-16

After eliminating best_actress_win, the adjusted R-squared value of the reduced model is 0.317, which is a very slight improvement compared the to 0.316 of the previous model. The same 13 statistically significant (p-value < 0.05) predictors remain in this new model.

The next highest p-value is the variable best_pic_win (p-value = 0.89), which is the the next variable to be eliminated from the model.

lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_actor_win + best_dir_win + top200_box, data = movies) %>% 
  summary()

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + best_pic_nom + best_actor_win + best_dir_win + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8699 -0.4875  0.0612  0.5937  2.0047 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    12.080326   7.159188   1.687  0.09203 .  
## title_typeFeature Film         -0.795008   0.332128  -2.394  0.01697 *  
## title_typeTV Movie             -1.373596   0.522805  -2.627  0.00882 ** 
## genreAnimation                 -0.171993   0.354212  -0.486  0.62745    
## genreArt House & International  0.627993   0.271039   2.317  0.02083 *  
## genreComedy                    -0.083700   0.149739  -0.559  0.57638    
## genreDocumentary                0.918317   0.356766   2.574  0.01028 *  
## genreDrama                      0.612447   0.127192   4.815 1.85e-06 ***
## genreHorror                    -0.136008   0.224372  -0.606  0.54462    
## genreMusical & Performing Arts  0.946067   0.304465   3.107  0.00197 ** 
## genreMystery & Suspense         0.414281   0.166594   2.487  0.01315 *  
## genreOther                      0.500318   0.254799   1.964  0.05002 .  
## genreScience Fiction & Fantasy -0.304229   0.319697  -0.952  0.34166    
## runtime                         0.010361   0.002135   4.853 1.54e-06 ***
## mpaa_ratingNC-17               -0.363764   0.680726  -0.534  0.59327    
## mpaa_ratingPG                  -0.587313   0.247765  -2.370  0.01807 *  
## mpaa_ratingPG-13               -0.783766   0.259177  -3.024  0.00260 ** 
## mpaa_ratingR                   -0.468314   0.249770  -1.875  0.06126 .  
## mpaa_ratingUnrated             -0.310817   0.290807  -1.069  0.28557    
## thtr_rel_year                  -0.002953   0.003580  -0.825  0.40964    
## best_pic_nomyes                 0.875100   0.205327   4.262 2.34e-05 ***
## best_actor_winyes              -0.041831   0.106387  -0.393  0.69431    
## best_dir_winyes                 0.344908   0.147822   2.333  0.01995 *  
## top200_boxyes                   0.574453   0.242270   2.371  0.01804 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8958 on 626 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3423, Adjusted R-squared:  0.3181 
## F-statistic: 14.17 on 23 and 626 DF,  p-value: < 2.2e-16

Following the further reduced model, the adjusted R-square value increases again, to 0.3181 (from 0.317 previously). The variable best_actor_win (p-value = 0.69) is eliminated, which has the next highest p-value in the reduced model.

lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_actor_win + best_dir_win + top200_box, data = movies) %>% 
  summary()

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + best_pic_nom + best_actor_win + best_dir_win + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8699 -0.4875  0.0612  0.5937  2.0047 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    12.080326   7.159188   1.687  0.09203 .  
## title_typeFeature Film         -0.795008   0.332128  -2.394  0.01697 *  
## title_typeTV Movie             -1.373596   0.522805  -2.627  0.00882 ** 
## genreAnimation                 -0.171993   0.354212  -0.486  0.62745    
## genreArt House & International  0.627993   0.271039   2.317  0.02083 *  
## genreComedy                    -0.083700   0.149739  -0.559  0.57638    
## genreDocumentary                0.918317   0.356766   2.574  0.01028 *  
## genreDrama                      0.612447   0.127192   4.815 1.85e-06 ***
## genreHorror                    -0.136008   0.224372  -0.606  0.54462    
## genreMusical & Performing Arts  0.946067   0.304465   3.107  0.00197 ** 
## genreMystery & Suspense         0.414281   0.166594   2.487  0.01315 *  
## genreOther                      0.500318   0.254799   1.964  0.05002 .  
## genreScience Fiction & Fantasy -0.304229   0.319697  -0.952  0.34166    
## runtime                         0.010361   0.002135   4.853 1.54e-06 ***
## mpaa_ratingNC-17               -0.363764   0.680726  -0.534  0.59327    
## mpaa_ratingPG                  -0.587313   0.247765  -2.370  0.01807 *  
## mpaa_ratingPG-13               -0.783766   0.259177  -3.024  0.00260 ** 
## mpaa_ratingR                   -0.468314   0.249770  -1.875  0.06126 .  
## mpaa_ratingUnrated             -0.310817   0.290807  -1.069  0.28557    
## thtr_rel_year                  -0.002953   0.003580  -0.825  0.40964    
## best_pic_nomyes                 0.875100   0.205327   4.262 2.34e-05 ***
## best_actor_winyes              -0.041831   0.106387  -0.393  0.69431    
## best_dir_winyes                 0.344908   0.147822   2.333  0.01995 *  
## top200_boxyes                   0.574453   0.242270   2.371  0.01804 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8958 on 626 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3423, Adjusted R-squared:  0.3181 
## F-statistic: 14.17 on 23 and 626 DF,  p-value: < 2.2e-16

After eliminating best_actor_win, the R-square value stays the same (0.3181). The next highest p-value is genre (category: Animation); however, one of the levels of this variable has a very small p-value (Drama), so, the “genre” variable cannot be eliminated as it is only one level in the variable and other levels have small p-values. This same rule applies to the mpaa_rating (category: NC-17).

The variable with the next highest p-value that can be eliminated is thtr_rel_year (p-value = 0.41).

lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + best_pic_nom + best_dir_win + top200_box, data = movies) %>% 
  summary()

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     best_pic_nom + best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8204 -0.4895  0.0624  0.5715  2.0121 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.200083   0.450653  13.758  < 2e-16 ***
## title_typeFeature Film         -0.793327   0.331769  -2.391  0.01709 *  
## title_typeTV Movie             -1.366333   0.522243  -2.616  0.00910 ** 
## genreAnimation                 -0.214358   0.350593  -0.611  0.54115    
## genreArt House & International  0.638106   0.270543   2.359  0.01865 *  
## genreComedy                    -0.082706   0.149579  -0.553  0.58051    
## genreDocumentary                0.917307   0.356418   2.574  0.01029 *  
## genreDrama                      0.613416   0.126917   4.833 1.69e-06 ***
## genreHorror                    -0.120068   0.223456  -0.537  0.59124    
## genreMusical & Performing Arts  0.947510   0.304158   3.115  0.00192 ** 
## genreMystery & Suspense         0.408227   0.165663   2.464  0.01400 *  
## genreOther                      0.515934   0.253709   2.034  0.04242 *  
## genreScience Fiction & Fantasy -0.288376   0.318910  -0.904  0.36621    
## runtime                         0.010423   0.002078   5.016 6.89e-07 ***
## mpaa_ratingNC-17               -0.362503   0.678481  -0.534  0.59333    
## mpaa_ratingPG                  -0.604628   0.246807  -2.450  0.01457 *  
## mpaa_ratingPG-13               -0.831252   0.252666  -3.290  0.00106 ** 
## mpaa_ratingR                   -0.504954   0.245621  -2.056  0.04021 *  
## mpaa_ratingUnrated             -0.364475   0.282897  -1.288  0.19809    
## best_pic_nomyes                 0.865622   0.204247   4.238 2.59e-05 ***
## best_dir_winyes                 0.351491   0.147323   2.386  0.01733 *  
## top200_boxyes                   0.573250   0.242019   2.369  0.01816 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.895 on 628 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3414, Adjusted R-squared:  0.3194 
## F-statistic:  15.5 on 21 and 628 DF,  p-value: < 2.2e-16

This final reduced model has the highest adjusted R-square value yet of 0.3194 (an increase from the previous value of 0.3181). The remaining variables are all statistically significant predictors (p-value <0.05). As noted above, genre and mpaa_rating cannot be eliminated as these variables have multiple levels with at least one level being statistically significance.

Now that the final reduced model has been stablished, the model diagnostics can be plotted.

model_features = fortify(lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + best_pic_nom + best_dir_win + top200_box, data = movies))

p1 <- ggplot(model_features, aes(x=.fitted, y=.resid))+geom_point() +
      geom_smooth(se=FALSE)+geom_hline(yintercept=0, col="red", linetype="dashed") +
      xlab("Fitted Values")+ylab("Residuals") +
      ggtitle("Residual vs Fitted Plot")

model_features$.qqnorm <- qqnorm(model_features$.stdresid, plot.it=FALSE)$x  
y <- quantile(model_features$.stdresid, c(0.25, 0.75))
x <- quantile(model_features$.qqnorm, c(0.25, 0.75)) 

# Compute the line slope
slope <- diff(y) / diff(x)             
# Compute the line intercept
int <- y[1] - slope * x[1]             

p2 <- ggplot(model_features, aes(.qqnorm, .stdresid)) +
      geom_point(na.rm = TRUE) +
      geom_abline(intercept=int, slope=slope, color="red") +
      xlab("Theoretical Quantiles")+ylab("Standardized Residuals") +
      ggtitle("Normal Q-Q Plot")

p3 <- ggplot(data=model_features, aes(x=.resid)) + 
      geom_histogram(binwidth=0.19, fill="blue") +
      xlab("Residuals") +
      ggtitle("Distribution of Residuals")

grid.arrange(p1, p3, p2, nrow=3, top="Model Diagnostic Plots")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The model diagnostic plots above show that the model satisfies the MLR conditions at least partially. The residuals are randomly distributed around zero for the fitted values < 7 in the scatter plot above. The amount of scatter reduces When fitted values are > 7. Distribution in the histogram shows a nearly normal distribution (although slightly left-skewed/negative skewness). The residuals in the Q-Q plot also show a nearly normal distribution; however as with the histogram distribution, the Q-Q plot also indicates slight left/negative skewness.

There are two variables with multiple levels: * genre — The genre variable has the following levels that show statistical significance: “Art House & International”, “Documentary”, “Drama”, “Musical & Performing Arts”, “Mystery & Suspense” and “Other”. Of these, the level “Musical & Performing Arts” has the highest effect on the IMDB rating, with its positive 0.95 estimate.

mpaa_rating —The mpaa_rating has the following levels that show statistical significane: “PG”, “PG-13” and “R”. All three levels have a negative effect on the IMDB rating, with the PG-13 rating having the strongest negative effect (-0.83) as opposed to PG (-0.6) and R (-0.51).

The best_pic_nom variable also as a positive effect on the IMDB rating (if the movie was nominated for best picture at the Oscars - but not necessarily won), with 0.87 estimate and a tiny p-value of 2.59e-05.

The final variable that is statistically significant to this analysis is the runtime, with a miniscule p-value of 6.89e-07 and a small positive effect on the rating, with an estimate of 0.01.

* * *

Part 5: Prediction

Using the model, I’d like to predict the IMDB rating of “A Dog’s Purpose”, a film I’ve been interested in seeing, released last year in 2017. If the model has any predictive power the IMDB rating predicted should be close to the actual IMDB score on the website.

model = lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + best_pic_nom + best_dir_win + top200_box, data = movies)
real_data = data.frame(title_type = "Feature Film", genre="Drama", runtime=100, mpaa_rating="PG", best_pic_nom = "no", best_dir_win="no", top200_box = "yes")
predicted_rating = predict(model, real_data, interval="predict")

df = data.frame(t="A Dog's Purpose",
                 p=sprintf("%2.1f", predicted_rating[1]),
                 i=sprintf("%2.1f - %2.1f", predicted_rating[2], predicted_rating[3]), 
                 r="7.0")
kable(df, col.names=c("Movie Title", "Predicted Rating", "95% Prediction Interval", "Actual Rating"))

Movie Title	Predicted Rating	95% Prediction Interval	Actual Rating
A Dog’s Purpose	7.0	5.2 - 8.9	7.0

The model works well and the predicted rating corresponds exactly to the actual rating found on the website: https://www.imdb.com/title/tt1753383/?ref_=nv_sr_1. Thus, it is possible to conclude that this model has predictive power.

* * *

Part 6: Conclusion

Based on the results above, it is possible to conclude that the MLR (multiple linear regresion) model is a effective tool that can be used to predict the response variable, given a linear relationship.

The above prediction successfully predicted the IMDB rating of the drama “A Dog’s Purpose” (2017) with the statistically significant variables: title_type, genre, runtime, mpaa_rating, best_pic_nom, best_dir_win, and top200_box. The predicted IMDB rating was 7.0, which was exactly the actual IMDB rating of 7.0. Although the total explained variability with the penalty (adjusted R-squared) was equal to 0.3194 in this case, the model perfectly predicted the actual rating.

The research question aimed to find if there were any factors (or factor) that could correlated to a movie’s IMDB rating and if these variables could be used to predict the IMDB rating (in this case, the quality or success of a movie - a higher IMDB rating would be considered greater quality/success/audience approval). Taking into consideration all of the analysis done for this project, it is possible to answer that yes, there are several factors that contribute to the IMDB rating, which include: title_type, genre, runtime, mpaa_rating, best_pic_nom, best_dir_win, and top200_box. These variables combined have statistically significant power of prediction, able to be used to predict the IMDB rating of a movie. This is further confirmed by the p-values < 0.05 of all these variables (for at least one level for those variables with multiple variables).

Still, there is always room for improvement since the dataset shows a slight left-skewed distribution. It may be necessary to further perfect the model using other methods (e.g. bootstrapping).

We can suggest to our new employer of producing a movie that would be in the Musical & Performing Arts genre with a longer runtime and G-rating for best results. Of course the quality of the movie should be such that it will at least be nominated for a best picture Oscar that year.

Modeling and Prediction for Movies

Yousie Kim

August 3, 2018

Setup

Load packages

Load data

* * *

Part 1: Data

* * *

Part 2: Research question

* * *

Part 3: Exploratory data analysis

* * *

Part 4: Modeling

* * *

Part 5: Prediction

* * *

Part 6: Conclusion