Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

library(dplyr)
library(statsr)
library(GGally)

## Warning: package 'GGally' was built under R version 3.3.2

Load data

The dataset is comprised of 615 observations. The each observation has data regarding to 32 variables. There are about 21 categorical variables and the rest are numerical variables.

load("movies.Rdata")
str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Part 1: Data

The model fitted from this dataset can be generalized to predict the parameters of the sample population. First, the data is indepedent. The dataset ‘movies’ contains data regarding 651 randomly sampled movies released before 2016, which is less than 10% of total number of movies released before 2016. Second, the sample size is large enough to assume nealry normal distribution. The causal conclusion can’t be drawm since the dataset is comprised of observational data. The relationship can be discussed instead.

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Part 2: Research question

The poluarity of a movie to audience will be associated with the rating of critics, whether or not the movie was nominated for or won Oscar, whether or not was in the Top 200 Box Office list on BoxOfficeMojo. The genre of movies was excluded from the question since the indivual different preference across the genre would be by chance because of the random sampling. The popularity of the movie to audience can be selected from either IMDB or Rotten Tomatoes. However, in this analysis a new score combined from IMDB and Rotten Tomatoes will be created to represent the popularity/acceptance of audience to a movie. Finally the critics, who are leading opinions on the movie industries, as well as Oscar nomiation, Oscar wining and Box office performance would be related to the popularity of a movie.

Part 3: Exploratory data analysis

First, let’s create a new data set only containing the variables of interest and name it as movies_sm.

movies_sm <- movies %>%
    select(title, title_type, genre, mpaa_rating, imdb_rating, critics_score, audience_score, best_pic_nom, best_pic_win, top200_box)
str(movies_sm)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  10 variables:
##  $ title         : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type    : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre         : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ mpaa_rating   : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ imdb_rating   : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ critics_score : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_score: num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ top200_box    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Before combining scores of IMDB and Rotten Tomatoes, investigate the relationship between two variables whether they are positive or negative or how strong the relationship is. According to the plot and the correlation coefficient score, the relationship between two is very strong postive relationships. An increase in the IMDB rating is expected to an increase in Rotten Tomatoes’ audience score as well. Here we can choose one of the two for the response variables for the model.

ggplot(data = movies_sm, aes(x=imdb_rating, y=audience_score))+
    geom_jitter()+
    geom_smooth(method = "lm", se = FALSE)

movies_sm %>%
    summarise(cor(audience_score, imdb_rating))

## # A tibble: 1 × 1
##   `cor(audience_score, imdb_rating)`
##                                <dbl>
## 1                          0.8648652

However, the combining two scores into a single score representing the rating of audience can also be an option for the model fitting in hope that it will better predit the parameter. In order to combine the two score, the difference of the scale used by different variables to be addressed by multiplying 10 to imdb_rating and name it as imdb_100. Instead of normalizing the two scores, this approach will give more intuitive and understandable to audience. Then lastly, create a new variable audience_one_score with single score calculated the avergae score of two variables.

movie_new <- movies_sm %>%
    mutate(imdb_100 = imdb_rating*10, audience_one_score=(imdb_100+audience_score)/2)

To investigate a single score audience_one_score can be substituted for the original two variables, the relationships between the single score and the original two variables must be strongly related. As the plots present, the correlation coefficient of imdb_rating and audience_score to the single score are 0.941 and 0.948 respectively, which are very strong positive relationships.

ggpairs(movie_new, columns = c(5,7,11:12))

Part 4: Modeling

First, the relatsionship of explanaotry variables with the response variable as well as among the explanatory vairables should be check to see the relationships of each explanatory variables and the response variable and if there is any chance for collinearity. Based on the plot and the summary, best_pic_win: Whether or not the movie won a best picture Oscar has a very small observations with only 7. However, the analysis will include this variables to see its relationships with success of a movie.

ggpairs(movie_new, columns = c(6, 8:10, 12))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(movie_new)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##   mpaa_rating   imdb_rating    critics_score    audience_score 
##  G      : 19   Min.   :1.900   Min.   :  1.00   Min.   :11.00  
##  NC-17  :  2   1st Qu.:5.900   1st Qu.: 33.00   1st Qu.:46.00  
##  PG     :118   Median :6.600   Median : 61.00   Median :65.00  
##  PG-13  :133   Mean   :6.493   Mean   : 57.69   Mean   :62.36  
##  R      :329   3rd Qu.:7.300   3rd Qu.: 83.00   3rd Qu.:80.00  
##  Unrated: 50   Max.   :9.000   Max.   :100.00   Max.   :97.00  
##                                                                
##  best_pic_nom best_pic_win top200_box    imdb_100     audience_one_score
##  no :629      no :644      no :636    Min.   :19.00   Min.   :17.50     
##  yes: 22      yes:  7      yes: 15    1st Qu.:59.00   1st Qu.:52.50     
##                                       Median :66.00   Median :65.50     
##                                       Mean   :64.93   Mean   :63.65     
##                                       3rd Qu.:73.00   3rd Qu.:76.00     
##                                       Max.   :90.00   Max.   :93.50     
##

For model selection, the backward elemination from full model will be used. This is because the number of explanatory variables is not too large to start with a full model. Furthermore, based on the idea that the more explanatory variables, the better model fitted, will start with the best possible prediction power model.

movie_m_full = lm(audience_one_score ~ critics_score + best_pic_nom + best_pic_win + top200_box, data = movie_new)
summary(movie_m_full)

## 
## Call:
## lm(formula = audience_one_score ~ critics_score + best_pic_nom + 
##     best_pic_win + top200_box, data = movie_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.604  -6.495   0.326   6.733  33.441 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     41.00684    0.88651  46.257   <2e-16 ***
## critics_score    0.38804    0.01405  27.623   <2e-16 ***
## best_pic_nomyes  6.38371    2.48197   2.572   0.0103 *  
## best_pic_winyes -0.30405    4.29840  -0.071   0.9436    
## top200_boxyes    1.81876    2.61662   0.695   0.4873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.937 on 646 degrees of freedom
## Multiple R-squared:  0.5668, Adjusted R-squared:  0.5641 
## F-statistic: 211.3 on 4 and 646 DF,  p-value: < 2.2e-16

summary(movie_m_full)$adj.r.sqaured

## NULL

From now, we will eliminate the variables one by one to find the best fit model.
This is the first step of elimination.

movie_m1 = lm(audience_one_score ~ best_pic_nom + best_pic_win + top200_box, data = movie_new)
summary(movie_m1)$adj.r.squared

## [1] 0.05072184

movie_m2 = lm(audience_one_score ~ critics_score + best_pic_win + top200_box, data = movie_new)
summary(movie_m2)$adj.r.squared

## [1] 0.5603297

movie_m3 = lm(audience_one_score ~ critics_score + best_pic_nom + top200_box, data = movie_new)
summary(movie_m3)$adj.r.squared

## [1] 0.5647831

movie_m4 = lm(audience_one_score ~ critics_score + best_pic_nom + best_pic_win, data = movie_new)
summary(movie_m4)$adj.r.squared

## [1] 0.564461

The model movie_m3 has highest R-sqaured with 0.5647. The second step of elimination will be starting with expalanatory variables of movie_m3 model.

movie_m31 = lm(audience_one_score ~ best_pic_nom + top200_box, data = movie_new)
summary(movie_m31)$adj.r.squared

## [1] 0.0517663

movie_m32 = lm(audience_one_score ~ critics_score + top200_box, data = movie_new)
summary(movie_m32)$adj.r.squared

## [1] 0.5599482

movie_m33 = lm(audience_one_score ~ critics_score + best_pic_nom, data = movie_new)
summary(movie_m33)$adj.r.squared

## [1] 0.5651321

The second step elminination will the final step for multiple regression model. And the final model will be movie_m33 with the highest R-squared score with 0.5651. The summary of the model is presented below. Based on the model, we can interpret that a unit increase in critics score on Rotten Tomatoes is expected an average 0.389 increase in the audience’s scores on a movie with every other held constant. Likewise, the nomiation of the best picture for Oscar is expected to an average 6.4 increase in the audience’s score on a movie with every other held constant.

summary(movie_m33)

## 
## Call:
## lm(formula = audience_one_score ~ critics_score + best_pic_nom, 
##     data = movie_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.610  -6.438   0.229   6.781  33.444 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     41.00062    0.88517  46.320  < 2e-16 ***
## critics_score    0.38881    0.01398  27.817  < 2e-16 ***
## best_pic_nomyes  6.40546    2.19534   2.918  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.925 on 648 degrees of freedom
## Multiple R-squared:  0.5665, Adjusted R-squared:  0.5651 
## F-statistic: 423.4 on 2 and 648 DF,  p-value: < 2.2e-16

Part 5: Prediction

For prediction using model movie_m33, the movie La La Land released in 2016 was selected fro prediction. The critics score from Rotten Tomato is 93 (rottentomatoes.com). It was also nominated as the best picture for Oscar (Oscar.go.com). Now, we have all the data for the explanatory variable for prediction of rating of the audience. The model predicts that the audience score would be 83.57 and the calculated single score of audience is 83, which is close to the prediction. For the calculation of the single score of audience, we referred to the audience score from Rotten Tomatoes and rating on IMDB which is 83 and 8.3 respectively.

pred_lala <- data.frame(critics_score = 93, best_pic_nom = "yes")
predict(movie_m33, pred_lala)

##        1 
## 83.56521

# Calculated single score
lalaland_sing_rating <- (8.3*10+83)/2
print(lalaland_sing_rating)

## [1] 83

In order to quantify the uncertainty around the prediction, the prediction interval would provide the range of certainty that the prediction is correct. According to the prediction interval, we are 95% confident that the model predict that the audiences would score the movie La La Land between 63.64 and 103.49.

predict(movie_m33, pred_lala, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 83.56521 63.63769 103.4927

Part 6: Conclusion

The variability of the audience’s score is best explained by the combination of such explanatory variables as critics score from Rotten Tomatoes and whether or not the movie was the best picture nominated for Oscar. Including the variable best_pic_nom into model was problematic because the cases for the nominee was 7 which is very small for the model.
Besides, the operational definition of a movie success may differ. However, the amount of ticket sales will be the best mearue for a movie success. Instead of using variable whether or not the movie is top 200 in the box office, the total sales in dollar should be considered as a response variable in the future.
The final problem is the model condition, especially with the condition of cosntant variability of residuals. The residual plot is fan shaped which means that the variability of residuals is not constant.

Appedix

Checking for model condition.

Linear relationships bewteen each explanatory variables and the response variable
There are a few case for outliers in categorical explantory variable, which is not strongly influencing the relationships. A strong postive linear relationships are witnessed between critics_score and audience_one_score.

ggpairs(movie_new, columns = c(6, 8:10, 12))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Nearly normal residuals with mean 0:
The histogram of residuals of model movie_m33 shows that the distribution is nearly normal with mean 0.

ggplot(movie_m33, aes(x=.resid))+
    geom_histogram(binwidth = 5)+
    xlab("Residuals")

qqnorm(movie_m33$residuals)
qqline(movie_m33$residuals)

Constant variability of residuals:
According to the residual plot, it shows fan-out shape which means that there is suspicion on the constant variablity.

ggplot(movie_m33, aes(x=.fitted, y=.resid))+
    geom_point()+
    geom_hline(yintercept = 0, linetype="dashed")+
    xlab("Fitted Values")+
    ylab("Residuals")

Independence of residuals:
Since the data isn’t time series but is from one semester and the scatterplot of residuals are almost at random, the samples are assumed as indepedent which in turn the residuals are independent.

plot(movie_m33$residuals)