Setup

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.


Part 1: Data

This data set comprises of 651 randomly sampled movies produced and released before 2016, with 32 variables recorded for each of these movies.

As the movies in this set are randomly sampled, and there is significant representation from each genre type, studio and MPAA rating, one could generalize the results to the entire population of movies before 2016.

However, as random assignment was not used in this study, no causation can be established.


Part 2: Research question

In this study, we want to understand what are the attributes that make a movie popular. The proxy that we will use for movie popularity is the Audience Score, as that is often crowdsourced and more representative of the general population rather than critics. As such, we are concerned with attributes that allow us to predict the audience score for the movie. Some key attributes to be tested are genre, runtime, thtr_rel_month, director and


Part 3: Exploratory data analysis

We first want to understand the types of input the dataframe has.

title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes critics_rating critics_score audience_rating audience_score best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box director actor1 actor2 actor3 actor4 actor5 imdb_url rt_url
Filly Brown Feature Film Drama 80 R Indomina Media Inc.  2013 4 19 2013 7 30 5.5 899 Rotten 45 Upright 73 no no no no no no Michael D. Olmos Gina Rodriguez Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria http://www.imdb.com/title/tt1869425/ //www.rottentomatoes.com/m/filly_brown_2012/
The Dish Feature Film Drama 101 PG-13 Warner Bros. Pictures 2001 3 14 2001 8 28 7.3 12285 Certified Fresh 96 Upright 81 no no no no no no Rob Sitch Sam Neill Kevin Harrington Patrick Warburton Tom Long Genevieve Mooy http://www.imdb.com/title/tt0205873/ //www.rottentomatoes.com/m/dish/
Waiting for Guffman Feature Film Comedy 84 R Sony Pictures Classics 1996 8 21 2001 8 21 7.6 22381 Certified Fresh 91 Upright 91 no no no no no no Christopher Guest Christopher Guest Catherine O’Hara Parker Posey Eugene Levy Bob Balaban http://www.imdb.com/title/tt0118111/ //www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence Feature Film Drama 139 PG Columbia Pictures 1993 10 1 2001 11 6 7.2 35096 Certified Fresh 80 Upright 76 no no yes no yes no Martin Scorsese Daniel Day-Lewis Michelle Pfeiffer Winona Ryder Richard E. Grant Alec McCowen http://www.imdb.com/title/tt0106226/ //www.rottentomatoes.com/m/age_of_innocence/
Malevolence Feature Film Horror 90 R Anchor Bay Entertainment 2004 9 10 2005 4 19 5.1 2386 Rotten 33 Spilled 27 no no no no no no Stevan Mena Samantha Dark R. Brandon Johnson Brandon Johnson Heather Magee Richard Glover http://www.imdb.com/title/tt0388230/ //www.rottentomatoes.com/m/10004684-malevolence/
Old Partner Documentary Documentary 78 Unrated Shcalo Media Group 2009 1 15 2010 4 20 7.8 333 Fresh 91 Upright 86 no no no no no no Chung-ryoul Lee Choi Won-kyun Lee Sam-soon Moo NA NA http://www.imdb.com/title/tt1334549/ //www.rottentomatoes.com/m/old-partner/
##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

Understanding the Underlying Distribution of Audience Scores
The audience score in this case follows this underlying distribution:

We can see that the distribution of audience scores tend to be right skewed, with a mean at around 60 and a high concentration in the 70-90 range.

Let us try and visualize some of the other variables in relation to audience score:

Genre to Audience Score
There are a number of genres in which the movies are categorized, and the aim here is to determine if this variable is an influencing factor for the model.

The summary statistics of this are as follows:

## movies$genre: Action & Adventure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   37.00   52.00   53.78   65.00   94.00 
## -------------------------------------------------------- 
## movies$genre: Animation
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   59.00   65.00   62.44   70.00   88.00 
## -------------------------------------------------------- 
## movies$genre: Art House & International
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   51.25   65.50   64.00   80.25   86.00 
## -------------------------------------------------------- 
## movies$genre: Comedy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   37.00   50.00   52.51   67.50   93.00 
## -------------------------------------------------------- 
## movies$genre: Documentary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   76.25   86.00   82.75   89.00   96.00 
## -------------------------------------------------------- 
## movies$genre: Drama
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   52.00   70.00   65.35   80.00   95.00 
## -------------------------------------------------------- 
## movies$genre: Horror
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   36.00   43.00   45.83   53.50   84.00 
## -------------------------------------------------------- 
## movies$genre: Musical & Performing Arts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55.00   75.75   80.50   80.17   89.50   95.00 
## -------------------------------------------------------- 
## movies$genre: Mystery & Suspense
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   40.50   54.00   55.95   70.50   97.00 
## -------------------------------------------------------- 
## movies$genre: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   53.00   73.50   66.69   82.50   91.00 
## -------------------------------------------------------- 
## movies$genre: Science Fiction & Fantasy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   26.00   47.00   50.89   79.00   85.00

We can see from the above that documentaries, as well as musical and perfoming arts tend to get higher than average audience scores (>60), while other genres like horror and comedy tend to fare below average (<60). This indicates that it’s likely that genre is a good predictor for audience scores.

Runtime to Audience Score
There has been anecdotal quotes about how longer runtimes might influence audience scores, hence this is to verify if the relationship exists.

As you can see here, there is some linear relationship between audience score and runtime, but it is not very strong. As runtime could still potentially have some predictive power, it is added into the construction of the linear model.

Month of Movie release to Audience Score
There might be a relationship between the month of the movie release in relation to the Audience Score, as one would suspect that seasonality might influence perspectives on movies.

Although we can see that the mean audience score is the highest in December as opposed to the other months (perhaps during the festive season), it is not statistically significant enough to prove that this is indeed an influencing factor, hence we will not include this in our model.

Critics Rating to Audience Score
It is not feasible to use the critics score as a proxy for the audience score, as that would be strongly correlated. Instead, we can test if critics rating could be used to predict what scores audiences might give, which might be the reason why Rotten Tomatoes was set up in the first place.

It is clear that from this, critics rating has some influence on what the final score from audiences would be, and should be included in the model. The summary statistics are as follows:

## movies$critics_rating: Certified Fresh
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   71.00   81.00   79.37   87.50   97.00 
## -------------------------------------------------------- 
## movies$critics_rating: Fresh
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   58.00   74.00   69.97   83.00   94.00 
## -------------------------------------------------------- 
## movies$critics_rating: Rotten
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0    36.0    48.0    49.7    64.0    95.0

Best Picture Nominee to Audience Score
Whether this movie is a best picture nominee might be a good indicator of whether Audiences would like it. The below EDA tries to verify this:

It is clear that this will be a key factor in the linear model, due to the sharp difference between non-nominated films and nominated films. However, this will only apply to a subset of 22 films.

Number of Votes on IMDB to Audience Score
Good movies are likely to generate larger volumes of reviews online, and this EDA seeks to test if the number of votes on IMDB is a good proxy for the final audience score:

There is still an overall linear relationship possible for this, although weak. However, we will still add this as a potential predictor for the final audience score. * * *

Part 4: Modeling

Critics rating, genre, best picture nomination, runtime and number of votes on IMDB were considered for the full model. Hence, a linear model was fit to determine if this would be a good predictor. The results are as follows:

## 
## Call:
## lm(formula = movies$audience_score ~ movies$critics_rating + 
##     movies$genre + movies$best_pic_nom + movies$runtime + movies$imdb_num_votes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.866  -9.794   0.187  10.261  42.617 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            6.303e+01  4.187e+00  15.054
## movies$critics_ratingFresh            -4.527e+00  1.795e+00  -2.522
## movies$critics_ratingRotten           -2.100e+01  1.776e+00 -11.824
## movies$genreAnimation                  6.121e+00  5.291e+00   1.157
## movies$genreArt House & International  8.248e+00  4.387e+00   1.880
## movies$genreComedy                    -1.218e-01  2.436e+00  -0.050
## movies$genreDocumentary                1.974e+01  2.958e+00   6.674
## movies$genreDrama                      5.471e+00  2.090e+00   2.618
## movies$genreHorror                    -6.304e+00  3.608e+00  -1.747
## movies$genreMusical & Performing Arts  1.874e+01  4.723e+00   3.968
## movies$genreMystery & Suspense        -2.399e+00  2.686e+00  -0.893
## movies$genreOther                      2.874e+00  4.170e+00   0.689
## movies$genreScience Fiction & Fantasy -8.275e+00  5.263e+00  -1.572
## movies$best_pic_nomyes                 6.002e+00  3.499e+00   1.715
## movies$runtime                         4.144e-02  3.412e-02   1.215
## movies$imdb_num_votes                  3.258e-05  6.322e-06   5.153
##                                       Pr(>|t|)    
## (Intercept)                            < 2e-16 ***
## movies$critics_ratingFresh             0.01193 *  
## movies$critics_ratingRotten            < 2e-16 ***
## movies$genreAnimation                  0.24776    
## movies$genreArt House & International  0.06058 .  
## movies$genreComedy                     0.96015    
## movies$genreDocumentary               5.44e-11 ***
## movies$genreDrama                      0.00907 ** 
## movies$genreHorror                     0.08105 .  
## movies$genreMusical & Performing Arts 8.08e-05 ***
## movies$genreMystery & Suspense         0.37207    
## movies$genreOther                      0.49100    
## movies$genreScience Fiction & Fantasy  0.11639    
## movies$best_pic_nomyes                 0.08682 .  
## movies$runtime                         0.22490    
## movies$imdb_num_votes                 3.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.76 on 634 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4804, Adjusted R-squared:  0.4681 
## F-statistic: 39.08 on 15 and 634 DF,  p-value: < 2.2e-16

This model has a reasonable predictive power, but the initial analysis seems to indicate that runtime and best picture nomination can be removed without impacting the model too much. Let us try it here:

## 
## Call:
## lm(formula = audience_score ~ critics_rating + genre + imdb_num_votes, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.077  -9.929   0.531  10.061  42.003 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.730e+01  2.528e+00  26.623  < 2e-16 ***
## critics_ratingFresh            -4.851e+00  1.784e+00  -2.720  0.00672 ** 
## critics_ratingRotten           -2.142e+01  1.761e+00 -12.162  < 2e-16 ***
## genreAnimation                  5.458e+00  5.277e+00   1.034  0.30140    
## genreArt House & International  8.468e+00  4.395e+00   1.927  0.05448 .  
## genreComedy                    -1.671e-01  2.433e+00  -0.069  0.94528    
## genreDocumentary                1.947e+01  2.948e+00   6.605 8.41e-11 ***
## genreDrama                      6.107e+00  2.073e+00   2.946  0.00333 ** 
## genreHorror                    -6.527e+00  3.601e+00  -1.813  0.07036 .  
## genreMusical & Performing Arts  1.930e+01  4.712e+00   4.096 4.75e-05 ***
## genreMystery & Suspense        -1.960e+00  2.682e+00  -0.731  0.46522    
## genreOther                      3.645e+00  4.164e+00   0.875  0.38173    
## genreScience Fiction & Fantasy -8.480e+00  5.273e+00  -1.608  0.10827    
## imdb_num_votes                  3.739e-05  5.921e-06   6.316 5.05e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.79 on 637 degrees of freedom
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4652 
## F-statistic:  44.5 on 13 and 637 DF,  p-value: < 2.2e-16

The new model has 3 variables, but only has a slightly smaller adjusted R-squared value of 0.465 compared to the 5 variable model. Hence, we know that the most important variables are genre, critics ratings and number of IMDB votes.

## Warning in plot.window(...): "yintercept" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "yintercept" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "yintercept"
## is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "yintercept"
## is not a graphical parameter
## Warning in box(...): "yintercept" is not a graphical parameter
## Warning in title(...): "yintercept" is not a graphical parameter

We can clearly see that the residuals are normally distributed around zero, which is a good sign. As there is a linear relationship, the normals are nearly normal, but the residuals seem to be slighly interdependent, as there is less scattering at larger fitted values. However, as the interdependence is small and the residuals are still centred around zero, this can be ignored.


Part 5: Prediction

As I just watched Thor-Ragnarok on Netflix and thought that it was hillarious, I wanted to see if my model could potentially predict the final audience score for this would be. To me, as there is space exploration and mythology involved, this falls under the genre of “Science Fiction and Fantasy”. The following will be the code for the prediction for a 95% confidence level.

##       fit      lwr      upr
## 1 73.7176 42.58831 104.8469

The model predicted that the score will be 73.7, with a 95% condidence level that it will be between 42.6 and 104.8 (need new upper bound). The actual audience score was 87%, which was not too bad a prediction.


Part 6: Conclusion

EDA was conducted on the movies dataset, and a linear model was fitted using 3 variables to predict audience scores. The model performed satisfactorily in giving reasonable predictions, and can be extended to movies that are outside of the dataset (Thor was released in 2017, which is outside of the sample). Hope you have found this useful!

Eric Tay