Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)

Load data

load("movies.Rdata")
movies <- na.omit(movies) # Removing the NA values from the dataset
head(movies)

## # A tibble: 6 x 32
##   title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month
##   <chr> <fct>      <fct>   <dbl> <fct>       <fct>          <dbl>          <dbl>
## 1 Fill… Feature F… Drama      80 R           Indom…          2013              4
## 2 The … Feature F… Drama     101 PG-13       Warne…          2001              3
## 3 Wait… Feature F… Come…      84 R           Sony …          1996              8
## 4 The … Feature F… Drama     139 PG          Colum…          1993             10
## 5 Male… Feature F… Horr…      90 R           Ancho…          2004              9
## 6 Lady… Feature F… Drama     142 PG-13       Param…          1986              1
## # … with 24 more variables: thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fct>, critics_score <dbl>,
## #   audience_rating <fct>, audience_score <dbl>, best_pic_nom <fct>,
## #   best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## #   best_dir_win <fct>, top200_box <fct>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## #   rt_url <chr>

Part 1: Data

The dataset contains information on movies from the famous websites Rotten Tomatoes and IMDB. There are 651 randomly sampled movies produced and released before 2016. There are 32 available variables. With this dataset and for the purpose of this project, it is only possible to do an observational study and no causal inference can be drawn.The study can generalize to movies produced and released before 2016.

Part 2: Research question

I have always wondered if casting an established actor/actress in a movie or directed by one really influences the overall performance or rating of a movie. Do audience agree with critics? Or are the critics ratings biased. Well why sit and wonder,let us just go and find out.

The goal of this project is to predict the overall performance and rating of a particular movie and find out if factors such as cast , crew and critics have any influence

Part 3: Exploratory data analysis

I have selected few variables from the available dataset that best suites the purpose of this study and are expected to be independent of each other. Thus, variables such as DVD release date, number of IMDB votes, best picture nomination/win, etc. were not chosen to be in the model. Variables with large domains, such as studio name, actor/director names, URLs, etc. were excluded as well;some of the variables are irrelevant to the purpose of identifying the popularity of a movie: the Link to IMDB page for the movie and the Link to Rotten Tomatoes page for the movie. Theater release month was included assuming that movies released at certain times of the year may be more popular than others.

1.genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) 2.runtime: Runtime of movie (in minutes) 3.mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated) 4.thtr_rel_year: Year the movie is released in theaters 5.imdb_rating: Rating on IMDB 6.critics_score: Critics score on Rotten Tomatoes 7.audience_score: Audience score on Rotten Tomatoes 8.best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) 9.best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) 10.best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) 11.top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) ### Summarising the required data

summary(movies %>% select(genre,runtime,mpaa_rating,thtr_rel_year,imdb_rating,critics_score,audience_score,best_actor_win,best_actress_win,best_dir_win,top200_box))

##                 genre        runtime       mpaa_rating  thtr_rel_year 
##  Drama             :298   Min.   : 65.0   G      : 16   Min.   :1972  
##  Comedy            : 86   1st Qu.: 93.0   NC-17  :  1   1st Qu.:1991  
##  Action & Adventure: 62   Median :103.0   PG     :111   Median :2000  
##  Mystery & Suspense: 56   Mean   :106.5   PG-13  :131   Mean   :1998  
##  Documentary       : 40   3rd Qu.:116.0   R      :319   3rd Qu.:2007  
##  Horror            : 22   Max.   :267.0   Unrated: 41   Max.   :2014  
##  (Other)           : 55                                               
##   imdb_rating    critics_score    audience_score  best_actor_win
##  Min.   :1.900   Min.   :  1.00   Min.   :11.00   no :528       
##  1st Qu.:5.900   1st Qu.: 33.00   1st Qu.:46.00   yes: 91       
##  Median :6.600   Median : 61.00   Median :65.00                 
##  Mean   :6.486   Mean   : 57.43   Mean   :62.21                 
##  3rd Qu.:7.300   3rd Qu.: 82.50   3rd Qu.:80.00                 
##  Max.   :9.000   Max.   :100.00   Max.   :97.00                 
##                                                                 
##  best_actress_win best_dir_win top200_box
##  no :548          no :576      no :604   
##  yes: 71          yes: 43      yes: 15   
##                                          
##                                          
##                                          
##                                          
##

I am interested in finding out the influence on the IMDB rating by genre of the movie.

cor(movies$imdb_rating,movies$runtime)

## [1] 0.2974388

g <- ggplot(data = movies, aes(x = genre, y = imdb_rating, fill = "genre", draw_quantiles=TRUE))
g+theme_bw() + 
  geom_violin() + 
   labs(title="Violin Plot",
        x="Genre",
        y="IMDB Rating")

This violin plot shows the relationship of genre type to imdb_rating. The plot elements show the median rating for documentary and drama is higher than for other genre types. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates the ratings of horror and documentary are highly concentrated around the median.

Checking the corelation between audience_score and critics_score on Rotten Tomatoes

cor(movies$critics_score, movies$audience_score, use = "everything")

## [1] 0.7015256

g <- ggplot(data = movies, aes(x = critics_score, y = audience_score))
g + geom_point() + geom_smooth() +  geom_jitter(alpha=3) + facet_grid(movies$critics_rating)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

With a corelation coefficient as high as 0.7, it is clear that critics and audience most of the times agree with each other on the movie rating.

Distribution of Rating Scores

p1 <- ggplot(data=movies, aes(x=imdb_rating)) + 
      geom_histogram(binwidth=0.5, fill="purple") +
      xlab("IMDB Scores")
p2 <- ggplot(data=movies, aes(x=critics_score)) + 
      geom_histogram(binwidth=5, fill="pink") +
      xlab("Critics Scores")
p3 <- ggplot(data=movies, aes(x=audience_score)) + 
      geom_histogram(binwidth=5, fill="magenta") +
      xlab("Audience Scores")
grid.arrange(p1, p2, p3, nrow=1,
             top="Distribution of Rating Scores")

Contrary to the two Rotten Tomatoe scores, the IMDB scores show a nice, mostly normal distribution centered around at mean of 6.37 with somewhat of a left-sided skew.

Part 4: Modeling

The target response variable for the prediction model is a movie rating score, but with three to choose from, which one should be used? Two of the ratings come from the Rotten Tomatoes web site: one is an average of reviews by movie critics and the other is an average of reviews from the public (a.k.a., audience). The third rating is an average of reviews on the IMDB web site (no distinction made between critics and audience reviews). Given its distribution and the fact that it has the highest pairwise correlation with the other scores, the IMDB rating (imdb_rating) is the chosen response variable

We will be using a backward elimination method, i.e we will start with all the variables in other words the full model and we’ll remove variables to create a parsimonious model. The initial variables are :

genre runtime mpaa_rating thtr_rel_month best_actor_win best_actress win best_dir_win critics_score audience_score

The initial model is summmarised below.

model <-lm(imdb_rating ~ genre+runtime+mpaa_rating+best_actor_win+best_actress_win+best_dir_win+critics_score+audience_score, data=movies)
summary(model)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + best_actor_win + 
##     best_actress_win + best_dir_win + critics_score + audience_score, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.37794 -0.18657  0.04211  0.27789  1.19448 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.2971065  0.1816755  18.148  < 2e-16 ***
## genreAnimation                 -0.5263118  0.1917443  -2.745  0.00624 ** 
## genreArt House & International  0.2459594  0.1528671   1.609  0.10815    
## genreComedy                    -0.1559576  0.0797020  -1.957  0.05084 .  
## genreDocumentary                0.2808051  0.1129641   2.486  0.01320 *  
## genreDrama                      0.0294810  0.0696210   0.423  0.67212    
## genreHorror                     0.0744675  0.1199214   0.621  0.53486    
## genreMusical & Performing Arts  0.0329651  0.1519943   0.217  0.82837    
## genreMystery & Suspense         0.2135603  0.0901403   2.369  0.01814 *  
## genreOther                     -0.0033619  0.1373917  -0.024  0.98049    
## genreScience Fiction & Fantasy -0.0869348  0.1768009  -0.492  0.62310    
## runtime                         0.0050418  0.0011441   4.407 1.25e-05 ***
## mpaa_ratingNC-17               -0.0165491  0.4886483  -0.034  0.97299    
## mpaa_ratingPG                  -0.1318487  0.1386581  -0.951  0.34204    
## mpaa_ratingPG-13               -0.0742621  0.1415873  -0.524  0.60013    
## mpaa_ratingR                   -0.0427815  0.1372020  -0.312  0.75529    
## mpaa_ratingUnrated             -0.1809026  0.1601397  -1.130  0.25908    
## best_actor_winyes               0.0256733  0.0562272   0.457  0.64812    
## best_actress_winyes             0.0686706  0.0615902   1.115  0.26532    
## best_dir_winyes                 0.0564405  0.0777405   0.726  0.46812    
## critics_score                   0.0104516  0.0009839  10.623  < 2e-16 ***
## audience_score                  0.0334177  0.0013661  24.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4686 on 597 degrees of freedom
## Multiple R-squared:  0.8163, Adjusted R-squared:  0.8098 
## F-statistic: 126.3 on 21 and 597 DF,  p-value: < 2.2e-16

The backward elimination method :

red_model <- step(model, direction = "backward", trace = FALSE)
summary(red_model)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + critics_score + 
##     audience_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34317 -0.19822  0.04111  0.26913  1.16952 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.182655   0.130365  24.413  < 2e-16 ***
## genreAnimation                 -0.490449   0.177001  -2.771  0.00576 ** 
## genreArt House & International  0.219695   0.148553   1.479  0.13969    
## genreComedy                    -0.148901   0.078375  -1.900  0.05793 .  
## genreDocumentary                0.216866   0.101665   2.133  0.03331 *  
## genreDrama                      0.047788   0.067161   0.712  0.47702    
## genreHorror                     0.087379   0.117252   0.745  0.45642    
## genreMusical & Performing Arts  0.011288   0.150541   0.075  0.94025    
## genreMystery & Suspense         0.251075   0.087187   2.880  0.00412 ** 
## genreOther                     -0.019440   0.136217  -0.143  0.88657    
## genreScience Fiction & Fantasy -0.074232   0.176285  -0.421  0.67384    
## runtime                         0.005435   0.001060   5.127 3.97e-07 ***
## critics_score                   0.010438   0.000962  10.850  < 2e-16 ***
## audience_score                  0.033525   0.001360  24.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4681 on 605 degrees of freedom
## Multiple R-squared:  0.8143, Adjusted R-squared:  0.8103 
## F-statistic:   204 on 13 and 605 DF,  p-value: < 2.2e-16

The reduced model has only 3 variables but it has a slightly larger adjusted R-squared of 0.8103. Then, using only a fraction of the variables this model captures almost the same variability (81%) than the full model. The genre and the critics score variables are the most significant variables

plot(red_model$residuals~movies$runtime,xlab="Runtime",ylab="Residuals",main="a) Residuals vs. Runtime")

plot(red_model$residuals~movies$critics_score,xlab="Critics score",ylab="Residuals",main="b) Residuals vs critics score")

hist(red_model$residuals,ylab="Residuals",main="a) Histogram of reduced_model residuals")

qqnorm(red_model$residuals,main="b) Normal probability of residuals")
qqline(red_model$residuals)

On observation of the histogram of residuals in Figure 5 a there is a little skewness in the residuals, however in Figure 5 b there are deviations only in the tail of the graph.

plot(red_model$residuals~red_model$fitted,main="Residuals vs. fitted")
abline(0,0)

The residuals seem to be mostly homoscedastic though there are few discripencies. * * *

Part 5: Prediction

To predict the IMDB rating score of a movie that has been released in 2016, we have considered “Doctor Strange”. Plugging in the required variable data, the imdb_rating obtained is 7.54. The actual IMDB rating is 7.5. The model predicted the variable acurately for this movie. The lower and upper values for a 95% confidence interval are 6.5 and 8.522 respectively.

new_movie <- data.frame(genre = "Science Fiction & Fantasy", runtime = 115, mpaa_rating = "PG-13",best_dir_win = "No" ,critics_score = 89, audience_score = 86)
predict(red_model,new_movie)

##        1 
## 7.545571

predict(red_model,new_movie, interval = "predict")

##        fit      lwr      upr
## 1 7.545571 6.568234 8.522908

For the movie, “The Lord of the Rings: The Fellowship of the Ring”, the imdb_rating obtained is 8.2 where as the rating on website is 8.8. There is a discrepancy of 0.6. The lower and upper values for a 95% confidence interval are 7.3 and 9.2 respectively. The obtained value falls in the 95% CI range.

new_movie <- data.frame(genre = "Action & Adventure", runtime = 178, mpaa_rating = "PG-13", best_dir_win = "Yes", critics_score = 91, audience_score = 95)
predict(red_model,new_movie)

##       1 
## 8.28481

predict(red_model,new_movie, interval = "predict")

##       fit      lwr      upr
## 1 8.28481 7.343928 9.225692

Part 6: Conclusion

Though there was a significantly low p-value for the best actor, actress and director variables, variables like genre, audience score and critic score has higher significant scores. According to the model outputs, we can say that factors like cast and critics play a significant role in predicting the overall performance rating for a movie. There may be many more factors that are actual contributing factors but due to the limitations of the dataset, this model is our best fit.