Modeling and prediction for movies - how to predict movie score

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.2

library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

Dataset is made of information about movies in Rotten Tomatoes and IMDB. We have got 651 randomly sampled movies produced and released before 2016, with 32 variables. We cannot infere any causality here as this is observational. We are able to generalize results to all movies. Well, as there are lot of irrelevant variables (like title or link) for our analysis (as I am going to use score) and there are few NAs in dataset, I am not going to work with them, so I do not consider it to be important to deal with that.

Part 2: Research question

As I am fan of B-movies and I do think people tend to copy authorities (ie. critics) I would like to know if there is correlation between movie score and if audience is following the critics voice. I do suppose I can predict movie score with this. I do also wonder if runtime is what affect score of movie.

Part 3: Exploratory data analysis

First we will look on what kind of data we do have. First and last six observations will give us good image of data.

head(movies)

## # A tibble: 6 × 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>       <fctr>      <fctr>   <dbl>      <fctr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <fctr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fctr>, critics_score <dbl>,
## #   audience_rating <fctr>, audience_score <dbl>, best_pic_nom <fctr>,
## #   best_pic_win <fctr>, best_actor_win <fctr>, best_actress_win <fctr>,
## #   best_dir_win <fctr>, top200_box <fctr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

tail(movies)

## # A tibble: 6 × 32
##                         title   title_type              genre runtime
##                         <chr>       <fctr>             <fctr>   <dbl>
## 1          Cocoon: The Return Feature Film              Drama     116
## 2          Death Defying Acts Feature Film              Drama      97
## 3                  Half Baked Feature Film             Comedy      82
## 4           Dance of the Dead Feature Film Action & Adventure      87
## 5 Around the World in 80 Days Feature Film Action & Adventure     120
## 6                         LOL Feature Film             Comedy      97
## # ... with 28 more variables: mpaa_rating <fctr>, studio <fctr>,
## #   thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fctr>,
## #   critics_score <dbl>, audience_rating <fctr>, audience_score <dbl>,
## #   best_pic_nom <fctr>, best_pic_win <fctr>, best_actor_win <fctr>,
## #   best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>

dim(movies)

## [1] 651  32

We can see we have 651 entries with lot of variables - title, type of movie, genre, mpaa_rating, studio, release zear, month, day, etc. 32 variables and 651 observations…

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Well there are lot of feature film (591), which is huge number compared to 55 documentary or even 5 TV movies! Also we have lot of drama (305) compared to 87 of comedy…this can make analysis harder as these number can overweight others. These are not big numbers so our analysis will be not as accurate as we could hope for. Fortunately there are not a lot of NAs in data so we have not to worry about them.

p_genrerun <- ggplot(movies, aes(x=factor(genre), y=runtime)) +
  geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p_genrerun + ggtitle("Genre to runtime") + geom_hline(yintercept =median(movies$runtime, na.rm = TRUE), col = "royalblue",lwd = 1)

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

We can see all genres are oscilating around median runtime value of 103 minutes. There are lot of outliers in data, mostly documentary genre.

p_genreimdb <- ggplot(movies, aes(x=factor(genre), y=imdb_rating)) +
  geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p_genreimdb + ggtitle("Genre to IMDB rating") + geom_hline(yintercept =median(movies$imdb_rating, na.rm = TRUE), col = "royalblue",lwd = 1)

We can see sci-fi has biggest variance between 1Q a 3Q, median of sci-fi genre is lower then median of all genres. Documentary performs best as Musical and performing arts movies. There are outliers too for lot of genres. From median of genre we can see that people tend to give higher (around 7) rating on IMDB.

plot(movies$imdb_rating,movies$runtime, main="IMDB rating to runtime", xlab = "IMDB rating", ylab="Runtime in minutes")
abline(lm(movies$runtime~movies$imdb_rating),col = "royalblue",lwd = 1)

From simple linear regression used to IMDB rating and runtime of movie we can see that we can expect better IMDB rating if runtime is longer.

ggplot(data = movies, aes(x = critics_score, y = audience_score)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE) + ggtitle("Critics score to audience score - Rotten")

We see there is correlation between critics score of Rotten Tomato and audience score. What about IMDB?

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE) + ggtitle("Critics score to audience score - IMDB")

Very strong correlation…

ggplot(data = movies, aes(x = critics_score, y = imdb_rating)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE) + ggtitle("IMDB vs. Rotten")

We definitely have correlation between critics and IMDB rating. I will use these values in model.

Part 4: Modeling

regres <- lm(movies$audience_score ~ movies$critics_score + movies$imdb_rating)
summary(regres)

## 
## Call:
## lm(formula = movies$audience_score ~ movies$critics_score + movies$imdb_rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.668  -6.758   0.723   5.513  52.438 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -37.03195    2.86401 -12.930  < 2e-16 ***
## movies$critics_score   0.07318    0.02161   3.386 0.000753 ***
## movies$imdb_rating    14.65760    0.56590  25.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.7516 
## F-statistic: 984.4 on 2 and 648 DF,  p-value: < 2.2e-16

Not surprisingly we have got strong relationship between these values. I wonder if they are correlated.

reg <- lm( movies$critics_score ~ movies$imdb_rating)
summary(reg)

## 
## Call:
## lm(formula = movies$critics_score ~ movies$imdb_rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.855 -12.803   3.116  13.694  43.196 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -72.3792     4.3572  -16.61   <2e-16 ***
## movies$imdb_rating  20.0317     0.6619   30.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.31 on 649 degrees of freedom
## Multiple R-squared:  0.5853, Adjusted R-squared:  0.5846 
## F-statistic: 915.9 on 1 and 649 DF,  p-value: < 2.2e-16

These values are correlated, which means I do not need to add more of them as this will not be helpful. I will use only critics score, as IMDB score and audience_score are strongly correlated (both are audience score).So critics score is going to by my explanatory variable.

model <- lm(audience_score ~ critics_score, data=movies) #funny thing, if you will not use "data=" you will not be able to get predict() to work properly
par(mfrow=c(2,2)) #combine plots to 2x2 table
hist(model$residuals, main="residuals")
qqnorm(model$residuals)
qqline(model$residuals)  
plot(model$residuals ~ model$fitted)
summary(model)

## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.043  -9.571   0.504  10.422  43.544 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.43551    1.27561   26.21   <2e-16 ***
## critics_score  0.50144    0.01984   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared:  0.496,  Adjusted R-squared:  0.4952 
## F-statistic: 638.7 on 1 and 649 DF,  p-value: < 2.2e-16

We can see thet residuals are normaly distributed around zero. Few outliers do not infere with data, therefore we do not care for them. Also there is linear relationships between x and y, we have got nearly normal residuals and we see from constant variability of residuals the independence of residuals.

From summary table is clear what we did know - there is strong relationship between critics score and audience score. And we can see that critic score is very significant preedictor for audience score (as 1 unit increasing in critics score meand 0.5 increase for audience score).

Part 5: Prediction

As I am fan of B-movies I am constantly looking for low score on IMDB (bellow 5 at least, usualy bellow 3). Lowest IMDB score has Disaster Movie (1.9 on IMDB, 1 critics score, 19 audience score). So I am going to predict this movie score with last “model” variable…

film <- data.frame(title ="Disaster Movie", critics_score = 1)
predict(model, film, interval = "prediction", level = 0.95)

##        fit      lwr     upr
## 1 33.93695 5.616396 62.2575

We can see that we have got prediction of 33.94 audience score with 95% confidence, that our score will be between 5.62 and 62.26 score. Well, actualy, audience score of this movie was 19…

film2 <- data.frame(title ="Hellraiser - Bloodline", critics_score = 25)
predict(model, film2, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 45.97145 17.70844 74.23446

I tried it with another movie - Hellraiser, with average 25 critics score. Model predicted that this movie will get audience score of ~46 and that with 95% confidence will be between 17.7 and 74.23 score. As this movie got 47 audience score we see it was pretty accurate.

Part 6: Conclusion

I was trying to determine if there is any association between audience score and others scoring systems and how we can use it for predictions. As there is strong positive relationship between critics and audience score, critics are best predictors of audience scores.