Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.2

library(statsr)

## Warning: package 'statsr' was built under R version 3.5.2

## Warning: package 'BayesFactor' was built under R version 3.5.2

## Warning: package 'coda' was built under R version 3.5.2

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

load(movies.Rdata“)

Part 1: Data

The study claims to have random sample of movies released before 2016. Now, we can check the release date to see if the claim is true.

hist(movies$thtr_rel_year)

summary(movies$thtr_rel_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1970    1990    2000    1998    2007    2014

The median is at 2000, hence in previous 14 years we have the same number of movies produced as 30 years before 2000. This shows tendency of the sampling having being more dense in last few years. However, this is normal since the industry is growing and the pattern is very visible from the histagram itself.

Second check can be genre. We can look at the data to see if there is even distribution on genre.

plot(movies$genre)

For most part the distiribution looks random, however the number of drama is off the charts compared to other levels. This brings us to a question, is drama and movies released after 2000 related. We can simply check the distribution to see if there is any biasness.

table(movies$genre, movies$thtr_rel_year>2000)

##                            
##                             FALSE TRUE
##   Action & Adventure           33   32
##   Animation                     2    7
##   Art House & International     6    8
##   Comedy                       45   42
##   Documentary                  10   42
##   Drama                       169  136
##   Horror                       12   11
##   Musical & Performing Arts     6    6
##   Mystery & Suspense           32   27
##   Other                        12    4
##   Science Fiction & Fantasy     6    3

There seems to be no relation with the type of genre and release date. It is difficult to find any pattern that may suggest the sample is non-random. Hence, we can assume and move forward as it being a random sample. The data was possibly generated by random code generator.

Since the data is random and also large in size, we can generalize the data for US population. There is hint that the genre “drama” might have recuured often, hence sample may not be random even if it was generated to be random. This can have impact on model prediction if Drama are usually favourable to audience and critics score.

Hence the assignment is non-random so causality can not be made from this data set.

Part 2: Research question

Imdb rating can be a potential indicator of how good a movie is. That does not show causality however, because it may be other way around, good movie have higher rating. However, we can built a model around the IMDB rating. The research question is what variable are good predictor for a movie’s Imdb rating.

Part 3: Exploratory data analysis

Since we are predicting Imdb score, however it is imporatant that the number of voters are not baised to particular genre, which can be potential predictor in our model. At first, let us find out the number of vote in imdb based on film genre.

plot(movies$genre,log(movies$imdb_num_votes))

tapply(movies$imdb_num_votes,movies$genre, FUN= mean)

##        Action & Adventure                 Animation 
##                 79572.292                 56873.111 
## Art House & International                    Comedy 
##                  8810.214                 44547.345 
##               Documentary                     Drama 
##                  5459.673                 62487.131 
##                    Horror Musical & Performing Arts 
##                 27008.174                 25213.000 
##        Mystery & Suspense                     Other 
##                 81040.339                121956.312 
## Science Fiction & Fantasy 
##                 85783.333

This plot and summary statistics shows us that there is little distribution of number of votes and type of genre. Therefore, we can assume that including genre in the model will not hold any biasness due to differnence in number of votes.

Now we check if possible predictor such as audience score has linear association with the imdb rating.

plot(movies$imdb_rating,movies$audience_score)

Linear association does exist hence audience score can be possible predictor of our variable.

Part 4: Modeling

Now we start with all the predicted variable of the model and test the significance of the predictor.

We will be using backward elimination using adjusted R suqared for selecting the perfect predictor of this model. The variables for full model would be 1. Audience score, 2 Critics score, 3 Genre and 4. If the movie got nominated for best picture award.

We exclude the variables which does not directly effect rating such as runtime, released month, best actor/actress and thoses which can appear multiply in observations succh as main actor, other actor who may be in multiple movie.

result <- lm(imdb_rating~audience_score+critics_score+genre+best_pic_nom, data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + genre + 
##     best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40248 -0.21900  0.03859  0.29028  1.17289 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.6856727  0.0818607  45.024  < 2e-16 ***
## audience_score                  0.0343323  0.0013505  25.422  < 2e-16 ***
## critics_score                   0.0105851  0.0009567  11.064  < 2e-16 ***
## genreAnimation                 -0.4611392  0.1690583  -2.728 0.006554 ** 
## genreArt House & International  0.1847032  0.1402630   1.317 0.188368    
## genreComedy                    -0.1781805  0.0778428  -2.289 0.022407 *  
## genreDocumentary                0.2074266  0.0952164   2.178 0.029737 *  
## genreDrama                      0.0780908  0.0666997   1.171 0.242124    
## genreHorror                     0.0365993  0.1158205   0.316 0.752106    
## genreMusical & Performing Arts  0.0504997  0.1521120   0.332 0.740006    
## genreMystery & Suspense         0.2870734  0.0861145   3.334 0.000907 ***
## genreOther                     -0.0475929  0.1339342  -0.355 0.722449    
## genreScience Fiction & Fantasy -0.2065030  0.1691725  -1.221 0.222664    
## best_pic_nomyes                 0.1354311  0.1072295   1.263 0.207051    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4746 on 637 degrees of freedom
## Multiple R-squared:  0.8124, Adjusted R-squared:  0.8085 
## F-statistic: 212.2 on 13 and 637 DF,  p-value: < 2.2e-16

The p-value is less than 0.05, so as a whole the model is significant. The Multiple R-squared and Adjusted R-squared both are strong. We drop genre at first to check if adjusted R squared improves.

result <- lm(imdb_rating~audience_score+critics_score+best_pic_nom, data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + best_pic_nom, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51865 -0.20372  0.02617  0.30988  1.22731 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.656972   0.063106  57.950   <2e-16 ***
## audience_score  0.034548   0.001347  25.641   <2e-16 ***
## critics_score   0.011747   0.000956  12.288   <2e-16 ***
## best_pic_nomyes 0.118187   0.109085   1.083    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4903 on 647 degrees of freedom
## Multiple R-squared:  0.7966, Adjusted R-squared:  0.7957 
## F-statistic: 844.7 on 3 and 647 DF,  p-value: < 2.2e-16

The adujusted R-squared actually decreases in this model. Hence it is not good idea to drop genre. Now we drop best_pic_nom

result <- lm(imdb_rating~audience_score+critics_score+genre,data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + genre, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39362 -0.21778  0.03771  0.28558  1.17120 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.6703854  0.0809987  45.314  < 2e-16 ***
## audience_score                  0.0345479  0.0013403  25.777  < 2e-16 ***
## critics_score                   0.0106741  0.0009546  11.182  < 2e-16 ***
## genreAnimation                 -0.4637894  0.1691241  -2.742 0.006272 ** 
## genreArt House & International  0.1815911  0.1403068   1.294 0.196050    
## genreComedy                    -0.1763025  0.0778649  -2.264 0.023896 *  
## genreDocumentary                0.1971816  0.0949145   2.077 0.038158 *  
## genreDrama                      0.0812951  0.0666825   1.219 0.223243    
## genreHorror                     0.0380908  0.1158685   0.329 0.742459    
## genreMusical & Performing Arts  0.0416735  0.1520222   0.274 0.784075    
## genreMystery & Suspense         0.2899956  0.0861235   3.367 0.000805 ***
## genreOther                     -0.0355336  0.1336557  -0.266 0.790433    
## genreScience Fiction & Fantasy -0.2066413  0.1692513  -1.221 0.222570    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4749 on 638 degrees of freedom
## Multiple R-squared:  0.8119, Adjusted R-squared:  0.8084 
## F-statistic: 229.5 on 12 and 638 DF,  p-value: < 2.2e-16

The adjusted model is still, although just,less than the intial model. Now we drop critics_score to evaluate.

result <- lm(imdb_rating~audience_score+genre+best_pic_nom,data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + genre + best_pic_nom, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87572 -0.19546  0.06426  0.31398  1.06206 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.620172   0.089077  40.641  < 2e-16 ***
## audience_score                  0.043704   0.001148  38.085  < 2e-16 ***
## genreAnimation                 -0.449237   0.184440  -2.436   0.0151 *  
## genreArt House & International  0.197064   0.153023   1.288   0.1983    
## genreComedy                    -0.172611   0.084925  -2.033   0.0425 *  
## genreDocumentary                0.411408   0.101915   4.037 6.08e-05 ***
## genreDrama                      0.184909   0.072003   2.568   0.0105 *  
## genreHorror                     0.137919   0.125965   1.095   0.2740    
## genreMusical & Performing Arts  0.176232   0.165491   1.065   0.2873    
## genreMystery & Suspense         0.406740   0.093207   4.364 1.49e-05 ***
## genreOther                      0.068721   0.145672   0.472   0.6373    
## genreScience Fiction & Fantasy -0.088659   0.184202  -0.481   0.6305    
## best_pic_nomyes                 0.222830   0.116670   1.910   0.0566 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5178 on 638 degrees of freedom
## Multiple R-squared:  0.7763, Adjusted R-squared:  0.7721 
## F-statistic: 184.5 on 12 and 638 DF,  p-value: < 2.2e-16

The adjusted R squared falls again. We now drop audience score.

result <- lm(imdb_rating~genre+critics_score+best_pic_nom,data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ genre + critics_score + best_pic_nom, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.80776 -0.37584  0.04958  0.42076  2.01684 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.900171   0.094277  51.976  < 2e-16 ***
## genreAnimation                 -0.297946   0.239593  -1.244 0.214121    
## genreArt House & International  0.379629   0.198630   1.911 0.056421 .  
## genreComedy                    -0.218250   0.110377  -1.977 0.048437 *  
## genreDocumentary                0.516665   0.133934   3.858 0.000126 ***
## genreDrama                      0.138715   0.094536   1.467 0.142780    
## genreHorror                    -0.275167   0.163338  -1.685 0.092546 .  
## genreMusical & Performing Arts  0.418712   0.214752   1.950 0.051644 .  
## genreMystery & Suspense         0.143731   0.121869   1.179 0.238682    
## genreOther                     -0.005346   0.189937  -0.028 0.977556    
## genreScience Fiction & Fantasy -0.436648   0.239584  -1.823 0.068843 .  
## critics_score                   0.025841   0.001057  24.452  < 2e-16 ***
## best_pic_nomyes                 0.480097   0.150857   3.182 0.001531 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6731 on 638 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6149 
## F-statistic: 87.49 on 12 and 638 DF,  p-value: < 2.2e-16

Adjusted R square drops significantly here. Hence, our inital model is best predictor of the movie so we do not drop any variable. Let us now look at the model for evaluation.

result <- lm(imdb_rating~audience_score+critics_score+genre+best_pic_nom,data = movies)
summary(result)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + genre + 
##     best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40248 -0.21900  0.03859  0.29028  1.17289 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.6856727  0.0818607  45.024  < 2e-16 ***
## audience_score                  0.0343323  0.0013505  25.422  < 2e-16 ***
## critics_score                   0.0105851  0.0009567  11.064  < 2e-16 ***
## genreAnimation                 -0.4611392  0.1690583  -2.728 0.006554 ** 
## genreArt House & International  0.1847032  0.1402630   1.317 0.188368    
## genreComedy                    -0.1781805  0.0778428  -2.289 0.022407 *  
## genreDocumentary                0.2074266  0.0952164   2.178 0.029737 *  
## genreDrama                      0.0780908  0.0666997   1.171 0.242124    
## genreHorror                     0.0365993  0.1158205   0.316 0.752106    
## genreMusical & Performing Arts  0.0504997  0.1521120   0.332 0.740006    
## genreMystery & Suspense         0.2870734  0.0861145   3.334 0.000907 ***
## genreOther                     -0.0475929  0.1339342  -0.355 0.722449    
## genreScience Fiction & Fantasy -0.2065030  0.1691725  -1.221 0.222664    
## best_pic_nomyes                 0.1354311  0.1072295   1.263 0.207051    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4746 on 637 degrees of freedom
## Multiple R-squared:  0.8124, Adjusted R-squared:  0.8085 
## F-statistic: 212.2 on 13 and 637 DF,  p-value: < 2.2e-16

This model shows audience score, critics score and genre being animation or mystery and suspense being most significant. Audience score results to 0.034 change in rating where as critics score changes 0.01. Coefficient of genre being Animation suggest that its on average likely to be -46% lower rated than not being Animation. Another important is if the genre is Mystery & Suspense. If predicts the average change of .287 in rating.

Model Diagonosis We plot the residual to check if the model is approriate or not.

plot(result$residuals)

The model does have constant variability around zero. This means that the linear may is good fit for the research question. * * *

hist(result$residuals, breaks = 100)

This histgram can also represent the model being closer to 0. Further more, we can run summary statistics

summary(result$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.40248 -0.21900  0.03859  0.00000  0.29028  1.17289

sd(result$residuals)

## [1] 0.4698633

Part 5: Prediction

I used the movie Manchester by the Sea which is rate 7.8 in imdb. It is classified as Drama. the critics score is 96 where as the aduience score is 77. Source https://www.rottentomatoes.com/m/manchester_by_the_sea https://www.imdb.com/list/ls033133511/

Our model should predict imdb rating in this form

ManchesterBySea <- data.frame( audience_score=77, critics_score=96,genre="Drama",best_pic_nom="yes")
predict(result,ManchesterBySea, interval = "predict")

##        fit      lwr      upr
## 1 7.558948 6.604914 8.512983

The predicted value is 7.56. The lower limit is 6.61 and upper limit is 8.51.This means that the actual value has 95% chances of being within the limits of our predictipon. The real imdb rating, 7.8 falls under our estimation limit.

Part 6: Conclusion

This model may be explained by numerical figure what we must also put logical reasoning to find proper prediction. Rating are influence by qualitative factors, which are beyond the scope statistical measurement. Not all scores may be rational and the voters may be different for different movie.