Math 239 Final

Introduction

The goal of this paper is to explore trends in movie data; specifically, it attempts to model individual movies’ ratings as measured on IMDB, among other topics. Topics and questions of interest include: How to predict a movie’s critical reception based on its demographic features, how genres differ from one another in ratings, and how these statistics have evolved over time.

The paper presents a working model for ratings prediction, as well as some information on average rating across genres and observations about trends in movie ratings from different eras.

Data

The data used in this project were adapted in 2016 by Kaggle user Orges Leka from IMDB using the \(\texttt{wget}\) package. The specific dataset was chosen because of its large sample size and because it had een conveniently preprocessed to be easy to work with. It takes the form of a matrix of 14,762 movies and TV shows, of which 13,153 were easily usable.

The data were reasonably tidy as presented, but needed some wrangling before being usable. In several rows, the data had been entered wrong and needed to be corrected, likely because certain German characters were mistranscribed by some system at some point in the process and interpreted as a cell break, which misaligned every cell after it in that row. After removing N/As, those rows affected still needed to be cleaned. One column, imdbRating, was chosen, because it should always have been numeric and so always was affected when the row was shifted. Each of its rows were checked to see if they had the wrong data by converting them to char and then to num; any misplaced rows returned N/A and were removed.

Even after being processed by the useful importation argumnents available, the uploaded data had a number of variables that were meant to be nums were either factors or chars. The columns that were still the wrong data type were cleaned by coercion applied to a manually selected list of appropriate data types.

The variables for each movie include useful details like IMDB rating, number of ratings, duration, release year, number of awards won, and a sequence of dummy variables indicating whether the film falls into each of around 30 genre categories like ”Comedy,” ”Talk Show,” or ”FilmNoir.”

Methods

Much of the data is unevenly distributed and has heavy skew, simply because there are far more poorly known films than big name films.

To counteract this, we can use the bootstrap to simulate a sample distribution in order to measure the means and other characteristics of our skewed data. We do this by resampling with replacement from the distribution we want to measure, then performing a t-test for the desired statistic. In this case bootstrapping may not have been fully necessary because of the large sample size; the bootstrapped mean values varied from the means of their base distributions by less than 0.01%.

One major goal of the paper was to perform prediction on imdbRating. To find the best model, the first step was to explore the significance of genre by training a multiple linear regression model with each of the genre variables. By incrementally trimming the least significant predictors, the list was narrowed to 19 genre variables with effects on the average rating that were statistically significant at the 0.01 level. A few other predictors proved to be meaningful, specifically year released and number of awards won The residuals plot for year suggested that a higher order polynomial would model the data better.

In order to choose the best degree for year in regression, the \(\texttt{boot}\) package was employed to calculate the MSE of each possible degree from 1 to 10 using k-fold cross validation on 10 folds. This made it easy to observe when increasing the degree stopped being effective in reducing the model error.

The graph’s odd ledge shape offers a choice between a degree of 2 and 4 but 4 was chosen for its overall significantly lower bias, particularly for the last 10 years of the data.

Results

The genre with the lowest mean rating was Science Fiction, with a rating of 6.449022 out of 10. The genre with the highest mean rating was Animation, with a rating of 7.188239. The graph below shows the confidence intervals of the mean rating of the following genres, respectively: Sci-fi, Thriller, Action, Comedy, Drama, Documentary, Animation.

The best model based on this data used polynomial regression to incorporate genre information as well as awards won, duration, and release year. It achieved an adjusted R-squared value of 1.8, which is mildly predictive although not extremely useful.

## 
## Call:
## lm(formula = imdbRating ~ nrOfWins + duration + poly(year, 4) + 
##     Action + Animation + Documentary + Drama + Family + Fantasy + 
##     GameShow + History + Horror + Mystery + News + RealityTV + 
##     Romance + SciFi + Short + Thriller + War, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9267 -0.4857  0.1005  0.6165  3.8763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.795e+00  2.760e-02 246.203  < 2e-16 ***
## nrOfWins        2.134e-02  9.375e-04  22.763  < 2e-16 ***
## duration       -1.459e-05  3.812e-06  -3.828 0.000130 ***
## poly(year, 4)1 -1.817e+01  1.075e+00 -16.907  < 2e-16 ***
## poly(year, 4)2  4.089e+00  1.061e+00   3.852 0.000118 ***
## poly(year, 4)3  6.955e+00  1.079e+00   6.448 1.18e-10 ***
## poly(year, 4)4  4.513e+00  1.098e+00   4.108 4.02e-05 ***
## Action         -2.360e-01  2.588e-02  -9.116  < 2e-16 ***
## Animation       6.317e-01  3.855e-02  16.389  < 2e-16 ***
## Documentary     4.910e-01  3.950e-02  12.430  < 2e-16 ***
## Drama           3.386e-01  2.114e-02  16.018  < 2e-16 ***
## Family         -3.477e-01  3.696e-02  -9.409  < 2e-16 ***
## Fantasy        -1.495e-01  3.767e-02  -3.969 7.25e-05 ***
## GameShow        6.053e-01  1.201e-01   5.041 4.69e-07 ***
## History         1.416e-01  4.679e-02   3.027 0.002472 ** 
## Horror         -7.063e-01  3.611e-02 -19.557  < 2e-16 ***
## Mystery         2.078e-01  3.527e-02   5.892 3.92e-09 ***
## News           -7.730e-01  1.488e-01  -5.194 2.09e-07 ***
## RealityTV      -4.828e-01  1.316e-01  -3.667 0.000246 ***
## Romance        -1.548e-01  2.777e-02  -5.575 2.53e-08 ***
## SciFi          -1.501e-01  3.617e-02  -4.151 3.33e-05 ***
## Short          -3.339e-01  5.434e-02  -6.145 8.24e-10 ***
## Thriller       -8.318e-02  3.163e-02  -2.630 0.008547 ** 
## War             1.183e-01  4.961e-02   2.384 0.017155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 12381 degrees of freedom
##   (748 observations deleted due to missingness)
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.175 
## F-statistic: 115.4 on 23 and 12381 DF,  p-value: < 2.2e-16

Analysis

One interesting result was in the relationship between year and rating. To visualize this relationship, it proved useful to calculate and plot the average movie rating across time. This kept the same overall shape as the full distribution but is significantly more readable and shows an interesting trend. This model leaves out years before 1920 as prior to that point there are not enough datapoints to form a consistent average and the data that do exist are odd and anomalous, which is to be expected since the medium evolved rapidly over the first several decades of existence.

Here the trend can be clearly seen, and it suggests interesting things about the nature of IMDB’s ratings, with average ratings steadily declining from 1920 onwards until around 2000, when they make a sudden and distinct upturn. One hypothesized explanation for this trend is the combination of hindsight and recency biases. That is, reviewers have a tendency to romanticize historical movies and give the “classics” the highest marks, but at the same time are more familiar with the most recent movies and more willing to rate them highly than the no longer hip but not yet classic films of a decade or two ago. (The data were collected largely around 2016)

This is just one possible explanation- maybe this relationship reflects an objective change in quality of films over the past century, or is the result of some other factor.

Conclusion

Future exploration may seek to test the cause of the change in ratings over time. It may also investigate in more detail correlations between genres. Are some genres more likely to be paired than others? Has the prevalence of certain genres changed with various time periods? Finally, while the predictive model presented in this paper is somewhat significant, the task of building a more successful model remains, whether by gathering data on predictors that more accurately explain the variance or by improving the modeling techniques used.

Appendix: Code

#two packages used throughout the paper
library(tidyverse)
library(ggplot2)

movies <- read.csv("Data/imdb.csv", 
                   header=TRUE, 
                   stringsAsFactors = FALSE)
movies <- na.omit(movies)

#removes all rows for which the entry for imdbRating is nonnumeric
movies <- movies[!is.na(as.numeric(as.character(movies$imdbRating))),]

## Warning in `[.data.frame`(movies, !
## is.na(as.numeric(as.character(movies$imdbRating))), : NAs introduced by
## coercion

#coerces a specified list of columns to be numeric
numericCols <-c("imdbRating","ratingCount","duration","year","nrOfWins", "nrOfNominations","nrOfPhotos","nrOfNewsArticles")
movies[numericCols] <- lapply(movies[numericCols],as.numeric)

#define subset of films released after 1920, which are less anomalous as a group
modernMovies <- filter(movies, movies$year>1920)

#defining the one sample bootstrap
bootStrapCI1<-function(data, nsim){
  n<-length(data)
  bootCI<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    thisXbar<-mean(data[bootSamp])
    bootCI<-c(bootCI, thisXbar)
  }
  return(bootCI)
}

#calculating each bootstrapped mean
thrillerBoot <- bootStrapCI1(movies$imdbRating[movies$Thriller==1],nsim=1000)
actionBoot <- bootStrapCI1(movies$imdbRating[movies$Action==1],nsim=1000)
comedyBoot <- bootStrapCI1(movies$imdbRating[movies$Comedy==1],nsim=1000)
sciFiBoot <- bootStrapCI1(movies$imdbRating[movies$SciFi==1],nsim=1000)
dramaBoot <- bootStrapCI1(movies$imdbRating[movies$Drama==1],nsim=1000)
documentaryBoot <- bootStrapCI1(movies$imdbRating[movies$Documentary==1],nsim=1000)
animationBoot <- bootStrapCI1(movies$imdbRating[movies$Animation==1],nsim=1000)

#print each 95% confidence interval
quantile(thrillerBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 6.548670 6.665488

quantile(actionBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 6.574355 6.672734

quantile(comedyBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 6.801529 6.865085

quantile(sciFiBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 6.358308 6.544740

quantile(dramaBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 7.062066 7.107780

quantile(documentaryBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 7.051793 7.169442

quantile(animationBoot,c(0.025,0.975))

##     2.5%    97.5% 
## 7.133256 7.242755

#calculate the average rating of all movies released in a given year
avgRatingByYear = c()
for(i in 1920:2014) {
  avgRatingByYear = c(avgRatingByYear,mean(movies$imdbRating[movies$year == i]))
}

#create dataframe to hold average ratings alongside year data
ratingDF <- data.frame(1920:2014,avgRatingByYear)
names(ratingDF)<-c("year","avgRating")
#a few early years have no movies recorded so need to be removed
ratingDF <- na.omit(ratingDF)

#plot a basic model with a linear year term
avgRatingMod <- lm(avgRating ~ year,data=ratingDF)
plot(ratingDF$avgRating~ratingDF$year)
points(ratingDF$year, fitted(avgRatingMod), col='red', pch=20)

#observe the residual-fitted plot is unsatisfactory
plot(avgRatingMod,which=1)

#run CV on the model to select the proper polynomial degree

library(boot)
set.seed(239)

#calculate MSE for degree 1-10 using k-fold cross validation
cv.error.k10<-rep(0, 10)
for(i in 1:10){
  glm.fit<-glm(avgRating~poly(year, i), data=ratingDF)
  cv.error.k10[i]<-cv.glm(ratingDF, glm.fit, K=10)$delta[1]
}

#put MSE values into a dataframe
cvDF<-data.frame(degree=1:10, cv.error.k10)


#plot the error as a function of degree
ggplot(data=cvDF, aes(x=degree, y=cv.error.k10))+
  geom_point()+
  geom_line()

#compare visually the fit of degree 2 and 4:
degFourMod <- lm(avgRating ~ poly(year,4),data=ratingDF)
degTwoMod <- lm(avgRating ~ poly(year,2),data=ratingDF)

plot(ratingDF$avgRating~ratingDF$year)
points(ratingDF$year, fitted(degTwoMod), col='blue', pch=20)

plot(ratingDF$avgRating~ratingDF$year)
points(ratingDF$year, fitted(degFourMod), col='red', pch=20)

#genre exploration model
genreMod <- lm(imdbRating ~ Action+Adult+ Adventure+Animation+ Biography+Comedy+Crime+Documentary+Drama+
                    Family+Fantasy+FilmNoir+GameShow+History+Horror+Music+Musical+Mystery+News+RealityTV+     
                    Romance+SciFi+Short+Sport+TalkShow+Thriller+War+Western +Western:year, data=movies)
summary(genreMod)

## 
## Call:
## lm(formula = imdbRating ~ Action + Adult + Adventure + Animation + 
##     Biography + Comedy + Crime + Documentary + Drama + Family + 
##     Fantasy + FilmNoir + GameShow + History + Horror + Music + 
##     Musical + Mystery + News + RealityTV + Romance + SciFi + 
##     Short + Sport + TalkShow + Thriller + War + Western + Western:year, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7387 -0.5315  0.1154  0.6768  3.6908 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.784953   0.032758 207.125  < 2e-16 ***
## Action       -0.256962   0.028278  -9.087  < 2e-16 ***
## Adult        -0.624648   0.264986  -2.357 0.018424 *  
## Adventure    -0.050941   0.030955  -1.646 0.099857 .  
## Animation     0.599322   0.039033  15.354  < 2e-16 ***
## Biography     0.121121   0.047243   2.564 0.010364 *  
## Comedy       -0.007137   0.024814  -0.288 0.773642    
## Crime         0.010704   0.029105   0.368 0.713047    
## Documentary   0.338223   0.043868   7.710 1.35e-14 ***
## Drama         0.353716   0.024265  14.577  < 2e-16 ***
## Family       -0.338926   0.037953  -8.930  < 2e-16 ***
## Fantasy      -0.119283   0.039820  -2.996 0.002745 ** 
## FilmNoir      0.210380   0.080165   2.624 0.008691 ** 
## GameShow      0.485809   0.116629   4.165 3.13e-05 ***
## History       0.180454   0.049074   3.677 0.000237 ***
## Horror       -0.726734   0.039323 -18.481  < 2e-16 ***
## Music         0.093743   0.054171   1.731 0.083562 .  
## Musical       0.019095   0.057774   0.331 0.741016    
## Mystery       0.217891   0.037933   5.744 9.45e-09 ***
## News         -0.766057   0.153096  -5.004 5.69e-07 ***
## RealityTV    -0.551863   0.116491  -4.737 2.19e-06 ***
## Romance      -0.122443   0.029172  -4.197 2.72e-05 ***
## SciFi        -0.148699   0.038537  -3.859 0.000115 ***
## Short        -0.241887   0.053307  -4.538 5.74e-06 ***
## Sport        -0.149965   0.072126  -2.079 0.037617 *  
## TalkShow      0.003952   0.090778   0.044 0.965279    
## Thriller     -0.124492   0.034548  -3.603 0.000315 ***
## War           0.233249   0.052347   4.456 8.42e-06 ***
## Western      11.225249   6.295873   1.783 0.074617 .  
## Western:year -0.005689   0.003201  -1.777 0.075598 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 13123 degrees of freedom
## Multiple R-squared:  0.1151, Adjusted R-squared:  0.1131 
## F-statistic: 58.85 on 29 and 13123 DF,  p-value: < 2.2e-16

#highest accuracy model
mod4 <- lm(imdbRating ~ nrOfWins+duration+poly(year,4)+Action+Animation+Documentary+Drama+
             Family+Fantasy+GameShow+History+Horror+Mystery+News+RealityTV+     
             Romance+SciFi+Short+Thriller+War, data=movies)
summary(mod4)

## 
## Call:
## lm(formula = imdbRating ~ nrOfWins + duration + poly(year, 4) + 
##     Action + Animation + Documentary + Drama + Family + Fantasy + 
##     GameShow + History + Horror + Mystery + News + RealityTV + 
##     Romance + SciFi + Short + Thriller + War, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9267 -0.4857  0.1005  0.6165  3.8763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.795e+00  2.760e-02 246.203  < 2e-16 ***
## nrOfWins        2.134e-02  9.375e-04  22.763  < 2e-16 ***
## duration       -1.459e-05  3.812e-06  -3.828 0.000130 ***
## poly(year, 4)1 -1.817e+01  1.075e+00 -16.907  < 2e-16 ***
## poly(year, 4)2  4.089e+00  1.061e+00   3.852 0.000118 ***
## poly(year, 4)3  6.955e+00  1.079e+00   6.448 1.18e-10 ***
## poly(year, 4)4  4.513e+00  1.098e+00   4.108 4.02e-05 ***
## Action         -2.360e-01  2.588e-02  -9.116  < 2e-16 ***
## Animation       6.317e-01  3.855e-02  16.389  < 2e-16 ***
## Documentary     4.910e-01  3.950e-02  12.430  < 2e-16 ***
## Drama           3.386e-01  2.114e-02  16.018  < 2e-16 ***
## Family         -3.477e-01  3.696e-02  -9.409  < 2e-16 ***
## Fantasy        -1.495e-01  3.767e-02  -3.969 7.25e-05 ***
## GameShow        6.053e-01  1.201e-01   5.041 4.69e-07 ***
## History         1.416e-01  4.679e-02   3.027 0.002472 ** 
## Horror         -7.063e-01  3.611e-02 -19.557  < 2e-16 ***
## Mystery         2.078e-01  3.527e-02   5.892 3.92e-09 ***
## News           -7.730e-01  1.488e-01  -5.194 2.09e-07 ***
## RealityTV      -4.828e-01  1.316e-01  -3.667 0.000246 ***
## Romance        -1.548e-01  2.777e-02  -5.575 2.53e-08 ***
## SciFi          -1.501e-01  3.617e-02  -4.151 3.33e-05 ***
## Short          -3.339e-01  5.434e-02  -6.145 8.24e-10 ***
## Thriller       -8.318e-02  3.163e-02  -2.630 0.008547 ** 
## War             1.183e-01  4.961e-02   2.384 0.017155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 12381 degrees of freedom
##   (748 observations deleted due to missingness)
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.175 
## F-statistic: 115.4 on 23 and 12381 DF,  p-value: < 2.2e-16

plot(mod4)