Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

# set working directory 
setwd('C:/Users/sanso/Documents/Git/statsR/regression')
load("movies.Rdata")

Part 1: Data

This data has been collected from IMDB with a number of movies

The data seems sufficiently random to be able to generalize this to a population. However, since this is observed data, we will not be able to make any causal statements regarding this.

Our job is also to make a model which would be able to predict the audience score with a set of regressors

Pre processing the data

Checking what the data looks like

dim(movies)

## [1] 651  32

This data has 651 observations with 32 vaeiables.

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

To start with I would be removing, the last 2 variables which are imdb_url and rt_url

movies$rt_url <- NULL ; movies$imdb_url <- NULL 

names(movies)

##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"

Part 2: Research question

Since this project involves trying to understand what are the factors that make a movie popular, I would be using the variable " audience_score" as an outcome variable and would try and find what are the other factors which impact this variable

Part 3: Exploratory data analysis

As part of the exploratory analysis I would be checking what are the major factors that I would be using to build this model by checking if factors are relevant to the outcome variable or not

Checking for Variables

1st check would be on the Character Variables - Director and actor

length(unique(movies$director))

## [1] 533

length(unique(movies$actor1))

## [1] 486

since there are so many unique directors(533) and actors(486), I would not be using these are regressors,

Removing the regressors

movies <- movies[,-c(25:30)]
dim(movies)

## [1] 651  24

movies <- movies[,-1]
names(movies)

##  [1] "title_type"       "genre"            "runtime"         
##  [4] "mpaa_rating"      "studio"           "thtr_rel_year"   
##  [7] "thtr_rel_month"   "thtr_rel_day"     "dvd_rel_year"    
## [10] "dvd_rel_month"    "dvd_rel_day"      "imdb_rating"     
## [13] "imdb_num_votes"   "critics_rating"   "critics_score"   
## [16] "audience_rating"  "audience_score"   "best_pic_nom"    
## [19] "best_pic_win"     "best_actor_win"   "best_actress_win"
## [22] "best_dir_win"     "top200_box"

Now would be checking for the individual regressors

Checking if runtime has an impact on audience score

g <- ggplot(data = movies, aes(x = runtime, y = audience_score))
g <- g + geom_point() 
g <- g + geom_smooth(method = "lm", se = FALSE)
g

concluding that there is some degree of correlation between runtime and audience score

Now would be checking if there is any impact of “title type” on audience score

g <- ggplot(data = movies, aes( x= title_type, y = audience_score))
g <- g + geom_boxplot(aes(fill = title_type))
g <- g + xlab("Title Type") + ylab("Audience Score") 
g

Concluding that title type is required

Checking for Genre Type

g <- ggplot(data = movies, aes( x= genre, y = audience_score))
g <- g + geom_boxplot(aes(fill = genre))
g <- g + xlab("Genre") + ylab("Audience Score") 
g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
g

I would be keeping Genre as a regressor

I would be checking if Studio is required

length(unique(movies$studio))

## [1] 212

Since there are 212 unique Studio Names, I would not be using this as a regressor.

Deleting the column Studio

movies$studio <- NULL

Now would be checking if MPAA rating is important for predicting audience score

g <- ggplot(data = movies, aes( x= mpaa_rating, y = audience_score))
  g <- g + geom_boxplot(aes(fill = mpaa_rating))
  g <- g + xlab("MPAA Rating ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
  g

Rating wiill be used as a regressor

I will not be using release year since we would be using preidctions of different years

Checking if the release month has any impact on the outcome variable

g <- ggplot(data = movies, aes( x= factor(thtr_rel_month), y = audience_score))
  g <- g + geom_boxplot(aes(fill = thtr_rel_month))
  g <- g + xlab("Month ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
  g

There does not seem to much impact of the month. Hence I would be deleting the release date, month and year variables

movies$thtr_rel_year <- NULL; movies$thtr_rel_month <- NULL; movies$thtr_rel_day <- NULL

movies$dvd_rel_day <- NULL; movies$dvd_rel_month <- NULL; movies$dvd_rel_year <- NULL

Checking to see if IMDB ratings have any impact on audience score

 g <- ggplot(data = movies, aes( x= imdb_rating, y = audience_score))
  g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
  g <- g + xlab("IMDB Rating ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
  g

There is a very strong linear relation between these

Now checking for IMDB Number Votes and if that has any impact

g <- ggplot(data = movies, aes( x= imdb_num_votes, y = audience_score))
  g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
  g <- g + xlab("IMDB Votes # ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
  g

while the linear relation does not seem to be particularly strong, it does merit using it in the model

checking for Crtics Rating type

g <- ggplot(data = movies, aes( x= critics_rating, y = audience_score))
  g <- g + geom_boxplot(aes(fill = critics_rating))
  g <- g + xlab("Critics Rating  ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
  g

There does seem to be an impact; hence would be considered

Checking for Critics Score

g <- ggplot(data = movies, aes( x= critics_score, y = audience_score))
  g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
  g <- g + xlab("Critics Score ") + ylab("Audience Score") 
  g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
  g

There is a very strong Linear relation

Part 4: Modeling

For this section I would be using the results I have gotten in the exploratory data analysis section in terms of regressors to be used

# Checking what the current regressors are 

names(movies)

##  [1] "title_type"       "genre"            "runtime"         
##  [4] "mpaa_rating"      "imdb_rating"      "imdb_num_votes"  
##  [7] "critics_rating"   "critics_score"    "audience_rating" 
## [10] "audience_score"   "best_pic_nom"     "best_pic_win"    
## [13] "best_actor_win"   "best_actress_win" "best_dir_win"    
## [16] "top200_box"

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  16 variables:
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Make the 1st Model

Making a model with all the regressors

fit1 <- lm(audience_score ~ ., data = movies)
summary(fit1)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3327  -4.3939   0.3987   4.2976  24.3401 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -9.088e+00  4.494e+00  -2.023   0.0436 *  
## title_typeFeature Film          2.469e+00  2.577e+00   0.958   0.3383    
## title_typeTV Movie              9.358e-01  4.042e+00   0.232   0.8170    
## genreAnimation                  2.477e+00  2.717e+00   0.912   0.3623    
## genreArt House & International -2.337e+00  2.111e+00  -1.107   0.2686    
## genreComedy                     1.618e+00  1.159e+00   1.397   0.1630    
## genreDocumentary                2.545e+00  2.762e+00   0.921   0.3572    
## genreDrama                     -4.135e-01  1.024e+00  -0.404   0.6866    
## genreHorror                    -1.691e+00  1.727e+00  -0.979   0.3280    
## genreMusical & Performing Arts  3.785e+00  2.371e+00   1.597   0.1109    
## genreMystery & Suspense        -2.770e+00  1.305e+00  -2.123   0.0342 *  
## genreOther                     -1.547e-01  1.968e+00  -0.079   0.9374    
## genreScience Fiction & Fantasy -2.709e-01  2.463e+00  -0.110   0.9125    
## runtime                        -2.706e-02  1.702e-02  -1.590   0.1123    
## mpaa_ratingNC-17               -6.803e-01  5.235e+00  -0.130   0.8966    
## mpaa_ratingPG                  -1.861e-01  1.907e+00  -0.098   0.9223    
## mpaa_ratingPG-13               -1.125e+00  1.974e+00  -0.570   0.5692    
## mpaa_ratingR                   -1.296e+00  1.904e+00  -0.681   0.4963    
## mpaa_ratingUnrated             -3.254e-01  2.179e+00  -0.149   0.8813    
## imdb_rating                     9.403e+00  4.893e-01  19.218   <2e-16 ***
## imdb_num_votes                  3.332e-06  3.234e-06   1.030   0.3033    
## critics_ratingFresh             1.535e-01  8.691e-01   0.177   0.8599    
## critics_ratingRotten           -6.669e-01  1.402e+00  -0.476   0.6344    
## critics_score                   9.438e-03  2.527e-02   0.373   0.7089    
## audience_ratingUpright          2.004e+01  7.924e-01  25.289   <2e-16 ***
## best_pic_nomyes                 3.719e+00  1.811e+00   2.053   0.0405 *  
## best_pic_winyes                -2.523e+00  3.192e+00  -0.790   0.4297    
## best_actor_winyes              -6.606e-03  8.224e-01  -0.008   0.9936    
## best_actress_winyes            -1.267e+00  9.100e-01  -1.392   0.1643    
## best_dir_winyes                 3.046e-01  1.192e+00   0.256   0.7984    
## top200_boxyes                  -6.564e-01  1.936e+00  -0.339   0.7347    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.871 on 619 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:   0.89,  Adjusted R-squared:  0.8847 
## F-statistic:   167 on 30 and 619 DF,  p-value: < 2.2e-16

Here we get a R-squared value of 89% and and adjusted R-squared of 88.5%. This means that this model explains 89% of the Variance of the outcome variable

Make the 2nd Model.

I would now be checking if there is a way in which I can reduce the number of regressors. I would be using the Step function which basically checks for maximization of the adjusted R-squared by reducing regressors

fit2 <- step(lm(audience_score ~ ., data = movies), trace = 0) # step function

summary(fit2) # Printing out the results

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2299  -4.3817   0.5075   4.3523  24.1505 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -8.75008    2.66718  -3.281  0.00109 ** 
## genreAnimation                  3.20421    2.45639   1.304  0.19256    
## genreArt House & International -2.66418    2.02406  -1.316  0.18856    
## genreComedy                     1.35890    1.13110   1.201  0.23005    
## genreDocumentary                0.16974    1.39624   0.122  0.90328    
## genreDrama                     -0.79769    0.96859  -0.824  0.41050    
## genreHorror                    -2.09686    1.67267  -1.254  0.21045    
## genreMusical & Performing Arts  2.60624    2.18644   1.192  0.23371    
## genreMystery & Suspense        -3.13253    1.24911  -2.508  0.01240 *  
## genreOther                     -0.02899    1.92785  -0.015  0.98801    
## genreScience Fiction & Fantasy -0.08638    2.43485  -0.035  0.97171    
## runtime                        -0.02547    0.01567  -1.626  0.10445    
## imdb_rating                     9.47342    0.45974  20.606  < 2e-16 ***
## critics_score                   0.02177    0.01502   1.450  0.14759    
## audience_ratingUpright         20.05406    0.77665  25.821  < 2e-16 ***
## best_pic_nomyes                 3.52360    1.59063   2.215  0.02710 *  
## best_actress_winyes            -1.29501    0.89570  -1.446  0.14873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.823 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8863 
## F-statistic: 317.2 on 16 and 633 DF,  p-value: < 2.2e-16

We are now down to 7 regressors where we get an adjusted R-squared value of 88.63% ( a slight increase over the last model) and a R-squared value of 88.91% ( which is drop of 0.1% drop over the last model)

The regressors would be :

genre + runtime + imdb_rating + critics_score + audience_rating + best_pic_nom + best_actress_win

Hence we see that even with 7 regressors, we can explain 88.91% of the variance of the outcome variable with this model.

I would be selecting this model ( fit2) as the final model

plotting the residuals of this model

par(mfrow = c(2,2))
plot(fit2)

The residuals plot shows that the residuals are scattered normally and also the normal QQ line shows that it is near normal

Hence we can conclude that it is a good fit

Part 5: Prediction

I would be using the movie " Rogue One - A Star Wars story" for predictions

Link : http://www.imdb.com/title/tt3748528/?ref_=adv_li_tt

The variables I require for this model are

genre,runtime,imdb_rating,critics_score, audience_rating,best_pic_nom,best_actress_win

I have created a csv file with this data which I would be importing

movie1 <- read.csv('newdata.csv')
str(movie1)

## 'data.frame':    1 obs. of  8 variables:
##  $ Title           : Factor w/ 1 level "Rogue One - A Star Wars story": 1
##  $ genre           : Factor w/ 1 level "Action & Adventure": 1
##  $ runtime         : int 133
##  $ imdb_rating     : num 7.8
##  $ critics_score   : int 65
##  $ audience_rating : Factor w/ 1 level "Spilled": 1
##  $ best_pic_nom    : Factor w/ 1 level "no": 1
##  $ best_actress_win: Factor w/ 1 level "no": 1

using this data I would be predicting an audience score using the predict function

predict(fit2, newdata = movie1, interval = "predict")

##        fit      lwr     upr
## 1 63.16992 49.54014 76.7997

This gives us an audience score of 63 with a lower interval of 49.5 and an upper interval of 76.8

Part 6: Conclusion

In conclusion we can say that using 7 regressors, namely the following 7

genre,runtime,imdb_rating,critics_score, audience_rating,best_pic_nom,best_actress_win

we can build a linear model which explains 89% of the variance of the audience score.

We can also use this to predict the results.

Since some of this variables can only be understood after the movie is released, this model might not be a great method to understand what kind of movie to make before making it. Of the regressors in this model, only 2 “Genre” and “Runtime” are something that can be used before making the movie. For deciding which kind of movie to be made nd which would be popular, we should ideally be building a model with only those regressors where data is available before making the movie.