Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

## Warning: package 'GGally' was built under R version 3.4.4

Load data

load("movies.Rdata")
# data sneak peak, right of the bat!
str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Part 1: Data

Movies data set contains data from a randomly sampled movies which were released before 2K16. Altough this is a random sample but it’s a very small one when compared to all the movies that were released before the year 2K16. But this data still can help create a generalised result set. There is no random assignment in the data set or in the parent database from where this data was pulled out of, since the data contains information about movies, their release dates and ratings on different platforms etc. So it would be safe to say that there is no causality.

And, when it comes to public opinion about something on a massive scale like the movies here for example, the ratings can be biased. There are movies based on novels, then we have documentaries the list is long. Ususally all the movies that are based on novels don’t do justice to the imagination of the author. For example, many people like Inferno(the movie), while those who have read Dan Brown’s real work have a completely different view about the movie, and as a result the public opinion is spilled. And this can be taken as the basis for a bias in the dataset. (I’ve watched Batman Vs Superman, 3 times in a theatre!, sorry if you hated it but I liked it!)

Part 2: Research question

It so often happens that the ratings a movie receives on Rotten Tomatoes(RT) and IMDB vary by a huge margin. For example, we have “A Law Abiding Citizen”, this movie scored 25 critics rating on RT while it received a score of 7.2 on IMDB. And there are many more instances on the same, the list is long, but for those who’s interests I’ve peaked, have a look yourself at the link below: $http://www.listal.com/list/imdb-rotten-tomatoes-disagree$

And while all this happens there are times when the ratings on both these sites align with each other.

So, the research question is, if there is a relation between the ratings a movie receives on RT vs those they receive on IMDB.

Part 3: Exploratory data analysis

To start with let’s have a quick look at the scores that the movies have received on IMDB vs those scored on Rotten Tomatoes.

ggplot(data = movies, aes(x = imdb_rating, fill = factor(movies$critics_score)))+
  geom_histogram(width = 0.05, stat = "count")+
  # facet_wrap(~genre)+
  xlab("ratings")+
  ylab("typefactor")+
  labs(fill = "title_type")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

A movie that scores a 7+ on IMDB is usually one of those Hit movies of the year, but as can be seen from the histogram above, that while there is an understanding between the scores on RT and IMDB, there are still a lot of movies where the scores don’t tally at all.

Well, I guess this is a bit too messy to look at! Next let’s create something more…informative.

# cor(movies$critics_score,movies$imdb_rating)
# # 0.7650355

plot(movies$imdb_rating,movies$critics_score, xlab = "IMDB_Scoreboard", ylab = "Critic'sReviews", cex = 2.0, pch = 20, las = 1, col =8)
abline(lm(movies$imdb_rating~movies$critics_score))

It’s kinda wierd, right! To have a straight line for $lm(ImdbRatings~and~CriticsScores)$. When trying to create a smooth line on the distrubution here,

plot(movies$imdb_rating,movies$critics_score, xlab = "IMDB_Scoreboard", ylab = "Critic'sReviews", cex = 2.0, pch = 20, las = 1, col = 8)
abline(lm(movies$imdb_rating~movies$critics_score))
lines(smooth.spline(movies$imdb_rating,movies$critics_score), lty = 2, lwd = 5, col = 1)

We can see that, the relationship is not exactly linear here for IMDB scores and the scores recorded on RT.

IMDB rates movies on a scale of 10 while RT does that on a scale of 1 to 100. TO bring a balance let’s consider the $IMDB scores*10$ while taking the RT’s scores as is. Let’s summarise the average ratings received for all the movies in the Movies dataset. But before we do that, let’s first clean the data and remove all the NAs from the movie data set.

ratings <- movies%>%
  filter(!is.na(imdb_rating), !is.na(critics_rating), !is.na(critics_score), !is.na(audience_score), !is.na(audience_rating), !is.na(thtr_rel_year), !is.na(runtime), !is.na(title_type), !is.na(mpaa_rating), !is.na(imdb_num_votes), !is.na(best_pic_nom), !is.na(best_pic_win), !is.na(best_actor_win), !is.na(best_actress_win), !is.na(best_dir_win), !is.na(top200_box))

ratings%>%
    summarise(total = n(), mean_IMDB = mean(imdb_rating), mean_RT = mean(critics_score), mean_public = mean(audience_score))

## # A tibble: 1 x 4
##   total mean_IMDB  mean_RT mean_public
##   <int>     <dbl>    <dbl>       <dbl>
## 1   650  6.491538 57.65385    62.34769

Although the mean Audience ratings seems to be in agreement with the mean IMDB ratings, but there is a substantial gap between the means for IMDB and RottenTomatoes(RT).

Part 4: Modeling

In order to create a model with the highest $Adjusted~R-Square$ value, let’s start with the backward approach where in we’ll take all the relevant values for the model and then we drop them one by one to check for a difference in the values for $Adjusted~R-Sqaure$. We take the new model only if it results in giving us a value for $Adjusted~R-Square$ which is higher than our existing model. We then repeat this process until we arrive at a model after which we cannot see any furhter rise in the value for the $Adjusted~R-Square$.

The initial model is going to look something like:

m1 = lm(imdb_rating~critics_rating+critics_score+audience_score+audience_rating+thtr_rel_year+runtime+title_type+mpaa_rating+imdb_num_votes+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box+genre, data = movies)

summary(m1)$adj.r.squared

## [1] 0.8304156

Next, let’s remove the variables one by one creating a new model in each scenario and keep comparing the values for the $adj.R.squared$.

After a long elimination process, things boil down to the final model:

m63 = lm(imdb_rating~critics_rating+critics_score+audience_score+audience_rating+runtime+imdb_num_votes+genre, data = movies)
summary(m63)$adj.r.squared

## [1] 0.8318041

Next let’s check if this model is valid enough for predictions:

# condition1:
plot(m63$residuals~ratings$critics_score+ratings$audience_score+ratings$runtime+ratings$imdb_num_votes)

We are not able to meet the first condition for Movie Runtime and IMDB Number of Votes. Let’s try to refit this model; refining it further will yield:

mX1 = lm(imdb_rating~critics_rating+critics_score+audience_score+audience_rating+genre, data = ratings)
summary(mX1)$adj.r.squared

## [1] 0.8198888

The $Adj.R.Squared$ for this model is slightly lower than the previous model. Let’s see if this new model is valid enough for prediction purposes:

#condition1: linear relationship b/w x and y
plot(mX1$residuals~ratings$critics_score+ratings$audience_score)

Since the data seems to be randomly distributed in a given band;first condition is met

#condition2: nearly normal residuals
hist(mX1$residuals)

qqnorm(mX1$residuals)
qqline(mX1$residuals)

The hist. is slightly left skewed but it is okay..ish! The qqnorm graph shows a little deviation from the mean at the tail areas, but overall it would be safe to assume that the condition 2 is successfully met here.

#condition3: constant variability of residuals
plot(mX1$residuals~mX1$fitted)

plot(abs(mX1$residuals)~mX1$fitted)

The predicted vals.(fitted) are randomly scattered in a band with a constant width around the residuals..so this condition is met as well!

#condition 4: Independence of Observations!
plot(mX1$residuals)

Since there is no time series structure as such, all four conditions are met!

Next let’s look at its diagnostic plots:

plot(mX1)

This gives out four graphs:

Residuals vs Fitted plot: This graph tells us if the model is either good or bad. The good model data are simulated in a way that meets the regression assumptions very well, while the bad model data are not. There is no distinctive pattern in this plot, which showcases that the model spread is almost linear.

Normal QQ Plot: This plot shows if residuals are normally distributed. The residuals are quite normally distributed on this model which is a desired outcome.

Scale-Location Plot: It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points. But here there is a slight deviation from the normal horizontal is the line is somewhat curved. Since the curve is smooth and it’s not steep in the slightest, it is safe to take it as a good plot for the model as well.

Residuals vs Leverage: This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis. Cook’s distance here helps us filter out the influential outliers; i.e., if an outlier ranges outside the red dashed line, it is an influential outlier and can have a great impact at the model outputs. But since there is none, the Residual spread is safe for evaluations.

Part 5: Prediction

Well, the model is well cooked and ready for making predictions now. Let’s try it out on the following two movies: >Spiderman Homecoming and, >Batman Vs Superman: Dawn of Justice

# All the values have been taken from Rotten Tomatoes and IMDB.
# PREDICTIONS!!
# Spiderman Homecoming
Spidey <- data.frame(critics_rating = "Certified Fresh", critics_score = 93, audience_score = 90, audience_rating = "Upright", genre = "Action & Adventure")
# Batman VS Superman: Dawn of Justice
Bats <- data.frame(critics_rating = "Rotten", critics_score = 24, audience_score = 63, audience_rating = "Spilled", genre = "Action & Adventure")
predict(mX1,Spidey, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 7.847308 6.931492 8.763124

predict(mX1,Bats, interval = "prediction", level = 0.95)

##        fit      lwr     upr
## 1 6.325965 5.409449 7.24248

109663 487155 The real IMDB score for Spiderman Homecoming is 8.0/10, while that for Batman Vs Superman is 6.8! The predicted values are at 7.84 for Spiderman and 6.32 for Batman vs Superman.

The confidence interval of 95 percent gives us a range on both movies which successfully captures the actual ratings on IMDB for both the movies. The interval model ranges from 6.9 to 8.76 for Spiderman Homecoming, this can be explained by the variability of the rest of the model factors and this goes the same for Dawn of Justice.

The model fitted values differ slightly for these movies when compared to the real-time ratings on IMDB, this might be due to various factors that were not introduced in the model, like, Franchise fandom, some like MCU over DCEU while some take it the other way round. Some are influenced by the critic ratings while some are geeky comic nerds who don’t leave a single detail out.

There are many other factors like these which, if included in the model might help predict values as close as 99% in the vicinity of the real value.

Part 6: Conclusion

The two movies selected here were both great hits. The sole reason behind choosing these two movies is that, one of them received great ratings on Rotten Tomatoes(Spiderman) whille the other received a Rotten core on Rotten Tomatoes(Batman vs Superman). We used the same model to predict the results on both the movies,

To say this on the basis of number of voters for each of these movies on IMDB, Batman vs Superman(vote count: 487155) was a bigger sensation than Spiderman(vote count: 109663).

Batman vs Superman earned close to $855M, while Spiderman fished out a max of $280M worldwide.

The model predicted the values for Spiderman real close to it’s acutal ratings on Imdb whilst the value differ by a margin value of 0.48 for Dawn of Justice.

Since the ratings on Rotten tomatoes were less for Batman Vs Superman, it did had an impact on the ratings for this movie on IMDB compared to Spiderman where the ratings are skyrocketing.