IMDB Data Analysis & Prediction

Pavan Singh

2022-05-13

Preliminary Discussion

A movie is defined (as you would expect) as a work of visual art that simulates experiences and otherwise communicates ideas, stories, perceptions, feelings, beauty, or atmosphere through the use of moving images. These images are generally accompanied by sound and, more rarely, other sensory stimulation. The movie industry has grown substantially over the past two decades, where films can spend millions of dollars for production, and earn even more in gross profits. They have also dictated and held a strong influence on the growing culture of the world.

Launched online in 1990 and a subsidiary of Amazon.com since 1998, IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content, designed to help fans explore the world of movies and shows and decide what to watch. Rotten Tomatoes is an American review-aggregation website for film and television.

Aim and Project Goal

This micro project involves analyzing a movie data set that contains information from Rotten Tomatoes and IMDB for a random sample of movies. The data set was taken and provided on Kaggle, and it is also used in the Coursera course “Linear Regression and Modelling” provided by Duke University. The data-set was originally used as a project for the course.

This data set serves as a basis to apply simple and multiple linear regression models. Essentially, these are models which allow you to assess the relationship between variables in a data set and a continuous response variable. We shall use these regression methods to understand what attributes make a movie popular, whilst learning something new about movies and how they are rated by viewers and critics alike. Specifically we shall be seeking to evaluate the relationship between audience perception of a movies rating (audience_score) given a set of predictors. We shall use this relationship to evaluate the audience score for movies that the data is not trained on, to assess accuracy.

Data Description

The data set,movies, is comprised of 32 variables and 651 randomly sampled movies produced and released before 2016. Most of the variables from the data set are provided for informational purposes and do not make sense to include in a statistical analysis of this aim and goal. We decided on a specific selection of variables to include in our analysis and which maybe are meaningful and which should be omitted. For example information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar. An extract of the variables chosen and their relevant descriptions are given below in the Table.

Description of variables in the movie data set.
Variable	Description
title	Title of the movie
title_type	Type of movie (Documentary, Feature Film, TV Movie)
genre	Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime	Run-time of movie (in minutes)
imdb_rating	Rating on IMDB
critics_rating	Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
best_pic_nom	Whether or not the movie was nominated for a best picture Oscar (no, yes)
actor1	First main actor/actress in the abridged cast of the movie

The full list of variables in the data set and their descriptions can be found at the end of this paper.

Setup

We read in the data used for this project as well as load packages to be used in this project.

load("movies.Rdata") #data contained in file

library(ggplot2) #graphics
library(dplyr) #manipulation
library(statsr)
library(gridExtra)
library(ggpubr)
library(corrplot)
library(GGally) #for correlation matrix
library(plotly) #interactive plots
library(DT) #interactive tables

Research Question

The data has contains many variables, which can lead to many questions of interest or expected associations between variables. Specifically we shall start by exploring what attributes make a movie popular. That is, whether a movie’s popularity, audience_score, is related to the the type of movie, genre, runtime, IMDb rating, IMDb number of votes, critics rating, critics score, audience rating. By modelling this, we can observe any apparent relationships and associations between these variables and a movies [popularity as well as be able to predict a movie’s popularity. This can also help in identify possible roots of multicollinearity - allowing us to adjust our modelling approach. Essentially we begin by getting a thorough understanding of our data. From here we shall move onto modelling, where we seek to build the best model fit to the data. After this we briefly interpret the model, conduct diagnostics and look at making a prediction. We conclude the paper with a brief discussion of the project and further applications.

Exploratory Data Analysis

The research question defined only concerns itself with a select subset of the variables in the data set. As such, we take a subset of the original data frame (the original data) and conduct EDA (exploratory data analysis) only with the reduced data frame that include only the variables of presumable concern. We note that this is not usually what should happen in a statistical analysis, however we are entering this project with reasonable amount of knowledge with respect to the variables of importance. Moving on, these specific variables included in our subset for the start of our analysis include: title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win.

#Create Sub-set
movie_data <- movies %>% select(title, title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win)
attach(movie_data)

Looking at the structure of the resulting subset.

#Structure
str(movie_data)

## tibble [651 × 14] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...

All variables appear to be the correct type and is appropriate for conducting further analysis. Our new data set has 651 observations (movies) and 14 variables associated with the movies. Below we look at the summary statistics for the specific variables in our analysis.

#Summary 
summary(movie_data[,c("runtime", "imdb_rating", "imdb_num_votes", "critics_score", "audience_score")])

##     runtime       imdb_rating    imdb_num_votes   critics_score   
##  Min.   : 39.0   Min.   :1.900   Min.   :   180   Min.   :  1.00  
##  1st Qu.: 92.0   1st Qu.:5.900   1st Qu.:  4546   1st Qu.: 33.00  
##  Median :103.0   Median :6.600   Median : 15116   Median : 61.00  
##  Mean   :105.8   Mean   :6.493   Mean   : 57533   Mean   : 57.69  
##  3rd Qu.:115.8   3rd Qu.:7.300   3rd Qu.: 58300   3rd Qu.: 83.00  
##  Max.   :267.0   Max.   :9.000   Max.   :893008   Max.   :100.00  
##  NA's   :1                                                        
##  audience_score 
##  Min.   :11.00  
##  1st Qu.:46.00  
##  Median :65.00  
##  Mean   :62.36  
##  3rd Qu.:80.00  
##  Max.   :97.00  
##

We observe that one movie has no runtime provided. We removed that movie (The End of America) from the data set.

# Remove NA's
movie_data[which(is.na(movie_data$runtime)),"title"] #The End of America
movie_data[which(title == "The End of America"),]
movie_data <- movie_data[-which(title == "The End of America"),]

We split our data into training and testing, so as to assess the performance of our model at the end. That is, to use the model to predict a movie’s audience score. So this movie should not be part of the data. Our test set, as such, will have only one row (one movie).

#Test and Training Data sets
set.seed(2022)
split <- sample(1:nrow(movie_data), size = 1)
test <- movie_data[split, ]
train <- movie_data[-split, ]
list("train" = dim(train), "test" = dim(test))

## $train
## [1] 649  14
## 
## $test
## [1]  1 14

An extract of the training data set is shown below. We used \(datatable()\) from DT package, which supports interactive tables for filtering, pagination, and sorting.

library(DT) #interactive tables
datatable(movie_data[,1:8], options = list(pageLength = 5))

Again, this is an observational study which uses a random sample of movies - there is no random assignment - so we are able to make only generalisable inferences (not causal). We shall now explore our data set. First, by looking at some of the univariate distributions of the variables in the data set.

#Type
a1 <- ggplot(movie_data) +
 aes(x = title_type) +
 geom_bar(fill = "chocolate4") + xlab("") + theme_minimal()

#Genre
a2 <- ggplot(movie_data) +
 aes(x = genre) +
 geom_bar(fill = "darkorange1") +
 labs(x = " ", y = "Count") +
 coord_flip() +
 theme_minimal()

#Runtime
a3 <- ggplot(movie_data) +
 aes(x = runtime) +
 geom_density(adjust = 1L, fill = "darkgoldenrod") +
 labs(x = "Runtime", 
 y = "Density") +
 theme_minimal()

ggarrange(a2, ggarrange(a3, a1, ncol = 2), nrow=2)

Looking at the distribution of genre and type of movies (top and bottom right plots) ,we see that the majority of movies in our sample are part of the genre Drama. The least abundant genre types appear to be Science Fiction and Animation, as well as Musical Arts. With respect to type of movies, we see that feature films make up the majority of the films in our sample, and we have very few TV movie type of films (only 5). Looking at the distribution of runtime of the movies, we observe that the majority of the movies in our data set have a runtime between 75mins and 125mins. The longest movie in our data set is Hotel Terminus: The Life and Times of Klaus Barbie (267mins), while the shortest movie is Africa: The Serengeti (39mins). The mean runtime is around 105mins (as indicated from summary in earlier analysis). Most movies in the data are in the “Feature Film” title type and majority of the movies are drama. Therefore, we must be aware that the results could be biased toward drama movies.

Now exploring at some of the distributions for the ratings, we look at audience score as well as imdb_rating and critics_score.

#Audience Score
a2 <- ggplot(movie_data) +
 aes(x = audience_score) +
 geom_density(adjust = 1L, fill = "#112446") +
 labs(x = "Audience Score", 
 y = "Density") +
 theme_minimal()

#IMDB Rating
a3 <- ggplot(movie_data) +
 aes(x = imdb_rating) +
 geom_density(adjust = 1L, fill = "#112446") +
 labs(x = "IMDB Rating", 
 y = "Density") +
 theme_minimal()

#Critics Score
a4 <- ggplot(movie_data) +
 aes(x = critics_score) +
 geom_density(adjust = 1L, fill = "#112446") +
 labs(x = "Critics Score", 
 y = "Density") +
 theme_minimal()

ggarrange(a1, a2, a3, a4, nrow = 2, ncol = 2)

From these density plots above, we can identify several characteristics of our data. The majority of the critic and audience scores appear to be high (over 75). Also the majority of movies in our data have a high IMDb rating around 7. Audience score and critic score density plots can be considered as relatively left skew given the shape of the distributions. Interestingly, it looks like critics score and audience score density plots have similar shape and have peaks in roughly the same regions (scores are similar). We can further investigate this relationship by looking at a correlation plot. We shall do that later on in this section.

Looking specifically at the audience score, since it is our response variable, we can highlight some key features of the variable and its distribution.

summary(audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

The median of our response variable - audience score distribution is 65; 25% of the movie in the training set have an audience score higher than 80; 25% of the movie in the training set have an audience score lower than 46; very few movie have an audience score lower than 20 or higher than 90. This essentially suggests that the audience, given our data, are unlike to give very low or very high score. Lets show this on our density plot for the audience score.

# Define a "histogram" function
distribution<-function(var){
  
  ggplot(movie_data, aes(x=var)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")+
  geom_vline(aes(xintercept=mean(var)),
             color="blue", linetype="dashed", size=1)+
  geom_vline(aes(xintercept=median(var)),
             color="red", linetype="dashed", size=1)+
  ylab("Density")+
  labs(caption="Note: Red/Blue vertical line marks the median/mean value.
       The histogram is overlaid with the density plot.")+
  theme_minimal()
}

# Draw imdb_rating distribution
distribution(movie_data$audience_score)+xlab('Audience Score') + ggtitle("Audience Score Distribution Superimposed with Summary Statistics")

We can also investigate the correlations and show simultaneously the distributions of the variety of variables. We shall do this below using ggpair() function in R.

ggpairs(movie_data[,c("runtime", "imdb_rating", "imdb_num_votes", "critics_score", "audience_score")])

We see that critics_score and imdb_rating have a very high correlation of 0.765. Variables audience_score and imdb_rating also have a very strong positive association. this was picked up on by examining the similar shape and peaks of their density plots in prior analysis in this section - it is also shown by their density plots in the output above. The correlations can also be visualized by using a correlation heat map (see below).

# Check correlation between variables
corr <- cor(movie_data[,c("runtime", "imdb_rating", "imdb_num_votes", "critics_score", "audience_score")]) 
ggcorr(movie_data[,c("runtime", "imdb_rating", "imdb_num_votes", "critics_score", "audience_score")], method = c("everything", "pearson"))

It can be seen here that imdb_rating, audience_score, and critics_score are all strongly correlated. Therefore, including more than one of these variables would introduce multicollinearity into the model, which we wish to avoid. Additionally, since the correlation values for these variables are all high (\(> 0.7\)), including two, or all three, of these variables would not significantly improve the predicting power of the model compared compared to using just one. One of them will be used for the model - decided to use imdb_rating.

To see which critics rating does tend to have higher audience score, we used the following scatter plot. As the plot depicts, we see that movies generally with higher critic scores also have high audience scores. This is exemplary of the findings from the correlation heat maps.

qplot(critics_score,audience_score,colour=critics_score, data=movie_data)

Lets take a loom at the the IMDb data abase and the data we have pulled from there. Specifically, the number of votes and ratings on the IMDb page for movies.

#Scatter for Votes vs Ratings with Titles
plot_ly(movie_data, x = imdb_num_votes, y = imdb_rating, text = paste("Titles: ", title),
        mode = "markers", color = genre, size = imdb_num_votes) %>%
  layout(title = 'Ratings vs Number Votes (IMDb)', plot_bgcolor = "#e5ecf6", xaxis = list(title = 'IMDb Number of Votes'), 
         yaxis = list(title = 'IMDb Rating'), legend = list(title=list(text='<b> Genre of Movies </b>')))

The plotly figure, indicates that only a few movies have over 400 000 votes. It also appears to be a non-linear relationship between the two variables. Very few movies all in this data set) which hare rated low have high voter counts. Movies rated higher appear to have a large range of votes. The highest rated movie - The Godfather Part II - does not have the highest number of votes in IMDb data base; that title belongs to Django Unchained (8.5 rating) with close to 900 000 votes on IMDb platform. Interestingly, but also understandably, that the lowest rated movies have a significantly higher number of votes than many of the other movies in the data base which are rated higher.

The tendency for movies with large number of votes to also be movies which are rated highly, is a consequence of general consumer behavior; where viewed are most likely to watch a movie given that it has a high rating.

To see what genre of movies has higher audience score and their distribution, we utilized the following graph. As the plot demonstrates, documentary films, tend to have higher audience score.

qplot(audience_score,colour=genre, data=movie_data, geom="density")

This density distribution plot for the movie genres, shows that Horror movies generally have a low audience score. A large proportion of documentaries have a high rating by the audience.

We now seek to explore the ratings from the critics and audience separately. We look at the value counts of movies categorized across each rating system. The visualizations are interactive, to allow for exploration of the exact value count.

#Genre and Rating by Critics
p <- ggplot(data = movie_data, aes(x = genre, fill = critics_rating)) +
            geom_bar(position = "dodge")
ggplotly(p)

In most of the m,ovie genres we see that there are far greater rotten rated movies than fresh or certified fresh. Te exceptions to this appear to be documentaries and musicals. Comedy and Action movies have the greatest disparity, with far greater proportion of movies being rated as rotten. Documentaries have a significant proportion of movies classified as fresh and certified fresh, and a distinctly low proportion of rotten movies.

In next step, the movies were categorized based on the critics rating. As the plot shows, Certified Fresh and Fresh, tend to have a higher audience score.

qplot(critics_score,audience_score,colour=critics_rating, data=movie_data)

There are 3 clear categorizes identified in the plot - or clusters rather. We can see that cluster of red and green points at the top right of plot seems to have a lot of overlap. Interestingly, some movies rated very high by the audience are in fact rated rotten by critics - there are many e examples of this in the data. And in fact, some movies which are related below 30 by audience are in fact rated as fresh by critics.

We now look at the audience rating for movies, which can either be classified as spilled (bad) or upright (good).

#Genre and Rating by Audience
p <- ggplot(data = movie_data, aes(x = genre, fill = audience_rating)) +
            geom_bar(position = "dodge")
ggplotly(p)

The results are some what analogous to those of the rotten tomatoes - action and comedy have a distinctly greater proportion of spilled movies than other genres. We also see that Drama has a greater proportion of upright movies than spilled.

We analyzed the relationship between audience_score and critics_score using a scatter plot and movies colored with regards to their genre to see which genre will get higher critics score so we can assess the relationship between critics score and audience score further. As we can see, movies in comedy, documentary, and drama get a higher critics score.

qplot(critics_score,audience_score,colour=genre, data=movie_data)

This concludes our detailed look into the data and the various ratings by audience and critics. We shall now look to model audience score through several possible predictors, in order to establish a relationship.

Modelling

The procedure for selection involves the forward selection framework - here, we start with an empty model, then add variables one at a time until a parsimonious model is reached.

forward selection: starts with no predictors in the model, iteratively adds the most contributive predictors, and stops when the improvement is no longer statistically significant

We first run the full model - model with all predictors, to first identify the most contributive predictors (i.i,e lowest \(p\)-value). A summary is given below.

full_model <- lm(audience_score~imdb_rating+title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=train)
sum <- summary(full_model)
sum$coefficients

##                                     Estimate   Std. Error     t value
## (Intercept)                    -9.992662e+00 4.057269e+00 -2.46290323
## imdb_rating                     9.556148e+00 4.286621e-01 22.29296403
## title_typeFeature Film          2.227635e+00 2.544750e+00  0.87538464
## title_typeTV Movie              6.930186e-01 4.031839e+00  0.17188645
## genreAnimation                  3.225081e+00 2.476956e+00  1.30203386
## genreArt House & International -2.569402e+00 2.061416e+00 -1.24642610
## genreComedy                     1.518818e+00 1.142713e+00  1.32913324
## genreDocumentary                2.487119e+00 2.725212e+00  0.91263337
## genreDrama                     -6.163990e-01 9.919925e-01 -0.62137465
## genreHorror                    -1.886847e+00 1.686221e+00 -1.11897942
## genreMusical & Performing Arts  3.559602e+00 2.358877e+00  1.50902404
## genreMystery & Suspense        -3.165923e+00 1.268556e+00 -2.49569015
## genreOther                      2.637299e-01 1.952119e+00  0.13509927
## genreScience Fiction & Fantasy -2.326789e-01 2.459616e+00 -0.09459968
## runtime                        -2.517436e-02 1.664940e-02 -1.51202798
## imdb_num_votes                  2.879118e-06 3.081575e-06  0.93430072
## critics_ratingFresh             1.462643e-02 8.454173e-01  0.01730083
## critics_ratingRotten           -1.198396e+00 9.321282e-01 -1.28565588
## audience_ratingUpright          2.007945e+01 7.905661e-01 25.39882498
## best_pic_winyes                 4.366316e-01 2.919313e+00  0.14956659
## best_actor_winyes               2.541833e-01 8.125383e-01  0.31282626
## best_actress_winyes            -1.037288e+00 9.022669e-01 -1.14964627
## best_dir_winyes                 1.429280e-01 1.186872e+00  0.12042417
##                                    Pr(>|t|)
## (Intercept)                    1.404979e-02
## imdb_rating                    1.743445e-81
## title_typeFeature Film         3.817004e-01
## title_typeTV Movie             8.635824e-01
## genreAnimation                 1.933837e-01
## genreArt House & International 2.130742e-01
## genreComedy                    1.842882e-01
## genreDocumentary               3.617867e-01
## genreDrama                     5.345793e-01
## genreHorror                    2.635781e-01
## genreMusical & Performing Arts 1.317972e-01
## genreMystery & Suspense        1.282752e-02
## genreOther                     8.925768e-01
## genreScience Fiction & Fantasy 9.246631e-01
## runtime                        1.310314e-01
## imdb_num_votes                 3.505090e-01
## critics_ratingFresh            9.862021e-01
## critics_ratingRotten           1.990386e-01
## audience_ratingUpright         2.341900e-98
## best_pic_winyes                8.811547e-01
## best_actor_winyes              7.545169e-01
## best_actress_winyes            2.507286e-01
## best_dir_winyes                9.041858e-01

From the following full model, we can see that IMDb rating has one of the lowest \(p\)-value and is the most correlated variable to our response variable. So we choose imdb_rating as the first predictor. next, we shall add this predictor to our model and assess the \(p\)-value and R-squared. The output of the model with predictors being only imdb_rating is given below.

fit1 <- lm(audience_score ~ imdb_rating, data=train)
summary(fit1)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.802  -6.542   0.648   5.684  52.902 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.3449     2.4234  -17.47   <2e-16 ***
## imdb_rating  16.1267     0.3683   43.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.17 on 647 degrees of freedom
## Multiple R-squared:  0.7477, Adjusted R-squared:  0.7473 
## F-statistic:  1917 on 1 and 647 DF,  p-value: < 2.2e-16

The 0.75 R-squared and almost zero \(p\)-value indicate that imdb_rating is a statistically significant predictor of audience score. In order to find out the second predictor, we shall look at the full model again but without the imdb_rating variable.

fit_model <- lm(audience_score~title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=train)
sum <- summary(fit_model)
sum$coefficients

##                                     Estimate   Std. Error     t value
## (Intercept)                     4.487020e+01 4.316893e+00 10.39409447
## title_typeFeature Film         -8.282989e-01 3.400678e+00 -0.24356873
## title_typeTV Movie             -6.013707e+00 5.380744e+00 -1.11763472
## genreAnimation                  9.782635e-02 3.309575e+00  0.02955858
## genreArt House & International  5.538741e-01 2.752401e+00  0.20123303
## genreComedy                     3.380287e-01 1.527643e+00  0.22127469
## genreDocumentary                8.081179e+00 3.631640e+00  2.22521470
## genreDrama                      1.627388e+00 1.320726e+00  1.23219161
## genreHorror                    -7.827174e-01 2.255686e+00 -0.34699755
## genreMusical & Performing Arts  8.018663e+00 3.145501e+00  2.54924847
## genreMystery & Suspense        -4.161508e-01 1.689657e+00 -0.24629312
## genreOther                      5.405908e-01 2.612456e+00  0.20692816
## genreScience Fiction & Fantasy -3.494838e+00 3.285859e+00 -1.06359976
## runtime                         2.287556e-02 2.209430e-02  1.03536050
## imdb_num_votes                  1.835107e-05 4.018099e-06  4.56710272
## critics_ratingFresh            -4.107426e-01 1.131129e+00 -0.36312637
## critics_ratingRotten           -7.258311e+00 1.193238e+00 -6.08286824
## audience_ratingUpright          2.897264e+01 9.134331e-01 31.71840989
## best_pic_winyes                -1.217142e+00 3.905637e+00 -0.31163718
## best_actor_winyes               8.500891e-01 1.086827e+00  0.78217555
## best_actress_winyes            -5.916370e-01 1.207202e+00 -0.49008957
## best_dir_winyes                 1.462172e+00 1.586407e+00  0.92168736
##                                     Pr(>|t|)
## (Intercept)                     1.866314e-23
## title_typeFeature Film          8.076445e-01
## title_typeTV Movie              2.641511e-01
## genreAnimation                  9.764285e-01
## genreArt House & International  8.405816e-01
## genreComedy                     8.249506e-01
## genreDocumentary                2.642180e-02
## genreDrama                      2.183394e-01
## genreHorror                     7.287096e-01
## genreMusical & Performing Arts  1.103252e-02
## genreMystery & Suspense         8.055359e-01
## genreOther                      8.361331e-01
## genreScience Fiction & Fantasy  2.879195e-01
## runtime                         3.008997e-01
## imdb_num_votes                  5.954800e-06
## critics_ratingFresh             7.166329e-01
## critics_ratingRotten            2.053639e-09
## audience_ratingUpright         1.884960e-132
## best_pic_winyes                 7.554199e-01
## best_actor_winyes               4.344067e-01
## best_actress_winyes             6.242420e-01
## best_dir_winyes                 3.570462e-01

We add audience rating as the second predictor because of the lowest \(p\)-value. Similar to previous step, we shall add this predictor along with our initial predictor (imdb_rating) and assess the impact - whether or not the additional variable improves fit, using the anova() function.

fit2 <- lm(audience_score ~ imdb_rating + audience_rating, data=train)
summary(fit2)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1531  -4.7727   0.6087   4.3696  24.3247 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -11.5166     2.0081  -5.735  1.5e-08 ***
## imdb_rating              9.5244     0.3499  27.217  < 2e-16 ***
## audience_ratingUpright  20.8564     0.7676  27.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.954 on 646 degrees of freedom
## Multiple R-squared:  0.8822, Adjusted R-squared:  0.8819 
## F-statistic:  2420 on 2 and 646 DF,  p-value: < 2.2e-16

The models’ R-squared and Adjusted R-Squared both increased significantly, the almost zero \(p\)-value indicate that audience rating is another statistically significant predictor of audience score.

anova(fit1, fit2)

## Analysis of Variance Table
## 
## Model 1: audience_score ~ imdb_rating
## Model 2: audience_score ~ imdb_rating + audience_rating
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    647 66931                                  
## 2    646 31237  1     35694 738.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the first model, with one predictor and the second model with two, we see that the additional predictor improves model fit substantially. We now proceed with our forward selection for variables in this modelling stage. We apply the same techniques as discussed above - remove predictors from full model to see next largest contributing variable, add this variable to current model and see if the model fit is improved with this additional variable.

A detailed look at the process and stages used in this forward selection framework is summarised below:

Added critics rating to the model but the Adjust R-squared only increased from 0.8817 to 0.8819, the p value is insignificant at 0.61896 and 0.10116. Therefore, we will not include critics rating as a predictor.
Added imdb_num_votes to the model but the Adjust R-squared decreased from 0.8817 to 0.8815 and the p value is not significant at 0.734. So, we will not include imdb_num_votes to the model.
Added genre to the model and the Adjust R-squared increased from 0.8817 to 0.8847, the amount variance it explains at 0.8868 versus 0.8812 without. From the anova analysis we can see that the p value is significant at 0.0033.
It is obvious that title type, runtime, best_pic_win, best_actor_win, best_actress_win, best_dir_win are not significant predictors, therefore, they will not be included in the model.

Therefore, we decide to add genre as one of the predictors. So, we arrived at our final model - parsimonious model, with three predictors: imdb_rating, audience rating and genre. We shall show the output of the summary function for our final model below.

fit.final <- lm(audience_score ~ imdb_rating + audience_rating + genre, data=train)
summary(fit.final)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating + 
##     genre, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6393  -4.4364   0.5833   4.2991  25.0850 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5619     2.1972  -5.717 1.67e-08 ***
## imdb_rating                      9.8031     0.3692  26.551  < 2e-16 ***
## audience_ratingUpright          20.3175     0.7752  26.210  < 2e-16 ***
## genreAnimation                   3.6230     2.4531   1.477  0.14020    
## genreArt House & International  -2.7912     2.0335  -1.373  0.17037    
## genreComedy                      1.5109     1.1278   1.340  0.18082    
## genreDocumentary                 0.5774     1.3774   0.419  0.67523    
## genreDrama                      -0.8340     0.9596  -0.869  0.38515    
## genreHorror                     -1.6199     1.6705  -0.970  0.33257    
## genreMusical & Performing Arts   2.5415     2.1916   1.160  0.24663    
## genreMystery & Suspense         -3.2745     1.2472  -2.626  0.00886 ** 
## genreOther                       0.2743     1.9266   0.142  0.88684    
## genreScience Fiction & Fantasy   0.2560     2.4425   0.105  0.91657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.865 on 636 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8849 
## F-statistic:   416 on 12 and 636 DF,  p-value: < 2.2e-16

We also shall investigate the ANOVA for the model we have built. We see all the F-values for the 3 variables. All 3 variables are significant.

anova(fit.final)

## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## imdb_rating       1 198327  198327 4207.8961 < 2.2e-16 ***
## audience_rating   1  35694   35694  757.3159 < 2.2e-16 ***
## genre            10   1261     126    2.6758  0.003268 ** 
## Residuals       636  29976      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We also implemented step wise regression. The results are illustrated below:

library(leaps) #for computing stepwise regression

The stepAIC() function from MASS package chooses the best model by AIC. It has an option named direction, which can take the following values: i) “both” (for stepwise regression, both forward and backward selection); “backward” (for backward selection) and “forward” (for forward selection). It return the best final model.

library(MASS)
# Fit the full model 
full.model <- lm(audience_score~imdb_rating+title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=train)

# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", 
                      trace = FALSE)

summary(step.model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + genre + runtime + 
##     audience_rating, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5131  -4.5898   0.7554   4.4080  25.1882 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -10.93067    2.44673  -4.467 9.37e-06 ***
## imdb_rating                      9.93488    0.37904  26.210  < 2e-16 ***
## genreAnimation                   3.26860    2.46190   1.328   0.1848    
## genreArt House & International  -2.89262    2.03261  -1.423   0.1552    
## genreComedy                      1.38076    1.12997   1.222   0.2222    
## genreDocumentary                 0.22921    1.39523   0.164   0.8696    
## genreDrama                      -0.74742    0.96040  -0.778   0.4367    
## genreHorror                     -1.87504    1.67741  -1.118   0.2641    
## genreMusical & Performing Arts   2.63048    2.19021   1.201   0.2302    
## genreMystery & Suspense         -3.19328    1.24707  -2.561   0.0107 *  
## genreOther                       0.37732    1.92587   0.196   0.8447    
## genreScience Fiction & Fantasy   0.21642    2.44020   0.089   0.9294    
## runtime                         -0.02305    0.01528  -1.509   0.1318    
## audience_ratingUpright          20.25218    0.77562  26.111  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.858 on 635 degrees of freedom
## Multiple R-squared:  0.8874, Adjusted R-squared:  0.8851 
## F-statistic: 384.9 on 13 and 635 DF,  p-value: < 2.2e-16

anova(fit.final, step.model)

## Analysis of Variance Table
## 
## Model 1: audience_score ~ imdb_rating + audience_rating + genre
## Model 2: audience_score ~ imdb_rating + genre + runtime + audience_rating
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    636 29976                           
## 2    635 29869  1    107.12 2.2772 0.1318

We compared our final model from forward selection to the final model from stepwise selection using anova() function. We see that the additional variables not important predictors. Choose simpler model - forward selection model.

Model Interpretation

summary(fit.final)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating + 
##     genre, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6393  -4.4364   0.5833   4.2991  25.0850 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5619     2.1972  -5.717 1.67e-08 ***
## imdb_rating                      9.8031     0.3692  26.551  < 2e-16 ***
## audience_ratingUpright          20.3175     0.7752  26.210  < 2e-16 ***
## genreAnimation                   3.6230     2.4531   1.477  0.14020    
## genreArt House & International  -2.7912     2.0335  -1.373  0.17037    
## genreComedy                      1.5109     1.1278   1.340  0.18082    
## genreDocumentary                 0.5774     1.3774   0.419  0.67523    
## genreDrama                      -0.8340     0.9596  -0.869  0.38515    
## genreHorror                     -1.6199     1.6705  -0.970  0.33257    
## genreMusical & Performing Arts   2.5415     2.1916   1.160  0.24663    
## genreMystery & Suspense         -3.2745     1.2472  -2.626  0.00886 ** 
## genreOther                       0.2743     1.9266   0.142  0.88684    
## genreScience Fiction & Fantasy   0.2560     2.4425   0.105  0.91657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.865 on 636 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8849 
## F-statistic:   416 on 12 and 636 DF,  p-value: < 2.2e-16

Intercept(-12.5142) is the estimated audience score for a movie with imdb_rating, audience_rating and genre at zero. It does not provide any meaningful interpretation here. imdb_rating coefficient(9.7844): All else hold constant, for every one unit increase in imdb_rating, the model predicts a 9.7844 increase in audience_score on average. audience_ratingUpright coefficient(20.3246): All else hold constant, the model predicts rating Upright movie is 20.3246 higher in audience score on average than rating Spilled movie. genreAnimation coefficient(3.6812): The model predicts that Animation films get an audience score that is 3.6812 higher than Action & Adventure(reference category) films on average after controlling for imdb_rating and audience rating. genreArt House & International coefficient(-2.7199): The model predicts that Art House & International films get an audience score that is 2.7199 lower than Action & Adventure films on average after controlling for imdb_rating and audience rating. There are total 11 genre categories in the data set, the audience score can higher or lower than Action & Adventure films depends on what genre is selected. R-Squared(0.8847): 88.47% of the variability in audience score can be explained by the model.

Model Diagnostics

For the model to provide valid results, there are certain conditions that need to be met. These are:

Linear relationships between (numerical) x and y
Nearly normal residuals with mean 0
Constant variability of residuals
Independent residuals

The condition for nearly normal residuals can be checked using histogram and/or Q-Q plot of the residuals:

the mean of the residuals can be seen to equal zero. However, the histogram shows that the residuals are somewhat right skewed, due to a small proportion of outliers. The Q-Q plot also indicates an almost normal distribution, with some deviation from normality occurring on the right hand side, indicating again that the data is right skewed. As both plots show a nearly normal distribution, with the skew that is occurs appearing to be for more extreme values, we can be satisfied condition 2 has been met.

par(mfrow = c(2,2))
plot(fit.final)

g.res1 <- ggplot(data = fit3, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

g.res2 <- ggplot(data = fit3, aes(x = .resid)) +
  geom_histogram(binwidth = 1, fill='whit  e', color='black') +
  xlab("Residuals")


g.res3 <- ggplot(data = fit3, aes(sample = .resid)) +
  stat_qq()

ggarrange(g.res1, ggarrange(g.res2, g.res3, nrow = 1, ncol = 2), nrow = 2)

Constant variance of residuals condition met, No fan shape in residuals plot.

The residuals are nearly symmetric, hence it would be appropriate to deem the the normal distribution of residuals condition met.

Prediction

Using our final model (fit.final) to predict the audience score for the movie in the test set - Nostalgia for the Light. First we create a new data frame for this movie.

library(dplyr)
newmovie <- test[, c("genre", "imdb_rating", "audience_rating")]
predict(fit.final, newmovie)

##        1 
## 82.83659

The model predicts movie Aliens in the test set will have an audience score at approximate 90.

predict(fit.final, newmovie, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 82.83659 69.22095 96.45223

Our model predicts, with 95% confidence, that the movie Aliens is expected to have an audience score between 76.34 and 103.65

Our prediction interval contains this value. Is very close.

Conclusion

Our model demonstrates that it is possible to predict a movie’s popularity, as measured by audience score with only three predictors - imdb score, audience rating and genre. Movie industries can use the similar methods when producing movies that are more likely to be liked by the target audience.

However, the potential shortcoming is that our model’s predictive power is limited because the sample data is not representative. Therefore, a larger number of observations to capture more variability in the population data in our testing data set is required to have a better measure of the model’s accuracy.

Further Work

An interesting expansion on this work would be to include directors and possibly actors into the modelling. We can look to see if specific actors in a film on average have an effect of a movies rating, measured by audiences score. Likewise we can see if specific directors are associated with increased scores etc.

Appendix

title: Title of the movie
title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
thtr_rel_year: Year the movie is released in theaters
thtr_rel_month: Month the movie is released in theaters
thtr_rel_day: Day of the month the movie is released in theaters
dvd_rel_year: Year the movie is released on DVD
dvd_rel_month: Month the movie is released on DVD
dvd_rel_day: Day of the month the movie is released on DVD
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress_win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director: Director of the movie
actor1: First main actor/actress in the abridged cast of the movie
actor2: Second main actor/actress in the abridged cast of the movie
actor3: Third main actor/actress in the abridged cast of the movie
actor4: Fourth main actor/actress in the abridged cast of the movie
actor5: Fifth main actor/actress in the abridged cast of the movie
imdb_url: Link to IMDB page for the movie
rt_url: Link to Rotten Tomatoes page for the movie

References

Markdown template provided and taken from Juba

Data Sources: Rotten Tomatoes and IMDB APIs