Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(kableExtra)
library(grid)
library(gridExtra)
library(scales)
library(GGally)
library(corrplot)

Load data

load("movies.Rdata")

Part 1: Data

According to my Boss who just acquired and entrusted me the dataset for obtaining valuable insights from it, the observations are about movies that were randomly sampled and includes information about these movies from Rotten Tomatoes and IMDB. The dataset is comprised of 651 randomly sampled movies produced and released before 2016.

Therefore, the results obtained can be generalized to all Hollywood movies. But, since this is just an observational study, no causation can be implied from the results.

The data set suffers from some biases since ratings and voting are done voluntarily by users on IMDB and Rotten tomatoes. Moreover, there is an imbalance of types of genres within the data set.

Part 2: Research question

What attributes makes a movies popular among the audience? Is there an association between a movies genre, MPAA rating, length of the movies, etc. and a movie’s popularity rating?

I am also interested in knowing which is a good metric for a assessing a movies popularity.

Part 3: Exploratory data analysis

First, let us simply eyeball the dataset structure.

str(movies)

## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

The movies dataset is made up of 651 observations and 32 variables. There are several variables that won’t be useful where we are creating a linear regression model.

Searching for NA values in the dataset.

table(is.na(movies))

## 
## FALSE  TRUE 
## 20751    81

There are about 81 NAs within the data set. We will take care of that right away.

The variables I prefer to choose are:

genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score.

Other variables were excluded as I believe movie ratings does not depend on the actor, actress, director etc. Also, variables such as the studio, DVD dates, whether the movies got any awards, etc as these happens way later when the movie is released on theater. Finally,URL links provides no use information too.

Exploring title type variable.

table(movies$title_type)

## 
##  Documentary Feature Film     TV Movie 
##           55          591            5

I have decided to filter only the observations with feature film since documentary and TV movie only makes up a total of 60 observations compared to 591 observations of feature film.

Creating the dataset we will be using:

film_df <- movies %>%
  na.omit() %>%
  filter(title_type == "Feature Film") %>%
  select(genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating, 
         imdb_num_votes, critics_rating, critics_score, audience_rating, 
         audience_score)

dim(film_df)

## [1] 573  10

Let us get some quick descriptive statistics of the data set.

summary(film_df)

##                 genre        runtime       mpaa_rating  thtr_rel_year 
##  Drama             :294   Min.   : 68.0   G      : 15   Min.   :1972  
##  Comedy            : 85   1st Qu.: 94.0   NC-17  :  1   1st Qu.:1990  
##  Action & Adventure: 62   Median :104.0   PG     :104   Median :1999  
##  Mystery & Suspense: 56   Mean   :106.9   PG-13  :128   Mean   :1998  
##  Horror            : 22   3rd Qu.:117.0   R      :309   3rd Qu.:2006  
##  Other             : 15   Max.   :202.0   Unrated: 16   Max.   :2014  
##  (Other)           : 39                                               
##   imdb_rating    imdb_num_votes           critics_rating critics_score  
##  Min.   :1.900   Min.   :   390   Certified Fresh:114    Min.   :  1.0  
##  1st Qu.:5.900   1st Qu.:  6552   Fresh          :167    1st Qu.: 31.0  
##  Median :6.500   Median : 18712   Rotten         :292    Median : 59.0  
##  Mean   :6.395   Mean   : 64347                          Mean   : 54.9  
##  3rd Qu.:7.100   3rd Qu.: 68429                          3rd Qu.: 79.0  
##  Max.   :9.000   Max.   :893008                          Max.   :100.0  
##                                                                         
##  audience_rating audience_score 
##  Spilled:263     Min.   :11.00  
##  Upright:310     1st Qu.:45.00  
##                  Median :63.00  
##                  Mean   :60.58  
##                  3rd Qu.:78.00  
##                  Max.   :97.00  
##

As seen above, we have four categorical variables and six numerical variables. Next, I will move on to visualizing the dataset using ggplot in order to get better insights from it.

Visualizing EDA

p1 <- ggplot(data = film_df, aes(x = genre, fill = genre)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + 
  scale_fill_brewer(palette="Paired") + xlab("Film Genre") + ylab("Percentage (%)") + 
  coord_flip() 

p2 <- ggplot(data = film_df, aes(x = mpaa_rating, fill = mpaa_rating)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") + 
  xlab("Film MPAA Rating") + ylab("Percentage (%)") + coord_flip()

grid.arrange(p1, p2, nrow = 1, top = "Movie Genre & MPAA Rating")

From the above plots, almost 50% of the movie in the data set are of Drama genre. Therefore, we must keep in mind that the predictions made could be biased toward drama genre movies. Similarly, ‘R’ rated movies dominates by almost 50% compared to other movie ratings.

For the uninitiated, MPAA stands for ‘Motion Picture Association of America’. These are the meanings of each of the ratings:

Rated G: General audiences – All ages admitted.

Rated PG: Parental guidance suggested – Some material may not be suitable for children.

Rated PG-13: Parents strongly cautioned – Some material may be inappropriate for children under 13.

Rated R: Restricted – Under 17 requires accompanying parent or adult guardian.

Rated NC-17: Adults Only – No one 17 and under admitted.

Unrated are movies that were not rated by the ratings board of the MPAA.

p3 <- ggplot(data = film_df, aes(x = runtime)) + 
  geom_histogram(binwidth = 5, color="black", fill="grey", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Film Run Time (min)") + ylab("Density")

p4 <- ggplot(data = film_df, aes(x = thtr_rel_year)) + 
  geom_histogram(binwidth = 1, color="darkblue", fill="lightblue", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Film Theatre Release Year (year)") + ylab("Density")

grid.arrange(p3, p4, nrow = 2, top = "Movie Run Times and Release Year")

The median run time of movies are approximately around 100 min (1 hr 40 min). Very few movies are longer than two hours. Most of the movies in the dataset were released before the 2000s. The years seems to be almost uniformly distributed though from 1970 to 2014.

p5 <- ggplot(data = film_df, aes(x = imdb_rating)) + 
  geom_histogram(binwidth = 0.5, color="black", fill="yellow", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("IMDB Rating") + ylab("Density")


p6 <- ggplot(data = film_df, aes(x = imdb_num_votes)) + 
  geom_histogram(binwidth = 20000, color="black", fill="yellow", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("IMDB No. of Votes") + ylab("Density") + scale_y_continuous(labels = comma)

grid.arrange(p5, p6, nrow = 1, top = "IMDB Ratings and Number of Votes")

Average ratings for movies on IMDB are approximately 6.5, suggesting most of the movies are “Average”. The other plot shows an extreme right skew, suggesting a median number of votes of approximately 18000 numbers while the mean is approximately 65000 votes.

p7 <- ggplot(data = film_df, aes(x = critics_rating , fill = critics_rating )) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") + 
  xlab("Critics Rating") + ylab("Percentage (%)")


p8 <- ggplot(data = film_df, aes(x = critics_score)) + 
  geom_histogram(binwidth = 5, color="darkred", fill="pink", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Critics Score") + ylab("Density")

grid.arrange(p7, p8, nrow = 1, top = "Rotten Tomato's Critic's Rating Categories & and Critic's Score")

Almost 50% of the movies in the dataset have a ‘Rotten’ rating meaning they were below average movies. The critic’s score for the movies in the dataset follows almost a uniform distribution, suggesting there are all kinds of good and bad movies in the data set.

p9 <- ggplot(data = film_df, aes(x = audience_rating , fill = audience_rating )) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") + xlab("Audience Rating") + 
  ylab("Percentage (%)")


p10 <- ggplot(data = film_df, aes(x = audience_score)) + 
  geom_histogram(binwidth = 5, color="darkgreen", fill="lightgreen", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Audience Score") + ylab("Density")

grid.arrange(p9, p10, nrow = 1, top = "Rotten Tomatoe's Audience Rating Categories & Audience Score")

Audience rating signifies the percentage of users who have rated the movie or TV Show positively. The spilled category of the audience rating means less than 60% of the voters rated the movies positively, where as the opposite is true for upright rating. The movies having upright rating are higher in the dataset. Finally, the movies have a median audience score of approximately 65, similar to that of IMDB ratings.

In the next session, we will look for any linear relationship between all of these variables and choose an appropriate response variable.

Explanatory Variables and Response Variable.

The quickest way to visualize the relationship and correlation of all variables one time is using the ggpairs function.

lowerFn <- function(data, mapping, method = "lm", ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "black") +
    geom_smooth(method = method, color = "red", ...)
  p
} #A function to help add a regression line to the scatter plots to the ggpairs function.


#Next, I will add a sapply function to the ggpair function to ensures all categorical variables 
#are excluded from the plot..

ggpairs(film_df[ ,sapply(film_df,is.numeric)], 
        lower = list(continuous = wrap(lowerFn, method = "lm")), 
        diag = list(continuous = wrap("barDiag", colour = "white")), 
        upper = list(continuous = wrap("cor", size = 8)))

The above pair plot illustrates a clear linear relationship between imdb_rating,critics_score, and audience_score variables. The rest of the variables have very less linearity with each other. In order to see the correlation results more clearly, I am going to plot a correlation matrix plot too.

cor_matrix <- cor(film_df[ ,sapply(film_df,is.numeric)], use= "complete.obs")

corrplot(cor_matrix, method="shade", shade.col=NA, cl.pos="n", tl.col="black", 
         tl.srt=30, addCoef.col="black")

Finally, let us look closer into just three variables mentioned earlier.

scores <- film_df %>%
  select(imdb_rating, audience_score, critics_score)

ggpairs(scores, lower = list(continuous = wrap(lowerFn, method = "lm")),
  diag = list(continuous = wrap("barDiag", colour = "white")),
  upper = list(continuous = wrap("cor", size = 8)))

Out of these three variables, we have to choose a suitable response variable and an explanatory variable. The remaining variable has to be left out of the model as two explanatory variables having strong collinearity with each other might complicate the model and violates the conditions for linear regression modeling.

Since imdb_rating has a higher correlation with both of these variables, I have decided to choose it as the explanatory variable. For the response variable, I chose the audience_score variable as the response variable. Since the critics_score variable have lower collinearity with other variables, this will be excluded from the model.

bp1 <- ggplot(data = film_df, aes(y = audience_score, fill = genre)) + 
  geom_boxplot() + scale_fill_brewer(palette="Paired") + 
  labs(title = "Audience Score by Genre", y = "Audience Score", 
       x = "Critics Rating") + theme(plot.title = element_text(hjust = 0.5))

bp2 <- ggplot(data = film_df, aes(y = audience_score, fill = mpaa_rating)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Audience Score by MPAA Rating", y = "Audience Score", 
       x = "MPAA Rating") + theme(plot.title = element_text(hjust = 0.5))

bp3 <- ggplot(data = film_df, aes(y = audience_score, fill = critics_rating)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Audience Score by Critics Rating", y = "Audience Score", 
       x = "Critics Rating") + theme(plot.title = element_text(hjust = 0.5))

bp4 <- ggplot(data = film_df, aes(y = audience_score, fill = audience_rating)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Audience Score by Audience Rating", y = "Audience Score", 
       x = "Audience Rating") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(bp1, bp2, bp3, bp4, ncol = 1, top = "Audience Score and Categorical Variables")

In the first plot, movies of ‘Musical & Historical Arts’ genre has the highest median audience score. Horror movies on the other hand has the lowest score although there is one outlier having a high score above 75. Drama movies have a median score of above 60, which signifies ‘above-average’ rating.

In the second plot, Unrated movies have a higher median score. Interestingly, NC-17 rated movies (meaning Adults only) have a consistent median score above 60 (BUT, movies with this rating has very less number of observations in the dataset). PG-13 movies have the lowest median score below 60.

In the third plot, movies rated ‘Certified fresh’ always have a median score above 75, followed by ‘Fresh’ rating having a median score above 60. ‘Rotten’ movies have median score below 60.

In the fourth plot, ‘Upright’ rated movies have a high median score of approximately 75 while ‘Spilled’ rating movies have a median score below 50.

Part 4: Modeling

We will now move to creating the linear regression model. In order to create a model to predict the audience score, we need to pick the right predictors (explanatory variables). Therefore, we shall use the step function to achieve a parsimonious model based on the lowest AIC (Akaike information criterion) and best R-squared and Adjusted R-squared.

#Removing the critics score variable

f <- film_df %>%
  select(genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating, 
         imdb_num_votes, critics_rating, audience_rating, audience_score)

str(f)

## tibble [573 x 9] (S3: tbl_df/tbl/data.frame)
##  $ genre          : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 6 1 6 ...
##  $ runtime        : num [1:573] 80 101 84 139 90 142 93 119 127 108 ...
##  $ mpaa_rating    : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 4 5 6 3 4 ...
##  $ thtr_rel_year  : num [1:573] 2013 2001 1996 1993 2004 ...
##  $ imdb_rating    : num [1:573] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 6.6 6.8 6 ...
##  $ imdb_num_votes : int [1:573] 899 12285 22381 35096 2386 5016 2272 12496 71979 9669 ...
##  $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 3 3 1 1 2 ...
##  $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 1 ...
##  $ audience_score : num [1:573] 73 81 91 76 27 76 47 66 75 46 ...
##  - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
##   ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...

We have 8 explanatory variables and a response variable. Now, we will create a linear regression model with all the variables and assess the results.

full_model <- lm(audience_score~., data = f) #~. represents use the rest of the explanatory variables.
summary(full_model)

## 
## Call:
## lm(formula = audience_score ~ ., data = f)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3225  -4.2882   0.5771   4.2375  24.1551 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.939e+01  6.533e+01   0.909   0.3637    
## genreAnimation                  3.427e+00  2.926e+00   1.171   0.2420    
## genreArt House & International -1.245e+00  2.339e+00  -0.532   0.5948    
## genreComedy                     1.644e+00  1.190e+00   1.382   0.1675    
## genreDocumentary                4.531e+00  4.217e+00   1.074   0.2831    
## genreDrama                      4.182e-02  1.050e+00   0.040   0.9683    
## genreHorror                    -1.136e+00  1.800e+00  -0.631   0.5285    
## genreMusical & Performing Arts  5.629e+00  2.687e+00   2.095   0.0366 *  
## genreMystery & Suspense        -2.582e+00  1.342e+00  -1.924   0.0548 .  
## genreOther                      5.619e-01  2.054e+00   0.274   0.7846    
## genreScience Fiction & Fantasy -9.861e-02  2.639e+00  -0.037   0.9702    
## runtime                        -2.999e-02  2.012e-02  -1.491   0.1365    
## mpaa_ratingNC-17               -8.737e+00  7.322e+00  -1.193   0.2333    
## mpaa_ratingPG                  -1.555e+00  2.157e+00  -0.721   0.4712    
## mpaa_ratingPG-13               -2.428e+00  2.254e+00  -1.077   0.2818    
## mpaa_ratingR                   -2.790e+00  2.168e+00  -1.287   0.1987    
## mpaa_ratingUnrated             -2.514e+00  2.945e+00  -0.854   0.3936    
## thtr_rel_year                  -3.117e-02  3.251e-02  -0.959   0.3381    
## imdb_rating                     9.189e+00  4.568e-01  20.115   <2e-16 ***
## imdb_num_votes                  4.407e-06  3.304e-06   1.334   0.1828    
## critics_ratingFresh            -6.695e-01  9.786e-01  -0.684   0.4942    
## critics_ratingRotten           -1.681e+00  1.025e+00  -1.640   0.1016    
## audience_ratingUpright          2.027e+01  8.230e-01  24.631   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.974 on 550 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8762 
## F-statistic:   185 on 22 and 550 DF,  p-value: < 2.2e-16

At present, we have a R-squared value of 88%, meaning around 88% of the variability of the response variables is explained by our model. The adjusted R-squared is a modified version of R-squared that adjusts for predictors that are not significant in a regression model. In this case, we achieved an adjusted R-squared of 87.62%, which is not too less from the R-squared values.

anova(full_model)

## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## genre            10  27815    2781   57.190 < 2.2e-16 ***
## runtime           1   7464    7464  153.478 < 2.2e-16 ***
## mpaa_rating       5   6364    1273   26.171 < 2.2e-16 ***
## thtr_rel_year     1    949     949   19.506 1.208e-05 ***
## imdb_rating       1 123706  123706 2543.524 < 2.2e-16 ***
## imdb_num_votes    1    566     566   11.629 0.0006969 ***
## critics_rating    2   1550     775   15.937 1.870e-07 ***
## audience_rating   1  29506   29506  606.679 < 2.2e-16 ***
## Residuals       550  26750      49                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the predictors are significant as they all have a lower P value below 0.05.

Next, we will use the step function use “both” direction for both backward and front step-wise elimination to remove model features with low predictive value.

lr_model <- step(full_model, direction = "both", scope = full_model)

## Start:  AIC=2248.26
## audience_score ~ genre + runtime + mpaa_rating + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_rating + audience_rating
## 
##                   Df Sum of Sq   RSS    AIC
## - mpaa_rating      5     191.7 26941 2242.4
## - thtr_rel_year    1      44.7 26794 2247.2
## - critics_rating   2     148.5 26898 2247.4
## - imdb_num_votes   1      86.5 26836 2248.1
## <none>                         26750 2248.3
## - genre           10     964.7 27714 2248.6
## - runtime          1     108.1 26858 2248.6
## - imdb_rating      1   19678.1 46428 2562.2
## - audience_rating  1   29506.3 56256 2672.2
## 
## Step:  AIC=2242.36
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_rating + audience_rating
## 
##                   Df Sum of Sq   RSS    AIC
## - critics_rating   2     144.3 27086 2241.4
## - imdb_num_votes   1      87.1 27028 2242.2
## <none>                         26941 2242.4
## - thtr_rel_year    1     112.5 27054 2242.7
## - runtime          1     121.9 27063 2242.9
## - genre           10    1231.2 28173 2248.0
## - imdb_rating      1   20076.7 47018 2559.4
## - audience_rating  1   29420.1 56361 2663.3
## 
## Step:  AIC=2241.42
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + audience_rating
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         27086 2241.4
## - thtr_rel_year    1     119.1 27205 2241.9
## - imdb_num_votes   1     142.1 27228 2242.4
## - runtime          1     148.0 27234 2242.5
## - genre           10    1248.9 28335 2247.2
## - imdb_rating      1   24551.4 51637 2609.1
## - audience_rating  1   30924.4 58010 2675.8

We have achieved a parsimonious model with just six predictors from 8 predictors. Initially, the AIC value was 2248.26 with 8 predictors. This has been reduced to a value of 2241.42 after removing 2 non significant variables with low predictive power.

The predictors chosen by the model are thtr_rel_year, imdb_num_votes, runtime, genre, imdb_rating, audience_rating.

summary(lr_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + audience_rating, data = f)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6801  -4.6320   0.5414   4.3269  24.2810 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.231e+01  5.820e+01   1.414   0.1579    
## genreAnimation                  5.418e+00  2.652e+00   2.043   0.0415 *  
## genreArt House & International -1.594e+00  2.252e+00  -0.708   0.4794    
## genreComedy                     1.524e+00  1.175e+00   1.297   0.1951    
## genreDocumentary                3.729e+00  4.164e+00   0.896   0.3709    
## genreDrama                     -2.008e-01  1.018e+00  -0.197   0.8438    
## genreHorror                    -1.631e+00  1.749e+00  -0.933   0.3513    
## genreMusical & Performing Arts  5.435e+00  2.671e+00   2.035   0.0423 *  
## genreMystery & Suspense        -2.952e+00  1.304e+00  -2.263   0.0240 *  
## genreOther                      6.472e-01  2.032e+00   0.319   0.7502    
## genreScience Fiction & Fantasy  2.257e-01  2.626e+00   0.086   0.9315    
## runtime                        -3.408e-02  1.953e-02  -1.745   0.0816 .  
## thtr_rel_year                  -4.514e-02  2.884e-02  -1.565   0.1181    
## imdb_rating                     9.502e+00  4.229e-01  22.470   <2e-16 ***
## imdb_num_votes                  5.266e-06  3.081e-06   1.710   0.0879 .  
## audience_ratingUpright          2.043e+01  8.101e-01  25.218   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.973 on 557 degrees of freedom
## Multiple R-squared:  0.8794, Adjusted R-squared:  0.8762 
## F-statistic: 270.9 on 15 and 557 DF,  p-value: < 2.2e-16

After the elimination step, we haven’t lost much in term both R-squared and adjust R-squared value. The current R-squared value is 87.94%, just slightly lower than 88.09% previously. Meanwhile, the adjust R-squared has remained the same at 87.62%.

Finally, the p-value based on the F-statistics test is below 0.05, meaning all of the current predictors are statistically significant.

Interpretation of model coefficients

An intercept of 82.31 is the estimated audience score if all other variables is zero. This does not make any sense is only used to adjust the intercept height.
Out of the 11 genre categories, the audience score can be higher or lower than Action & Adventure films based on what genre is selected and its corresponding coefficient. For example, movie with comedy genre is expected to have an audience score 1.524 higher than Action & Adventure films. And, movies with Horror genre is expected to have an audience score 1.631 lower than that of Action & Adventure films.
All else hold constant, for every one unit increase in runtime, the model predicts a 0.034 decrease in audience_score on average.
All else hold constant, for every one unit increase in theater release year, the model predicts a 0.045 decrease in audience_score on average.
All else hold constant, for every one unit increase in IMDB rating, the model predicts a 9.5 increase in audience_score on average.
All else hold constant, for every one unit increase in runtime, the model predicts a 0.034 decrease in audience_score on average.
All else hold constant, the model predicts rating Upright movie is 20.43 higher in audience score on average compared to Spilled rating movie.
All else hold constant, for every one unit increase in the IMDB number of votes, the model predicts a 0.0000052 increase in audience_score on average.

Model Diagnostics

Now, we will check whether our model follows the four assumptions of linear regression.

1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.

2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.

3. Homoscedasticity: The residuals have constant variance at every level of x.

4. Normality: The residuals of the model are normally distributed.

We have already checked for linearity in the previous section. Only variables such as thtr_rel_year, imdb_num_votes, and runtime have low linearity.

We have also taken care for multicollinearity in the previous section by dropping the critics score from our model.

Let us now visualize our model for checking the remaining conditions.

g1 <- ggplot(data = lr_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "a) Variability Condition")

g2 <- ggplot(data = lr_model, aes(x = .resid)) +
  geom_histogram(binwidth = 1, fill='white', color='black', aes(y=..density..)) +
  xlab("Residuals") + geom_density(lwd = 0.8) + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "b) Normality Condition 1")

g3 <- ggplot(data = lr_model, aes(sample = .resid)) +
  stat_qq() + stat_qq_line(col = "red") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "c) Normality Condition 2")

grid.arrange(g1,g2,g3, nrow = 1)

From above plot a, the residuals is randomly scattered around the zero axis. This signifies that there is a constant variance of residuals and there is no fan shape in residuals plot.

From plot b, we can see a nearly normal distribution of residuals around zero. Similarly from plot c, majority of the points stays on the line, with just slight skewness around the tail. Overall, the normal distribution of residuals condition is met.

Part 5: Prediction

Building a test data frame

I am going to create a data frame with randomly chosen movies of different genres, imdb ratings, critics rating and release year in order to predict the corresponding audience score. This will then be compared with the actual audience score from Rotten tomatoes website.

test_df <- data.frame(genre = c("Horror", "Action & Adventure", "Animation", 
                                "Comedy", "Science Fiction & Fantasy", "Drama", 
                                "Action & Adventure","Drama","Animation",
                                "Action & Adventure"), 
                      runtime = c(117,133,97,107,118,130,120,125,86,154), 
                      thtr_rel_year = c(2016,2016,2016,2016,2017,2015,2015,2015,
                                        2017,2017), 
                      imdb_rating = c(7.3,7.8,6.3,6.3,6.6,7.8,8.1,4.1,3.3,5.2), 
                      imdb_num_votes = c(425738, 557110, 80416, 159855, 270621,
                                         364011,883042,297834,56163,135701), 
                      audience_rating = c("Upright", "Upright", "Spilled", 
                                          "Upright", "Upright", "Upright", 
                                          "Upright","Spilled","Spilled","Spilled"))

test_df

##                        genre runtime thtr_rel_year imdb_rating imdb_num_votes
## 1                     Horror     117          2016         7.3         425738
## 2         Action & Adventure     133          2016         7.8         557110
## 3                  Animation      97          2016         6.3          80416
## 4                     Comedy     107          2016         6.3         159855
## 5  Science Fiction & Fantasy     118          2017         6.6         270621
## 6                      Drama     130          2015         7.8         364011
## 7         Action & Adventure     120          2015         8.1         883042
## 8                      Drama     125          2015         4.1         297834
## 9                  Animation      86          2017         3.3          56163
## 10        Action & Adventure     154          2017         5.2         135701
##    audience_rating
## 1          Upright
## 2          Upright
## 3          Spilled
## 4          Upright
## 5          Upright
## 6          Upright
## 7          Upright
## 8          Spilled
## 9          Spilled
## 10         Spilled

Next, I will predict the score and the predict the intervals of the audience score using my model. Subsequently, I will save the results to a new data frame and add the corresponding movie titles and actual audience score and then analyze the result.

predict_df <- predict(lr_model, test_df, interval = "predict")

mydf <- cbind(test_df, round(predict_df))

mydf$Movie_Title <- c("Split", "Rogue One: A Star Wars Story", 
                      "The Angry Birds Movie", "Central Intelligence", 
                      "Kong: Skull Island", "The Big Short", 
                      "Mad Max: Fury Road", "Fifty Shades of Grey", 
                      "The Emoji Movie","Transformers: The Last Knight")

mydf <- mydf %>%
  select(Movie_Title, everything()) %>%
  mutate(Actual_score = c(79,86,46,62,69,88,86,41,37,43), 
         estimate = ifelse(abs(Actual_score - fit) <= 5, "Good Prediction", 
                           ifelse(Actual_score >= lwr & Actual_score <= upr, 
                                  "Not great but within the range", "Bad prediction")))
         
colnames(mydf) <- c("Movie Title", "Genre", "Runtime", "Release Year", "IMDB Rating", 
         "IMDB Number of Votes", "Audience Rating", "Predicted Audience Score", 
         "Predicted Lower Interval", "Prediceted Upper Interval", "Actual Audience Score", "Prediction Accuracy")

Now let us look at the results.

mydf %>%
  kbl() %>%
  column_spec(c(1,8,11), bold = T) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = F, position = "center", font_size = 11)

Movie Title	Genre	Runtime	Release Year	IMDB Rating	IMDB Number of Votes	Audience Rating	Predicted Audience Score	Predicted Lower Interval	Prediceted Upper Interval	Actual Audience Score	Prediction Accuracy
Split	Horror	117	2016	7.3	425738	Upright	78	64	92	79	Good Prediction
Rogue One: A Star Wars Story	Action & Adventure	133	2016	7.8	557110	Upright	84	70	98	86	Good Prediction
The Angry Birds Movie	Animation	97	2016	6.3	80416	Spilled	54	39	68	46	Not great but within the range
Central Intelligence	Comedy	107	2016	6.3	159855	Upright	70	56	84	62	Not great but within the range
Kong: Skull Island	Science Fiction & Fantasy	118	2017	6.6	270621	Upright	72	57	87	69	Good Prediction
The Big Short	Drama	130	2015	7.8	364011	Upright	83	69	97	88	Good Prediction
Mad Max: Fury Road	Action & Adventure	120	2015	8.1	883042	Upright	89	75	104	86	Good Prediction
Fifty Shades of Grey	Drama	125	2015	4.1	297834	Spilled	27	13	41	41	Not great but within the range
The Emoji Movie	Animation	86	2017	3.3	56163	Spilled	25	11	40	37	Not great but within the range
Transformers: The Last Knight	Action & Adventure	154	2017	5.2	135701	Spilled	36	22	50	43	Not great but within the range

Our model succeeded in predicting almost accurately the audience score of half of the movies in the table with an absolute difference of 5 points. Predictions for the rest of the movies were not great but still was within the range of the predicted interval. It is clear that the model improves its prediction accuracy for movies with a higher IMDB score, “Upright” category and higher number of votes. Movies with lower IMDB rating and “Spilled” critic rating suffers from lower prediction accuracy.

Part 6: Conclusion

To conclude, we have created a linear regression model that can predict the audience score of a movie based on certain conditions. Therefore, we have almost narrowed down the required attributes of a movies in order to predict a movies popularity.

But, the model is far from perfect. More can be done to improve the model’s predictive ability. Addition of more observations with movies of equal variation of different genres could help our model to capture more variability in the data.