According to my Boss who just acquired and entrusted me the dataset for obtaining valuable insights from it, the observations are about movies that were randomly sampled and includes information about these movies from Rotten Tomatoes and IMDB. The dataset is comprised of 651 randomly sampled movies produced and released before 2016.
Therefore, the results obtained can be generalized to all Hollywood movies. But, since this is just an observational study, no causation can be implied from the results.
The data set suffers from some biases since ratings and voting are done voluntarily by users on IMDB and Rotten tomatoes. Moreover, there is an imbalance of types of genres within the data set.
What attributes makes a movies popular among the audience? Is there an association between a movies genre, MPAA rating, length of the movies, etc. and a movie’s popularity rating?
I am also interested in knowing which is a good metric for a assessing a movies popularity.
First, let us simply eyeball the dataset structure.
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
The movies dataset is made up of 651 observations and 32 variables. There are several variables that won’t be useful where we are creating a linear regression model.
Searching for NA values in the dataset.
##
## FALSE TRUE
## 20751 81
There are about 81 NAs within the data set. We will take care of that right away.
The variables I prefer to choose are:
genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score.
Other variables were excluded as I believe movie ratings does not depend on the actor, actress, director etc. Also, variables such as the studio, DVD dates, whether the movies got any awards, etc as these happens way later when the movie is released on theater. Finally,URL links provides no use information too.
Exploring title type variable.
##
## Documentary Feature Film TV Movie
## 55 591 5
I have decided to filter only the observations with feature film since documentary and TV movie only makes up a total of 60 observations compared to 591 observations of feature film.
Creating the dataset we will be using:
film_df <- movies %>%
na.omit() %>%
filter(title_type == "Feature Film") %>%
select(genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating,
imdb_num_votes, critics_rating, critics_score, audience_rating,
audience_score)
dim(film_df)## [1] 573 10
Let us get some quick descriptive statistics of the data set.
## genre runtime mpaa_rating thtr_rel_year
## Drama :294 Min. : 68.0 G : 15 Min. :1972
## Comedy : 85 1st Qu.: 94.0 NC-17 : 1 1st Qu.:1990
## Action & Adventure: 62 Median :104.0 PG :104 Median :1999
## Mystery & Suspense: 56 Mean :106.9 PG-13 :128 Mean :1998
## Horror : 22 3rd Qu.:117.0 R :309 3rd Qu.:2006
## Other : 15 Max. :202.0 Unrated: 16 Max. :2014
## (Other) : 39
## imdb_rating imdb_num_votes critics_rating critics_score
## Min. :1.900 Min. : 390 Certified Fresh:114 Min. : 1.0
## 1st Qu.:5.900 1st Qu.: 6552 Fresh :167 1st Qu.: 31.0
## Median :6.500 Median : 18712 Rotten :292 Median : 59.0
## Mean :6.395 Mean : 64347 Mean : 54.9
## 3rd Qu.:7.100 3rd Qu.: 68429 3rd Qu.: 79.0
## Max. :9.000 Max. :893008 Max. :100.0
##
## audience_rating audience_score
## Spilled:263 Min. :11.00
## Upright:310 1st Qu.:45.00
## Median :63.00
## Mean :60.58
## 3rd Qu.:78.00
## Max. :97.00
##
As seen above, we have four categorical variables and six numerical variables. Next, I will move on to visualizing the dataset using ggplot in order to get better insights from it.
p1 <- ggplot(data = film_df, aes(x = genre, fill = genre)) +
geom_bar(aes(y=100*(..count..)/sum(..count..))) +
scale_fill_brewer(palette="Paired") + xlab("Film Genre") + ylab("Percentage (%)") +
coord_flip()
p2 <- ggplot(data = film_df, aes(x = mpaa_rating, fill = mpaa_rating)) +
geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") +
xlab("Film MPAA Rating") + ylab("Percentage (%)") + coord_flip()
grid.arrange(p1, p2, nrow = 1, top = "Movie Genre & MPAA Rating")From the above plots, almost 50% of the movie in the data set are of Drama genre. Therefore, we must keep in mind that the predictions made could be biased toward drama genre movies. Similarly, ‘R’ rated movies dominates by almost 50% compared to other movie ratings.
For the uninitiated, MPAA stands for ‘Motion Picture Association of America’. These are the meanings of each of the ratings:
Rated G: General audiences – All ages admitted.
Rated PG: Parental guidance suggested – Some material may not be suitable for children.
Rated PG-13: Parents strongly cautioned – Some material may be inappropriate for children under 13.
Rated R: Restricted – Under 17 requires accompanying parent or adult guardian.
Rated NC-17: Adults Only – No one 17 and under admitted.
Unrated are movies that were not rated by the ratings board of the MPAA.
p3 <- ggplot(data = film_df, aes(x = runtime)) +
geom_histogram(binwidth = 5, color="black", fill="grey",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Film Run Time (min)") + ylab("Density")
p4 <- ggplot(data = film_df, aes(x = thtr_rel_year)) +
geom_histogram(binwidth = 1, color="darkblue", fill="lightblue",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Film Theatre Release Year (year)") + ylab("Density")
grid.arrange(p3, p4, nrow = 2, top = "Movie Run Times and Release Year")The median run time of movies are approximately around 100 min (1 hr 40 min). Very few movies are longer than two hours. Most of the movies in the dataset were released before the 2000s. The years seems to be almost uniformly distributed though from 1970 to 2014.
p5 <- ggplot(data = film_df, aes(x = imdb_rating)) +
geom_histogram(binwidth = 0.5, color="black", fill="yellow",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("IMDB Rating") + ylab("Density")
p6 <- ggplot(data = film_df, aes(x = imdb_num_votes)) +
geom_histogram(binwidth = 20000, color="black", fill="yellow",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("IMDB No. of Votes") + ylab("Density") + scale_y_continuous(labels = comma)
grid.arrange(p5, p6, nrow = 1, top = "IMDB Ratings and Number of Votes")Average ratings for movies on IMDB are approximately 6.5, suggesting most of the movies are “Average”. The other plot shows an extreme right skew, suggesting a median number of votes of approximately 18000 numbers while the mean is approximately 65000 votes.
p7 <- ggplot(data = film_df, aes(x = critics_rating , fill = critics_rating )) +
geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") +
xlab("Critics Rating") + ylab("Percentage (%)")
p8 <- ggplot(data = film_df, aes(x = critics_score)) +
geom_histogram(binwidth = 5, color="darkred", fill="pink",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Critics Score") + ylab("Density")
grid.arrange(p7, p8, nrow = 1, top = "Rotten Tomato's Critic's Rating Categories & and Critic's Score")Almost 50% of the movies in the dataset have a ‘Rotten’ rating meaning they were below average movies. The critic’s score for the movies in the dataset follows almost a uniform distribution, suggesting there are all kinds of good and bad movies in the data set.
p9 <- ggplot(data = film_df, aes(x = audience_rating , fill = audience_rating )) +
geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") + xlab("Audience Rating") +
ylab("Percentage (%)")
p10 <- ggplot(data = film_df, aes(x = audience_score)) +
geom_histogram(binwidth = 5, color="darkgreen", fill="lightgreen",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Audience Score") + ylab("Density")
grid.arrange(p9, p10, nrow = 1, top = "Rotten Tomatoe's Audience Rating Categories & Audience Score")Audience rating signifies the percentage of users who have rated the movie or TV Show positively. The spilled category of the audience rating means less than 60% of the voters rated the movies positively, where as the opposite is true for upright rating. The movies having upright rating are higher in the dataset. Finally, the movies have a median audience score of approximately 65, similar to that of IMDB ratings.
In the next session, we will look for any linear relationship between all of these variables and choose an appropriate response variable.
The quickest way to visualize the relationship and correlation of all variables one time is using the ggpairs function.
lowerFn <- function(data, mapping, method = "lm", ...) {
p <- ggplot(data = data, mapping = mapping) +
geom_point(colour = "black") +
geom_smooth(method = method, color = "red", ...)
p
} #A function to help add a regression line to the scatter plots to the ggpairs function.
#Next, I will add a sapply function to the ggpair function to ensures all categorical variables
#are excluded from the plot..
ggpairs(film_df[ ,sapply(film_df,is.numeric)],
lower = list(continuous = wrap(lowerFn, method = "lm")),
diag = list(continuous = wrap("barDiag", colour = "white")),
upper = list(continuous = wrap("cor", size = 8)))The above pair plot illustrates a clear linear relationship between imdb_rating,critics_score, and audience_score variables. The rest of the variables have very less linearity with each other. In order to see the correlation results more clearly, I am going to plot a correlation matrix plot too.
cor_matrix <- cor(film_df[ ,sapply(film_df,is.numeric)], use= "complete.obs")
corrplot(cor_matrix, method="shade", shade.col=NA, cl.pos="n", tl.col="black",
tl.srt=30, addCoef.col="black")Finally, let us look closer into just three variables mentioned earlier.
scores <- film_df %>%
select(imdb_rating, audience_score, critics_score)
ggpairs(scores, lower = list(continuous = wrap(lowerFn, method = "lm")),
diag = list(continuous = wrap("barDiag", colour = "white")),
upper = list(continuous = wrap("cor", size = 8)))Out of these three variables, we have to choose a suitable response variable and an explanatory variable. The remaining variable has to be left out of the model as two explanatory variables having strong collinearity with each other might complicate the model and violates the conditions for linear regression modeling.
Since imdb_rating has a higher correlation with both of these variables, I have decided to choose it as the explanatory variable. For the response variable, I chose the audience_score variable as the response variable. Since the critics_score variable have lower collinearity with other variables, this will be excluded from the model.
bp1 <- ggplot(data = film_df, aes(y = audience_score, fill = genre)) +
geom_boxplot() + scale_fill_brewer(palette="Paired") +
labs(title = "Audience Score by Genre", y = "Audience Score",
x = "Critics Rating") + theme(plot.title = element_text(hjust = 0.5))
bp2 <- ggplot(data = film_df, aes(y = audience_score, fill = mpaa_rating)) +
geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Audience Score by MPAA Rating", y = "Audience Score",
x = "MPAA Rating") + theme(plot.title = element_text(hjust = 0.5))
bp3 <- ggplot(data = film_df, aes(y = audience_score, fill = critics_rating)) +
geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Audience Score by Critics Rating", y = "Audience Score",
x = "Critics Rating") + theme(plot.title = element_text(hjust = 0.5))
bp4 <- ggplot(data = film_df, aes(y = audience_score, fill = audience_rating)) +
geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Audience Score by Audience Rating", y = "Audience Score",
x = "Audience Rating") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(bp1, bp2, bp3, bp4, ncol = 1, top = "Audience Score and Categorical Variables")In the first plot, movies of ‘Musical & Historical Arts’ genre has the highest median audience score. Horror movies on the other hand has the lowest score although there is one outlier having a high score above 75. Drama movies have a median score of above 60, which signifies ‘above-average’ rating.
In the second plot, Unrated movies have a higher median score. Interestingly, NC-17 rated movies (meaning Adults only) have a consistent median score above 60 (BUT, movies with this rating has very less number of observations in the dataset). PG-13 movies have the lowest median score below 60.
In the third plot, movies rated ‘Certified fresh’ always have a median score above 75, followed by ‘Fresh’ rating having a median score above 60. ‘Rotten’ movies have median score below 60.
In the fourth plot, ‘Upright’ rated movies have a high median score of approximately 75 while ‘Spilled’ rating movies have a median score below 50.
We will now move to creating the linear regression model. In order to create a model to predict the audience score, we need to pick the right predictors (explanatory variables). Therefore, we shall use the step function to achieve a parsimonious model based on the lowest AIC (Akaike information criterion) and best R-squared and Adjusted R-squared.
#Removing the critics score variable
f <- film_df %>%
select(genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating,
imdb_num_votes, critics_rating, audience_rating, audience_score)
str(f)## tibble [573 x 9] (S3: tbl_df/tbl/data.frame)
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 6 1 6 ...
## $ runtime : num [1:573] 80 101 84 139 90 142 93 119 127 108 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 4 5 6 3 4 ...
## $ thtr_rel_year : num [1:573] 2013 2001 1996 1993 2004 ...
## $ imdb_rating : num [1:573] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 6.6 6.8 6 ...
## $ imdb_num_votes : int [1:573] 899 12285 22381 35096 2386 5016 2272 12496 71979 9669 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 3 3 1 1 2 ...
## $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 1 ...
## $ audience_score : num [1:573] 73 81 91 76 27 76 47 66 75 46 ...
## - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
## ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
We have 8 explanatory variables and a response variable. Now, we will create a linear regression model with all the variables and assess the results.
full_model <- lm(audience_score~., data = f) #~. represents use the rest of the explanatory variables.
summary(full_model)##
## Call:
## lm(formula = audience_score ~ ., data = f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3225 -4.2882 0.5771 4.2375 24.1551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.939e+01 6.533e+01 0.909 0.3637
## genreAnimation 3.427e+00 2.926e+00 1.171 0.2420
## genreArt House & International -1.245e+00 2.339e+00 -0.532 0.5948
## genreComedy 1.644e+00 1.190e+00 1.382 0.1675
## genreDocumentary 4.531e+00 4.217e+00 1.074 0.2831
## genreDrama 4.182e-02 1.050e+00 0.040 0.9683
## genreHorror -1.136e+00 1.800e+00 -0.631 0.5285
## genreMusical & Performing Arts 5.629e+00 2.687e+00 2.095 0.0366 *
## genreMystery & Suspense -2.582e+00 1.342e+00 -1.924 0.0548 .
## genreOther 5.619e-01 2.054e+00 0.274 0.7846
## genreScience Fiction & Fantasy -9.861e-02 2.639e+00 -0.037 0.9702
## runtime -2.999e-02 2.012e-02 -1.491 0.1365
## mpaa_ratingNC-17 -8.737e+00 7.322e+00 -1.193 0.2333
## mpaa_ratingPG -1.555e+00 2.157e+00 -0.721 0.4712
## mpaa_ratingPG-13 -2.428e+00 2.254e+00 -1.077 0.2818
## mpaa_ratingR -2.790e+00 2.168e+00 -1.287 0.1987
## mpaa_ratingUnrated -2.514e+00 2.945e+00 -0.854 0.3936
## thtr_rel_year -3.117e-02 3.251e-02 -0.959 0.3381
## imdb_rating 9.189e+00 4.568e-01 20.115 <2e-16 ***
## imdb_num_votes 4.407e-06 3.304e-06 1.334 0.1828
## critics_ratingFresh -6.695e-01 9.786e-01 -0.684 0.4942
## critics_ratingRotten -1.681e+00 1.025e+00 -1.640 0.1016
## audience_ratingUpright 2.027e+01 8.230e-01 24.631 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.974 on 550 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.8762
## F-statistic: 185 on 22 and 550 DF, p-value: < 2.2e-16
At present, we have a R-squared value of 88%, meaning around 88% of the variability of the response variables is explained by our model. The adjusted R-squared is a modified version of R-squared that adjusts for predictors that are not significant in a regression model. In this case, we achieved an adjusted R-squared of 87.62%, which is not too less from the R-squared values.
## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 10 27815 2781 57.190 < 2.2e-16 ***
## runtime 1 7464 7464 153.478 < 2.2e-16 ***
## mpaa_rating 5 6364 1273 26.171 < 2.2e-16 ***
## thtr_rel_year 1 949 949 19.506 1.208e-05 ***
## imdb_rating 1 123706 123706 2543.524 < 2.2e-16 ***
## imdb_num_votes 1 566 566 11.629 0.0006969 ***
## critics_rating 2 1550 775 15.937 1.870e-07 ***
## audience_rating 1 29506 29506 606.679 < 2.2e-16 ***
## Residuals 550 26750 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the predictors are significant as they all have a lower P value below 0.05.
Next, we will use the step function use “both” direction for both backward and front step-wise elimination to remove model features with low predictive value.
## Start: AIC=2248.26
## audience_score ~ genre + runtime + mpaa_rating + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_rating + audience_rating
##
## Df Sum of Sq RSS AIC
## - mpaa_rating 5 191.7 26941 2242.4
## - thtr_rel_year 1 44.7 26794 2247.2
## - critics_rating 2 148.5 26898 2247.4
## - imdb_num_votes 1 86.5 26836 2248.1
## <none> 26750 2248.3
## - genre 10 964.7 27714 2248.6
## - runtime 1 108.1 26858 2248.6
## - imdb_rating 1 19678.1 46428 2562.2
## - audience_rating 1 29506.3 56256 2672.2
##
## Step: AIC=2242.36
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_rating + audience_rating
##
## Df Sum of Sq RSS AIC
## - critics_rating 2 144.3 27086 2241.4
## - imdb_num_votes 1 87.1 27028 2242.2
## <none> 26941 2242.4
## - thtr_rel_year 1 112.5 27054 2242.7
## - runtime 1 121.9 27063 2242.9
## - genre 10 1231.2 28173 2248.0
## - imdb_rating 1 20076.7 47018 2559.4
## - audience_rating 1 29420.1 56361 2663.3
##
## Step: AIC=2241.42
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating +
## imdb_num_votes + audience_rating
##
## Df Sum of Sq RSS AIC
## <none> 27086 2241.4
## - thtr_rel_year 1 119.1 27205 2241.9
## - imdb_num_votes 1 142.1 27228 2242.4
## - runtime 1 148.0 27234 2242.5
## - genre 10 1248.9 28335 2247.2
## - imdb_rating 1 24551.4 51637 2609.1
## - audience_rating 1 30924.4 58010 2675.8
We have achieved a parsimonious model with just six predictors from 8 predictors. Initially, the AIC value was 2248.26 with 8 predictors. This has been reduced to a value of 2241.42 after removing 2 non significant variables with low predictive power.
The predictors chosen by the model are thtr_rel_year, imdb_num_votes, runtime, genre, imdb_rating, audience_rating.
##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + imdb_num_votes + audience_rating, data = f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6801 -4.6320 0.5414 4.3269 24.2810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.231e+01 5.820e+01 1.414 0.1579
## genreAnimation 5.418e+00 2.652e+00 2.043 0.0415 *
## genreArt House & International -1.594e+00 2.252e+00 -0.708 0.4794
## genreComedy 1.524e+00 1.175e+00 1.297 0.1951
## genreDocumentary 3.729e+00 4.164e+00 0.896 0.3709
## genreDrama -2.008e-01 1.018e+00 -0.197 0.8438
## genreHorror -1.631e+00 1.749e+00 -0.933 0.3513
## genreMusical & Performing Arts 5.435e+00 2.671e+00 2.035 0.0423 *
## genreMystery & Suspense -2.952e+00 1.304e+00 -2.263 0.0240 *
## genreOther 6.472e-01 2.032e+00 0.319 0.7502
## genreScience Fiction & Fantasy 2.257e-01 2.626e+00 0.086 0.9315
## runtime -3.408e-02 1.953e-02 -1.745 0.0816 .
## thtr_rel_year -4.514e-02 2.884e-02 -1.565 0.1181
## imdb_rating 9.502e+00 4.229e-01 22.470 <2e-16 ***
## imdb_num_votes 5.266e-06 3.081e-06 1.710 0.0879 .
## audience_ratingUpright 2.043e+01 8.101e-01 25.218 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.973 on 557 degrees of freedom
## Multiple R-squared: 0.8794, Adjusted R-squared: 0.8762
## F-statistic: 270.9 on 15 and 557 DF, p-value: < 2.2e-16
After the elimination step, we haven’t lost much in term both R-squared and adjust R-squared value. The current R-squared value is 87.94%, just slightly lower than 88.09% previously. Meanwhile, the adjust R-squared has remained the same at 87.62%.
Finally, the p-value based on the F-statistics test is below 0.05, meaning all of the current predictors are statistically significant.
An intercept of 82.31 is the estimated audience score if all other variables is zero. This does not make any sense is only used to adjust the intercept height.
Out of the 11 genre categories, the audience score can be higher or lower than Action & Adventure films based on what genre is selected and its corresponding coefficient. For example, movie with comedy genre is expected to have an audience score 1.524 higher than Action & Adventure films. And, movies with Horror genre is expected to have an audience score 1.631 lower than that of Action & Adventure films.
All else hold constant, for every one unit increase in runtime, the model predicts a 0.034 decrease in audience_score on average.
All else hold constant, for every one unit increase in theater release year, the model predicts a 0.045 decrease in audience_score on average.
All else hold constant, for every one unit increase in IMDB rating, the model predicts a 9.5 increase in audience_score on average.
All else hold constant, for every one unit increase in runtime, the model predicts a 0.034 decrease in audience_score on average.
All else hold constant, the model predicts rating Upright movie is 20.43 higher in audience score on average compared to Spilled rating movie.
All else hold constant, for every one unit increase in the IMDB number of votes, the model predicts a 0.0000052 increase in audience_score on average.
Now, we will check whether our model follows the four assumptions of linear regression.
1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.
We have already checked for linearity in the previous section. Only variables such as thtr_rel_year, imdb_num_votes, and runtime have low linearity.
We have also taken care for multicollinearity in the previous section by dropping the critics score from our model.
Let us now visualize our model for checking the remaining conditions.
g1 <- ggplot(data = lr_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") + theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "a) Variability Condition")
g2 <- ggplot(data = lr_model, aes(x = .resid)) +
geom_histogram(binwidth = 1, fill='white', color='black', aes(y=..density..)) +
xlab("Residuals") + geom_density(lwd = 0.8) + theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "b) Normality Condition 1")
g3 <- ggplot(data = lr_model, aes(sample = .resid)) +
stat_qq() + stat_qq_line(col = "red") + theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "c) Normality Condition 2")
grid.arrange(g1,g2,g3, nrow = 1)From above plot a, the residuals is randomly scattered around the zero axis. This signifies that there is a constant variance of residuals and there is no fan shape in residuals plot.
From plot b, we can see a nearly normal distribution of residuals around zero. Similarly from plot c, majority of the points stays on the line, with just slight skewness around the tail. Overall, the normal distribution of residuals condition is met.
I am going to create a data frame with randomly chosen movies of different genres, imdb ratings, critics rating and release year in order to predict the corresponding audience score. This will then be compared with the actual audience score from Rotten tomatoes website.
test_df <- data.frame(genre = c("Horror", "Action & Adventure", "Animation",
"Comedy", "Science Fiction & Fantasy", "Drama",
"Action & Adventure","Drama","Animation",
"Action & Adventure"),
runtime = c(117,133,97,107,118,130,120,125,86,154),
thtr_rel_year = c(2016,2016,2016,2016,2017,2015,2015,2015,
2017,2017),
imdb_rating = c(7.3,7.8,6.3,6.3,6.6,7.8,8.1,4.1,3.3,5.2),
imdb_num_votes = c(425738, 557110, 80416, 159855, 270621,
364011,883042,297834,56163,135701),
audience_rating = c("Upright", "Upright", "Spilled",
"Upright", "Upright", "Upright",
"Upright","Spilled","Spilled","Spilled"))
test_df## genre runtime thtr_rel_year imdb_rating imdb_num_votes
## 1 Horror 117 2016 7.3 425738
## 2 Action & Adventure 133 2016 7.8 557110
## 3 Animation 97 2016 6.3 80416
## 4 Comedy 107 2016 6.3 159855
## 5 Science Fiction & Fantasy 118 2017 6.6 270621
## 6 Drama 130 2015 7.8 364011
## 7 Action & Adventure 120 2015 8.1 883042
## 8 Drama 125 2015 4.1 297834
## 9 Animation 86 2017 3.3 56163
## 10 Action & Adventure 154 2017 5.2 135701
## audience_rating
## 1 Upright
## 2 Upright
## 3 Spilled
## 4 Upright
## 5 Upright
## 6 Upright
## 7 Upright
## 8 Spilled
## 9 Spilled
## 10 Spilled
Next, I will predict the score and the predict the intervals of the audience score using my model. Subsequently, I will save the results to a new data frame and add the corresponding movie titles and actual audience score and then analyze the result.
predict_df <- predict(lr_model, test_df, interval = "predict")
mydf <- cbind(test_df, round(predict_df))
mydf$Movie_Title <- c("Split", "Rogue One: A Star Wars Story",
"The Angry Birds Movie", "Central Intelligence",
"Kong: Skull Island", "The Big Short",
"Mad Max: Fury Road", "Fifty Shades of Grey",
"The Emoji Movie","Transformers: The Last Knight")
mydf <- mydf %>%
select(Movie_Title, everything()) %>%
mutate(Actual_score = c(79,86,46,62,69,88,86,41,37,43),
estimate = ifelse(abs(Actual_score - fit) <= 5, "Good Prediction",
ifelse(Actual_score >= lwr & Actual_score <= upr,
"Not great but within the range", "Bad prediction")))
colnames(mydf) <- c("Movie Title", "Genre", "Runtime", "Release Year", "IMDB Rating",
"IMDB Number of Votes", "Audience Rating", "Predicted Audience Score",
"Predicted Lower Interval", "Prediceted Upper Interval", "Actual Audience Score", "Prediction Accuracy")Now let us look at the results.
mydf %>%
kbl() %>%
column_spec(c(1,8,11), bold = T) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, position = "center", font_size = 11)| Movie Title | Genre | Runtime | Release Year | IMDB Rating | IMDB Number of Votes | Audience Rating | Predicted Audience Score | Predicted Lower Interval | Prediceted Upper Interval | Actual Audience Score | Prediction Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Split | Horror | 117 | 2016 | 7.3 | 425738 | Upright | 78 | 64 | 92 | 79 | Good Prediction |
| Rogue One: A Star Wars Story | Action & Adventure | 133 | 2016 | 7.8 | 557110 | Upright | 84 | 70 | 98 | 86 | Good Prediction |
| The Angry Birds Movie | Animation | 97 | 2016 | 6.3 | 80416 | Spilled | 54 | 39 | 68 | 46 | Not great but within the range |
| Central Intelligence | Comedy | 107 | 2016 | 6.3 | 159855 | Upright | 70 | 56 | 84 | 62 | Not great but within the range |
| Kong: Skull Island | Science Fiction & Fantasy | 118 | 2017 | 6.6 | 270621 | Upright | 72 | 57 | 87 | 69 | Good Prediction |
| The Big Short | Drama | 130 | 2015 | 7.8 | 364011 | Upright | 83 | 69 | 97 | 88 | Good Prediction |
| Mad Max: Fury Road | Action & Adventure | 120 | 2015 | 8.1 | 883042 | Upright | 89 | 75 | 104 | 86 | Good Prediction |
| Fifty Shades of Grey | Drama | 125 | 2015 | 4.1 | 297834 | Spilled | 27 | 13 | 41 | 41 | Not great but within the range |
| The Emoji Movie | Animation | 86 | 2017 | 3.3 | 56163 | Spilled | 25 | 11 | 40 | 37 | Not great but within the range |
| Transformers: The Last Knight | Action & Adventure | 154 | 2017 | 5.2 | 135701 | Spilled | 36 | 22 | 50 | 43 | Not great but within the range |
Our model succeeded in predicting almost accurately the audience score of half of the movies in the table with an absolute difference of 5 points. Predictions for the rest of the movies were not great but still was within the range of the predicted interval. It is clear that the model improves its prediction accuracy for movies with a higher IMDB score, “Upright” category and higher number of votes. Movies with lower IMDB rating and “Spilled” critic rating suffers from lower prediction accuracy.
To conclude, we have created a linear regression model that can predict the audience score of a movie based on certain conditions. Therefore, we have almost narrowed down the required attributes of a movies in order to predict a movies popularity.
But, the model is far from perfect. More can be done to improve the model’s predictive ability. Addition of more observations with movies of equal variation of different genres could help our model to capture more variability in the data.