Submitted by: Preetha Rajan
library(ggplot2)
library(dplyr)
library(statsr)
library(varhandle)## Warning: package 'varhandle' was built under R version 3.5.2
library(gridExtra)## Warning: package 'gridExtra' was built under R version 3.5.2
library(janitor)## Warning: package 'janitor' was built under R version 3.5.2
library(car)## Warning: package 'car' was built under R version 3.5.2
## Warning: package 'carData' was built under R version 3.5.2
library(SignifReg)## Warning: package 'SignifReg' was built under R version 3.5.2
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
load("movies.Rdata")The data set is comprised of 651 randomly sampled movies produced and released before 2016. Each row in the dataset is a movie and each column is a characteristic of a movie. Therefore, the data should allow us to generalize to the population of interest. It must be kept in mind that this is an observational study in which individuals are observed and/or certain outcomes are measured (in contrast to an experiment where there are treatment and control groups), the data collected cannot be utilized in the establishment of causal relations. Because observational studies are not randomized, they cannot control for all the other un-measurable and confounding factors that may actually be driving the results. Thus, any “link” between cause and effect in observational studies is speculative at best. Caution must be exercised while utilizing ratings data from movie rating websites such as IMDB due to the following reasons:
“IMDB ratings were biased upwards for young movies. When new movies come out, the first people to see them are early adopters and critics. As an example, someone disinterested in a new film may see it eventually, but they are unlikely to see it the first day it comes to their local theater. My contention was that seeking out such pre-releases, in combination with marketing and release hype, would select and reinforce overly-positive movie reviews” - IMDB Score Bias (2018).
The algorithms behind IMDB’s famous top 250 movie list are likely to lean heavily towards the demographic segment comprising of young men between the ages of 18 to 29, as this demographic segment is most likely to comprise of IMDB’s regular voters. Hence, IMDB’s top 250 movie list is shaped only by its regular users and not by the site’s casual users. As stated by IMDB: “To maintain the effectiveness of the Top Rated Movies and Top Rated TV Shows lists, we do not disclose the criteria used for an IMDb user to be counted as a regular voter.”
According to a report published by Phys.Org, a study published in the Journal of the Academy of Marketing Science entitled Debates and assumptions about motion picture performance: a meta-analysis, explored what makes a movie a box office success by stating the following facts:
“when a movie is first released it is the”star power" of a popular actor that has a strongest impact at the box office. However, if the movie has been out for a while, the pull of a popular star tends to wane whereas the influence of an actor who has been recognised for their acting abilities remains steady. In fact, choosing a film where the main actor has received awards and recognition for acting is one of the best predictors of movie success.
The other key factor in predicting whether moviegoers will be sitting on the edge of their power-reclining seats is reviews, both by professional movie critics and the general public. There is a widespread assumption that ratings by the general public are gaining more influence over box office performance compared to professional critics’ reviews, but we found that wasn’t the case. So there’s no need for studios to switch their promotional efforts to target users at the expense of critics.
The researchers found critics have a dual role, where they both influence consumers’ movie choice and predict box office performance by reflecting moviegoers’ tastes. And it’s not just how positive the reviews are but also the number of reviews that can predict box office success. The lesson for studios is that they should aim to have their movies reviewed by as many critics as possible.
And for moviegoers - if the movie only has a couple of reviews it’s perhaps not a good sign.
Bottom line: So if you want to improve your chances of picking a great movie, make sure it has a popular actor - preferably someone who has won an Oscar. Check out the critical reviews, they could well be on the money. For movie makers, it’s worth bearing in mind that even with great actors and stellar reviews a movie will not perform well unless distributors are on side - a key determinant of box office success is the number of screens where the movie is released“.
Keeping in mind the above findings regarding the key attributes of a movie that determines its popularity, my research question is as follows:
Is a movie’s audience score (a metric that measures a movie’s popularity) associated with certain potentially key attributes of that movie such as the genre of the movie, the number of votes and ratings (from the general public) that the movie received on ratings websites such as IMDB, crtics score, Oscar nominations and wins and whether or not the movie is in the Top 200 Box Office?
Identifying the key attributes of a movie that determines its popularity is an important task, in that it will enable the utilization of the developed multiple linear regression model framework (developed utilizing a section of the data known as a training set where the model is trained on the data) to make audience score predictions on ‘unseen data’ (data that has not been used to train the model), also called the test data. This step is critical to be able to ascertain (at least to a certain extent), the generalizability of the results by making ‘out of sample predictions’.
Determining what makes a movie popular is no easy task. Apart from the above cited study, there are other studies out there that seem to produce contradicting results regarding the key attributes of a movie that makes it popular with the general public. While the above cited study published in the Journal of the Academy of Marketing Science, highlighted the key positive role that critics play in determining a movie’s box office success, an NYU study that appeared in a journal called Projections, seems to state otherwise: “Critics may be adept at evaluating films, but that doesn’t mean their assessments will accurately predict how much the public will like what they see,” adds co-author Jake Whritner, a graduate of the Cinema Studies Program at NYU’s Tisch School of the Arts and currently part of the Cognitive and Data Science Lab at Rutgers University-Newark. In other words, ‘our taste in movies is highly idiosyncratic and at odds with critics’.
The NYU study explored “the agreement between critics and the general public by considering more than 200 major motion pictures, taking into account popularity, financial success, and critics’ reviews and surveying over 3,000 participants, asking them to give a rating of how much they liked each of the films in the sample that they had previously seen. The researchers also asked participants to provide demographic information (e.g., age, gender) and whether they consider movie critics’ recommendations in choosing which movies to see. Additionally, the researchers gathered reviews from 42 publicly accessible critics or rating sites for each of the 200 films in the sample. The results generally showed low levels of correlation in movie preferences among study participants. Turning to correlations with movie critics, the connection between the ratings of critics and any given participant was no better than the average correlation between participants. Even a critic as well regarded as the late Roger Ebert did no better in predicting how well someone would like a movie than a randomly picked participant in the sample. In contrast, critics agreed with each other relatively strongly” - NYU (2017).
Given contradictory results from such studies, it will be interesting to see the kinds of insights that can be derieved from the data set being considered for this project and whether the findings derieved from the developed multiple linear regression model confirm the findings from the above cited studies.
Getting a sense of how R has read in the data set is important so as to be aware of whether R has read in categorical variables as factors, as the majority of the chosen variables that could potentially be appropriate predictors of audience score, happen to be categorical variables.
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
The first step prior to the EDA process is to subset the movies data set to include only those potentially important predictors that logically have an impact on the movies popularity metric ‘audience score’.
#The dplyr package comes in handy here - we use dplyr's select function
#Step 1: Selection of relevant variables. The selected variables are audience_score, genre, critics_score, critics_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win and top200_box
#I am keeping another copy of subset.data.1 and naming it as subset.data.1.copy as R requires categorical variables to be read in as factors while running a regression model
#The final data set to be used in the EDA process will be named as Movies.Final
#The final data set to be used in the model building process will be named as Movies.Final.Modelling
subset.data.1 <- movies %>% select(title, audience_score, audience_rating, genre, critics_score, title_type, critics_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box, imdb_rating, imdb_num_votes)
subset.data.1.copy <- movies %>% select(title, audience_score, audience_rating, genre, critics_score, title_type, critics_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box, imdb_rating, imdb_num_votes)
#Step 2: Convert both categorical variables that have been imported as factors to character variables so as to be able to obtain the counts
#It is for this purpose that the varhandle library has been imported. We use the unfactor function
subset.data.1$critics_rating <- unfactor(subset.data.1$critics_rating)
subset.data.1$title_type <- unfactor(subset.data.1$title_type)
subset.data.1$best_pic_nom <- unfactor(subset.data.1$best_pic_nom)
subset.data.1$best_pic_win <- unfactor(subset.data.1$best_pic_win)
subset.data.1$best_actor_win <- unfactor(subset.data.1$best_actor_win)
subset.data.1$best_actress_win <- unfactor(subset.data.1$best_actress_win)
subset.data.1$best_dir_win <- unfactor(subset.data.1$best_dir_win)
subset.data.1$top200_box <- unfactor(subset.data.1$top200_box)
subset.data.1$audience_rating <- unfactor(subset.data.1$audience_rating)
#Step 3: Include only those rows in the data set that does not have null values
#The output suggests that there were eight observations that had null values
Movies.Final <- subset.data.1[complete.cases(subset.data.1),]
Movies.Final.Modelling <- subset.data.1.copy[complete.cases(subset.data.1.copy),]Let us first get a sense of the data set.
The aim here is to get a general idea of the pre-dominant characteristics of the movies in the data set. For example, are only a few of these movies Oscar winning? What is the most common genre? Are only a few of the movies listed in the top 200 list?
These questions can be answered through the effective utlization of data visualization and data summarization techniques such as bar plots and contingency tables on categorical variables such as genre, studio, critics rating, best picture wins and nominations, whether the main actor has ever won an Oscar, whether the main actress won an Oscar and whether the director has ever won an Oscar.
The relevant categorical variables that serve as good potential predictors of audience score will contain a percentage bar plot and a contingency table as a means of summarizing counts and counts as percentages
#Critics Rating
p1 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=critics_rating, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#best_pic_nom
p2 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=best_pic_nom, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#best_pic_win
p3 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=best_pic_win, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#best_actor_win
p4 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=best_actor_win, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#best_actress_win
p5 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=best_actress_win, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#best_dir_win
p6 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=best_dir_win, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#top200_box
p7 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=top200_box, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
#audience_rating
p8 <- ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=audience_rating, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")
grid.arrange(p1, p2, p3, p4, ncol=2)grid.arrange(p5, p6, p7, p8, ncol=2)#Genre
ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=genre, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")ggplot(data=Movies.Final) + geom_bar(mapping=aes(x=title_type, y=..prop.., group=1), stat="count") + scale_y_continuous(label=scales::percent_format()) + ylab("percentage of movies")#Creation of one-way frequency distribution tables by relevant categorical variables
tabyl(Movies.Final, critics_rating) %>%
adorn_totals(c('row', 'col'))## critics_rating n percent Total
## Certified Fresh 135 0.2073733 135.2074
## Fresh 209 0.3210445 209.3210
## Rotten 307 0.4715822 307.4716
## Total 651 1.0000000 652.0000
tabyl(Movies.Final, best_pic_nom) %>%
adorn_totals(c('row', 'col'))## best_pic_nom n percent Total
## no 629 0.96620584 629.96621
## yes 22 0.03379416 22.03379
## Total 651 1.00000000 652.00000
tabyl(Movies.Final, best_pic_win) %>%
adorn_totals(c('row', 'col'))## best_pic_win n percent Total
## no 644 0.98924731 644.989247
## yes 7 0.01075269 7.010753
## Total 651 1.00000000 652.000000
tabyl(Movies.Final, best_actor_win) %>%
adorn_totals(c('row', 'col'))## best_actor_win n percent Total
## no 558 0.8571429 558.85714
## yes 93 0.1428571 93.14286
## Total 651 1.0000000 652.00000
tabyl(Movies.Final, best_actress_win) %>%
adorn_totals(c('row', 'col'))## best_actress_win n percent Total
## no 579 0.8894009 579.8894
## yes 72 0.1105991 72.1106
## Total 651 1.0000000 652.0000
tabyl(Movies.Final, best_dir_win) %>%
adorn_totals(c('row', 'col'))## best_dir_win n percent Total
## no 608 0.93394777 608.93395
## yes 43 0.06605223 43.06605
## Total 651 1.00000000 652.00000
tabyl(Movies.Final, top200_box) %>%
adorn_totals(c('row', 'col'))## top200_box n percent Total
## no 636 0.97695853 636.97696
## yes 15 0.02304147 15.02304
## Total 651 1.00000000 652.00000
tabyl(Movies.Final, audience_rating) %>%
adorn_totals(c('row', 'col'))## audience_rating n percent Total
## Spilled 275 0.422427 275.4224
## Upright 376 0.577573 376.5776
## Total 651 1.000000 652.0000
From the above percentage barplots and one-way frequency distribution tables, the data set seems to consist of movies that are predominantly drama feature films with lower ratings from critics, favourable among audiences (as indicated by the ‘Upright’ audience rating from Rotten Tomatoes), with the lead actor, actress and director not having won an Oscar during their film industry careers, so far. We are also dealing with the fact that the majority of the movies neither won an Oscar for Best Picture nor were nominated for the same. The majority of the movies also do not figure in the Top 200 Box Office List.
Getting a sense of the distribution of audience score across various potential explanatory categorical variables
The next logical step is to get a sense of the distribution of the movies data set (as grouped by the various levels of relevant categorical variables) in terms of the movie popularity metric (which has been regarded as the dependent variable) audience score. With this step, we can answer questions such as ‘are feature films more likely to receive a higher audience score in comparison to a documentary?’, ‘are Oscar nominated movies more likely to receive a higher audience score in comparison to movies that were not nominated?’, ‘are movies that figure in the Top 200 Box Office list more likely to receive a higher audience score in comparison to movies that do not figure in this list?’. This step in the EDA process is important because not only will we be aware of important trends in the data, but also being able to get a sense of the relationship between the dependent variable and these categorical variables in terms of aspects such as gauging the potential statistical significance of the concerned predictor in predicting the audience score, the sign of the estimated slope coefficient for each potential predictor, during the modelling process, etc. This step in the EDA process involves the construction of boxplots and summary statistics tables for the audience score variable as grouped by the selected categorical variables. We need to get an idea regarding aspects such as average audience score, median audience score and the range of the audience score in terms of the top 25% (the 75th Percentile) and the bottom 25% (the 25th Percentile).
Judging by the boxplots and Summary Statistics tables below for the chosen categorical variables, the trends in the data seem consistent with logic.
I have also undertaken detailed write-ups for the sake of completeness.
Critics Rating
#Critics Rating - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$critics_rating, data=Movies.Final, main='Audience Score vs. Critics Rating', xlab='Critics Rating', ylab='Audience Score')#Critics Rating - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$critics_rating, summary)## Movies.Final$critics_rating: Certified Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 71.00 81.00 79.37 87.50 97.00
## --------------------------------------------------------
## Movies.Final$critics_rating: Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 58.00 74.00 69.97 83.00 94.00
## --------------------------------------------------------
## Movies.Final$critics_rating: Rotten
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.0 36.0 48.0 49.7 64.0 95.0
The boxplot depicting the distribution of the audience score by three Critics Rating type ‘Certified Fresh’, ‘Fresh’ and ‘Rotten’, seems to convey the fact that movies that receive a critics rating ‘Certified Fresh’ seem to receive a higher audience score in comparison to movies that have a ‘Fresh’ or a ‘Rotten’ rating. While the distribution of movies that received either a ‘Certified Fresh’ or ‘Fresh’ rating seem to be skewed to the left, the distribution of movies that received a ‘Rotten’ rating seem to be skewed to the right. Also, the median and the average audience scores for movies with a Certified Fresh and Fresh rating seem fairly similar. The difference between movies with either a Certified Fresh or a Fresh rating and movies with a Rotten rating in terms of average and median score is striking and this conforms with logic (in fact, the audience score range that constitutes the bottom 25% for movies with a Certified Fresh and Fresh rating is higher in comparison to the average and median audience score for movies with a Rotten rating). Also, looking at the maximum audience score for movies that received a Rotten rating (a high score of 98), there might be potential outliers here and indication of an instance where the audience do not necessarily agree with a critic’s assessment of a movie!
Best Picture Nomination
#Best Picture Nomination - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$best_pic_nom, data=Movies.Final, main='Audience Score vs. Best Picture Nomination', xlab='Best Picture Nomination', ylab='Audience Score')#Best Picture Nomination - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$best_pic_nom, summary)## Movies.Final$best_pic_nom: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 45.00 64.00 61.56 79.00 96.00
## --------------------------------------------------------
## Movies.Final$best_pic_nom: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.00 81.00 86.50 85.32 90.50 97.00
The boxplot depicting the distribution of the audience score by whether or not the movie received a Best Picture Oscar nomination, seems to convey the fact that movies that receive a Best Picture Oscar nomination seem to receive a higher audience score in comparison to movies that have not received such a nomination. As per the box plot, both distributions of audience score by whether or not the concerned movie received a Best Picture Nomination, seem fairly symmetric but at the same time, as per the summary statistics table, there is considerable divergence with regard to which audience score range constitutes the bottom 25% (first quartile), which audience score range constitutes the top 25% (third quartile), the median audience score and the average audience score. This divergence does seem to decrease upon reaching the right end of the distribution (when considering metrics like the top 25% and the maximum audience score). Once again, when a movie is classified as not having received a Best Picture Oscar nomination, perhaps the maximum audience score of 96 might potentially be an outlier or an instance where the assigned audience score is not dependent on whether or not the movie in question received a Best Picture Oscar nomination.
Best Picture Winner
#Best Picture Win - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$best_pic_win, data=Movies.Final, main='Audience Score vs. Best Picture Win', xlab='Best Picture Win', ylab='Audience Score')#Best Picture Win - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$best_pic_win, summary)## Movies.Final$best_pic_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.12 79.00 96.00
## --------------------------------------------------------
## Movies.Final$best_pic_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.00 82.00 84.00 84.71 89.50 97.00
The boxplot depicting the distribution of the audience score by whether or not the movie won a Best Picture Oscar, seems to convey the fact that movies that won a Best Picture Oscar seem to receive a higher audience score in comparison to movies that have not won that award. As per the box plot, the distribution of audience score by the level defined as the movie not having won a Best Picture Oscar, seems fairly symmetric but the distribution of audience score by the level defined as the movie having won a Best Picture Oscar seems heavily skewed to the right, meaning that Best Picture Oscar winning movies are likely to receive more positive scores that fall in the right tail of the distribution (above the median audience score). This seems logical. As per the summary statistics table, there is considerable divergence with regard to which audience score range (concerning Oscar winning movies and non-Oscar winning movies) constitutes the bottom 25% (first quartile), which audience score range constitutes the top 25% (third quartile), the median audience score and the average audience score. Naturally, the summary statistics pertaining to audience scores for movies that won a Best Picture Oscar are higher in comparison to those movies that did not. This divergence does seem to decrease upon reaching the right end of the distribution (when considering metrics like the top 25% and the maximum audience score). Once again, when a movie is classified as not having received a Best Picture Oscar, perhaps the maximum audience score of 96 might potentially be an outlier or an instance where the assigned audience score is not dependent on whether or not the movie in question received a Best Picture Oscar.
Collinearity Issues between best_pic_nom and best_pic_winner?
Reflecting upon the write up about both categorical variables ‘best_pic_nom’ and ‘best_pic_winner’, both variables seem to have extremely similar distributions with regard to audience score. Care should be taken before including both variables as potential predictors in the final regression model as both variables might be conveying the same thing.
As a quick method of investigation, let us run a linear model taking the audience score as the dependent variable and ‘best_pic_nom’ and ‘best_pic_winner’ as potential predictors. A metric that can be used to test for multicollinearity among these predictors is the Variance Inflation Factor, which is the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.
Collinearity.Investigation <- lm(Movies.Final.Modelling$audience_score~Movies.Final.Modelling$best_pic_nom + Movies.Final.Modelling$best_pic_win)
vif(Collinearity.Investigation)## Movies.Final.Modelling$best_pic_nom Movies.Final.Modelling$best_pic_win
## 1.291434 1.291434
As a rule of thumb, a VIF of 5 or less indicates that the predictors are not redundant. In this case, there does not seem to be a multi-collinearity issue. Hence, we can attempt to see during the modelling process, whether ‘best_pic_nom’ or ‘best_pic_winner’ is a potentially better predictor of audience score.
Best Actor Winner
#Best Actor Win - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$best_actor_win, data=Movies.Final, main='Audience Score vs. Best Actor Win', xlab='Best Actor Win', ylab='Audience Score')#Best Actor Win - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$best_actor_win, summary)## Movies.Final$best_actor_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 46.00 65.00 62.21 80.00 96.00
## --------------------------------------------------------
## Movies.Final$best_actor_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 48.00 64.00 63.26 81.00 97.00
The boxplot depicting the distribution of the audience score by whether or not one of the main actors had ever won an Oscar during his career in the film industry, seems to convey the fact that movies with main actors having won an Oscar at least once seem to receive fairly similar audience scores in comparison to movies that do not have a main actor who won an Oscar during his film career. Also as per the Box plot, both the distributions of audience score (as classified by whether or not the movie in question features a main actor who has won an Oscar at some point in his movie career), seem fairly symmetric.
These aspects are further conveyed by the summary statistics table where the average audience score, median audience score and the audience score range that constitutes the top 25% (third quartile) and the bottom 25% (the first quartile) seem to be in the same range across levels (levels are defined as an actor with no Oscar win vs. an actor with an Oscar win). Perhaps the variable ‘best_actor_win’ may not be a significant predictor of audience score. But this statement can be confirmed only during the modelling process.
Best Actress Winner
#Best Actress Win - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$best_actress_win, data=Movies.Final, main='Audience Score vs. Best Actress Win', xlab='Best Actress Win', ylab='Audience Score')#Best Actress Win - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$best_actress_win, summary)## Movies.Final$best_actress_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.17 80.00 96.00
## --------------------------------------------------------
## Movies.Final$best_actress_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 49.00 69.00 63.88 79.25 97.00
The boxplot depicting the distribution of the audience score by whether or not one of the main actresses had ever won an Oscar during her career in the film industry, seems to convey the fact that movies with main actresses having won an Oscar at least once, seem to receive fairly similar audience scores in comparison to movies that do not have a main actress who won an Oscar during her film career. Also as per the Box plot, the distribution of audience score (as classified by the movie in question featuring a main actress who has never won an Oscar at some point in her movie career), seems fairly symmetric. On the other hand, the distribution of audience score (as classified by the movie in question featuring a main actress who has won an Oscar at some point in her movie career) seems to be skewed to the left, thereby suggesting that such movies are more likely to receive less favourable audience scores.
These aspects are further conveyed by the summary statistics table where the average audience score, median audience score and the audience score range that constitutes the top 25% (third quartile) and the bottom 25% (the first quartile) seem to be in the same range across levels (levels are defined as actress with no Oscar win vs. actress with an Oscar win). Perhaps the variable ‘best_actress_win’ may not be a significant predictor of audience score. But this statement can be confirmed only during the modelling process.
Best Director Winner
#Best Director Win - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$best_dir_win, data=Movies.Final, main='Audience Score vs. Best Director Win', xlab='Best Director Win', ylab='Audience Score')#Best Director Win - Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$best_dir_win, summary)## Movies.Final$best_dir_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 45.00 65.00 61.86 79.00 96.00
## --------------------------------------------------------
## Movies.Final$best_dir_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.00 53.00 73.00 69.51 85.50 97.00
The boxplot depicting the distribution of the audience score by whether or not the director of the movie ever won an Oscar, seems to convey the fact that movies that movies that have a director who won an Oscar at least once during his career in the film industry, seem to receive a higher audience score in comparison to movies that have a director who has never won an Oscar, during his career in the film industry.
As per the box plot, the distribution of audience score by the level that the movie in question has a director who has never won an Oscar during his film career seems to be fairly symmetric. On the other hand, the distribution of audience score by the level that the movie in question has a director who has won an Oscar at least once during his film career seems to be skewed to the left (with a larger likelihood of not very favourable audience scores), with the majority of the audience scores lying on the left hand side of the distribution (the lower end of the distribution) indicating that a fair majority of the audience scores lie below the median.
It must be noted though that unlike other categorical variables such as ‘best_pic_win’ and ‘best_pic_nom’ where the divergence in summary statistics such as the median audience score, the average audience score and the audience score range that constitutes the upper and lower quartiles were very large across the levels, the divergence in summary statistics is across levels for the Best Director winner variable is pretty marginal. Hence, with the summary statistics for audience score in terms of levels (defined in terms of whether or not the director of the movie in question ever received an Oscar during his career) being fairly similar, perhaps the ‘best_dir_win’ variable may not be a significant predictor of audience score.
Top 200 Movies Box Office List
#Top 200 Movies Box Office List - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$top200_box, data=Movies.Final, main='Audience Score vs. Top 200 Box Office List', xlab='Top 200 Box Office List', ylab='Audience Score')#Top 200 Movies Box Office List- Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$top200_box, summary)## Movies.Final$top200_box: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.08 79.00 97.00
## --------------------------------------------------------
## Movies.Final$top200_box: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.00 71.00 81.00 74.53 83.50 92.00
The boxplot depicting the distribution of the audience score by whether or not the movie is included in the top 200 box office list seems to convey the fact that movies that were featured in this list, seem to receive a higher audience score in comparison to movies that were not featured in this list.
As per the box plot, the distribution of audience score by the level defined as the movie not being featured in this list, seems fairly symmetric but the distribution of audience score by the level defined as the movie being featured in this list, seems heavily skewed to the left, meaning that movies featured in the Top 200 list (as compiled by Box Office Mojo) are likely to receive scores that are less favourable, that fall in the left tail of the distribution (below the median audience score).
It must be noted that the audience score has been compiled from the Rotten Tomatoes website and is pretty subjective and maybe subject to biases such as a movie that was a box office hit (in terms of revenue) but has a distribution that is left skewed in the Rotten Tomatoes website, based upon the number of voters and their voting preferences. Furthermore, the Top 200 list from Box Office Mojo, is based on a movie’s actual revenue numbers at the box office both in and outside the United States. Hence, these might be some of the reasons behind the rather interesting distribution of audience score by whether or not a movie was featured in the Top 200 list. As per the summary statistics table, there is considerable divergence with regard to which audience score range (concerning movies that were featured in the list vs. movies that were not) constitutes the bottom 25% (first quartile), which audience score range constitutes the top 25% (third quartile), the median audience score and the average audience score. This divergence does seem to decrease upon reaching the right end of the distribution (when considering metrics like the top 25% and the maximum audience score). Once again, when a movie is classified as not being featured in the top 200 list, perhaps the maximum audience score of 92 might potentially be an outlier or an instance where the assigned audience score is not dependent on whether or not the movie in question was featured in that list.
Genre
#Genre - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$genre, data=Movies.Final, main='Audience Score vs. Genre', xlab='Genre', ylab='Audience Score')#Genre- Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$genre, summary)## Movies.Final$genre: Action & Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 37.00 52.00 53.78 65.00 94.00
## --------------------------------------------------------
## Movies.Final$genre: Animation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 59.00 65.00 62.44 70.00 88.00
## --------------------------------------------------------
## Movies.Final$genre: Art House & International
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 51.25 65.50 64.00 80.25 86.00
## --------------------------------------------------------
## Movies.Final$genre: Comedy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 37.00 50.00 52.51 67.50 93.00
## --------------------------------------------------------
## Movies.Final$genre: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 76.25 86.00 82.75 89.00 96.00
## --------------------------------------------------------
## Movies.Final$genre: Drama
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 52.00 70.00 65.35 80.00 95.00
## --------------------------------------------------------
## Movies.Final$genre: Horror
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 36.00 43.00 45.83 53.50 84.00
## --------------------------------------------------------
## Movies.Final$genre: Musical & Performing Arts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.00 75.75 80.50 80.17 89.50 95.00
## --------------------------------------------------------
## Movies.Final$genre: Mystery & Suspense
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 40.50 54.00 55.95 70.50 97.00
## --------------------------------------------------------
## Movies.Final$genre: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 53.00 73.50 66.69 82.50 91.00
## --------------------------------------------------------
## Movies.Final$genre: Science Fiction & Fantasy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 26.00 47.00 50.89 79.00 85.00
From the boxplots above, popular movie genres include documentaries, dramas and musicals (we can expect higher audience scores for these movie genres). Genres like Horror seem less popular. There is also a lot of divergence across the genres (which will be considered as levels by R while running the linear model), concerning the summary statistics such as the mean, median, the lower quartile and upper quartile for audience score.
Type of Movie
#Title Type - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$title_type, data=Movies.Final, main= 'Audience Score vs. Type of Movie', xlab='Genre', ylab='Audience Score')#Title Type- Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$title_type, summary)## Movies.Final$title_type: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.00 77.50 86.00 83.25 89.00 96.00
## --------------------------------------------------------
## Movies.Final$title_type: Feature Film
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 44.50 62.00 60.47 78.00 97.00
## --------------------------------------------------------
## Movies.Final$title_type: TV Movie
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 21.0 75.0 56.8 83.0 86.0
The boxplot depicting the distribution of the audience score by movie type (consisting of three levels: Documentary, Feature Film and TV Movie) seems to convey the audience preference for Documentaries! This is an interesting trend especially since the majority of the movies featured in the data set happen to be Feature Films. The audience in the sample data set, also seem to prefer TV Movies to Feature Films! But it has to be said that the distribution of the audience score for both TV Movies and Documentaries is predominantly left tailed, with the majority of the audience scores across these levels lying below the median. The distribution of audience scores by Feature Films on the other hand is fairly symmetric.
Also, as per the summary statistics table, the divergence between various summary statistics seems narrower when comparing the two levels ‘Documentary’ and ‘TV Movie’, in comparison to ‘Documentary’ vs. ‘Feature Film’.
Including this variable in the model as a potential predictor of audience score, might be a bit tricky due to the fact that the audience seems to prefer documentaries to feature films, in that the distribution of audience scores are upwardly biased towards documentaries.
Audience Rating
#Audience Rating - Box Plot
boxplot(Movies.Final$audience_score~Movies.Final$audience_rating, data=Movies.Final, main= 'Audience Score vs. Audience Rating', xlab='Audience Rating', ylab='Audience Score')#Audience Rating- Summary Statistics Table
by(Movies.Final$audience_score, Movies.Final$audience_rating, summary)## Movies.Final$audience_rating: Spilled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 35.00 43.00 41.93 51.00 59.00
## --------------------------------------------------------
## Movies.Final$audience_rating: Upright
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 70.0 78.0 77.3 85.0 97.0
The boxplot depicting the distribution of the audience score by the audience rating that the movie received on Rotten Tomatoes seems to convey the fact that movies that receive an upright audience rating, seem to receive higher audience scores in comparison to movies that receive a spilled audience rating. This makes sense with the spilled rating being an unfavourable audience rating and the upright rating being a favourable audience rating.
As per the Box plot, the distribution of audience score by both levels of audience ratings seem to be fairly symmetric.
As per the summary statistics table, there is considerable divergence with regard to which audience score range (concerning movies that received an upright audience rating vs. movies that received a spilled audience rating) constitutes the bottom 25% (first quartile), which audience score range constitutes the top 25% (third quartile), the median audience score and the average audience score. This divergence does seem to marginally decrease upon reaching the right end of the distribution (when considering metrics like the top 25% and the maximum audience score). Given the extent of such a divergence, perhaps audience ratings could be a good predictor of audience score.
Also, recall the following statement from the earlier cited study published in the Journal of the Academy of Marketing Science entitled Debates and assumptions about motion picture performance: a meta-analysis, “The other key factor in predicting whether moviegoers will be sitting on the edge of their power-reclining seats is reviews, both by professional movie critics and the general public”.
Getting a sense of the relationship between audience score and potential explanatory numeric variables
It is also imperative to take a look at the relationship between the dependent variable audience score and numerical variables that could potentially serve as key predictors of the audience score such as imdb_rating, imdb_num_votes and critics_score. Such relationships can be explored by means of scatterplots to assess the form (linear or non-linear), strength (weak, moderate or strong) and direction (positive or negative).
Relationship between Audience Score and IMDB Rating
ggplot(data=Movies.Final, aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method='lm', se=FALSE)cor(Movies.Final$audience_score, Movies.Final$imdb_rating)## [1] 0.8648652
Relationship between Audience Score and IMDB - Number of Votes
ggplot(data=Movies.Final, aes(x=imdb_num_votes, y=audience_score)) + geom_point() + geom_smooth(method='lm', se=FALSE)cor(Movies.Final$audience_score, Movies.Final$imdb_num_votes)## [1] 0.2898128
Relationship between Audience Score and Critics Score
ggplot(data=Movies.Final, aes(x=critics_score, y=audience_score)) + geom_point() + geom_smooth(method='lm', se=FALSE)cor(Movies.Final$audience_score, Movies.Final$critics_score)## [1] 0.7042762
Judging by the scatterplots and the signs and magnitude of the correlation coefficients, imdb_rating has a strong positive linear relationship with audience score and this maybe an indication that it might serve as a good predictor of the audience score. imdb_num_votes on the other hand, has a weak positive linear relationship with audience score and this maybe an indication that it might not serve as a good predictor of the audience score. critics_score has a moderately strong positive linear relationship with audience score and this maybe an indication that it might serve as a good predictor of the audience score. Note, a moderately strong positive linear relationship indicates a correlation coefficient between the range of +0.55 to +0.75.
Getting a sense of the distribution of the dependent variable audience score
The final step in the EDA process is to examine the distribution of the dependent variable audience score by means of a histogram:
hist(Movies.Final$audience_score, main="Histogram depicting the distribution of Audience Score", xlab="Audience Score") From the histogram above, interestingly, the audience score seems concentrated in the 70-90 range. This audience score range is definitely towards the right tail of the distribution. This fact is interesting, especially since the EDA process has revealed that the majority of the movies in the data set are those that are predominantly drama feature films with lower ratings from critics, favourable among audiences (as indicated by the ‘Upright’ audience rating from Rotten Tomatoes), with the lead actor, actress and director not having won an Oscar during their film industry careers, so far. As mentioned earlier, we are also dealing with the fact that the majority of the movies neither won an Oscar for Best Picture nor were nominated for the same. The majority of the movies also do not figure in the Top 200 Box Office List.
Before undertaking the modelling process, we must be aware that there are certain variables in the data set such as imdb_rating and audience_rating that convey the same information, namely the rating assigned to a specific movie by users of movie ratings websites such as IMDB and Rotten Tomatoes. The only aspect that is different across these variables is their type. imdb_rating is a numeric variable and audience_rating is a categorical variable. Similarly, the variables critics_score and critics_rating convey the same information, namely the ratings assigned by critics to a specific movie. Once again, the only differing aspect is the variable type, wherein critics_rating is a categorical variable and critics_score is a numeric variable. Separate models will be run to decipher whether or not audience_rating is a more statistically significant predictor in comparison to imdb_rating and whether or not critics_rating is a more statistically significant predictor in comparison to critics_score.
The best model will be chosen on the basis of the Adjusted R Squared criterion via the backward selection process.
The first step is to split the original data set into a training set and test set. A training set is the segment of the original data set that you can use to train your model and find optimal parameters. A test set is the segment of the data that you can use to test your trained model and see how well it generalises, in terms of making predictions utilizing the developed modelling framework on ‘unseen’ data.
People usually tend to start with a 80%-20% split (80% training set - 20% test set). This criterion is a rule of thumb and the split percentage can be adjusted depending on the amount of available data.
The key principle to understand is that the more samples the lower the variance. So you need the training set to be big enough to achieve low variance over the model parameters.
Similarly for test data, you also want enough data to observe low variance among the performance results.
The idea is to split the data to achieve low variance in both cases. If your data set is big enough to achieve low variance on the training parameters, increasing the training set any further won’t help much but will increase the training time.
The iterative process will be undertaken using the SignifReg function.
#Split the data set into the training set and test set in the 80:20 ratio
#floor(x) function in R rounds to the nearest integer that's smaller than x
#80% of 651 rows is 520.8 - the floor function is rounding this down to 520 to give you a
#training set size of 520
#Setting a seed is an extremely important step to be able to make your results reproducible
set.seed(3)
samp_size <- floor(0.80*nrow(Movies.Final.Modelling))
#The seq_len function creates a sequence that starts at 1 and with steps of 1 finishes at the number value (which in this case is the number of unique rows in the Final Movie Data Set). A common use of this function is to create indexes that match the length of a vector in order to make plots.
#What train_ind contains are the row numbers from the original data set that constitutes the training set
#The sample function ensures that the original data set is split at random into the training and test set
train_ind <- sample(seq_len(nrow(Movies.Final.Modelling)), size = samp_size)
train_80 <- Movies.Final.Modelling[train_ind, ]
test_20 <- Movies.Final.Modelling[-train_ind, ]We will be considering four full models in order to choose between audience_rating and imdb_rating and critics_score and critics_rating as potential predictors of audience_score. As eluded to earlier, it does not make sense to include predictors in the model that convey the same information just to bump up fit statistics such as the R Squared.
#Define the scope parameter for each of the full four models
#This model takes critics_score and imdb_rating into account
scope.full.model.1 <- audience_score ~ genre + critics_score + title_type + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + imdb_rating + imdb_num_votes
#This model takes critics_score and audience_rating into account
scope.full.model.2 <- audience_score ~ genre + critics_score + title_type + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + audience_rating + imdb_num_votes
#This model takes critics_rating and audience_rating into account
scope.full.model.3 <- audience_score ~ genre + critics_rating + title_type + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + audience_rating + imdb_num_votes
#This model takes critics_rating and imdb_rating into account
scope.full.model.4 <- audience_score ~ genre + critics_rating + title_type + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + imdb_rating + imdb_num_votes
#Using the SignifReg function to get the output for the 1st model
#We are utilizing backward selection and the Adjusted R Squared criterion to choose the best
#parsimonious model
model_one_80_20_Adj_R <- SignifReg(scope=scope.full.model.1, data =train_80, direction= 'backward', criterion='r-adj', correction='None')
summary(model_one_80_20_Adj_R)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.135 -6.366 0.261 5.245 44.810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -38.21854 3.35030 -11.408 < 2e-16 ***
## genreAnimation 3.58739 4.45726 0.805 0.42129
## genreArt House & International 1.17783 2.94663 0.400 0.68953
## genreComedy 1.28974 1.73408 0.744 0.45737
## genreDocumentary 0.27531 2.10101 0.131 0.89580
## genreDrama 0.08727 1.49956 0.058 0.95362
## genreHorror -6.39109 2.58137 -2.476 0.01362 *
## genreMusical & Performing Arts 5.64342 3.46362 1.629 0.10386
## genreMystery & Suspense -5.39738 1.91122 -2.824 0.00493 **
## genreOther -1.34552 3.07286 -0.438 0.66167
## genreScience Fiction & Fantasy 0.05778 3.81564 0.015 0.98792
## critics_score 0.06403 0.02302 2.782 0.00561 **
## best_pic_nomyes 4.65207 2.33877 1.989 0.04723 *
## best_actress_winyes -3.42244 1.35057 -2.534 0.01158 *
## imdb_rating 14.99136 0.62044 24.162 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.466 on 505 degrees of freedom
## Multiple R-squared: 0.7904, Adjusted R-squared: 0.7846
## F-statistic: 136 on 14 and 505 DF, p-value: < 2.2e-16
model_two_80_20_Adj_R <- SignifReg(scope=scope.full.model.2, data =train_80, direction= 'backward', criterion='r-adj', correction='None')
summary(model_two_80_20_Adj_R)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9422 -6.3511 0.2577 6.4476 21.2042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.416e+01 1.358e+00 25.157 < 2e-16 ***
## genreAnimation 2.237e+00 4.109e+00 0.544 0.586386
## genreArt House & International 1.001e+00 2.748e+00 0.364 0.715917
## genreComedy 7.143e-01 1.598e+00 0.447 0.654960
## genreDocumentary 6.637e+00 1.975e+00 3.361 0.000836 ***
## genreDrama 6.292e-01 1.386e+00 0.454 0.649985
## genreHorror -2.217e+00 2.398e+00 -0.924 0.355781
## genreMusical & Performing Arts 8.188e+00 3.213e+00 2.549 0.011105 *
## genreMystery & Suspense 4.752e-02 1.751e+00 0.027 0.978355
## genreOther -2.003e+00 2.833e+00 -0.707 0.479876
## genreScience Fiction & Fantasy -4.878e+00 3.511e+00 -1.389 0.165399
## critics_score 1.804e-01 1.819e-02 9.917 < 2e-16 ***
## best_pic_nomyes 4.434e+00 2.157e+00 2.056 0.040341 *
## audience_ratingUpright 2.698e+01 1.028e+00 26.243 < 2e-16 ***
## imdb_num_votes 1.939e-05 3.800e-06 5.103 4.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.735 on 505 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8165
## F-statistic: 166 on 14 and 505 DF, p-value: < 2.2e-16
model_three_80_20_Adj_R <- SignifReg(scope=scope.full.model.3, data =train_80, direction= 'backward', criterion='r-adj', correction='None')
summary(model_three_80_20_Adj_R)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.2869 -6.3763 0.3203 6.8362 19.1575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.577e+01 1.874e+00 24.428 < 2e-16 ***
## genreAnimation 3.447e+00 4.318e+00 0.798 0.42502
## genreArt House & International 6.876e-01 2.890e+00 0.238 0.81206
## genreComedy 8.627e-01 1.679e+00 0.514 0.60758
## genreDocumentary 9.307e+00 2.040e+00 4.563 6.34e-06 ***
## genreDrama 1.771e+00 1.448e+00 1.223 0.22194
## genreHorror -1.451e+00 2.519e+00 -0.576 0.56503
## genreMusical & Performing Arts 1.003e+01 3.368e+00 2.978 0.00304 **
## genreMystery & Suspense 8.845e-01 1.843e+00 0.480 0.63146
## genreOther -4.772e-01 2.975e+00 -0.160 0.87262
## genreScience Fiction & Fantasy -5.490e+00 3.695e+00 -1.486 0.13797
## critics_ratingFresh -5.032e-01 1.239e+00 -0.406 0.68477
## critics_ratingRotten -6.786e+00 1.333e+00 -5.090 5.07e-07 ***
## best_pic_nomyes 5.516e+00 2.280e+00 2.420 0.01589 *
## audience_ratingUpright 2.907e+01 1.045e+00 27.811 < 2e-16 ***
## imdb_num_votes 2.036e-05 4.218e-06 4.828 1.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.181 on 504 degrees of freedom
## Multiple R-squared: 0.8032, Adjusted R-squared: 0.7973
## F-statistic: 137.1 on 15 and 504 DF, p-value: < 2.2e-16
model_four_80_20_Adj_R <- SignifReg(scope=scope.full.model.4, data =train_80, direction= 'backward', criterion='r-adj', correction='None')
summary(model_four_80_20_Adj_R)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.173 -5.906 0.477 5.272 45.297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.1394 4.0939 -8.095 4.32e-15 ***
## genreAnimation 3.2699 4.4481 0.735 0.462607
## genreArt House & International 0.8181 2.9405 0.278 0.780955
## genreComedy 1.4209 1.7316 0.821 0.412289
## genreDocumentary 0.5819 2.0962 0.278 0.781417
## genreDrama 0.3215 1.4971 0.215 0.830040
## genreHorror -6.2231 2.5758 -2.416 0.016046 *
## genreMusical & Performing Arts 5.7190 3.4537 1.656 0.098367 .
## genreMystery & Suspense -5.2169 1.9209 -2.716 0.006836 **
## genreOther -0.4443 3.0512 -0.146 0.884277
## genreScience Fiction & Fantasy -0.1783 3.8162 -0.047 0.962749
## critics_ratingFresh -2.4858 1.1932 -2.083 0.037728 *
## critics_ratingRotten -5.0495 1.3704 -3.685 0.000254 ***
## best_actress_winyes -2.9955 1.3344 -2.245 0.025217 *
## imdb_rating 15.2610 0.5410 28.210 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.453 on 505 degrees of freedom
## Multiple R-squared: 0.791, Adjusted R-squared: 0.7852
## F-statistic: 136.5 on 14 and 505 DF, p-value: < 2.2e-16
From the above regression output for each of the four potential models, the model that considers critics score and audience rating as some of the potential predictors of audience score (named as model_two_80_20_Adj_R), seems to have the best predictive power among the four models (judging by the Adjusted R Squared statistic). Also, the signs of the coefficients in this model do seem in keeping with data trends that were discovered during the EDA process such as (for example), documentary movies receive higher audience scores on an average, movies with higher critics scores receive higher audience scores on an average and that movies with a favourable audience rating receive higher audience scores on an average.
Model Diagnostics
Now that we have the final parsimonious model in place, the next step (prior to interpreting the slope coefficients to assess the magnitude of the impact of the final set of predictors on audience score), is the model diagnostics step where (we with the help of visualization tools such as a histogram, normal probability plot and a scatterplot) check whether certain critical conditions hold in order for the method of Ordinary Least Squares to be valid.
Normality of Residuals
The first condition to test for is the normality of the residuals. This condition entails that the residuals are normally distributed and centered at a mean of 0. It is important that is condition is satisfied in order to be able to suitably apply tests of statistical significance and confidence interval construction to be able to assess the statistical significance of each individual predictor in the model, as such inference tests are based on the normality assumption. The distribution of residuals maybe highly skewed due to the presence of influential observations called outliers. Such outlier observations might exert a disproportionate influence on parameter estimates. It may be difficult to gauge aspects such as width of the confidence interval (this maybe too wide or too narrow) in the presence of non-normal residuals, caused by outlier observations.
The normality condition is being tested utilizing two data visualization tools - a histogram and a normal probability plot.
The histogram enables us to gauge the length of the tails of the distribution (in terms of how skewed the distribution of residuals really is) and the centre of the distribution (whether the centre of the distribution lies at 0).
The normal probability plot is a plot of the fractiles of the residual distribution vs. the fractiles of a normal distribution having the same mean and variance. A fractile is the cut off point for a certain fraction of a sample. If your distribution is known, then the fractile is just the cut-off point where the distribution reaches a certain probability. The normal probability plot enables us to examine the extent of the deviation of the residual distribution from the diagonal line. Small deviations are alright but large deviations are indicative of the disproportionate influence exerted by outliers, that causes the distribution of the residuals to be skewed and non-normal.
#The following R code plots a histogram of the residuals
#We use the hist function
hist(model_two_80_20_Adj_R$residuals, main="Histogram Depicting the Distribution of Residuals for the Chosen Parsimonious Model", xlab="Model Residuals", ylab="Frequency Counts")#The following R code plots a normal probability plot of the residuals
#We use the qqnorm function to construct the normal probability plot
#We use the qqline function to construct the diagonal line that is critical to tell us whether #the distribution of the residuals fall along this line. If there is a deviation at the left hand corner or right hand corner away from the diagonal line, this indicates the extent to which the distribution is skewed. Small deviations from the diagonal line are
qqnorm(model_two_80_20_Adj_R$residuals)
qqline(model_two_80_20_Adj_R$residuals)It appears from the histogram and the normal probability plot that the normality condition has been reasonably met.
Homoscedasticity or Constant Spread of Residuals and Linearity
The next condition that needs to be verified is whether the residuals have a constant spread across different values of the explanatory variables. Remember that the regression line in essence captures the average relationship between the dependent variable and explanatory variables. Homoscedasticity implies that the variation about the regression line remains the same across different values of the explanatory variable. It does not change as the explanatory variable changes. For example, the chosen regression model depicts the fact that the audience score increases on an average with an ‘Upright’ audience rating. The variation of audience score remains constant while moving from the reference level (a spilled audience rating) to the non-reference level (an upright audience rating).
Therefore, there is neither a tendency for negative residuals at smaller fitted values of the dependent variable nor a tendency for positive residuals at larger fitted values of the dependent variable.
The presence of non-linear patterns in the data present themselves in the form of tendencies for residuals to be clustered at small fitted values of the dependent vaiable and the tendency for the residuals to be scattered at larger fitted values of the dependent variable. Like the Homoscedasticity condition, the linearity condition will be tested by means of a scatterplot depicting the residuals on the y-axis and the fitted values of the dependent variable on the X-axis by looking out for clustered and widely scattered residuals. If the linearity condition is not satisfied, then we are in essence specifying the incorrect model form in the prediction of movie audience score!
Hence, there should be no tendency for the residuals to assume a ‘fan shape’ in relation to the fitted values of the dependent variable. We are looking out for randomly scattered residuals about the zero line in relation to the dependent variable.
It must also be noted that the OLS estimator is known as the BLUE estimator (the best linear unbiased estimator). The term ’best in this case means that this estimator is a minimum variance estimator. Violation of the homoscedasticity condition, no longer makes the OLS estimator as the estimator with the minimum variance.
Also, ideally, a good predictive model will have the majority of the variation in the dependent variable explained by the chosen predictors and hence, we should not expect the residuals to show any specific patterns in the residual plot in relation to either the dependent variable or the explanatory variables.
#The following is the R code to plot a scatter diagram that examines the residuals vs. fitted values of the dependent vatiable
#We will be making use of the plot function and the abline function
plot(model_two_80_20_Adj_R$residuals~model_two_80_20_Adj_R$fitted.values, main="Scatter Plot Depicting Residuals vs. Fitted Values of Audience Score", ylab="Residuals", xlab="Fitted Values - Audience Score")
#abline draws a line in the plot at y=0 and the line type argument (lt) has also been specified
abline(h=0, lt=3)Judging from the above scatterplot, the residuals seem pretty randomly scattered about the zero line. There is neither a tendency for negative residuals at smaller fitted values of the dependent variable nor a tendency for positive residuals at larger fitted values of the dependent variable. There is also neither a tendency for clustered residuals at smaller fitted values of the dependent variable nor a tendency for widely scattered residuals at larger fitted values of the dependent variable.
Interpretation of the Final Model Output
Recall the summary output of the final chosen parsimonious model:
summary(model_two_80_20_Adj_R)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9422 -6.3511 0.2577 6.4476 21.2042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.416e+01 1.358e+00 25.157 < 2e-16 ***
## genreAnimation 2.237e+00 4.109e+00 0.544 0.586386
## genreArt House & International 1.001e+00 2.748e+00 0.364 0.715917
## genreComedy 7.143e-01 1.598e+00 0.447 0.654960
## genreDocumentary 6.637e+00 1.975e+00 3.361 0.000836 ***
## genreDrama 6.292e-01 1.386e+00 0.454 0.649985
## genreHorror -2.217e+00 2.398e+00 -0.924 0.355781
## genreMusical & Performing Arts 8.188e+00 3.213e+00 2.549 0.011105 *
## genreMystery & Suspense 4.752e-02 1.751e+00 0.027 0.978355
## genreOther -2.003e+00 2.833e+00 -0.707 0.479876
## genreScience Fiction & Fantasy -4.878e+00 3.511e+00 -1.389 0.165399
## critics_score 1.804e-01 1.819e-02 9.917 < 2e-16 ***
## best_pic_nomyes 4.434e+00 2.157e+00 2.056 0.040341 *
## audience_ratingUpright 2.698e+01 1.028e+00 26.243 < 2e-16 ***
## imdb_num_votes 1.939e-05 3.800e-06 5.103 4.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.735 on 505 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8165
## F-statistic: 166 on 14 and 505 DF, p-value: < 2.2e-16
The model I developed is such that 82% of the variation in audience score is explained by the chosen final predictors which include movie genre, audience rating, critics score, number of votes on IMDB and whether or not the movie in question was nominated for a Best Picture Oscar.
The genre of the movie, critics score, the number of votes that the movie received on the ratings website IMDB, the audience rating from Rotten Tomatoes and whether the movie has received a Best Picture Oscar nomination seem to be statistically significant predictors of audience score. We began with a full model of 11 potential predictors of audience score. Through the process of backward selection, where the Adjusted R Squared criterion was used, we ended up with a parsimonious model with five predictors of audience score.
Critics Score slope coefficent interpretation: If the critics score for a movie increases by 1 point, we can expect that on an average, the audience score for that movie will increase by 0.2 points, holding everything else constant.
IMDB number of votes slope coefficent interpretation: If the movie in question receives one additional vote on the IMDB ratings website, we can expect that on an average, that the audience score for that movie will increase marginally, by 0.00002 points, holding everything else constant.
Audience Rating slope coefficent interpretation: As per the regression output produced by R, the audience rating level ‘Spilled’ is the reference level and the audience score can be estimated by plugging in zero for the slope coefficient estimate labelled as audience_ratingUpright. So, holding everything else constant, moving from the reference level (an audience rating of Spilled - this happens to be a less favourable audience rating for a movie as per the Rotten Tomatoes website) to the non-reference level (an audience rating of Upright - this happens to be a more favourable audience rating for a movie as per the Rotten Tomatoes website), we can expect on an average that the audience score will increase by 27 points.
Genre slope coefficients interpretation: I am aware that there are many statistically insignificant levels of the explanatory variable genre, but I am not removing this from the model as there are some levels that are statistically significant predictors of audience score. As per the slope coefficients, the most popular genres seem to be Documentaries, Musicals, Animation and Drama. This seems to reflect the findings from the EDA process depicted earlier. So for example, holding everything else, moving from a movie that is not a documentary to a movie that is a documentary, we can expect on an average that the audience score will increase by 7 points. Genres such as Horror and SCi-Fi seem less popular with audiences. For example, holding everything else, moving from a movie that is not a horror movie to a movie that is a horror movie, we can expect on an average that the audience score will decrease by 2 points.
Best Picture Nomination slope coefficient interpretation: As per the regression output produced by R, the Best Picture Nomination level ‘no’ is the reference level and the audience score can be estimated by plugging in zero for the slope coefficient estimate labelled as best_pic_nomyes. So, holding everything else constant, moving from the reference level (the movie in question not receiving a Best Picture Oscar nomination) to the non-reference level (the movie in question having received a Best Picture Oscar nomination), we can expect on an average that the audience score will increase by 4 points.
We are going to use the model created earlier(model_two_80_20_Adj_R) to predict the audience scores for 131 movies in the test set. Remember, we had earlier allocated 80% of the data towards the training set and 20% of the data towards the test set. First we create a new dataframe for this movie.
newmovies <- test_20 %>% select(genre, critics_score, best_pic_nom, audience_rating, imdb_num_votes)
#We can also construct a prediction interval around this prediction, which will
#provide a measure of uncertainty around the prediction.
predictions <- predict(model_two_80_20_Adj_R, newmovies, interval="prediction", level=0.95)
head(predictions, n=5)## fit lwr upr
## 1 69.90025 52.65761 87.14290
## 2 79.32125 62.08278 96.55973
## 3 37.89764 20.63345 55.16183
## 4 70.07060 52.82262 87.31859
## 5 82.11224 64.75850 99.46598
As an example for the validation of predictions, consider the first 5 movies in the test set: a) “Filly Brown”
b) “The Dish”
c) “Mad Dog Time”
d) “Fallen”
e) “The Yes Men Fix the World”
The model predicts movie “Filly Brown”" in the test set will have an audience score at approximately 70.
The model predicts movie “The Dish” in the test set will have an audience score at approximately 79.
The model predicts movie “Mad Dog Time” in the test set will have an audience score at approximately 38.
The model predicts movie “Fallen” in the test set will have an audience score at approximately 70.
The model predicts movie “The Yes Men Fix The World” in the test set will have an audience score at approximately 82.
The model predicts, with 95% confidence, that the movie “Filly Brown”" is expected to have an audience score between 53 and 87.
The model predicts, with 95% confidence, that the movie “The Dish” is expected to have an audience score between 62 and 97.
The model predicts, with 95% confidence, that the movie “Mad Dog Time” is expected to have an audience score between 21 and 55.
The model predicts, with 95% confidence, that the movie “Fallen” is expected to have an audience score between 53 and 87.
The model predicts, with 95% confidence, that the movie “The Yes Men Fix The World” is expected to have an audience score between 64 and 99.
Recall the first 5 movie titles and audience scores for the 131 movies present in the test set:
test_set <- test_20 %>% select(title, audience_score) %>% head(n=5)
test_set## # A tibble: 5 x 2
## title audience_score
## <chr> <dbl>
## 1 Filly Brown 73
## 2 The Dish 81
## 3 Mad Dog Time 47
## 4 Fallen 73
## 5 The Yes Men Fix the World 77
The actual audience score for the movie “Filly Brown” is 73. Our prediction interval contains this value.
The actual audience score for the movie “The Dish” is 81. Our prediction interval contains this value.
The actual audience score for the movie “Mad Dog Time” is 47. Our prediction interval contains this value.
The actual audience score for the movie “Fallen” is 73. Our prediction interval contains this value.
The actual audience score for the movie “The Yes Men Fix The World” is 77. Our prediction interval contains this value.
The chosen model framework demonstrates that it is possible to predict a movie’s popularity, as measured by audience score with only five predictors - audience rating, genre, critics score, the number of votes that a movie received on the IMDB website and whether or not the movie was nominated for a Best Picture Oscar. These results do seem to reflect what the study that appeared in the Journal of the Academy of Marketing Science (that attempted to determine what makes a movie popular) cited earlier in this document conveyed, namely that a movie’s popularity (measured in this case in terms of the audience score) is determined by factors such as both the general public and critics ratings alike and the number of votes that the movie received on ratings websites such as IMDB. Interestingly, a movie’s genre (a factor that was not explicitly accounted for in the studies) also plays a role as a predictor of a movie’s popularity. Also, while the cited studies stressed upon the importance of the main actor’s Oscar wins and nominations, it was interestingly whether or not the movie concerned had been nominated for a Best Picture Oscar that played a statistically significant role in predicting audience score.
The NYU study however stated that audiences appear to be at odds with critics concerning what constitutes a good movie. The modelling framework does not seem to confirm this. In fact, if a critics score increases by 1 point for a movie, an audience score can be expected to increase by 0.2 points, indicating a positive relationship between critics score and audience score.
The potential shortcoming is that the model’s predictive power is limited because the sample data is not representative. Therefore, a larger number of observations to capture more variability in the population data in our testing data set is required to have a better measure of the model’s accuracy.
Another aspect that needs to be noted here is that the data set in question can also include numerical metrics of box office success such as the revenue that the movie generated both in the US and Overseas. Information regarding the number of screens where the movie is released could also be a potential predictor of box office success. Studies in future could consider a much larger data set of movies in order to include not just movies that have been released in the last 10-40 years or so, but also include the so-called classics and it will be interesting to see the variation in revenue across newer releases and the classics. Two studies can be carried out, with the first study utilizing revenue as the dependent variable of box office success, while the second study can use audience score as a measure of box office success, with additional information such as the number of screens where the movie was released. Such studies can assess which is a more realistic dependent variable in measuring box office success.