library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(DT)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)load("movies.Rdata")The data set comprises of 651 randomly sampled movies produced and released before 2016. Although random sampling was used, the population of movies considered is only Hollywood. Further, the rating of the movie is drawn from popular sites like IMDB and Rotten Tomatoes, whose users might not be representative of the world’s population in general.
Codebook
Conclusions
As our sample might not be a good representation of the world population, we have to careful in making generalized inferences. Further, any inference drawn would be applicable to only Hollywood movies.
Considering this is an observational study (as opposed to an experimental study with random assignment), we will refrain from establishing casual relationships between variables.
Perhaps the greatest art form of the modern age, movies embody all that’s great in the world of media. All rolled into one, we get beautiful visuals, gorgeous music, thought-provoking stories, parables on morality, commentaries on society, insights into history, and so much more. Not to mention, they’re just really friggin’ fun to watch; so why wouldn’t you want to learn more about them?
In this analysis we want to answer the question what makes a movie popular?
We have a data of 651 randomly sampled movies with 32 variables
datatable(movies)Removing 1 duplicate row
movies <- movies[!duplicated(movies),]
nrow(movies)## [1] 650
Audience Scores vary from 11-97 and 50% of these scores range between 46-80.
summary(movies$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.32 80.00 97.00
Audience Scores are left-skewed as well.
ggplot(movies,aes(x=audience_score)) +
geom_histogram()movies %>%
select(audience_score,runtime,imdb_rating,imdb_num_votes,critics_score) %>%
ggpairs()title_type: Documentaries have a higher audience score as compared to Feature Films and TV Movies.
df<-
tapply(movies$audience_score,movies$title_type,mean) %>%
as.data.frame(names="mean_audience_score") %>%
rename("mean_audience_score" = ".") %>%
rownames_to_column("title_type")
ggplot(df,aes(x=title_type,y=mean_audience_score)) +
geom_bar(stat="identity") +
labs(x=NULL, y = "Audience Score")genre: Documentaries and Musicals are rated highly whereas Horror movies are generally not that well accepted.
df<-
sort(tapply(movies$audience_score,movies$genre,mean),decreasing = TRUE) %>%
as.data.frame(names="mean_audience_score") %>%
rename("mean_audience_score" = ".") %>%
rownames_to_column("genre")
ggplot(df,aes(x=reorder(genre, mean_audience_score),y=mean_audience_score)) +
geom_bar(stat="identity") +
labs(x=NULL, y = "Audience Score") +
coord_flip()mpaa_rating: Movies for general audiences are rated slightly higher than movies meant for more matured audience.
df <-
sort(tapply(movies$audience_score,movies$mpaa_rating,mean),decreasing = TRUE) %>%
as.data.frame(names="mean_audience_score") %>%
rename("mean_audience_score" = ".") %>%
rownames_to_column("mpaa_rating")
ggplot(df,aes(x=reorder(mpaa_rating,-mean_audience_score),y=mean_audience_score)) +
geom_bar(stat="identity") +
labs(x=NULL, y = "Audience_Score")If the movie has won or been nominated for the best film at Oscars or featured in top 200 Box Office list on BoxOfficeMojo then it’s more popular among the audience.
At the same time having Oscar awarded actors, actresses or directors in the movie does not seem to impact the audience rating.
df <-
movies %>%
select(best_pic_nom,best_pic_win,best_actor_win,best_actress_win,
best_dir_win,top200_box,audience_score) %>%
gather(Variable,Value, -audience_score) %>%
group_by(Variable,Value) %>%
summarize(mean_audience_score = mean(audience_score))
ggplot(df,aes(x=Variable,y=mean_audience_score,fill=Value)) +
geom_bar(position="dodge",stat="identity") +
labs(x=NULL, y = "Audience_Score")Using the Exploratory Data Analysis, these are the predictor variables which seemed to have some association with the audience score:
Now we will run a forward-selection algorithm to determine the best possible predictor variables based on Adjusted R-Squared
scope <- audience_score~
runtime+
imdb_rating+
imdb_num_votes+
critics_score+
genre+mpaa_rating+
best_pic_nom+
best_pic_win+
top200_box
model <- SignifReg(scope=scope,
data=movies,
alpha=0.05,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.708 -6.449 0.625 5.485 50.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.15256 3.11144 -11.941 < 2e-16 ***
## imdb_rating 14.76629 0.57332 25.756 < 2e-16 ***
## genreAnimation 9.11682 3.50121 2.604 0.009432 **
## genreArt House & International 0.03001 2.90680 0.010 0.991767
## genreComedy 2.09165 1.61538 1.295 0.195849
## genreDocumentary 1.20116 1.97792 0.607 0.543879
## genreDrama -0.20340 1.38128 -0.147 0.882977
## genreHorror -5.02802 2.38927 -2.104 0.035733 *
## genreMusical & Performing Arts 4.39752 3.14076 1.400 0.161956
## genreMystery & Suspense -6.25294 1.78054 -3.512 0.000476 ***
## genreOther 1.58199 2.76483 0.572 0.567400
## genreScience Fiction & Fantasy -0.29096 3.50593 -0.083 0.933884
## critics_score 0.06688 0.02144 3.120 0.001893 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.825 on 637 degrees of freedom
## Multiple R-squared: 0.7682, Adjusted R-squared: 0.7638
## F-statistic: 175.9 on 12 and 637 DF, p-value: < 2.2e-16
The final model has an Adjusted R-Squared of 0.7638 and the chosen predictors are:
We will check all the assumptions of a linear regression model:
Using a Histogram, errors seem to be normally distributed and centred at 0.
df <- data.frame(res=model$residuals)
df %>%
ggplot(aes(x=res)) +
geom_histogram() Using a Q-Q Plot, errors seem to be normally distributed as well.
qqnorm(model$residuals)
qqline(model$residuals, col='red')There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors (hence the observations) are uncorrelated.
plot(model$residuals)This assumption is not satisfied. We can clearly see that the errors are more varied for lesser predicted values as compared to higher predicted values. We might need to transform our response variable or predictor variables or both to satisfy this condition.
df <- data.frame(residuals = model$residuals, predicted_values = model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0)Using box-cox transformation, the best transformation of response variable is very close to the response variable itself (lambda ~ 1). Thus we will not transform the response variable.
bc <- MASS::boxcox(audience_score ~
imdb_rating+
critics_score+
runtime +
best_pic_nom +
genre
,data=movies)Applying Inverse Transformation to the predictor variable imdb_rating
new_model <- lm(audience_score ~
I(1/imdb_rating)+
genre+
critics_score,
data=movies)We can see that after the transformation the errors - although not perfectly - are randomly scattered around 0. Thus we can asssume that the condition of constant variance is met.
df <- data.frame(residuals = new_model$residuals, predicted_values = new_model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0)As the Variation Inflation Factor < 5 for each predictor variable, we can assume that there is no multi-collinearity.
vif <- vif(new_model)
vif[,1]## I(1/imdb_rating) genre critics_score
## 1.816703 1.297577 1.892534
Almost all standardized errors are below the absolute value of 4. Thus we do not have any extreme outliers that may be influencing the regression line.
rstan <- rstandard(new_model) # studentized
plot(rstan)Interpreting the Model:
summary(new_model)##
## Call:
## lm(formula = audience_score ~ I(1/imdb_rating) + genre + critics_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.043 -8.557 0.586 8.567 58.276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.98952 3.91377 21.716 < 2e-16 ***
## I(1/imdb_rating) -237.28740 16.65417 -14.248 < 2e-16 ***
## genreAnimation 8.07374 4.35753 1.853 0.06437 .
## genreArt House & International 2.02195 3.61527 0.559 0.57617
## genreComedy 0.35921 2.00657 0.179 0.85798
## genreDocumentary 6.13712 2.44137 2.514 0.01219 *
## genreDrama 0.31551 1.71942 0.183 0.85447
## genreHorror -8.43713 2.96655 -2.844 0.00460 **
## genreMusical & Performing Arts 7.77701 3.90137 1.993 0.04664 *
## genreMystery & Suspense -6.25401 2.21826 -2.819 0.00496 **
## genreOther 1.64148 3.44009 0.477 0.63341
## genreScience Fiction & Fantasy -1.68851 4.36556 -0.387 0.69905
## critics_score 0.26343 0.02326 11.324 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 637 degrees of freedom
## Multiple R-squared: 0.6411, Adjusted R-squared: 0.6343
## F-statistic: 94.82 on 12 and 637 DF, p-value: < 2.2e-16
confint(new_model)## 2.5 % 97.5 %
## (Intercept) 77.3040726 92.6749693
## I(1/imdb_rating) -269.9911088 -204.5836992
## genreAnimation -0.4831081 16.6305949
## genreArt House & International -5.0773497 9.1212399
## genreComedy -3.5810772 4.2994924
## genreDocumentary 1.3430199 10.9312164
## genreDrama -3.0609153 3.6919290
## genreHorror -14.2625310 -2.6117372
## genreMusical & Performing Arts 0.1159193 15.4381055
## genreMystery & Suspense -10.6099955 -1.8980273
## genreOther -5.1137973 8.3967632
## genreScience Fiction & Fantasy -10.2611487 6.8841194
## critics_score 0.2177493 0.3091118
The current audience score on Rotten Tomatoes is 81 and our model predicts this score to be 80.
Links:
new_movie <- data.frame(title = "La La Land",
genre = "Comedy",
imdb_rating = 8.1,
critics_score = 91)
predict(new_model,new_movie) %>%
round(2)## 1
## 80.03
We are 95% confident that, all else being equal, the predicted audience score for the movie ‘La La Land’ will be between 80 and 83 on average.
predict(new_model, new_movie, interval="confidence")## fit lwr upr
## 1 80.02617 76.79458 83.25775