library(ggplot2)
library(dplyr)
library(statsr)
library(broom)
library(ggplot2)
library(car)
library(kableExtra)
library(DT)
library(SignifReg)
library(GGally)load("movies.Rdata")According to the codebook, these are randomly sampled 651 movies that were produced before 2016. It is an observational study.
datatable(movies)However, we observe that data contains movies only from hollywood and thus it can not be used to make sound judgement about movies from other countries.
Also, few very people actually provide ratings, so the provided data would be biased.
What makes a movie popular?
How accurately can we predict the populatiry?
I will look into various data columns and study if there is any association between the movie ratings and audience and critic scores. I would also see if the ratings depend on the genre type of the movies.
Recently I watched a movie without reading the reviews or checking the ratings and my experience wasn’t good. So, I would like to see how the audience rating is calculated and what factors do they consider when they like a movie. I would also like to see how if IMDB rating is associated with the rating in Rotten Tomatoes.
Please note: I am checking if there are any NA values in my variables of interest.
df <- movies[,c("runtime","imdb_rating","critics_score","audience_score")]
ggpairs(df)sum(is.na(movies$runtime))## [1] 1
cor(movies$audience_score, movies$runtime)## [1] NA
There is exactly 1 movie for which we aren’t provided with the data for runtime. We would look into the movie and compare it with the movie of similar genre so as to fix the runtime.
which(is.na(movies$runtime))## [1] 334
movies[334,c("title","title_type","runtime","genre")]## # A tibble: 1 x 4
## title title_type runtime genre
## <chr> <fct> <dbl> <fct>
## 1 The End of America Documentary NA Documentary
movies[334,"runtime"] <- 74So, we find the row number and observe that the runtime was NA for ‘The end of America’ documentary. Hence,
I google the runtime of the documentary - ‘The end of America’, it is around 74 minutes, so I replace NA with 74
cor(movies$audience_score, movies$runtime)## [1] 0.1793614
Here the strength of associaltion less than 20%.
We will look into the distribution of runtime.
hist(movies$runtime)To check the distribution of runtime across the title_type-
boxplot(movies$runtime ~ movies$title_type)There are few outliers for documentary and feature films.
The average runtime across the different movie title_type-
movies %>%
select(title_type, runtime) %>%
group_by(title_type) %>%
summarize(avg_run = mean(runtime)) ## # A tibble: 3 x 2
## title_type avg_run
## <fct> <dbl>
## 1 Documentary 97.3
## 2 Feature Film 107.
## 3 TV Movie 102.
unique(movies$title_type)## [1] Feature Film Documentary TV Movie
## Levels: Documentary Feature Film TV Movie
unique(movies$genre)## [1] Drama Comedy
## [3] Horror Documentary
## [5] Action & Adventure Art House & International
## [7] Musical & Performing Arts Mystery & Suspense
## [9] Animation Science Fiction & Fantasy
## [11] Other
## 11 Levels: Action & Adventure Animation ... Science Fiction & Fantasy
unique(movies$critics_rating)## [1] Rotten Certified Fresh Fresh
## Levels: Certified Fresh Fresh Rotten
unique(movies$audience_rating)## [1] Upright Spilled
## Levels: Spilled Upright
Now that we have looked at the different variables, we would fit a model using the model using forward selection method, based on Adjusted R^2
scope <- audience_score ~
runtime + imdb_rating + critics_score+ critics_rating + audience_rating
model <- SignifReg(scope=scope,
data=movies,
alpha=0.05,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1570 -4.7630 0.6209 4.3572 24.3224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.4864 2.0068 -5.724 1.59e-08 ***
## imdb_rating 9.5191 0.3497 27.220 < 2e-16 ***
## audience_ratingUpright 20.8473 0.7674 27.166 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.952 on 648 degrees of freedom
## Multiple R-squared: 0.8822, Adjusted R-squared: 0.8818
## F-statistic: 2426 on 2 and 648 DF, p-value: < 2.2e-16
Thus, we obtain the final model as -
model_first <- lm( audience_score ~ imdb_rating + audience_rating, data = movies)round(coef(model_first), digits = 3)## (Intercept) imdb_rating audience_ratingUpright
## -11.486 9.519 20.847
95% Confidence intervals for the coefficients
confint(model_first, level = 0.95)## 2.5 % 97.5 %
## (Intercept) -15.42705 -7.545717
## imdb_rating 8.83236 10.205747
## audience_ratingUpright 19.34045 22.354207
We observe that
R^2 and adjusted R^2 are close, the model seems to be well-fitted
We can also perform a hypothesis test to know if the predictors are associated with the response variable.
Our null hypothesis i.e. H0: the slopes associated with all the repsonse variables are 0
the alternate hypothesis HA: there are associated and hence atleast one of the slope is non-zero.
This result is provided by the F-statistic value, which is much greater than 1 signifying that atleast one of the response variable is significant and is related to our response variable
fit1Augment <- augment(model_first) %>% mutate(row_num = 1:n())
# Plot of residuals against fitted values (non-constant variance and non-linearity)
ggplot(fit1Augment, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
xlab("Fitted value") +
ylab("Standardized residuals") +
ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))# Plot of residuals against predictor variable (checking non-linearity).
ggplot(fit1Augment, aes(x = imdb_rating, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nimdb_rating is equally spread around 0") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Boxplot of residuals (outlying observations)
ggplot(fit1Augment, aes(y = .std.resid)) +
geom_boxplot() +
ggtitle("Outlying observations test\nFew outliers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))# Normal probability Q-Q plot of residuals (non-normality).
ggplot(fit1Augment, aes(sample = .std.resid)) +
geom_qq(alpha = 0.3) +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") +
ggtitle("Non-normality test\nThe residuals follow a normal distribution") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))The standardised residuals in the graphs follow rstudent function. The graphs indicate that all the typical assumptions of the linear regression model hold true. Hence the model can be accepted as adequate
Finally, we check for multicollinearity using the VIF function. If the value of VIF > 10 then we can say that multicollineraity exists. However, here the values are less than 10, there is no multicollinearity issue.
vif(model_first)## imdb_rating audience_rating
## 1.935264 1.935264
I would like to predict the IMDB rating for the documentary movie- ‘City of Gold’.
In this richly penetrating documentary odyssey, Pulitzer Prize-winning food critic Jonathan Gold shows us a Los Angeles where ethnic cooking is a kaleidoscopic portal to the mysteries of an unwieldy city and the soul of America. As piping-hot platters are served up, so are stories of immigrants whose secret family recipes are like sacred offerings pledged for the opportunity to build their American Dream. With eternal curiosity, razor-sharp intellect, and existential longing, Gold is a culinary geographer taking us where no critic has gone before.
It was released in 2016 and the runtime of the movie is 91 minutes. It was directed and written by Laura Gabbert and was in theaters on Mar 11,2016. The audience_score is 81, which we will try to observe in our prediction.
new_movie <- data.frame( imdb_rating = 7.1 , audience_rating = "Upright")predict(model_first, new_movie, interval = "prediction", level = 0.95)## fit lwr upr
## 1 76.94623 63.27664 90.61581
I have used the prediction interval of 95%. Also, the resulted prediction value is very close to the true audience_score i.e. 81. Thus, we can say that the model is pretty good in that case.
Thus, we observe that the audience score is associated with the IMDB rating and audience rating.
This open-ended question helped me to look data from various angles.
In future, I would like to treat the outliers and leverage points, and verify the results obtained thereafter. I would also look into the scope for polynomial terms to see if we obtain any better result.
However as mentioned already, not many people actually rate the movies in IMDB and rotten tomatoes, so it is quite biased in that manner.