This is a project for Linear Regression and Modelling course, which is a part of Coursera’s Statistics with R Specialization.
The Project aims to find out what attributes make a movie top-rated. To achieve this, an exploratory data analysis (EDA) has been completed, and Linear regression model predicting average rating for a movie has been developed.
Code buttonlibrary(ggplot2); library(dplyr); library(statsr)
library(plotly); library(GGally); library(tidyr)
if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
method = "curl")
}
load("./data/1209_SR-LR-w4_Movies/movies.RData")The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.
I never use ratings to select a movie to watch. For me, a much more important criterion is the opinion of my son or friend. In my understanding, the sinking of my heart when watching a movie is incalculable at all. As a mathematician, though, I wonder if it’s possible to decompose such a sensitive factor into numerical components.
Start with examination the data set variables, their description can be found in the Codebook.
Get rid of unnecessary variables, and create average_rating variable by averaging of rating types presented (imdb_rating, critics_score, audience_score; last two come from Rotten Tomatoes, are scaled by division by 10):
movies <- movies %>% transmute(
average_rating =round((imdb_rating + critics_score/10 +
audience_score/10)/3, 2),
genre, runtime, thtr_rel_month, imdb_num_votes, best_pic_nom,
best_pic_win, best_actor_win, best_actress_win,best_dir_win, top200_box)NArows <- sum(!complete.cases(movies))
duo <- sum(duplicated(movies))Now, there is 1 row (0.15\(\%\)) with NA values, and 1 duplicated row. So, they can be removed
movies<-na.omit(movies)
movies<-movies[!duplicated(movies),]Examine the structure of the resulting data set, and levels of genre variable:
str(movies)tibble [649 × 11] (S3: tbl_df/tbl/data.frame)
$ average_rating : num [1:649] 5.77 8.33 8.6 7.6 3.7 8.5 6.83 3.97 8.47 7.17 ...
$ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
$ runtime : num [1:649] 80 101 84 139 90 78 142 93 88 119 ...
$ thtr_rel_month : num [1:649] 4 3 8 10 9 1 1 11 9 3 ...
$ imdb_num_votes : int [1:649] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
$ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
$ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
$ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'omit' Named int 334
..- attr(*, "names")= chr "334"
levels(movies$genre) [1] "Action & Adventure" "Animation"
[3] "Art House & International" "Comedy"
[5] "Documentary" "Drama"
[7] "Horror" "Musical & Performing Arts"
[9] "Mystery & Suspense" "Other"
[11] "Science Fiction & Fantasy"
yes/no), three numeric (one of them is average_rating), and one integer (imdb_num_votes)Take a closer look at average_rating summary and interactive histogram:
summ<-summary(movies$average_rating)
summ Min. 1st Qu. Median Mean 3rd Qu. Max.
1.27 4.83 6.27 6.16 7.77 9.47
The mean of average_rating is around 6.16, the median is 6.27, the IQR is 2.94.
gg <-ggplot(movies, aes(x = average_rating, y = ..density..)) +
geom_histogram(fill = "cornflowerblue", colour = "darkgray") +
geom_density(size = 1, colour = "brown")
ggplotly(gg)average_rating distribution has a left skewed structureUnderstand the relationship between audience_score and other numeric/ integer variables:
ggpairs(movies, columns = c(1, 4:6),
upper = list(continuous = wrap("cor", size = 8)),
diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))average_rating and thtr_rel_month is statistically insignificantCompare variability in average_rating for two-levels (yes/no) categorical variables:
data <- movies %>%
gather('category','yesno', best_pic_nom, best_pic_win, best_actor_win,
best_actress_win, best_dir_win, top200_box)
ggplot(data=data,aes(x=category, y= average_rating,fill=yesno))+
geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") +
scale_fill_manual(name="", values =
c("firebrick3","springgreen3"))+
theme(legend.title = element_blank(), axis.text.x = element_text(angle=20))The boxplots shows:
best_actor_win and best_actress_win, there is almost no difference in if movie is from the respective category, or notaverage_rating for best_dir_win yes level is noticeably higher in relation to *no levelaverage_rating for the other three categories yes level (best_pic_nom, best_pic_win, top200_box) is significantly higher than for no oneAlso, look at genre levels interactive boxplots:
gg<-ggplot(data = movies, aes(x = genre, y = average_rating, fill = genre)) +
geom_boxplot(color="darkgrey")+
theme(axis.text.x = element_text(angle=30),
legend.title = element_text(size=8.5))+
scale_fill_discrete(name="CLICK on legend to isolate one trace")
ggplotly(gg)Documentary and Musical Performing & Arts movies have noticeably higher medians (\(>7.5\)) than other genresComedy and Horror movies have lower medians than other genres (\(<5\))Science Fiction & Fantasy movies have the widest IQR with average_rating of \(7.73-3.12=4.61\)The goal is to predict the success of a movie quantified by average_rating using a linear regression model.
As the aim is to pick up variables most affecting average_rating, consider initial regression of average_rating on all other variables (all.model) with further selection of the best.model by step-wise backward method:
all.model <- lm(average_rating ~ ., data = movies)
best.model <- step(all.model, direction = "backward" , data = movies, trace = 0)
bestrsq<- summary(best.model)$adj.r.squared
best.model$calllm(formula = average_rating ~ genre + imdb_num_votes + best_pic_nom +
best_dir_win, data = movies)
step() function found best.model includes four best predictors: genre, imdb_num_votes, best_pic_nom, best_dir_winbest.model (Adj.R-squared) is 33.9\(\%\)To model provide valid results, the following conditions should be met:
LINEARITY: has already been verified in chapter EDA, section Correlogram for numerical/ integer variables
NORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals
hist<-ggplot(data = best.model, aes(x = best.model$residuals, y = ..density..))+
geom_histogram(color= "darkgray",fill = "cornflowerblue")+
geom_density(size = 1, colour = "brown")+
xlab("residuals")+
ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model, aes(sample = .resid))+
stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
xlab("theoretical quantiles")+
ylab("sample quantiles")+
ggtitle("Normal Q-Q Plot")
multiplot(hist, qq, cols=2) # Appendix: code `multiplot`CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values
INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence
var<-ggplot(data = best.model, aes(x = .fitted, y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(colour="brown") +
xlab("prediction") +ylab("residuals")+
ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model, aes(x = 1:nrow(movies), y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(method=lm, colour="brown") +
xlab("index") +ylab("residuals")+
ggtitle("Model Residuals (independence)")
multiplot(var, indep, cols=2)round(coefficients(best.model),6) (Intercept) genreAnimation
4.759862 0.698380
genreArt House & International genreComedy
1.257816 0.024026
genreDocumentary genreDrama
3.401586 1.320504
genreHorror genreMusical & Performing Arts
0.003018 2.736404
genreMystery & Suspense genreOther
0.626110 1.099557
genreScience Fiction & Fantasy imdb_num_votes
0.059695 0.000005
best_pic_nomyes best_dir_winyes
1.070874 0.570473
average_rating, i.e. increase it
best_pic_nom and best_dir_win of yes level, increase (comparing to no level) average_rating by 1.07 and 0.57 points respectivelygenre levels coefficients show changes relative to Action & Adventure movies (initial level)Action & Adventure movie with zero number of votes, without nomination to best picture and director non-won Oscar, gets average_rating of 4.76 pointsimdb_num_votes by 100 voices, raises audience_score by 0.0005 pointsNow test the predictive capability of best.model on movie not included in the original data set: Lion. According to IMDb and Rotten Tomatoes, the movie was released in 2016, earned imdb_rating of \(8.0\), critics_score of \(85\%\), and audience_score of \(91\%\). Compare true average rating to predicted value, and look at confidence interval:
lion <- tibble(imdb_num_votes=210903,
best_pic_nom=factor("yes", levels=c("no", "yes")),
best_dir_win=factor("no", levels=c("no", "yes")),
genre= factor("Drama", levels=
levels(movies$genre)))
rating_lion <- predict(best.model, lion, interval="prediction", level=0.95)
rating_lion fit lwr upr
1 8.130159 5.151084 11.10923
average_rating for “Lion” is between 5.15 and 11.11 on average.| Movie Title | genre | imdb_num_votes | best_pic_nom | best_dir_win | Predicted avg Rating | True avg Rating |
|---|---|---|---|---|---|---|
| Lion | Drama | 210903 | yes | no | 8.13 | 8.53 |
average_rating (\(8.53\) vs. \(8.13\))average_rating increase are Documentary (+3.4) and Musical & Performing Arts (+2.74 points), while nomination for the best picture adds 1.07 points to movie average_rating
average_rating distribution doesn’t show a normal curve?
audience_score values?average_rating?multiplotmultiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}