Background
What are the most important factors in determining the IMDB rating of a movie and can we use a multiple linear regression model to predict this rating?
This project is based on a fictitious scenario where I’ve been hired as a data scientist at Paramount Pictures. The data presents numerous variables on movies such as audience/critic ratings, number of votes, runtime, genre, etc. Paramount Pictures is looking to gather insights into determining the acclaim of a film and other novel patterns or ideas. The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Part 1: Data
generabizability
The dataset is comprised of 651 randomly sampled movies produced and released before 2016. Therefore,due to the random sampling, it can be assumed that the data is representative of all movies produced.
However, as seen below, the earliest date included is 1970 and some years do not have a significant amount of data. Thus, the data is not representative of each year within our sample and this should be considered when interpreting the results. We should also ensure to not extrapolate outside of this year range when calculating predictions.
movies%>%
group_by(thtr_rel_year)%>%
summarise(count = n())%>%
ggplot(aes(thtr_rel_year, count))+
geom_col(col="black", fill = "dark red", width = 1)+
geom_text(aes(label = count), nudge_y = 3, size = 3)+
theme_few()+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
labs(x = NULL,
y = NULL,
title = "Number of movies included within each year")

causality
Due to the observational nature of the study, no random assignment was used (test/control group), and hence causality cannot be inferred.
Part 2: Research question
What are the most important factors in determining the IMDB rating of a movie and can we use a multiple linear regression model to predict this rating?
The ability to predict ratings based on film metrics could help:
- understand the popularity of a movie before the full ratings are generated
- inform us if a movie is performing better or worse than expected
- provide insight into which areas to focus for the most desirable outcome
Part 3: Exploratory data analysis (EDA)
Collinearity and Parsimony
We can observe high correlations between our various ratings (IMDB rating, audience score and critics score). We should consider only keeping one of our variables for our final model, but we will explore all of them throughout our EDA.
There’s also a high correlation between theatre release year and DVD release year so we should only choose to include one. With the current trend of online streaming (ie. Netflix, Hulu, Amazon), physical DVDs are not as relevant and we can choose to exclude the DVD variables in favour of the theatre ones.
#select all to start
raw_data_corr <- select_if(movies, is.numeric)
# Compute a correlation matrix
corr <- round(cor(raw_data_corr, use="complete.obs"),2)
# Compute a matrix of correlation p-values
p.mat <- cor_pmat(raw_data_corr)
# Visualize the correlation matrix
ggcorrplot(corr, method = "square",
ggtheme = ggthemes::theme_few,
#title = "We can observe some clear patterns",
outline.col = "black",
colors = c("blue","white", "red"),
lab = TRUE,
lab_size = 2.5,
digits = 2,
type = "lower",
legend = "",
tl.cex = 8,
#show insignificant ones as blank
p.mat = p.mat,
hc.order = TRUE,
insig = "blank")

movies <- movies%>%
select(-dvd_rel_year, -dvd_rel_month, -dvd_rel_day)
Which types of film are included in our analysis?
We can see that the majority of the movies were within the Feature Film category. Given that Paramount operates in this area (rather than documentaries or TV shows), we will also focus our analysis/model on this category.
movies%>%
group_by(title_type)%>%
summarise(count = n())%>%
mutate(prop = round(count/sum(count)*100,0))%>%
ggplot(aes(title_type, prop))+
geom_col(col="black", fill = "dark red")+
geom_text(aes(label = paste(count, " - (", prop, "%)", sep = "")), nudge_y = 5, size = 5)+
theme_few()+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
labs(x = NULL,
y = NULL)

movies <- movies%>%
filter(title_type == "Feature Film")%>%
select(-title_type)
Which genres are included in our analysis?
The majority of our data contains Drama movies(51%), followed by Comedy(14%) and Action and Adventure(11%). We can see that we still have a few documentaries left, even though we removed them based on type. We will also remove this subset for the same reason as before.
movie_genre <- movies%>%
group_by(genre)%>%
summarise(count = n())%>%
mutate(proportion = round(count/sum(count)*100,1))%>%
arrange(desc(count))
movie_genre%>%
kbl(caption = "Summary of movie genres")%>%
kable_paper("hover", full_width = F)
Summary of movie genres
|
genre
|
count
|
proportion
|
|
Drama
|
301
|
50.9
|
|
Comedy
|
85
|
14.4
|
|
Action & Adventure
|
65
|
11.0
|
|
Mystery & Suspense
|
59
|
10.0
|
|
Horror
|
23
|
3.9
|
|
Other
|
15
|
2.5
|
|
Art House & International
|
14
|
2.4
|
|
Animation
|
9
|
1.5
|
|
Science Fiction & Fantasy
|
9
|
1.5
|
|
Musical & Performing Arts
|
8
|
1.4
|
|
Documentary
|
3
|
0.5
|
movies <- movies%>%
filter(genre != "Documentary")
Are the IMDB ratings impacted by the genre?
We can observe significant differences between the various genres and metrics such as variance, medians, range and outliers.
ggplot(movies, aes(genre, imdb_rating, col = genre))+
geom_boxplot(show.legend = FALSE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
coord_flip()+
labs(y = "IMDB rating")+
theme_few()+
labs(x=NULL)

Is there an association between the IMDB rating and the total amount of IMDB votes?
The high-performing movies (as measured by IMDB rating) also have a higher number of votes. This trend seems exponential towards the higher IMDB ratings/votes, where the increase in total number of votes is much higher than in the low/medium ranges.
movies%>%
ggplot(aes(imdb_rating, imdb_num_votes))+
geom_jitter(width=0.1, height = 0.1, alpha = 0.5)+
geom_smooth(se = FALSE)+
theme_few()+
labs(x = "IMDB rating",
y = "Total IMDB votes")+
theme(legend.position = "none")

Do movie critics and audiences share the same taste in movies?
Judging by the variances in our scatter plots, we can observe fairly large discrepancies between how audiences and critics perceive movies. This is even more apparent in genres such as Action & Adventure and Comedy, where audiences tend to give much higher ratings than the critics.
movies%>%
ggplot(aes(audience_score, critics_score, col = genre))+
geom_jitter(width=0.1, height = 0.1, show.legend = FALSE)+
facet_wrap(.~genre)+
theme_few()+
annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=Inf, alpha = 0.5, lty = 2)+
theme(legend.position = "none")+
labs(x = "Audience score",
y = "Critics score")

We can focus more on these differences by looking in average scores for each genre. Below, we can observe very large differences between the various genres.
movies %>%
select(genre, audience_score, critics_score)%>%
group_by(genre)%>%
summarise(audience_score = mean(audience_score),
critics_score = mean(critics_score))%>%
mutate(difference_score = round((audience_score - critics_score), 1),
status = ifelse(abs(difference_score) <=3, "good (lower than 3)",
ifelse(abs(difference_score) <=5, "ok (between 3 and 5)", "bad (higher than 5)")))%>%
#create plot
ggplot(aes(reorder(genre, difference_score), difference_score, fill = status))+
geom_col(col = "black")+
scale_fill_manual(values = c("dark red", "#009E73", "gold3"))+
scale_colour_manual(values = c("dark red", "#009E73", "gold3"))+
geom_label(aes(genre, difference_score + ifelse(difference_score >= 0, +0.2, -0.2), label = difference_score), show.legend = FALSE, size = 3, fill = "white")+
labs(x=NULL,
y="Points disagreement",
title = "Rating disagreement between critics and audiences",
subtitle = "A positive score means the audience rated higher than critics",
fill = "Status")+
theme(legend.position = "top",
axis.ticks.x = element_blank(),
axis.text.x = element_blank())+
coord_flip()+
theme_few()

Is there an association between the year a movie was released in and the IMDB rating?
Given that we have the number of votes for each IMDB rating, we will use a weighted mean for this comparison.
While the total number of votes increased, as shown by the size of the points, we can see that the overall rating was fairly consistent. Again, to note the relatively low sample size of movies released before ~1990.
movies%>%
group_by(thtr_rel_year)%>%
summarise(imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
count = n(),
votes = sum(imdb_num_votes))%>%
ggplot(aes(thtr_rel_year, imdb_wmean))+
geom_point(aes(size = votes), show.legend = FALSE)+
geom_line()+
coord_cartesian(ylim = c(0,10))+
labs(x = NULL,
y = "IMDB rating",
subtitle = "Size of point shows the relative sample of number of votes within each year")+
theme_few()

Are there any associations between the IMDB rating and the month of release?
It seems that the release dates are fairly equal distributed across the year (width of the bar) as well as equal in variance. Thus, we can say that it looks like these factors are independent of each other.
movies%>%
mutate(thtr_rel_month = as.factor(thtr_rel_month))%>%
ggplot(aes(thtr_rel_month, imdb_rating, col = thtr_rel_month))+
geom_boxplot(aes(group = thtr_rel_month), varwidth = TRUE, show.legend = FALSE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
labs(x = "Month",
y = "IMDB rating")+
theme_few()

Is there an association between the day of the week a movie was released and the IMDB rating?
Both the boxplot and table below show that almost 75% of movie releases happen on a Friday and 11% on a Wednesday. An squal split of around 2-3% each happen on the other days. While the sample sizes are different across the various weekdays, there doesn’t seem to be a clear trend regarding the IMDB rating. Both audiences and critics tend to agree that movies released on Monday, Tuesday and Friday perform worse than the rest.
day_of_week <- movies%>%
mutate(date = make_date(thtr_rel_year, thtr_rel_month, thtr_rel_day),
wday = weekdays.Date(date),
wday_num = wday(date, week_start = 1))%>%
group_by(wday, wday_num)%>%
summarise(count = n(),
votes = round(mean(imdb_num_votes)),
imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
audience_score = round(mean(audience_score),1),
critics_score = round(mean(critics_score),1),
imdb_wmean = round(imdb_wmean, 1))%>%
ungroup()%>%
mutate(prop = round(count/sum(count)*100,1))%>%
arrange(wday_num)
#add the weekday column to our model
movies <- movies%>%
mutate(date = make_date(thtr_rel_year, thtr_rel_month, thtr_rel_day),
wday = as.factor(weekdays.Date(date)),
wday_num = wday(date, week_start = 1))%>%
select(-date)
ggplot(movies, aes(reorder(wday,wday_num), imdb_rating, col = wday))+
geom_boxplot(show.legend = FALSE, varwidth = TRUE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
labs(x = NULL,
y = "IMDB rating")+
theme_few()

day_of_week%>%
kbl(caption = "Proportion of releases by day of week")%>%
kable_paper("hover", full_width = F)
Proportion of releases by day of week
|
wday
|
wday_num
|
count
|
votes
|
imdb_wmean
|
audience_score
|
critics_score
|
prop
|
|
Monday
|
1
|
12
|
35906
|
7.0
|
57.8
|
52.1
|
2.0
|
|
Tuesday
|
2
|
20
|
104076
|
8.1
|
58.1
|
57.3
|
3.4
|
|
Wednesday
|
3
|
65
|
62649
|
7.3
|
67.4
|
65.7
|
11.1
|
|
Thursday
|
4
|
23
|
54167
|
6.9
|
66.2
|
65.4
|
3.9
|
|
Friday
|
5
|
430
|
65142
|
7.1
|
58.8
|
51.2
|
73.1
|
|
Saturday
|
6
|
20
|
43208
|
7.3
|
67.1
|
71.9
|
3.4
|
|
Sunday
|
7
|
18
|
22631
|
7.1
|
62.4
|
68.6
|
3.1
|
Does the runtime of the movie affect the IMDB rating?
We can see a slight trend where movies of a longer length correlate with a higher IMDB rating. However, this is not causal and it could be that this is an indirect effect of the genre of the movie (and thus, longer movies are of a favourable genre).
ggplot(movies, aes(runtime, imdb_rating))+
geom_jitter(width = 0.1, height = 0.1, alpha = 0.75)+
geom_smooth(se=FALSE)+
theme_few()+
labs(x = "Runtime (min)",
y = "IMDB rating")

Where does Paramount Pictures rank amogst the various other studios?
Based on movie averages, Paramount Pictures is ranked somewhere at the top, based on the IMDB rating/number of total votes and just above average regarding audience ratings.
studios <- movies%>%
group_by(studio)%>%
summarise(imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
audience_score = mean(audience_score, na.rm = TRUE),
count = n(),
avg_votes = sum(imdb_num_votes)/count)
paramount_studio <- studios%>%
filter(studio == "Paramount Pictures")
ggplot()+
geom_point(data = studios, aes(audience_score, imdb_wmean, size = avg_votes), show.legend = FALSE, alpha = 0.6)+
geom_text(data = paramount_studio, aes(audience_score, imdb_wmean, label = "Paramount"), col = "red", size = 5, nudge_y = 0.3)+
geom_point(data = paramount_studio, aes(audience_score, imdb_wmean), col = "red", size = 3.7)+
labs(x = "Audience score",
y = "IMDB rating",
subtitle = "Size of point shows the relative sample of number of votes within each year (average per studio)")+
theme_few()

Is there an association between the type of an award a movie receives and the rating?
We can notice significant differences for the bottom three awards, but not nothing significant between those that received the top 3 awards in the below boxplot.
awards <- movies%>%
select(audience_score, critics_score, imdb_rating, imdb_num_votes, best_pic_nom:top200_box)%>%
gather(Award, Status, -audience_score, -critics_score, -imdb_rating, -imdb_num_votes)
ggplot(awards, aes(Status, imdb_rating, col = Status))+
geom_boxplot()+
theme_few()+
labs(col = "Award received",
x = NULL,
y = "IMDB rating")+
facet_wrap(.~Award)+
theme(legend.position = "top")

The same trend can be seen below where movies that have received the bottom 3 awards tend to perform better than those who have not on either movie rating score.
award_yes <- awards%>%
filter(Status == "yes")
ggplot(awards, aes(critics_score, imdb_rating, col = Status))+
geom_point(alpha = 0.50)+
geom_point(data = award_yes, aes(critics_score, imdb_rating), col = "dark red")+
theme_few()+
scale_colour_manual(values = c("gold3", "dark red"))+
labs(col = "Award received",
x = "Critics score",
y = "IMDB rating")+
facet_wrap(.~Award)+
theme(legend.position = "top")

We can also spot a trend where, generally speaking, the rarer the award, the higher the chance it would impact any of the ratings in a favourable way.
awards_table <- awards%>%
group_by(Award, Status)%>%
summarise(count = n(),
audience_score = round(mean(audience_score),1),
critics_score = round(mean(critics_score),1),
imdb_rating = round(weighted.mean(imdb_rating, imdb_num_votes),1),
imdb_num_votes = round(mean(imdb_num_votes),0))%>%
mutate(proportion = round(count/sum(count)*100,1))%>%
arrange(proportion)
awards_table%>%
kbl(caption = "Summary of awards and movie ratings")%>%
kable_paper("hover", full_width = F)
Summary of awards and movie ratings
|
Award
|
Status
|
count
|
audience_score
|
critics_score
|
imdb_rating
|
imdb_num_votes
|
proportion
|
|
best_pic_win
|
yes
|
7
|
84.7
|
91.3
|
8.1
|
399420
|
1.2
|
|
top200_box
|
yes
|
15
|
74.5
|
75.5
|
7.4
|
269876
|
2.6
|
|
best_pic_nom
|
yes
|
22
|
85.3
|
87.5
|
8.1
|
253804
|
3.7
|
|
best_dir_win
|
yes
|
43
|
69.5
|
72.1
|
7.7
|
137816
|
7.3
|
|
best_actress_win
|
yes
|
70
|
63.6
|
61.7
|
7.5
|
98932
|
11.9
|
|
best_actor_win
|
yes
|
91
|
62.7
|
60.2
|
7.4
|
82086
|
15.5
|
|
best_actor_win
|
no
|
497
|
60.0
|
53.9
|
7.1
|
59644
|
84.5
|
|
best_actress_win
|
no
|
518
|
60.0
|
53.9
|
7.1
|
58278
|
88.1
|
|
best_dir_win
|
no
|
545
|
59.7
|
53.5
|
7.1
|
57224
|
92.7
|
|
best_pic_nom
|
no
|
566
|
59.4
|
53.6
|
7.0
|
55705
|
96.3
|
|
top200_box
|
no
|
573
|
60.0
|
54.3
|
7.2
|
57705
|
97.4
|
|
best_pic_win
|
no
|
581
|
60.1
|
54.4
|
7.1
|
59066
|
98.8
|
Part 4: Modeling
Feature selection
We will be excluding the following variables from our full model:
- title, studio as they might cause our model to overfit
- critics/audience ratings and scores as they are highly correlated with our predicted variables (IMDB rating)
- number of IMDB ratings as it would act as a proxy for the IMDB rating (they’re also correlated)
- DVD release year, month and day for reasons mention before (high correlation with theatre release dates and lack of importance given current streaming trends)
Feature engineering: we also added day of the week to see if there are any trends
Selection method
We chose a backward p value adjustment to our model, given the context of the study where we could not use an automated way to reduce variables. An R adjustment would be incredibly time-consuming given the manual methodology and prone to human error. Furthermore, a p value adjustment would ensure that the factors included are significant (which is not necessarily true in an R adjustment).
model_initial <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + wday, data=movies)
tidy(model_initial)%>%
arrange(term, p.value)%>%
kbl(caption = "Initial model - summary")%>%
kable_paper("hover", full_width = F)
Initial model - summary
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
|
(Intercept)
|
8.1742438
|
7.9644193
|
1.0263452
|
0.3051744
|
|
best_actor_winyes
|
-0.0711392
|
0.1125253
|
-0.6322066
|
0.5275113
|
|
best_actress_winyes
|
0.0044981
|
0.1237113
|
0.0363600
|
0.9710083
|
|
best_dir_winyes
|
0.3188387
|
0.1590998
|
2.0040174
|
0.0455519
|
|
best_pic_nomyes
|
0.8816059
|
0.2389923
|
3.6888459
|
0.0002474
|
|
best_pic_winyes
|
-0.1331604
|
0.4210997
|
-0.3162206
|
0.7519535
|
|
genreAnimation
|
-0.1133041
|
0.3741316
|
-0.3028455
|
0.7621207
|
|
genreArt House & International
|
0.5010347
|
0.2879901
|
1.7397636
|
0.0824529
|
|
genreComedy
|
-0.0658930
|
0.1555755
|
-0.4235434
|
0.6720624
|
|
genreDrama
|
0.5711587
|
0.1325883
|
4.3077606
|
0.0000195
|
|
genreHorror
|
-0.1636790
|
0.2321635
|
-0.7050161
|
0.4810949
|
|
genreMusical & Performing Arts
|
0.9702023
|
0.3466066
|
2.7991452
|
0.0053014
|
|
genreMystery & Suspense
|
0.4188761
|
0.1730444
|
2.4206272
|
0.0158125
|
|
genreOther
|
0.5762551
|
0.2713937
|
2.1233178
|
0.0341675
|
|
genreScience Fiction & Fantasy
|
-0.2579984
|
0.3285955
|
-0.7851551
|
0.4326965
|
|
mpaa_ratingNC-17
|
-0.3952647
|
0.7098442
|
-0.5568330
|
0.5778652
|
|
mpaa_ratingPG
|
-0.5859253
|
0.2845822
|
-2.0588966
|
0.0399685
|
|
mpaa_ratingPG-13
|
-0.7732551
|
0.2979153
|
-2.5955536
|
0.0096929
|
|
mpaa_ratingR
|
-0.4397807
|
0.2897715
|
-1.5176811
|
0.1296621
|
|
mpaa_ratingUnrated
|
0.0134913
|
0.3910641
|
0.0344990
|
0.9724916
|
|
runtime
|
0.0124955
|
0.0027001
|
4.6277832
|
0.0000046
|
|
thtr_rel_day
|
0.0005832
|
0.0044814
|
0.1301413
|
0.8965016
|
|
thtr_rel_month
|
0.0042229
|
0.0113959
|
0.3705581
|
0.7111076
|
|
thtr_rel_year
|
-0.0015561
|
0.0039896
|
-0.3900363
|
0.6966588
|
|
top200_boxyes
|
0.5304606
|
0.2501291
|
2.1207471
|
0.0343843
|
|
wdayMonday
|
-0.1617262
|
0.2759457
|
-0.5860798
|
0.5580593
|
|
wdaySaturday
|
0.4156167
|
0.2156469
|
1.9273016
|
0.0544503
|
|
wdaySunday
|
0.2682264
|
0.2287683
|
1.1724810
|
0.2415052
|
|
wdayThursday
|
0.3603500
|
0.2023244
|
1.7810508
|
0.0754490
|
|
wdayTuesday
|
0.0488718
|
0.2158849
|
0.2263792
|
0.8209896
|
|
wdayWednesday
|
0.1893176
|
0.1287452
|
1.4704825
|
0.1419958
|
# removed best_actor_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_actress_win + best_dir_win + top200_box + wday, data=movies)
# removed best_actress_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_day
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_month
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_year
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed best_pic_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + best_pic_nom + best_dir_win + top200_box + wday, data=movies)
Final model
Reason for excluding certain variables: following a backward model adjustment, the following variables were not proven to be significant and were excluded from the final model (in order of exclusion):
- best_actor_win
- best_actress_win
- thtr_rel_day
- thtr_rel_month
- thtr_rel_year
- best_pic_win
That leaves us with the following variables included in our final model:
- genre
- runtime
- mpaa_rating
- best_pic_nom
- best_dir_win
- top200_box
- weekday
Interpretation of our prediction model
Overall, we can see that the model was able to capture 26% of the variability in our data.
glance(model_final)%>%
kbl(caption = "Final model - overall summary")%>%
kable_paper("hover", full_width = F)
Final model - overall summary
|
r.squared
|
adj.r.squared
|
sigma
|
statistic
|
p.value
|
df
|
logLik
|
AIC
|
BIC
|
deviance
|
df.residual
|
nobs
|
|
0.2890144
|
0.2587059
|
0.9103221
|
9.535769
|
0
|
24
|
-766.3157
|
1584.631
|
1698.426
|
466.5504
|
563
|
588
|
At the coefficient level, we can see that all our included predictors were significant. The most significant predictor was runtime. This was followed by whether a movie was nominated for Best Picture as well as whether it belonged to the “Drama” genre.
We can conclude that with everything else held constant:
- for each minute of additional runtime, it is expected that the IMDB rating increases by 0.012 points
- a movie nominated for the Best Picture award is estimated to have a higher rating, on average, by 0.84 points
- a movie that won the Best Director award is predicted to have a higher rating, on average, by 0.30 points
- movies included in the top 200 box are estimated to be rated higher by 0.53 points
- a movie within the Drama genre is expected to score 0.57 points higher than an Action and Adventure movie (the intercept). Three other genres can also be estimated to have a significantly higher rating than the intercept, based on their coefficient and p-value:
- Musical & Performing Arts
- Mystery & Suspsense
- Other
- movies with an mpaa rating of PG and PG-13 are estimated to score lower than G rated movies by 0.59 and 0.80, respectively
- day of week was also significant. A movie released on a Saturday was predicted to score higher than the intercept (Friday) by 0.42 points
tidy(model_final)%>%
arrange(term, p.value)%>%
kbl(caption = "Final model - summary")%>%
kable_paper("hover", full_width = F)
Final model - summary
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
|
(Intercept)
|
5.1202358
|
0.3674164
|
13.9357829
|
0.0000000
|
|
best_dir_winyes
|
0.3074598
|
0.1516608
|
2.0272859
|
0.0431038
|
|
best_pic_nomyes
|
0.8476648
|
0.2101322
|
4.0339600
|
0.0000624
|
|
genreAnimation
|
-0.1310169
|
0.3679277
|
-0.3560942
|
0.7219034
|
|
genreArt House & International
|
0.5161796
|
0.2839500
|
1.8178537
|
0.0696178
|
|
genreComedy
|
-0.0659667
|
0.1536537
|
-0.4293205
|
0.6678543
|
|
genreDrama
|
0.5713490
|
0.1306699
|
4.3724595
|
0.0000146
|
|
genreHorror
|
-0.1514570
|
0.2296387
|
-0.6595445
|
0.5098159
|
|
genreMusical & Performing Arts
|
0.9816999
|
0.3446402
|
2.8484779
|
0.0045535
|
|
genreMystery & Suspense
|
0.4064704
|
0.1702345
|
2.3877091
|
0.0172826
|
|
genreOther
|
0.5816296
|
0.2678480
|
2.1714917
|
0.0303109
|
|
genreScience Fiction & Fantasy
|
-0.2464390
|
0.3258388
|
-0.7563218
|
0.4497726
|
|
mpaa_ratingNC-17
|
-0.4266973
|
0.7044137
|
-0.6057482
|
0.5449261
|
|
mpaa_ratingPG
|
-0.5940688
|
0.2825116
|
-2.1028121
|
0.0359255
|
|
mpaa_ratingPG-13
|
-0.7953364
|
0.2902341
|
-2.7403270
|
0.0063326
|
|
mpaa_ratingR
|
-0.4562930
|
0.2840804
|
-1.6062106
|
0.1087883
|
|
mpaa_ratingUnrated
|
-0.0221070
|
0.3768923
|
-0.0586561
|
0.9532468
|
|
runtime
|
0.0123815
|
0.0025315
|
4.8909967
|
0.0000013
|
|
top200_boxyes
|
0.5299325
|
0.2479966
|
2.1368538
|
0.0330402
|
|
wdayMonday
|
-0.1414877
|
0.2696405
|
-0.5247271
|
0.5999794
|
|
wdaySaturday
|
0.4245825
|
0.2098264
|
2.0234936
|
0.0434942
|
|
wdaySunday
|
0.2597699
|
0.2249641
|
1.1547171
|
0.2486962
|
|
wdayThursday
|
0.3673926
|
0.1997341
|
1.8394085
|
0.0663815
|
|
wdayTuesday
|
0.0494167
|
0.2117060
|
0.2334211
|
0.8155193
|
|
wdayWednesday
|
0.1969639
|
0.1272882
|
1.5473853
|
0.1223323
|
To note, however, that while our model performs well in the middle regions, it significantly over-predicts for low IMDB ratings and under-predicts for high IMDB ratings.
ggplot(model_final, aes(imdb_rating, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_jitter(alpha = 0.5, color = "blue", height = 0.1, width = 0.1)+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating",
y="Residuals")

Model diagnostics
1. linear relationship between each (numerical) explanatory variable and response
We only have one numerical variable (runtime) which is shown to have a linear relationship with our predictor variable (IMDB rating) by the random scatter in the below residual plot. However, to note that there weren’t many movies with a runtime of over 150 minutes.
ggplot(model_final, aes(runtime, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "Runtime",
y="Residuals")

2. nearly normal residuals with mean 0
Despite a slight skew to the left, the majority of our residuals within the below residual histogram are centered around the mean 0.
ggplot(model_final, aes(.resid))+
#geom_histogram(binwidth = 0.1, col = "black", fill = "blue")+
geom_histogram(aes(y=..density..), color="black", fill="white", lwd = 0.75) +
geom_density(alpha=0.2, fill="#FF6666") +
geom_vline(aes(xintercept=mean(.resid)), col = 'red', lwd = 1, lty = 2) +
labs(x="Residuals",
y= "Density")+
theme_few()

Similarly, the Normal probability plot of residuals (QQ plot) below shows a similar trend. Outside of the tail areas, we do not see any significant deviations from the mean.
ggplot(model_final, aes(sample=.resid))+
stat_qq()+
stat_qq_line()+
theme_few()+
theme(legend.position = "none")

3. constant variability of residuals
We can see that our residuals are equally variable for low and high values of the predicted response variable (IMDB rating).
ggplot(model_final, aes(.fitted, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating prediction",
y="Residuals")

We can also plot the absolute values of the residuals as seen below. This can be thought of the above plot folded in half. Thus, a fan shape in the above plot would look as a triangle in the below plot. This is not the case here, and thus, this condition is also met.
ggplot(model_final, aes(.fitted, abs(.resid)))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
#geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating prediction",
y="Residuals")

4. independence of residuals (and hence observations)
No apparent trend which suggests independence from the order of data collection.
df <- augment(model_final)
ggplot(data=df, aes(x = 1:nrow(df), y = .resid)) +
labs(x = "Index",
y = "Residuals")+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52", lty = 2)+
geom_hline(yintercept=0, col="red", linetype="dashed")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()

Part 5: Prediction
Prediction and interpretation
Here are the details of the La La Land movie, a movie released in 2016.
Reference - IMDB, Oscars and Box Office Mojo official websites.
movie_lalaland <- data.frame("genre" = c("Musical & Performing Arts"),
"runtime" = c(128),
"best_dir_win" = "yes",
"best_pic_nom" = "yes",
"mpaa_rating" = "PG-13",
"wday" = "Friday",
"top200_box" = "yes")
movie_lalaland%>%
kbl(caption = "Movie data")%>%
kable_paper("hover", full_width = F)
Movie data
|
genre
|
runtime
|
best_dir_win
|
best_pic_nom
|
mpaa_rating
|
wday
|
top200_box
|
|
Musical & Performing Arts
|
128
|
yes
|
yes
|
PG-13
|
Friday
|
yes
|
Given this data, the model predicted an IMDB rating of 8.58. The lower and upper interval values tell us that the IMDB rating for La La Land is in the interval (7.65 and 9.50) with 95% probability.
pred_lalaland <- predict(model_final, newdata = movie_lalaland, interval = "confidence")
pred_lalaland%>%
kbl(caption = "Predictions")%>%
kable_paper("hover", full_width = F)
Predictions
|
fit
|
lwr
|
upr
|
|
8.576486
|
7.650619
|
9.502354
|
Lastly, we can plot this data to be able to visually interpret it. We can see that it was relatively close to the prediction and within the predicted interval.
ggplot(movies, aes(imdb_rating, imdb_num_votes))+
geom_jitter(alpha = 0.5, size = 0.75, width = 0.1, height = 0.1)+
geom_smooth(se=FALSE)+
#plot prediction
geom_point(aes(x=8.576486 , y = 513225), size = 5, col = "blue")+
geom_text(aes(x=8.576486 , y = 513225), label = "Prediction", nudge_x = 0.25, nudge_y = -75000)+
#correct value
geom_point(aes(x=8.0, y = 513225), size = 5, col = "red")+
geom_text(aes(x=8.0, y = 513225), label = "Actual", nudge_y = -55000)+
#confidence interval
geom_errorbar(aes(xmax = 9.502354, xmin = 7.650619, x =8.576486, y=513225), width = 0.5)+
geom_text(aes(x=9.2, y = 513225), label = "(CI)", col = "black", nudge_y = 48000)+
theme_few()+
labs(x = "IMDB rating",
y = "IMDB number of votes")

