library(tidyverse)
library(statsr)
library(corrgram)
Data provided by instructor.
Every year a lot of new movies get released. So it is very difficult to choose which movie should watch.
Imdb rating is one way to tell that I should watch the movie or not. Now It would be very interesting If I could build a model that can predict imdb rating. So the question is:
Can imdb_rating be predicted from
* runtime
* title_type
* genre
* mpaa_rating
* critics_rating
* audience_rating
* audience_score
* best_pic_nom
* best_pic_win
* top200_box
I am going to start with a correlation matrix of numerical variables.
# configure output
# corrgram libraby is used here
corrgram(movies, order = TRUE,
lower.panel = panel.shade,
upper.panel = panel.pie,
text.panel = panel.txt)
From the correlation matrix I can see that ibdb_rating is highly correlated with critics score and audience score. Also imdb rating correlated with runtime and imdb number of votes.
I will build a dataframe with column of my interests. The reason for choosing certain variable explained in modeling part. I will my dataframe movies_1. There are few way handing missing data but I will drop NA value from my data set for now.
movies_1 <- select( movies,
imdb_rating,
runtime,
title_type,
genre,
mpaa_rating,
critics_rating,
audience_rating,
audience_score,
best_pic_nom,
best_pic_win,
top200_box)
movies_1 <- movies_1 %>% drop_na()
lets check the structure of the dataset with str() function.
there are few numeric column and rest are factor column.
str(movies_1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 650 obs. of 11 variables:
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I can see summary of my data set using summary() function. Summary function come back with minimum, maximum, mean, median, 1st quartile and 3rd quartile value. For the factor variable it come back with number of case of each level. There is no missing values in the dataset.
summary(movies_1)
## imdb_rating runtime title_type
## Min. :1.900 Min. : 39.0 Documentary : 54
## 1st Qu.:5.900 1st Qu.: 92.0 Feature Film:591
## Median :6.600 Median :103.0 TV Movie : 5
## Mean :6.492 Mean :105.8
## 3rd Qu.:7.300 3rd Qu.:115.8
## Max. :9.000 Max. :267.0
##
## genre mpaa_rating critics_rating
## Drama :305 G : 19 Certified Fresh:135
## Comedy : 87 NC-17 : 2 Fresh :208
## Action & Adventure: 65 PG :118 Rotten :307
## Mystery & Suspense: 59 PG-13 :133
## Documentary : 51 R :329
## Horror : 23 Unrated: 49
## (Other) : 60
## audience_rating audience_score best_pic_nom best_pic_win top200_box
## Spilled:275 Min. :11.00 no :628 no :643 no :635
## Upright:375 1st Qu.:46.00 yes: 22 yes: 7 yes: 15
## Median :65.00
## Mean :62.35
## 3rd Qu.:80.00
## Max. :97.00
##
# I will use plot_grid function from cowplot here.
library(cowplot)
q1 <- ggplot(movies_1, aes(x ='', y = runtime))+
geom_boxplot()+
theme(panel.background = element_rect(fill = 'light blue'))
q2 <- ggplot(movies_1, aes(x ='', y = audience_score))+
geom_boxplot()+
theme(panel.background = element_rect(fill = 'light blue'))
q3 <- ggplot(movies_1, aes(x ='', y = imdb_rating))+
geom_boxplot()+
theme(panel.background = element_rect(fill = 'light blue'))
plot_grid(q1,q2,q3, ncol = 3, nrow = 1)
From the above boxplot I can see there are outliers in runtime and imdb rating. Right now, I am not worried about outliers. I will check later on whether these outlier are influential or not.
h1 <- ggplot(movies_1, aes(runtime))+
geom_histogram(binwidth = 10, fill = 'salmon')+
theme(panel.background = element_rect(fill = 'light blue'))
h2 <- ggplot(movies_1, aes(audience_score))+
geom_histogram(binwidth = 10, fill = 'salmon')+
theme(panel.background = element_rect(fill = 'light blue'))
h3 <- ggplot(movies_1, aes(imdb_rating))+
geom_histogram(fill = 'salmon')+
theme(panel.background = element_rect(fill = 'light blue'))
plot_grid(h1,h2,h3, ncol = 3, nrow = 1)
From above plot, Except some outliers all numeric varible seems normally distributed.
qq1 <- ggplot(movies_1, aes(sample = runtime) )+
stat_qq()+
theme(panel.background = element_rect(fill = 'light blue'))+
ggtitle('runtime')
qq2 <- ggplot(movies_1, aes(sample = audience_score) )+
stat_qq()+
theme(panel.background = element_rect(fill = 'light blue'))+
ggtitle('audience score')
qq3 <- ggplot(movies_1, aes(sample = imdb_rating) )+
stat_qq()+theme(panel.background = element_rect(fill = 'light blue'))+
ggtitle('imdb_rating')
plot_grid(qq1,qq2,qq3, ncol = 3, nrow = 1)
The qq plot is not stright because there are outliers, otherwise it seems fairly linear.
Factor variable studio excluded because it has 211 level. all charcter variable excluded beacause it is not appropriate for regression model. critic score is excluded since it highly correlated with audience score and I have already added audience score in my model. imdb number of vote also excluded for same reason. otherwise, either theater release year or dvd release year is not much related to imdb rating so all year day month variable are excluded.
Since adjusted \(R^2\) is more reliable than p value selection so I will use backward elimination using adjusted \(R^2\) value.
model_full <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating+
audience_score+
best_pic_nom+
best_pic_win+
top200_box, data = movies_1)
Here is the summary and anova table for model.
summary(model_full)
##
## Call:
## lm(formula = imdb_rating ~ runtime + title_type + genre + mpaa_rating +
## critics_rating + audience_rating + audience_score + best_pic_nom +
## best_pic_win + top200_box, data = movies_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56893 -0.19197 0.03524 0.25793 1.11120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.651036 0.267944 13.626 < 2e-16 ***
## runtime 0.005719 0.001114 5.135 3.77e-07 ***
## title_typeFeature Film -0.247236 0.180088 -1.373 0.170287
## title_typeTV Movie -0.405054 0.283519 -1.429 0.153601
## genreAnimation -0.417820 0.190199 -2.197 0.028405 *
## genreArt House & International 0.241813 0.146782 1.647 0.099973 .
## genreComedy -0.136525 0.080911 -1.687 0.092037 .
## genreDocumentary 0.180958 0.193645 0.934 0.350416
## genreDrama 0.131726 0.070156 1.878 0.060898 .
## genreHorror 0.123050 0.121207 1.015 0.310399
## genreMusical & Performing Arts 0.024397 0.166037 0.147 0.883229
## genreMystery & Suspense 0.300666 0.090370 3.327 0.000929 ***
## genreOther 0.037498 0.138287 0.271 0.786360
## genreScience Fiction & Fantasy -0.183601 0.172985 -1.061 0.288932
## mpaa_ratingNC-17 -0.167508 0.367491 -0.456 0.648681
## mpaa_ratingPG -0.184847 0.133803 -1.381 0.167623
## mpaa_ratingPG-13 -0.182783 0.137780 -1.327 0.185116
## mpaa_ratingR -0.121028 0.133321 -0.908 0.364335
## mpaa_ratingUnrated -0.188623 0.153116 -1.232 0.218454
## critics_ratingFresh -0.092189 0.057115 -1.614 0.107012
## critics_ratingRotten -0.354180 0.063595 -5.569 3.80e-08 ***
## audience_ratingUpright -0.432681 0.077420 -5.589 3.42e-08 ***
## audience_score 0.047599 0.002081 22.876 < 2e-16 ***
## best_pic_nomyes -0.023536 0.125670 -0.187 0.851501
## best_pic_winyes 0.123255 0.211790 0.582 0.560800
## top200_boxyes 0.032252 0.132990 0.243 0.808462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4836 on 624 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.8013
## F-statistic: 105.7 on 25 and 624 DF, p-value: < 2.2e-16
anova(model_full)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| runtime | 1 | 54.9593548 | 54.9593548 | 235.0439460 | 0.0000000 |
| title_type | 2 | 101.7089717 | 50.8544859 | 217.4887073 | 0.0000000 |
| genre | 10 | 61.5015476 | 6.1501548 | 26.3022855 | 0.0000000 |
| mpaa_rating | 5 | 16.5493186 | 3.3098637 | 14.1552504 | 0.0000000 |
| critics_rating | 2 | 163.2651891 | 81.6325945 | 349.1170377 | 0.0000000 |
| audience_rating | 1 | 96.0498188 | 96.0498188 | 410.7749902 | 0.0000000 |
| audience_score | 1 | 123.7862228 | 123.7862228 | 529.3949025 | 0.0000000 |
| best_pic_nom | 1 | 0.0010000 | 0.0010000 | 0.0042766 | 0.9478799 |
| best_pic_win | 1 | 0.0809442 | 0.0809442 | 0.3461728 | 0.5565005 |
| top200_box | 1 | 0.0137520 | 0.0137520 | 0.0588128 | 0.8084620 |
| Residuals | 624 | 145.9073419 | 0.2338259 | NA | NA |
I will start with full model and I will just simply drop one variable at a time from the end.
model_full <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating+
audience_score+
best_pic_nom+
best_pic_win+
top200_box, data = movies_1)
model_1 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating+
audience_score+
best_pic_nom+
best_pic_win, data = movies_1)
model_2 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating+
audience_score+
best_pic_nom, data = movies_1)
model_3 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating+
audience_score
, data = movies_1)
model_4 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating+
audience_rating, data = movies_1)
model_5 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating+
critics_rating, data = movies_1)
model_6 <- lm(imdb_rating ~
runtime+
title_type+
genre+
mpaa_rating, data = movies_1)
model_7 <- lm(imdb_rating ~
runtime+
title_type+
genre,
data = movies_1)
model_8 <- lm(imdb_rating ~
runtime+
title_type,
data = movies_1)
model_9 <- lm(imdb_rating ~
runtime,
data = movies_1)
Now I will compare the \(R^2\) value for each model.
# model_full
sprintf('R-squared value of model_full is : %s ',
round(summary(model_full)$adj.r.squared,digit=5))
[1] “R-squared value of model_full is : 0.80132”
# model_1
sprintf('R-squared value of model_1 is : %s ',
round(summary(model_1)$adj.r.squared,digit=5))
[1] “R-squared value of model_1 is : 0.80162”
# model_2
sprintf('R-squared value of model_2 is : %s ',
round(summary(model_2)$adj.r.squared,digit=5))
[1] “R-squared value of model_2 is : 0.80183”
# model_3
sprintf('R-squared value of model_3 is : %s ',
round(summary(model_3)$adj.r.squared,digit=5))
[1] “R-squared value of model_3 is : 0.80215”
# model_4
sprintf('R-squared value of model_4 is : %s ',
round(summary(model_4)$adj.r.squared,digit=5))
[1] “R-squared value of model_4 is : 0.63498”
# model_5
sprintf('R-squared value of model_5 is : %s ',
round(summary(model_5)$adj.r.squared,digit=5))
[1] “R-squared value of model_5 is : 0.50581”
# model_6
sprintf('R-squared value of model_6 is : %s ',
round(summary(model_6)$adj.r.squared,digit=5))
[1] “R-squared value of model_6 is : 0.28753”
# model_7
sprintf('R-squared value of model_7 is : %s ',
round(summary(model_7)$adj.r.squared,digit=5))
[1] “R-squared value of model_7 is : 0.27103”
# model_8
sprintf('R-squared value of model_8 is : %s ',
round(summary(model_8)$adj.r.squared,digit=5))
[1] “R-squared value of model_8 is : 0.20142”
# model_9
sprintf('R-squared value of model_9 is : %s ',
round(summary(model_9)$adj.r.squared,digit=5))
[1] “R-squared value of model_9 is : 0.07052”
Lets build a table for \(R^2\) values.
model_number <- c("model_full", "model_1","model_2","model_3","model_4","model_5",
"model_6","model_7","model_8","model_9")
R_squared_value <- c(0.80132,0.80162,0.80183,0.80215,0.63498,0.50581,
0.28753,0.27103,0.20142,0.07052)
R_square_table <- cbind(model_number, R_squared_value)
R_square_table <- as.data.frame(R_square_table)
R_square_table
| model_number | R_squared_value |
|---|---|
| model_full | 0.80132 |
| model_1 | 0.80162 |
| model_2 | 0.80183 |
| model_3 | 0.80215 |
| model_4 | 0.63498 |
| model_5 | 0.50581 |
| model_6 | 0.28753 |
| model_7 | 0.27103 |
| model_8 | 0.20142 |
| model_9 | 0.07052 |
From the above output, model_3 is with highest R-squared value.
Residual vs fitted plot check if there is any non linear relationshiop between predictor variable and outcome variable. Since below plot show no disticnt pattern and the residuals is equally spread around horizontal line so there is no non linear relationship.
linearity_plot <- plot(model_3,1)
Normal Q-Q plot check whether residuals are normally distributed. Residuals would be normally distributed if qq plot is straight line. In my case the line is not totally stright, the lower portion of the line is curved. Eventhough few residuals are very far from stright line, overall the line does not violate the normality.
normality_plot <- plot(model_3, 2)
Here is the histogram of residuals. Mean of residulas around 0.
hist(model_3$residuals,
main = 'Histogram for Residuals', xlab = 'Residuals')
This is scale-location or spread location plot. Some data points are far away from mean that make red horizontal line little curved. But over all residuals are homogenously spread out.
homoscedas_plot <- plot(model_3, 3)
Cook’s distance measure the influence of each observation on regression coefficients. Cook’s distance value more than one should be further investigate. In my plot no cook’s distance value more than 1. But some cook’s distance value are far more than others like observation number 629, 502, and 66.
cook_distance_plot <- plot(model_3, 4)
from the summary of the model, coeffcient of audience_score is 0.047596, which mean as audience score increase by one unit, imdb_rating increase by 0.047596 where all other variable is held constant.
similarly, coeffcient of runtime is 0.005829, which mean as runtime increase by one unit, imdb_rating increase by 0.005829 where all other variable is held constant.
all coefficients can be explained in the same way. when all coefficient is zero the imdb rating is 3.65 which is intercept.
I have selected movie ‘fantastic beast and where to find them’ from 2016.
I will build a test data frame with all data that I collected for movie ‘fantastic beast and where to find them’.
imdb_rating <- c(7.4)
runtime <- c(132)
title_type <- c('Feature Film')
genre <- c('Action & Adventure')
mpaa_rating <- c('PG-13')
critics_rating <- c('Certified Fresh')
audience_rating <- c('Upright')
audience_score <- c(79)
test <- data.frame(imdb_rating,runtime, title_type, genre,
mpaa_rating, critics_rating, audience_rating, audience_score)
prediction_interval <- predict(model_3, test, interval = 'predict', level = 0.95 )
prediction_interval
## fit lwr upr
## 1 7.31255 6.349724 8.275377
prediction <- predict(model_3, test )
prediction
## 1
## 7.31255
results <- cbind(prediction,test$imdb_rating )
colnames(results) <- c('pred', 'real')
results <- as.data.frame(results)
results
| pred | real |
|---|---|
| 7.31255 | 7.4 |
mse <- mean((results$real - results$pred)^2)
sprintf("Mean Squared Error (MSE) is: %s ", round(mse, 4))
## [1] "Mean Squared Error (MSE) is: 0.0076 "
rmse <- mse^0.5
sprintf("Root Mean Squared Error(RMSE) is: %s ", round(rmse, 4))
## [1] "Root Mean Squared Error(RMSE) is: 0.0874 "
SSE <- sum((results$pred - results$real)^2)
SST <- sum((mean(movies_1$imdb_rating)- results$real)^2)
R2 <- 1- SSE/SST
sprintf("R-squared value is: %s ", round(R2, 4))
## [1] "R-squared value is: 0.9907 "
Movie - FANTASTIC BEASTS AND WHERE TO FIND THEM (2016)
data source link - imdb and rottentomatoes
There are some variables that significatly importrant to predict rating others are not. In prediction we use only one data point so prediction might not represents overall scenario.