Movies Week 4 Project of the course Linear regression model under the course track Statistics with R
Submitted by Olusola Afuwape
May 24th 2019
library(ggplot2)
library(dplyr)
library(statsr)load("movies.Rdata")# Find out the range of years of release of movies in the dataset
temporal <- movies %>% filter(!is.na(thtr_rel_year)) %>% select(thtr_rel_year)
range(temporal$thtr_rel_year)[1] 1970 2014
The movies dataset was collected by random sampling of movies produced and released before 2016. The dataset included variables like title of movie, date of release, type of movie etc. The movies included in this dataset were gotten from Rotten Tomatoes and IMDb
Scope of inference can be divided into spatial, temporal and socio-economic aspects. Spatial scope of inference included movies produced all over the world. The temporal scope of inference of the movies dataset ranges from 1970 to 2014. Socio-economic scope of inference looks at the impact of movies on socio-economic growth.
According to the codebook of the movies dataset, the dataset included 651 randomly selected movies that were produced and released before 2016. From this piece of information, it is clear that not all required principles of experimental design were engaged for the compilation of this dataset. The principles of experimental design include control, randomize, replicate and block. The movies dataset can only infer generalizability and association/correlation. The dataset cannot infer causality.
How do some specific variables in the movies dataset affect or determine nomination for Oscar?
# Check the data dimension and observe variables
dim(movies)[1] 651 32
names(movies) [1] "title" "title_type" "genre"
[4] "runtime" "mpaa_rating" "studio"
[7] "thtr_rel_year" "thtr_rel_month" "thtr_rel_day"
[10] "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
[13] "imdb_rating" "imdb_num_votes" "critics_rating"
[16] "critics_score" "audience_rating" "audience_score"
[19] "best_pic_nom" "best_pic_win" "best_actor_win"
[22] "best_actress_win" "best_dir_win" "top200_box"
[25] "director" "actor1" "actor2"
[28] "actor3" "actor4" "actor5"
[31] "imdb_url" "rt_url"
# Observe the count of movies that are nominated or win Oscar
oscar <- movies %>% filter(!is.na(best_pic_nom), !is.na(best_pic_win)) %>% select(best_pic_nom, best_pic_win)
table(oscar$best_pic_nom)
no yes
629 22
table(oscar$best_pic_win)
no yes
644 7
A movie that wins Oscar will first be nominated for Oscar before it can be given an Oscar award. Thus, this linear regression model will focus on how the variable best_pic_nom i.e. whether or not the movie was nominated for a best picture Oscar (no, yes) is affected by the following variables in the movies dataset:
# Get the variables
oscar_model <- movies %>% filter(!is.na(best_pic_nom), !is.na(imdb_rating), !is.na(imdb_num_votes), !is.na(critics_rating), !is.na(critics_score), !is.na(audience_rating), !is.na(audience_score), !is.na(top200_box)) %>% select(best_pic_nom, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, top200_box)
# Observe variables properties
str(oscar_model)Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 8 variables:
$ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
$ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
$ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
$ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
$ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
$ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
$ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# Graphical representattion of variable best_pic_nom in relation to other variables
boxplot(imdb_rating ~ best_pic_nom, data = oscar_model, col = c("green", "deep pink"), xlab = "Best picture Oscar nomination", ylab = "Rating on IMDB")boxplot(imdb_num_votes ~ best_pic_nom, data = oscar_model, col = c("purple", "light blue"), xlab = "Best picture Oscar nomination", ylab = "Number of votes on IMDB")boxplot(critics_score ~ best_pic_nom, data = oscar_model, col = c("yellow", "violet"), xlab = "Best picture Oscar nomination", ylab = "Critics score on Rotten Tomatoes")boxplot(audience_score ~ best_pic_nom, data = oscar_model, col = c("blue", "grey"), xlab = "Best picture Oscar nomination", ylab = "Audience rating on Rotten Tomatoes")c_rating <- ggplot(oscar_model) + aes(x = critics_rating, fill = best_pic_nom) + geom_bar(position = "dodge")
c_rating <- c_rating + xlab("Critics rating") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
c_ratinga_rating <- ggplot(oscar_model) + aes(x = audience_rating, fill = best_pic_nom) + geom_bar(position = "dodge")
a_rating <- a_rating + xlab("Audience rating") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
a_ratingtop200 <- ggplot(oscar_model) + aes(x = top200_box, fill = best_pic_nom) + geom_bar(position = "dodge")
top200 <- top200 + xlab("Top 200 Box Office list") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
top200# Convert variable best_pic_nom from a factor to a numeric
oscar_model$best_pic_nom <- as.numeric(oscar_model$best_pic_nom)
str(oscar_model)Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 8 variables:
$ best_pic_nom : num 1 1 1 1 1 1 1 1 1 1 ...
$ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
$ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
$ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
$ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
$ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
$ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
$ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head(oscar_model)# A tibble: 6 x 8
best_pic_nom imdb_rating imdb_num_votes critics_rating critics_score
<dbl> <dbl> <int> <fct> <dbl>
1 1 5.5 899 Rotten 45
2 1 7.3 12285 Certified Fre~ 96
3 1 7.6 22381 Certified Fre~ 91
4 1 7.2 35096 Certified Fre~ 80
5 1 5.1 2386 Rotten 33
6 1 7.8 333 Fresh 91
# ... with 3 more variables: audience_rating <fct>, audience_score <dbl>,
# top200_box <fct>
# Check how these variables are related to variable best_pic_nom usng pairs plot
panel.cor <- function(x, y, ...)
{
par(usr = c(0, 2, 0, 2))
txt <- as.character(format(cor(x, y), digits = 2))
text(0.7, 0.7, txt, cex = 2* abs(cor(x, y)))
}
pairs(oscar_model[1:8], upper.panel=panel.cor)From the above data graphical representations, the boxplots and bar charts indicate that movies nominated for Oscar exhibited higher positive score when compared with movies that were not nominated for each of the variables plotted against the variable best_pic_nom.
An observation of the pairs plot depicts some of the explanatory variables have some level of collinearity. This means some of the variables are correlated and not independent of each other. Multicollinearity should be avoided which results in complication of the model. A parsimonious model is prefered.
Since this model is for predictions, adjusted R2 is more reliable for predictions than p-value. p-value is used for statistically significant predictors and not for model prediction. This model prediction will use backwards elimination method to get at the model that gives the best and more parsimonious model with the highest adjusted R2 value.
From the pairs plot and using each of the numeric explanatory variable to play the role of the response variable and regress it on the remaining explanatory variables, it was seen that variables imdb_rating (R2 = 0.8148), critics_rating (R2 = 0.8505) and audience_score (R2 = 0.8816) exhibit multicollinearity. Variable imdb_rating was then completely removed from the model because its removal yields the highest adjusted R2 value (0.1261).
Below is the code for the model using all the variables and the model the yields the highest adjusted R2 value after performing Backwards elimination method.
# Full model
full_movies <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_rating + audience_score + top200_box, data = oscar_model)
summary(full_movies)
Call:
lm(formula = best_pic_nom ~ imdb_num_votes + critics_rating +
critics_score + audience_rating + audience_score + top200_box,
data = oscar_model)
Residuals:
Min 1Q Median 3Q Max
-0.42046 -0.04528 -0.01192 0.00558 1.00583
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.530e-01 5.077e-02 18.770 < 2e-16 ***
imdb_num_votes 4.183e-07 6.856e-08 6.101 1.82e-09 ***
critics_ratingFresh -4.927e-02 2.083e-02 -2.365 0.0183 *
critics_ratingRotten -2.833e-02 3.345e-02 -0.847 0.3972
critics_score 5.812e-04 5.566e-04 1.044 0.2968
audience_ratingUpright -2.518e-02 2.696e-02 -0.934 0.3506
audience_score 1.081e-03 7.616e-04 1.420 0.1561
top200_boxyes -2.300e-02 4.629e-02 -0.497 0.6194
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1691 on 643 degrees of freedom
Multiple R-squared: 0.1355, Adjusted R-squared: 0.1261
F-statistic: 14.4 on 7 and 643 DF, p-value: < 2.2e-16
# Best model
best_model <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_score, data = oscar_model)
summary(best_model)
Call:
lm(formula = best_pic_nom ~ imdb_num_votes + critics_rating +
critics_score + audience_score, data = oscar_model)
Residuals:
Min 1Q Median 3Q Max
-0.40998 -0.04583 -0.01235 0.00839 0.99930
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.673e-01 4.822e-02 20.060 < 2e-16 ***
imdb_num_votes 4.151e-07 6.617e-08 6.274 6.47e-10 ***
critics_ratingFresh -4.701e-02 2.071e-02 -2.270 0.0235 *
critics_ratingRotten -2.475e-02 3.324e-02 -0.745 0.4568
critics_score 6.284e-04 5.531e-04 1.136 0.2563
audience_score 5.307e-04 4.730e-04 1.122 0.2623
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1689 on 645 degrees of freedom
Multiple R-squared: 0.134, Adjusted R-squared: 0.1273
F-statistic: 19.96 on 5 and 645 DF, p-value: < 2.2e-16
# Linear relationship between numerical explanatory variables and the response variable
best_model <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_score, data = oscar_model)
plot(best_model$residuals ~ oscar_model$critics_score)plot(best_model$residuals ~ oscar_model$audience_score)# Residuals normality
hist(best_model$residuals)qqnorm(best_model$residuals)
qqline(best_model$residuals)# Constant residuals variability
plot(best_model$residuals ~ best_model$fitted)plot(abs(best_model$residuals) ~ best_model$fitted)# Residuals independence
plot(best_model$residuals)This model will be used to predict the movies Acrimony produced by Tyler Perry. The variable values are gotten from the sites Rotten Tomatoes and IMDb
imdb_num_votes <- 1002
critics_rating <- "Fresh"
critics_score <- 32
audience_score <- 69
model_predict <- data.frame(imdb_num_votes, critics_rating, critics_score, audience_score)
predict(best_model, model_predict, interval = "prediction") fit lwr upr
1 0.9774569 0.6412887 1.313625
From the result of the prediction function, it is clear that the model does not give a good prediction. Let’s recall that the focus of this research is: How do some specific variables in the movies dataset affect or determine nomination for Oscar? Thus, the set of variables selected in this model generation are not the best combination of variable data to use to be able to adequately predict nomination for Oscar award. To get a better prediction of how some of the variables in the movies dataset determine nomination for Oscar, a new set of variables from the dataset should be engaged for linear modeling.