library(ggplot2)
library(dplyr)
library(statsr)
# set working directory
setwd('C:/Users/sanso/Documents/Git/statsR/regression')
load("movies.Rdata")
This data has been collected from IMDB with a number of movies
The data seems sufficiently random to be able to generalize this to a population. However, since this is observed data, we will not be able to make any causal statements regarding this.
Our job is also to make a model which would be able to predict the audience score with a set of regressors
Checking what the data looks like
dim(movies)
## [1] 651 32
This data has 651 observations with 32 vaeiables.
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
To start with I would be removing, the last 2 variables which are imdb_url and rt_url
movies$rt_url <- NULL ; movies$imdb_url <- NULL
names(movies)
## [1] "title" "title_type" "genre"
## [4] "runtime" "mpaa_rating" "studio"
## [7] "thtr_rel_year" "thtr_rel_month" "thtr_rel_day"
## [10] "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
## [13] "imdb_rating" "imdb_num_votes" "critics_rating"
## [16] "critics_score" "audience_rating" "audience_score"
## [19] "best_pic_nom" "best_pic_win" "best_actor_win"
## [22] "best_actress_win" "best_dir_win" "top200_box"
## [25] "director" "actor1" "actor2"
## [28] "actor3" "actor4" "actor5"
Since this project involves trying to understand what are the factors that make a movie popular, I would be using the variable " audience_score" as an outcome variable and would try and find what are the other factors which impact this variable
As part of the exploratory analysis I would be checking what are the major factors that I would be using to build this model by checking if factors are relevant to the outcome variable or not
1st check would be on the Character Variables - Director and actor
length(unique(movies$director))
## [1] 533
length(unique(movies$actor1))
## [1] 486
since there are so many unique directors(533) and actors(486), I would not be using these are regressors,
Removing the regressors
movies <- movies[,-c(25:30)]
dim(movies)
## [1] 651 24
movies <- movies[,-1]
names(movies)
## [1] "title_type" "genre" "runtime"
## [4] "mpaa_rating" "studio" "thtr_rel_year"
## [7] "thtr_rel_month" "thtr_rel_day" "dvd_rel_year"
## [10] "dvd_rel_month" "dvd_rel_day" "imdb_rating"
## [13] "imdb_num_votes" "critics_rating" "critics_score"
## [16] "audience_rating" "audience_score" "best_pic_nom"
## [19] "best_pic_win" "best_actor_win" "best_actress_win"
## [22] "best_dir_win" "top200_box"
Now would be checking for the individual regressors
Checking if runtime has an impact on audience score
g <- ggplot(data = movies, aes(x = runtime, y = audience_score))
g <- g + geom_point()
g <- g + geom_smooth(method = "lm", se = FALSE)
g
concluding that there is some degree of correlation between runtime and audience score
Now would be checking if there is any impact of “title type” on audience score
g <- ggplot(data = movies, aes( x= title_type, y = audience_score))
g <- g + geom_boxplot(aes(fill = title_type))
g <- g + xlab("Title Type") + ylab("Audience Score")
g
Concluding that title type is required
Checking for Genre Type
g <- ggplot(data = movies, aes( x= genre, y = audience_score))
g <- g + geom_boxplot(aes(fill = genre))
g <- g + xlab("Genre") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
g
I would be keeping Genre as a regressor
I would be checking if Studio is required
length(unique(movies$studio))
## [1] 212
Since there are 212 unique Studio Names, I would not be using this as a regressor.
Deleting the column Studio
movies$studio <- NULL
Now would be checking if MPAA rating is important for predicting audience score
g <- ggplot(data = movies, aes( x= mpaa_rating, y = audience_score))
g <- g + geom_boxplot(aes(fill = mpaa_rating))
g <- g + xlab("MPAA Rating ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
g
Rating wiill be used as a regressor
I will not be using release year since we would be using preidctions of different years
Checking if the release month has any impact on the outcome variable
g <- ggplot(data = movies, aes( x= factor(thtr_rel_month), y = audience_score))
g <- g + geom_boxplot(aes(fill = thtr_rel_month))
g <- g + xlab("Month ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
g
There does not seem to much impact of the month. Hence I would be deleting the release date, month and year variables
movies$thtr_rel_year <- NULL; movies$thtr_rel_month <- NULL; movies$thtr_rel_day <- NULL
movies$dvd_rel_day <- NULL; movies$dvd_rel_month <- NULL; movies$dvd_rel_year <- NULL
Checking to see if IMDB ratings have any impact on audience score
g <- ggplot(data = movies, aes( x= imdb_rating, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
g <- g + xlab("IMDB Rating ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
g
There is a very strong linear relation between these
Now checking for IMDB Number Votes and if that has any impact
g <- ggplot(data = movies, aes( x= imdb_num_votes, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
g <- g + xlab("IMDB Votes # ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 90))
g
while the linear relation does not seem to be particularly strong, it does merit using it in the model
checking for Crtics Rating type
g <- ggplot(data = movies, aes( x= critics_rating, y = audience_score))
g <- g + geom_boxplot(aes(fill = critics_rating))
g <- g + xlab("Critics Rating ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
g
There does seem to be an impact; hence would be considered
Checking for Critics Score
g <- ggplot(data = movies, aes( x= critics_score, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm", se = FALSE)
g <- g + xlab("Critics Score ") + ylab("Audience Score")
g <- g + theme(axis.text.x = element_text(size = 10, angle = 0))
g
There is a very strong Linear relation
For this section I would be using the results I have gotten in the exploratory data analysis section in terms of regressors to be used
# Checking what the current regressors are
names(movies)
## [1] "title_type" "genre" "runtime"
## [4] "mpaa_rating" "imdb_rating" "imdb_num_votes"
## [7] "critics_rating" "critics_score" "audience_rating"
## [10] "audience_score" "best_pic_nom" "best_pic_win"
## [13] "best_actor_win" "best_actress_win" "best_dir_win"
## [16] "top200_box"
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 16 variables:
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Making a model with all the regressors
fit1 <- lm(audience_score ~ ., data = movies)
summary(fit1)
##
## Call:
## lm(formula = audience_score ~ ., data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.3327 -4.3939 0.3987 4.2976 24.3401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.088e+00 4.494e+00 -2.023 0.0436 *
## title_typeFeature Film 2.469e+00 2.577e+00 0.958 0.3383
## title_typeTV Movie 9.358e-01 4.042e+00 0.232 0.8170
## genreAnimation 2.477e+00 2.717e+00 0.912 0.3623
## genreArt House & International -2.337e+00 2.111e+00 -1.107 0.2686
## genreComedy 1.618e+00 1.159e+00 1.397 0.1630
## genreDocumentary 2.545e+00 2.762e+00 0.921 0.3572
## genreDrama -4.135e-01 1.024e+00 -0.404 0.6866
## genreHorror -1.691e+00 1.727e+00 -0.979 0.3280
## genreMusical & Performing Arts 3.785e+00 2.371e+00 1.597 0.1109
## genreMystery & Suspense -2.770e+00 1.305e+00 -2.123 0.0342 *
## genreOther -1.547e-01 1.968e+00 -0.079 0.9374
## genreScience Fiction & Fantasy -2.709e-01 2.463e+00 -0.110 0.9125
## runtime -2.706e-02 1.702e-02 -1.590 0.1123
## mpaa_ratingNC-17 -6.803e-01 5.235e+00 -0.130 0.8966
## mpaa_ratingPG -1.861e-01 1.907e+00 -0.098 0.9223
## mpaa_ratingPG-13 -1.125e+00 1.974e+00 -0.570 0.5692
## mpaa_ratingR -1.296e+00 1.904e+00 -0.681 0.4963
## mpaa_ratingUnrated -3.254e-01 2.179e+00 -0.149 0.8813
## imdb_rating 9.403e+00 4.893e-01 19.218 <2e-16 ***
## imdb_num_votes 3.332e-06 3.234e-06 1.030 0.3033
## critics_ratingFresh 1.535e-01 8.691e-01 0.177 0.8599
## critics_ratingRotten -6.669e-01 1.402e+00 -0.476 0.6344
## critics_score 9.438e-03 2.527e-02 0.373 0.7089
## audience_ratingUpright 2.004e+01 7.924e-01 25.289 <2e-16 ***
## best_pic_nomyes 3.719e+00 1.811e+00 2.053 0.0405 *
## best_pic_winyes -2.523e+00 3.192e+00 -0.790 0.4297
## best_actor_winyes -6.606e-03 8.224e-01 -0.008 0.9936
## best_actress_winyes -1.267e+00 9.100e-01 -1.392 0.1643
## best_dir_winyes 3.046e-01 1.192e+00 0.256 0.7984
## top200_boxyes -6.564e-01 1.936e+00 -0.339 0.7347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.871 on 619 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.89, Adjusted R-squared: 0.8847
## F-statistic: 167 on 30 and 619 DF, p-value: < 2.2e-16
Here we get a R-squared value of 89% and and adjusted R-squared of 88.5%. This means that this model explains 89% of the Variance of the outcome variable
I would now be checking if there is a way in which I can reduce the number of regressors. I would be using the Step function which basically checks for maximization of the adjusted R-squared by reducing regressors
fit2 <- step(lm(audience_score ~ ., data = movies), trace = 0) # step function
summary(fit2) # Printing out the results
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## critics_score + audience_rating + best_pic_nom + best_actress_win,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2299 -4.3817 0.5075 4.3523 24.1505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.75008 2.66718 -3.281 0.00109 **
## genreAnimation 3.20421 2.45639 1.304 0.19256
## genreArt House & International -2.66418 2.02406 -1.316 0.18856
## genreComedy 1.35890 1.13110 1.201 0.23005
## genreDocumentary 0.16974 1.39624 0.122 0.90328
## genreDrama -0.79769 0.96859 -0.824 0.41050
## genreHorror -2.09686 1.67267 -1.254 0.21045
## genreMusical & Performing Arts 2.60624 2.18644 1.192 0.23371
## genreMystery & Suspense -3.13253 1.24911 -2.508 0.01240 *
## genreOther -0.02899 1.92785 -0.015 0.98801
## genreScience Fiction & Fantasy -0.08638 2.43485 -0.035 0.97171
## runtime -0.02547 0.01567 -1.626 0.10445
## imdb_rating 9.47342 0.45974 20.606 < 2e-16 ***
## critics_score 0.02177 0.01502 1.450 0.14759
## audience_ratingUpright 20.05406 0.77665 25.821 < 2e-16 ***
## best_pic_nomyes 3.52360 1.59063 2.215 0.02710 *
## best_actress_winyes -1.29501 0.89570 -1.446 0.14873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.823 on 633 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8863
## F-statistic: 317.2 on 16 and 633 DF, p-value: < 2.2e-16
We are now down to 7 regressors where we get an adjusted R-squared value of 88.63% ( a slight increase over the last model) and a R-squared value of 88.91% ( which is drop of 0.1% drop over the last model)
The regressors would be :
genre + runtime + imdb_rating + critics_score + audience_rating + best_pic_nom + best_actress_win
Hence we see that even with 7 regressors, we can explain 88.91% of the variance of the outcome variable with this model.
I would be selecting this model ( fit2) as the final model
plotting the residuals of this model
par(mfrow = c(2,2))
plot(fit2)
The residuals plot shows that the residuals are scattered normally and also the normal QQ line shows that it is near normal
Hence we can conclude that it is a good fit
I would be using the movie " Rogue One - A Star Wars story" for predictions
Link : http://www.imdb.com/title/tt3748528/?ref_=adv_li_tt
The variables I require for this model are
genre,runtime,imdb_rating,critics_score, audience_rating,best_pic_nom,best_actress_win
I have created a csv file with this data which I would be importing
movie1 <- read.csv('newdata.csv')
str(movie1)
## 'data.frame': 1 obs. of 8 variables:
## $ Title : Factor w/ 1 level "Rogue One - A Star Wars story": 1
## $ genre : Factor w/ 1 level "Action & Adventure": 1
## $ runtime : int 133
## $ imdb_rating : num 7.8
## $ critics_score : int 65
## $ audience_rating : Factor w/ 1 level "Spilled": 1
## $ best_pic_nom : Factor w/ 1 level "no": 1
## $ best_actress_win: Factor w/ 1 level "no": 1
using this data I would be predicting an audience score using the predict function
predict(fit2, newdata = movie1, interval = "predict")
## fit lwr upr
## 1 63.16992 49.54014 76.7997
This gives us an audience score of 63 with a lower interval of 49.5 and an upper interval of 76.8
In conclusion we can say that using 7 regressors, namely the following 7
genre,runtime,imdb_rating,critics_score, audience_rating,best_pic_nom,best_actress_win
we can build a linear model which explains 89% of the variance of the audience score.
We can also use this to predict the results.
Since some of this variables can only be understood after the movie is released, this model might not be a great method to understand what kind of movie to make before making it. Of the regressors in this model, only 2 “Genre” and “Runtime” are something that can be used before making the movie. For deciding which kind of movie to be made nd which would be popular, we should ideally be building a model with only those regressors where data is available before making the movie.