library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)
library(statsr)
library(GGally)## Warning: package 'GGally' was built under R version 3.3.2
The dataset is comprised of 615 observations. The each observation has data regarding to 32 variables. There are about 21 categorical variables and the rest are numerical variables.
load("movies.Rdata")
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
The model fitted from this dataset can be generalized to predict the parameters of the sample population. First, the data is indepedent. The dataset ‘movies’ contains data regarding 651 randomly sampled movies released before 2016, which is less than 10% of total number of movies released before 2016. Second, the sample size is large enough to assume nealry normal distribution. The causal conclusion can’t be drawm since the dataset is comprised of observational data. The relationship can be discussed instead.
summary(movies)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The poluarity of a movie to audience will be associated with the rating of critics, whether or not the movie was nominated for or won Oscar, whether or not was in the Top 200 Box Office list on BoxOfficeMojo. The genre of movies was excluded from the question since the indivual different preference across the genre would be by chance because of the random sampling. The popularity of the movie to audience can be selected from either IMDB or Rotten Tomatoes. However, in this analysis a new score combined from IMDB and Rotten Tomatoes will be created to represent the popularity/acceptance of audience to a movie. Finally the critics, who are leading opinions on the movie industries, as well as Oscar nomiation, Oscar wining and Box office performance would be related to the popularity of a movie.
First, let’s create a new data set only containing the variables of interest and name it as movies_sm.
movies_sm <- movies %>%
select(title, title_type, genre, mpaa_rating, imdb_rating, critics_score, audience_score, best_pic_nom, best_pic_win, top200_box)
str(movies_sm)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 10 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_score: num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Before combining scores of IMDB and Rotten Tomatoes, investigate the relationship between two variables whether they are positive or negative or how strong the relationship is. According to the plot and the correlation coefficient score, the relationship between two is very strong postive relationships. An increase in the IMDB rating is expected to an increase in Rotten Tomatoes’ audience score as well. Here we can choose one of the two for the response variables for the model.
ggplot(data = movies_sm, aes(x=imdb_rating, y=audience_score))+
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)movies_sm %>%
summarise(cor(audience_score, imdb_rating))## # A tibble: 1 × 1
## `cor(audience_score, imdb_rating)`
## <dbl>
## 1 0.8648652
However, the combining two scores into a single score representing the rating of audience can also be an option for the model fitting in hope that it will better predit the parameter. In order to combine the two score, the difference of the scale used by different variables to be addressed by multiplying 10 to imdb_rating and name it as imdb_100. Instead of normalizing the two scores, this approach will give more intuitive and understandable to audience. Then lastly, create a new variable audience_one_score with single score calculated the avergae score of two variables.
movie_new <- movies_sm %>%
mutate(imdb_100 = imdb_rating*10, audience_one_score=(imdb_100+audience_score)/2)To investigate a single score audience_one_score can be substituted for the original two variables, the relationships between the single score and the original two variables must be strongly related. As the plots present, the correlation coefficient of imdb_rating and audience_score to the single score are 0.941 and 0.948 respectively, which are very strong positive relationships.
ggpairs(movie_new, columns = c(5,7,11:12))First, the relatsionship of explanaotry variables with the response variable as well as among the explanatory vairables should be check to see the relationships of each explanatory variables and the response variable and if there is any chance for collinearity. Based on the plot and the summary, best_pic_win: Whether or not the movie won a best picture Oscar has a very small observations with only 7. However, the analysis will include this variables to see its relationships with success of a movie.
ggpairs(movie_new, columns = c(6, 8:10, 12))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(movie_new)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## mpaa_rating imdb_rating critics_score audience_score
## G : 19 Min. :1.900 Min. : 1.00 Min. :11.00
## NC-17 : 2 1st Qu.:5.900 1st Qu.: 33.00 1st Qu.:46.00
## PG :118 Median :6.600 Median : 61.00 Median :65.00
## PG-13 :133 Mean :6.493 Mean : 57.69 Mean :62.36
## R :329 3rd Qu.:7.300 3rd Qu.: 83.00 3rd Qu.:80.00
## Unrated: 50 Max. :9.000 Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win top200_box imdb_100 audience_one_score
## no :629 no :644 no :636 Min. :19.00 Min. :17.50
## yes: 22 yes: 7 yes: 15 1st Qu.:59.00 1st Qu.:52.50
## Median :66.00 Median :65.50
## Mean :64.93 Mean :63.65
## 3rd Qu.:73.00 3rd Qu.:76.00
## Max. :90.00 Max. :93.50
##
For model selection, the backward elemination from full model will be used. This is because the number of explanatory variables is not too large to start with a full model. Furthermore, based on the idea that the more explanatory variables, the better model fitted, will start with the best possible prediction power model.
movie_m_full = lm(audience_one_score ~ critics_score + best_pic_nom + best_pic_win + top200_box, data = movie_new)
summary(movie_m_full)##
## Call:
## lm(formula = audience_one_score ~ critics_score + best_pic_nom +
## best_pic_win + top200_box, data = movie_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.604 -6.495 0.326 6.733 33.441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.00684 0.88651 46.257 <2e-16 ***
## critics_score 0.38804 0.01405 27.623 <2e-16 ***
## best_pic_nomyes 6.38371 2.48197 2.572 0.0103 *
## best_pic_winyes -0.30405 4.29840 -0.071 0.9436
## top200_boxyes 1.81876 2.61662 0.695 0.4873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.937 on 646 degrees of freedom
## Multiple R-squared: 0.5668, Adjusted R-squared: 0.5641
## F-statistic: 211.3 on 4 and 646 DF, p-value: < 2.2e-16
summary(movie_m_full)$adj.r.sqaured## NULL
From now, we will eliminate the variables one by one to find the best fit model.
This is the first step of elimination.
movie_m1 = lm(audience_one_score ~ best_pic_nom + best_pic_win + top200_box, data = movie_new)
summary(movie_m1)$adj.r.squared## [1] 0.05072184
movie_m2 = lm(audience_one_score ~ critics_score + best_pic_win + top200_box, data = movie_new)
summary(movie_m2)$adj.r.squared## [1] 0.5603297
movie_m3 = lm(audience_one_score ~ critics_score + best_pic_nom + top200_box, data = movie_new)
summary(movie_m3)$adj.r.squared## [1] 0.5647831
movie_m4 = lm(audience_one_score ~ critics_score + best_pic_nom + best_pic_win, data = movie_new)
summary(movie_m4)$adj.r.squared## [1] 0.564461
The model movie_m3 has highest R-sqaured with 0.5647. The second step of elimination will be starting with expalanatory variables of movie_m3 model.
movie_m31 = lm(audience_one_score ~ best_pic_nom + top200_box, data = movie_new)
summary(movie_m31)$adj.r.squared## [1] 0.0517663
movie_m32 = lm(audience_one_score ~ critics_score + top200_box, data = movie_new)
summary(movie_m32)$adj.r.squared## [1] 0.5599482
movie_m33 = lm(audience_one_score ~ critics_score + best_pic_nom, data = movie_new)
summary(movie_m33)$adj.r.squared## [1] 0.5651321
The second step elminination will the final step for multiple regression model. And the final model will be movie_m33 with the highest R-squared score with 0.5651. The summary of the model is presented below. Based on the model, we can interpret that a unit increase in critics score on Rotten Tomatoes is expected an average 0.389 increase in the audience’s scores on a movie with every other held constant. Likewise, the nomiation of the best picture for Oscar is expected to an average 6.4 increase in the audience’s score on a movie with every other held constant.
summary(movie_m33)##
## Call:
## lm(formula = audience_one_score ~ critics_score + best_pic_nom,
## data = movie_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.610 -6.438 0.229 6.781 33.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.00062 0.88517 46.320 < 2e-16 ***
## critics_score 0.38881 0.01398 27.817 < 2e-16 ***
## best_pic_nomyes 6.40546 2.19534 2.918 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.925 on 648 degrees of freedom
## Multiple R-squared: 0.5665, Adjusted R-squared: 0.5651
## F-statistic: 423.4 on 2 and 648 DF, p-value: < 2.2e-16
For prediction using model movie_m33, the movie La La Land released in 2016 was selected fro prediction. The critics score from Rotten Tomato is 93 (rottentomatoes.com). It was also nominated as the best picture for Oscar (Oscar.go.com). Now, we have all the data for the explanatory variable for prediction of rating of the audience. The model predicts that the audience score would be 83.57 and the calculated single score of audience is 83, which is close to the prediction. For the calculation of the single score of audience, we referred to the audience score from Rotten Tomatoes and rating on IMDB which is 83 and 8.3 respectively.
pred_lala <- data.frame(critics_score = 93, best_pic_nom = "yes")
predict(movie_m33, pred_lala)## 1
## 83.56521
# Calculated single score
lalaland_sing_rating <- (8.3*10+83)/2
print(lalaland_sing_rating)## [1] 83
In order to quantify the uncertainty around the prediction, the prediction interval would provide the range of certainty that the prediction is correct. According to the prediction interval, we are 95% confident that the model predict that the audiences would score the movie La La Land between 63.64 and 103.49.
predict(movie_m33, pred_lala, interval = "prediction", level = 0.95)## fit lwr upr
## 1 83.56521 63.63769 103.4927
The variability of the audience’s score is best explained by the combination of such explanatory variables as critics score from Rotten Tomatoes and whether or not the movie was the best picture nominated for Oscar. Including the variable best_pic_nom into model was problematic because the cases for the nominee was 7 which is very small for the model.
Besides, the operational definition of a movie success may differ. However, the amount of ticket sales will be the best mearue for a movie success. Instead of using variable whether or not the movie is top 200 in the box office, the total sales in dollar should be considered as a response variable in the future.
The final problem is the model condition, especially with the condition of cosntant variability of residuals. The residual plot is fan shaped which means that the variability of residuals is not constant.
Checking for model condition.
critics_score and audience_one_score. ggpairs(movie_new, columns = c(6, 8:10, 12))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
movie_m33 shows that the distribution is nearly normal with mean 0. ggplot(movie_m33, aes(x=.resid))+
geom_histogram(binwidth = 5)+
xlab("Residuals")qqnorm(movie_m33$residuals)
qqline(movie_m33$residuals)ggplot(movie_m33, aes(x=.fitted, y=.resid))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed")+
xlab("Fitted Values")+
ylab("Residuals")plot(movie_m33$residuals)