The data were obtained from IMDB and Rotten Tomatoes. The data represents 651 randomly sampled movies produced and released before 2016. There are 32 variables about the movies.
The raw data is not a complete list of all movies released prior to 2016. It is a random sample taken from the full data set. We don’t know the sampling method. With random sampling, the results are generalizable to all movies in the range of years released between 1970 and 2014.In observational studies, only associations are shown. Association does not imply causation.
A possible non-independent bias may arise with regard to movie sequels whereby the popularity of a sequel movie may be influenced by that of the previous release. * * *
Can we predict a movie’s popularity based on type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)?
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
We can see that 9 columns consisted of character variable 12 columns consisted of factor variables 11 columns consisted of numeric/ interger variables, 6 of which are date-related. We check for missing values * * *
## [1] 619 32
‘Studio’ has 211 levels, thus it is too granular for the regression model. we will remove it. Column that has to do with actors, directors and actresses and imdb website will be removed (column 25-32)
## Classes 'tbl_df', 'tbl' and 'data.frame': 619 obs. of 20 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
## $ runtime : num 80 101 84 139 90 142 93 88 119 127 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 4 5 6 6 3 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 11 9 3 6 ...
## $ thtr_rel_day : num 19 14 21 1 10 1 8 7 2 19 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ critics_score : num 45 96 91 80 33 57 17 90 83 89 ...
## $ audience_score : num 73 81 91 76 27 76 47 89 66 75 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 3 3 2 1 1 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
ggplot(data=dataset, aes(x=genre)) +
geom_bar(fill="green") +
xlab("Genre Distribution") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))ggplot(data=dataset, aes(x=title_type)) +
geom_bar(fill="blue") +
xlab("Movie Type") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))ggplot(data=dataset, aes(x=mpaa_rating)) +
geom_bar(fill="yellow") +
xlab("MPAA Rating") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))First, we need to find if there is any collinearity, especially with the numerical explanatory variables. A subset consisted of all numerical explanatory variables was created and the correlation matrix was created to better identify if there is any collinearity.
num_expl_var <- dataset[c(4,6:8,10:12)]
corr<- cor(num_expl_var)
cex.before <- par("cex")
par(cex = 0.55)
col<- colorRampPalette(c("dark red","red","pink", "yellow","light green", "dark green"))(20)
corrplot(corr, method="circle", type="lower", col=col, sig.level = 0.01, tl.col="black") critic score and audience score are highly correlated. This could distort our regression model. To remove one of the two, I plotted the correlation between these two explanatory variables with the response variable.
higher correlation with the response variables. Thus it is chosen over critic score.
##
## Call:
## lm(formula = imdb_rating ~ ., data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5391 -0.1681 0.0466 0.2527 1.0848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.047e+00 4.341e+00 0.932 0.351550
## title_typeFeature Film -3.515e-01 1.939e-01 -1.812 0.070439 .
## title_typeTV Movie -3.271e-01 3.110e-01 -1.052 0.293339
## genreAnimation -5.611e-01 2.005e-01 -2.798 0.005315 **
## genreArt House & International 3.368e-01 1.593e-01 2.114 0.034915 *
## genreComedy -1.437e-01 8.269e-02 -1.738 0.082744 .
## genreDocumentary 1.157e-01 2.053e-01 0.563 0.573329
## genreDrama 1.420e-01 7.300e-02 1.945 0.052260 .
## genreHorror 1.153e-01 1.241e-01 0.930 0.352968
## genreMusical & Performing Arts 5.785e-02 1.691e-01 0.342 0.732475
## genreMystery & Suspense 2.853e-01 9.354e-02 3.050 0.002388 **
## genreOther 5.667e-02 1.429e-01 0.396 0.691914
## genreScience Fiction & Fantasy -7.923e-02 1.827e-01 -0.434 0.664733
## runtime 4.307e-03 1.271e-03 3.389 0.000747 ***
## mpaa_ratingNC-17 9.980e-02 5.058e-01 0.197 0.843640
## mpaa_ratingPG -1.580e-01 1.434e-01 -1.102 0.271029
## mpaa_ratingPG-13 -1.761e-01 1.493e-01 -1.180 0.238644
## mpaa_ratingR -1.090e-01 1.441e-01 -0.757 0.449579
## mpaa_ratingUnrated -1.672e-01 1.717e-01 -0.974 0.330701
## thtr_rel_year -1.144e-04 2.158e-03 -0.053 0.957739
## thtr_rel_month 9.412e-03 5.849e-03 1.609 0.108115
## thtr_rel_day -1.000e-03 2.257e-03 -0.443 0.657910
## audience_score 4.616e-02 2.180e-03 21.173 < 2e-16 ***
## imdb_num_votes 7.955e-07 2.321e-07 3.427 0.000652 ***
## critics_ratingFresh -3.507e-02 6.299e-02 -0.557 0.577961
## critics_ratingRotten -3.019e-01 6.741e-02 -4.478 9.05e-06 ***
## audience_ratingUpright -4.243e-01 7.947e-02 -5.339 1.34e-07 ***
## best_pic_nomyes -1.075e-01 1.289e-01 -0.834 0.404418
## best_pic_winyes -3.971e-02 2.249e-01 -0.177 0.859895
## best_actor_winyes 2.219e-02 5.842e-02 0.380 0.704198
## best_actress_winyes 8.711e-02 6.435e-02 1.354 0.176339
## best_dir_winyes 5.314e-02 8.392e-02 0.633 0.526866
## top200_boxyes -9.827e-02 1.367e-01 -0.719 0.472376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 586 degrees of freedom
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.7992
## F-statistic: 77.88 on 32 and 586 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.3158 < 2.2e-16 ***
## genre 10 93.05 9.30 40.1252 < 2.2e-16 ***
## runtime 1 35.91 35.91 154.8520 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6211 1.361e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2388 0.03995 *
## thtr_rel_month 1 0.66 0.66 2.8573 0.09149 .
## thtr_rel_day 1 0.42 0.42 1.8061 0.17949
## audience_score 1 344.24 344.24 1484.4354 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.1951 8.429e-06 ***
## critics_rating 2 7.17 3.59 15.4689 2.840e-07 ***
## audience_rating 1 6.80 6.80 29.3277 8.930e-08 ***
## best_pic_nom 1 0.12 0.12 0.5253 0.46889
## best_pic_win 1 0.00 0.00 0.0015 0.96878
## best_actor_win 1 0.05 0.05 0.2177 0.64096
## best_actress_win 1 0.41 0.41 1.7553 0.18573
## best_dir_win 1 0.10 0.10 0.4305 0.51202
## top200_box 1 0.12 0.12 0.5171 0.47238
## Residuals 586 135.89 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We start with a relatively high adjusted r squared of 0.7992. We work our way through the set by removing variables with the highest p-value first. So, we will remove * * *
second_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_month+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating+best_pic_nom+best_actor_win+
best_actress_win+best_dir_win, data=model)
summary(second_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score +
## imdb_num_votes + critics_rating + audience_rating + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5427 -0.1683 0.0389 0.2524 1.0862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.719e+00 4.309e+00 0.863 0.388495
## title_typeFeature Film -3.523e-01 1.937e-01 -1.819 0.069420 .
## title_typeTV Movie -3.302e-01 3.106e-01 -1.063 0.288151
## genreAnimation -5.474e-01 1.993e-01 -2.746 0.006213 **
## genreArt House & International 3.401e-01 1.590e-01 2.140 0.032794 *
## genreComedy -1.400e-01 8.225e-02 -1.703 0.089184 .
## genreDocumentary 1.182e-01 2.050e-01 0.576 0.564518
## genreDrama 1.458e-01 7.264e-02 2.006 0.045261 *
## genreHorror 1.184e-01 1.238e-01 0.956 0.339465
## genreMusical & Performing Arts 6.193e-02 1.688e-01 0.367 0.713808
## genreMystery & Suspense 2.891e-01 9.322e-02 3.101 0.002019 **
## genreOther 6.195e-02 1.424e-01 0.435 0.663635
## genreScience Fiction & Fantasy -8.187e-02 1.824e-01 -0.449 0.653750
## runtime 4.283e-03 1.269e-03 3.376 0.000785 ***
## mpaa_ratingNC-17 1.169e-01 5.046e-01 0.232 0.816914
## mpaa_ratingPG -1.494e-01 1.427e-01 -1.047 0.295512
## mpaa_ratingPG-13 -1.644e-01 1.483e-01 -1.109 0.267989
## mpaa_ratingR -9.556e-02 1.427e-01 -0.669 0.503442
## mpaa_ratingUnrated -1.568e-01 1.708e-01 -0.918 0.359183
## thtr_rel_year 4.103e-05 2.143e-03 0.019 0.984733
## thtr_rel_month 9.214e-03 5.819e-03 1.583 0.113853
## thtr_rel_day -9.823e-04 2.253e-03 -0.436 0.662996
## audience_score 4.626e-02 2.171e-03 21.307 < 2e-16 ***
## imdb_num_votes 7.524e-07 2.228e-07 3.377 0.000780 ***
## critics_ratingFresh -3.087e-02 6.264e-02 -0.493 0.622322
## critics_ratingRotten -2.977e-01 6.709e-02 -4.437 1.09e-05 ***
## audience_ratingUpright -4.271e-01 7.928e-02 -5.387 1.03e-07 ***
## best_pic_nomyes -1.126e-01 1.179e-01 -0.955 0.339960
## best_actor_winyes 2.183e-02 5.819e-02 0.375 0.707655
## best_actress_winyes 8.454e-02 6.415e-02 1.318 0.188054
## best_dir_winyes 5.105e-02 8.052e-02 0.634 0.526354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.481 on 588 degrees of freedom
## Multiple R-squared: 0.8094, Adjusted R-squared: 0.7997
## F-statistic: 83.26 on 30 and 588 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.6759 < 2.2e-16 ***
## genre 10 93.05 9.30 40.2246 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.2357 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6548 1.259e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2493 0.03971 *
## thtr_rel_month 1 0.66 0.66 2.8644 0.09109 .
## thtr_rel_day 1 0.42 0.42 1.8106 0.17896
## audience_score 1 344.24 344.24 1488.1133 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.2451 8.213e-06 ***
## critics_rating 2 7.17 3.59 15.5073 2.735e-07 ***
## audience_rating 1 6.80 6.80 29.4003 8.606e-08 ***
## best_pic_nom 1 0.12 0.12 0.5266 0.46834
## best_actor_win 1 0.05 0.05 0.2153 0.64281
## best_actress_win 1 0.41 0.41 1.7641 0.18463
## best_dir_win 1 0.09 0.09 0.4019 0.52635
## Residuals 588 136.02 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our adjusted r squared improved slightly from 0.7992 to 0.7997. We wil continue to remove insignificant variable. The variable we will remove this time is the best_actor_win variable.
third_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_month+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating+best_pic_nom+
best_actress_win+best_dir_win, data=model)
summary(third_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score +
## imdb_num_votes + critics_rating + audience_rating + best_pic_nom +
## best_actress_win + best_dir_win, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54348 -0.16429 0.03922 0.25058 1.08392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.709e+00 4.306e+00 0.861 0.389437
## title_typeFeature Film -3.510e-01 1.935e-01 -1.814 0.070172 .
## title_typeTV Movie -3.318e-01 3.104e-01 -1.069 0.285437
## genreAnimation -5.456e-01 1.991e-01 -2.740 0.006327 **
## genreArt House & International 3.379e-01 1.587e-01 2.129 0.033706 *
## genreComedy -1.402e-01 8.219e-02 -1.705 0.088653 .
## genreDocumentary 1.178e-01 2.048e-01 0.575 0.565341
## genreDrama 1.464e-01 7.257e-02 2.018 0.044068 *
## genreHorror 1.169e-01 1.237e-01 0.945 0.344967
## genreMusical & Performing Arts 6.077e-02 1.686e-01 0.360 0.718691
## genreMystery & Suspense 2.921e-01 9.280e-02 3.148 0.001728 **
## genreOther 6.263e-02 1.423e-01 0.440 0.659933
## genreScience Fiction & Fantasy -8.394e-02 1.822e-01 -0.461 0.645200
## runtime 4.367e-03 1.248e-03 3.499 0.000502 ***
## mpaa_ratingNC-17 1.164e-01 5.042e-01 0.231 0.817450
## mpaa_ratingPG -1.478e-01 1.426e-01 -1.037 0.300153
## mpaa_ratingPG-13 -1.637e-01 1.482e-01 -1.105 0.269627
## mpaa_ratingR -9.508e-02 1.426e-01 -0.667 0.505272
## mpaa_ratingUnrated -1.566e-01 1.707e-01 -0.917 0.359406
## thtr_rel_year 4.112e-05 2.141e-03 0.019 0.984687
## thtr_rel_month 9.321e-03 5.808e-03 1.605 0.109038
## thtr_rel_day -9.664e-04 2.251e-03 -0.429 0.667841
## audience_score 4.628e-02 2.169e-03 21.342 < 2e-16 ***
## imdb_num_votes 7.491e-07 2.224e-07 3.368 0.000807 ***
## critics_ratingFresh -3.038e-02 6.258e-02 -0.486 0.627500
## critics_ratingRotten -2.976e-01 6.704e-02 -4.439 1.08e-05 ***
## audience_ratingUpright -4.285e-01 7.912e-02 -5.416 8.89e-08 ***
## best_pic_nomyes -1.087e-01 1.173e-01 -0.927 0.354523
## best_actress_winyes 8.590e-02 6.400e-02 1.342 0.180049
## best_dir_winyes 5.165e-02 8.045e-02 0.642 0.521138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4806 on 589 degrees of freedom
## Multiple R-squared: 0.8094, Adjusted R-squared: 0.8
## F-statistic: 86.25 on 29 and 589 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.8887 < 2.2e-16 ***
## genre 10 93.05 9.30 40.2833 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.4625 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6748 1.203e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2555 0.03956 *
## thtr_rel_month 1 0.66 0.66 2.8686 0.09085 .
## thtr_rel_day 1 0.42 0.42 1.8132 0.17864
## audience_score 1 344.24 344.24 1490.2874 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.2747 8.089e-06 ***
## critics_rating 2 7.17 3.59 15.5299 2.675e-07 ***
## audience_rating 1 6.80 6.80 29.4433 8.421e-08 ***
## best_pic_nom 1 0.12 0.12 0.5273 0.46802
## best_actress_win 1 0.42 0.42 1.8316 0.17645
## best_dir_win 1 0.10 0.10 0.4121 0.52114
## Residuals 589 136.05 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adjusted r squared increased to 0.8. This time we will remove best_dir_win
fourth_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_month+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating+best_pic_nom+
best_actress_win, data=model)
summary(fourth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score +
## imdb_num_votes + critics_rating + audience_rating + best_pic_nom +
## best_actress_win, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54251 -0.16980 0.03898 0.25007 1.07756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.898e+00 4.294e+00 0.908 0.364422
## title_typeFeature Film -3.473e-01 1.933e-01 -1.797 0.072894 .
## title_typeTV Movie -3.299e-01 3.102e-01 -1.064 0.287919
## genreAnimation -5.446e-01 1.990e-01 -2.737 0.006396 **
## genreArt House & International 3.352e-01 1.586e-01 2.114 0.034959 *
## genreComedy -1.400e-01 8.215e-02 -1.704 0.088952 .
## genreDocumentary 1.187e-01 2.047e-01 0.580 0.562240
## genreDrama 1.455e-01 7.252e-02 2.007 0.045204 *
## genreHorror 1.173e-01 1.236e-01 0.949 0.342819
## genreMusical & Performing Arts 6.113e-02 1.685e-01 0.363 0.716988
## genreMystery & Suspense 2.924e-01 9.276e-02 3.152 0.001703 **
## genreOther 5.847e-02 1.421e-01 0.412 0.680797
## genreScience Fiction & Fantasy -8.223e-02 1.821e-01 -0.452 0.651762
## runtime 4.493e-03 1.232e-03 3.647 0.000289 ***
## mpaa_ratingNC-17 1.183e-01 5.040e-01 0.235 0.814531
## mpaa_ratingPG -1.439e-01 1.424e-01 -1.011 0.312445
## mpaa_ratingPG-13 -1.608e-01 1.480e-01 -1.086 0.277748
## mpaa_ratingR -9.120e-02 1.424e-01 -0.640 0.522205
## mpaa_ratingUnrated -1.547e-01 1.706e-01 -0.907 0.364890
## thtr_rel_year -6.205e-05 2.134e-03 -0.029 0.976818
## thtr_rel_month 9.375e-03 5.804e-03 1.615 0.106806
## thtr_rel_day -9.780e-04 2.250e-03 -0.435 0.663947
## audience_score 4.633e-02 2.166e-03 21.386 < 2e-16 ***
## imdb_num_votes 7.599e-07 2.217e-07 3.428 0.000651 ***
## critics_ratingFresh -3.019e-02 6.255e-02 -0.483 0.629565
## critics_ratingRotten -3.003e-01 6.686e-02 -4.492 8.49e-06 ***
## audience_ratingUpright -4.312e-01 7.898e-02 -5.459 7.05e-08 ***
## best_pic_nomyes -1.052e-01 1.171e-01 -0.898 0.369509
## best_actress_winyes 8.660e-02 6.396e-02 1.354 0.176236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4804 on 590 degrees of freedom
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.8002
## F-statistic: 89.4 on 28 and 590 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 146.0342 < 2.2e-16 ***
## genre 10 93.05 9.30 40.3235 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.6175 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6884 1.166e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2598 0.03946 *
## thtr_rel_month 1 0.66 0.66 2.8715 0.09069 .
## thtr_rel_day 1 0.42 0.42 1.8150 0.17842
## audience_score 1 344.24 344.24 1491.7737 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.2949 8.004e-06 ***
## critics_rating 2 7.17 3.59 15.5454 2.634e-07 ***
## audience_rating 1 6.80 6.80 29.4727 8.295e-08 ***
## best_pic_nom 1 0.12 0.12 0.5279 0.46779
## best_actress_win 1 0.42 0.42 1.8335 0.17624
## Residuals 590 136.15 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adjusted R squared slightly improved to 0.8002. We will remove best_pic_nom this time.
fifth_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_month+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating+
best_actress_win, data=model)
summary(fifth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score +
## imdb_num_votes + critics_rating + audience_rating + best_actress_win,
## data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54575 -0.17002 0.03634 0.25551 1.07724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.707e+00 4.288e+00 0.865 0.387641
## title_typeFeature Film -3.490e-01 1.933e-01 -1.806 0.071503 .
## title_typeTV Movie -3.335e-01 3.101e-01 -1.075 0.282592
## genreAnimation -5.454e-01 1.990e-01 -2.741 0.006309 **
## genreArt House & International 3.338e-01 1.586e-01 2.105 0.035719 *
## genreComedy -1.423e-01 8.209e-02 -1.734 0.083452 .
## genreDocumentary 1.169e-01 2.047e-01 0.571 0.568027
## genreDrama 1.420e-01 7.240e-02 1.962 0.050256 .
## genreHorror 1.126e-01 1.235e-01 0.912 0.362291
## genreMusical & Performing Arts 6.323e-02 1.685e-01 0.375 0.707624
## genreMystery & Suspense 2.894e-01 9.268e-02 3.122 0.001881 **
## genreOther 5.015e-02 1.417e-01 0.354 0.723574
## genreScience Fiction & Fantasy -8.128e-02 1.821e-01 -0.446 0.655447
## runtime 4.388e-03 1.226e-03 3.578 0.000374 ***
## mpaa_ratingNC-17 1.241e-01 5.038e-01 0.246 0.805524
## mpaa_ratingPG -1.471e-01 1.423e-01 -1.034 0.301746
## mpaa_ratingPG-13 -1.637e-01 1.480e-01 -1.106 0.268974
## mpaa_ratingR -9.207e-02 1.424e-01 -0.647 0.518178
## mpaa_ratingUnrated -1.546e-01 1.706e-01 -0.906 0.365172
## thtr_rel_year 4.435e-05 2.131e-03 0.021 0.983399
## thtr_rel_month 8.720e-03 5.757e-03 1.515 0.130402
## thtr_rel_day -9.169e-04 2.248e-03 -0.408 0.683571
## audience_score 4.617e-02 2.159e-03 21.388 < 2e-16 ***
## imdb_num_votes 7.296e-07 2.191e-07 3.330 0.000921 ***
## critics_ratingFresh -2.333e-02 6.207e-02 -0.376 0.707183
## critics_ratingRotten -2.938e-01 6.646e-02 -4.421 1.17e-05 ***
## audience_ratingUpright -4.276e-01 7.886e-02 -5.422 8.61e-08 ***
## best_actress_winyes 7.906e-02 6.339e-02 1.247 0.212848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4803 on 591 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.8003
## F-statistic: 92.72 on 27 and 591 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 146.0820 < 2.2e-16 ***
## genre 10 93.05 9.30 40.3367 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.6685 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6929 1.151e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2612 0.03943 *
## thtr_rel_month 1 0.66 0.66 2.8724 0.09064 .
## thtr_rel_day 1 0.42 0.42 1.8156 0.17835
## audience_score 1 344.24 344.24 1492.2622 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.3016 7.975e-06 ***
## critics_rating 2 7.17 3.59 15.5505 2.619e-07 ***
## audience_rating 1 6.80 6.80 29.4823 8.251e-08 ***
## best_actress_win 1 0.36 0.36 1.5553 0.21285
## Residuals 591 136.33 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the sixth model, we will remove best_actress_win. I also started to notice a pattern. We have been removing variables that have to do with oscar prizes ( best actress, best picture nominated, best director…)
sixth_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_month+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating, data=model)
summary(sixth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score +
## imdb_num_votes + critics_rating + audience_rating, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54920 -0.17859 0.03308 0.25704 1.07702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.933e+00 4.286e+00 0.918 0.359174
## title_typeFeature Film -3.490e-01 1.934e-01 -1.805 0.071604 .
## title_typeTV Movie -3.203e-01 3.101e-01 -1.033 0.302092
## genreAnimation -5.323e-01 1.988e-01 -2.678 0.007620 **
## genreArt House & International 3.435e-01 1.584e-01 2.168 0.030557 *
## genreComedy -1.320e-01 8.171e-02 -1.616 0.106720
## genreDocumentary 1.241e-01 2.047e-01 0.606 0.544576
## genreDrama 1.546e-01 7.173e-02 2.156 0.031515 *
## genreHorror 1.165e-01 1.235e-01 0.944 0.345663
## genreMusical & Performing Arts 6.406e-02 1.686e-01 0.380 0.704105
## genreMystery & Suspense 3.042e-01 9.196e-02 3.309 0.000995 ***
## genreOther 5.892e-02 1.416e-01 0.416 0.677545
## genreScience Fiction & Fantasy -8.064e-02 1.822e-01 -0.443 0.658151
## runtime 4.589e-03 1.216e-03 3.774 0.000177 ***
## mpaa_ratingNC-17 1.115e-01 5.040e-01 0.221 0.825043
## mpaa_ratingPG -1.450e-01 1.423e-01 -1.018 0.308901
## mpaa_ratingPG-13 -1.628e-01 1.480e-01 -1.100 0.271811
## mpaa_ratingR -9.354e-02 1.425e-01 -0.657 0.511692
## mpaa_ratingUnrated -1.565e-01 1.707e-01 -0.917 0.359489
## thtr_rel_year -7.682e-05 2.130e-03 -0.036 0.971235
## thtr_rel_month 8.783e-03 5.760e-03 1.525 0.127819
## thtr_rel_day -8.508e-04 2.249e-03 -0.378 0.705311
## audience_score 4.611e-02 2.159e-03 21.358 < 2e-16 ***
## imdb_num_votes 7.401e-07 2.190e-07 3.379 0.000775 ***
## critics_ratingFresh -3.021e-02 6.186e-02 -0.488 0.625476
## critics_ratingRotten -2.986e-01 6.638e-02 -4.498 8.25e-06 ***
## audience_ratingUpright -4.282e-01 7.890e-02 -5.427 8.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4805 on 592 degrees of freedom
## Multiple R-squared: 0.8085, Adjusted R-squared: 0.8001
## F-statistic: 96.13 on 26 and 592 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.9451 < 2.2e-16 ***
## genre 10 93.05 9.30 40.2989 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.5226 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6801 1.180e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2572 0.03952 *
## thtr_rel_month 1 0.66 0.66 2.8697 0.09079 .
## thtr_rel_day 1 0.42 0.42 1.8139 0.17855
## audience_score 1 344.24 344.24 1490.8638 < 2.2e-16 ***
## imdb_num_votes 1 4.68 4.68 20.2825 8.050e-06 ***
## critics_rating 2 7.17 3.59 15.5359 2.654e-07 ***
## audience_rating 1 6.80 6.80 29.4547 8.358e-08 ***
## Residuals 592 136.69 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The R squared decreased slightly. However, the decrease is insignificant. There are still variables with high p-value in the model. There are still variables that are considered insignificant such as thtr_rel_year, thtr_rel_month, thtr_rel_day and mpaa_rating. I will remove them one-by-one to see if the model improves.
seventh_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_year+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating, data=model)
summary(seventh_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_year + thtr_rel_day + audience_score + imdb_num_votes +
## critics_rating + audience_rating, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57159 -0.17542 0.03625 0.25896 1.09890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.817e+00 4.290e+00 0.890 0.374043
## title_typeFeature Film -3.482e-01 1.936e-01 -1.799 0.072593 .
## title_typeTV Movie -3.517e-01 3.097e-01 -1.135 0.256666
## genreAnimation -5.241e-01 1.990e-01 -2.635 0.008646 **
## genreArt House & International 3.429e-01 1.586e-01 2.162 0.031037 *
## genreComedy -1.261e-01 8.171e-02 -1.543 0.123423
## genreDocumentary 1.270e-01 2.049e-01 0.620 0.535652
## genreDrama 1.530e-01 7.180e-02 2.131 0.033489 *
## genreHorror 1.191e-01 1.236e-01 0.963 0.335890
## genreMusical & Performing Arts 6.737e-02 1.688e-01 0.399 0.689889
## genreMystery & Suspense 2.958e-01 9.189e-02 3.219 0.001359 **
## genreOther 4.538e-02 1.415e-01 0.321 0.748520
## genreScience Fiction & Fantasy -8.574e-02 1.823e-01 -0.470 0.638333
## runtime 5.021e-03 1.184e-03 4.241 2.58e-05 ***
## mpaa_ratingNC-17 1.429e-01 5.041e-01 0.283 0.776981
## mpaa_ratingPG -1.391e-01 1.425e-01 -0.977 0.329077
## mpaa_ratingPG-13 -1.667e-01 1.482e-01 -1.125 0.260873
## mpaa_ratingR -8.911e-02 1.426e-01 -0.625 0.532254
## mpaa_ratingUnrated -1.543e-01 1.708e-01 -0.903 0.366799
## thtr_rel_year -1.485e-05 2.131e-03 -0.007 0.994443
## thtr_rel_day -4.948e-04 2.239e-03 -0.221 0.825173
## audience_score 4.608e-02 2.161e-03 21.319 < 2e-16 ***
## imdb_num_votes 7.505e-07 2.192e-07 3.424 0.000659 ***
## critics_ratingFresh -2.996e-02 6.192e-02 -0.484 0.628725
## critics_ratingRotten -2.990e-01 6.645e-02 -4.500 8.19e-06 ***
## audience_ratingUpright -4.283e-01 7.899e-02 -5.423 8.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4811 on 593 degrees of freedom
## Multiple R-squared: 0.8077, Adjusted R-squared: 0.7996
## F-statistic: 99.66 on 25 and 593 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.6197 < 2.2e-16 ***
## genre 10 93.05 9.30 40.2091 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.1758 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6495 1.257e-12 ***
## thtr_rel_year 1 0.98 0.98 4.2477 0.03974 *
## thtr_rel_day 1 0.53 0.53 2.2978 0.13009
## audience_score 1 344.10 344.10 1486.9493 < 2.2e-16 ***
## imdb_num_votes 1 4.80 4.80 20.7253 6.435e-06 ***
## critics_rating 2 7.21 3.60 15.5706 2.567e-07 ***
## audience_rating 1 6.81 6.81 29.4076 8.549e-08 ***
## Residuals 593 137.23 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eighth_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+thtr_rel_day+
audience_score+imdb_num_votes+critics_rating+audience_rating, data=model)
summary(eighth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_day + audience_score + imdb_num_votes + critics_rating +
## audience_rating, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57161 -0.17534 0.03626 0.25899 1.09904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.787e+00 2.875e-01 13.173 < 2e-16 ***
## title_typeFeature Film -3.482e-01 1.934e-01 -1.800 0.072312 .
## title_typeTV Movie -3.517e-01 3.095e-01 -1.136 0.256268
## genreAnimation -5.243e-01 1.968e-01 -2.664 0.007922 **
## genreArt House & International 3.429e-01 1.585e-01 2.164 0.030892 *
## genreComedy -1.261e-01 8.163e-02 -1.544 0.123095
## genreDocumentary 1.269e-01 2.046e-01 0.620 0.535246
## genreDrama 1.530e-01 7.172e-02 2.133 0.033307 *
## genreHorror 1.191e-01 1.231e-01 0.967 0.333752
## genreMusical & Performing Arts 6.732e-02 1.685e-01 0.400 0.689590
## genreMystery & Suspense 2.957e-01 9.180e-02 3.222 0.001345 **
## genreOther 4.545e-02 1.410e-01 0.322 0.747380
## genreScience Fiction & Fantasy -8.568e-02 1.820e-01 -0.471 0.637908
## runtime 5.023e-03 1.153e-03 4.357 1.56e-05 ***
## mpaa_ratingNC-17 1.430e-01 5.035e-01 0.284 0.776552
## mpaa_ratingPG -1.392e-01 1.421e-01 -0.979 0.327840
## mpaa_ratingPG-13 -1.669e-01 1.455e-01 -1.148 0.251637
## mpaa_ratingR -8.926e-02 1.408e-01 -0.634 0.526496
## mpaa_ratingUnrated -1.546e-01 1.659e-01 -0.932 0.351895
## thtr_rel_day -4.960e-04 2.231e-03 -0.222 0.824124
## audience_score 4.608e-02 2.152e-03 21.412 < 2e-16 ***
## imdb_num_votes 7.501e-07 2.136e-07 3.513 0.000477 ***
## critics_ratingFresh -2.988e-02 6.087e-02 -0.491 0.623713
## critics_ratingRotten -2.990e-01 6.629e-02 -4.510 7.81e-06 ***
## audience_ratingUpright -4.283e-01 7.892e-02 -5.428 8.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4807 on 594 degrees of freedom
## Multiple R-squared: 0.8077, Adjusted R-squared: 0.8
## F-statistic: 104 on 24 and 594 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 145.8652 < 2.2e-16 ***
## genre 10 93.05 9.30 40.2769 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.4375 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.6726 1.193e-12 ***
## thtr_rel_day 1 0.38 0.38 1.6601 0.1981
## audience_score 1 345.02 345.02 1493.4320 < 2.2e-16 ***
## imdb_num_votes 1 4.97 4.97 21.5339 4.280e-06 ***
## critics_rating 2 7.24 3.62 15.6691 2.336e-07 ***
## audience_rating 1 6.81 6.81 29.4601 8.326e-08 ***
## Residuals 594 137.23 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ninth_model <-lm(imdb_rating~title_type+genre+runtime+mpaa_rating+audience_score+imdb_num_votes+critics_rating+audience_rating, data=model)
summary(ninth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## audience_score + imdb_num_votes + critics_rating + audience_rating,
## data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57790 -0.17857 0.03747 0.25433 1.10365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.779e+00 2.848e-01 13.265 < 2e-16 ***
## title_typeFeature Film -3.479e-01 1.932e-01 -1.800 0.072295 .
## title_typeTV Movie -3.532e-01 3.092e-01 -1.142 0.253759
## genreAnimation -5.244e-01 1.966e-01 -2.667 0.007866 **
## genreArt House & International 3.405e-01 1.580e-01 2.155 0.031546 *
## genreComedy -1.256e-01 8.154e-02 -1.540 0.124036
## genreDocumentary 1.265e-01 2.045e-01 0.619 0.536481
## genreDrama 1.526e-01 7.165e-02 2.130 0.033557 *
## genreHorror 1.177e-01 1.229e-01 0.958 0.338479
## genreMusical & Performing Arts 6.655e-02 1.683e-01 0.395 0.692648
## genreMystery & Suspense 2.950e-01 9.166e-02 3.218 0.001361 **
## genreOther 4.692e-02 1.408e-01 0.333 0.739058
## genreScience Fiction & Fantasy -8.297e-02 1.814e-01 -0.457 0.647588
## runtime 5.025e-03 1.152e-03 4.362 1.52e-05 ***
## mpaa_ratingNC-17 1.389e-01 5.027e-01 0.276 0.782385
## mpaa_ratingPG -1.399e-01 1.420e-01 -0.985 0.325009
## mpaa_ratingPG-13 -1.684e-01 1.452e-01 -1.159 0.246736
## mpaa_ratingR -8.941e-02 1.407e-01 -0.635 0.525478
## mpaa_ratingUnrated -1.541e-01 1.658e-01 -0.930 0.352837
## audience_score 4.609e-02 2.150e-03 21.438 < 2e-16 ***
## imdb_num_votes 7.491e-07 2.133e-07 3.512 0.000479 ***
## critics_ratingFresh -2.832e-02 6.042e-02 -0.469 0.639446
## critics_ratingRotten -2.978e-01 6.604e-02 -4.510 7.81e-06 ***
## audience_ratingUpright -4.287e-01 7.884e-02 -5.438 7.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4803 on 595 degrees of freedom
## Multiple R-squared: 0.8077, Adjusted R-squared: 0.8003
## F-statistic: 108.7 on 23 and 595 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 146.099 < 2.2e-16 ***
## genre 10 93.05 9.30 40.341 < 2.2e-16 ***
## runtime 1 35.91 35.91 155.686 < 2.2e-16 ***
## mpaa_rating 5 15.79 3.16 13.694 1.136e-12 ***
## audience_score 1 345.40 345.40 1497.484 < 2.2e-16 ***
## imdb_num_votes 1 4.95 4.95 21.477 4.402e-06 ***
## critics_rating 2 7.24 3.62 15.685 2.299e-07 ***
## audience_rating 1 6.82 6.82 29.567 7.896e-08 ***
## Residuals 595 137.24 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tenth_model <-lm(imdb_rating~title_type+genre+runtime+audience_score+imdb_num_votes+critics_rating+audience_rating, data=model)
summary(tenth_model)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + audience_score +
## imdb_num_votes + critics_rating + audience_rating, data = model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54779 -0.18685 0.04423 0.25968 1.05127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.671e+00 2.544e-01 14.429 < 2e-16 ***
## title_typeFeature Film -3.267e-01 1.910e-01 -1.710 0.087747 .
## title_typeTV Movie -3.411e-01 3.080e-01 -1.107 0.268631
## genreAnimation -4.681e-01 1.818e-01 -2.574 0.010287 *
## genreArt House & International 3.302e-01 1.541e-01 2.143 0.032484 *
## genreComedy -1.394e-01 8.050e-02 -1.732 0.083746 .
## genreDocumentary 1.152e-01 2.022e-01 0.569 0.569232
## genreDrama 1.532e-01 6.973e-02 2.197 0.028411 *
## genreHorror 1.333e-01 1.202e-01 1.109 0.267731
## genreMusical & Performing Arts 6.554e-02 1.672e-01 0.392 0.695270
## genreMystery & Suspense 3.081e-01 8.961e-02 3.438 0.000626 ***
## genreOther 3.427e-02 1.398e-01 0.245 0.806471
## genreScience Fiction & Fantasy -6.879e-02 1.810e-01 -0.380 0.704109
## runtime 4.728e-03 1.138e-03 4.156 3.71e-05 ***
## audience_score 4.627e-02 2.134e-03 21.684 < 2e-16 ***
## imdb_num_votes 7.335e-07 2.103e-07 3.487 0.000523 ***
## critics_ratingFresh -3.492e-02 6.009e-02 -0.581 0.561359
## critics_ratingRotten -3.089e-01 6.536e-02 -4.726 2.86e-06 ***
## audience_ratingUpright -4.279e-01 7.855e-02 -5.448 7.46e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4798 on 600 degrees of freedom
## Multiple R-squared: 0.8065, Adjusted R-squared: 0.8007
## F-statistic: 138.9 on 18 and 600 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: imdb_rating
## Df Sum Sq Mean Sq F value Pr(>F)
## title_type 2 67.40 33.70 146.401 < 2.2e-16 ***
## genre 10 93.05 9.30 40.425 < 2.2e-16 ***
## runtime 1 35.91 35.91 156.008 < 2.2e-16 ***
## audience_score 1 360.03 360.03 1564.111 < 2.2e-16 ***
## imdb_num_votes 1 4.79 4.79 20.813 6.143e-06 ***
## critics_rating 2 7.69 3.85 16.706 8.695e-08 ***
## audience_rating 1 6.83 6.83 29.677 7.456e-08 ***
## Residuals 600 138.11 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing all the insignificant variables with high p-value, the r suqared for the 10th model is 0.8007, the highest. Our tenth model is the final model.
Now we need to see if the model fits the following conditions:
the residuals are scattered randomly:
par(mfrow = c(1, 3))
plot(tenth_model$residuals~dataset$runtime, ylab="Residuals", xlab="RunTime", main="Residuals vs RunTime")
plot(tenth_model$residuals~dataset$audience_score, ylab="Residuals", xlab="Audience_Score", main="Residuals vs Audience_Score")
plot(tenth_model$residuals~dataset$imdb_num_votes, ylab="Residuals", xlab="IMDB Num Votes", main="Residuals vs IMDB Votes") distributed:
par(mfrow = c(1, 2))
hist(tenth_model$residuals, col="blue", main="Histogram-Model Residuals")
qqnorm(tenth_model$residuals)
qqline(tenth_model$residuals) tail but not significant.
the residuals display constant variability
par(mfrow = c(1, 2))
plot(tenth_model$residuals~tenth_model$fitted.values, main="Residuals vs Fitted")
plot(abs(tenth_model$residuals)~tenth_model$fitted.values, main="Absolute Residuals vs Fitted") the residuals are independent
there is no trend overtime for the residuals. * * * ## Part 5: Prediction We can use it to predict the rating of Deadpool, a movie released in 2016 but was not in the sample.
df_deadpool<- data.frame(title_type="Feature Film",genre="Action & Adventure",runtime=98,audience_score=90,imdb_num_votes=764199,critics_rating="Certified Fresh",audience_rating="Upright")
predict(tenth_model, df_deadpool, interval="prediction")## fit lwr upr
## 1 8.104227 7.121324 9.08713
The predicted value of 8.1 is very close to the actual imdb of 8.0. With a 95% confidence interval, the lower bound is 7.12 and upper bound is 9.09.
We will now use it to predict the rating of Inside out
df_Insideout<- data.frame(title_type="Feature Film",genre="Animation",runtime=94,audience_score=89,imdb_num_votes=499323,critics_rating="Certified Fresh",audience_rating="Upright")
predict(tenth_model, df_Insideout, interval="prediction")## fit lwr upr
## 1 7.376668 6.365044 8.388292
The prediction power for inside out was not as strong as that for Deadpool. The actual rating is 8.2 while the predicted value is 7.4. ## Part 6: Conclusion The model has good predictive power that can be used to predict the imdb rating for a particular movie.