Setup

Part 1: Data

The data were obtained from IMDB and Rotten Tomatoes. The data represents 651 randomly sampled movies produced and released before 2016. There are 32 variables about the movies.

The raw data is not a complete list of all movies released prior to 2016. It is a random sample taken from the full data set. We don’t know the sampling method. With random sampling, the results are generalizable to all movies in the range of years released between 1970 and 2014.In observational studies, only associations are shown. Association does not imply causation.

A possible non-independent bias may arise with regard to movie sequels whereby the popularity of a sequel movie may be influenced by that of the previous release. * * *

Part 2: Research question

Can we predict a movie’s popularity based on type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)?


Part 3: Exploratory data analysis

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

We can see that 9 columns consisted of character variable 12 columns consisted of factor variables 11 columns consisted of numeric/ interger variables, 6 of which are date-related. We check for missing values * * *


## [1] 619  32

‘Studio’ has 211 levels, thus it is too granular for the regression model. we will remove it. Column that has to do with actors, directors and actresses and imdb website will be removed (column 25-32)

## Classes 'tbl_df', 'tbl' and 'data.frame':    619 obs. of  20 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
##  $ runtime         : num  80 101 84 139 90 142 93 88 119 127 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 4 5 6 6 3 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 11 9 3 6 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 1 8 7 2 19 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ critics_score   : num  45 96 91 80 33 57 17 90 83 89 ...
##  $ audience_score  : num  73 81 91 76 27 76 47 89 66 75 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 3 3 2 1 1 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...


Part 4: Modeling

First, we need to find if there is any collinearity, especially with the numerical explanatory variables. A subset consisted of all numerical explanatory variables was created and the correlation matrix was created to better identify if there is any collinearity.

critic score and audience score are highly correlated. This could distort our regression model. To remove one of the two, I plotted the correlation between these two explanatory variables with the response variable.

higher correlation with the response variables. Thus it is chosen over critic score.

## 
## Call:
## lm(formula = imdb_rating ~ ., data = model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5391 -0.1681  0.0466  0.2527  1.0848 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.047e+00  4.341e+00   0.932 0.351550    
## title_typeFeature Film         -3.515e-01  1.939e-01  -1.812 0.070439 .  
## title_typeTV Movie             -3.271e-01  3.110e-01  -1.052 0.293339    
## genreAnimation                 -5.611e-01  2.005e-01  -2.798 0.005315 ** 
## genreArt House & International  3.368e-01  1.593e-01   2.114 0.034915 *  
## genreComedy                    -1.437e-01  8.269e-02  -1.738 0.082744 .  
## genreDocumentary                1.157e-01  2.053e-01   0.563 0.573329    
## genreDrama                      1.420e-01  7.300e-02   1.945 0.052260 .  
## genreHorror                     1.153e-01  1.241e-01   0.930 0.352968    
## genreMusical & Performing Arts  5.785e-02  1.691e-01   0.342 0.732475    
## genreMystery & Suspense         2.853e-01  9.354e-02   3.050 0.002388 ** 
## genreOther                      5.667e-02  1.429e-01   0.396 0.691914    
## genreScience Fiction & Fantasy -7.923e-02  1.827e-01  -0.434 0.664733    
## runtime                         4.307e-03  1.271e-03   3.389 0.000747 ***
## mpaa_ratingNC-17                9.980e-02  5.058e-01   0.197 0.843640    
## mpaa_ratingPG                  -1.580e-01  1.434e-01  -1.102 0.271029    
## mpaa_ratingPG-13               -1.761e-01  1.493e-01  -1.180 0.238644    
## mpaa_ratingR                   -1.090e-01  1.441e-01  -0.757 0.449579    
## mpaa_ratingUnrated             -1.672e-01  1.717e-01  -0.974 0.330701    
## thtr_rel_year                  -1.144e-04  2.158e-03  -0.053 0.957739    
## thtr_rel_month                  9.412e-03  5.849e-03   1.609 0.108115    
## thtr_rel_day                   -1.000e-03  2.257e-03  -0.443 0.657910    
## audience_score                  4.616e-02  2.180e-03  21.173  < 2e-16 ***
## imdb_num_votes                  7.955e-07  2.321e-07   3.427 0.000652 ***
## critics_ratingFresh            -3.507e-02  6.299e-02  -0.557 0.577961    
## critics_ratingRotten           -3.019e-01  6.741e-02  -4.478 9.05e-06 ***
## audience_ratingUpright         -4.243e-01  7.947e-02  -5.339 1.34e-07 ***
## best_pic_nomyes                -1.075e-01  1.289e-01  -0.834 0.404418    
## best_pic_winyes                -3.971e-02  2.249e-01  -0.177 0.859895    
## best_actor_winyes               2.219e-02  5.842e-02   0.380 0.704198    
## best_actress_winyes             8.711e-02  6.435e-02   1.354 0.176339    
## best_dir_winyes                 5.314e-02  8.392e-02   0.633 0.526866    
## top200_boxyes                  -9.827e-02  1.367e-01  -0.719 0.472376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 586 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.7992 
## F-statistic: 77.88 on 32 and 586 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type         2  67.40   33.70  145.3158 < 2.2e-16 ***
## genre             10  93.05    9.30   40.1252 < 2.2e-16 ***
## runtime            1  35.91   35.91  154.8520 < 2.2e-16 ***
## mpaa_rating        5  15.79    3.16   13.6211 1.361e-12 ***
## thtr_rel_year      1   0.98    0.98    4.2388   0.03995 *  
## thtr_rel_month     1   0.66    0.66    2.8573   0.09149 .  
## thtr_rel_day       1   0.42    0.42    1.8061   0.17949    
## audience_score     1 344.24  344.24 1484.4354 < 2.2e-16 ***
## imdb_num_votes     1   4.68    4.68   20.1951 8.429e-06 ***
## critics_rating     2   7.17    3.59   15.4689 2.840e-07 ***
## audience_rating    1   6.80    6.80   29.3277 8.930e-08 ***
## best_pic_nom       1   0.12    0.12    0.5253   0.46889    
## best_pic_win       1   0.00    0.00    0.0015   0.96878    
## best_actor_win     1   0.05    0.05    0.2177   0.64096    
## best_actress_win   1   0.41    0.41    1.7553   0.18573    
## best_dir_win       1   0.10    0.10    0.4305   0.51202    
## top200_box         1   0.12    0.12    0.5171   0.47238    
## Residuals        586 135.89    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We start with a relatively high adjusted r squared of 0.7992. We work our way through the set by removing variables with the highest p-value first. So, we will remove * * *

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win, data = model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5427 -0.1683  0.0389  0.2524  1.0862 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.719e+00  4.309e+00   0.863 0.388495    
## title_typeFeature Film         -3.523e-01  1.937e-01  -1.819 0.069420 .  
## title_typeTV Movie             -3.302e-01  3.106e-01  -1.063 0.288151    
## genreAnimation                 -5.474e-01  1.993e-01  -2.746 0.006213 ** 
## genreArt House & International  3.401e-01  1.590e-01   2.140 0.032794 *  
## genreComedy                    -1.400e-01  8.225e-02  -1.703 0.089184 .  
## genreDocumentary                1.182e-01  2.050e-01   0.576 0.564518    
## genreDrama                      1.458e-01  7.264e-02   2.006 0.045261 *  
## genreHorror                     1.184e-01  1.238e-01   0.956 0.339465    
## genreMusical & Performing Arts  6.193e-02  1.688e-01   0.367 0.713808    
## genreMystery & Suspense         2.891e-01  9.322e-02   3.101 0.002019 ** 
## genreOther                      6.195e-02  1.424e-01   0.435 0.663635    
## genreScience Fiction & Fantasy -8.187e-02  1.824e-01  -0.449 0.653750    
## runtime                         4.283e-03  1.269e-03   3.376 0.000785 ***
## mpaa_ratingNC-17                1.169e-01  5.046e-01   0.232 0.816914    
## mpaa_ratingPG                  -1.494e-01  1.427e-01  -1.047 0.295512    
## mpaa_ratingPG-13               -1.644e-01  1.483e-01  -1.109 0.267989    
## mpaa_ratingR                   -9.556e-02  1.427e-01  -0.669 0.503442    
## mpaa_ratingUnrated             -1.568e-01  1.708e-01  -0.918 0.359183    
## thtr_rel_year                   4.103e-05  2.143e-03   0.019 0.984733    
## thtr_rel_month                  9.214e-03  5.819e-03   1.583 0.113853    
## thtr_rel_day                   -9.823e-04  2.253e-03  -0.436 0.662996    
## audience_score                  4.626e-02  2.171e-03  21.307  < 2e-16 ***
## imdb_num_votes                  7.524e-07  2.228e-07   3.377 0.000780 ***
## critics_ratingFresh            -3.087e-02  6.264e-02  -0.493 0.622322    
## critics_ratingRotten           -2.977e-01  6.709e-02  -4.437 1.09e-05 ***
## audience_ratingUpright         -4.271e-01  7.928e-02  -5.387 1.03e-07 ***
## best_pic_nomyes                -1.126e-01  1.179e-01  -0.955 0.339960    
## best_actor_winyes               2.183e-02  5.819e-02   0.375 0.707655    
## best_actress_winyes             8.454e-02  6.415e-02   1.318 0.188054    
## best_dir_winyes                 5.105e-02  8.052e-02   0.634 0.526354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.481 on 588 degrees of freedom
## Multiple R-squared:  0.8094, Adjusted R-squared:  0.7997 
## F-statistic: 83.26 on 30 and 588 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type         2  67.40   33.70  145.6759 < 2.2e-16 ***
## genre             10  93.05    9.30   40.2246 < 2.2e-16 ***
## runtime            1  35.91   35.91  155.2357 < 2.2e-16 ***
## mpaa_rating        5  15.79    3.16   13.6548 1.259e-12 ***
## thtr_rel_year      1   0.98    0.98    4.2493   0.03971 *  
## thtr_rel_month     1   0.66    0.66    2.8644   0.09109 .  
## thtr_rel_day       1   0.42    0.42    1.8106   0.17896    
## audience_score     1 344.24  344.24 1488.1133 < 2.2e-16 ***
## imdb_num_votes     1   4.68    4.68   20.2451 8.213e-06 ***
## critics_rating     2   7.17    3.59   15.5073 2.735e-07 ***
## audience_rating    1   6.80    6.80   29.4003 8.606e-08 ***
## best_pic_nom       1   0.12    0.12    0.5266   0.46834    
## best_actor_win     1   0.05    0.05    0.2153   0.64281    
## best_actress_win   1   0.41    0.41    1.7641   0.18463    
## best_dir_win       1   0.09    0.09    0.4019   0.52635    
## Residuals        588 136.02    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our adjusted r squared improved slightly from 0.7992 to 0.7997. We wil continue to remove insignificant variable. The variable we will remove this time is the best_actor_win variable.

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating + best_pic_nom + 
##     best_actress_win + best_dir_win, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54348 -0.16429  0.03922  0.25058  1.08392 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.709e+00  4.306e+00   0.861 0.389437    
## title_typeFeature Film         -3.510e-01  1.935e-01  -1.814 0.070172 .  
## title_typeTV Movie             -3.318e-01  3.104e-01  -1.069 0.285437    
## genreAnimation                 -5.456e-01  1.991e-01  -2.740 0.006327 ** 
## genreArt House & International  3.379e-01  1.587e-01   2.129 0.033706 *  
## genreComedy                    -1.402e-01  8.219e-02  -1.705 0.088653 .  
## genreDocumentary                1.178e-01  2.048e-01   0.575 0.565341    
## genreDrama                      1.464e-01  7.257e-02   2.018 0.044068 *  
## genreHorror                     1.169e-01  1.237e-01   0.945 0.344967    
## genreMusical & Performing Arts  6.077e-02  1.686e-01   0.360 0.718691    
## genreMystery & Suspense         2.921e-01  9.280e-02   3.148 0.001728 ** 
## genreOther                      6.263e-02  1.423e-01   0.440 0.659933    
## genreScience Fiction & Fantasy -8.394e-02  1.822e-01  -0.461 0.645200    
## runtime                         4.367e-03  1.248e-03   3.499 0.000502 ***
## mpaa_ratingNC-17                1.164e-01  5.042e-01   0.231 0.817450    
## mpaa_ratingPG                  -1.478e-01  1.426e-01  -1.037 0.300153    
## mpaa_ratingPG-13               -1.637e-01  1.482e-01  -1.105 0.269627    
## mpaa_ratingR                   -9.508e-02  1.426e-01  -0.667 0.505272    
## mpaa_ratingUnrated             -1.566e-01  1.707e-01  -0.917 0.359406    
## thtr_rel_year                   4.112e-05  2.141e-03   0.019 0.984687    
## thtr_rel_month                  9.321e-03  5.808e-03   1.605 0.109038    
## thtr_rel_day                   -9.664e-04  2.251e-03  -0.429 0.667841    
## audience_score                  4.628e-02  2.169e-03  21.342  < 2e-16 ***
## imdb_num_votes                  7.491e-07  2.224e-07   3.368 0.000807 ***
## critics_ratingFresh            -3.038e-02  6.258e-02  -0.486 0.627500    
## critics_ratingRotten           -2.976e-01  6.704e-02  -4.439 1.08e-05 ***
## audience_ratingUpright         -4.285e-01  7.912e-02  -5.416 8.89e-08 ***
## best_pic_nomyes                -1.087e-01  1.173e-01  -0.927 0.354523    
## best_actress_winyes             8.590e-02  6.400e-02   1.342 0.180049    
## best_dir_winyes                 5.165e-02  8.045e-02   0.642 0.521138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4806 on 589 degrees of freedom
## Multiple R-squared:  0.8094, Adjusted R-squared:    0.8 
## F-statistic: 86.25 on 29 and 589 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type         2  67.40   33.70  145.8887 < 2.2e-16 ***
## genre             10  93.05    9.30   40.2833 < 2.2e-16 ***
## runtime            1  35.91   35.91  155.4625 < 2.2e-16 ***
## mpaa_rating        5  15.79    3.16   13.6748 1.203e-12 ***
## thtr_rel_year      1   0.98    0.98    4.2555   0.03956 *  
## thtr_rel_month     1   0.66    0.66    2.8686   0.09085 .  
## thtr_rel_day       1   0.42    0.42    1.8132   0.17864    
## audience_score     1 344.24  344.24 1490.2874 < 2.2e-16 ***
## imdb_num_votes     1   4.68    4.68   20.2747 8.089e-06 ***
## critics_rating     2   7.17    3.59   15.5299 2.675e-07 ***
## audience_rating    1   6.80    6.80   29.4433 8.421e-08 ***
## best_pic_nom       1   0.12    0.12    0.5273   0.46802    
## best_actress_win   1   0.42    0.42    1.8316   0.17645    
## best_dir_win       1   0.10    0.10    0.4121   0.52114    
## Residuals        589 136.05    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

adjusted r squared increased to 0.8. This time we will remove best_dir_win

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating + best_pic_nom + 
##     best_actress_win, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54251 -0.16980  0.03898  0.25007  1.07756 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.898e+00  4.294e+00   0.908 0.364422    
## title_typeFeature Film         -3.473e-01  1.933e-01  -1.797 0.072894 .  
## title_typeTV Movie             -3.299e-01  3.102e-01  -1.064 0.287919    
## genreAnimation                 -5.446e-01  1.990e-01  -2.737 0.006396 ** 
## genreArt House & International  3.352e-01  1.586e-01   2.114 0.034959 *  
## genreComedy                    -1.400e-01  8.215e-02  -1.704 0.088952 .  
## genreDocumentary                1.187e-01  2.047e-01   0.580 0.562240    
## genreDrama                      1.455e-01  7.252e-02   2.007 0.045204 *  
## genreHorror                     1.173e-01  1.236e-01   0.949 0.342819    
## genreMusical & Performing Arts  6.113e-02  1.685e-01   0.363 0.716988    
## genreMystery & Suspense         2.924e-01  9.276e-02   3.152 0.001703 ** 
## genreOther                      5.847e-02  1.421e-01   0.412 0.680797    
## genreScience Fiction & Fantasy -8.223e-02  1.821e-01  -0.452 0.651762    
## runtime                         4.493e-03  1.232e-03   3.647 0.000289 ***
## mpaa_ratingNC-17                1.183e-01  5.040e-01   0.235 0.814531    
## mpaa_ratingPG                  -1.439e-01  1.424e-01  -1.011 0.312445    
## mpaa_ratingPG-13               -1.608e-01  1.480e-01  -1.086 0.277748    
## mpaa_ratingR                   -9.120e-02  1.424e-01  -0.640 0.522205    
## mpaa_ratingUnrated             -1.547e-01  1.706e-01  -0.907 0.364890    
## thtr_rel_year                  -6.205e-05  2.134e-03  -0.029 0.976818    
## thtr_rel_month                  9.375e-03  5.804e-03   1.615 0.106806    
## thtr_rel_day                   -9.780e-04  2.250e-03  -0.435 0.663947    
## audience_score                  4.633e-02  2.166e-03  21.386  < 2e-16 ***
## imdb_num_votes                  7.599e-07  2.217e-07   3.428 0.000651 ***
## critics_ratingFresh            -3.019e-02  6.255e-02  -0.483 0.629565    
## critics_ratingRotten           -3.003e-01  6.686e-02  -4.492 8.49e-06 ***
## audience_ratingUpright         -4.312e-01  7.898e-02  -5.459 7.05e-08 ***
## best_pic_nomyes                -1.052e-01  1.171e-01  -0.898 0.369509    
## best_actress_winyes             8.660e-02  6.396e-02   1.354 0.176236    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4804 on 590 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8002 
## F-statistic:  89.4 on 28 and 590 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type         2  67.40   33.70  146.0342 < 2.2e-16 ***
## genre             10  93.05    9.30   40.3235 < 2.2e-16 ***
## runtime            1  35.91   35.91  155.6175 < 2.2e-16 ***
## mpaa_rating        5  15.79    3.16   13.6884 1.166e-12 ***
## thtr_rel_year      1   0.98    0.98    4.2598   0.03946 *  
## thtr_rel_month     1   0.66    0.66    2.8715   0.09069 .  
## thtr_rel_day       1   0.42    0.42    1.8150   0.17842    
## audience_score     1 344.24  344.24 1491.7737 < 2.2e-16 ***
## imdb_num_votes     1   4.68    4.68   20.2949 8.004e-06 ***
## critics_rating     2   7.17    3.59   15.5454 2.634e-07 ***
## audience_rating    1   6.80    6.80   29.4727 8.295e-08 ***
## best_pic_nom       1   0.12    0.12    0.5279   0.46779    
## best_actress_win   1   0.42    0.42    1.8335   0.17624    
## Residuals        590 136.15    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adjusted R squared slightly improved to 0.8002. We will remove best_pic_nom this time.

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating + best_actress_win, 
##     data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54575 -0.17002  0.03634  0.25551  1.07724 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.707e+00  4.288e+00   0.865 0.387641    
## title_typeFeature Film         -3.490e-01  1.933e-01  -1.806 0.071503 .  
## title_typeTV Movie             -3.335e-01  3.101e-01  -1.075 0.282592    
## genreAnimation                 -5.454e-01  1.990e-01  -2.741 0.006309 ** 
## genreArt House & International  3.338e-01  1.586e-01   2.105 0.035719 *  
## genreComedy                    -1.423e-01  8.209e-02  -1.734 0.083452 .  
## genreDocumentary                1.169e-01  2.047e-01   0.571 0.568027    
## genreDrama                      1.420e-01  7.240e-02   1.962 0.050256 .  
## genreHorror                     1.126e-01  1.235e-01   0.912 0.362291    
## genreMusical & Performing Arts  6.323e-02  1.685e-01   0.375 0.707624    
## genreMystery & Suspense         2.894e-01  9.268e-02   3.122 0.001881 ** 
## genreOther                      5.015e-02  1.417e-01   0.354 0.723574    
## genreScience Fiction & Fantasy -8.128e-02  1.821e-01  -0.446 0.655447    
## runtime                         4.388e-03  1.226e-03   3.578 0.000374 ***
## mpaa_ratingNC-17                1.241e-01  5.038e-01   0.246 0.805524    
## mpaa_ratingPG                  -1.471e-01  1.423e-01  -1.034 0.301746    
## mpaa_ratingPG-13               -1.637e-01  1.480e-01  -1.106 0.268974    
## mpaa_ratingR                   -9.207e-02  1.424e-01  -0.647 0.518178    
## mpaa_ratingUnrated             -1.546e-01  1.706e-01  -0.906 0.365172    
## thtr_rel_year                   4.435e-05  2.131e-03   0.021 0.983399    
## thtr_rel_month                  8.720e-03  5.757e-03   1.515 0.130402    
## thtr_rel_day                   -9.169e-04  2.248e-03  -0.408 0.683571    
## audience_score                  4.617e-02  2.159e-03  21.388  < 2e-16 ***
## imdb_num_votes                  7.296e-07  2.191e-07   3.330 0.000921 ***
## critics_ratingFresh            -2.333e-02  6.207e-02  -0.376 0.707183    
## critics_ratingRotten           -2.938e-01  6.646e-02  -4.421 1.17e-05 ***
## audience_ratingUpright         -4.276e-01  7.886e-02  -5.422 8.61e-08 ***
## best_actress_winyes             7.906e-02  6.339e-02   1.247 0.212848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4803 on 591 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.8003 
## F-statistic: 92.72 on 27 and 591 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type         2  67.40   33.70  146.0820 < 2.2e-16 ***
## genre             10  93.05    9.30   40.3367 < 2.2e-16 ***
## runtime            1  35.91   35.91  155.6685 < 2.2e-16 ***
## mpaa_rating        5  15.79    3.16   13.6929 1.151e-12 ***
## thtr_rel_year      1   0.98    0.98    4.2612   0.03943 *  
## thtr_rel_month     1   0.66    0.66    2.8724   0.09064 .  
## thtr_rel_day       1   0.42    0.42    1.8156   0.17835    
## audience_score     1 344.24  344.24 1492.2622 < 2.2e-16 ***
## imdb_num_votes     1   4.68    4.68   20.3016 7.975e-06 ***
## critics_rating     2   7.17    3.59   15.5505 2.619e-07 ***
## audience_rating    1   6.80    6.80   29.4823 8.251e-08 ***
## best_actress_win   1   0.36    0.36    1.5553   0.21285    
## Residuals        591 136.33    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the sixth model, we will remove best_actress_win. I also started to notice a pattern. We have been removing variables that have to do with oscar prizes ( best actress, best picture nominated, best director…)

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_month + thtr_rel_day + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54920 -0.17859  0.03308  0.25704  1.07702 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.933e+00  4.286e+00   0.918 0.359174    
## title_typeFeature Film         -3.490e-01  1.934e-01  -1.805 0.071604 .  
## title_typeTV Movie             -3.203e-01  3.101e-01  -1.033 0.302092    
## genreAnimation                 -5.323e-01  1.988e-01  -2.678 0.007620 ** 
## genreArt House & International  3.435e-01  1.584e-01   2.168 0.030557 *  
## genreComedy                    -1.320e-01  8.171e-02  -1.616 0.106720    
## genreDocumentary                1.241e-01  2.047e-01   0.606 0.544576    
## genreDrama                      1.546e-01  7.173e-02   2.156 0.031515 *  
## genreHorror                     1.165e-01  1.235e-01   0.944 0.345663    
## genreMusical & Performing Arts  6.406e-02  1.686e-01   0.380 0.704105    
## genreMystery & Suspense         3.042e-01  9.196e-02   3.309 0.000995 ***
## genreOther                      5.892e-02  1.416e-01   0.416 0.677545    
## genreScience Fiction & Fantasy -8.064e-02  1.822e-01  -0.443 0.658151    
## runtime                         4.589e-03  1.216e-03   3.774 0.000177 ***
## mpaa_ratingNC-17                1.115e-01  5.040e-01   0.221 0.825043    
## mpaa_ratingPG                  -1.450e-01  1.423e-01  -1.018 0.308901    
## mpaa_ratingPG-13               -1.628e-01  1.480e-01  -1.100 0.271811    
## mpaa_ratingR                   -9.354e-02  1.425e-01  -0.657 0.511692    
## mpaa_ratingUnrated             -1.565e-01  1.707e-01  -0.917 0.359489    
## thtr_rel_year                  -7.682e-05  2.130e-03  -0.036 0.971235    
## thtr_rel_month                  8.783e-03  5.760e-03   1.525 0.127819    
## thtr_rel_day                   -8.508e-04  2.249e-03  -0.378 0.705311    
## audience_score                  4.611e-02  2.159e-03  21.358  < 2e-16 ***
## imdb_num_votes                  7.401e-07  2.190e-07   3.379 0.000775 ***
## critics_ratingFresh            -3.021e-02  6.186e-02  -0.488 0.625476    
## critics_ratingRotten           -2.986e-01  6.638e-02  -4.498 8.25e-06 ***
## audience_ratingUpright         -4.282e-01  7.890e-02  -5.427 8.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4805 on 592 degrees of freedom
## Multiple R-squared:  0.8085, Adjusted R-squared:  0.8001 
## F-statistic: 96.13 on 26 and 592 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type        2  67.40   33.70  145.9451 < 2.2e-16 ***
## genre            10  93.05    9.30   40.2989 < 2.2e-16 ***
## runtime           1  35.91   35.91  155.5226 < 2.2e-16 ***
## mpaa_rating       5  15.79    3.16   13.6801 1.180e-12 ***
## thtr_rel_year     1   0.98    0.98    4.2572   0.03952 *  
## thtr_rel_month    1   0.66    0.66    2.8697   0.09079 .  
## thtr_rel_day      1   0.42    0.42    1.8139   0.17855    
## audience_score    1 344.24  344.24 1490.8638 < 2.2e-16 ***
## imdb_num_votes    1   4.68    4.68   20.2825 8.050e-06 ***
## critics_rating    2   7.17    3.59   15.5359 2.654e-07 ***
## audience_rating   1   6.80    6.80   29.4547 8.358e-08 ***
## Residuals       592 136.69    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The R squared decreased slightly. However, the decrease is insignificant. There are still variables with high p-value in the model. There are still variables that are considered insignificant such as thtr_rel_year, thtr_rel_month, thtr_rel_day and mpaa_rating. I will remove them one-by-one to see if the model improves.

## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + thtr_rel_day + audience_score + imdb_num_votes + 
##     critics_rating + audience_rating, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57159 -0.17542  0.03625  0.25896  1.09890 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.817e+00  4.290e+00   0.890 0.374043    
## title_typeFeature Film         -3.482e-01  1.936e-01  -1.799 0.072593 .  
## title_typeTV Movie             -3.517e-01  3.097e-01  -1.135 0.256666    
## genreAnimation                 -5.241e-01  1.990e-01  -2.635 0.008646 ** 
## genreArt House & International  3.429e-01  1.586e-01   2.162 0.031037 *  
## genreComedy                    -1.261e-01  8.171e-02  -1.543 0.123423    
## genreDocumentary                1.270e-01  2.049e-01   0.620 0.535652    
## genreDrama                      1.530e-01  7.180e-02   2.131 0.033489 *  
## genreHorror                     1.191e-01  1.236e-01   0.963 0.335890    
## genreMusical & Performing Arts  6.737e-02  1.688e-01   0.399 0.689889    
## genreMystery & Suspense         2.958e-01  9.189e-02   3.219 0.001359 ** 
## genreOther                      4.538e-02  1.415e-01   0.321 0.748520    
## genreScience Fiction & Fantasy -8.574e-02  1.823e-01  -0.470 0.638333    
## runtime                         5.021e-03  1.184e-03   4.241 2.58e-05 ***
## mpaa_ratingNC-17                1.429e-01  5.041e-01   0.283 0.776981    
## mpaa_ratingPG                  -1.391e-01  1.425e-01  -0.977 0.329077    
## mpaa_ratingPG-13               -1.667e-01  1.482e-01  -1.125 0.260873    
## mpaa_ratingR                   -8.911e-02  1.426e-01  -0.625 0.532254    
## mpaa_ratingUnrated             -1.543e-01  1.708e-01  -0.903 0.366799    
## thtr_rel_year                  -1.485e-05  2.131e-03  -0.007 0.994443    
## thtr_rel_day                   -4.948e-04  2.239e-03  -0.221 0.825173    
## audience_score                  4.608e-02  2.161e-03  21.319  < 2e-16 ***
## imdb_num_votes                  7.505e-07  2.192e-07   3.424 0.000659 ***
## critics_ratingFresh            -2.996e-02  6.192e-02  -0.484 0.628725    
## critics_ratingRotten           -2.990e-01  6.645e-02  -4.500 8.19e-06 ***
## audience_ratingUpright         -4.283e-01  7.899e-02  -5.423 8.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4811 on 593 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.7996 
## F-statistic: 99.66 on 25 and 593 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type        2  67.40   33.70  145.6197 < 2.2e-16 ***
## genre            10  93.05    9.30   40.2091 < 2.2e-16 ***
## runtime           1  35.91   35.91  155.1758 < 2.2e-16 ***
## mpaa_rating       5  15.79    3.16   13.6495 1.257e-12 ***
## thtr_rel_year     1   0.98    0.98    4.2477   0.03974 *  
## thtr_rel_day      1   0.53    0.53    2.2978   0.13009    
## audience_score    1 344.10  344.10 1486.9493 < 2.2e-16 ***
## imdb_num_votes    1   4.80    4.80   20.7253 6.435e-06 ***
## critics_rating    2   7.21    3.60   15.5706 2.567e-07 ***
## audience_rating   1   6.81    6.81   29.4076 8.549e-08 ***
## Residuals       593 137.23    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_day + audience_score + imdb_num_votes + critics_rating + 
##     audience_rating, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57161 -0.17534  0.03626  0.25899  1.09904 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.787e+00  2.875e-01  13.173  < 2e-16 ***
## title_typeFeature Film         -3.482e-01  1.934e-01  -1.800 0.072312 .  
## title_typeTV Movie             -3.517e-01  3.095e-01  -1.136 0.256268    
## genreAnimation                 -5.243e-01  1.968e-01  -2.664 0.007922 ** 
## genreArt House & International  3.429e-01  1.585e-01   2.164 0.030892 *  
## genreComedy                    -1.261e-01  8.163e-02  -1.544 0.123095    
## genreDocumentary                1.269e-01  2.046e-01   0.620 0.535246    
## genreDrama                      1.530e-01  7.172e-02   2.133 0.033307 *  
## genreHorror                     1.191e-01  1.231e-01   0.967 0.333752    
## genreMusical & Performing Arts  6.732e-02  1.685e-01   0.400 0.689590    
## genreMystery & Suspense         2.957e-01  9.180e-02   3.222 0.001345 ** 
## genreOther                      4.545e-02  1.410e-01   0.322 0.747380    
## genreScience Fiction & Fantasy -8.568e-02  1.820e-01  -0.471 0.637908    
## runtime                         5.023e-03  1.153e-03   4.357 1.56e-05 ***
## mpaa_ratingNC-17                1.430e-01  5.035e-01   0.284 0.776552    
## mpaa_ratingPG                  -1.392e-01  1.421e-01  -0.979 0.327840    
## mpaa_ratingPG-13               -1.669e-01  1.455e-01  -1.148 0.251637    
## mpaa_ratingR                   -8.926e-02  1.408e-01  -0.634 0.526496    
## mpaa_ratingUnrated             -1.546e-01  1.659e-01  -0.932 0.351895    
## thtr_rel_day                   -4.960e-04  2.231e-03  -0.222 0.824124    
## audience_score                  4.608e-02  2.152e-03  21.412  < 2e-16 ***
## imdb_num_votes                  7.501e-07  2.136e-07   3.513 0.000477 ***
## critics_ratingFresh            -2.988e-02  6.087e-02  -0.491 0.623713    
## critics_ratingRotten           -2.990e-01  6.629e-02  -4.510 7.81e-06 ***
## audience_ratingUpright         -4.283e-01  7.892e-02  -5.428 8.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4807 on 594 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:    0.8 
## F-statistic:   104 on 24 and 594 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## title_type        2  67.40   33.70  145.8652 < 2.2e-16 ***
## genre            10  93.05    9.30   40.2769 < 2.2e-16 ***
## runtime           1  35.91   35.91  155.4375 < 2.2e-16 ***
## mpaa_rating       5  15.79    3.16   13.6726 1.193e-12 ***
## thtr_rel_day      1   0.38    0.38    1.6601    0.1981    
## audience_score    1 345.02  345.02 1493.4320 < 2.2e-16 ***
## imdb_num_votes    1   4.97    4.97   21.5339 4.280e-06 ***
## critics_rating    2   7.24    3.62   15.6691 2.336e-07 ***
## audience_rating   1   6.81    6.81   29.4601 8.326e-08 ***
## Residuals       594 137.23    0.23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     audience_score + imdb_num_votes + critics_rating + audience_rating, 
##     data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57790 -0.17857  0.03747  0.25433  1.10365 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.779e+00  2.848e-01  13.265  < 2e-16 ***
## title_typeFeature Film         -3.479e-01  1.932e-01  -1.800 0.072295 .  
## title_typeTV Movie             -3.532e-01  3.092e-01  -1.142 0.253759    
## genreAnimation                 -5.244e-01  1.966e-01  -2.667 0.007866 ** 
## genreArt House & International  3.405e-01  1.580e-01   2.155 0.031546 *  
## genreComedy                    -1.256e-01  8.154e-02  -1.540 0.124036    
## genreDocumentary                1.265e-01  2.045e-01   0.619 0.536481    
## genreDrama                      1.526e-01  7.165e-02   2.130 0.033557 *  
## genreHorror                     1.177e-01  1.229e-01   0.958 0.338479    
## genreMusical & Performing Arts  6.655e-02  1.683e-01   0.395 0.692648    
## genreMystery & Suspense         2.950e-01  9.166e-02   3.218 0.001361 ** 
## genreOther                      4.692e-02  1.408e-01   0.333 0.739058    
## genreScience Fiction & Fantasy -8.297e-02  1.814e-01  -0.457 0.647588    
## runtime                         5.025e-03  1.152e-03   4.362 1.52e-05 ***
## mpaa_ratingNC-17                1.389e-01  5.027e-01   0.276 0.782385    
## mpaa_ratingPG                  -1.399e-01  1.420e-01  -0.985 0.325009    
## mpaa_ratingPG-13               -1.684e-01  1.452e-01  -1.159 0.246736    
## mpaa_ratingR                   -8.941e-02  1.407e-01  -0.635 0.525478    
## mpaa_ratingUnrated             -1.541e-01  1.658e-01  -0.930 0.352837    
## audience_score                  4.609e-02  2.150e-03  21.438  < 2e-16 ***
## imdb_num_votes                  7.491e-07  2.133e-07   3.512 0.000479 ***
## critics_ratingFresh            -2.832e-02  6.042e-02  -0.469 0.639446    
## critics_ratingRotten           -2.978e-01  6.604e-02  -4.510 7.81e-06 ***
## audience_ratingUpright         -4.287e-01  7.884e-02  -5.438 7.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4803 on 595 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.8003 
## F-statistic: 108.7 on 23 and 595 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## title_type        2  67.40   33.70  146.099 < 2.2e-16 ***
## genre            10  93.05    9.30   40.341 < 2.2e-16 ***
## runtime           1  35.91   35.91  155.686 < 2.2e-16 ***
## mpaa_rating       5  15.79    3.16   13.694 1.136e-12 ***
## audience_score    1 345.40  345.40 1497.484 < 2.2e-16 ***
## imdb_num_votes    1   4.95    4.95   21.477 4.402e-06 ***
## critics_rating    2   7.24    3.62   15.685 2.299e-07 ***
## audience_rating   1   6.82    6.82   29.567 7.896e-08 ***
## Residuals       595 137.24    0.23                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + audience_score + 
##     imdb_num_votes + critics_rating + audience_rating, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54779 -0.18685  0.04423  0.25968  1.05127 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.671e+00  2.544e-01  14.429  < 2e-16 ***
## title_typeFeature Film         -3.267e-01  1.910e-01  -1.710 0.087747 .  
## title_typeTV Movie             -3.411e-01  3.080e-01  -1.107 0.268631    
## genreAnimation                 -4.681e-01  1.818e-01  -2.574 0.010287 *  
## genreArt House & International  3.302e-01  1.541e-01   2.143 0.032484 *  
## genreComedy                    -1.394e-01  8.050e-02  -1.732 0.083746 .  
## genreDocumentary                1.152e-01  2.022e-01   0.569 0.569232    
## genreDrama                      1.532e-01  6.973e-02   2.197 0.028411 *  
## genreHorror                     1.333e-01  1.202e-01   1.109 0.267731    
## genreMusical & Performing Arts  6.554e-02  1.672e-01   0.392 0.695270    
## genreMystery & Suspense         3.081e-01  8.961e-02   3.438 0.000626 ***
## genreOther                      3.427e-02  1.398e-01   0.245 0.806471    
## genreScience Fiction & Fantasy -6.879e-02  1.810e-01  -0.380 0.704109    
## runtime                         4.728e-03  1.138e-03   4.156 3.71e-05 ***
## audience_score                  4.627e-02  2.134e-03  21.684  < 2e-16 ***
## imdb_num_votes                  7.335e-07  2.103e-07   3.487 0.000523 ***
## critics_ratingFresh            -3.492e-02  6.009e-02  -0.581 0.561359    
## critics_ratingRotten           -3.089e-01  6.536e-02  -4.726 2.86e-06 ***
## audience_ratingUpright         -4.279e-01  7.855e-02  -5.448 7.46e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4798 on 600 degrees of freedom
## Multiple R-squared:  0.8065, Adjusted R-squared:  0.8007 
## F-statistic: 138.9 on 18 and 600 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: imdb_rating
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## title_type        2  67.40   33.70  146.401 < 2.2e-16 ***
## genre            10  93.05    9.30   40.425 < 2.2e-16 ***
## runtime           1  35.91   35.91  156.008 < 2.2e-16 ***
## audience_score    1 360.03  360.03 1564.111 < 2.2e-16 ***
## imdb_num_votes    1   4.79    4.79   20.813 6.143e-06 ***
## critics_rating    2   7.69    3.85   16.706 8.695e-08 ***
## audience_rating   1   6.83    6.83   29.677 7.456e-08 ***
## Residuals       600 138.11    0.23                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing all the insignificant variables with high p-value, the r suqared for the 10th model is 0.8007, the highest. Our tenth model is the final model.

Now we need to see if the model fits the following conditions:

the residuals are scattered randomly:

distributed:

tail but not significant.

the residuals display constant variability

the residuals are independent

there is no trend overtime for the residuals. * * * ## Part 5: Prediction We can use it to predict the rating of Deadpool, a movie released in 2016 but was not in the sample.

##        fit      lwr     upr
## 1 8.104227 7.121324 9.08713

The predicted value of 8.1 is very close to the actual imdb of 8.0. With a 95% confidence interval, the lower bound is 7.12 and upper bound is 9.09.

We will now use it to predict the rating of Inside out

##        fit      lwr      upr
## 1 7.376668 6.365044 8.388292

The prediction power for inside out was not as strong as that for Deadpool. The actual rating is 8.2 while the predicted value is 7.4. ## Part 6: Conclusion The model has good predictive power that can be used to predict the imdb rating for a particular movie.