Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.1

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

For ease of use, we subset the dataset and will continue only with the variables which are favorable for us.

movie_sub <- dplyr::tbl_df(movies)
movie_sub <- dplyr::select(movie_sub, title,genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, top200_box)

Part 1: Data

The dataset “movies” contains 651 randomly sampled movies produced and released before 2016. The movies are from US-based Studios, because MPAA Ratings in the data applies only to American films. This dataset includes information from both Rotten Tomatoes and IMDb.

In this report, we are going to try to generalize the factors that affect the popularity of movies using this data. Such generalization might only be applicable to US-based movies. Since the data is collected randomly from a sample population of all the people who have seen the movie we might be able to use this data to see what affects a movie’s popularity.

This information suggests the dataset should be considered the result of an observational retrospective study that uses a random sampling design to select a representative sample from U.S. movies. When random sampling has been employed in data collection, the results should be generalizable to the target population. Therefore, the results of the analysis should be generalizable to all the movies released between 1970 - 2014.

The potential bias in the data is that the audience ratings are only collected from either IMDB or Rotten Tomatoes which is limited to only people who register and take the time to rate the movies in this sites as compared to all the people who have seen the movie.

Part 2: Research question

To obtain the best information about the quality of a movie before a movie is released, consumers rely on movie critics as their guide. Movie critics act as advisors to consumers telling them which movies will be worth their money. Their reviews can tell their readers, before they decide to see a movie or not, how funny, entertaining, well-acted, and gripping a variety of movies are. Readers can then take this information and use it to decide whether to spend their money on a movie or a more useful alternative.

The main question here will be; Is there any association between critic score critics_score and audience score audience_score? Moreover, among chosen explanatory variables (critics score on Rotten Tomatoes critics_score, number of votes on IMDB imdb_num_votes, rating on IMDB imdb_rating, critics rating on Rotten Tomatoes critics_rating) which one affects the audience score audience_score more?

Part 3: Exploratory data analysis

Part 3: Exploratory data analysis We analyzed the relationship between audience_score and critics_score using a scatter plot and movies colored with regards to their genre to see which genre will get higher critics score so we can assess the relationship between critics score and audience score further. As we can see, movies in comedy, documentary, and drama get a higher critics score.

qplot(critics_score,audience_score,colour=genre, data=movie_sub)

In next step, the movies were categorized based on the critics rating. As the plot shows, Certified Fresh and Fresh, tend to have a higher audience score.

qplot(critics_score,audience_score,colour=critics_rating, data=movie_sub)

To see which MPAA rating does tend to have higher audience score, we used the following scatter plot. As the plot depicts, there’s a good mix of the movie.

qplot(critics_score,audience_score,colour=mpaa_rating, data=movie_sub)

The relationship between critics score and audience score with regard to being as top 200 movies, is shown in the following scatter plot. This plot demonstrates that being in the top 200 movies does not necessarily lead to having higher audience score.

qplot(critics_score,audience_score,colour=top200_box, data=movie_sub)

To see what genre of movies has higher audience score and their distribution, we utilized the following graph. As the plot demonstrates, documentary films, tend to have higher audience score.

qplot(audience_score,colour=genre, data=movie_sub, geom="density")

Now we can see if there is a linear relationship between audience score and critics score. As the plot depicts, there might be a linear relationship between two scores (critics_score and audience_score).

ggplot(data = movie_sub, aes(x = critics_score, y = audience_score)) +
    geom_jitter() +
  geom_smooth(method = "lm", se= T, col=2)

Part 4: Modeling

To get started, we will build a linear model using all variables;

model_full <- lm(audience_score ~ genre + runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + critics_score, data=movie_sub)

summary(model_full)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     imdb_rating + imdb_num_votes + critics_rating + critics_score, 
##     data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.270  -6.304   0.468   5.667  49.182 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.723e+01  4.900e+00  -5.556 4.07e-08 ***
## genreAnimation                  7.776e+00  3.827e+00   2.032   0.0426 *  
## genreArt House & International -1.500e-01  2.990e+00  -0.050   0.9600    
## genreComedy                     2.035e+00  1.631e+00   1.247   0.2127    
## genreDocumentary                1.052e+00  2.292e+00   0.459   0.6464    
## genreDrama                      4.848e-01  1.434e+00   0.338   0.7354    
## genreHorror                    -5.073e+00  2.444e+00  -2.076   0.0383 *  
## genreMusical & Performing Arts  5.198e+00  3.187e+00   1.631   0.1034    
## genreMystery & Suspense        -5.506e+00  1.828e+00  -3.012   0.0027 ** 
## genreOther                      1.510e+00  2.770e+00   0.545   0.5859    
## genreScience Fiction & Fantasy -9.231e-01  3.502e+00  -0.264   0.7922    
## runtime                        -4.596e-02  2.296e-02  -2.002   0.0457 *  
## mpaa_ratingNC-17               -4.448e+00  7.421e+00  -0.599   0.5491    
## mpaa_ratingPG                   9.615e-01  2.702e+00   0.356   0.7220    
## mpaa_ratingPG-13               -6.938e-01  2.789e+00  -0.249   0.8036    
## mpaa_ratingR                   -3.703e-01  2.684e+00  -0.138   0.8903    
## mpaa_ratingUnrated              9.499e-01  3.064e+00   0.310   0.7566    
## imdb_rating                     1.494e+01  6.211e-01  24.055  < 2e-16 ***
## imdb_num_votes                  3.595e-06  4.364e-06   0.824   0.4103    
## critics_ratingFresh            -2.052e+00  1.217e+00  -1.686   0.0923 .  
## critics_ratingRotten           -4.762e+00  1.977e+00  -2.408   0.0163 *  
## critics_score                   1.949e-03  3.574e-02   0.055   0.9565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.781 on 628 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7739, Adjusted R-squared:  0.7664 
## F-statistic: 102.4 on 21 and 628 DF,  p-value: < 2.2e-16

Since MPAA rating (mpaa_rating) has large p-value, it may not be a good predictor for audience score.

model1 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating + critics_score, data = movie_sub)

summary(model1)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     imdb_num_votes + critics_rating + critics_score, data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.118  -5.979   0.452   5.650  49.099 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.753e+01  4.207e+00  -6.543 1.25e-10 ***
## genreAnimation                  8.057e+00  3.501e+00   2.301   0.0217 *  
## genreArt House & International -1.503e-01  2.925e+00  -0.051   0.9590    
## genreComedy                     1.861e+00  1.611e+00   1.155   0.2484    
## genreDocumentary                1.418e+00  2.062e+00   0.688   0.4919    
## genreDrama                      1.756e-01  1.394e+00   0.126   0.8998    
## genreHorror                    -5.274e+00  2.387e+00  -2.210   0.0275 *  
## genreMusical & Performing Arts  5.187e+00  3.161e+00   1.641   0.1014    
## genreMystery & Suspense        -5.884e+00  1.780e+00  -3.305   0.0010 ** 
## genreOther                      1.721e+00  2.753e+00   0.625   0.5320    
## genreScience Fiction & Fantasy -8.504e-01  3.491e+00  -0.244   0.8076    
## runtime                        -4.652e-02  2.256e-02  -2.062   0.0397 *  
## imdb_rating                     1.494e+01  6.176e-01  24.194  < 2e-16 ***
## imdb_num_votes                  2.863e-06  4.308e-06   0.665   0.5065    
## critics_ratingFresh            -1.957e+00  1.212e+00  -1.615   0.1067    
## critics_ratingRotten           -4.546e+00  1.965e+00  -2.314   0.0210 *  
## critics_score                   7.852e-03  3.532e-02   0.222   0.8241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.762 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.7672 
## F-statistic: 134.7 on 16 and 633 DF,  p-value: < 2.2e-16

Critics score has a high p-value as well, in the next step, we will remove critics_score to see what will happen to the model;

model2 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating, data = movie_sub)

summary(model2)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     imdb_num_votes + critics_rating, data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.035  -6.046   0.477   5.592  49.049 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.740e+01  4.164e+00  -6.579  9.9e-11 ***
## genreAnimation                  8.070e+00  3.498e+00   2.307 0.021367 *  
## genreArt House & International -1.964e-01  2.915e+00  -0.067 0.946312    
## genreComedy                     1.867e+00  1.610e+00   1.160 0.246623    
## genreDocumentary                1.445e+00  2.057e+00   0.702 0.482679    
## genreDrama                      1.870e-01  1.392e+00   0.134 0.893147    
## genreHorror                    -5.254e+00  2.383e+00  -2.204 0.027865 *  
## genreMusical & Performing Arts  5.213e+00  3.157e+00   1.651 0.099185 .  
## genreMystery & Suspense        -5.882e+00  1.779e+00  -3.307 0.000998 ***
## genreOther                      1.739e+00  2.750e+00   0.633 0.527224    
## genreScience Fiction & Fantasy -8.510e-01  3.489e+00  -0.244 0.807354    
## runtime                        -4.629e-02  2.252e-02  -2.055 0.040268 *  
## imdb_rating                     1.501e+01  5.232e-01  28.700  < 2e-16 ***
## imdb_num_votes                  2.721e-06  4.257e-06   0.639 0.522986    
## critics_ratingFresh            -2.015e+00  1.183e+00  -1.704 0.088830 .  
## critics_ratingRotten           -4.873e+00  1.298e+00  -3.756 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.755 on 634 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.7676 
## F-statistic: 143.9 on 15 and 634 DF,  p-value: < 2.2e-16

Now, we will see what will happen if we remove genre as well;

model3 <- lm(audience_score ~ runtime + imdb_rating + imdb_num_votes + critics_rating, data = movie_sub)

summary(model3)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + imdb_num_votes + 
##     critics_rating, data = movie_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.821  -6.525   0.555   5.614  51.214 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.667e+01  3.918e+00  -6.808 2.27e-11 ***
## runtime              -5.525e-02  2.192e-02  -2.520   0.0120 *  
## imdb_rating           1.515e+01  4.890e-01  30.987  < 2e-16 ***
## imdb_num_votes        4.345e-07  4.124e-06   0.105   0.9161    
## critics_ratingFresh  -2.809e+00  1.200e+00  -2.340   0.0196 *  
## critics_ratingRotten -5.550e+00  1.314e+00  -4.225 2.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.992 on 644 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7562 
## F-statistic: 403.5 on 5 and 644 DF,  p-value: < 2.2e-16

While genre removed, the adjusted R-squared decreased as well, so we will keep genre in our model.

model4 <- lm(audience_score ~  genre + runtime + imdb_rating + critics_rating, data = movie_sub,na.action="na.exclude")

summary(model4)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_rating, data = movie_sub, na.action = "na.exclude")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.036  -6.019   0.599   5.575  49.175 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -27.90462    4.08624  -6.829 2.00e-11 ***
## genreAnimation                   8.03284    3.49580   2.298 0.021894 *  
## genreArt House & International  -0.44754    2.88716  -0.155 0.876862    
## genreComedy                      1.81976    1.60740   1.132 0.258014    
## genreDocumentary                 1.11257    1.98951   0.559 0.576212    
## genreDrama                       0.04222    1.37264   0.031 0.975471    
## genreHorror                     -5.32295    2.37988  -2.237 0.025656 *  
## genreMusical & Performing Arts   4.88807    3.11428   1.570 0.117014    
## genreMystery & Suspense         -5.92951    1.77663  -3.337 0.000895 ***
## genreOther                       1.76351    2.74809   0.642 0.521285    
## genreScience Fiction & Fantasy  -0.81231    3.48644  -0.233 0.815843    
## runtime                         -0.04258    0.02175  -1.957 0.050730 .  
## imdb_rating                     15.10176    0.50487  29.912  < 2e-16 ***
## critics_ratingFresh             -2.27666    1.10903  -2.053 0.040498 *  
## critics_ratingRotten            -5.07158    1.25938  -4.027 6.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.75 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7728, Adjusted R-squared:  0.7678 
## F-statistic: 154.3 on 14 and 635 DF,  p-value: < 2.2e-16

The above summary shows that genre, runtime, imdb_rating and critics_rating explain 76% of variability of audience_score (Adjusted R-sqaure is 76%).

model_final <- model4

Multiple linear regression has some inherent assumptions that we should evaluate:

1-Each Numerical variable is linearly related to the outcome

model_resid <- residuals(model_final,type="deviance")

We extract model residuals and save them as a new dataset and then plot residuals.

plot(model_resid, main = "Residuals vs. Critics Score", xlab = "", ylab = "Residuals")
abline(h=0)

2-The residuals of the model are nearly normal

par(mfrow=c(1,2))
hist(model_final$residuals, main = "Histogram of Residuals")
qqnorm(model_final$residuals, main = "Normal Probability Plot of Residuals")
qqline(model_final$residuals)

3-The variability of the residuals is almost constant

par(mfrow=c(1,2))
plot(model_final$residuals ~ model_final$fitted.values, main = "Residuals vs. Fitted")
abline(h=0)
plot(abs(model_final$residuals) ~ model_final$fitted.values, main = "Absolute Value of Residuals vs. Fitted")
abline(h=0)

We do not see a fan shape here. It appears that the variability of the residual stays constant as the value of the fitted or the predicted values change, so, the constant variability condition seems to be met.

The absolute value of residuals plot can be thought of simply the first plot folded in half. So if we were to see a fan shape in the first plot, we would see a triangle in the absolute value of residuals versus fitted plot. Doesn’t exactly seem to be the case, so it seems like this condition is met as well. 4-The residuals are independent Independent residuals means independent observations. If we don not have any time series structure, we do not have another diagnostic approach to see whether residuals are independent or not. The sampling of the data to obtain independent observations was discussed at the beginning of this analysis, and we reached the conclusion that the data is a random sample and is generalizable.

Part 5: Prediction

The movie which audience score we will try to predict is Passengers (2016). Using data from IMDB and Rotten Tomatoes a dataframe is created:

passen <- data.frame(genre = "Action & Adventure", runtime = 116, imdb_rating = 7.0,  critics_rating = "Rotten")

predict(model_final, newdata=passen, interval='confidence')

##        fit      lwr      upr
## 1 67.79719 65.05149 70.54289

We will now predict the audience score with our model.

predict(model_final, newdata=passen, interval='prediction')

##        fit      lwr      upr
## 1 67.79719 48.45451 87.13987

Note: The actual audience_score for Passengers (2016) is 65%.

With this information we can conclude that we are 95% confident that the actual audience_score for Passengers (2016) is between 48.45 and 87.13. The model returns an interval that includes our predicted value of 67.79. * * *

Part 6: Conclusion

The model provided sufficient accuracy to predict the value of a 2016 movie properly and the predicted value fell within the 95% confidence levels. However, it is clear the model needs improvement. We can explain only 76% of the variance of audience_score. In the future we might consider:

.Perhaps not all the explanatory variables are linear.Using a polynomial or other non-linear regression analysis would provide a higher performing predictive model.

.Add other data to the model to improve model’s accuracy.

.Perhaps a larger set of data would help improve the model.

.Testing some variable transformation may contribute to improving the model.