Modeling and prediction for movies

Movies Week 4 Project of the course Linear regression model under the course track Statistics with R

Submitted by Olusola Afuwape

May 24^th 2019

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

# Find out the range of years of release of movies in the dataset

temporal <- movies %>% filter(!is.na(thtr_rel_year)) %>% select(thtr_rel_year)
range(temporal$thtr_rel_year)

[1] 1970 2014

Overview

The movies dataset was collected by random sampling of movies produced and released before 2016. The dataset included variables like title of movie, date of release, type of movie etc. The movies included in this dataset were gotten from Rotten Tomatoes and IMDb

Scope of Inference

Scope of inference can be divided into spatial, temporal and socio-economic aspects. Spatial scope of inference included movies produced all over the world. The temporal scope of inference of the movies dataset ranges from 1970 to 2014. Socio-economic scope of inference looks at the impact of movies on socio-economic growth.

According to the codebook of the movies dataset, the dataset included 651 randomly selected movies that were produced and released before 2016. From this piece of information, it is clear that not all required principles of experimental design were engaged for the compilation of this dataset. The principles of experimental design include control, randomize, replicate and block. The movies dataset can only infer generalizability and association/correlation. The dataset cannot infer causality.

Part 2: Research question

How do some specific variables in the movies dataset affect or determine nomination for Oscar?

Part 3: Exploratory data analysis

# Check the data dimension and observe variables
dim(movies)

[1] 651  32

names(movies)

 [1] "title"            "title_type"       "genre"           
 [4] "runtime"          "mpaa_rating"      "studio"          
 [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
[10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
[13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
[16] "critics_score"    "audience_rating"  "audience_score"  
[19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
[22] "best_actress_win" "best_dir_win"     "top200_box"      
[25] "director"         "actor1"           "actor2"          
[28] "actor3"           "actor4"           "actor5"          
[31] "imdb_url"         "rt_url"

# Observe the count of movies that are nominated or win Oscar
oscar <- movies %>% filter(!is.na(best_pic_nom), !is.na(best_pic_win)) %>% select(best_pic_nom, best_pic_win)

table(oscar$best_pic_nom)


 no yes 
629  22

table(oscar$best_pic_win)


 no yes 
644   7

Discussion

A movie that wins Oscar will first be nominated for Oscar before it can be given an Oscar award. Thus, this linear regression model will focus on how the variable best_pic_nom i.e. whether or not the movie was nominated for a best picture Oscar (no, yes) is affected by the following variables in the movies dataset:

imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)

# Get the variables

oscar_model <- movies %>% filter(!is.na(best_pic_nom), !is.na(imdb_rating), !is.na(imdb_num_votes), !is.na(critics_rating), !is.na(critics_score), !is.na(audience_rating), !is.na(audience_score), !is.na(top200_box)) %>% select(best_pic_nom, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, top200_box)

# Observe variables properties

str(oscar_model)

Classes 'tbl_df', 'tbl' and 'data.frame':   651 obs. of  8 variables:
 $ best_pic_nom   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ imdb_rating    : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
 $ imdb_num_votes : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
 $ critics_score  : num  45 96 91 80 33 91 57 17 90 83 ...
 $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
 $ audience_score : num  73 81 91 76 27 86 76 47 89 66 ...
 $ top200_box     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Graphical visualization

# Graphical representattion of variable best_pic_nom in relation to other variables

boxplot(imdb_rating ~ best_pic_nom, data = oscar_model, col = c("green", "deep pink"), xlab = "Best picture Oscar nomination", ylab = "Rating on IMDB")

boxplot(imdb_num_votes ~ best_pic_nom, data = oscar_model, col = c("purple", "light blue"), xlab = "Best picture Oscar nomination", ylab = "Number of votes on IMDB")

boxplot(critics_score ~ best_pic_nom, data = oscar_model, col = c("yellow", "violet"), xlab = "Best picture Oscar nomination", ylab = "Critics score on Rotten Tomatoes")

boxplot(audience_score ~ best_pic_nom, data = oscar_model, col = c("blue", "grey"), xlab = "Best picture Oscar nomination", ylab = "Audience rating on Rotten Tomatoes")

c_rating <- ggplot(oscar_model) + aes(x = critics_rating, fill = best_pic_nom) + geom_bar(position = "dodge")
c_rating <- c_rating + xlab("Critics rating") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
c_rating

a_rating <- ggplot(oscar_model) + aes(x = audience_rating, fill = best_pic_nom) + geom_bar(position = "dodge")
a_rating <- a_rating + xlab("Audience rating") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
a_rating

top200 <- ggplot(oscar_model) + aes(x = top200_box, fill = best_pic_nom) + geom_bar(position = "dodge")
top200 <- top200 + xlab("Top 200 Box Office list") + ylab("Count") + scale_fill_discrete(name = "Oscar nomination")
top200

# Convert variable best_pic_nom from a factor to a numeric

oscar_model$best_pic_nom <- as.numeric(oscar_model$best_pic_nom)
str(oscar_model)

Classes 'tbl_df', 'tbl' and 'data.frame':   651 obs. of  8 variables:
 $ best_pic_nom   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ imdb_rating    : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
 $ imdb_num_votes : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
 $ critics_score  : num  45 96 91 80 33 91 57 17 90 83 ...
 $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
 $ audience_score : num  73 81 91 76 27 86 76 47 89 66 ...
 $ top200_box     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

head(oscar_model)

# A tibble: 6 x 8
  best_pic_nom imdb_rating imdb_num_votes critics_rating critics_score
         <dbl>       <dbl>          <int> <fct>                  <dbl>
1            1         5.5            899 Rotten                    45
2            1         7.3          12285 Certified Fre~            96
3            1         7.6          22381 Certified Fre~            91
4            1         7.2          35096 Certified Fre~            80
5            1         5.1           2386 Rotten                    33
6            1         7.8            333 Fresh                     91
# ... with 3 more variables: audience_rating <fct>, audience_score <dbl>,
#   top200_box <fct>

# Check how these variables are related to variable best_pic_nom usng pairs plot

panel.cor <- function(x, y, ...)
{
par(usr = c(0, 2, 0, 2))
txt <- as.character(format(cor(x, y), digits = 2))
text(0.7, 0.7, txt, cex = 2* abs(cor(x, y)))
}
pairs(oscar_model[1:8], upper.panel=panel.cor)

Discussion

From the above data graphical representations, the boxplots and bar charts indicate that movies nominated for Oscar exhibited higher positive score when compared with movies that were not nominated for each of the variables plotted against the variable best_pic_nom.

An observation of the pairs plot depicts some of the explanatory variables have some level of collinearity. This means some of the variables are correlated and not independent of each other. Multicollinearity should be avoided which results in complication of the model. A parsimonious model is prefered.

Since this model is for predictions, adjusted R² is more reliable for predictions than p-value. p-value is used for statistically significant predictors and not for model prediction. This model prediction will use backwards elimination method to get at the model that gives the best and more parsimonious model with the highest adjusted R² value.

From the pairs plot and using each of the numeric explanatory variable to play the role of the response variable and regress it on the remaining explanatory variables, it was seen that variables imdb_rating (R² = 0.8148), critics_rating (R² = 0.8505) and audience_score (R² = 0.8816) exhibit multicollinearity. Variable imdb_rating was then completely removed from the model because its removal yields the highest adjusted R² value (0.1261).

Part 4: Modeling

Below is the code for the model using all the variables and the model the yields the highest adjusted R² value after performing Backwards elimination method.

# Full model

full_movies <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_rating + audience_score + top200_box, data = oscar_model)
summary(full_movies)


Call:
lm(formula = best_pic_nom ~ imdb_num_votes + critics_rating + 
    critics_score + audience_rating + audience_score + top200_box, 
    data = oscar_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42046 -0.04528 -0.01192  0.00558  1.00583 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             9.530e-01  5.077e-02  18.770  < 2e-16 ***
imdb_num_votes          4.183e-07  6.856e-08   6.101 1.82e-09 ***
critics_ratingFresh    -4.927e-02  2.083e-02  -2.365   0.0183 *  
critics_ratingRotten   -2.833e-02  3.345e-02  -0.847   0.3972    
critics_score           5.812e-04  5.566e-04   1.044   0.2968    
audience_ratingUpright -2.518e-02  2.696e-02  -0.934   0.3506    
audience_score          1.081e-03  7.616e-04   1.420   0.1561    
top200_boxyes          -2.300e-02  4.629e-02  -0.497   0.6194    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1691 on 643 degrees of freedom
Multiple R-squared:  0.1355,    Adjusted R-squared:  0.1261 
F-statistic:  14.4 on 7 and 643 DF,  p-value: < 2.2e-16

# Best model

best_model <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_score, data = oscar_model)
summary(best_model)


Call:
lm(formula = best_pic_nom ~ imdb_num_votes + critics_rating + 
    critics_score + audience_score, data = oscar_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40998 -0.04583 -0.01235  0.00839  0.99930 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           9.673e-01  4.822e-02  20.060  < 2e-16 ***
imdb_num_votes        4.151e-07  6.617e-08   6.274 6.47e-10 ***
critics_ratingFresh  -4.701e-02  2.071e-02  -2.270   0.0235 *  
critics_ratingRotten -2.475e-02  3.324e-02  -0.745   0.4568    
critics_score         6.284e-04  5.531e-04   1.136   0.2563    
audience_score        5.307e-04  4.730e-04   1.122   0.2623    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1689 on 645 degrees of freedom
Multiple R-squared:  0.134, Adjusted R-squared:  0.1273 
F-statistic: 19.96 on 5 and 645 DF,  p-value: < 2.2e-16

Model Diagnostics

# Linear relationship between numerical explanatory variables and the response variable

best_model <- lm(best_pic_nom ~ imdb_num_votes + critics_rating + critics_score + audience_score, data = oscar_model)
plot(best_model$residuals ~ oscar_model$critics_score)

plot(best_model$residuals ~ oscar_model$audience_score)

# Residuals normality

hist(best_model$residuals)

qqnorm(best_model$residuals)
qqline(best_model$residuals)

# Constant residuals variability

plot(best_model$residuals ~ best_model$fitted)

plot(abs(best_model$residuals) ~ best_model$fitted)

# Residuals independence

plot(best_model$residuals)

Discussion

Linear relationship between numeric explanatory variable and response variable should display complete random scatter and randomly scattered around zero. The linear relationship between the numeric variable critics_score and the residuals is not completely random scatter.
Nearly normal residuals with mean zero condition is not met as displayed by the histogram and the Q-Q plot
Constant variability of residuals is not met. The plot displayed fan and triangle shapes respectively.
Independent residuals condition is not fulfilled. Plot showed weak independence.

Part 5: Prediction

This model will be used to predict the movies Acrimony produced by Tyler Perry. The variable values are gotten from the sites Rotten Tomatoes and IMDb

imdb_num_votes <- 1002
critics_rating <- "Fresh"
critics_score <- 32
audience_score <- 69

model_predict <- data.frame(imdb_num_votes, critics_rating, critics_score, audience_score)
predict(best_model, model_predict, interval = "prediction")

        fit       lwr      upr
1 0.9774569 0.6412887 1.313625

Part 6: Conclusion

From the result of the prediction function, it is clear that the model does not give a good prediction. Let’s recall that the focus of this research is: How do some specific variables in the movies dataset affect or determine nomination for Oscar? Thus, the set of variables selected in this model generation are not the best combination of variable data to use to be able to adequately predict nomination for Oscar award. To get a better prediction of how some of the variables in the movies dataset determine nomination for Oscar, a new set of variables from the dataset should be engaged for linear modeling.