Abstract

Our purpose for this exercise is to develop a multiple linear regression model that will explain what makes movies popular given the variables in a dataset that contains information from Rotten Tomatoes, a website that keeps track of all reviews for each films and aggregates the results and Internet Movie Database IMDB, an online database of information related to film, television programs and video games.

Load Packages

library(ggplot2)
library(dplyr)
library(caret)
library(statsr)
library(gridExtra)
library(GGally)
library(ggthemes)

Load Data

load('movies.Rdata')

We find that the dataset provided consists of 651 rows and 32 column variables. Each row is a movie and each column is a characteristic of the movie. In order not to distract the reader from the analysis, I have placed all the codes in the appendix section in the interest of reproducible research and as a learning tool for others who have little background in computer programming like myself. in some instances the codes are visible to provide understanding of the output.

Furthermore, the list of variables in the dataset provided is quite long. It can be found in the appendix section at the bottom of this document for those who are interested. We will describe the reasons for choosing the variables that will be selected for use in our model.

Part of this exercise is to use the model to predict on a movie that is not part of the dataset provided. I tried finding a movie and the information that are needed for the variables in our model but this was a difficult task than expected. I therefore decided to pick one row in the dataset and separate that from the rest of the data we will use for making our model. it will be used later for our prediction. We will call our data with a single observation as our testing set and the main bulk of our data as our training set. ## Divide into training set and testing set

set.seed(3974)
inTrain <- createDataPartition(y=movies$imdb_rating, p=0.994, list=FALSE)
training <- movies[inTrain,]
testing <- movies[-inTrain,]
dim(training)
## [1] 650  32
dim(testing)
## [1]  1 32

Part -1 Data

Our first task for this assignment is to choose which variables to include in our model. It would be easier to start with eliminating variables that will obviously not be of use for our model. Uniform Resource Locators or commonly known as URLs provide an easy way to find more information for each movie but will not provide information whether a movie is popular or not. If the variable provide information about how many visitors visited their wesite, that would be even more useful for our purpose.

Runtime or length of the movie in minutes is not a key ingredient for popularity for a movie. Most movies are of similar length. Runtime would probably be a good predictor of movie genre. Animation and Documentaries are generally shorter than feature films.

The title of a movie is usually what a moviegoer remembers when a movie is popular but it is not what makes a movie popular. However, this not the case when it comes to actors, actresses or directors. Movie goers turn into devoted fans when a certain actor, actress or director captures their imagination and becomes a key determiner whether subsequent movies from the same person is a must see.

Let us now focus our attention to choosing our response variable. Below is a list of the variables that measures a movie’s popularity.

library(knitr)
minv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(min) %>% sapply(round,2)
maxv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(max) %>% sapply(round,2)
meanv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(mean) %>% sapply(round,2)
medianv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(median) %>% sapply(round,2)
df <- rbind(minv, maxv, meanv, medianv)
rownames(df) <- c("min", "max", "mean", "median")
kable(df)
imdb_rating imdb_num_votes critics_score audience_score
min 1.90 180.00 1.00 11.00
max 9.00 893008.00 100.00 97.00
mean 6.49 57519.87 57.73 62.41
median 6.60 15070.50 61.00 65.00

imdb_rating is scale of 1 to 10 scored by users of IMDb. An IMDb user have one vote per title per user and can change their vote anytime.The totals are converted into a weighted mean-rating that is displayed beside each title in the website. The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open ended scale begining with 0 and the maximum votes is dependent on the number of users of the website that voted for a particular film.

critics_score and audience_score are ratings from the website Rotten tomatoes. Both have scale of 1 to 100. Each movie features a “user average,” which calculates the percentage of users who have rated the film positively. Users rate the movie on a scale of 0-10, while critics reviews generally use 4-star ratings and are often qualitative.

p1 <- ggplot(data = training, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "skyblue", binwidth = .3)
p2 <- ggplot(data = training, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 40000, alpha = 0.5)
p3 <- ggplot(data = training, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
p4 <- ggplot(data = training, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(p1, p2, p3, p4, nrow = 1, ncol = 4)

Let’s take a look at the distribution of these variables.

imdb_rating appears to have the closest afinity to a shape of a normal distribution. imdb_num_votes is heavily skewed with 90% of movies having a score of 151,934 and below.

quantile(training$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))
##        0%       25%       50%       75%       90%      100% 
##    180.00   4543.25  15070.50  57762.50 151962.20 893008.00

The distribution of critics_score and audience_score appear similar except that the audience score taper more in both ends.

We will choose imdb_rating as our response variable.

We will only include one of the two categorical variables above to be a predictor in our model, since they measure the same things.

d1 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = critics_rating)) + geom_point()
d2 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = audience_rating)) + geom_point()
grid.arrange(d1, d2, nrow = 1, ncol = 2)

We can see from the left plot above that as the value of critics_score increases the color of the points changes correspondingly to the levels of critics_rating with a clear demarcation between the level rotten and fresh around the value 60. This is because both were computed from the same data from the same company, Rotten Tomatoes. There is an overlap between the level of fresh and certified fresh. It seems that the numerical scale and the categorical scale find it easier to agree on which movie is rotten than it is to agree which movie is fresh or certified fresh.

The same observations can be seen on the right plot between imdb_rating and audience_rating. Both were computed from data collected by IMDb. There are only 2 levels however in their categorical variable.

We’ll choose critics_rating because it has 3 levels/categories. A movie is not either popular or not popular, there are many shades of gray in-between.

Our sample data is a random sample of movies released in the United States, the results of our analysis can therefore be generalized to the population of movies released in the US.

The characteristics of movies that form the variables in our dataset are derived from the the website Rotten Tomatoes and IMDb. Since voting and rating are voluntary in these websites, the data suffers from positive respondent bias, where the votes and ratings of those who did not respond were not counted. Random assignment was not performed. Causality therefore cannot be inferred from the results of our analysis.

Part -2 Research Question

Is movie popularity, as measured by IMDb rating, associated with other film characteristics like: having actors, actresses, directors who won an oscar, critic reviews, studios that produced the movie, and year that the movie was released in theaters.

Part -2 Exploratory Data Analysis

We will now examine the relationship of the variables in our dataset. An actor, actress, or director who has won an oscar award is a great motivation to see a movie. An oscar award is like a guarantee of excellence. Does an actress starring in a movie who has won an oscar has the same weight as an actor or a director in making a movie popular?

actr <- table(training$best_actor_win)
acts <- table(training$best_actress_win)
dir <- table(training$best_dir_win)
flm2 <- training %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
osc <- flm2 %>% select(oscar) %>% group_by(oscar) %>% table() %>% rbind(actr, acts, dir)
rownames(osc) <- c("At.least.one.Oscar", "best.actor", "best.actress", "best.director")
osc
##                     no yes
## At.least.one.Oscar 479 171
## best.actor         557  93
## best.actress       578  72
## best.director      607  43

In addition we have created a new variable combining all the 3 variables on winning an Oscar and see if having at least one oscar winner, either a director, actress, or actor is a better predictor of movie popularity. We will call this new variable oscar.

oscar_in_cast <- flm2 %>% filter(oscar == "yes") %>% arrange(imdb_rating) %>% select(title) %>% data.frame() %>% head(6)
x1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar)) + geom_point() + scale_colour_discrete(name="Combined") + scale_fill_hue(name="Combined")
x2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actor_win)) + geom_point() + scale_colour_discrete(name="Actor") + scale_fill_hue(name="Actor")
x3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actress_win)) + geom_point() + scale_colour_discrete(name="Actress") + scale_fill_hue(name="Actress")
x4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_dir_win)) + geom_point() + scale_colour_discrete(name="Director") + scale_fill_hue(name="Director")
grid.arrange(x1, x2, x3, x4, nrow = 1, ncol = 4)

From the plots above we can see the distribution of having an oscar winner as a cast in a movie. Oscar winners appear to be randomly distributed across the range of both the imdb_rating and critics_score and follows the linear trend in the data. It seems having an oscar winner in the movie is not a guarantee of popularity. The three observations in the plot with an oscar in the cast but with the least imd_brating are Battlefield Earth, Jack and Jill, and Town & Country.

We will also be comparing whether getting an oscar nomination and winning an Oscar for best picture are the same in predicting movie popularity.

nom <- table(flm2$best_pic_nom)
win <- table(flm2$best_pic_win)
flm2 <- flm2 %>% mutate(oscar_nom_win = ifelse(best_pic_nom == "yes" | best_pic_win == "yes", "yes", "no"))
nom_win <- table(flm2$oscar_nom_win)
comb_nom_win <- rbind(nom_win, nom, win)
rownames(comb_nom_win) <- c("combined", "nominations", "wins")
comb_nom_win
##              no yes
## combined    627  23
## nominations 628  22
## wins        643   7

From the results of the table above, we can see that out of the 22 movies nominated for a best picture award in the Oscars only 7 won. However, when we combined these two variables the combined number of yes is 23. This is either an error in data entry or one movie won a best picture award that didn’t received a nomination.

w1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar_nom_win)) + geom_point()
w2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_nom)) + geom_point()
w3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_win)) + geom_point()
grid.arrange(w1, w2, w3, nrow = 1, ncol = 3)

We can see that an oscar nomination for best picture has almost the same effect as winning an Oscar in terms of movie popularity. The points for the nominated films and those that won are all clustered in the extreme high end of both the imdb_rating and critics_score scale except for one point that is in the middle. We identified that movie to be A Star is Born, released in 1976, starring Barbra Streisand. I love Barbra Streisand, I love that movie and i love her especially for her role in “Nuts” with Richard Dreyfus and “Prince of Tides” with Nick Nolte. Barbra Streisand won the Oscar even though she was not as popular then as she is now, a testament to her skill in acting. We also combined the best_pic_nom and best_pic_win variables into one.

outlier_best_pic_nom <- flm2 %>% filter(best_pic_nom == "yes") %>% arrange(imdb_rating) %>% data.frame() %>% head(1)
g1 <- ggplot(data = flm2, aes(x = thtr_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g2 <- ggplot(data = flm2, aes(x = thtr_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g3 <- ggplot(data = flm2, aes(x = thtr_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
g4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_year))) + geom_point()
g5 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_month))) + geom_point()
g6 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_day))) + geom_point()
grid.arrange(g1, g2, g3, g4, g5, g6, nrow = 2, ncol = 3)

The histograms above show that there are particular years, months, and days where more movies are released in theaters for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

g7 <- ggplot(data = flm2, aes(x = dvd_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g8 <- ggplot(data = flm2, aes(x = dvd_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g9 <- ggplot(data = flm2, aes(x = dvd_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
g10 <-ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_year))) + geom_point()
g11 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_month))) + geom_point()
g12 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_day))) + geom_point()
grid.arrange(g7, g8, g9, g10, g11, g12 ,nrow = 2, ncol = 3)

The histograms above show that there are particular years, months, and days where more dvds are released for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

Part - 4 Modeling

num_var <- flm2 %>% select(imdb_rating,imdb_num_votes, critics_score,audience_score)
ggcorr(num_var, name = "Correlation", label = TRUE, alpha = TRUE, palette = "PuOr") + ggtitle("correlation matrix plot") + theme_dark()

We choose critics_score as our second predictor variable. The model’s R-squared increased together with the adjusted R-squared and the p-values indicate that they are both significant predictors.

fit1 <- lm(imdb_rating ~ audience_score, data = flm2)
fit2 <- lm(imdb_rating ~ audience_score + critics_score, data = flm2)
summary(fit2)
## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score, data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51919 -0.19927  0.03135  0.30596  1.22869 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.6422376  0.0625758   58.20   <2e-16 ***
## audience_score 0.0347717  0.0013404   25.94   <2e-16 ***
## critics_score  0.0118129  0.0009536   12.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4902 on 647 degrees of freedom
## Multiple R-squared:  0.7966, Adjusted R-squared:  0.796 
## F-statistic:  1267 on 2 and 647 DF,  p-value: < 2.2e-16

5 We tried adding a categorical variable to our model to see the effect of having an Oscar nomination or having a member of the cast who has won an Oscar already. We tried all the other categorical variable but the only variable that seems logical with a significant value of less than 0.05 is the new variable we created oscar which gives a yes when an actor, actress, or director in the movie has won an Oscar. Other categorical variables like studio, genre, and mpaa_rating have too many levels to be useful. Furthermore, some of these levels have very few observations and may appear as outliers.

fit3 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes, data = flm2)
fit4 <- lm(imdb_rating ~ audience_score + critics_score + oscar, data = flm2)
summary(fit4)
## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + oscar, 
##     data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49612 -0.22044  0.01617  0.30003  1.26053 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.612300   0.062953  57.381  < 2e-16 ***
## audience_score 0.034887   0.001332  26.182  < 2e-16 ***
## critics_score  0.011603   0.000950  12.214  < 2e-16 ***
## oscaryes       0.132492   0.043529   3.044  0.00243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4871 on 646 degrees of freedom
## Multiple R-squared:  0.7995, Adjusted R-squared:  0.7985 
## F-statistic: 858.5 on 3 and 646 DF,  p-value: < 2.2e-16

A model with audience_score and critics_score as a predictor accounts for 79.66 % of the variation in imdb_rating. Adding the categorical variable improves this to only 79.95 %. However, I am interested in quantifying the difference between imdb_rating of movies with a director, actor or actress who has won an Oscar award compared to one without an oscar award.

Model Inerpretation

We used a forward selection approach with a combined criteria of p value, adjusted R squared, and logical reasoning.

imdb_rating = 3.61 + 0.03audience_score + 0.01critics_score + 0.13oscar:yes Intercept - The intercept (3.61) is interpreted as the predicted mean imdb_rating when audience_score, critics_score and oscar is 0. The intercept is usually meaningless when the predictor variables are continuous. Furthermore, the scale of critics_score starts at one and the minimum value in our data for audience_score is 11. The value of 0 therefore is beyond the range of our data points. We will be extrapolating if we assume a value of 0 for our predictors.

Coefficient for audience_score - All else held constant, for every one unit increase in audience_score the model predicts a 0.03 increase in imdb_rating on average. The coefficients are small because the scale of audience_score is from 1 to 100, while the scale of imdb_rating is 1 to 10. Another way of saying this is, all else held constant, for every ten unit increase in audience_score the model predicts a 0.3 increase in imdb_rating on average.

Coefficient for critics_score - All else held constant, for every one unit increase in critics_score the model predicts a 0.01 increase in imdb_rating on average.

Coefficient for oscar - All else held constant, the model predicts that the imdb_rating for movies with at least one member of the cast (actor, actress, including director) with an Oscar award is 0.13 points higher than movies without a member of the cast with an Oscar award on average. The difference in popularity based on imdb_rating between the two is not that much.

R squared - 79.95 % of the variability of imdb_rating is explained by the model.

P values - All the coefficients in our model has a p-value that is less than 0.5. The explanatory variables in our model (audience_score, critics_score, and critics_score) are significant predictors of the response variable (imdb_rating). Their slopes are different from 0.

Conditions for Linear Regression

We will now proceed to perform model diagnostics to see if our model meets the requirements for the linear conditions to be valid. ### linear relationships between each (numerical) predictor and response

t1 <-  ggplot(data = flm2, aes(x = audience_score, y = resid(fit4))) + geom_hline(yintercept = 0, size = 1)  + xlab("audience_score") + ylab("Residual") + geom_point()  
t2 <-  ggplot(data = flm2, aes(x = critics_score, y = resid(fit4))) + geom_hline(yintercept = 0, size = 2)  + xlab("critics_score") + ylab("Residual") + geom_point()
grid.arrange(t1, t2, nrow = 1, ncol = 2)

Based on the residual plots above, we can see a linear trend between our residuals and our predictor variables. This condition is met by our model.

nearly normal residuals with mean 0 and independent residuals

par(mfrow = c(1,3))
hist(fit4$residuals, breaks = 25, main = "Histogram of Residuals", col = "blue", border = "pink", prob = TRUE)
curve(dnorm(x, mean = mean(fit4$residuals), sd = sd(fit4$residuals)), col="red", add=T, lwd = 3)  
qqnorm(fit4$residuals)
qqline(fit4$residuals)
plot(fit4$residuals, main = "Plot of Residuals VS order of Observations")

We can see a strong left skew in the histogram of our residuals (left) and the normal probability plots (center) above. Our model fails to meet this requirement for linearity. This means that our model is not very reliable when audience score or critics score is low.

The plot of the residuals and the order of observations (right) above reveal a random pattern. The observations appear to be independent of each other. This condition is met.

constant variability of residuals

t3 <- ggplot(data.frame(x = fit4$fitted.values, y = resid(fit4)), aes(x=x, y=y)) + geom_hline(yintercept = 0, size = 1)  + xlab("fitted.values") + ylab("Residual") + geom_point()  
t4 <- ggplot(data.frame(x =fit4$fitted.values, y = abs(resid(fit4))), aes(x=x, y=y)) + geom_hline(yintercept = 0, size = 1)  + xlab("fitted.values") + ylab("Residual") + geom_point() 
grid.arrange(t3, t4, nrow = 1, ncol = 2)

Part -5 Prediction

testing2 <- testing %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
newprof <- testing2 %>% select(audience_score, critics_score, oscar)
pred <- predict(fit4, newprof, interval = "prediction", level = 0.95, se.fit = TRUE)
pred
## $fit
##        fit      lwr      upr
## 1 5.158152 4.199223 6.117081
## 
## $se.fit
## [1] 0.03468925
## 
## $df
## [1] 646
## 
## $residual.scale
## [1] 0.487108

The model predicts that the movie Prison will have an imdb_rating of 5.15 with a 95% prediction interval of 4.199 points to 6.117 points. The imdb_rating for this movie from the original dataset provided is 5.9. Our prediction interval contains this value.

Part -6 Conclusion

Our model has demonstrated that with only three predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. However, we have to remember that our predictors, audience_score and critics_score are subjective measures that are easily prone to bias. Meaning both critics and audience are both finicky in their choices, determined by many variables not found in our dataset such as age, race, and gender.

Looking at the model’s residual plots, it seems that there is greater variance when rating bad movies compared to good movies and critics are more likely than the audience to rate extreme values.

One of the shortcoming of our model is that it wasn’t able to include the variable box office list as one of its predictors. It seems reasonable that an increase in the amount of revenues of movie sales is associated with popularity. The small number of movies in our dataset that’s included in the list might have lessened the variables ability to have a small p value. It might have been better to have used a continuous variable of gross movie sales.

We also could have used a larger number of observations in our testing data set to have a better measure of the model’s accuracy.