Top Secrets of Top Movies

Intro

This is a project for Linear Regression and Modelling course, which is a part of Coursera’s Statistics with R Specialization.

The Project aims to find out what attributes make a movie top-rated. To achieve this, an exploratory data analysis (EDA) has been completed, and Linear regression model predicting average rating for a movie has been developed.

Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(dplyr); library(statsr)
library(plotly); library(GGally); library(tidyr)

if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
              method = "curl")
  }
load("./data/1209_SR-LR-w4_Movies/movies.RData")

Part 1: Data

The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.

Generalization. The data were collected by retrospective study using random sampling to select a representative sample of US movies. So, it can be generalized to the US movies, with some reservations:
- all findings can only refer to the respective years, i.e. to the US movies released from 1970 to 2014;
- there could be some biases compared to all the people had seen a movie due to
  - considering ratings from only two sources (Rotten Tomatoes/ IMDb);
  - condition of previous registration on these sites;
  - required time to rate the movies
Causation/ Correlation. The data come from an available information and not from an random assignment experiment. So, it’s an observational study, i.e there is no causality, but only correlation/ association between the variables examined

Part 2: Research question

I never use ratings to select a movie to watch. For me, a much more important criterion is the opinion of my son or friend. In my understanding, the sinking of my heart when watching a movie is incalculable at all. As a mathematician, though, I wonder if it’s possible to decompose such a sensitive factor into numerical components.

The question is: What are the attributes behind successful, top-rated films?
- i.e. What factors associate to high movie rating?

Part 3: Exploratory data analysis

Data manipulation

Start with examination the data set variables, their description can be found in the Codebook.

Get rid of unnecessary variables, and create average_rating variable by averaging of rating types presented (imdb_rating, critics_score, audience_score; last two come from Rotten Tomatoes, are scaled by division by 10):

movies <- movies %>% transmute(
        average_rating =round((imdb_rating + critics_score/10 +
                                    audience_score/10)/3, 2),
        genre, runtime, thtr_rel_month, imdb_num_votes, best_pic_nom,
        best_pic_win, best_actor_win, best_actress_win,best_dir_win, top200_box)

NArows <- sum(!complete.cases(movies))
duo <- sum(duplicated(movies))

Now, there is 1 row (0.15\(\%\)) with NA values, and 1 duplicated row. So, they can be removed

movies<-na.omit(movies)
movies<-movies[!duplicated(movies),]

Structure

Examine the structure of the resulting data set, and levels of genre variable:

str(movies)

tibble [649 × 11] (S3: tbl_df/tbl/data.frame)
 $ average_rating  : num [1:649] 5.77 8.33 8.6 7.6 3.7 8.5 6.83 3.97 8.47 7.17 ...
 $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
 $ runtime         : num [1:649] 80 101 84 139 90 78 142 93 88 119 ...
 $ thtr_rel_month  : num [1:649] 4 3 8 10 9 1 1 11 9 3 ...
 $ imdb_num_votes  : int [1:649] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int 334
  ..- attr(*, "names")= chr "334"

levels(movies$genre)

 [1] "Action & Adventure"        "Animation"                
 [3] "Art House & International" "Comedy"                   
 [5] "Documentary"               "Drama"                    
 [7] "Horror"                    "Musical & Performing Arts"
 [9] "Mystery & Suspense"        "Other"                    
[11] "Science Fiction & Fantasy"

there are 7 categorical variables (six of them have two levels yes/no), three numeric (one of them is average_rating), and one integer (imdb_num_votes)
there are 11 various genres

Histogram

Take a closer look at average_rating summary and interactive histogram:

summ<-summary(movies$average_rating)
summ

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.27    4.83    6.27    6.16    7.77    9.47

The mean of average_rating is around 6.16, the median is 6.27, the IQR is 2.94.

gg <-ggplot(movies, aes(x = average_rating, y = ..density..)) +
  geom_histogram(fill = "cornflowerblue", colour = "darkgray") + 
  geom_density(size = 1, colour = "brown")
ggplotly(gg)

average_rating distribution has a left skewed structure

Correlogram

Understand the relationship between audience_score and other numeric/ integer variables:

ggpairs(movies, columns = c(1, 4:6),
        upper = list(continuous = wrap("cor", size = 8)),
        diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
        lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))

correlations between all variables are positive
there is no strong relationship between any variables
correlation between average_rating and thtr_rel_month is statistically insignificant

Boxplots

Compare variability in average_rating for two-levels (yes/no) categorical variables:

data <- movies %>%
  gather('category','yesno', best_pic_nom, best_pic_win, best_actor_win,
         best_actress_win, best_dir_win, top200_box)
ggplot(data=data,aes(x=category, y= average_rating,fill=yesno))+
  geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") + 
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3"))+
  theme(legend.title = element_blank(), axis.text.x = element_text(angle=20))

The boxplots shows:

for best_actor_win and best_actress_win, there is almost no difference in if movie is from the respective category, or not
average_rating for best_dir_win yes level is noticeably higher in relation to *no level
average_rating for the other three categories yes level (best_pic_nom, best_pic_win, top200_box) is significantly higher than for no one

Also, look at genre levels interactive boxplots:

gg<-ggplot(data = movies, aes(x = genre, y = average_rating, fill = genre)) +
        geom_boxplot(color="darkgrey")+
        theme(axis.text.x = element_text(angle=30),
              legend.title = element_text(size=8.5))+
        scale_fill_discrete(name="CLICK on legend to isolate one trace")
ggplotly(gg)

Documentary and Musical Performing & Arts movies have noticeably higher medians (\(>7.5\)) than other genres
Comedy and Horror movies have lower medians than other genres (\(<5\))
Science Fiction & Fantasy movies have the widest IQR with average_rating of \(7.73-3.12=4.61\)

Part 4: Modelling

The goal is to predict the success of a movie quantified by average_rating using a linear regression model.

Model selection

As the aim is to pick up variables most affecting average_rating, consider initial regression of average_rating on all other variables (all.model) with further selection of the best.model by step-wise backward method:

all.model <- lm(average_rating ~ ., data = movies)
best.model <- step(all.model, direction = "backward" , data = movies, trace = 0)
bestrsq<- summary(best.model)$adj.r.squared
best.model$call

lm(formula = average_rating ~ genre + imdb_num_votes + best_pic_nom + 
    best_dir_win, data = movies)

step() function found best.model includes four best predictors: genre, imdb_num_votes, best_pic_nom, best_dir_win
variance explained by best.model (Adj.R-squared) is 33.9\(\%\)

Model diagnostics

To model provide valid results, the following conditions should be met:

LINEARITY - linear relationships between explanatory and response variables
NORMALITY - nearly normal residuals with mean\(=0\)
CONSTANT VARIABILITY of residuals
INDEPENDENCY of residuals

LINEARITY: has already been verified in chapter EDA, section Correlogram for numerical/ integer variables

NORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals

hist<-ggplot(data = best.model, aes(x = best.model$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
multiplot(hist, qq, cols=2) # Appendix: code `multiplot`

histogram: shows a nearly normal distribution with slight left-skewness
QQ-plot: no significant deviation from the mean, most of the points fit the qq-line except for tail areas \(=>\) residuals almost normally distributed

CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values

INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence

var<-ggplot(data = best.model, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model, aes(x = 1:nrow(movies), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
multiplot(var, indep, cols=2)

variability: some level of heteroscedasticity, that could affect the accuracy of outcome (could be solved by altering the model, outwith the scope of the project)
independence: residuals are distributed randomly \(=>\) independent (to be expected as the movie data is sampled randomly)

Model coefficients interpretation

round(coefficients(best.model),6)

                   (Intercept)                 genreAnimation 
                      4.759862                       0.698380 
genreArt House & International                    genreComedy 
                      1.257816                       0.024026 
              genreDocumentary                     genreDrama 
                      3.401586                       1.320504 
                   genreHorror genreMusical & Performing Arts 
                      0.003018                       2.736404 
       genreMystery & Suspense                     genreOther 
                      0.626110                       1.099557 
genreScience Fiction & Fantasy                 imdb_num_votes 
                      0.059695                       0.000005 
               best_pic_nomyes                best_dir_winyes 
                      1.070874                       0.570473

all the predictors have positive effect on average_rating, i.e. increase it
- best_pic_nom and best_dir_win of yes level, increase (comparing to no level) average_rating by 1.07 and 0.57 points respectively
- genre levels coefficients show changes relative to Action & Adventure movies (initial level)
regression intercept: Action & Adventure movie with zero number of votes, without nomination to best picture and director non-won Oscar, gets average_rating of 4.76 points
increase in imdb_num_votes by 100 voices, raises audience_score by 0.0005 points

Part 5: Prediction

Now test the predictive capability of best.model on movie not included in the original data set: Lion. According to IMDb and Rotten Tomatoes, the movie was released in 2016, earned imdb_rating of \(8.0\), critics_score of \(85\%\), and audience_score of \(91\%\). Compare true average rating to predicted value, and look at confidence interval:

lion <- tibble(imdb_num_votes=210903,
                         best_pic_nom=factor("yes", levels=c("no", "yes")),
                         best_dir_win=factor("no", levels=c("no", "yes")),
                         genre= factor("Drama", levels=
                                               levels(movies$genre)))
rating_lion <- predict(best.model, lion, interval="prediction", level=0.95)
rating_lion

       fit      lwr      upr
1 8.130159 5.151084 11.10923

There is \(95\%\) confidence that, all else being equal, the predicted average_rating for “Lion” is between 5.15 and 11.11 on average.

Movie Title	genre	imdb_num_votes	best_pic_nom	best_dir_win	Predicted avg Rating	True avg Rating
Lion	Drama	210903	yes	no	8.13	8.53

Part 6: Conclusion

Linear regression model has shown some capability of predicting movie average rating based on certain characteristics
- nevertheless, the model could have some bias as predicted value overestimates true average_rating (\(8.53\) vs. \(8.13\))
Poor predicting capability is due to the fact that Adj.R-squared value is only 0.339, i.e. approx. 66.1\(\%\) of the variability in data is not explained by the model
It has found that the best genres for average_rating increase are Documentary (+3.4) and Musical & Performing Arts (+2.74 points), while nomination for the best picture adds 1.07 points to movie average_rating
- the same time, the original data set has some limits, i.e. there are only movies have been rated on Rotten Tomatoes and IMDb websites
Further analyses could be conducted to answer the questions:
- Why average_rating distribution doesn’t show a normal curve?
  - Could it be related to the non-normal distribution of audience_score values?
- What’s the nature of residuals heteroscedasticity?
  - How to deal with it?
- What other variables could affect average_rating?

Appendix

`multiplot`

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  plots <- c(list(...), plotlist)
  numPlots = length(plots)
  if (is.null(layout)) {
          layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                           ncol = cols, nrow = ceiling(numPlots/cols))
  }
  if (numPlots==1) {
    print(plots[[1]])
  } else {
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    for (i in 1:numPlots) {
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}