Intro

This is a project for Linear Regression and Modelling course, which is a part of Coursera’s Statistics with R Specialization.

The Project aims to find out what attributes make a movie top-rated. To achieve this, an exploratory data analysis (EDA) has been completed, and Linear regression model predicting average rating for a movie has been developed.


  • Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(dplyr); library(statsr)
library(plotly); library(GGally); library(tidyr)

if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
              method = "curl")
  }
load("./data/1209_SR-LR-w4_Movies/movies.RData")

Part 1: Data

The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.


  • Generalization. The data were collected by retrospective study using random sampling to select a representative sample of US movies. So, it can be generalized to the US movies, with some reservations:
    • all findings can only refer to the respective years, i.e. to the US movies released from 1970 to 2014;
    • there could be some biases compared to all the people had seen a movie due to
      • considering ratings from only two sources (Rotten Tomatoes/ IMDb);
      • condition of previous registration on these sites;
      • required time to rate the movies
  • Causation/ Correlation. The data come from an available information and not from an random assignment experiment. So, it’s an observational study, i.e there is no causality, but only correlation/ association between the variables examined

Part 2: Research question

I never use ratings to select a movie to watch. For me, a much more important criterion is the opinion of my son or friend. In my understanding, the sinking of my heart when watching a movie is incalculable at all. As a mathematician, though, I wonder if it’s possible to decompose such a sensitive factor into numerical components.


  • The question is: What are the attributes behind successful, top-rated films?
    • i.e. What factors associate to high movie rating?

Part 3: Exploratory data analysis

Data manipulation

Start with examination the data set variables, their description can be found in the Codebook.

Get rid of unnecessary variables, and create average_rating variable by averaging of rating types presented (imdb_rating, critics_score, audience_score; last two come from Rotten Tomatoes, are scaled by division by 10):

movies <- movies %>% transmute(
        average_rating =round((imdb_rating + critics_score/10 +
                                    audience_score/10)/3, 2),
        genre, runtime, thtr_rel_month, imdb_num_votes, best_pic_nom,
        best_pic_win, best_actor_win, best_actress_win,best_dir_win, top200_box)
NArows <- sum(!complete.cases(movies))
duo <- sum(duplicated(movies))

Now, there is 1 row (0.15\(\%\)) with NA values, and 1 duplicated row. So, they can be removed

movies<-na.omit(movies)
movies<-movies[!duplicated(movies),]

Structure

Examine the structure of the resulting data set, and levels of genre variable:

str(movies)
tibble [649 × 11] (S3: tbl_df/tbl/data.frame)
 $ average_rating  : num [1:649] 5.77 8.33 8.6 7.6 3.7 8.5 6.83 3.97 8.47 7.17 ...
 $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
 $ runtime         : num [1:649] 80 101 84 139 90 78 142 93 88 119 ...
 $ thtr_rel_month  : num [1:649] 4 3 8 10 9 1 1 11 9 3 ...
 $ imdb_num_votes  : int [1:649] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int 334
  ..- attr(*, "names")= chr "334"
levels(movies$genre)
 [1] "Action & Adventure"        "Animation"                
 [3] "Art House & International" "Comedy"                   
 [5] "Documentary"               "Drama"                    
 [7] "Horror"                    "Musical & Performing Arts"
 [9] "Mystery & Suspense"        "Other"                    
[11] "Science Fiction & Fantasy"
  • there are 7 categorical variables (six of them have two levels yes/no), three numeric (one of them is average_rating), and one integer (imdb_num_votes)
  • there are 11 various genres

Histogram

Take a closer look at average_rating summary and interactive histogram:

summ<-summary(movies$average_rating)
summ
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.27    4.83    6.27    6.16    7.77    9.47 

The mean of average_rating is around 6.16, the median is 6.27, the IQR is 2.94.

gg <-ggplot(movies, aes(x = average_rating, y = ..density..)) +
  geom_histogram(fill = "cornflowerblue", colour = "darkgray") + 
  geom_density(size = 1, colour = "brown")
ggplotly(gg)
  • average_rating distribution has a left skewed structure

Correlogram

Understand the relationship between audience_score and other numeric/ integer variables:

ggpairs(movies, columns = c(1, 4:6),
        upper = list(continuous = wrap("cor", size = 8)),
        diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
        lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))

  • correlations between all variables are positive
  • there is no strong relationship between any variables
  • correlation between average_rating and thtr_rel_month is statistically insignificant

Boxplots

Compare variability in average_rating for two-levels (yes/no) categorical variables:

data <- movies %>%
  gather('category','yesno', best_pic_nom, best_pic_win, best_actor_win,
         best_actress_win, best_dir_win, top200_box)
ggplot(data=data,aes(x=category, y= average_rating,fill=yesno))+
  geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") + 
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3"))+
  theme(legend.title = element_blank(), axis.text.x = element_text(angle=20))

The boxplots shows:

  • for best_actor_win and best_actress_win, there is almost no difference in if movie is from the respective category, or not
  • average_rating for best_dir_win yes level is noticeably higher in relation to *no level
  • average_rating for the other three categories yes level (best_pic_nom, best_pic_win, top200_box) is significantly higher than for no one

Also, look at genre levels interactive boxplots:

gg<-ggplot(data = movies, aes(x = genre, y = average_rating, fill = genre)) +
        geom_boxplot(color="darkgrey")+
        theme(axis.text.x = element_text(angle=30),
              legend.title = element_text(size=8.5))+
        scale_fill_discrete(name="CLICK on legend to isolate one trace")
ggplotly(gg)
  • Documentary and Musical Performing & Arts movies have noticeably higher medians (\(>7.5\)) than other genres
  • Comedy and Horror movies have lower medians than other genres (\(<5\))
  • Science Fiction & Fantasy movies have the widest IQR with average_rating of \(7.73-3.12=4.61\)

Part 4: Modelling

The goal is to predict the success of a movie quantified by average_rating using a linear regression model.

Model selection

As the aim is to pick up variables most affecting average_rating, consider initial regression of average_rating on all other variables (all.model) with further selection of the best.model by step-wise backward method:

all.model <- lm(average_rating ~ ., data = movies)
best.model <- step(all.model, direction = "backward" , data = movies, trace = 0)
bestrsq<- summary(best.model)$adj.r.squared
best.model$call
lm(formula = average_rating ~ genre + imdb_num_votes + best_pic_nom + 
    best_dir_win, data = movies)
  • step() function found best.model includes four best predictors: genre, imdb_num_votes, best_pic_nom, best_dir_win
  • variance explained by best.model (Adj.R-squared) is 33.9\(\%\)

Model diagnostics

To model provide valid results, the following conditions should be met:

  • LINEARITY - linear relationships between explanatory and response variables
  • NORMALITY - nearly normal residuals with mean\(=0\)
  • CONSTANT VARIABILITY of residuals
  • INDEPENDENCY of residuals

LINEARITY: has already been verified in chapter EDA, section Correlogram for numerical/ integer variables

NORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals

hist<-ggplot(data = best.model, aes(x = best.model$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
multiplot(hist, qq, cols=2) # Appendix: code `multiplot`

  • histogram: shows a nearly normal distribution with slight left-skewness
  • QQ-plot: no significant deviation from the mean, most of the points fit the qq-line except for tail areas \(=>\) residuals almost normally distributed

CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values

INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence

var<-ggplot(data = best.model, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model, aes(x = 1:nrow(movies), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
multiplot(var, indep, cols=2)

  • variability: some level of heteroscedasticity, that could affect the accuracy of outcome (could be solved by altering the model, outwith the scope of the project)
  • independence: residuals are distributed randomly \(=>\) independent (to be expected as the movie data is sampled randomly)

Model coefficients interpretation

round(coefficients(best.model),6)
                   (Intercept)                 genreAnimation 
                      4.759862                       0.698380 
genreArt House & International                    genreComedy 
                      1.257816                       0.024026 
              genreDocumentary                     genreDrama 
                      3.401586                       1.320504 
                   genreHorror genreMusical & Performing Arts 
                      0.003018                       2.736404 
       genreMystery & Suspense                     genreOther 
                      0.626110                       1.099557 
genreScience Fiction & Fantasy                 imdb_num_votes 
                      0.059695                       0.000005 
               best_pic_nomyes                best_dir_winyes 
                      1.070874                       0.570473 
  • all the predictors have positive effect on average_rating, i.e. increase it
    • best_pic_nom and best_dir_win of yes level, increase (comparing to no level) average_rating by 1.07 and 0.57 points respectively
    • genre levels coefficients show changes relative to Action & Adventure movies (initial level)
  • regression intercept: Action & Adventure movie with zero number of votes, without nomination to best picture and director non-won Oscar, gets average_rating of 4.76 points
  • increase in imdb_num_votes by 100 voices, raises audience_score by 0.0005 points

Part 5: Prediction

Now test the predictive capability of best.model on movie not included in the original data set: Lion. According to IMDb and Rotten Tomatoes, the movie was released in 2016, earned imdb_rating of \(8.0\), critics_score of \(85\%\), and audience_score of \(91\%\). Compare true average rating to predicted value, and look at confidence interval:

lion <- tibble(imdb_num_votes=210903,
                         best_pic_nom=factor("yes", levels=c("no", "yes")),
                         best_dir_win=factor("no", levels=c("no", "yes")),
                         genre= factor("Drama", levels=
                                               levels(movies$genre)))
rating_lion <- predict(best.model, lion, interval="prediction", level=0.95)
rating_lion
       fit      lwr      upr
1 8.130159 5.151084 11.10923
  • There is \(95\%\) confidence that, all else being equal, the predicted average_rating for “Lion” is between 5.15 and 11.11 on average.
Movie Title genre imdb_num_votes best_pic_nom best_dir_win Predicted avg Rating True avg Rating
Lion Drama 210903 yes no 8.13 8.53

Part 6: Conclusion

  • Linear regression model has shown some capability of predicting movie average rating based on certain characteristics
    • nevertheless, the model could have some bias as predicted value overestimates true average_rating (\(8.53\) vs. \(8.13\))
  • Poor predicting capability is due to the fact that Adj.R-squared value is only 0.339, i.e. approx. 66.1\(\%\) of the variability in data is not explained by the model
  • It has found that the best genres for average_rating increase are Documentary (+3.4) and Musical & Performing Arts (+2.74 points), while nomination for the best picture adds 1.07 points to movie average_rating
    • the same time, the original data set has some limits, i.e. there are only movies have been rated on Rotten Tomatoes and IMDb websites
  • Further analyses could be conducted to answer the questions:
    • Why average_rating distribution doesn’t show a normal curve?
      • Could it be related to the non-normal distribution of audience_score values?
    • What’s the nature of residuals heteroscedasticity?
      • How to deal with it?
    • What other variables could affect average_rating?

Appendix

multiplot

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  plots <- c(list(...), plotlist)
  numPlots = length(plots)
  if (is.null(layout)) {
          layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                           ncol = cols, nrow = ceiling(numPlots/cols))
  }
  if (numPlots==1) {
    print(plots[[1]])
  } else {
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    for (i in 1:numPlots) {
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}