Setup

This project details our analysis of the movie dataset that contains information from Rotten Tomatos and IMDB for a random sample of movies. The purpose of this project is to develop a multiple linear regression model to understand what attributes make a movie popular. In the meantime, learning something new about movies.

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)
library(corrplot)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016, each row in the dataset is a movie and each column is a characteristic of a movie. Therefore, the data should allow us to generalize to the population of interest. However, there is no causation can be established because random assignment is not used in this study. In addition, potential biases are associated with non-voting or non_rating because the voting and rating are voluntary on IMDB and Rotten Tomatos website.

From common sense, we realized that many of the variables are irrelevant to the purpose of identifying the popularity of a movie. As such, we select the following variables to start our analysis.

Part 2: Research question

Is a movie’s popularity, as measured by audience score, related to the type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)? Being able to answer this question will help us to predict a movie’s popularity.

Part 3: Exploratory data analysis and feature selection

Abstracting the data of the above potential predictors for the model.

movies_new <- movies %>% select(title, title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win)

Look at the structure of the data

str(movies_new)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  14 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...

Summary statistics

summary(movies_new)
##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 39.0   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 92.0   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :103.0   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :105.8   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:115.8   3rd Qu.:7.300   3rd Qu.: 58301                         
##  Max.   :267.0   Max.   :9.000   Max.   :893008                         
##  NA's   :1                                                              
##  critics_score    audience_rating audience_score  best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes:  7     
##  Median : 61.00                   Median :65.00               
##  Mean   : 57.69                   Mean   :62.36               
##  3rd Qu.: 83.00                   3rd Qu.:80.00               
##  Max.   :100.00                   Max.   :97.00               
##                                                               
##  best_actor_win best_actress_win best_dir_win
##  no :558        no :579          no :608     
##  yes: 93        yes: 72          yes: 43     
##                                              
##                                              
##                                              
##                                              
## 

I find there is one missing value, and decide to drop it.

movies_new <- na.omit(movies_new)

Part of this project is to use the model to predict a movie’s audience score and this movie should not be part of the data. Therefore, I split the data into traning and testing, and there is only one row in the test set.

set.seed(2017)
split <- sample(seq_len(nrow(movies_new)), size = floor(0.999 * nrow(movies_new)))
train <- movies_new[split, ]
test <- movies_new[-split, ]
dim(train)
## [1] 649  14
dim(test)
## [1]  1 14

Histogram of Numeric variables

hist(train$audience_score)

summary(train$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0    46.0    65.0    62.3    80.0    97.0

The median of our response variable - audience score distribution is 65; 25% of the movie in the training set have an audience score higher than 80; 25% of the movie in the training set have an audience score lower than 46; very few movie have an audience score lower than 20 or higher than 90 (i.e.Audience in the data are unlikey to give very low or very high score).

p1 <- ggplot(aes(x=runtime), data=train) + 
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 5) + ylab('percentage') + ggtitle('Run Time')
p2 <- ggplot(aes(x=imdb_rating), data=train) +
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 0.2) + ylab('percentage') + ggtitle('IMDB rating')
p3 <- ggplot(aes(x=log10(imdb_num_votes)), data=train) +
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white') + ylab('percentage') + ggtitle('log(IMDB number of votes)')
p4 <- ggplot(aes(x=critics_score), data=train) +
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 2) + ylab('percentage') + ggtitle('Critics Score')
grid.arrange(p1, p2, p3, p4, ncol=2)

Regression analysis: Run time, IMDB rating, log(IMDB number of votes) and Critics Scores all have reasonable broad distribution, therefore, they will be considered for the regression analysis.

Bar plot of categorical variables

p1 <- ggplot(aes(x=title_type), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
  ggtitle('Title Type') + coord_flip()
p2 <- ggplot(aes(x=genre), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
  ggtitle('Genre') + coord_flip()
p3 <- ggplot(aes(x=critics_rating), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
  ggtitle('Critics Rating') + coord_flip()
p4 <- ggplot(aes(x=audience_rating), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
  ggtitle('Audience Rating') + coord_flip()
grid.arrange(p1, p2, p3, p4, ncol=2)

Not all those categorical variables have reasonable spread of distribution. Most movies in the data are in the “Feature Film” title type and majority of the movies are drama. Therefore, we must be aware that the results could be biased toward drama movies.

Correlation between numerical variables

vars <- names(train) %in% c('runtime', 'imdb_rating', 'imdb_num_votes', 'critics_score')
selected_train <- train[vars]
corr.matrix <- cor(selected_train)
corrplot(corr.matrix, main="\n\nCorrelation Plot of numerical variables", method="number")