This project details our analysis of the movie dataset that contains information from Rotten Tomatos and IMDB for a random sample of movies. The purpose of this project is to develop a multiple linear regression model to understand what attributes make a movie popular. In the meantime, learning something new about movies.
The data set is comprised of 651 randomly sampled movies produced and released before 2016, each row in the dataset is a movie and each column is a characteristic of a movie. Therefore, the data should allow us to generalize to the population of interest. However, there is no causation can be established because random assignment is not used in this study. In addition, potential biases are associated with non-voting or non_rating because the voting and rating are voluntary on IMDB and Rotten Tomatos website.
From common sense, we realized that many of the variables are irrelevant to the purpose of identifying the popularity of a movie. As such, we select the following variables to start our analysis.
Is a movie’s popularity, as measured by audience score, related to the type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)? Being able to answer this question will help us to predict a movie’s popularity.
Abstracting the data of the above potential predictors for the model.
movies_new <- movies %>% select(title, title_type, genre, runtime, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_win, best_actor_win, best_actress_win, best_dir_win)Look at the structure of the data
## tibble [651 x 14] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
Summary statistics
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## imdb_rating imdb_num_votes critics_rating critics_score
## Min. :1.900 Min. : 180 Certified Fresh:135 Min. : 1.00
## 1st Qu.:5.900 1st Qu.: 4546 Fresh :209 1st Qu.: 33.00
## Median :6.600 Median : 15116 Rotten :307 Median : 61.00
## Mean :6.493 Mean : 57533 Mean : 57.69
## 3rd Qu.:7.300 3rd Qu.: 58301 3rd Qu.: 83.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## audience_rating audience_score best_pic_win best_actor_win best_actress_win
## Spilled:275 Min. :11.00 no :644 no :558 no :579
## Upright:376 1st Qu.:46.00 yes: 7 yes: 93 yes: 72
## Median :65.00
## Mean :62.36
## 3rd Qu.:80.00
## Max. :97.00
##
## best_dir_win
## no :608
## yes: 43
##
##
##
##
##
I find there is one missing value, and decide to drop it.
Part of this project is to use the model to predict a movie’s audience score and this movie should not be part of the data. Therefore, I split the data into traning and testing, and there is only one row in the test set.
set.seed(2017)
split <- sample(seq_len(nrow(movies_new)), size = floor(0.999 * nrow(movies_new)))
train <- movies_new[split, ]
test <- movies_new[-split, ]
dim(train)## [1] 649 14
## [1] 1 14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.32 80.00 97.00
The median of our response variable - audience score distribution is 65; 25% of the movie in the training set have an audience score higher than 80; 25% of the movie in the training set have an audience score lower than 46; very few movie have an audience score lower than 20 or higher than 90 (i.e.Audience in the data are unlikey to give very low or very high score).
p1 <- ggplot(aes(x=runtime), data=train) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 5) + ylab('percentage') + ggtitle('Run Time')
p2 <- ggplot(aes(x=imdb_rating), data=train) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 0.2) + ylab('percentage') + ggtitle('IMDB rating')
p3 <- ggplot(aes(x=log10(imdb_num_votes)), data=train) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white') + ylab('percentage') + ggtitle('log(IMDB number of votes)')
p4 <- ggplot(aes(x=critics_score), data=train) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='white', binwidth = 2) + ylab('percentage') + ggtitle('Critics Score')
grid.arrange(p1, p2, p3, p4, ncol=2)Regression analysis: Run time, IMDB rating, log(IMDB number of votes) and Critics Scores all have reasonable broad distribution, therefore, they will be considered for the regression analysis.
p1 <- ggplot(aes(x=title_type), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
ggtitle('Title Type') + coord_flip()
p2 <- ggplot(aes(x=genre), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
ggtitle('Genre') + coord_flip()
p3 <- ggplot(aes(x=critics_rating), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
ggtitle('Critics Rating') + coord_flip()
p4 <- ggplot(aes(x=audience_rating), data=train) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('percentage') +
ggtitle('Audience Rating') + coord_flip()
grid.arrange(p1, p2, p3, p4, ncol=2)Not all those categorical variables have reasonable spread of distribution. Most movies in the data are in the “Feature Film” title type and majority of the movies are drama. Therefore, we must be aware that the results could be biased toward drama movies.
vars <- names(train) %in% c('runtime', 'imdb_rating', 'imdb_num_votes', 'critics_score')
selected_train <- train[vars]
corr.matrix <- cor(selected_train)
corrplot(corr.matrix, main="\n\nCorrelation Plot of numerical variables", method="number")Two predictors - critics score and imdb rating are highly correlated at 0.76 (collinearity), therefore, One of them will be removed from the model, I decided to remove critics score.
boxplot(audience_score~critics_rating, data=train, main='Audience score vs. Critics rating', xlab='Critics Rating', ylab='Audience Score')## train$critics_rating: Certified Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 71.00 81.00 79.37 87.50 97.00
## ------------------------------------------------------------
## train$critics_rating: Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 58.00 74.00 69.96 83.00 94.00
## ------------------------------------------------------------
## train$critics_rating: Rotten
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 36.00 48.00 49.60 63.75 95.00
boxplot(audience_score~audience_rating, data=train, main='Audience Score vs. Audience Rating', xlab='Audience rating', ylab='Audience Score')## train$audience_rating: Spilled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 35.00 43.00 41.93 51.00 59.00
## ------------------------------------------------------------
## train$audience_rating: Upright
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 70.00 78.00 77.31 85.00 97.00
boxplot(audience_score~title_type, data=train, main='Audience score vs. Title type', xlab='Title_type', ylab='Audience Score')## train$title_type: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.00 78.00 86.00 83.46 89.00 96.00
## ------------------------------------------------------------
## train$title_type: Feature Film
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 44.25 62.00 60.43 78.00 97.00
## ------------------------------------------------------------
## train$title_type: TV Movie
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 21.0 75.0 56.8 83.0 86.0
boxplot(audience_score~genre, data=train, main='Audience score vs. Genre', xlab='Genre', ylab='Audience score')## train$genre: Action & Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 37.00 52.00 53.78 65.00 94.00
## ------------------------------------------------------------
## train$genre: Animation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 59.00 65.00 62.44 70.00 88.00
## ------------------------------------------------------------
## train$genre: Art House & International
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 51.25 65.50 64.00 80.25 86.00
## ------------------------------------------------------------
## train$genre: Comedy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 37.00 50.00 52.51 67.50 93.00
## ------------------------------------------------------------
## train$genre: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 77.50 86.00 82.96 89.00 96.00
## ------------------------------------------------------------
## train$genre: Drama
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.0 52.0 70.0 65.3 80.0 95.0
## ------------------------------------------------------------
## train$genre: Horror
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 36.00 43.00 45.83 53.50 84.00
## ------------------------------------------------------------
## train$genre: Musical & Performing Arts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.00 75.75 80.50 80.17 89.50 95.00
## ------------------------------------------------------------
## train$genre: Mystery & Suspense
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 40.50 54.00 55.95 70.50 97.00
## ------------------------------------------------------------
## train$genre: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 53.00 73.50 66.69 82.50 91.00
## ------------------------------------------------------------
## train$genre: Science Fiction & Fantasy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 26.00 47.00 50.89 79.00 85.00
All the categorical variables seems to have reasonable significant correlation with audience score.
We will be using stepwise model forward selection method, we start with an empty model, then add variables one at a time until a parsimonious model is reached. From the following full model, we can see that imdb rating has the lowest p value and is the most correlated variable to our response variable. So we choose imdb rating as the first predictor.
full_model <- lm(audience_score~imdb_rating+title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=train)
summary(full_model)##
## Call:
## lm(formula = audience_score ~ imdb_rating + title_type + genre +
## runtime + imdb_num_votes + critics_rating + audience_rating +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5184 -4.4880 0.5766 4.3247 24.5522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.713e+00 4.061e+00 -2.392 0.0171 *
## imdb_rating 9.558e+00 4.284e-01 22.312 <2e-16 ***
## title_typeFeature Film 2.232e+00 2.542e+00 0.878 0.3804
## title_typeTV Movie 6.893e-01 4.029e+00 0.171 0.8642
## genreAnimation 3.184e+00 2.476e+00 1.286 0.1988
## genreArt House & International -2.583e+00 2.060e+00 -1.254 0.2103
## genreComedy 1.497e+00 1.142e+00 1.311 0.1904
## genreDocumentary 2.477e+00 2.722e+00 0.910 0.3631
## genreDrama -6.352e-01 9.915e-01 -0.641 0.5220
## genreHorror -1.927e+00 1.686e+00 -1.143 0.2534
## genreMusical & Performing Arts 3.570e+00 2.357e+00 1.515 0.1304
## genreMystery & Suspense -3.162e+00 1.268e+00 -2.494 0.0129 *
## genreOther 2.711e-01 1.951e+00 0.139 0.8895
## genreScience Fiction & Fantasy -2.615e-01 2.458e+00 -0.106 0.9153
## runtime -2.732e-02 1.679e-02 -1.628 0.1041
## imdb_num_votes 2.837e-06 3.079e-06 0.921 0.3572
## critics_ratingFresh -1.029e-02 8.423e-01 -0.012 0.9903
## critics_ratingRotten -1.262e+00 9.318e-01 -1.354 0.1763
## audience_ratingUpright 2.005e+01 7.908e-01 25.350 <2e-16 ***
## best_pic_winyes 4.731e-01 2.917e+00 0.162 0.8712
## best_actor_winyes 1.991e-01 8.140e-01 0.245 0.8069
## best_actress_winyes -9.973e-01 9.026e-01 -1.105 0.2696
## best_dir_winyes 1.841e-01 1.187e+00 0.155 0.8768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.867 on 626 degrees of freedom
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8849
## F-statistic: 227.4 on 22 and 626 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = audience_score ~ imdb_rating, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.795 -6.533 0.655 5.692 52.905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.3426 2.4212 -17.49 <2e-16 ***
## imdb_rating 16.1251 0.3679 43.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.17 on 647 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7476
## F-statistic: 1921 on 1 and 647 DF, p-value: < 2.2e-16
The 0.75 R-squared and almost zero p value indicate that imdb rating is a statistically significant predictor of audience score.
In order to find out the second predictor, I look at the following model.
fit_model <- lm(audience_score~title_type+genre+runtime+imdb_num_votes+critics_rating+audience_rating+best_pic_win+best_actor_win+best_actress_win+best_dir_win, data=train)
summary(fit_model)##
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime +
## imdb_num_votes + critics_rating + audience_rating + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.4370 -6.0200 0.8101 6.5995 19.2891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.511e+01 4.329e+00 10.420 < 2e-16 ***
## title_typeFeature Film -8.178e-01 3.399e+00 -0.241 0.8099
## title_typeTV Movie -6.012e+00 5.379e+00 -1.118 0.2641
## genreAnimation 6.284e-02 3.309e+00 0.019 0.9849
## genreArt House & International 5.425e-01 2.752e+00 0.197 0.8438
## genreComedy 3.189e-01 1.527e+00 0.209 0.8347
## genreDocumentary 8.062e+00 3.629e+00 2.222 0.0267 *
## genreDrama 1.611e+00 1.321e+00 1.220 0.2230
## genreHorror -8.176e-01 2.256e+00 -0.362 0.7171
## genreMusical & Performing Arts 8.031e+00 3.144e+00 2.554 0.0109 *
## genreMystery & Suspense -4.128e-01 1.689e+00 -0.244 0.8070
## genreOther 5.464e-01 2.612e+00 0.209 0.8343
## genreScience Fiction & Fantasy -3.521e+00 3.285e+00 -1.072 0.2843
## runtime 2.100e-02 2.228e-02 0.942 0.3464
## imdb_num_votes 1.833e-05 4.016e-06 4.563 6.06e-06 ***
## critics_ratingFresh -4.235e-01 1.127e+00 -0.376 0.7073
## critics_ratingRotten -7.308e+00 1.194e+00 -6.123 1.62e-09 ***
## audience_ratingUpright 2.894e+01 9.142e-01 31.663 < 2e-16 ***
## best_pic_winyes -1.183e+00 3.905e+00 -0.303 0.7620
## best_actor_winyes 8.012e-01 1.089e+00 0.736 0.4623
## best_actress_winyes -5.554e-01 1.208e+00 -0.460 0.6459
## best_dir_winyes 1.498e+00 1.587e+00 0.944 0.3456
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.193 on 627 degrees of freedom
## Multiple R-squared: 0.8003, Adjusted R-squared: 0.7936
## F-statistic: 119.7 on 21 and 627 DF, p-value: < 2.2e-16
We add audience rating as the second predictor because of the lowest p value.
##
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1510 -4.7695 0.6162 4.3688 24.3404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.5337 2.0073 -5.746 1.41e-08 ***
## imdb_rating 9.5275 0.3498 27.238 < 2e-16 ***
## audience_ratingUpright 20.8470 0.7677 27.154 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.953 on 646 degrees of freedom
## Multiple R-squared: 0.8823, Adjusted R-squared: 0.882
## F-statistic: 2422 on 2 and 646 DF, p-value: < 2.2e-16
The models’ R-squared and Adjusted R-Squared both increased significantly, the almost zero p value indicate that audience rating is another statistically significant predictor of audience score.
After the above second fit, I did the following attempts:
## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## imdb_rating 1 198512 198512 4214.4653 < 2.2e-16 ***
## audience_rating 1 35641 35641 756.6690 < 2.2e-16 ***
## genre 10 1269 127 2.6938 0.003069 **
## Residuals 636 29957 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Therefore, I decide to add genre as one of the prdictors. So, I arrived at our final model - Parsimonious Model, with three predictors: imdb rating, audience rating and genre.
##
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating +
## genre, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6319 -4.4264 0.5933 4.2973 25.0928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.5588 2.1965 -5.718 1.66e-08 ***
## imdb_rating 9.8033 0.3691 26.560 < 2e-16 ***
## audience_ratingUpright 20.3058 0.7752 26.195 < 2e-16 ***
## genreAnimation 3.6263 2.4524 1.479 0.13971
## genreArt House & International -2.7874 2.0329 -1.371 0.17081
## genreComedy 1.5106 1.1275 1.340 0.18077
## genreDocumentary 0.6068 1.3702 0.443 0.65805
## genreDrama -0.8457 0.9595 -0.881 0.37843
## genreHorror -1.6223 1.6700 -0.971 0.33170
## genreMusical & Performing Arts 2.5474 2.1909 1.163 0.24539
## genreMystery & Suspense -3.2744 1.2468 -2.626 0.00884 **
## genreOther 0.2776 1.9260 0.144 0.88542
## genreScience Fiction & Fantasy 0.2554 2.4417 0.105 0.91672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.863 on 636 degrees of freedom
## Multiple R-squared: 0.8871, Adjusted R-squared: 0.885
## F-statistic: 416.5 on 12 and 636 DF, p-value: < 2.2e-16
ggplot(data = fit3, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")There is clear a linear relationship between imdb rating and audience score. The linearity condition is met by our model.
Constant variance of residuals condition met, No fan shape in residuals plot.
ggplot(data = fit3, aes(x = .resid)) +
geom_histogram(binwidth = 1, fill='white', color='black') +
xlab("Residuals")The residuals are nearly symmetric, hence it would be appropriate to deem the the normal distribution of residuals condition met.
We are going to use the model created earlier(fit3) to predict the audience score for the movie in the test set - Aliens. First we create a new dataframe for this movie.
## 1
## 76.50501
The model predicts movie Aliens in the test set will have an audience score at approximate 90.
## fit lwr upr
## 1 76.50501 62.99971 90.01032
Our model predicts, with 95% confidence, that the movie Aliens is expected to have an audience score between 76.34 and 103.65.
## [1] 81
The actual audience score for this movie is 94. Our prediction interval contains this value.
Our model demonstrates that it is possible to predict a movie’s popularity, as measured by audience score with only three predictors - imdb score, audience rating and genre. Movie industries can use the similar methods when producing movies that are more likely to be liked by the target audience.
However, the potential shortcoming is that our model’s predictive power is limited because the sample data is not representative. Therefore, a larger number of observations to capture more variability in the population data in our testing data set is required to have a better measure of the model’s accuracy.