Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.All observations are name of movies and columns are charcteristic of movies.This data is about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.This is an observational study so you can not establish causal relationship . there may be some confounding factors effecting the result. As the data is randomly sampled it can be genralized to the population of intrest.

Part 2: Research question

We want to see how differnt characteristics makes a movie famous? this is observational study so we can just find a correlation not causation .

Part 3: Exploratory data analysis

This dataset has 32 varibles but for the analysis we dont need all variables we will subset out data and use only below listed varibles :

1.title
2.title_type
3.genre
4.runtime
5.mpaa_rating
6.imdb_rating
7.critics_rating
8.critics_score
9.audience_rating
10.audience_score
11.best_pic_nom
12.best_pic_win
13.best_actor_win
14.best_actress win
15.best_dir_win
16.top200_box

Let’s subset the data

Movies1 dataset contains 15 variables and 650 observations , now we will use this dataset to do our analysis. As we are intrested in finding out characteristic that makes a movie famous we first need to find out which varibles in the dataset are measuring the ratings of movies.

The variables which measure ratings of a movie are:

1.imbd_rating
2.critics_score
3.audience_score
4.best_pic_nom
5.top200_box

We have to check the collinearity of these variables first.we will use numerical variables for that

ggpairs(movies1, columns = c(6,8,10))

As we can see all these varibles are collinear, adding more than one of these varibles to the model would not add much value to the model. these variables are highly-correlated, it is reasonable to use “audience_score” as a single representative of all the varibles which measure ratings of the movies. So audience_score is response varible for the model.

We would now like to explorethe relationship between our response variable audience_score and other variables title_type, genre, runtime,critics_rating, mpaa_rating, best_actor_win, best_actress win, and best_dir_win.

We will explore relation of audience_ score one by one with each of the above mentioned variables.

ggplot(movies1, aes(x = title_type, y = audience_score)) + geom_boxplot() +ggtitle("Audience score vs Title type")

From above plot we can see Documentry has highest median and smallest IQR, while Feature Films have lowest median and TV Movie have largest IQR. Documentaries are appearing most famous movies.

ggplot(movies1, aes(x = genre, y = audience_score)) + geom_boxplot() + ggtitle("Audience Score vs Genre") + scale_x_discrete(labels = function(attend) lapply(strwrap(attend, width = 5, simplify = FALSE), paste, collapse="\n"))

Geners also have highest median and lowest IQR for Documetries.

ggplot(movies1,aes(y = audience_score , x = runtime)) + geom_point() + ggtitle("Audience Score vs Runtime")

## Warning: Removed 1 rows containing missing values (geom_point).

from above plot runtime does not look like a good predictor for audience score.

ggplot(movies1, aes(x = critics_rating , y = audience_score)) + geom_boxplot() +ggtitle("Audience Score vs Critics Rating ")

Above box plot shows movies that are rated Certified fresh by critics get a better score .

ggplot(movies1, aes(x = mpaa_rating, y = audience_score)) + geom_boxplot() +ggtitle("Audience Score vs Mpaa Rating ")

Unrated movies have highest median.

ggplot(movies1, aes(x = best_actor_win , y = audience_score)) + geom_boxplot() +ggtitle("Audience Score vs Main Actor won Oscar ")

both boxplots have almost same medians and almost same IQR.

ggplot(movies1, aes(x = best_actress_win, y = audience_score)) + geom_boxplot() +ggtitle("Audience Score vs Main Actress won Oscar")

For actrsses who won oscar have higher median and lower IQR relatively.

ggplot(movies1, aes(x = best_dir_win , y = audience_score)) + geom_boxplot() +ggtitle("Audience Score vs Directors won Oscar")

For directors also who won Oscars have higher median and lower IQR.

ggplot(movies1,aes(y = audience_score , x = imdb_rating)) + geom_point() + ggtitle("Audience Score vs IMDB Ratings")

Plot shows strong positive correlation between two variables thus it makes sense to evaluate this for simple regression model.

ggplot(movies1,aes(y = audience_score , x = audience_rating)) + geom_boxplot() + ggtitle("Audience Score vs Audience Rating")

Box plot shows how Audince rating Upright is getting high audience score.

Part 4: Modeling

Now we will create a regression model with multiple predictors.We will use Backward Elimination method. In this method we start with full model, containing all predictors,drop one predictor at a time untill we reach parsimonious model.

slr<- lm(audience_score~title_type + genre + runtime + mpaa_rating +  critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating, data = movies1)

summary(slr)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + critics_rating + best_actor_win + best_actress_win + 
##     best_dir_win + imdb_rating + audience_rating, data = movies1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3708  -4.5269   0.5551   4.2251  25.0429 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -10.2430     4.3491  -2.355   0.0188 *  
## title_typeFeature Film           2.5794     2.5635   1.006   0.3147    
## title_typeTV Movie               1.0334     4.0398   0.256   0.7982    
## genreAnimation                   2.5628     2.7029   0.948   0.3434    
## genreArt House & International  -2.6285     2.0895  -1.258   0.2089    
## genreComedy                      1.6243     1.1521   1.410   0.1591    
## genreDocumentary                 2.3419     2.7523   0.851   0.3952    
## genreDrama                      -0.4912     1.0029  -0.490   0.6244    
## genreHorror                     -1.5967     1.7220  -0.927   0.3541    
## genreMusical & Performing Arts   3.4219     2.3504   1.456   0.1459    
## genreMystery & Suspense         -2.8098     1.3004  -2.161   0.0311 *  
## genreOther                       0.2652     1.9605   0.135   0.8925    
## genreScience Fiction & Fantasy  -0.2103     2.4632  -0.085   0.9320    
## runtime                         -0.0200     0.0165  -1.212   0.2260    
## mpaa_ratingNC-17                -0.8976     5.2305  -0.172   0.8638    
## mpaa_ratingPG                   -0.1122     1.9007  -0.059   0.9530    
## mpaa_ratingPG-13                -0.9232     1.9507  -0.473   0.6362    
## mpaa_ratingR                    -1.2015     1.8818  -0.639   0.5234    
## mpaa_ratingUnrated              -0.4039     2.1671  -0.186   0.8522    
## critics_ratingFresh             -0.3813     0.7933  -0.481   0.6309    
## critics_ratingRotten            -1.4675     0.9088  -1.615   0.1069    
## best_actor_winyes                0.1697     0.8165   0.208   0.8354    
## best_actress_winyes             -1.0706     0.9011  -1.188   0.2352    
## best_dir_winyes                  0.2822     1.1399   0.248   0.8045    
## imdb_rating                      9.6459     0.4198  22.975   <2e-16 ***
## audience_ratingUpright          20.0604     0.7923  25.320   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.876 on 624 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.889,  Adjusted R-squared:  0.8845 
## F-statistic: 199.9 on 25 and 624 DF,  p-value: < 2.2e-16

This is full model with all predictors R-squared is .889 and Adjusted R-squared is .8845 . Adjusted R-squared applies a penalty for the number of predictors included in the model .we choose higher adjusted r_squared over others . so now we will drop one predictor every time and compare the Adjusted R-sqared of different models.

slr<- lm(audience_score~title_type + genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr)$adj.r.squared

## [1] 0.8845305

slr1<- lm(audience_score~ genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr1)$adj.r.squared

## [1] 0.8846721

slr2<- lm(audience_score~title_type + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr2)$adj.r.squared

## [1] 0.8828145

slr3<- lm(audience_score~title_type + genre + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr3)$adj.r.squared

## [1] 0.8841277

slr4<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4)$adj.r.squared

## [1] 0.8850372

slr5<- lm(audience_score~title_type + genre + runtime + mpaa_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr5)$adj.r.squared

## [1] 0.8843364

slr6<- lm(audience_score~title_type + genre + runtime + mpaa_rating +   critics_rating  + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr6)$adj.r.squared

## [1] 0.8847072

slr7<- lm(audience_score~title_type + genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr7)$adj.r.squared

## [1] 0.8844544

slr8<- lm(audience_score~title_type + genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr8)$adj.r.squared

## [1] 0.8847039

slr9<- lm(audience_score~title_type + genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + audience_rating , data = movies1)

summary(slr9)$adj.r.squared

## [1] 0.7871917

slr10<- lm(audience_score~title_type + genre + runtime + mpaa_rating + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating  , data = movies1)

summary(slr10)$adj.r.squared

## [1] 0.7662753

so after dropping one predictor at a time we have observed our slr4 model has highest Adjusted R-squared value of.8850372 . Now we will start with slr4 and start dropping one predictor at a time.

slr4<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4)$adj.r.squared

## [1] 0.8850372

slr4.1<- lm(audience_score~ genre + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.1)$adj.r.squared

## [1] 0.8851937

slr4.2<- lm(audience_score~title_type + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.2)$adj.r.squared

## [1] 0.8824985

slr4.3<- lm(audience_score~title_type + genre + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.3)$adj.r.squared

## [1] 0.8845851

slr4.4<- lm(audience_score~title_type + genre + runtime + best_actor_win + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.4)$adj.r.squared

## [1] 0.8848306

slr4.5<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actress_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.5)$adj.r.squared

## [1] 0.8852044

slr4.6<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_dir_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.6)$adj.r.squared

## [1] 0.8849901

slr4.7<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7)$adj.r.squared

## [1] 0.885209

slr4.8<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + audience_rating , data = movies1)

summary(slr4.8)$adj.r.squared

## [1] 0.7874538

slr4.9<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + best_dir_win + imdb_rating  , data = movies1)

summary(slr4.9)$adj.r.squared

## [1] 0.7671332

beased on above analysis slr4.7 model has highest Adjusted R-squared value . so we will choose this model and start droping one predictor at a time to see if performance of model improves

slr4.7<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7)$adj.r.squared

## [1] 0.885209

slr4.7.1<- lm(audience_score~ genre + runtime + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.1)$adj.r.squared

## [1] 0.8853605

slr4.7.2<- lm(audience_score~title_type + runtime + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.2)$adj.r.squared

## [1] 0.8826625

slr4.7.3<- lm(audience_score~title_type + genre + critics_rating + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.3)$adj.r.squared

## [1] 0.8847677

slr4.7.4<- lm(audience_score~title_type + genre + runtime + best_actor_win + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.4)$adj.r.squared

## [1] 0.8849839

slr4.7.5<- lm(audience_score~title_type + genre + runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5)$adj.r.squared

## [1] 0.8853749

slr4.7.6<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.6)$adj.r.squared

## [1] 0.8851641

slr4.7.7<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + audience_rating , data = movies1)

summary(slr4.7.7)$adj.r.squared

## [1] 0.787269

slr4.7.8<- lm(audience_score~title_type + genre + runtime + critics_rating + best_actor_win + best_actress_win + imdb_rating  , data = movies1)

summary(slr4.7.8)$adj.r.squared

## [1] 0.7673716

now this model shows slr4.7.5 has highest Adjusted R-squared value, so we will retain this model and repeat the process in order to obtain parsimonious model.

slr4.7.5<- lm(audience_score~title_type + genre + runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5)$adj.r.squared

## [1] 0.8853749

slr4.7.5.1<- lm(audience_score~ genre + runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1)$adj.r.squared

## [1] 0.8855215

slr4.7.5.2<- lm(audience_score~title_type +  runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.2)$adj.r.squared

## [1] 0.8828427

slr4.7.5.3<- lm(audience_score~title_type + genre +  critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.3)$adj.r.squared

## [1] 0.8849495

slr4.7.5.4<- lm(audience_score~title_type + genre + runtime + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.4)$adj.r.squared

## [1] 0.8851468

slr4.7.5.5<- lm(audience_score~title_type + genre + runtime + critics_rating + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.5)$adj.r.squared

## [1] 0.8853372

slr4.7.5.6<- lm(audience_score~title_type + genre + runtime + critics_rating  + best_actress_win +  audience_rating , data = movies1)

summary(slr4.7.5.6)$adj.r.squared

## [1] 0.7874144

slr4.7.5.7<- lm(audience_score~title_type + genre + runtime + critics_rating  + best_actress_win + imdb_rating  , data = movies1)

summary(slr4.7.5.7)$adj.r.squared

## [1] 0.7674823

This model shows slr4.7.5.1 has best Adjusted R-squared value.

slr4.7.5.1<- lm(audience_score~ genre + runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1)$adj.r.squared

## [1] 0.8855215

slr4.7.5.1.1<- lm(audience_score~  runtime + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1.1)$adj.r.squared

## [1] 0.883093

slr4.7.5.1.2<- lm(audience_score~ genre + critics_rating  + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1.2)$adj.r.squared

## [1] 0.8851025

slr4.7.5.1.3<- lm(audience_score~ genre + runtime + best_actress_win + imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1.3)$adj.r.squared

## [1] 0.8853302

slr4.7.5.1.4<- lm(audience_score~ genre + runtime + critics_rating+ imdb_rating + audience_rating , data = movies1)

summary(slr4.7.5.1.4)$adj.r.squared

## [1] 0.8854804

slr4.7.5.1.5<- lm(audience_score~ genre + runtime + critics_rating  + best_actress_win + audience_rating , data = movies1)

summary(slr4.7.5.1.5)$adj.r.squared

## [1] 0.7874236

slr4.7.5.1.6<- lm(audience_score~ genre + runtime + critics_rating  + best_actress_win + imdb_rating , data = movies1)

summary(slr4.7.5.1.6)$adj.r.squared

## [1] 0.7681463

Still the highest Adjusted R-squared is of model slr4.7.5.1 so we dont need to drop any more predictor this is the final model.

final_model<- slr4.7.5.1

  P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Using diagnostic plots, we can conclude that the conditions for this model are reasonable.

hist(final_model$residuals)

ggplot(data = final_model, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted") + ylab("Residual")

We can see residual plot has random scatter about the zero line.

qqnorm(final_model$residuals)
qqline(final_model$residuals)

It appears from the histogram and the normal probability plot that the normality condition has been reasonably met.

summary(final_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_rating + 
##     best_actress_win + imdb_rating + audience_rating, data = movies1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5630  -4.6321   0.6255   4.3120  24.7858 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -8.51743    2.98381  -2.855  0.00445 ** 
## genreAnimation                  3.22076    2.46625   1.306  0.19205    
## genreArt House & International -2.85552    2.03077  -1.406  0.16018    
## genreComedy                     1.42804    1.13438   1.259  0.20854    
## genreDocumentary                0.07968    1.39805   0.057  0.95457    
## genreDrama                     -0.76964    0.97228  -0.792  0.42890    
## genreHorror                    -1.96974    1.67636  -1.175  0.24043    
## genreMusical & Performing Arts  2.47270    2.18889   1.130  0.25905    
## genreMystery & Suspense        -3.17702    1.25905  -2.523  0.01187 *  
## genreOther                      0.21157    1.93309   0.109  0.91288    
## genreScience Fiction & Fantasy -0.19922    2.44818  -0.081  0.93517    
## runtime                        -0.01906    0.01550  -1.230  0.21915    
## critics_ratingFresh            -0.30626    0.78545  -0.390  0.69673    
## critics_ratingRotten           -1.41288    0.89693  -1.575  0.11570    
## best_actress_winyes            -0.98821    0.89180  -1.108  0.26824    
## imdb_rating                     9.65742    0.41395  23.330  < 2e-16 ***
## audience_ratingUpright         20.03025    0.78502  25.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.846 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8855 
## F-statistic: 314.8 on 16 and 633 DF,  p-value: < 2.2e-16

We have created final regression model using backward selection process.In this model explanatory variables are genre, runtime,critics_rating, best_actress_win, imdb_rating and audience_rating and response variable is audience score. this model gives adjusted R-squared value of .8855.In this model 88.55% of variation in audience score is explained by the above mentioned predictors .

Genre slope coefficent interpretation: Genre is factor variale withdifferent levels.action & adventure is the refrence level.If all else held constant genre for a movie is animation rather than action and adventure, we can expect that on an average, the audience score for that movie will increase by 3.2 points.

Runtime slope coefficent interpretation: If the runtime for a movie increases by 1 point, we can expect that on an average, the audience score for that movie will decrease by 0.02 points, holding everything else constant.

IMDB ratings slope coefficent interpretation: If the IMDB rating for a movie increases by 1 point, we can expect that on an average, the audience score for that movie will increase by 9.65 points, holding everything else constant. * * *

Part 5: Prediction

We want to pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using our the model we developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.

We are picking Kung Fu Panda 3. The data below, obtained from IMDB and Rotten Tomato represent each respective data pointfor the regression model .

kung_fu_3<- data.frame(genre = "Animation", runtime = 95, critics_rating = "Certified Fresh", best_actress_win = "yes", imdb_rating = 7.1, audience_rating = "Upright")

predict(final_model, kung_fu_3, interval = "prediction", level = .95)

##        fit      lwr     upr
## 1 80.50241 66.22421 94.7806

Amazingly our model predicts audience score for Kung Fu Panda 3 as 81 and Rotten Tomatoes has given a score of 78 which is pretty close to our prediction. also the confidence interval for this is 66-95 , and our prediction falls within this range. We can say the model predicts, with 95% confidence, that the movie “Kung Fu Panda 3” is expected to have an audience score between 66 and 95.

Part 6: Conclusion

The regression model we have created here is predicting movies popularity by measuring its Audience score . We have used 6 predictors Genre, Runtime, Critics rating , Best actress win , Imdb rating and Audience rating. All these predictors have different slopes some are positively related to audince scores while some are negatively related. this model has used different predictors some of them could be biased. Also we could use some other predictors like social media reviews to see how it effects the model.