Movie Data Analysis

Initial Criteria

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(broom)
library(ggplot2)
library(car)
library(kableExtra)
library(DT)
library(SignifReg)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

According to the codebook, these are randomly sampled 651 movies that were produced before 2016. It is an observational study.

datatable(movies)

However, we observe that data contains movies only from hollywood and thus it can not be used to make sound judgement about movies from other countries.

Also, few very people actually provide ratings, so the provided data would be biased.

Part 2: Research question

What am I trying to answer?

What makes a movie popular?

How accurately can we predict the populatiry?

How to go about this?

I will look into various data columns and study if there is any association between the movie ratings and audience and critic scores. I would also see if the ratings depend on the genre type of the movies.

How will it be useful?

Recently I watched a movie without reading the reviews or checking the ratings and my experience wasn’t good. So, I would like to see how the audience rating is calculated and what factors do they consider when they like a movie. I would also like to see how if IMDB rating is associated with the rating in Rotten Tomatoes.

Part 3: Exploratory data analysis

Numeric Variables

  1. As I begin exploring the dataset, I would like to look into few variables to see how correlated they are.

Please note: I am checking if there are any NA values in my variables of interest.

df <- movies[,c("runtime","imdb_rating","critics_score","audience_score")]

ggpairs(df)

  1. I would like to see if runtime impacts the popularity of the movie.
sum(is.na(movies$runtime))
## [1] 1
cor(movies$audience_score, movies$runtime)
## [1] NA

There is exactly 1 movie for which we aren’t provided with the data for runtime. We would look into the movie and compare it with the movie of similar genre so as to fix the runtime.

which(is.na(movies$runtime))
## [1] 334
movies[334,c("title","title_type","runtime","genre")]
## # A tibble: 1 x 4
##   title              title_type  runtime genre      
##   <chr>              <fct>         <dbl> <fct>      
## 1 The End of America Documentary      NA Documentary
movies[334,"runtime"] <- 74

So, we find the row number and observe that the runtime was NA for ‘The end of America’ documentary. Hence,
I google the runtime of the documentary - ‘The end of America’, it is around 74 minutes, so I replace NA with 74

  1. Now, we will check the correlation of runtime with the audience_score.
cor(movies$audience_score, movies$runtime)
## [1] 0.1793614

Here the strength of associaltion less than 20%.

We will look into the distribution of runtime.

hist(movies$runtime)

Categorical Variables

To check the distribution of runtime across the title_type-

boxplot(movies$runtime ~ movies$title_type)

There are few outliers for documentary and feature films.

The average runtime across the different movie title_type-

movies %>% 
  select(title_type, runtime) %>%  
  group_by(title_type) %>% 
  summarize(avg_run = mean(runtime)) 
## # A tibble: 3 x 2
##   title_type   avg_run
##   <fct>          <dbl>
## 1 Documentary     97.3
## 2 Feature Film   107. 
## 3 TV Movie       102.
  1. We will look at the unique values for the title_type, critical_rating and audience_rating values
unique(movies$title_type)
## [1] Feature Film Documentary  TV Movie    
## Levels: Documentary Feature Film TV Movie
unique(movies$genre)
##  [1] Drama                     Comedy                   
##  [3] Horror                    Documentary              
##  [5] Action & Adventure        Art House & International
##  [7] Musical & Performing Arts Mystery & Suspense       
##  [9] Animation                 Science Fiction & Fantasy
## [11] Other                    
## 11 Levels: Action & Adventure Animation ... Science Fiction & Fantasy
unique(movies$critics_rating)
## [1] Rotten          Certified Fresh Fresh          
## Levels: Certified Fresh Fresh Rotten
unique(movies$audience_rating)
## [1] Upright Spilled
## Levels: Spilled Upright

Part 4: Modeling

Fitted Model

Now that we have looked at the different variables, we would fit a model using the model using forward selection method, based on Adjusted R^2

scope <- audience_score   ~ 
  runtime  + imdb_rating + critics_score+ critics_rating + audience_rating


model <- SignifReg(scope=scope,
                   data=movies,
                   alpha=0.05,
                   direction="forward",
                   criterion="r-adj",
                   correction="FDR")

summary(model)
## 
## Call:
## lm(formula = reg, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1570  -4.7630   0.6209   4.3572  24.3224 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -11.4864     2.0068  -5.724 1.59e-08 ***
## imdb_rating              9.5191     0.3497  27.220  < 2e-16 ***
## audience_ratingUpright  20.8473     0.7674  27.166  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.952 on 648 degrees of freedom
## Multiple R-squared:  0.8822, Adjusted R-squared:  0.8818 
## F-statistic:  2426 on 2 and 648 DF,  p-value: < 2.2e-16

Thus, we obtain the final model as -

model_first <- lm( audience_score   ~  imdb_rating +  audience_rating, data = movies)
round(coef(model_first), digits = 3)
##            (Intercept)            imdb_rating audience_ratingUpright 
##                -11.486                  9.519                 20.847

95% Confidence intervals for the coefficients

confint(model_first, level = 0.95)
##                            2.5 %    97.5 %
## (Intercept)            -15.42705 -7.545717
## imdb_rating              8.83236 10.205747
## audience_ratingUpright  19.34045 22.354207

We observe that

  1. R^2 and adjusted R^2 are close, the model seems to be well-fitted

  2. We can also perform a hypothesis test to know if the predictors are associated with the response variable.

Our null hypothesis i.e. H0: the slopes associated with all the repsonse variables are 0

the alternate hypothesis HA: there are associated and hence atleast one of the slope is non-zero.

This result is provided by the F-statistic value, which is much greater than 1 signifying that atleast one of the response variable is significant and is related to our response variable

Residual Analysis

fit1Augment <- augment(model_first) %>% mutate(row_num = 1:n())
# Plot of residuals against fitted values (non-constant variance and non-linearity)
ggplot(fit1Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).

 ggplot(fit1Augment, aes(x = imdb_rating, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nimdb_rating is equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Boxplot of residuals (outlying observations)
ggplot(fit1Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nFew outliers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
ggplot(fit1Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

The standardised residuals in the graphs follow rstudent function. The graphs indicate that all the typical assumptions of the linear regression model hold true. Hence the model can be accepted as adequate

Finally, we check for multicollinearity using the VIF function. If the value of VIF > 10 then we can say that multicollineraity exists. However, here the values are less than 10, there is no multicollinearity issue.

vif(model_first)
##     imdb_rating audience_rating 
##        1.935264        1.935264

Part 5: Prediction

I would like to predict the IMDB rating for the documentary movie- ‘City of Gold’.

In this richly penetrating documentary odyssey, Pulitzer Prize-winning food critic Jonathan Gold shows us a Los Angeles where ethnic cooking is a kaleidoscopic portal to the mysteries of an unwieldy city and the soul of America. As piping-hot platters are served up, so are stories of immigrants whose secret family recipes are like sacred offerings pledged for the opportunity to build their American Dream. With eternal curiosity, razor-sharp intellect, and existential longing, Gold is a culinary geographer taking us where no critic has gone before.

It was released in 2016 and the runtime of the movie is 91 minutes. It was directed and written by Laura Gabbert and was in theaters on Mar 11,2016. The audience_score is 81, which we will try to observe in our prediction.

new_movie <- data.frame( imdb_rating = 7.1 , audience_rating = "Upright")
predict(model_first, new_movie, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 76.94623 63.27664 90.61581

I have used the prediction interval of 95%. Also, the resulted prediction value is very close to the true audience_score i.e. 81. Thus, we can say that the model is pretty good in that case.

Part 6: Conclusion

Thus, we observe that the audience score is associated with the IMDB rating and audience rating.

This open-ended question helped me to look data from various angles.

In future, I would like to treat the outliers and leverage points, and verify the results obtained thereafter. I would also look into the scope for polynomial terms to see if we obtain any better result.

However as mentioned already, not many people actually rate the movies in IMDB and rotten tomatoes, so it is quite biased in that manner.