Introduction

In real world, a common issue in multivariate dataset is that some confounding variable(s) can distort the relationship between other pair of variables. Adjustment studies the impact of such third variable on the relationship by adding it as a regressor into a linear model.

In this blog, we use a dataset about movie Box sourced from Kaggle as our study object. Through the below analysis, we will check whether some features are confounding variables or not.

Analysis

To make the analysis straightforward and relevant, we first picked up a subset of the features from the dataset and applied some transformation on it. Here are the final features in our dataset.

movie_df <- read.csv("train.csv")

mv_thin <- movie_df %>%
  filter(budget > 0 & (str_detect(genres, "Action")|str_detect(genres, "Romance")) & original_language == 'en') %>%
  select(c(title, budget, genres, popularity, revenue)) %>%
  arrange(desc(revenue))

genres_fun <- function(x) {
  if(str_detect(x, 'Romance')) {
    return ('Romance')
  }
  else {
    return ('Action')
  }
}
mv_thin['genres_cl'] <- apply(mv_thin['genres'], 1, genres_fun)
mv_thin['rev_log'] <- log(mv_thin['revenue'])
#, layout="l-body-outset"}
kable(head(mv_thin))
title budget genres popularity revenue genres_cl rev_log
The Avengers 2.20e+08 [{‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}] 89.887648 1519557910 Action 21.14169
Furious 7 1.90e+08 [{‘id’: 28, ‘name’: ‘Action’}] 27.275687 1506249360 Action 21.13289
Avengers: Age of Ultron 2.80e+08 [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}, {‘id’: 878, ‘name’: ‘Science Fiction’}] 37.379420 1405403694 Action 21.06359
Beauty and the Beast 1.60e+08 [{‘id’: 10751, ‘name’: ‘Family’}, {‘id’: 14, ‘name’: ‘Fantasy’}, {‘id’: 10749, ‘name’: ‘Romance’}] 287.253654 1262886337 Romance 20.95667
Transformers: Dark of the Moon 1.95e+08 [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 12, ‘name’: ‘Adventure’}] 4.503505 1123746996 Action 20.83993
The Dark Knight Rises 2.50e+08 [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 80, ‘name’: ‘Crime’}, {‘id’: 18, ‘name’: ‘Drama’}, {‘id’: 53, ‘name’: ‘Thriller’}] 20.582580 1084939099 Action 20.80479
genres_average <- mv_thin %>% group_by(genres_cl) %>% summarise(n=n(), avg_revenue=mean(revenue), .groups = 'drop') 
kable(genres_average)
genres_cl n avg_revenue
Action 500 148372491
Romance 356 76749889
action_mean_rev = genres_average[1,3]$avg_revenue
romance_mean_rev = genres_average[2,3]$avg_revenue

lm1 <- lm(revenue~budget, data=mv_thin)
coef <- lm1$coefficients
intercept <- coef[1]
slope <- coef[2]
g <- ggplot(mv_thin, aes(budget, revenue, colour = genres_cl)) + 
  geom_smooth(method='lm') +
  geom_point() +
  geom_hline(yintercept = action_mean_rev[1], linetype="dashed", colour='blue') + 
  geom_hline(yintercept = romance_mean_rev[1], linetype="dashed", colour='red') 
# +geom_abline(slope=slope, intercept=intercept)

Below is the plot showing data distribution and the linear model cross different two genres (Action and Romance). In addition, the blue dashed line demonstrates the average revenue for Action movies and the red dashed line demonstrates the same metric for Romance movies.

fit1 <- lm(revenue~genres_cl, data = mv_thin)
summary(fit1)$coef
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      148372491    8485760 17.484880 9.500469e-59
## genres_clRomance -71622601   13158386 -5.443115 6.844826e-08

We can see that the coefficient for Genres (Romance) is -71622601 when we only consider Genres as regressor. Suppose now we would like to analyze if Budget is a confounding variable, i.e., whether there will be more than 10% changes on the coefficient of Genres when Budget is introduced as another regressor in the model.

fit2 <- lm(revenue~genres_cl + budget, data = mv_thin)
summary(fit2)$coef
##                       Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)      -2.081192e+07 7.711132e+06 -2.698945  7.093529e-03
## genres_clRomance  2.366569e+07 9.285381e+06  2.548704  1.098653e-02
## budget            3.057850e+00 9.424847e-02 32.444555 4.745204e-151
pct_change <- ((summary(fit2)$coef[2,1]) - (summary(fit1)$coef[2,1]))/(summary(fit1)$coef[2,1])
print(pct_change[1])
## [1] -1.330422

Now we can see that the coefficient for Genres (Romance) changes to 23665690, indicating a change in estimate of 133%. Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between Movie Genres and Revenue is explained by Budget.