In real world, a common issue in multivariate dataset is that some confounding variable(s) can distort the relationship between other pair of variables. Adjustment studies the impact of such third variable on the relationship by adding it as a regressor into a linear model.
In this blog, we use a dataset about movie Box sourced from Kaggle as our study object. Through the below analysis, we will check whether some features are confounding variables or not.
To make the analysis straightforward and relevant, we first picked up a subset of the features from the dataset and applied some transformation on it. Here are the final features in our dataset.
movie_df <- read.csv("train.csv")
mv_thin <- movie_df %>%
filter(budget > 0 & (str_detect(genres, "Action")|str_detect(genres, "Romance")) & original_language == 'en') %>%
select(c(title, budget, genres, popularity, revenue)) %>%
arrange(desc(revenue))
genres_fun <- function(x) {
if(str_detect(x, 'Romance')) {
return ('Romance')
}
else {
return ('Action')
}
}
mv_thin['genres_cl'] <- apply(mv_thin['genres'], 1, genres_fun)
mv_thin['rev_log'] <- log(mv_thin['revenue'])
title - the English Title of the movie
budget - the budget of each movie (in our case, we exclude records without budget)
genres - the genres of each movie (in our case, we only select two values: Action and Romance)
popularity - a popularity score based on customer review
revenue - the revenue of each movie in US dollars
#, layout="l-body-outset"}
kable(head(mv_thin))
| title | budget | genres | popularity | revenue | genres_cl | rev_log |
|---|---|---|---|---|---|---|
| The Avengers | 2.20e+08 | [{‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}] | 89.887648 | 1519557910 | Action | 21.14169 |
| Furious 7 | 1.90e+08 | [{‘id’: 28, ‘name’: ‘Action’}] | 27.275687 | 1506249360 | Action | 21.13289 |
| Avengers: Age of Ultron | 2.80e+08 | [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}, {‘id’: 878, ‘name’: ‘Science Fiction’}] | 37.379420 | 1405403694 | Action | 21.06359 |
| Beauty and the Beast | 1.60e+08 | [{‘id’: 10751, ‘name’: ‘Family’}, {‘id’: 14, ‘name’: ‘Fantasy’}, {‘id’: 10749, ‘name’: ‘Romance’}] | 287.253654 | 1262886337 | Romance | 20.95667 |
| Transformers: Dark of the Moon | 1.95e+08 | [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 12, ‘name’: ‘Adventure’}] | 4.503505 | 1123746996 | Action | 20.83993 |
| The Dark Knight Rises | 2.50e+08 | [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 80, ‘name’: ‘Crime’}, {‘id’: 18, ‘name’: ‘Drama’}, {‘id’: 53, ‘name’: ‘Thriller’}] | 20.582580 | 1084939099 | Action | 20.80479 |
genres_average <- mv_thin %>% group_by(genres_cl) %>% summarise(n=n(), avg_revenue=mean(revenue), .groups = 'drop')
kable(genres_average)
| genres_cl | n | avg_revenue |
|---|---|---|
| Action | 500 | 148372491 |
| Romance | 356 | 76749889 |
action_mean_rev = genres_average[1,3]$avg_revenue
romance_mean_rev = genres_average[2,3]$avg_revenue
lm1 <- lm(revenue~budget, data=mv_thin)
coef <- lm1$coefficients
intercept <- coef[1]
slope <- coef[2]
g <- ggplot(mv_thin, aes(budget, revenue, colour = genres_cl)) +
geom_smooth(method='lm') +
geom_point() +
geom_hline(yintercept = action_mean_rev[1], linetype="dashed", colour='blue') +
geom_hline(yintercept = romance_mean_rev[1], linetype="dashed", colour='red')
# +geom_abline(slope=slope, intercept=intercept)
Below is the plot showing data distribution and the linear model cross different two genres (Action and Romance). In addition, the blue dashed line demonstrates the average revenue for Action movies and the red dashed line demonstrates the same metric for Romance movies.
fit1 <- lm(revenue~genres_cl, data = mv_thin)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148372491 8485760 17.484880 9.500469e-59
## genres_clRomance -71622601 13158386 -5.443115 6.844826e-08
We can see that the coefficient for Genres (Romance) is -71622601 when we only consider Genres as regressor. Suppose now we would like to analyze if Budget is a confounding variable, i.e., whether there will be more than 10% changes on the coefficient of Genres when Budget is introduced as another regressor in the model.
fit2 <- lm(revenue~genres_cl + budget, data = mv_thin)
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.081192e+07 7.711132e+06 -2.698945 7.093529e-03
## genres_clRomance 2.366569e+07 9.285381e+06 2.548704 1.098653e-02
## budget 3.057850e+00 9.424847e-02 32.444555 4.745204e-151
pct_change <- ((summary(fit2)$coef[2,1]) - (summary(fit1)$coef[2,1]))/(summary(fit1)$coef[2,1])
print(pct_change[1])
## [1] -1.330422
Now we can see that the coefficient for Genres (Romance) changes to 23665690, indicating a change in estimate of 133%. Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between Movie Genres and Revenue is explained by Budget.