Adjustment and Confounding Analysis on Movie Data

Analysis

To make the analysis straightforward and relevant, we first picked up a subset of the features from the dataset and applied some transformation on it. Here are the final features in our dataset.

movie_df <- read.csv("train.csv")

mv_thin <- movie_df %>%
  filter(budget > 0 & (str_detect(genres, "Action")|str_detect(genres, "Romance")) & original_language == 'en') %>%
  select(c(title, budget, genres, popularity, revenue)) %>%
  arrange(desc(revenue))

genres_fun <- function(x) {
  if(str_detect(x, 'Romance')) {
    return ('Romance')
  }
  else {
    return ('Action')
  }
}
mv_thin['genres_cl'] <- apply(mv_thin['genres'], 1, genres_fun)
mv_thin['rev_log'] <- log(mv_thin['revenue'])

title - the English Title of the movie
budget - the budget of each movie (in our case, we exclude records without budget)
genres - the genres of each movie (in our case, we only select two values: Action and Romance)
popularity - a popularity score based on customer review
revenue - the revenue of each movie in US dollars

#, layout="l-body-outset"}
kable(head(mv_thin))

title	budget	genres	popularity	revenue	genres_cl	rev_log
The Avengers	2.20e+08	[{‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}]	89.887648	1519557910	Action	21.14169
Furious 7	1.90e+08	[{‘id’: 28, ‘name’: ‘Action’}]	27.275687	1506249360	Action	21.13289
Avengers: Age of Ultron	2.80e+08	[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 12, ‘name’: ‘Adventure’}, {‘id’: 878, ‘name’: ‘Science Fiction’}]	37.379420	1405403694	Action	21.06359
Beauty and the Beast	1.60e+08	[{‘id’: 10751, ‘name’: ‘Family’}, {‘id’: 14, ‘name’: ‘Fantasy’}, {‘id’: 10749, ‘name’: ‘Romance’}]	287.253654	1262886337	Romance	20.95667
Transformers: Dark of the Moon	1.95e+08	[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 878, ‘name’: ‘Science Fiction’}, {‘id’: 12, ‘name’: ‘Adventure’}]	4.503505	1123746996	Action	20.83993
The Dark Knight Rises	2.50e+08	[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 80, ‘name’: ‘Crime’}, {‘id’: 18, ‘name’: ‘Drama’}, {‘id’: 53, ‘name’: ‘Thriller’}]	20.582580	1084939099	Action	20.80479

genres_average <- mv_thin %>% group_by(genres_cl) %>% summarise(n=n(), avg_revenue=mean(revenue), .groups = 'drop')

kable(genres_average)

genres_cl	n	avg_revenue
Action	500	148372491
Romance	356	76749889

action_mean_rev = genres_average[1,3]$avg_revenue
romance_mean_rev = genres_average[2,3]$avg_revenue

lm1 <- lm(revenue~budget, data=mv_thin)
coef <- lm1$coefficients
intercept <- coef[1]
slope <- coef[2]
g <- ggplot(mv_thin, aes(budget, revenue, colour = genres_cl)) + 
  geom_smooth(method='lm') +
  geom_point() +
  geom_hline(yintercept = action_mean_rev[1], linetype="dashed", colour='blue') + 
  geom_hline(yintercept = romance_mean_rev[1], linetype="dashed", colour='red') 
# +geom_abline(slope=slope, intercept=intercept)

Below is the plot showing data distribution and the linear model cross different two genres (Action and Romance). In addition, the blue dashed line demonstrates the average revenue for Action movies and the red dashed line demonstrates the same metric for Romance movies.

fit1 <- lm(revenue~genres_cl, data = mv_thin)
summary(fit1)$coef

##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      148372491    8485760 17.484880 9.500469e-59
## genres_clRomance -71622601   13158386 -5.443115 6.844826e-08

We can see that the coefficient for Genres (Romance) is -71622601 when we only consider Genres as regressor. Suppose now we would like to analyze if Budget is a confounding variable, i.e., whether there will be more than 10% changes on the coefficient of Genres when Budget is introduced as another regressor in the model.

fit2 <- lm(revenue~genres_cl + budget, data = mv_thin)
summary(fit2)$coef

##                       Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)      -2.081192e+07 7.711132e+06 -2.698945  7.093529e-03
## genres_clRomance  2.366569e+07 9.285381e+06  2.548704  1.098653e-02
## budget            3.057850e+00 9.424847e-02 32.444555 4.745204e-151

pct_change <- ((summary(fit2)$coef[2,1]) - (summary(fit1)$coef[2,1]))/(summary(fit1)$coef[2,1])
print(pct_change[1])

## [1] -1.330422

Now we can see that the coefficient for Genres (Romance) changes to 23665690, indicating a change in estimate of 133%. Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between Movie Genres and Revenue is explained by Budget.

Adjustment and Confounding Analysis on Movie Data

Sai Wang

10/11/2020

Introduction

Analysis