Background and Motivation

Objectives

Looking for connection between passing the Bechdel Test and box office performance.

Methodology

Data collection

Variable Details

The columns are as follows-

  1. index -serial no of observations

  2. year - releasing year of the movie

  3. imdb - IMDb code for the movie

  4. title- name of the movie

  5. imdbRating - rating of the movie on IMDb (on a scale of 1 to 10)

  6. age.certification - Us certification of the movie on IMDb( whether G or PG or PG-13 or R or Unrated)

  7. Runtime.mins- length of the movie in minutes

  8. genre- genre of the movie (subcategory- Action ,Thriller ,Crime, Fantsay ,Animation, Horror ,Drama, Romance ,Biography etc)

  9. imdb.votes- no. of votes the movie received on IMDb

  10. test- whether or not the movie passed the Bechdel Test

  11. clean_test- whether or not the movie passed the Bechdel Test without any caveats

  12. binary- whether or not test and clean_test passed together

  13. budget- budget of the movie in millions

  14. domgross- domestic gross of the movie in millions

  15. intgross- international gross of the movie in millions

  16. code- release year along with whether binary is passed or failed

  17. budget_2013. - budget of the movie in 2013 dollars

  18. domgross_2013. - domestic gross of the movie in 2013 dollars

  19. intgross_2013. - international gross of the movie in 2013 dollars

  20. period.code - period code of the movie

  21. decade.code - decade code of the movie

Data preparation

Getting the required packages:

library(pacman,quietly=T)
pacman::p_load(dplyr,ggpubr,plyr,ggplot2,MASS,ggcorrplot,tidytable,Matrix,glmnet,car,lmtest,sandwich)

Importing and cleaning the dataset:

data=read.csv("C:\\Users\\zeeda\\OneDrive\\Desktop\\movies.csv",header=T)
data=data[,-c(20,21)] #-- removing period.code and decade.code
data=na.omit(data) #--removing the observations corresponding to NA values
data$imdb.votes[which(data$imdb.votes=="None")]=c(0,0) #--treating none voting as 0 votes  
data$imdb.votes=as.numeric(data$imdb.votes)
data$income=data$domgross+data$intgross #--definiing total income of the movie
data$profit=data$income-data$budget #--profit of movie
Summary of the data
summary(data)
     index             year          imdb              title          
 Min.   :   0.0   Min.   :1970   Length:1774        Length:1774       
 1st Qu.: 447.2   1st Qu.:1998   Class :character   Class :character  
 Median : 903.5   Median :2005   Mode  :character   Mode  :character  
 Mean   : 898.4   Mean   :2003                                        
 3rd Qu.:1346.8   3rd Qu.:2009                                        
 Max.   :1793.0   Max.   :2013                                        
   imdbRating    age.certification   Runtime.mins      genre          
 Min.   :2.300   Length:1774        Min.   :  2.0   Length:1774       
 1st Qu.:6.200   Class :character   1st Qu.:131.0   Class :character  
 Median :6.800   Mode  :character   Median :142.0   Mode  :character  
 Mean   :6.745                      Mean   :139.1                     
 3rd Qu.:7.400                      3rd Qu.:154.0                     
 Max.   :9.300                      Max.   :431.0                     
   imdb.votes          test            clean_test           binary         
 Min.   :      0   Length:1774        Length:1774        Length:1774       
 1st Qu.:  60182   Class :character   Class :character   Class :character  
 Median : 134244   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 239791                                                           
 3rd Qu.: 291757                                                           
 Max.   :2886071                                                           
     budget             domgross            intgross             code          
 Min.   :     7000   Min.   :      828   Min.   :8.280e+02   Length:1774       
 1st Qu.: 12000000   1st Qu.: 16314670   1st Qu.:2.634e+07   Class :character  
 Median : 29500000   Median : 42147342   Median :7.684e+07   Mode  :character  
 Mean   : 45090050   Mean   : 69141289   Mean   :1.509e+08                     
 3rd Qu.: 60000000   3rd Qu.: 93278718   3rd Qu.:1.903e+08                     
 Max.   :425000000   Max.   :760507625   Max.   :2.784e+09                     
  budget_2013.       domgross_2013.      intgross_2013.     
 Min.   :     8632   Min.   :8.990e+02   Min.   :8.990e+02  
 1st Qu.: 16234217   1st Qu.:2.050e+07   1st Qu.:3.371e+07  
 Median : 37143282   Median :5.599e+07   Median :9.679e+07  
 Mean   : 55802234   Mean   :9.516e+07   Mean   :1.986e+08  
 3rd Qu.: 78885934   3rd Qu.:1.216e+08   3rd Qu.:2.418e+08  
 Max.   :461435929   Max.   :1.772e+09   Max.   :3.172e+09  
     income              profit          
 Min.   :1.656e+03   Min.   : -89057484  
 1st Qu.:4.259e+07   1st Qu.:  20520157  
 Median :1.201e+08   Median :  85670865  
 Mean   :2.201e+08   Mean   : 174962520  
 3rd Qu.:2.846e+08   3rd Qu.: 222381997  
 Max.   :3.544e+09   Max.   :3119426607  

Exploratory Data analysis

par(mfrow=c(2,2))
plot(data$imdbRating,ylim=c(0,10),col=ifelse(data$imdbRating>=6,ifelse(data$imdbRating>=8,"black","blue"),"red"),ylab="IMDb Ratings",xlab="Serial no. of movies",main="IMDb Movie Ratings") 
abline(h=6,col="red") 
abline(h=8,col="black")
legend(x=1200,y=2,legend=c("Below Average","Decent","Topclass"),fill=c("red","blue","black"))
data$age.certification=as.factor(data$age.certification) #--converting in factor variable
lab=names(table(data$age.certification)) 
pct=round((table(data$age.certification)/sum(table(data$age.certification)))*100,1) 
pie(labels=paste(lab,pct,"%",sep=":"),table(data$age.certification),main="Pie Chart of Age Certification of Movies")
plot(data$Runtime.mins,ylab="Runtime",xlab="serial no. of movies",main="Runtime of Movies(in mins)",col=ifelse(data$Runtime.mins<=45,"dark green"," red"),pch=3)
legend(x=0,y=450,c("Feature flim","Short flim"),fill=c("red","dark green"))
data$genre=as.factor(data$genre)
levels(data$genre)=c("Action","Adventure","Animation","Biography","Comedy","Crime","Documentary","Drama", "Fantasy","Horror","Mystery","Sci-Fi","Crime")
barplot(horiz=T,sort(table(data$genre)),col=rainbow(12),main="Barplot of Genre",xlab="Frequency") 
pct=round((table(data$genre)/nrow(data))*100,1) 
labs=(names(table(data$genre))) 
legend("bottomright",paste(labs,pct,"%",sep=":"),fill=rainbow(12))

Insights from the above plots
  • The variable imdb.Rating ranges between 0-10 using a plot we could observe how the movies here are being rated.

  • Here most of the movies(82%) have DECENT rating.

  • 6 of them are OSCAR worthy let’s find out their names

data$title[which(data$imdbRating>=9)]
[1] "The Dark Knight"                            
[2] "The Lord of theRings: TheReturn of the King"
[3] "The ShawshankRedemption"                    
[4] "Schindler&#39;s List"                       
[5] "TheGodfather: Part II"                      
[6] "TheGodfather"                               
  • The variable age.certification is a categorical variable having levels “G”,“PG”,“PG-13”,“R” and “Unrated”

  • Most of the movies has PG-13 certificate here followed by R certificated movies.

  • The variable Runtime.mins shows the range of the movies in minutes(older has higher length)

  • The variable genre is categorical variable . Here is the barplot to observe the types of genre with the percentage from different genre

Here clean_test and test are categorical variables and based on their levels corresponding to a movie we decide whether it will PASS or FAIL the Bechdel test. Lets understand it by a plot

data$test=as.factor(data$test) 
data$clean_test=as.factor(data$clean_test) 
data$binary=as.factor(data$binary)
ggplot(data,aes(y=test,x=clean_test,shape=binary))+ geom_point(aes(colour=binary),size=5)+geom_line(aes(group=binary),lty=2)+ ggtitle("Bechdel Test")+ xlab(label="Bechdel test without any caveats")+ ylab(label="Bechdel test status")

Insights from the plot
test clean_test Binary
Dubious Dubious & Dubious-disagree FAIL
Men Men &Men-disagree FAIL
Nowomen Nowomen & Nowomen-disagree FAIL
Ok Ok & Ok-disagree PASS

Histogram

The variable year, imdbRating, Runtime.mins, imdb.votes, budget, income gives the releasing year of each movie let us visualize from a histogram how many movies we have in the dataset over different year.

par(mfrow=c(2,3)) 
for(i in c(2,5,7,9,13,20)){ 
  hist(data[,i],col="light blue",xlab=colnames(data)[i],       
       main=paste("Histogram of",colnames(data)[i])) }

  • Since most of the variable do not have normal curve we will check for kendall’s Tau correlation coefficient.

Correlation plot

cor.data=cor(data[,c(5,7,9,13,20)],method="spearman")
ggcorrplot(cor.data,method="square",type="lower",lab=TRUE)

Insight from the plot
  • budget have good positive correlation Income .

  • imdb votes and income also have good poisitive correlation.

  • imdb votes and imdb rating have good poisitive correlation.

Rest of the correlation are very small can be thought of uncorrelated.

Key Findings

1. Passed and Failed movies mostly belong to which genre?

tapply(data,data$binary,function(x){return(names(which.max(table(x$genre))))}) 
    FAIL     PASS 
"Action" "Comedy" 

2. Passed and Failed movies mostly belong to which age certification?

tapply(data,data$binary,function(x){return(names(which.max(table(x$age.certification))))}) 
   FAIL    PASS 
"PG-13" "PG-13" 

3. Does passed movies have on an higher average profit?

obj_profit=daply(data,.(data$year,data$binary),function(x){return(mean(x$profit))}) 
dec_1970=colMeans(obj_profit[1:10,][-c(1,2,6),]) 
dec_1980=colMeans(obj_profit[11:20,])  
dec_1990=colMeans(obj_profit[21:30,])  
dec_2000=colMeans(obj_profit[31:44,]) 
obj_main_profit=rbind(dec_1970,dec_1980,dec_1990,dec_2000) 
barplot(t(obj_main_profit),beside=T,col=rainbow(2),xlab="Decade",ylab="Profit") 
legend("topright",c("FAIL","PASS"),fill=rainbow(2))

  • clearly failed movies has higher average profit in all decades but we cannot say that a movie is having less profit just because of pass or fail as there are effects of the other variables.

3. Does older flims mostly failed the Bechdel Test?

Approach
  1. Obtained the no. of movies that failed and passed over different years.

  2. Subgroup them in different decades (1971’s,1980’s,1990’s and 2000’s ) and add up the count of failed and passed movies and scale them down to 1.

  3. Create a Barplot using the data.

object=daply(data,.(data$year),function(x){return(table(x$binary))}) 
decade_1970=colSums(object[1:10,]) 
decade_1980=colSums(object[11:20,]) 
decade_1990=colSums(object[21:30,]) 
decade_2000=colSums(object[31:44,]) 
object12=rbind(decade_1970,decade_1980,decade_1990,decade_2000) 
new_obj=object12*(1/rowSums(object12)) 
barplot(t(new_obj),col=rainbow(7),ylim=c(0,1.2)) 
legend("topleft",c("FAIL","PASS"),fill=rainbow(7))

  • Based on this particular dataset , we can conclude that OLDER FLIMS FAILED THE BECHDEL TEST OFTEN THAN THE NEW ONE’S.

4. Does movies passing Bechdel Test has on higher ratings and higher income?

par(mfrow=c(1,2))
boxplot(data$imdbRating~data$binary,col="light pink",xlab="Bechdel Test pass/fail",ylab="IMDb Rating",main="IMDb rating Vs Bechdel Test pass/fail")
boxplot(data$income~data$binary,col="light pink",xlab="Bechdel Test pass/fail",ylab="Income",main="Income Vs Bechdel Test pass/fail") 

  • Clearly the median is less for the movies passing the test in both the cases, so we can conclude MOVIES PASSING THE BECHDEL TEST TEND TO HAVE LOWER RATINGS AND INCOME.

  • Is budget a confounding factor incase of income?

To answer that we will first verify if budget for passed and failed flim is different using boxplot.

boxplot(data$budget~data$binary,col="light pink",xlab=" Bechdel Test pass/fail",         
    ylab="Budget",main="Boxplot of Income Vs Bechdel Test pass/fail")

  • There is a difference in budget in particular failed flim has higher budget.

  • From the correlation plot , we can observe budget seems to have more positive correlation Income than with IMDb rating . So incase of income ,budget can be a confounding factor i.e. since the failed movie has higher budget so it has higher income.

5. Does the imdb ratings and income remain same over different genre?

par(mfrow=c(1,2))
plot(data$imdbRating~data$genre,col="light blue",xlab="Genre",ylab="IMDb Rating",main="IMDb rating vs Genre")
plot(data$income~data$genre,col="light blue",xlab="Genre",ylab="Income",main="Income vs Genre")

  • Clearly the median IMDb rating varies over different genre.
  • The income varies significantly over two genres Animation and Sci-Fi from most of the genre.

6. Does the imdb ratings and income remain same over different certification?

par(mfrow=c(1,2))
plot(data$imdbRating~data$age.certification,col="light green",xlab="age certification",ylab="IMDb Rating",main="IMDb rating vs age certification ")
plot(data$income~data$age.certification,col="light green",xlab="age certification",ylab="Income",main="Income vs age certification")

  • Clearly the median IMDb rating varies over different age certification but not as much as it varies with different age certification.

  • The income varies significantly over two age certification G and PG from most of the age certification.

7. Does the average IMDb rating and income change over different decades(1971’s,1980’s,1990’s and 2000’s ) for passed and failed (in Bechdel Test) movies ?

Approach
  1. Subset the data over different decade .

  2. split the data based on Bechdel pass/fail .

  3. Calculated mean IMDb rating and Income seperately.

  4. plot two different curves of mean IMDb rating and Income over different decades using two plots.

par(mfrow=c(1,2))
decade2_1970=subset(data,data$year %in% 1970:1979) 
p_1=tapply(decade2_1970,decade2_1970$binary,function(x){return(mean(x$imdbRating))})
decade2_1980=subset(data,data$year %in% 1980:1989)
p_2=tapply(decade2_1980,decade2_1980$binary,function(x){return(mean(x$imdbRating))}) 
decade2_1990=subset(data,data$year %in% 1990:1999)
p_3=tapply(decade2_1990,decade2_1990$binary,function(x){return(mean(x$imdbRating))}) 
decade2_2000=subset(data,data$year %in% 2000:2013)
p_4=tapply(decade2_2000,decade2_2000$binary,function(x){return(mean(x$imdbRating))}) 
object2=rbind(p_1,p_2,p_3,p_4) 
plot(object2[,1],type="b",ylim=c(5,8),col="red",lwd=2,xaxt="n",xlab="Decade",ylab="mean IMDb Rating")
par(new=T) 
plot(object2[,2],type="b",ylim=c(5,8),col="blue",xlab="",ylab="",yaxt="n",xaxt="n",lwd=2)
axis(1,at=1:4,labels=c("70's","80's","90's","2000's")) 
legend("bottomleft",c("FAIL","PASS"),fill=c("red","blue")) 
object3=ddply(data,.(data$year),function(x){return(tapply(x,x$binary,function(y){return(mean(y$income))}))})
index=c(which(object3$FAIL=="NaN"),which(object3$PASS=="NaN")) 
object3=object3[-index,] 
decade3_1970=colMeans(object3[1:7,c(2,3)]) 
decade3_1980=colMeans(object3[8:17,c(2,3)]) 
decade3_1990=colMeans(object3[18:27,c(2,3)])
decade3_2000=colMeans(object3[28:41,c(2,3)]) 
object32=rbind(decade3_1970,decade3_1980,decade3_1990,decade3_2000) 
plot(object32[,1],type="b",col="red",lwd=2,xaxt="n",xlab="Decade",ylab="mean Income")
par(new=T)
plot(object32[,2],type="b",col="blue",yaxt="n",xaxt="n",lwd=2,xlab="",ylab="") 
axis(1,at=1:4,labels=c("70's","80's","90's","2000's")) 
legend("bottomright",c("FAIL","PASS"),fill=c("red","blue"))

  • We can observe Mostly the failed movies performed in terms of mean IMDb rating and mean income better.

  • The difference has become less over the latter decades specially in 90s where the mean income of passed movies crossed the failed one.

Conformatory analysis

Comparing Passed and Failed movies(in Bechdel Test)

1. Does the passed and failed movies have same mean and variance in IMDb rating?

we will conduct two sample F test where \(\sigma_{1}^{2}\) is the variance of IMDb rating of passed movies and \(\sigma_{2}^{2}\) is the variance of IMDb rating of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]

test_pass=subset(data,data$binary=="PASS") 
test_fail=subset(data,data$binary=="FAIL")
var.test(test_pass$imdbRating,test_fail$imdbRating,alternative="two.sided")

    F test to compare two variances

data:  test_pass$imdbRating and test_fail$imdbRating
F = 0.97999, num df = 792, denom df = 980, p-value = 0.7668
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8587585 1.1193928
sample estimates:
ratio of variances 
         0.9799902 
  • Since the P-value is greater than 0.05, in the light of the given sample we have equal variance .

Now we will look for equality of mean and hence conduct two sample t test where \(\mu_{1}\)is the mean of IMDb rating of passed movies and \(\mu_{2}\)is the mean of IMDb rating of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]

t.test(test_pass$imdbRating,test_fail$imdbRating,alternative="two.sided",var.equal=TRUE)

    Two Sample t-test

data:  test_pass$imdbRating and test_fail$imdbRating
t = -5.3677, df = 1772, p-value = 9.026e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3129715 -0.1454636
sample estimates:
mean of x mean of y 
 6.618285  6.847503 
  • Since the P-value is less than 0.05, in the light of the given sample the mean IMDb rating for passed is significantly different than failed movies.

2. Does the passed and failed movies have same mean and variance in income?

we will conduct two sample F test where \(\sigma_{1}^{2}\)is the variance of income of passed movies and \(\sigma_{2}^{2}\)is the variance of income of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]

test_pass=subset(data,data$binary=="PASS") 
test_fail=subset(data,data$binary=="FAIL")
var.test(test_pass$income,test_fail$income,alternative="two.sided")

    F test to compare two variances

data:  test_pass$income and test_fail$income
F = 0.80707, num df = 792, denom df = 980, p-value = 0.001612
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7072326 0.9218786
sample estimates:
ratio of variances 
         0.8070733 
  • Since the P-value is less than 0.05, in the light of the given sample we have equal variance

Now we will look for equality of mean and hence conduct welch t testwhere \(\mu_{1}\) is the mean of income of passed movies and \(\mu_{2}\) is the mean of income of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]

t.test(test_pass$income,test_fail$income,alternative="two.sided",var.equal=FALSE)

    Welch Two Sample t-test

data:  test_pass$income and test_fail$income
t = -3.3143, df = 1752.3, p-value = 0.0009376
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -71437389 -18320406
sample estimates:
mean of x mean of y 
195235097 240113994 
  • Since the P-value is less than 0.05, in the light of the given sample the mean income for passed is significantly different than failed movies.

3. Is budget for Passed and Failed flim is different?

.we will conduct two sample F test where \(\sigma_{1}^{2}\) is the variance of budget of passed movies and \(\sigma_{2}^{2}\) is the variance of budget of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]

var.test(test_pass$budget,test_fail$budget,alternative="two.sided")

    F test to compare two variances

data:  test_pass$budget and test_fail$budget
F = 0.69554, num df = 792, denom df = 980, p-value = 1.033e-07
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6094929 0.7944748
sample estimates:
ratio of variances 
         0.6955356 
  • Since the P-value is less than 0.05, in the light of the given sample we have unequal variance .

Now we will look for equality of mean and hence conduct welch t test where \(\mu_{1}\)is the mean of budget of passed movies and \(\mu_{2}\)is the mean of budget of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]

t.test(test_pass$budget,test_fail$budget,alternative="two.sided",var.equal=F)

    Welch Two Sample t-test

data:  test_pass$budget and test_fail$budget
t = -5.5472, df = 1770.3, p-value = 3.341e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16826749  -8036075
sample estimates:
mean of x mean of y 
 38215633  50647046 
  • Since the P-value is greater than 0.05, in the light of the given sample the mean budget for passed flim is significantly different than failed movies.

4. Does higher budget mean higher Income for a movie?

cor.test(data$income,data$budget,alternative="greater",method="spearman") 
Warning in cor.test.default(data$income, data$budget, alternative = "greater",
: Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  data$income and data$budget
S = 285386793, p-value < 2.2e-16
alternative hypothesis: true rho is greater than 0
sample estimates:
      rho 
0.6932925 
  • Since the P-value is less than 0.05, in the light of the given sample higher budget mean higher Income for a movie.

5. Is there a relationship between income and binary keeping the effect of budget out?

We will construct a logistic regression model of Binary on budget and income and perform the coefficient test for income to observe whether it is significantly different than 0.

logistic_mod1=glm(data$binary~data$budget+data$income-1,data=data,family="binomial") 
logistic_mod11=glm(data$binary~data$budget-1,data=data,family="binomial") 
anova(logistic_mod11,logistic_mod1,test="Chisq")
Analysis of Deviance Table

Model 1: data$binary ~ data$budget - 1
Model 2: data$binary ~ data$budget + data$income - 1
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      1773     2409.3                     
2      1772     2408.8  1  0.49068   0.4836
  • Since the P-value is greater than 0.05, in the light of the given sample failed movies are having higher income because of higher budget i.e Budget is a confounding factor.

Model Building

Regression analysis on IMDB rating

Here is our chosen predictors for predicting IMDb ratings

  1. decade

  2. age.certification

  3. Runtime.mins

  4. genre

  5. imdb.votes

  6. binary

  7. budget

Since in EDA we have observed decade-wise change instead of directly using year so here we will convert year into decade-

  • year 1970-1979 is 70’s

  • year 1980-1989 is 80’s

  • year 1990-1999 is 90’s

  • rest is 2000’s

Also scaled the continuous variables to reduce interpretability issues

new_data=data[,c(5,6,7,8,9,12,13)]
new_data$decade=as.factor(ifelse(data$year<1980,"70",ifelse(data$year<1990,"80",ifelse(data$year<2000,"90","2000"))))
new_data[,c(3,5,7)]=scale(new_data[,c(3,5,7)])

Computing the model

basic_model=lm(new_data$imdbRating~.,data=new_data)
summary(basic_model)

Call:
lm(formula = new_data$imdbRating ~ ., data = new_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7542 -0.3382  0.0802  0.4401  2.1118 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               6.280432   0.100978  62.196  < 2e-16 ***
age.certificationPG       0.023358   0.096693   0.242 0.809145    
age.certificationPG-13    0.136685   0.098357   1.390 0.164801    
age.certificationR        0.283472   0.099610   2.846 0.004481 ** 
age.certificationUnrated  0.359381   0.136171   2.639 0.008384 ** 
Runtime.mins              0.002226   0.015988   0.139 0.889291    
genreAdventure            0.363034   0.073029   4.971 7.31e-07 ***
genreAnimation            0.688739   0.075305   9.146  < 2e-16 ***
genreBiography            0.780885   0.079109   9.871  < 2e-16 ***
genreComedy               0.212469   0.047818   4.443 9.42e-06 ***
genreCrime                0.400869   0.071771   5.585 2.70e-08 ***
genreDocumentary          1.193309   0.296021   4.031 5.79e-05 ***
genreDrama                0.531745   0.052306  10.166  < 2e-16 ***
genreFantasy              0.012731   0.139345   0.091 0.927214    
genreHorror              -0.400078   0.076305  -5.243 1.77e-07 ***
genreMystery              0.046804   0.293748   0.159 0.873425    
genreSci-Fi               0.256343   0.150107   1.708 0.087864 .  
imdb.votes                0.543541   0.017042  31.895  < 2e-16 ***
binaryPASS               -0.079281   0.032771  -2.419 0.015655 *  
budget                   -0.128967   0.020387  -6.326 3.19e-10 ***
decade70                  0.699166   0.094431   7.404 2.04e-13 ***
decade80                  0.383076   0.064760   5.915 3.98e-09 ***
decade90                  0.146013   0.040798   3.579 0.000354 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6506 on 1751 degrees of freedom
Multiple R-squared:  0.4853,    Adjusted R-squared:  0.4788 
F-statistic: 75.04 on 22 and 1751 DF,  p-value: < 2.2e-16

Removing the outliers and model rebuilding

First we will look for potential outliers that are influential
Process
  • use cooks distance to measure influential points

  • calculate high residuals using studentized residual

  • take the intersection of both influential points and high residual and remove them

  • create the model again for the revised data

cd=cooks.distance(basic_model) 
a=which(cd>(4/nrow(new_data)))
abs_res_vec=abs(rstudent(basic_model)) 
b=which(abs_res_vec>2)
del_obs=intersect(a,b)
del_obs #--outlier observation
 [1]   10   82   89  117  151  159  222  293  383  406  432  481  577  608  626
[16]  643  769  815  816  828  840  876  882  890  922  954 1041 1071 1190 1271
[31] 1275 1282 1317 1364 1387 1414 1457 1491 1496 1515 1519 1521 1524 1531 1545
[46] 1568 1623 1642 1649 1654 1661 1682 1709 1766 1768
new_data2=new_data[-del_obs,]
basic_model2=lm(new_data2$imdbRating~.,data=new_data2)
Checking model assumpion :

Heteroscadasticity

plot(residuals(basic_model2)~fitted(basic_model2),col="darkgreen",xlab="Fitted values",ylab="Residuals",main="Residual vs Fitted",ylim=c(-2,2))
abline(col="red",h=mean(resid(basic_model2))+3*sd(resid(basic_model2)))
abline(col="red",h=mean(resid(basic_model2))-3*sd(resid(basic_model2)))

Insight
  • There is a band around 0 line mostly which means residuals are not effected by fitted values.
  • But small amount of points outside the 3*Sigma limit of model residuals so slight amount of heteroscadasticity is present in the model.

We conduct Breusch Pagan test for formal checking

bptest(basic_model2,studentize=F)

    Breusch-Pagan test

data:  basic_model2
BP = 39.695, df = 22, p-value = 0.01173
  • p value less than 0.05, so the model has heteroscadasticity.

Attempting to remove Heteroscadasticity

Process
  • Estimate \(\hat{\sigma_{i}}\) by regressing its absolute value on the predictors and hence obtain the weights as \(\frac{1}{\hat{\sigma_{i}}}\) .

  • Obtaining weighted least square model .

  • Create the plot and Bp-test again

#--Obtaining weighted least square model 
sgm=lm((resid(basic_model2)^2)~.,data=new_data2[,-1])
basic_model3=lm(new_data2$imdbRating~.,data=new_data2,weights=1/sqrt(fitted(basic_model2)))
#--Create the plot again. 
plot(resid(basic_model3)~fitted(basic_model3),ylim=c(-2,2),xlab="Fitted values",ylab="Residuals",main="Residual vs Fitted")
par(new=T)
plot(residuals(basic_model2)~fitted(basic_model2),col="grey",ylim=c(-2,2),xlab="",ylab="",xaxt="n",yaxt="n")
abline(col="red",h=mean(resid(basic_model3))+3*sd(resid(basic_model3)))
abline(col="red",h=mean(resid(basic_model3))-3*sd(resid(basic_model3)))
legend("topright",legend = c("Old model","New Model"),fill=c("black","grey"))

#--Bp test
bptest(basic_model3,studentize=F)

    Breusch-Pagan test

data:  basic_model3
BP = 941.19, df = 22, p-value < 2.2e-16
  • Since P value is still less than 0.05, clearly we are not been able to remove this small amount of heteroscasticity.

Multicollinearity

We will use conditional number to judge it.

X=model.matrix(basic_model2) 
egn=eigen(t(X)%*%X) 
egn_max=max(egn$values) 
egn_min=min(egn$values) 
cn=sqrt(egn_max/egn_min) 
cn 
[1] 25.13592
  • Since it is less than 30 , We can conclude multicollinearity is not present here.

Autocorrelation

Plot of lagged residuals

lagged_residual=c("NA",resid(basic_model2)[-length(resid(basic_model2))]) 
plot(lagged_residual,resid(basic_model2),col="darkblue",xlab="Lagged residuals",ylab="Residuals",main="Lagged Residual plot")
Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

  • There is no pattern in plot so autocorrelation might not be present.

Durbin watson test to confirm

durbinWatsonTest(basic_model2)
 lag Autocorrelation D-W Statistic p-value
   1      0.03503798      1.927132   0.122
 Alternative hypothesis: rho != 0
  • p value greater than 0.05, so the model has no autocorrelation.

Checking Normality

qqnorm(resid(basic_model2),col="brown") 
qqline(resid(basic_model2),col="black")

  • Data mostly obeys normality except at the edges.
Final model with summary
summary(basic_model2)

Call:
lm(formula = new_data2$imdbRating ~ ., data = new_data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.01227 -0.34877  0.04392  0.39270  1.60810 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               6.311070   0.088039  71.685  < 2e-16 ***
age.certificationPG       0.082229   0.084369   0.975  0.32988    
age.certificationPG-13    0.162699   0.085905   1.894  0.05840 .  
age.certificationR        0.265934   0.086911   3.060  0.00225 ** 
age.certificationUnrated  0.343105   0.118592   2.893  0.00386 ** 
Runtime.mins             -0.002852   0.013802  -0.207  0.83634    
genreAdventure            0.368772   0.063518   5.806 7.63e-09 ***
genreAnimation            0.673740   0.065377  10.305  < 2e-16 ***
genreBiography            0.764485   0.068067  11.231  < 2e-16 ***
genreComedy               0.172675   0.041206   4.191 2.93e-05 ***
genreCrime                0.385794   0.062001   6.222 6.16e-10 ***
genreDocumentary          1.183715   0.251530   4.706 2.73e-06 ***
genreDrama                0.515416   0.045087  11.432  < 2e-16 ***
genreFantasy             -0.037848   0.118444  -0.320  0.74935    
genreHorror              -0.376462   0.066509  -5.660 1.77e-08 ***
genreMystery             -0.022865   0.249552  -0.092  0.92701    
genreSci-Fi               0.294143   0.130708   2.250  0.02455 *  
imdb.votes                0.577228   0.016160  35.720  < 2e-16 ***
binaryPASS               -0.061686   0.028323  -2.178  0.02955 *  
budget                   -0.147388   0.017715  -8.320  < 2e-16 ***
decade70                  0.641666   0.081640   7.860 6.79e-15 ***
decade80                  0.437188   0.056488   7.739 1.70e-14 ***
decade90                  0.193425   0.035348   5.472 5.11e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5525 on 1696 degrees of freedom
Multiple R-squared:  0.5453,    Adjusted R-squared:  0.5394 
F-statistic: 92.45 on 22 and 1696 DF,  p-value: < 2.2e-16
Inference
  • The main goal was to judge whether passing the bechdel test is a significant variable or not and also for the rest of the variables.
  • Since the model is heteroscadastic model we will be using Robust standard errors.(Huber White heteroscadasticity consistent standard error)
robust_res = vcovHC(basic_model2,type="HC1") 
Anova(basic_model2,white.adjust=T,robust="HC1")
Coefficient covariances computed by hccm()
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
     arithmetic operators in their names;
  the printed representation of the hypothesis will be omitted
Analysis of Deviance Table (Type II tests)

Response: new_data2$imdbRating
                    Df        F    Pr(>F)    
age.certification    4   6.4713 3.629e-05 ***
Runtime.mins         1   0.0426    0.8365    
genre               11  41.9836 < 2.2e-16 ***
imdb.votes           1 789.5755 < 2.2e-16 ***
binary               1   4.4722    0.0346 *  
budget               1  64.5264 1.766e-15 ***
decade               3  42.5553 < 2.2e-16 ***
Residuals         1696                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

  • Except from Runtime.mins rest of the variable have slope significantly different from 0.

  • Clearly Bechdel test passing or failing has a significant effect on IMDb Rating.

Model Selection

We will try reduce the mse using Ridge regression estimates

Approach
  • Choosing a sequence of \(\lambda\)

  • Finding the \(\lambda\) minimizes the cross-validation error

  • Then fitting a ridge model using that \(\lambda\)

x_ridge=as.matrix(model.matrix(basic_model2)[,-1]) 
y_ridge=new_data2$imdbRating 
lam=seq(0.001,1,length=500) 
cv_ridge=cv.glmnet(x_ridge,y_ridge,alpha=0,standardize=F,lambda = lam) 
plot(cv_ridge)

  • Clearly we could observe the mse is minimized at a \(\lambda\) very close to zero which implies we get no improvement in shrinking the coefficients so OLS is much better option here.

Prediction of Profit

  • We will construct a linear model to predict how Passing or Faling the bechdel test impact profit of movies.
profit_data=data[,c(5,6,7,8,9,12,13,21)]
profit_data$decade=as.factor(ifelse(data$year<1980,"70",ifelse(data$year<1990,"80",ifelse(data$year<2000,"90","2000"))))

Computing the model

profit_model=lm(profit_data$profit~.,data=profit_data)

Average predicted profit of passed and failed movies by the this model

tapply(profit_data,profit_data$binary,function(x){
  return(mean(predict(profit_model,newdata = x)))
})
     FAIL      PASS 
189466948 157019463 

Findings

  • Movies that failed the test are having on an average higher income than movies that are passing the test.

  • Here is a catch! What if we fix all the covariates except binary for a collection of movies.

  • Compare their average profit based on only they are passed or failed.

Approach
  1. First fix their binary variable as PASS and predict their average profit.

  2. Repeat the process fixing binary ass FAIL.

  3. Compare the average gain or loss(in percentage) of a movie that occured just because of binary.

test_movies=profit_data
pass_movies=test_movies
levels(pass_movies$binary)=c("PASS","PASS")
failed_movies=test_movies
levels(failed_movies$binary)=c("FAIL","FAIL")
failed_movies_pred=predict(profit_model,newdata=failed_movies)
pass_movies_pred=predict(profit_model,newdata=pass_movies)
result=((mean(pass_movies_pred)-mean(failed_movies_pred))/mean(failed_movies_pred))*100
cat(paste("Average extra gain of a movie that occured just because it passed the bechdel test",round(result,2),"%"))
Average extra gain of a movie that occured just because it passed the bechdel test 23.34 %

Project Report

Summary Findings

  • Older films failed the bechdel test more often than the new ones.

  • Failed movies have had higher average profits in all decades.

  • Movies passing the bechdel test tend to have lower ratings and the difference in their mean is statistically significant.

  • There is a difference in budget and in particular, failed films have a higher budget and the difference in their mean is statistically significant.

  • Movies passing the bechdel test tend to have lower income and the difference in their mean is statistically significant, but the difference has become less over the latter decades.

  • Median IMDB rating and income of a movie varies a lot over different genres.

  • Median IMDB rating and income of a movie varies a lot over different age certifications.

  • The positive correlation between budget and income of the movie is statistically significant.

  • relationship between income and binary keeping the effect of budget out is statistically significant.

  • Failed movies are having higher income because of higher budget, i.e., budget is a confounding factor.

  • Bechdel test passing or failing has a statistically significant effect on IMDb Rating.

  • A movie gains 0.06 IMDB rating at the expense of failing the Bechdel test.

  • Predicted profit of a passed film by the model is less than the failed one’s But Fixing all other features in the model a passed movie gains 23.34 % extra profit than a failed movie.

Discussion

Bechdel Test-compliant movies prove more profitable, with a slight dip in IMDB ratings. This underscores the importance of creating inclusive content in the industry. Increased profits incentivize producers to prioritize gender diversity, fostering a more inclusive cinematic landscape and mitigating criticisms of feminist agendas in filmmaking.

Limitation

Potential limitation of the study is it includes only a collection of hollywood pop culture movies from 1970 to 2024 so -

  • The findings of this study might differ with that of studies that includes all categories of movies.

  • Since the data is available only upto 2013 , some findings might differ with presence of more recent data.

Session Information
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sandwich_3.1-0     lmtest_0.9-40      zoo_1.8-12         car_3.1-2         
 [5] carData_3.0-5      glmnet_4.1-8       Matrix_1.7-0       tidytable_0.11.1  
 [9] ggcorrplot_0.1.4.1 MASS_7.3-60.2      plyr_1.8.9         ggpubr_0.6.0      
[13] ggplot2_3.5.1      dplyr_1.1.4        pacman_0.5.1      

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      shape_1.4.6.1     xfun_0.46         bslib_0.8.0      
 [5] rstatix_0.7.2     lattice_0.22-6    vctrs_0.6.5       tools_4.4.1      
 [9] generics_0.1.3    tibble_3.2.1      fansi_1.0.6       highr_0.11       
[13] pkgconfig_2.0.3   data.table_1.15.4 lifecycle_1.0.4   compiler_4.4.1   
[17] farver_2.1.2      stringr_1.5.1     munsell_0.5.1     codetools_0.2-20 
[21] htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10       pillar_1.9.0     
[25] jquerylib_0.1.4   tidyr_1.3.1       cachem_1.1.0      iterators_1.0.14 
[29] abind_1.4-5       foreach_1.5.2     tidyselect_1.2.1  digest_0.6.36    
[33] stringi_1.8.4     reshape2_1.4.4    purrr_1.0.2       labeling_0.4.3   
[37] splines_4.4.1     fastmap_1.2.0     grid_4.4.1        colorspace_2.1-1 
[41] cli_3.6.3         magrittr_2.0.3    survival_3.6-4    utf8_1.2.4       
[45] broom_1.0.6       withr_3.0.1       scales_1.3.0      backports_1.5.0  
[49] rmarkdown_2.27    ggsignif_0.6.4    evaluate_0.24.0   knitr_1.48       
[53] rlang_1.1.4       Rcpp_1.0.13       glue_1.7.0        rstudioapi_0.16.0
[57] jsonlite_1.8.8    R6_2.5.1