Cartoonist ALISON BECHDEL in 1980’s found a trend in movies where most of them only have a single female participation or even if there were multiple female ,the storyline almost always revolves arround the male character.
She found it frustrating and decided to create BECHDEL TEST which is also known as BECHDEL-WALLACE TEST to judge whether the movie has enough female representation .
BECHDEL TEST has three simple rules-
Several analyses have indicated that passing the Bechdel test is associated with a film’s financial success.
Vocativ’s authors found that the films from 2013 that passed the test earned a total of $4.22 billion in the United States, while those that failed earned $2.66 billion in total, leading them to conclude that a way for Hollywood to make more money might be to “PUT MORE WOMEN ONSCREEN.”
So there exists a great motivation to observe how female representation impacts the performance of a movie in terms of IMDb rating and Box office collection as it will eventually help us to analyse the viewers sentiment as well as producer’s mindset in this regard .
Looking for connection between passing the Bechdel Test and box office performance.
To examine the relationship between Bechdel Test results and IMDb ratings.
To examine the relationship between Bechdel Test results and profit of the movie.
Analyise the movies (in terms of Genre , age certification , imdb votes etc) based on Bechdel test.
Some other variable relationships.
The columns are as follows-
index -serial no of observations
year - releasing year of the movie
imdb - IMDb code for the movie
title- name of the movie
imdbRating - rating of the movie on IMDb (on a scale of 1 to 10)
age.certification - Us certification of the movie on IMDb( whether G or PG or PG-13 or R or Unrated)
Runtime.mins- length of the movie in minutes
genre- genre of the movie (subcategory- Action ,Thriller ,Crime, Fantsay ,Animation, Horror ,Drama, Romance ,Biography etc)
imdb.votes- no. of votes the movie received on IMDb
test- whether or not the movie passed the Bechdel Test
clean_test- whether or not the movie passed the Bechdel Test without any caveats
binary- whether or not test and clean_test passed together
budget- budget of the movie in millions
domgross- domestic gross of the movie in millions
intgross- international gross of the movie in millions
code- release year along with whether binary is passed or failed
budget_2013. - budget of the movie in 2013 dollars
domgross_2013. - domestic gross of the movie in 2013 dollars
intgross_2013. - international gross of the movie in 2013 dollars
period.code - period code of the movie
decade.code - decade code of the movie
Getting the required packages:
library(pacman,quietly=T)
pacman::p_load(dplyr,ggpubr,plyr,ggplot2,MASS,ggcorrplot,tidytable,Matrix,glmnet,car,lmtest,sandwich)
Importing and cleaning the dataset:
data=read.csv("C:\\Users\\zeeda\\OneDrive\\Desktop\\movies.csv",header=T)
data=data[,-c(20,21)] #-- removing period.code and decade.code
data=na.omit(data) #--removing the observations corresponding to NA values
data$imdb.votes[which(data$imdb.votes=="None")]=c(0,0) #--treating none voting as 0 votes
data$imdb.votes=as.numeric(data$imdb.votes)
data$income=data$domgross+data$intgross #--definiing total income of the movie
data$profit=data$income-data$budget #--profit of movie
summary(data)
index year imdb title
Min. : 0.0 Min. :1970 Length:1774 Length:1774
1st Qu.: 447.2 1st Qu.:1998 Class :character Class :character
Median : 903.5 Median :2005 Mode :character Mode :character
Mean : 898.4 Mean :2003
3rd Qu.:1346.8 3rd Qu.:2009
Max. :1793.0 Max. :2013
imdbRating age.certification Runtime.mins genre
Min. :2.300 Length:1774 Min. : 2.0 Length:1774
1st Qu.:6.200 Class :character 1st Qu.:131.0 Class :character
Median :6.800 Mode :character Median :142.0 Mode :character
Mean :6.745 Mean :139.1
3rd Qu.:7.400 3rd Qu.:154.0
Max. :9.300 Max. :431.0
imdb.votes test clean_test binary
Min. : 0 Length:1774 Length:1774 Length:1774
1st Qu.: 60182 Class :character Class :character Class :character
Median : 134244 Mode :character Mode :character Mode :character
Mean : 239791
3rd Qu.: 291757
Max. :2886071
budget domgross intgross code
Min. : 7000 Min. : 828 Min. :8.280e+02 Length:1774
1st Qu.: 12000000 1st Qu.: 16314670 1st Qu.:2.634e+07 Class :character
Median : 29500000 Median : 42147342 Median :7.684e+07 Mode :character
Mean : 45090050 Mean : 69141289 Mean :1.509e+08
3rd Qu.: 60000000 3rd Qu.: 93278718 3rd Qu.:1.903e+08
Max. :425000000 Max. :760507625 Max. :2.784e+09
budget_2013. domgross_2013. intgross_2013.
Min. : 8632 Min. :8.990e+02 Min. :8.990e+02
1st Qu.: 16234217 1st Qu.:2.050e+07 1st Qu.:3.371e+07
Median : 37143282 Median :5.599e+07 Median :9.679e+07
Mean : 55802234 Mean :9.516e+07 Mean :1.986e+08
3rd Qu.: 78885934 3rd Qu.:1.216e+08 3rd Qu.:2.418e+08
Max. :461435929 Max. :1.772e+09 Max. :3.172e+09
income profit
Min. :1.656e+03 Min. : -89057484
1st Qu.:4.259e+07 1st Qu.: 20520157
Median :1.201e+08 Median : 85670865
Mean :2.201e+08 Mean : 174962520
3rd Qu.:2.846e+08 3rd Qu.: 222381997
Max. :3.544e+09 Max. :3119426607
par(mfrow=c(2,2))
plot(data$imdbRating,ylim=c(0,10),col=ifelse(data$imdbRating>=6,ifelse(data$imdbRating>=8,"black","blue"),"red"),ylab="IMDb Ratings",xlab="Serial no. of movies",main="IMDb Movie Ratings")
abline(h=6,col="red")
abline(h=8,col="black")
legend(x=1200,y=2,legend=c("Below Average","Decent","Topclass"),fill=c("red","blue","black"))
data$age.certification=as.factor(data$age.certification) #--converting in factor variable
lab=names(table(data$age.certification))
pct=round((table(data$age.certification)/sum(table(data$age.certification)))*100,1)
pie(labels=paste(lab,pct,"%",sep=":"),table(data$age.certification),main="Pie Chart of Age Certification of Movies")
plot(data$Runtime.mins,ylab="Runtime",xlab="serial no. of movies",main="Runtime of Movies(in mins)",col=ifelse(data$Runtime.mins<=45,"dark green"," red"),pch=3)
legend(x=0,y=450,c("Feature flim","Short flim"),fill=c("red","dark green"))
data$genre=as.factor(data$genre)
levels(data$genre)=c("Action","Adventure","Animation","Biography","Comedy","Crime","Documentary","Drama", "Fantasy","Horror","Mystery","Sci-Fi","Crime")
barplot(horiz=T,sort(table(data$genre)),col=rainbow(12),main="Barplot of Genre",xlab="Frequency")
pct=round((table(data$genre)/nrow(data))*100,1)
labs=(names(table(data$genre)))
legend("bottomright",paste(labs,pct,"%",sep=":"),fill=rainbow(12))
The variable imdb.Rating ranges between 0-10 using a plot we could observe how the movies here are being rated.
Here most of the movies(82%) have DECENT rating.
6 of them are OSCAR worthy let’s find out their names
data$title[which(data$imdbRating>=9)]
[1] "The Dark Knight"
[2] "The Lord of theRings: TheReturn of the King"
[3] "The ShawshankRedemption"
[4] "Schindler's List"
[5] "TheGodfather: Part II"
[6] "TheGodfather"
The variable age.certification is a categorical variable having levels “G”,“PG”,“PG-13”,“R” and “Unrated”
Most of the movies has PG-13 certificate here followed by R certificated movies.
The variable Runtime.mins shows the range of the movies in minutes(older has higher length)
The variable genre is categorical variable . Here is the barplot to observe the types of genre with the percentage from different genre
Here clean_test and test are categorical variables and based on their levels corresponding to a movie we decide whether it will PASS or FAIL the Bechdel test. Lets understand it by a plot
data$test=as.factor(data$test)
data$clean_test=as.factor(data$clean_test)
data$binary=as.factor(data$binary)
ggplot(data,aes(y=test,x=clean_test,shape=binary))+ geom_point(aes(colour=binary),size=5)+geom_line(aes(group=binary),lty=2)+ ggtitle("Bechdel Test")+ xlab(label="Bechdel test without any caveats")+ ylab(label="Bechdel test status")
| test | clean_test | Binary |
|---|---|---|
| Dubious | Dubious & Dubious-disagree | FAIL |
| Men | Men &Men-disagree | FAIL |
| Nowomen | Nowomen & Nowomen-disagree | FAIL |
| Ok | Ok & Ok-disagree | PASS |
The variable year, imdbRating, Runtime.mins, imdb.votes, budget, income gives the releasing year of each movie let us visualize from a histogram how many movies we have in the dataset over different year.
par(mfrow=c(2,3))
for(i in c(2,5,7,9,13,20)){
hist(data[,i],col="light blue",xlab=colnames(data)[i],
main=paste("Histogram of",colnames(data)[i])) }
cor.data=cor(data[,c(5,7,9,13,20)],method="spearman")
ggcorrplot(cor.data,method="square",type="lower",lab=TRUE)
budget have good positive correlation Income .
imdb votes and income also have good poisitive correlation.
imdb votes and imdb rating have good poisitive correlation.
1. Passed and Failed movies mostly belong to which genre?
tapply(data,data$binary,function(x){return(names(which.max(table(x$genre))))})
FAIL PASS
"Action" "Comedy"
2. Passed and Failed movies mostly belong to which age certification?
tapply(data,data$binary,function(x){return(names(which.max(table(x$age.certification))))})
FAIL PASS
"PG-13" "PG-13"
3. Does passed movies have on an higher average profit?
obj_profit=daply(data,.(data$year,data$binary),function(x){return(mean(x$profit))})
dec_1970=colMeans(obj_profit[1:10,][-c(1,2,6),])
dec_1980=colMeans(obj_profit[11:20,])
dec_1990=colMeans(obj_profit[21:30,])
dec_2000=colMeans(obj_profit[31:44,])
obj_main_profit=rbind(dec_1970,dec_1980,dec_1990,dec_2000)
barplot(t(obj_main_profit),beside=T,col=rainbow(2),xlab="Decade",ylab="Profit")
legend("topright",c("FAIL","PASS"),fill=rainbow(2))
3. Does older flims mostly failed the Bechdel Test?
Obtained the no. of movies that failed and passed over different years.
Subgroup them in different decades (1971’s,1980’s,1990’s and 2000’s ) and add up the count of failed and passed movies and scale them down to 1.
Create a Barplot using the data.
object=daply(data,.(data$year),function(x){return(table(x$binary))})
decade_1970=colSums(object[1:10,])
decade_1980=colSums(object[11:20,])
decade_1990=colSums(object[21:30,])
decade_2000=colSums(object[31:44,])
object12=rbind(decade_1970,decade_1980,decade_1990,decade_2000)
new_obj=object12*(1/rowSums(object12))
barplot(t(new_obj),col=rainbow(7),ylim=c(0,1.2))
legend("topleft",c("FAIL","PASS"),fill=rainbow(7))
4. Does movies passing Bechdel Test has on higher ratings and higher income?
par(mfrow=c(1,2))
boxplot(data$imdbRating~data$binary,col="light pink",xlab="Bechdel Test pass/fail",ylab="IMDb Rating",main="IMDb rating Vs Bechdel Test pass/fail")
boxplot(data$income~data$binary,col="light pink",xlab="Bechdel Test pass/fail",ylab="Income",main="Income Vs Bechdel Test pass/fail")
Clearly the median is less for the movies passing the test in both the cases, so we can conclude MOVIES PASSING THE BECHDEL TEST TEND TO HAVE LOWER RATINGS AND INCOME.
Is budget a confounding factor incase of income?
To answer that we will first verify if budget for passed and failed flim is different using boxplot.
boxplot(data$budget~data$binary,col="light pink",xlab=" Bechdel Test pass/fail",
ylab="Budget",main="Boxplot of Income Vs Bechdel Test pass/fail")
There is a difference in budget in particular failed flim has higher budget.
From the correlation plot , we can observe budget seems to have more positive correlation Income than with IMDb rating . So incase of income ,budget can be a confounding factor i.e. since the failed movie has higher budget so it has higher income.
5. Does the imdb ratings and income remain same over different genre?
par(mfrow=c(1,2))
plot(data$imdbRating~data$genre,col="light blue",xlab="Genre",ylab="IMDb Rating",main="IMDb rating vs Genre")
plot(data$income~data$genre,col="light blue",xlab="Genre",ylab="Income",main="Income vs Genre")
6. Does the imdb ratings and income remain same over different certification?
par(mfrow=c(1,2))
plot(data$imdbRating~data$age.certification,col="light green",xlab="age certification",ylab="IMDb Rating",main="IMDb rating vs age certification ")
plot(data$income~data$age.certification,col="light green",xlab="age certification",ylab="Income",main="Income vs age certification")
Clearly the median IMDb rating varies over different age certification but not as much as it varies with different age certification.
The income varies significantly over two age certification G and PG from most of the age certification.
7. Does the average IMDb rating and income change over different decades(1971’s,1980’s,1990’s and 2000’s ) for passed and failed (in Bechdel Test) movies ?
Subset the data over different decade .
split the data based on Bechdel pass/fail .
Calculated mean IMDb rating and Income seperately.
plot two different curves of mean IMDb rating and Income over different decades using two plots.
par(mfrow=c(1,2))
decade2_1970=subset(data,data$year %in% 1970:1979)
p_1=tapply(decade2_1970,decade2_1970$binary,function(x){return(mean(x$imdbRating))})
decade2_1980=subset(data,data$year %in% 1980:1989)
p_2=tapply(decade2_1980,decade2_1980$binary,function(x){return(mean(x$imdbRating))})
decade2_1990=subset(data,data$year %in% 1990:1999)
p_3=tapply(decade2_1990,decade2_1990$binary,function(x){return(mean(x$imdbRating))})
decade2_2000=subset(data,data$year %in% 2000:2013)
p_4=tapply(decade2_2000,decade2_2000$binary,function(x){return(mean(x$imdbRating))})
object2=rbind(p_1,p_2,p_3,p_4)
plot(object2[,1],type="b",ylim=c(5,8),col="red",lwd=2,xaxt="n",xlab="Decade",ylab="mean IMDb Rating")
par(new=T)
plot(object2[,2],type="b",ylim=c(5,8),col="blue",xlab="",ylab="",yaxt="n",xaxt="n",lwd=2)
axis(1,at=1:4,labels=c("70's","80's","90's","2000's"))
legend("bottomleft",c("FAIL","PASS"),fill=c("red","blue"))
object3=ddply(data,.(data$year),function(x){return(tapply(x,x$binary,function(y){return(mean(y$income))}))})
index=c(which(object3$FAIL=="NaN"),which(object3$PASS=="NaN"))
object3=object3[-index,]
decade3_1970=colMeans(object3[1:7,c(2,3)])
decade3_1980=colMeans(object3[8:17,c(2,3)])
decade3_1990=colMeans(object3[18:27,c(2,3)])
decade3_2000=colMeans(object3[28:41,c(2,3)])
object32=rbind(decade3_1970,decade3_1980,decade3_1990,decade3_2000)
plot(object32[,1],type="b",col="red",lwd=2,xaxt="n",xlab="Decade",ylab="mean Income")
par(new=T)
plot(object32[,2],type="b",col="blue",yaxt="n",xaxt="n",lwd=2,xlab="",ylab="")
axis(1,at=1:4,labels=c("70's","80's","90's","2000's"))
legend("bottomright",c("FAIL","PASS"),fill=c("red","blue"))
We can observe Mostly the failed movies performed in terms of mean IMDb rating and mean income better.
The difference has become less over the latter decades specially in 90s where the mean income of passed movies crossed the failed one.
Comparing Passed and Failed movies(in Bechdel Test)
1. Does the passed and failed movies have same mean and variance in IMDb rating?
we will conduct two sample F test where \(\sigma_{1}^{2}\) is the variance of IMDb rating of passed movies and \(\sigma_{2}^{2}\) is the variance of IMDb rating of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]
test_pass=subset(data,data$binary=="PASS")
test_fail=subset(data,data$binary=="FAIL")
var.test(test_pass$imdbRating,test_fail$imdbRating,alternative="two.sided")
F test to compare two variances
data: test_pass$imdbRating and test_fail$imdbRating
F = 0.97999, num df = 792, denom df = 980, p-value = 0.7668
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8587585 1.1193928
sample estimates:
ratio of variances
0.9799902
Now we will look for equality of mean and hence conduct two sample t test where \(\mu_{1}\)is the mean of IMDb rating of passed movies and \(\mu_{2}\)is the mean of IMDb rating of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]
t.test(test_pass$imdbRating,test_fail$imdbRating,alternative="two.sided",var.equal=TRUE)
Two Sample t-test
data: test_pass$imdbRating and test_fail$imdbRating
t = -5.3677, df = 1772, p-value = 9.026e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3129715 -0.1454636
sample estimates:
mean of x mean of y
6.618285 6.847503
2. Does the passed and failed movies have same mean and variance in income?
we will conduct two sample F test where \(\sigma_{1}^{2}\)is the variance of income of passed movies and \(\sigma_{2}^{2}\)is the variance of income of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]
test_pass=subset(data,data$binary=="PASS")
test_fail=subset(data,data$binary=="FAIL")
var.test(test_pass$income,test_fail$income,alternative="two.sided")
F test to compare two variances
data: test_pass$income and test_fail$income
F = 0.80707, num df = 792, denom df = 980, p-value = 0.001612
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7072326 0.9218786
sample estimates:
ratio of variances
0.8070733
Now we will look for equality of mean and hence conduct welch t testwhere \(\mu_{1}\) is the mean of income of passed movies and \(\mu_{2}\) is the mean of income of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]
t.test(test_pass$income,test_fail$income,alternative="two.sided",var.equal=FALSE)
Welch Two Sample t-test
data: test_pass$income and test_fail$income
t = -3.3143, df = 1752.3, p-value = 0.0009376
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-71437389 -18320406
sample estimates:
mean of x mean of y
195235097 240113994
3. Is budget for Passed and Failed flim is different?
.we will conduct two sample F test where \(\sigma_{1}^{2}\) is the variance of budget of passed movies and \(\sigma_{2}^{2}\) is the variance of budget of failed movies. \[ H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2} \hspace{1cm} Against \hspace{1cm}H_{1}:\sigma_{1}^{2}\not=\sigma_{2}^{2} \]
var.test(test_pass$budget,test_fail$budget,alternative="two.sided")
F test to compare two variances
data: test_pass$budget and test_fail$budget
F = 0.69554, num df = 792, denom df = 980, p-value = 1.033e-07
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.6094929 0.7944748
sample estimates:
ratio of variances
0.6955356
Now we will look for equality of mean and hence conduct welch t test where \(\mu_{1}\)is the mean of budget of passed movies and \(\mu_{2}\)is the mean of budget of failed movies. \[H_{0}:\mu_{1}=\mu_{2}\hspace{1cm}Against\hspace{1cm}H_{1}:\mu_{1}\ne\mu_{2}\]
t.test(test_pass$budget,test_fail$budget,alternative="two.sided",var.equal=F)
Welch Two Sample t-test
data: test_pass$budget and test_fail$budget
t = -5.5472, df = 1770.3, p-value = 3.341e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-16826749 -8036075
sample estimates:
mean of x mean of y
38215633 50647046
4. Does higher budget mean higher Income for a movie?
cor.test(data$income,data$budget,alternative="greater",method="spearman")
Warning in cor.test.default(data$income, data$budget, alternative = "greater",
: Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data$income and data$budget
S = 285386793, p-value < 2.2e-16
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.6932925
5. Is there a relationship between income and binary keeping the effect of budget out?
We will construct a logistic regression model of Binary on budget and income and perform the coefficient test for income to observe whether it is significantly different than 0.
logistic_mod1=glm(data$binary~data$budget+data$income-1,data=data,family="binomial")
logistic_mod11=glm(data$binary~data$budget-1,data=data,family="binomial")
anova(logistic_mod11,logistic_mod1,test="Chisq")
Analysis of Deviance Table
Model 1: data$binary ~ data$budget - 1
Model 2: data$binary ~ data$budget + data$income - 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1773 2409.3
2 1772 2408.8 1 0.49068 0.4836
Here is our chosen predictors for predicting IMDb ratings
decade
age.certification
Runtime.mins
genre
imdb.votes
binary
budget
Since in EDA we have observed decade-wise change instead of directly using year so here we will convert year into decade-
year 1970-1979 is 70’s
year 1980-1989 is 80’s
year 1990-1999 is 90’s
rest is 2000’s
Also scaled the continuous variables to reduce interpretability issues
new_data=data[,c(5,6,7,8,9,12,13)]
new_data$decade=as.factor(ifelse(data$year<1980,"70",ifelse(data$year<1990,"80",ifelse(data$year<2000,"90","2000"))))
new_data[,c(3,5,7)]=scale(new_data[,c(3,5,7)])
Computing the model
basic_model=lm(new_data$imdbRating~.,data=new_data)
summary(basic_model)
Call:
lm(formula = new_data$imdbRating ~ ., data = new_data)
Residuals:
Min 1Q Median 3Q Max
-3.7542 -0.3382 0.0802 0.4401 2.1118
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.280432 0.100978 62.196 < 2e-16 ***
age.certificationPG 0.023358 0.096693 0.242 0.809145
age.certificationPG-13 0.136685 0.098357 1.390 0.164801
age.certificationR 0.283472 0.099610 2.846 0.004481 **
age.certificationUnrated 0.359381 0.136171 2.639 0.008384 **
Runtime.mins 0.002226 0.015988 0.139 0.889291
genreAdventure 0.363034 0.073029 4.971 7.31e-07 ***
genreAnimation 0.688739 0.075305 9.146 < 2e-16 ***
genreBiography 0.780885 0.079109 9.871 < 2e-16 ***
genreComedy 0.212469 0.047818 4.443 9.42e-06 ***
genreCrime 0.400869 0.071771 5.585 2.70e-08 ***
genreDocumentary 1.193309 0.296021 4.031 5.79e-05 ***
genreDrama 0.531745 0.052306 10.166 < 2e-16 ***
genreFantasy 0.012731 0.139345 0.091 0.927214
genreHorror -0.400078 0.076305 -5.243 1.77e-07 ***
genreMystery 0.046804 0.293748 0.159 0.873425
genreSci-Fi 0.256343 0.150107 1.708 0.087864 .
imdb.votes 0.543541 0.017042 31.895 < 2e-16 ***
binaryPASS -0.079281 0.032771 -2.419 0.015655 *
budget -0.128967 0.020387 -6.326 3.19e-10 ***
decade70 0.699166 0.094431 7.404 2.04e-13 ***
decade80 0.383076 0.064760 5.915 3.98e-09 ***
decade90 0.146013 0.040798 3.579 0.000354 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6506 on 1751 degrees of freedom
Multiple R-squared: 0.4853, Adjusted R-squared: 0.4788
F-statistic: 75.04 on 22 and 1751 DF, p-value: < 2.2e-16
Removing the outliers and model rebuilding
First we will look for potential outliers that are influentialuse cooks distance to measure influential points
calculate high residuals using studentized residual
take the intersection of both influential points and high residual and remove them
create the model again for the revised data
cd=cooks.distance(basic_model)
a=which(cd>(4/nrow(new_data)))
abs_res_vec=abs(rstudent(basic_model))
b=which(abs_res_vec>2)
del_obs=intersect(a,b)
del_obs #--outlier observation
[1] 10 82 89 117 151 159 222 293 383 406 432 481 577 608 626
[16] 643 769 815 816 828 840 876 882 890 922 954 1041 1071 1190 1271
[31] 1275 1282 1317 1364 1387 1414 1457 1491 1496 1515 1519 1521 1524 1531 1545
[46] 1568 1623 1642 1649 1654 1661 1682 1709 1766 1768
new_data2=new_data[-del_obs,]
basic_model2=lm(new_data2$imdbRating~.,data=new_data2)
Heteroscadasticity
plot(residuals(basic_model2)~fitted(basic_model2),col="darkgreen",xlab="Fitted values",ylab="Residuals",main="Residual vs Fitted",ylim=c(-2,2))
abline(col="red",h=mean(resid(basic_model2))+3*sd(resid(basic_model2)))
abline(col="red",h=mean(resid(basic_model2))-3*sd(resid(basic_model2)))
We conduct Breusch Pagan test for formal checking
bptest(basic_model2,studentize=F)
Breusch-Pagan test
data: basic_model2
BP = 39.695, df = 22, p-value = 0.01173
Attempting to remove Heteroscadasticity
Estimate \(\hat{\sigma_{i}}\) by regressing its absolute value on the predictors and hence obtain the weights as \(\frac{1}{\hat{\sigma_{i}}}\) .
Obtaining weighted least square model .
Create the plot and Bp-test again
#--Obtaining weighted least square model
sgm=lm((resid(basic_model2)^2)~.,data=new_data2[,-1])
basic_model3=lm(new_data2$imdbRating~.,data=new_data2,weights=1/sqrt(fitted(basic_model2)))
#--Create the plot again.
plot(resid(basic_model3)~fitted(basic_model3),ylim=c(-2,2),xlab="Fitted values",ylab="Residuals",main="Residual vs Fitted")
par(new=T)
plot(residuals(basic_model2)~fitted(basic_model2),col="grey",ylim=c(-2,2),xlab="",ylab="",xaxt="n",yaxt="n")
abline(col="red",h=mean(resid(basic_model3))+3*sd(resid(basic_model3)))
abline(col="red",h=mean(resid(basic_model3))-3*sd(resid(basic_model3)))
legend("topright",legend = c("Old model","New Model"),fill=c("black","grey"))
#--Bp test
bptest(basic_model3,studentize=F)
Breusch-Pagan test
data: basic_model3
BP = 941.19, df = 22, p-value < 2.2e-16
Multicollinearity
We will use conditional number to judge it.
X=model.matrix(basic_model2)
egn=eigen(t(X)%*%X)
egn_max=max(egn$values)
egn_min=min(egn$values)
cn=sqrt(egn_max/egn_min)
cn
[1] 25.13592
Autocorrelation
Plot of lagged residuals
lagged_residual=c("NA",resid(basic_model2)[-length(resid(basic_model2))])
plot(lagged_residual,resid(basic_model2),col="darkblue",xlab="Lagged residuals",ylab="Residuals",main="Lagged Residual plot")
Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion
Durbin watson test to confirm
durbinWatsonTest(basic_model2)
lag Autocorrelation D-W Statistic p-value
1 0.03503798 1.927132 0.122
Alternative hypothesis: rho != 0
Checking Normality
qqnorm(resid(basic_model2),col="brown")
qqline(resid(basic_model2),col="black")
summary(basic_model2)
Call:
lm(formula = new_data2$imdbRating ~ ., data = new_data2)
Residuals:
Min 1Q Median 3Q Max
-2.01227 -0.34877 0.04392 0.39270 1.60810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.311070 0.088039 71.685 < 2e-16 ***
age.certificationPG 0.082229 0.084369 0.975 0.32988
age.certificationPG-13 0.162699 0.085905 1.894 0.05840 .
age.certificationR 0.265934 0.086911 3.060 0.00225 **
age.certificationUnrated 0.343105 0.118592 2.893 0.00386 **
Runtime.mins -0.002852 0.013802 -0.207 0.83634
genreAdventure 0.368772 0.063518 5.806 7.63e-09 ***
genreAnimation 0.673740 0.065377 10.305 < 2e-16 ***
genreBiography 0.764485 0.068067 11.231 < 2e-16 ***
genreComedy 0.172675 0.041206 4.191 2.93e-05 ***
genreCrime 0.385794 0.062001 6.222 6.16e-10 ***
genreDocumentary 1.183715 0.251530 4.706 2.73e-06 ***
genreDrama 0.515416 0.045087 11.432 < 2e-16 ***
genreFantasy -0.037848 0.118444 -0.320 0.74935
genreHorror -0.376462 0.066509 -5.660 1.77e-08 ***
genreMystery -0.022865 0.249552 -0.092 0.92701
genreSci-Fi 0.294143 0.130708 2.250 0.02455 *
imdb.votes 0.577228 0.016160 35.720 < 2e-16 ***
binaryPASS -0.061686 0.028323 -2.178 0.02955 *
budget -0.147388 0.017715 -8.320 < 2e-16 ***
decade70 0.641666 0.081640 7.860 6.79e-15 ***
decade80 0.437188 0.056488 7.739 1.70e-14 ***
decade90 0.193425 0.035348 5.472 5.11e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5525 on 1696 degrees of freedom
Multiple R-squared: 0.5453, Adjusted R-squared: 0.5394
F-statistic: 92.45 on 22 and 1696 DF, p-value: < 2.2e-16
robust_res = vcovHC(basic_model2,type="HC1")
Anova(basic_model2,white.adjust=T,robust="HC1")
Coefficient covariances computed by hccm()
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Analysis of Deviance Table (Type II tests)
Response: new_data2$imdbRating
Df F Pr(>F)
age.certification 4 6.4713 3.629e-05 ***
Runtime.mins 1 0.0426 0.8365
genre 11 41.9836 < 2.2e-16 ***
imdb.votes 1 789.5755 < 2.2e-16 ***
binary 1 4.4722 0.0346 *
budget 1 64.5264 1.766e-15 ***
decade 3 42.5553 < 2.2e-16 ***
Residuals 1696
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion
Except from Runtime.mins rest of the variable have slope significantly different from 0.
Clearly Bechdel test passing or failing has a significant effect on IMDb Rating.
We will try reduce the mse using Ridge regression estimates
Choosing a sequence of \(\lambda\)
Finding the \(\lambda\) minimizes the cross-validation error
Then fitting a ridge model using that \(\lambda\)
x_ridge=as.matrix(model.matrix(basic_model2)[,-1])
y_ridge=new_data2$imdbRating
lam=seq(0.001,1,length=500)
cv_ridge=cv.glmnet(x_ridge,y_ridge,alpha=0,standardize=F,lambda = lam)
plot(cv_ridge)
profit_data=data[,c(5,6,7,8,9,12,13,21)]
profit_data$decade=as.factor(ifelse(data$year<1980,"70",ifelse(data$year<1990,"80",ifelse(data$year<2000,"90","2000"))))
Computing the model
profit_model=lm(profit_data$profit~.,data=profit_data)
Average predicted profit of passed and failed movies by the this model
tapply(profit_data,profit_data$binary,function(x){
return(mean(predict(profit_model,newdata = x)))
})
FAIL PASS
189466948 157019463
Findings
Movies that failed the test are having on an average higher income than movies that are passing the test.
Here is a catch! What if we fix all the covariates except binary for a collection of movies.
Compare their average profit based on only they are passed or failed.
First fix their binary variable as PASS and predict their average profit.
Repeat the process fixing binary ass FAIL.
Compare the average gain or loss(in percentage) of a movie that occured just because of binary.
test_movies=profit_data
pass_movies=test_movies
levels(pass_movies$binary)=c("PASS","PASS")
failed_movies=test_movies
levels(failed_movies$binary)=c("FAIL","FAIL")
failed_movies_pred=predict(profit_model,newdata=failed_movies)
pass_movies_pred=predict(profit_model,newdata=pass_movies)
result=((mean(pass_movies_pred)-mean(failed_movies_pred))/mean(failed_movies_pred))*100
cat(paste("Average extra gain of a movie that occured just because it passed the bechdel test",round(result,2),"%"))
Average extra gain of a movie that occured just because it passed the bechdel test 23.34 %
Older films failed the bechdel test more often than the new ones.
Failed movies have had higher average profits in all decades.
Movies passing the bechdel test tend to have lower ratings and the difference in their mean is statistically significant.
There is a difference in budget and in particular, failed films have a higher budget and the difference in their mean is statistically significant.
Movies passing the bechdel test tend to have lower income and the difference in their mean is statistically significant, but the difference has become less over the latter decades.
Median IMDB rating and income of a movie varies a lot over different genres.
Median IMDB rating and income of a movie varies a lot over different age certifications.
The positive correlation between budget and income of the movie is statistically significant.
relationship between income and binary keeping the effect of budget out is statistically significant.
Failed movies are having higher income because of higher budget, i.e., budget is a confounding factor.
Bechdel test passing or failing has a statistically significant effect on IMDb Rating.
A movie gains 0.06 IMDB rating at the expense of failing the Bechdel test.
Predicted profit of a passed film by the model is less than the failed one’s But Fixing all other features in the model a passed movie gains 23.34 % extra profit than a failed movie.
Bechdel Test-compliant movies prove more profitable, with a slight dip in IMDB ratings. This underscores the importance of creating inclusive content in the industry. Increased profits incentivize producers to prioritize gender diversity, fostering a more inclusive cinematic landscape and mitigating criticisms of feminist agendas in filmmaking.
Limitation
Potential limitation of the study is it includes only a collection of hollywood pop culture movies from 1970 to 2024 so -
The findings of this study might differ with that of studies that includes all categories of movies.
Since the data is available only upto 2013 , some findings might differ with presence of more recent data.
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
[5] LC_TIME=English_India.utf8
time zone: Asia/Calcutta
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sandwich_3.1-0 lmtest_0.9-40 zoo_1.8-12 car_3.1-2
[5] carData_3.0-5 glmnet_4.1-8 Matrix_1.7-0 tidytable_0.11.1
[9] ggcorrplot_0.1.4.1 MASS_7.3-60.2 plyr_1.8.9 ggpubr_0.6.0
[13] ggplot2_3.5.1 dplyr_1.1.4 pacman_0.5.1
loaded via a namespace (and not attached):
[1] gtable_0.3.5 shape_1.4.6.1 xfun_0.46 bslib_0.8.0
[5] rstatix_0.7.2 lattice_0.22-6 vctrs_0.6.5 tools_4.4.1
[9] generics_0.1.3 tibble_3.2.1 fansi_1.0.6 highr_0.11
[13] pkgconfig_2.0.3 data.table_1.15.4 lifecycle_1.0.4 compiler_4.4.1
[17] farver_2.1.2 stringr_1.5.1 munsell_0.5.1 codetools_0.2-20
[21] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10 pillar_1.9.0
[25] jquerylib_0.1.4 tidyr_1.3.1 cachem_1.1.0 iterators_1.0.14
[29] abind_1.4-5 foreach_1.5.2 tidyselect_1.2.1 digest_0.6.36
[33] stringi_1.8.4 reshape2_1.4.4 purrr_1.0.2 labeling_0.4.3
[37] splines_4.4.1 fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
[41] cli_3.6.3 magrittr_2.0.3 survival_3.6-4 utf8_1.2.4
[45] broom_1.0.6 withr_3.0.1 scales_1.3.0 backports_1.5.0
[49] rmarkdown_2.27 ggsignif_0.6.4 evaluate_0.24.0 knitr_1.48
[53] rlang_1.1.4 Rcpp_1.0.13 glue_1.7.0 rstudioapi_0.16.0
[57] jsonlite_1.8.8 R6_2.5.1