Test
Before we begin with testing, let’s convert the categorical dataframe into numeric dataframe.
df$Pclass = as.integer(df$Pclass)
df$Sex = as.integer(df$Sex)
df$Age = as.integer(df$Age)
df$Embarked = as.integer(df$Embarked)
df$Survived = as.integer(df$Survived)
head(df)
## Survived Pclass Sex Age Embarked
## 1 1 3 2 1 3
## 2 2 1 1 1 1
## 3 2 3 1 1 3
## 4 2 1 1 1 3
## 5 1 3 2 1 3
## 6 1 1 2 1 3
Main Effects
There seem to be more number of survivors on an average from the 1st class as compared to the other two. The lowest number of survivors on an average were from the 3rd class.
me_pclass = c(0,0,0)
me_pclass[1] = mean(df$Survived[df$Pclass==1])
me_pclass[2] = mean(df$Survived[df$Pclass==2])
me_pclass[3] = mean(df$Survived[df$Pclass==3])
plot(me_pclass, type="o", main="Main Effect of Passenger Class", xlab="Passenger Class", ylab="Main Effect",
xaxt="n")
axis(1, at=c(1,2,3), labels=c("1st", "2nd", "3rd"))

The female survivors were much greater than the male.
me_sex = c(0,0)
me_sex[1] = mean(df$Survived[df$Sex==1])
me_sex[2] = mean(df$Survived[df$Sex==2])
plot(me_sex, type="o", main="Main Effect of Sex", xlab="Sex", ylab="Main Effect", xaxt="n")
axis(1, at=c(1,2), labels=c("Female", "Male"))

Maximum survivors on an average from the age category were children, followed by adults and senior citizens.
me_age = c(0,0,0)
me_age[1] = mean(df$Survived[df$Age==1])
me_age[2] = mean(df$Survived[df$Age==2])
me_age[3] = mean(df$Survived[df$Age==3])
plot(me_age, type="o", main="Main Effect of Age", xlab="Age", ylab="Main Effect", xaxt="n")
axis(1, at=c(1,2,3), labels=c("Adult", "Children", "Senior Citizen"))

The people who boarded at Cherbourg had the maximum number of survivors on an average, followed by those who boarded at Southampton and lastly Queenstown.
me_emb = c(0,0,0)
me_emb[1] = mean(df$Survived[df$Embarked==1])
me_emb[2] = mean(df$Survived[df$Embarked==2])
me_emb[3] = mean(df$Survived[df$Embarked==3])
plot(me_emb, type="o", main="Main Effect of Port of Embarkment", xlab="Port of Embarkment", ylab="Main Effect",
xaxt="n")
axis(1, at=c(1,2,3), labels=c("Cherbourg", "Queenstown", "Southampton"))

Interaction Effects
There is a clear interaction effect between Passenger Class and Sex. First and second class female passengers had a higher mean number of survivors than the third class female passengers. First class male passengers had a higher mean number of survivors than the second and third class male passengers.
interaction.plot(df$Pclass, df$Sex, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
main="Interaction Effect between Passenger Class and Sex", legend=FALSE)
legend("topright", c("Female","Male"), lty=c("dashed", "solid"), title="Sex")

There is an interaction effect between Passenger Class and Age. Overall, the adults had better mean number of survivors than senior citizens, which were better than children. More 1st class adults survived than the 2nd class, which had more survivors than the 3rd class. Same is the trend for senior citizens as well. However, the mean number of survivors was highest for 2nd class children and was lowest for 1st class children.
interaction.plot(df$Pclass, df$Age, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
main="Interaction Effect between Passenger Class and Age", legend=FALSE)
legend("topright", c("Adult","Child", "Senior"), lty=c("dashed", "solid", "dotted"), title="Age")

There is also an interaction effect between Passenger Class and Port of Embarkment. For passengers who boarded the ship from Queenstown and Southampton, the 1st class passengers survived the most and the 3rd class passengers survived the least. For passengers who boarded the ship from Cherbourg, the mean number of survivors were almost the same for 1st and 2nd class passengers, which were greater than the 3rd class passengers. For 2nd and 3rd class passengers who boarded from Cherbourg and Queenstown, there doesn’t seem to be any interaction effect.
interaction.plot(df$Pclass, df$Embarked, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
main="Interaction Effect between Passenger Class and Port of Embarkment", legend=FALSE)
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
title="Port of Embarkment")

It is interesting to note that at such a critical time (when the ship was sinking), the mean number of survivors for 1st class were hands down greater than the other two classes. This essentially reflects and reinforces their power and stature in the society – they were being given the priority to reach to safety even at a time when the whole ship was sinking and almost everybody had a substantial chance of meeting their death that day.
There seems to be an interaction effect between Age and Sex. There were more number of survivors for male adults than male seniors, which were greater than male children. On the contrary, there were more number of survivors for female children than female senior citizens, which were greater than female adults. Overall, there were more number of female survivors as compared to male across all age groups.
interaction.plot(df$Sex, df$Age, df$Survived, xlab="Sex", ylab="Mean number of Survivors",
main="Interaction Effect between Sex and Age", legend=FALSE, xtick=FALSE, xaxt="n")
axis(1, c(1,2), labels=c("Female", "Male"))
legend("topright", c("Adult","Child", "Senior"), lty=c("dashed", "solid", "dotted"), title="Age")

There doesn’t seem to be any interaction effect between Sex and Port of Embarkment.
interaction.plot(df$Sex, df$Embarked, df$Survived, xlab="Sex", ylab="Mean number of Survivors",
main="Interaction Effect between Sex and Port of Embarkment", legend=FALSE, xtick = FALSE, xaxt="n")
axis(1, c(1,2), labels=c("Female", "Male"))
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
title="Port of Embarkment")

There is an interaction effect between Age and Port of Embarkment. The senior citizens boarded from Southampton and Cherbourg survived in fewer numbers as compared to those who boarded the ship from Queenstown. For adults and children, the people who boarded at Southampton survived more than those from Queenstown, which were more than those from Cherbourg. There doesn’t seem to be an interaction effect in the adults-children part of the graph.
interaction.plot(df$Age, df$Embarked, df$Survived, xlab="Age", ylab="Mean number of Survivors",
main="Interaction Effect between Age and Port of Embarkment", legend=FALSE, xtick = FALSE, xaxt="n")
axis(1, c(1,2,3), labels=c("Adults", "Children", "Senior Citizens"))
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
title="Port of Embarkment")

ANOVA
The main effects of Passenger Class, Sex and Port of Embarkment are significant. There were more number of survivors in the 1st class as compared to 2nd and 3rd class. There were more female survivors. There is no significant effect of Age, meaning the mean number of survivors from all age groups were almost the same.
me1 = aov(df$Survived ~ df$Pclass)
anova(me1)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Pclass 1 21.792 21.7923 103.35 < 2.2e-16 ***
## Residuals 710 149.713 0.2109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
me2 = aov(df$Survived ~ df$Sex)
anova(me2)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Sex 1 49.413 49.413 287.35 < 2.2e-16 ***
## Residuals 710 122.093 0.172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
me3 = aov(df$Survived ~ df$Age)
anova(me3)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Age 1 0.129 0.12945 0.5363 0.4642
## Residuals 710 171.376 0.24138
me4 = aov(df$Survived ~ df$Embarked)
anova(me4)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Embarked 1 5.68 5.6797 24.318 1.018e-06 ***
## Residuals 710 165.83 0.2336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are significant main effects of Passenger Class, Sex and Port of Embarkment. This is reinforced yet again by the 2-way interaction ANOVAs. The interaction effects of Passenger Class is significant with Sex and Age, but not with Port of Embarkment. The interaction effects of Sex is not significant with either Age or Port of Embarkment. The interaction effect of Age with Port of Embarkment is not significant.
ie12 = aov(df$Survived ~ df$Pclass * df$Sex)
anova(ie12)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Pclass 1 21.792 21.792 145.192 < 2.2e-16 ***
## df$Sex 1 40.941 40.941 272.774 < 2.2e-16 ***
## df$Pclass:df$Sex 1 2.506 2.506 16.697 4.887e-05 ***
## Residuals 708 106.266 0.150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie13 = aov(df$Survived ~ df$Pclass * df$Age)
anova(ie13)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Pclass 1 21.792 21.7923 105.0161 < 2.2e-16 ***
## df$Age 1 0.426 0.4257 2.0513 0.1525146
## df$Pclass:df$Age 1 2.368 2.3676 11.4091 0.0007708 ***
## Residuals 708 146.920 0.2075
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie14 = aov(df$Survived ~ df$Pclass * df$Embarked)
anova(ie14)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Pclass 1 21.792 21.7923 104.4346 < 2.2e-16 ***
## df$Embarked 1 1.644 1.6443 7.8797 0.005137 **
## df$Pclass:df$Embarked 1 0.331 0.3309 1.5856 0.208364
## Residuals 708 147.738 0.2087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie23 = aov(df$Survived ~ df$Sex * df$Age)
anova(ie23)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Sex 1 49.413 49.413 287.9032 <2e-16 ***
## df$Age 1 0.011 0.011 0.0659 0.7975
## df$Sex:df$Age 1 0.567 0.567 3.3025 0.0696 .
## Residuals 708 121.514 0.172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie24 = aov(df$Survived ~ df$Sex * df$Embarked)
anova(ie24)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Sex 1 49.413 49.413 292.901 < 2.2e-16 ***
## df$Embarked 1 2.632 2.632 15.600 8.607e-05 ***
## df$Sex:df$Embarked 1 0.020 0.020 0.118 0.7314
## Residuals 708 119.441 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie34 = aov(df$Survived ~ df$Age * df$Embarked)
anova(ie34)
## Analysis of Variance Table
##
## Response: df$Survived
## Df Sum Sq Mean Sq F value Pr(>F)
## df$Age 1 0.129 0.1294 0.5536 0.4571
## df$Embarked 1 5.641 5.6410 24.1234 1.123e-06 ***
## df$Age:df$Embarked 1 0.176 0.1758 0.7517 0.3862
## Residuals 708 165.559 0.2338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimation
Based on exploratory data analysis, main effects, 2-way interaction effects, and ANOVA for all the main and interaction effects, we can estimate that if the Titanic vessel were to set sail again (for that matter, any ship) and if it were to end up with the same fate, then on an average, there would be more survivors from the upper class, more survivers who are women and children and more survivors who boarded the ship at Cherbourg.
Diagnostics and Model Adequacy Checking
The Q-Q plot is a tool to compare two distributions against each other based on comparison of their quantiles. None of the Q-Q plots below adhere to the line, meaning that the data is highly non-linear in nature. Also, the non-linearity of the points suggests that the data is not normally distributed.
The residual plot is used in statistical data analysis to detect non-linearity of data, unequal error variances and outliers. In most of the following residual vs. fits plots, the residuals do not randomly bounce around the zero line, but they seem to fit to a line having a negative slope. This suggests that the data is non-linear in nature. The residuals roughly form a horizontal band around the zero line, suggesting that the variances of the error terms are equal. Lastly, there is no residual which stands out from the basic pattern of the residuals, which suggests that there are no outliers.
These observations suggest that all the assumptions for deploying ANOVA have not been met by the dataset. And this is true because as explained above, there was no scope of replication and repeated measures. However, we still get reasonable results because there were some assumptions for ANOVA that were true in the dataset.
qqnorm(residuals(me1))
qqline(residuals(me1))

plot(fitted(me1), residuals(me1))

qqnorm(residuals(me2))
qqline(residuals(me2))

plot(fitted(me2), residuals(me2))

qqnorm(residuals(me3))
qqline(residuals(me3))

plot(fitted(me3), residuals(me3))

qqnorm(residuals(me4))
qqline(residuals(me4))

plot(fitted(me4), residuals(me4))

qqnorm(residuals(ie12))
qqline(residuals(ie12))

plot(fitted(ie12), residuals(ie12))

qqnorm(residuals(ie13))
qqline(residuals(ie13))

plot(fitted(ie13), residuals(ie13))

qqnorm(residuals(ie14))
qqline(residuals(ie14))

plot(fitted(ie14), residuals(ie14))

qqnorm(residuals(ie23))
qqline(residuals(ie23))

plot(fitted(ie23), residuals(ie23))

qqnorm(residuals(ie24))
qqline(residuals(ie24))

plot(fitted(ie24), residuals(ie24))

qqnorm(residuals(ie34))
qqline(residuals(ie34))

plot(fitted(ie34), residuals(ie34))
