Titanic Dataset: Analysis of Survivors

Prasanna Date

RPI

October 11, 2016 V1.0

1. Setting

System Under Test

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning of April 15, 1912 after colliding with an iceberg during her maiden voyage from Southampton to New York City. The ship contained 2,224 passengers and crew, out of which 1,500 died in the unfortunate incident.

In this study, we intend to perform a statistical analysis of the fatalities on the ship using the Titanic dataset on Kaggle. The main question that we are addressing here is whether there is a statistically significance relation between the death of the person and their passenger class, age, sex and/or port where they embarked their journey.

The following code shows how the data has been read into the R workspace.

df = read.csv('titanic.csv', header=TRUE)

The following subsections in this section talk about the factors and levels, continuous variables, response variables and the data. Section 2 talks about how the experiment is organized, what hypothesis are being tested and the rationale behind it. It further sheds light upon Randomization, Replication and Blocking. In Section 3, we actually perform the statistical analysis: beginning with exploratory data analysis, hypothesis testing, estimation and diagnostics and model adequacy checking.

Factors and Levels

For the purposes of this study, we operate with the following factors and their associated levels:

  1. Passenger Class: 1st, 2nd, 3rd
df$Pclass = as.factor(df$Pclass)
summary(df$Pclass)
##   1   2   3 
## 214 184 491
  1. Sex: Male, Female
summary(df$Sex)
## female   male 
##    312    577
  1. Age: In number of years (discrete numeric). Age is fractional if less than one.
summary(df$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.00   28.00   29.64   38.00   80.00     177
  1. Embarked: Place where the passenger embarked their journey. One of Cherbourg, Queenstown or Southampton
summary(df$Embarked)
##   C   Q   S 
## 168  77 644

Continuous Variables

There were no continuous variables used in the study. Three of the above mentioned variables (Passenger Class, Sex and Embarked) were categorical and the fourth one (Age) was discrete numeric.

Response Variables

The response variable in this study is whether the passenger/crew member survived or not: it is a binary variable with 0 indicating that they died and 1 indicating that they survived.

df$Survived = as.factor(df$Survived)
summary(df$Survived)
##   0   1 
## 549 340

The Data: How it is organized and what does it look like?

The Titanic dataset consists of twelve variables: PassengerID, Survived, Passenger Class, Name, Sex, Age, Number of Siblings/Spouse(s) On Board, Number of Parents/Child(ren) On Board, Ticket Information, Fare, Cabin, and Port or Embarkment.

The data looks as follows:

head(df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

Following command shows the summary of the data:

summary(df)
##   PassengerId  Survived Pclass 
##  Min.   :  1   0:549    1:214  
##  1st Qu.:224   1:340    2:184  
##  Median :446            3:491  
##  Mean   :446                   
##  3rd Qu.:668                   
##  Max.   :891                   
##                                
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:312   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.00  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.64  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :883                NA's   :177    
##      SibSp            Parch             Ticket         Fare        
##  Min.   :0.0000   Min.   :0.0000   1601    :  7   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.896  
##  Median :0.0000   Median :0.0000   CA. 2343:  7   Median : 14.454  
##  Mean   :0.5242   Mean   :0.3825   3101295 :  6   Mean   : 32.097  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.000  
##  Max.   :8.0000   Max.   :6.0000   CA 2144 :  6   Max.   :512.329  
##                                    (Other) :850                    
##          Cabin     Embarked
##             :687   C:168   
##  B96 B98    :  4   Q: 77   
##  C23 C25 C27:  4   S:644   
##  G6         :  4           
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :184

For the purposes of this study, we work with only four input variables and one response variable. As mentioned above, the four input variables are Passenger Class, Sex, Age, and Port of Embarkment. The response variable is whether they survived or not.

We can trim the data as per our needs using the following commands:

df = df[,c(2,3,5,6,12)]

We remove the rows where any field is equal to NA, and perform some data cleaning steps:

df = na.omit(df)
rownames(df) <- 1:nrow(df)

We also make Age as a categorical variable as follows:

  • If age <= 18, then age = child
  • If 18 < age <= 60, then age = adult
  • If age > 60, then age = senior
df$Age[df$Age <= 18] = "child"
df$Age[(df$Age > 18) & (df$Age <= 60) & (df$Age != "child")] = "adult"
df$Age[(df$Age != "child") & (df$Age != "adult")] = "senior"
df$Age = as.factor(df$Age)

After performing this operation, our data looks like this

head(df)
##   Survived Pclass    Sex   Age Embarked
## 1        0      3   male adult        S
## 2        1      1 female adult        C
## 3        1      3 female adult        S
## 4        1      1 female adult        S
## 5        0      3   male adult        S
## 6        0      1   male adult        S
summary(df)
##  Survived Pclass      Sex          Age      Embarked
##  0:424    1:184   female:259   adult :552   C:130   
##  1:288    2:173   male  :453   child :139   Q: 28   
##           3:355                senior: 21   S:554

2. Experimental Design

Experiment Organization, Conduction and Hypothesis Testing

The data analysis is organized in the following way:

  1. Computing main effects for all four factors
  2. Computing interaction effects for all six pairs of factors
  3. Computing analysis of variance (ANOVA) for all four main effects and all six interaction effects.

Rationale for the Design

The rational behind this design is as follows. Clearly, we have a data set at hand that is comprised of four input variables, three of which are categorical and the fourth is discrete numeric. The response variable is also categorical.

The first thing that comes to mind is to individually analyze the effect of each of the input variables on the response variable. This is nothing but computing the main effects.

Second level of investigation that comes to mind is to check if any pair of two factors has a synergistic effect on the response variable that seems to be more than the combined effect of the two factors. This is nothing but computing the interaction effects.

Thirdly, from a pure statistical viewpoint, we are dealing with samples of four random variables as inputs. If we only had samples of two random variables, then we would have opted for a z-test or a t-test. However, since there are more than two random variables, ANOVA is preferred, as it does exactly that.

The following subsections shed light on the topics of Randomization, Replication and Repeated Measures, and Blocking from a purely theoretical standpoint as well as from the point of view of this study.

Randomization

Randomization is done in order to allow the greatest reliability and validity of statistical estimates of treatment effects. More precisely, it refers to randomly allocating the experimental units across the treatment groups. Randomization reduces bias by minimizing the effect of nuisance factors or stochastic noise in the data.

In this study, the passengers of Titanic were divided into three separate classes, determined by their price of ticket and their wealth and social class. Passengers in first class were the most wealthy of the lot and included prominent people from upper class, businessmen, politicians, high-ranking military personnel, industrialists, bankers etc. Passengers in second class included professors, authors, clergymen, tourists etc. Passengers in third class were emigrants moving to the US and Canada. Moreover, the passengers were varied in ethnicity as well. There were passengers from Ottoman Empire, others having Arabic origins etc. So, for the purposes of this study, there was a fair amount of randomization of the passengers based on nationality, ethnicity, economic status, social status, wealth etc.

Replication and/or Repeated Measures

Replication refers to the repetition of an experiment in order to reduce the variability associated with the phenomenon being studied. In any sampling process, the variation that is inherent cannot be removed, but the best that can be done is to remove the variation caused due to special causes. This is what is achieved by Replication.

There is a thin line that separates replication from repeated measures. Repeated measurement refers to using the same subjects within every branch of the study, including the control group. For example, to test a person’s logical reasoning skills under the influence of marijuana, he/she might be subject to the same test with and without a dose of marijuana.

In this study, the data available is set in stone. Performing replication would mean setting the Titanic ship to sail multiple number of times and collecting the data over all those voyages: some of which may sink and others may not. Performing repeated measures would mean setting the Titanic ship to sail in a number of parallel universes with the same set of passengers. Clearly, both of these are not practical and hence, in our study, there is no evidence of either replication or repeated measures.

Blocking

Blocking refers to arranging experimental units in groups (blocks) that are similar to one another in some way, shape or form. These blocks are analyzed together in order to reduce known variability. The main idea behind blocking is that a variability occurring in any of the input variables that cannot be overcome is confounded with an interaction to eliminate its influence on the response variable.

The theoretical basis for blocking can be understood from the following equation. Given two random variables \(X\) and \(Y\),

\[ Var(X - Y) = Var(X) + Var(Y) - 2 Cov(X,Y) \]

The variance of the difference can be minimized by maximizing the covariance (or the correlation) between \(X\) and \(Y\).

In this study, we have analyzed the data as a whole because there was no reason to suspect occurrence of known variability for any of the input variables under consideration.

3. In Case the Assumptions Fail…

The previous section outlined a number of assumptions needed for Completely Randomized Designs. However, in real life (for instance, in this study), all of these assumptions may not hold. First of all, if some of the assumptions hold (like in this study), it still might be a good idea to proceed with the DoE analysis. This is generally the case with real world data. If none of the assumptions hold, then the DoE analysis methodology won’t yield logical results. In this case, it might be a good idea to switch the data analysis methodology. Depending on the problem at hand, several techniques from the massive pool of unsupervised learning, supervised learning, reinforcement learning etc. can be applied.

4. Statistical Analysis

Exploratory Data Analysis

Summary of the clean data:

summary(df)
##  Survived Pclass      Sex          Age      Embarked
##  0:424    1:184   female:259   adult :552   C:130   
##  1:288    2:173   male  :453   child :139   Q: 28   
##           3:355                senior: 21   S:554

Plotting the histograms of all four input variables:

barplot(table(df$Pclass), xlab="Class", ylab="Frequency", main="Histogram of Passenger Class")

barplot(table(df$Sex), xlab="Sex", ylab="Frequency", main="Histogram of Sex")

barplot(table(df$Age), xlab="Age", ylab="Frequency", main="Histogram of Age")

barplot(table(df$Embarked), xlab="Port of Embarkment", ylab="Frequency", main="Histogram of Port of Embarkment")

Test

Before we begin with testing, let’s convert the categorical dataframe into numeric dataframe.

df$Pclass = as.integer(df$Pclass)
df$Sex = as.integer(df$Sex)
df$Age = as.integer(df$Age)
df$Embarked = as.integer(df$Embarked)
df$Survived = as.integer(df$Survived)
head(df)
##   Survived Pclass Sex Age Embarked
## 1        1      3   2   1        3
## 2        2      1   1   1        1
## 3        2      3   1   1        3
## 4        2      1   1   1        3
## 5        1      3   2   1        3
## 6        1      1   2   1        3

Main Effects

There seem to be more number of survivors on an average from the 1st class as compared to the other two. The lowest number of survivors on an average were from the 3rd class.

me_pclass = c(0,0,0)
me_pclass[1] = mean(df$Survived[df$Pclass==1])
me_pclass[2] = mean(df$Survived[df$Pclass==2])
me_pclass[3] = mean(df$Survived[df$Pclass==3])
plot(me_pclass, type="o", main="Main Effect of Passenger Class", xlab="Passenger Class", ylab="Main Effect",
     xaxt="n")
axis(1, at=c(1,2,3), labels=c("1st", "2nd", "3rd"))

The female survivors were much greater than the male.

me_sex = c(0,0)
me_sex[1] = mean(df$Survived[df$Sex==1])
me_sex[2] = mean(df$Survived[df$Sex==2])
plot(me_sex, type="o", main="Main Effect of Sex", xlab="Sex", ylab="Main Effect", xaxt="n")
axis(1, at=c(1,2), labels=c("Female", "Male"))

Maximum survivors on an average from the age category were children, followed by adults and senior citizens.

me_age = c(0,0,0)
me_age[1] = mean(df$Survived[df$Age==1])
me_age[2] = mean(df$Survived[df$Age==2])
me_age[3] = mean(df$Survived[df$Age==3])
plot(me_age, type="o", main="Main Effect of Age", xlab="Age", ylab="Main Effect", xaxt="n")
axis(1, at=c(1,2,3), labels=c("Adult", "Children", "Senior Citizen"))

The people who boarded at Cherbourg had the maximum number of survivors on an average, followed by those who boarded at Southampton and lastly Queenstown.

me_emb = c(0,0,0)
me_emb[1] = mean(df$Survived[df$Embarked==1])
me_emb[2] = mean(df$Survived[df$Embarked==2])
me_emb[3] = mean(df$Survived[df$Embarked==3])
plot(me_emb, type="o", main="Main Effect of Port of Embarkment", xlab="Port of Embarkment", ylab="Main Effect",
     xaxt="n")
axis(1, at=c(1,2,3), labels=c("Cherbourg", "Queenstown", "Southampton"))

Interaction Effects

There is a clear interaction effect between Passenger Class and Sex. First and second class female passengers had a higher mean number of survivors than the third class female passengers. First class male passengers had a higher mean number of survivors than the second and third class male passengers.

interaction.plot(df$Pclass, df$Sex, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
                  main="Interaction Effect between Passenger Class and Sex", legend=FALSE)
legend("topright", c("Female","Male"), lty=c("dashed", "solid"), title="Sex")

There is an interaction effect between Passenger Class and Age. Overall, the adults had better mean number of survivors than senior citizens, which were better than children. More 1st class adults survived than the 2nd class, which had more survivors than the 3rd class. Same is the trend for senior citizens as well. However, the mean number of survivors was highest for 2nd class children and was lowest for 1st class children.

interaction.plot(df$Pclass, df$Age, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
                 main="Interaction Effect between Passenger Class and Age", legend=FALSE)
legend("topright", c("Adult","Child", "Senior"), lty=c("dashed", "solid", "dotted"), title="Age")

There is also an interaction effect between Passenger Class and Port of Embarkment. For passengers who boarded the ship from Queenstown and Southampton, the 1st class passengers survived the most and the 3rd class passengers survived the least. For passengers who boarded the ship from Cherbourg, the mean number of survivors were almost the same for 1st and 2nd class passengers, which were greater than the 3rd class passengers. For 2nd and 3rd class passengers who boarded from Cherbourg and Queenstown, there doesn’t seem to be any interaction effect.

interaction.plot(df$Pclass, df$Embarked, df$Survived, xlab="Passenger Class", ylab="Mean number of Survivors",
                 main="Interaction Effect between Passenger Class and Port of Embarkment", legend=FALSE)
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
       title="Port of Embarkment")

It is interesting to note that at such a critical time (when the ship was sinking), the mean number of survivors for 1st class were hands down greater than the other two classes. This essentially reflects and reinforces their power and stature in the society – they were being given the priority to reach to safety even at a time when the whole ship was sinking and almost everybody had a substantial chance of meeting their death that day.

There seems to be an interaction effect between Age and Sex. There were more number of survivors for male adults than male seniors, which were greater than male children. On the contrary, there were more number of survivors for female children than female senior citizens, which were greater than female adults. Overall, there were more number of female survivors as compared to male across all age groups.

interaction.plot(df$Sex, df$Age, df$Survived, xlab="Sex", ylab="Mean number of Survivors",
                 main="Interaction Effect between Sex and Age", legend=FALSE, xtick=FALSE, xaxt="n")
axis(1, c(1,2), labels=c("Female", "Male"))
legend("topright", c("Adult","Child", "Senior"), lty=c("dashed", "solid", "dotted"), title="Age")

There doesn’t seem to be any interaction effect between Sex and Port of Embarkment.

interaction.plot(df$Sex, df$Embarked, df$Survived, xlab="Sex", ylab="Mean number of Survivors", 
                 main="Interaction Effect between Sex and Port of Embarkment", legend=FALSE, xtick = FALSE, xaxt="n")
axis(1, c(1,2), labels=c("Female", "Male"))
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
       title="Port of Embarkment")

There is an interaction effect between Age and Port of Embarkment. The senior citizens boarded from Southampton and Cherbourg survived in fewer numbers as compared to those who boarded the ship from Queenstown. For adults and children, the people who boarded at Southampton survived more than those from Queenstown, which were more than those from Cherbourg. There doesn’t seem to be an interaction effect in the adults-children part of the graph.

interaction.plot(df$Age, df$Embarked, df$Survived, xlab="Age", ylab="Mean number of Survivors", 
                 main="Interaction Effect between Age and Port of Embarkment", legend=FALSE, xtick = FALSE, xaxt="n")
axis(1, c(1,2,3), labels=c("Adults", "Children", "Senior Citizens"))
legend("topright", c("Cherbourg","Queenstown","Southampton"), lty=c("dashed", "solid", "dotted"),
       title="Port of Embarkment")

ANOVA

The main effects of Passenger Class, Sex and Port of Embarkment are significant. There were more number of survivors in the 1st class as compared to 2nd and 3rd class. There were more female survivors. There is no significant effect of Age, meaning the mean number of survivors from all age groups were almost the same.

me1 = aov(df$Survived ~ df$Pclass)
anova(me1)
## Analysis of Variance Table
## 
## Response: df$Survived
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## df$Pclass   1  21.792 21.7923  103.35 < 2.2e-16 ***
## Residuals 710 149.713  0.2109                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
me2 = aov(df$Survived ~ df$Sex)
anova(me2)
## Analysis of Variance Table
## 
## Response: df$Survived
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## df$Sex      1  49.413  49.413  287.35 < 2.2e-16 ***
## Residuals 710 122.093   0.172                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
me3 = aov(df$Survived ~ df$Age)
anova(me3)
## Analysis of Variance Table
## 
## Response: df$Survived
##            Df  Sum Sq Mean Sq F value Pr(>F)
## df$Age      1   0.129 0.12945  0.5363 0.4642
## Residuals 710 171.376 0.24138
me4 = aov(df$Survived ~ df$Embarked)
anova(me4)
## Analysis of Variance Table
## 
## Response: df$Survived
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## df$Embarked   1   5.68  5.6797  24.318 1.018e-06 ***
## Residuals   710 165.83  0.2336                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are significant main effects of Passenger Class, Sex and Port of Embarkment. This is reinforced yet again by the 2-way interaction ANOVAs. The interaction effects of Passenger Class is significant with Sex and Age, but not with Port of Embarkment. The interaction effects of Sex is not significant with either Age or Port of Embarkment. The interaction effect of Age with Port of Embarkment is not significant.

ie12 = aov(df$Survived ~ df$Pclass * df$Sex)
anova(ie12)
## Analysis of Variance Table
## 
## Response: df$Survived
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## df$Pclass          1  21.792  21.792 145.192 < 2.2e-16 ***
## df$Sex             1  40.941  40.941 272.774 < 2.2e-16 ***
## df$Pclass:df$Sex   1   2.506   2.506  16.697 4.887e-05 ***
## Residuals        708 106.266   0.150                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie13 = aov(df$Survived ~ df$Pclass * df$Age)
anova(ie13)
## Analysis of Variance Table
## 
## Response: df$Survived
##                   Df  Sum Sq Mean Sq  F value    Pr(>F)    
## df$Pclass          1  21.792 21.7923 105.0161 < 2.2e-16 ***
## df$Age             1   0.426  0.4257   2.0513 0.1525146    
## df$Pclass:df$Age   1   2.368  2.3676  11.4091 0.0007708 ***
## Residuals        708 146.920  0.2075                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie14 = aov(df$Survived ~ df$Pclass * df$Embarked)
anova(ie14)
## Analysis of Variance Table
## 
## Response: df$Survived
##                        Df  Sum Sq Mean Sq  F value    Pr(>F)    
## df$Pclass               1  21.792 21.7923 104.4346 < 2.2e-16 ***
## df$Embarked             1   1.644  1.6443   7.8797  0.005137 ** 
## df$Pclass:df$Embarked   1   0.331  0.3309   1.5856  0.208364    
## Residuals             708 147.738  0.2087                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie23 = aov(df$Survived ~ df$Sex * df$Age)
anova(ie23)
## Analysis of Variance Table
## 
## Response: df$Survived
##                Df  Sum Sq Mean Sq  F value Pr(>F)    
## df$Sex          1  49.413  49.413 287.9032 <2e-16 ***
## df$Age          1   0.011   0.011   0.0659 0.7975    
## df$Sex:df$Age   1   0.567   0.567   3.3025 0.0696 .  
## Residuals     708 121.514   0.172                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie24 = aov(df$Survived ~ df$Sex * df$Embarked)
anova(ie24)
## Analysis of Variance Table
## 
## Response: df$Survived
##                     Df  Sum Sq Mean Sq F value    Pr(>F)    
## df$Sex               1  49.413  49.413 292.901 < 2.2e-16 ***
## df$Embarked          1   2.632   2.632  15.600 8.607e-05 ***
## df$Sex:df$Embarked   1   0.020   0.020   0.118    0.7314    
## Residuals          708 119.441   0.169                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ie34 = aov(df$Survived ~ df$Age * df$Embarked)
anova(ie34)
## Analysis of Variance Table
## 
## Response: df$Survived
##                     Df  Sum Sq Mean Sq F value    Pr(>F)    
## df$Age               1   0.129  0.1294  0.5536    0.4571    
## df$Embarked          1   5.641  5.6410 24.1234 1.123e-06 ***
## df$Age:df$Embarked   1   0.176  0.1758  0.7517    0.3862    
## Residuals          708 165.559  0.2338                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimation

Based on exploratory data analysis, main effects, 2-way interaction effects, and ANOVA for all the main and interaction effects, we can estimate that if the Titanic vessel were to set sail again (for that matter, any ship) and if it were to end up with the same fate, then on an average, there would be more survivors from the upper class, more survivers who are women and children and more survivors who boarded the ship at Cherbourg.

Diagnostics and Model Adequacy Checking

The Q-Q plot is a tool to compare two distributions against each other based on comparison of their quantiles. None of the Q-Q plots below adhere to the line, meaning that the data is highly non-linear in nature. Also, the non-linearity of the points suggests that the data is not normally distributed.

The residual plot is used in statistical data analysis to detect non-linearity of data, unequal error variances and outliers. In most of the following residual vs. fits plots, the residuals do not randomly bounce around the zero line, but they seem to fit to a line having a negative slope. This suggests that the data is non-linear in nature. The residuals roughly form a horizontal band around the zero line, suggesting that the variances of the error terms are equal. Lastly, there is no residual which stands out from the basic pattern of the residuals, which suggests that there are no outliers.

These observations suggest that all the assumptions for deploying ANOVA have not been met by the dataset. And this is true because as explained above, there was no scope of replication and repeated measures. However, we still get reasonable results because there were some assumptions for ANOVA that were true in the dataset.

qqnorm(residuals(me1))
qqline(residuals(me1))

plot(fitted(me1), residuals(me1))

qqnorm(residuals(me2))
qqline(residuals(me2))

plot(fitted(me2), residuals(me2))

qqnorm(residuals(me3))
qqline(residuals(me3))

plot(fitted(me3), residuals(me3))

qqnorm(residuals(me4))
qqline(residuals(me4))

plot(fitted(me4), residuals(me4))

qqnorm(residuals(ie12))
qqline(residuals(ie12))

plot(fitted(ie12), residuals(ie12))

qqnorm(residuals(ie13))
qqline(residuals(ie13))

plot(fitted(ie13), residuals(ie13))

qqnorm(residuals(ie14))
qqline(residuals(ie14))

plot(fitted(ie14), residuals(ie14))

qqnorm(residuals(ie23))
qqline(residuals(ie23))

plot(fitted(ie23), residuals(ie23))

qqnorm(residuals(ie24))
qqline(residuals(ie24))

plot(fitted(ie24), residuals(ie24))

qqnorm(residuals(ie34))
qqline(residuals(ie34))

plot(fitted(ie34), residuals(ie34))

6. Appendices

Data downloaded from Kaggle.com

df = read.csv('titanic.csv', header=TRUE)