I found an interesting historical data set about Titanic, the famous British luxury passenger liner that sank in \(1912\). The data is available in the package carData under the name TitanicSurvival. You can see this source for more historical information about the Titanic. But for now, let’s have a closer look at our data.

Our data has \(1309\) observations of \(1309\) passengers on four variables: Whether each passenger survived or not, the passenger gender, the age, and the passenger class: first class, second class, or third class.

#load the package containing the data.
library(carData)

#load the data itself.
data("TitanicSurvival")

#have a look!
TitanicSurvival[1:15,]
                                survived    sex     age passengerClass
Allen, Miss. Elisabeth Walton        yes female 29.0000            1st
Allison, Master. Hudson Trevor       yes   male  0.9167            1st
Allison, Miss. Helen Loraine          no female  2.0000            1st
Allison, Mr. Hudson Joshua Crei       no   male 30.0000            1st
Allison, Mrs. Hudson J C (Bessi       no female 25.0000            1st
Anderson, Mr. Harry                  yes   male 48.0000            1st
Andrews, Miss. Kornelia Theodos      yes female 63.0000            1st
Andrews, Mr. Thomas Jr                no   male 39.0000            1st
Appleton, Mrs. Edward Dale (Cha      yes female 53.0000            1st
Artagaveytia, Mr. Ramon               no   male 71.0000            1st
Astor, Col. John Jacob                no   male 47.0000            1st
Astor, Mrs. John Jacob (Madelei      yes female 18.0000            1st
Aubart, Mme. Leontine Pauline        yes female 24.0000            1st
Barber, Miss. Ellen Nellie           yes female 26.0000            1st
Barkworth, Mr. Algernon Henry W      yes   male 80.0000            1st
#create a summary for our variables.
summary(TitanicSurvival)
 survived      sex           age          passengerClass
 no :809   female:466   Min.   : 0.1667   1st:323       
 yes:500   male  :843   1st Qu.:21.0000   2nd:277       
                        Median :28.0000   3rd:709       
                        Mean   :29.8811                 
                        3rd Qu.:39.0000                 
                        Max.   :80.0000                 
                        NA's   :263                     
As you can see, the survivors were less than \(40\%\), with the majority of passengers, more than \(60\%\), being males. The passengers’ age varied from \(2\) months to \(80\) years, but note that we don’t have the age data for \(263\) passengers, so we might have even younger or older passengers.

The third class passengers were more than the first class and the second class passengers combined, and more than double the first-class passengers.

Let’s see the survived passengers by sex and by class.

library(ggplot2)

ggplot(data=TitanicSurvival,mapping=aes(x=survived))+geom_bar(aes(fill=sex))

ggplot(data=TitanicSurvival,mapping=aes(x=survived))+geom_bar(aes(fill=passengerClass))

#passengers' gender by class.
table(TitanicSurvival$passengerClass,TitanicSurvival$sex)
     
      female male
  1st    144  179
  2nd    106  171
  3rd    216  493
#boxplot of the passengers' age.
boxplot(TitanicSurvival$age,horizontal = T,xlab="Passengers' Age")
abline(v=mean(TitanicSurvival$age,na.rm = T))

#passengers' age by class.
ggplot(data=TitanicSurvival,mapping=aes(x=age,fill=passengerClass))+geom_bar()

As you can tell, most of the second and third class passengers were younger than \(40\).

#passengers' age by survival proportion.
ggplot(data=TitanicSurvival,mapping=aes(x=age,fill=survived))+geom_bar()

The Probability of survival:

We want to see if the probability of survival differs by the class. Let \(S_i\) denotes the survival of passenger \(i\), and \(C_i\) denotes the class of passenger \(i\). Then, \[\begin{equation} \mathrm P(S_i \mid C_i) = \frac{\mathrm P (S_i \cap C_i)}{\mathrm P(C_i)} \end{equation}\]

The probability of survival conditional on having, for example, a first-class ticket, is the ratio between the probability that a first-class passenger survives and the probability of a passenger \(i\) to choose a first-class ticket.

#calculate the conditional probability of survival on having a first class ticket. You can use also this code to get the same result: nrow(dplyr::filter(TitanicSurvival,survived=="yes",passengerClass=="1st"))/nrow(dplyr::filter(TitanicSurvival,passengerClass=="1st"))
mean(TitanicSurvival$survived[TitanicSurvival$passengerClass=="1st"]=="yes")
[1] 0.619195
#calculate the conditional probability of survival on having a second class ticket.
mean(TitanicSurvival$survived[TitanicSurvival$passengerClass=="2nd"]=="yes")
[1] 0.4296029
#calculate the conditional probability of survival on having a third class ticket.
mean(TitanicSurvival$survived[TitanicSurvival$passengerClass=="3rd"]=="yes")
[1] 0.2552891
#calculate the probability of survival. You can also use this code to get the same result: nrow(dplyr::filter(TitanicSurvival,survived=="yes"))/nrow(TitanicSurvival)
mean(TitanicSurvival$survived=="yes")
[1] 0.381971
The probability of survival conditional on having a first-class ticket, a second-class ticket, and a third-class ticket is \(62\%\), \(43\%\), and \(26\%\), so we can’t assume independence. Mathematically, \(\mathrm{P}(S_i \mid C_i) \not=\mathrm{P}(S_i)\).

We might also use \(\chi^2\) test to know the magnitude of the p-value.

#run chi-squared test.
chisq.test(TitanicSurvival$survived,TitanicSurvival$passengerClass)

    Pearson's Chi-squared test

data:  TitanicSurvival$survived and TitanicSurvival$passengerClass
X-squared = 127.86, df = 2, p-value < 2.2e-16

The p-value is almost equal to \(0\), implying that the probability of survival and the ticket class are not independent, at all levels.

To get a hint about the association between both variables, we might calculate the Contingency Coefficient and Cramer’s V:

#first have a look on our two variables.
table(TitanicSurvival$survived,TitanicSurvival$passengerClass)
     
      1st 2nd 3rd
  no  123 158 528
  yes 200 119 181

Contingency Coefficient \(=\sqrt{\frac {127.86}{1309+127.86}} \approx 0.298\).

Cramer’s V \(= \sqrt{\frac {127.86}{1309}} \approx 0.313\).

We didn’t consider the number of rows/columns in the calculation of Cramer’s V since the \(\min{(2-1,3-1)}=1\).

In \(R\), we can calculate Contingency Coefficient and Cramer’s V as well.

library(DescTools)

#calculate Contingency Coefficient.
ContCoef(TitanicSurvival$survived,TitanicSurvival$passengerClass)
[1] 0.2983038
#calculate Cramer's V.
CramerV(TitanicSurvival$survived,TitanicSurvival$passengerClass)
[1] 0.3125332

The two variables are not independent, as we said before, but the association is not very strong either.

Finally, I hope you had fun!