Final report

First of all, we will describe the aim of our project:

The aim of this project is to delve into the factors influencing the likelihood of survival, focusing particularly on the role of ticket prices and classes.

The results of such investigation may not just improve the history knowledge but also save someones life (or even show how valuable money were while facing the death). Moreover, the outcome may finally end the discussions about privileges of the rich.

1. Read the data

First, we’ll need some libraries:

require(BSDA)
## Loading required package: BSDA
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
library(BSDA)
require(EnvStats)
## Loading required package: EnvStats
## 
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
library(EnvStats)
library(vcd)
## Loading required package: grid
## 
## Attaching package: 'vcd'
## The following object is masked from 'package:BSDA':
## 
##     Trucks
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(nortest)
library(aod)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ResourceSelection)
## ResourceSelection 0.3-6   2023-06-27

Then, we will read the data from the CSV file and select the columns we’re planning to work with:

data <- read.csv("titanic.csv", sep=",")

data <- data %>% select("Survived", "Pclass", "Sex","Age","Fare", "SibSp")
head(data)
##   Survived Pclass    Sex Age    Fare SibSp
## 1        0      3   male  22  7.2500     1
## 2        1      1 female  38 71.2833     1
## 3        1      3 female  26  7.9250     0
## 4        1      1 female  35 53.1000     1
## 5        0      3   male  35  8.0500     0
## 6        0      3   male  NA  8.4583     0

Let’s take a quick look at data we’ll be using:

data$Survival <- cut(data$Survived, 2, labels=c('Did not survive','Survived'))
p1 <- ggplot(data, aes(x=Fare, fill = Survival)) + 
  geom_histogram(binwidth = 20, color="white")+
  scale_fill_manual(values=c("#a2d2ff", "#ffafcc"))

p2 <- ggplot(data, aes(x=Pclass, fill = Survival)) + 
  geom_histogram(binwidth = 1, color="white")+
  scale_fill_manual(values=c("#a2d2ff", "#ffafcc"))

grid.arrange(p1, p2) 

From these histograms it seems that people, who paid more and were higher class (consider Rose from famous movie) actually survived more. So, we constructed a couple hypotheses according to this observation.

To test our hypotheses, we decided to split the data into two parts: those passengers, who survived and those, who did not:

data_survived <- data[data$Survived == 1,]
head(data_survived)
##    Survived Pclass    Sex Age    Fare SibSp Survival
## 2         1      1 female  38 71.2833     1 Survived
## 3         1      3 female  26  7.9250     0 Survived
## 4         1      1 female  35 53.1000     1 Survived
## 9         1      3 female  27 11.1333     0 Survived
## 10        1      2 female  14 30.0708     1 Survived
## 11        1      3 female   4 16.7000     1 Survived
data_died <- data[data$Survived == 0,]
head(data_died)
##    Survived Pclass  Sex Age    Fare SibSp        Survival
## 1         0      3 male  22  7.2500     1 Did not survive
## 5         0      3 male  35  8.0500     0 Did not survive
## 6         0      3 male  NA  8.4583     0 Did not survive
## 7         0      1 male  54 51.8625     0 Did not survive
## 8         0      3 male   2 21.0750     3 Did not survive
## 13        0      3 male  20  8.0500     0 Did not survive

Part 1: Testing means of ticket price.

Assume \(X_k\) is the ticket price of passenger who did not survive of Titanic and \(Y_k\) is the ticket price of passenger who survived, with means \(\mu_0\) and \(\mu_1\) and variances \(\sigma_0\) and \(\sigma_1\) respectively. The hypotheses we will be testing:

\[ H_0: \mu_0=\mu_1 \quad vs \quad H_1: \mu_0<\mu_1 \]

To test this hypothesis we will use Welch’s t-test for two samples

But first, we need to extract the samples from our dataframe:

x <- data_died$Fare
y <- data_survived$Fare

To find out if our tests would be approximate or exact, let’s find out if our data is normally distributed or not. To do it let’s use Pearson chi-square normality test. Let our hypotheses be

\(H_0:\) the data is sampled from a normal distribution.

\(H_1:\) the data is not distributed normally.

Given the mean and standard deviation we have to calculate the expected values under the normal distribution for every data point. Then use formula\[ t:= \sum_i \frac{(o_i-e_i)^2}{e_i} \]where \(o_i\) is the observed values, \(e_i\) is predicted/expected values under the normal distribution for every data point.

Under \(H_0\) \(t\) has approximately a \(\chi^2_k\) with \(k\) degrees of freedom (degrees of freedom of \(H_1\) minus degrees of freedom of \(H_0\)). Under \(H_1\) \(t\) takes larger values.

pearson.test(x)
## 
##  Pearson chi-square normality test
## 
## data:  x
## P = 2023.7, p-value < 2.2e-16
pearson.test(y)
## 
##  Pearson chi-square normality test
## 
## data:  y
## P = 598.46, p-value < 2.2e-16

We can see that both of our sample are not normally distributed as the \(p\)-value is less than \(0.05\). So our tests will be approximate.

What is Welch’s t-test for Two Samples?

Welch’s Two Sample t-test is a modification of the one sample Student t-test. The latter is used to test mean if variance is unknown and it uses the following statistic:

\[ T(\mathbf{X}) := \frac{\sqrt{n}}{S(\mathbf{X})}(\overline{\mathbf{X}}-\mu_0)=\frac{Z(\mathbf{X})}{S(\mathbf{X})/\sigma} \]

Which has under \(H_0\) has Student distribution distribution with \(n-1\) degrees of freedom.

But what if we have two samples with unknown variances and want to test if their means are equal?

We can use \(\overline{\mathbf{Y}}-\overline{\mathbf{X}}\) to estimate \(\mu_1-\mu_0\). It’s natural to reject \(H_0\) if \(\overline{\mathbf{Y}}-\overline{\mathbf{X}}\) if far from zero. Thus, we have:

\(C_\alpha = \left\{\mathbf{x,y}\in \mathbb{R}|\space \overline{\mathbf{Y}}-\overline{\mathbf{X}} > c_\alpha\right\}\)

Under \(H_0\), \(\overline{\mathbf{X}}\) and \(\overline{\mathbf{Y}}\) have approximately Normal distribution with means \(\mu_0\) and \(\mu_1\) and variances \(\sigma_0/n\) and \(\sigma_1/m\) (n and m are sizes of samples) respectively. We can conclude that \(\overline{\mathbf{Y}}-\overline{\mathbf{X}}\) is approximately has \(N(0, \sigma_0^2/n+\sigma_1^2/m)\) distribution.

So, we can also use z-statistic for two samples and replace variances with their estimates \(S^2(\mathbf{X})=S_{\mathbf{xx}}/(n-1)\) and \(S^2(\mathbf{Y})=S_{\mathbf{yy}}/(m-1)\). We will get the following statistic:

\[ T(\mathbf{X},\mathbf{Y}) = \sqrt{\frac{mn}{S^2(\mathbf{X})m+S^2(\mathbf{Y})n}}(\overline{\mathbf{Y}}-\overline{\mathbf{X}}) \]

Which has approximately Student distribution \(T_{df}\),

where \(df=\frac{\left(\frac{S^2(\mathbf{X})}{n}+\frac{S^2(\mathbf{Y})}{m}\right)^2}{\frac{1}{n-1}\left(\frac{S^2(\mathbf{X})}{n}\right)^2+\frac{1}{m-1}\left(\frac{S^2(\mathbf{Y})}{m}\right)^2}\)

Thus, we can use quantiles of Student distribution:

\(C_\alpha = \left\{\mathbf{x,y}\in \mathbb{R}|\space T(\mathbf{x},\mathbf{y}) > t_{1-\alpha}^{df}\right\}\), where \(t_{1-\alpha}^{df}\) is the \(100(1-\alpha)\%\) quantile of Student distribution.

P-value:

\(P(T(\mathbf{x},\mathbf{y}) > c_{\alpha})=\alpha\\ P(T(\mathbf{x},\mathbf{y}) \leq c_{\alpha})=1-\alpha\\\alpha=1-F_{{T}_{df}}(T(\mathbf{x},\mathbf{y}))\)

t_test <- t.test(x, y, var.equal=FALSE, alternative = "l")
t_test
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -6.8391, df = 436.7, p-value = 1.35e-11
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -19.94415
## sample estimates:
## mean of x mean of y 
##  22.11789  48.39541

Conclusion: p-value we got is very small, so we can (with very little probability of mistake) reject \(H_0\). As we have only one sample available, we will not be able to conduct more tests, thus we have to accept \(H_1\): mean ticket price of people who survived on titanic is higher than of those who did not.


But… wait, maybe our result is not valid?

Let’s take a look at our data:

qplot(x=factor(c(rep(1, length(x)), rep(2, length(y)))),
      y=c(x, y),
      fill = I("#a2d2ff"),
      col = I("#023e8a"),
      xlab="",
      ylab="Ticket price",
      geom="boxplot") + scale_x_discrete(labels = c("Died", "Survived"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

hist(y, col="violet", ylim=c(0,600),
     border=F,
     breaks=seq(0,600,by=5), main="Histogram of ticket prices for people who survived and did not", xlab="Ticket price")

hist(x, col=scales::alpha("skyblue", 0.5),
     add=TRUE,
     border=F,
     breaks=seq(0,600,by=5))

As we can see, there is a lot outliers in our sample. So, we decided to delete them to see what will be the result if we make our data a little bit more symmetrical

Boxplot The “boxplot function” typically uses a method to identify and visualize outliers based on the interquartile range (IQR) - the range between the first quartile \((Q_1)\) and the third quartile \((Q_3)\)

It represents the interquartile range, the median, and “whiskers” that extend from the box to the minimum and maximum values within a certain range. Outliers are points that fall outside a specified range beyond the whiskers.

Boxplot identifies outliers by calculating

\[ IQR = (Q_3) - (Q_1) \]

Determining the whisker length (in R it is \(1.5\) times the IQR by default) - resemble only the range \([(Q_1) - 1.5 * IQR; \space (Q_3) + 1.5 * IQR]\);

This part of data analysis is extremely useful in our case as there were few tickets which price was extremely high. That caused a skewness which drastically caused our observations

outliers <- boxplot(x, plot = FALSE)$out
x_no_outlier <- x[-which(x %in% outliers)] 
boxplot(x_no_outlier, plot = FALSE)$out
##  [1] 51.8625 52.0000 46.9000 46.9000 47.1000 52.0000 53.1000 50.0000 46.9000
## [10] 52.0000 46.9000 49.5042 42.4000 52.0000 46.9000 46.9000 41.5792 53.1000
## [19] 50.4958
outliers <- boxplot(y, plot = FALSE)$out
y_no_outlier <- y[-which(y %in% outliers)] 
boxplot(y_no_outlier, plot = FALSE)$out
## [1] 120 120 120 120
qplot(x=factor(c(rep(1, length(x_no_outlier)), rep(2, length(y_no_outlier)))),
      y=c(x_no_outlier, y_no_outlier),
      fill = I("#a2d2ff"),
      col = I("#023e8a"),
      xlab = "",
      ylab="Ticket price",
      geom="boxplot") + scale_x_discrete(labels = c("Died", "Survived"))

hist(y_no_outlier, col="violet", ylim=c(0,300),
     border=F,
     breaks=seq(0,200,by=5),
     main="Histogram of data without outliers",
     xlab="Ticket prices with no outliers")

hist(x_no_outlier, col=scales::alpha("skyblue", 0.5),
     add=TRUE,
     border=F,
     breaks=seq(0,200,by=5))

Now, we can perform the Welch’s t-test one more time and check the results:

t_test <- t.test(x_no_outlier, y_no_outlier, var.equal=FALSE, alternative = "l")
t_test
## 
##  Welch Two Sample t-test
## 
## data:  x_no_outlier and y_no_outlier
## t = -10.591, df = 372.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -15.16588
## sample estimates:
## mean of x mean of y 
##   15.0281   32.9905

Conclusion: we still got very small p-value, so we can state that \(H_0\) can be rejected with very small probability of mistake.

Interesting result: p-value after removing outliers got even smaller


Part 2: decide, whether ticket class and survival are independent.

Our hypotheses:

\(H_0: Ticket \space class \space and \space survival \space of \space passenger \space are \space independent\)

\(H_1: There \space is \space a \space dependency \space between \space those \space two \space metrics\)

Note: In our dataframe, ticket class is represented as Pclass

As the class and survival are categorical variables, we will use Pearson’s Chi-squared test for independence to test this hypothesis.

What is Pearson’s Chi-squared test for independence?

Pearson’s Chi-square test is oftenly used for categorical variables to determine whether your data are significantly different from what you expected. One can use it to test whether two categorical variables are independent. It uses the following statistic:

\[ t:= \sum_{k}\frac{(O_{k}-E_{k})^2}{E_{k}} \]

Where:

  • \(O_k\) is the observed outcome

  • \(E_k\) is the expected outcome under \(H_0\).

How to use this to test independence?

First of all, we have to build the contingency table for our data:

table(data$Survived, data$Pclass)
##    
##       1   2   3
##   0  80  97 372
##   1 136  87 119

The same table, but a little bit prettier:

We will use this table to calculate the expected values for each Survival-Class combination. To do this, we will use the following formula:

\[ E_{row,col}=\frac{Total_{(row)}\times Total_{(col)}}{N} \]

For example, the expected value of 1st class passengers who did not survive is calculated the following way:

\(\frac{549\times 216}{891}\approx 133\)

By doing the same calculations of each value in the table, we will get next table of expected values:

Pearson’s Chi-Square test for independence uses the following statistic:

\[ t:= \sum_{i}\sum_{j}\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}} \\ {t} \space has \space approximately \space \chi^{2}_{df}\space destribution \space under \space H_0 \]

Where:

\(O_{i,j}\) – the observed outcome in \(i^{th}\) row and \(j^{th}\) column

\(E_{i,j}\) – the expected outcome in \(i^{th}\) row and \(j^{th}\) column

To calculate degrees of freedom of Chi-square distribution we will use the next formula:

\[ df = (num. \space of \space rows-1)\times(num. \space of \space columns-1)=(3-1)\times(2-1)=2 \]

Thus, our statistic approximately has Chi-square distribution with 2 degrees of freedom

P-value:

The rejection region (naturally we want the difference between expected and observed values to be larger to reject null hypothesis):

\(C_\alpha = \left\{\mathbf{x}\in\mathbb{R}^n |\space t(\mathbf{x}) \geq c_\alpha \right\}\)

In order to have \(C_\alpha\) of size \(\alpha\) we must have:

\(P(\mathbf{x}\in C_\alpha\space|\space H_0 \space is\space true)=P(t(\mathbf{x}) \geq c_\alpha)=\alpha\)

As, \(t(\mathbf{x})\) has Chi-Square distribution, we have:

\(P(t(\mathbf{x}) \geq c_\alpha)=1-P(t(\mathbf{x}) \leq c_\alpha)=\alpha\\ P(t(\mathbf{x}) \leq c_\alpha)=1-\alpha\\c_\alpha = x^{2}_{1-\alpha}\)

Where \(x^{2}_{1-\alpha}\) is \((1-\alpha)\%\) quantile of Chi-square distribution with 2 degrees of freedom.

test <- chisq.test(table(data$Survived, data$Pclass))
test
## 
##  Pearson's Chi-squared test
## 
## data:  table(data$Survived, data$Pclass)
## X-squared = 102.89, df = 2, p-value < 2.2e-16
table(data$Survived, data$Pclass)
##    
##       1   2   3
##   0  80  97 372
##   1 136  87 119

Also some cool representation of contingency table:

mosaic(~ Pclass + Survived,
  direction = c("v", "h"),
  data = data,
  shade = TRUE
)

Conclusion: p-value we received is very small, so we can (with almost no mistake) conclude that survival and ticket class are not independent


Part 3. Logistic regression

To understand relationships between survival and fare we can also use the logistic regression.

Logistic regression predicts the probability of the categorical dependent variable. In our case the prediction is categorical (survived or not survived).

On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Our goal in logistic regression is to draw the best fitting S-curve for given data points. And in logistic regression, we transform the y-axis from the probabilities to log(odds).

\[ log(\frac{p}{1-p}) = log(odds) \\ \frac{p}{1-p} = e^{a+b_1x} \\ p = \frac{1}{1+e^{-(a+b_1x)}} \]

Logistic regression uses the Maximum Likelihood for parameter estimation. The parameter to be estimated is a (that is called the intercept) and b (that is slope).

\(Y_i\) is a random variable and is independent with \(i = 1, 2, \dots, n\) and \(Y_i \sim \mathscr{B}(p)\).

\(L(a, b) = \prod_{i=1}^{n} p(x_i)^{y_i}\times(1-p(x_i))^{1-y_i}\)

\(l(a, b) = \sum_{i=1}^{n} y_i \space log(p(x_i)) + (1-y_i)\space log(1-p(x_i))\)

With substituting \(p(x)\) with its exponent form we will get the formula

\(l(a, b) = \sum_{i=1}^{n} y_i \space (a+b_1x) - \sum_{i=1}^n\space log(1+e^{a+b_1x})\)

To maximize Log-likelihood function we would need to take the derivative and equalize it to zero. We would get the equations:

\[ \frac{dl(a,b)}{da} = \sum_{i=1}^n (y_i - p(x_i)) = 0 \\ \frac{dl(a,b)}{db} = \sum_{i=1}^n (y_i - p(x_i))\space x_i = 0 \]

These equation does not contain a and b and it is difficult to find the solutions analytically. So Newton-Raphson numerical iteration method should be used. But we will not cover it in our report.

Interpreting the results

In the case of logistic regression, the Chi-square test tells whether the model is overall significant or not. The null deviance tells us how well the response variable can be predicted by a model with only an intercept term:

\[ p = \frac{1}{1+e^{-a}} \]

The residual deviance shows how well the response is predicted by the model when the predictors are included:

\[ p = \frac{1}{1+e^{-(a+bx_k)}} \]

The Chi-square test now tells us if there is a significant difference between these two results.

\(H_0:\) the null deviance and residual deviance are the same.

\(H_1\): the null deviance and residual deviance are not the same.

Equivalently, the null hypothesis can be stated as the predictor terms associated with the omitted coefficients have no relationship with the response, given the remaining predictor terms are already in the model. If we fit both models, we can compute the likelihood-ratio test (LRT) statistic:

\[ G^2 = -2(logL_0-logL_1) \]

where \(L_0\) and \(L_1\) are the max likelihood values for the reduced and full models, respectively. The degrees of freedom would be \(k\), the number of coefficients. The \(p\)-value is the area under the \(\chi^2_k\) curve under the right of \(G^2\).

It can also be shown that the difference in deviances between the full and reduced model follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the full and reduced models:

Thus,

\(C_\alpha = \left\{x \in \mathbb{R}^n, \space| \space G \geq \chi_{1-\frac{\alpha}{2}} \space or \space G\leq \chi_{\frac{\alpha}{2}}\right\}\)

We can then find the \(p\)-value associated with this Chi-Square statistic. The lower the p-value, the better the model is able to fit the dataset compared to a model with just an intercept term.

\(p(x) = 2min\left\{F_{\chi^2_k}(G), 1 - F_{\chi^2_k}(G)\right\}\)

So let’s dive into the results of our logit regression. We want to predict the survival of person based on the fare for the ticket.

ggplot(data, aes(x=Fare, y=Survived)) + geom_point() + 
      stat_smooth(method="glm", lwd = 1.5, color="#ffafcc", se=FALSE,
                method.args = list(family=binomial)) + labs(title="Probability of survival")
## `geom_smooth()` using formula = 'y ~ x'

mylogit <- glm(Survived ~ Fare, data = data, family=binomial)
summary(mylogit)
## 
## Call:
## glm(formula = Survived ~ Fare, family = binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.941330   0.095129  -9.895  < 2e-16 ***
## Fare         0.015197   0.002232   6.810 9.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance: 1117.6  on 889  degrees of freedom
## AIC: 1121.6
## 
## Number of Fisher Scoring iterations: 4

Let’s analyze the results we got:


Model Coefficients:

  • Intercept estimate: The intercept is -0.941330. It represents the estimated log(odds) of the reference group (assuming Fare is 0).

    That means that if the fare is equal to zero, the log(odds) of the person to survive are -0.941330.

    \[ log(\frac{p}{1-p}) = -0.941330\\ \frac{p}{1-p} = e^{-0.941330} \\ p = \frac{1}{1+e^{0.941330}} \]

    p <- 1/(1 + exp(0.941330))
    p
    ## [1] 0.2806318

    So when the fare is equal to zero, the probability that the person survives on the Titanic is approximately \(0.28\).

  • Fare estimate: The coefficient for Fare is 0.015197. It represents the change in the log(odds) of survival for a one-unit increase in Fare.

    Increasing the fare by one unit will result in 0.015197 increase in \(log(\frac{p}{1-p})\). Now, if \(log(\frac{p}{1-p})\) increases by 0.015197, that means that \(\frac{p}{1-p}\) will increase by \(exp(0.015197)=1.015313\). This is approximately \(1.5\)% increase in the odds of surviving on the Titanic.

    increase_in_odds <- (exp(0.015197) - 1)*100
    increase_in_odds
    ## [1] 1.531306
  • z-value: z-value is the estimate divided by the standard error. It show us how many standard deviations there are between the estimate and zero.

  • Pr(>|t|) returns the \(p\)-value for testing the hypothesis that the coefficient is zero. Hypotheses:

    \(H_0: a = 0\)

    \(H_1: a \neq 0\)

    In our case the \(p\)-value is less than 0.05. In practical terms, having an intercept significantly different from zero means that, under certain conditions (when fare is equal to zero), the log(odds) of survival are different from zero. This implies a non-zero probability of survival when other factors are at their reference levels.


To determine if a model is useful we can compute this statistics:

\[ G^2 = -2(logL_0-logL_1) \]

which also can be interpreted as

\[ G^2 = Null \space deviance - Residual \space deviance \\ G^2 = 1186.7 - 1117.6 = 69.1\\ k=890-889 =1\\ G^2 \sim\Gamma(\frac{1}{2}, \frac{1}{2}) \]

Let’s find \(p\)-value for our test

\(p(x) = 2min\left\{F_{\chi^2_k}(G), 1 - F_{\chi^2_k}(G)\right\}\)

degrees_of_freedom <- 1
value <- 69.1

p_value <- 2*min(pchisq(value, degrees_of_freedom), 1-pchisq(value, degrees_of_freedom))
p_value
## [1] 2.220446e-16

Conclusion: Since this \(p\)-value is much less than \(.05\), we would conclude that the model can be considered useful for predicting the probability that a given person survives. Thus, we can conclude that there is a dependency between survival and ticket fare.


CONCLUSIONS

After conducting Welch’s test and applying logistic regression approach, we found out that there is a dependency between survival and ticket fare of a passenger on Titanic. Also, using Pearson’s Chi-square test for independence, we discovered that there is also a dependency between survival and ticket class.

So, how to survive on Titanic?

Study Probability&Statistics and be wealthy :)