Eitan Tzelgov
15/11/2022
Is there an association between the social class of passengers and their survival?
Independent variable: social class (ordinal: 1st / 2nd / 3rd class)
Dependent variable: survival (nominal: survived / died)
load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 19-20/labs/data/titanic.RData")
names(titanic)## [1] "row.names" "pclass" "survived" "name" "age" "embarked"
## [7] "home.dest" "room" "ticket" "boat" "sex"
## Rows: 1,313
## Columns: 11
## $ row.names <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ pclass <fct> 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, ~
## $ survived <int> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, ~
## $ name <fct> "Allen, Miss Elisabeth Walton", "Allison, Miss Helen Loraine~
## $ age <dbl> 29.0000, 2.0000, 30.0000, 25.0000, 0.9167, 47.0000, 63.0000,~
## $ embarked <fct> Southampton, Southampton, Southampton, Southampton, Southamp~
## $ home.dest <fct> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal~
## $ room <fct> B-5, C26, C26, C26, C22, E-12, D-7, A-36, C-101, , , , B-35,~
## $ ticket <fct> 24160 L221, , , , , , 13502 L77, , , , 17754 L224 10s 6d, 17~
## $ boat <fct> 2, , (135), , 11, 3, 10, , 2, (22), (124), 4, 9, B, , 6, , ,~
## $ sex <fct> female, female, male, female, male, male, female, male, fema~
## pclass survived n
## 1 1st 0 129
## 2 1st 1 193
## 3 2nd 0 161
## 4 2nd 1 119
## 5 3rd 0 574
## 6 3rd 1 137
## pclass 0 1
## 1st 129 193
## 2nd 161 119
## 3rd 574 137
| Died | Survived | |
|---|---|---|
| 1st Class | 129 | 193 |
| 2nd Class | 161 | 119 |
| Third Class | 574 | 137 |
Null hypothesis: there is no association between class and survival
To reject the null: we have to show the numbers we have look very different than the no association scenario
Two approaches:
show proportions (intuitive)
Calculate the expected value what we would observe if that would there was no association, and then:
Compare that value to observed data (less intuitive, but we do it all the time)
## # A tibble: 3 x 2
## pclass `mean(survived)`
## <fct> <dbl>
## 1 1st 0.599
## 2 2nd 0.425
## 3 3rd 0.193
This seems to provide evidence against the null. Make sure you understand why!
Remember that we always contrast the null with our research hypothesis
But what exactly is the null?
The null, in this case at least, implies Independence between the two variable
This means that we can use the laws of probability to calculate the null world
\(P(A \cap B) = P(A) \times P(B)\) or: the joint probability of independent events is calculated as the probability of event A multiplied by the probability of event B.
| Class | Died | Survived | Totals |
|---|---|---|---|
| 1st Class | ? | ? | 322 |
| 2nd Class | ? | ? | 280 |
| Third Class | ? | ? | 711 |
| 864 | 449 | 1313 |
\[\texttt{Expected value for a cell}= \frac{\texttt{Row total}\times \texttt{Column total}}{\texttt{Sample Size}}\]
\(\texttt{E}=\frac{\texttt{R}\times\texttt{C}}{\texttt{N}}\)
| Class | Died | Survived | Totals |
|---|---|---|---|
| First Class | 211.8873 | 110.1127 | 322 |
| Second Class | 184.2498 | 95.7502 | 280 |
| Third Class | 467.8629 | 243.1371 | 711 |
| 864 | 119 | 1313 |
Chi-square (pron. “kai”, written \(\chi^2\))
Proposed by Karl Pearson, who noticed that relying on normal distributions can lead to errors
Allows us to summarize the differences between observed and expected
We also know how the chi-square statistics are distributed (along the \(\chi^2\) distribution)
so we can work out the p-value!
Call O the observed number
Call E the expected number
We subtract E from O, square the results, divide by E and sum
\(\chi^2=\sum{\frac{(\texttt{O}-\texttt{E})^2}{\texttt{E}}}\)
| Class | Died | Survived | Totals |
|---|---|---|---|
| First Class | -82.89 | 82.89 | 322 |
| Second Class | -23.25 | 23.25 | 280 |
| Third Class | 106.14 | - 106.14 | 711 |
| 864 | 119 | 1313 |
| Class | Died | Survived | Totals |
|---|---|---|---|
| First Class | 6771.644 | 6771.644 | 322 |
| Second Class | 540.5625 | 540.5625 | 280 |
| Third Class | 11265.7 | 11265.7 | 711 |
| 864 | 119 | 1313 |
| Class | Died | Survived | Totals |
|---|---|---|---|
| First Class | 31.9587 | 62.39 | 322 |
| Second Class | 2.93 | 5.65 | 280 |
| Third Class | 24.08 | 46.33 | 711 |
| 864 | 119 | 1313 |
The \(\chi^2\) statistic is 173.8
Is that statistically significant?
The \(\chi^2\) distribution with \(df\) degrees of freedom is the distribution of the sums of the squares of \(df\) independent standard normal random variables…
So, we need to know the degrees of freedom for each test:
equals number of rows, minus one \(\times\) number of columns, minus one
For a 3 by 2 table: 2 \(\times\) 1 equals 2
Hence, two degrees of freedom
##
## Pearson's Chi-squared test
##
## data: with(titanic, table(pclass, survived))
## X-squared = 173.81, df = 2, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: my.table
## X-squared = 173.81, df = 2, p-value < 2.2e-16
``To determine the degree of association between passenger class and survival, I carried out a chi-square test on the cross-tabulation of class and survival. The value of the chi-square statistic was 173.81 (d.f. = 2, N = 1313), and the corresponding p-value was less than 0.05, allowing us to reject the null hypothesis that there is no association.’’