Nominal Association

Eitan Tzelgov

15/11/2022

How it started

How it ended

Our research question:

Loading our Titanic data

load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 19-20/labs/data/titanic.RData")
names(titanic)
##  [1] "row.names" "pclass"    "survived"  "name"      "age"       "embarked" 
##  [7] "home.dest" "room"      "ticket"    "boat"      "sex"
glimpse(titanic)
## Rows: 1,313
## Columns: 11
## $ row.names <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ pclass    <fct> 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, ~
## $ survived  <int> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, ~
## $ name      <fct> "Allen, Miss Elisabeth Walton", "Allison, Miss Helen Loraine~
## $ age       <dbl> 29.0000, 2.0000, 30.0000, 25.0000, 0.9167, 47.0000, 63.0000,~
## $ embarked  <fct> Southampton, Southampton, Southampton, Southampton, Southamp~
## $ home.dest <fct> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal~
## $ room      <fct> B-5, C26, C26, C26, C22, E-12, D-7, A-36, C-101, , , , B-35,~
## $ ticket    <fct> 24160 L221, , , , , , 13502 L77, , , , 17754 L224 10s 6d, 17~
## $ boat      <fct> 2, , (135), , 11, 3, 10, , 2, (22), (124), 4, 9, B, , 6, , ,~
## $ sex       <fct> female, female, male, female, male, male, female, male, fema~

Our variables

titanic%>%
  count(pclass,survived)
##   pclass survived   n
## 1    1st        0 129
## 2    1st        1 193
## 3    2nd        0 161
## 4    2nd        1 119
## 5    3rd        0 574
## 6    3rd        1 137

Or, slightly nicer, using the tabyl() function from the janitor library

titanic %>%
  tabyl(pclass, survived)
##  pclass   0   1
##     1st 129 193
##     2nd 161 119
##     3rd 574 137

As you would report it:

Died Survived
1st Class 129 193
2nd Class 161 119
Third Class 574 137

The Expected Value

The null hypothesis:

Proportions: chances of survival by class

titanic%>%
  group_by(pclass)%>%
  summarise(mean(survived))
## # A tibble: 3 x 2
##   pclass `mean(survived)`
##   <fct>             <dbl>
## 1 1st               0.599
## 2 2nd               0.425
## 3 3rd               0.193

This seems to provide evidence against the null. Make sure you understand why!

The independence / null world

The expected value under the null

Class Died Survived Totals
1st Class ? ? 322
2nd Class ? ? 280
Third Class ? ? 711
864 449 1313

The expected value

\[\texttt{Expected value for a cell}= \frac{\texttt{Row total}\times \texttt{Column total}}{\texttt{Sample Size}}\]

\(\texttt{E}=\frac{\texttt{R}\times\texttt{C}}{\texttt{N}}\)

In our case (using the proportions we specified above):

Class Died Survived Totals
First Class 211.8873 110.1127 322
Second Class 184.2498 95.7502 280
Third Class 467.8629 243.1371 711
864 119 1313

Chi-Square

The statistical test: the idea

Building the formula

\(\chi^2=\sum{\frac{(\texttt{O}-\texttt{E})^2}{\texttt{E}}}\)

Calculating: step 1 (0-E)

Class Died Survived Totals
First Class -82.89 82.89 322
Second Class -23.25 23.25 280
Third Class 106.14 - 106.14 711
864 119 1313

Calculating: step 2 (Squaring)

Class Died Survived Totals
First Class 6771.644 6771.644 322
Second Class 540.5625 540.5625 280
Third Class 11265.7 11265.7 711
864 119 1313

Calculating: step 3 (Dividing by E and summing)

Class Died Survived Totals
First Class 31.9587 62.39 322
Second Class 2.93 5.65 280
Third Class 24.08 46.33 711
864 119 1313

We check against the distribution:

The \(\chi^2\) distribution with \(df\) degrees of freedom is the distribution of the sums of the squares of \(df\) independent standard normal random variables…

So, we need to know the degrees of freedom for each test:

Degrees of freedom

And the distribution for 2 degrees of freedom is:

Let R do it

Using Base R

chisq.test(with(titanic, table(pclass, survived)))
## 
##  Pearson's Chi-squared test
## 
## data:  with(titanic, table(pclass, survived))
## X-squared = 173.81, df = 2, p-value < 2.2e-16

Using tabyl()

my.table<-titanic %>%
    tabyl(pclass, survived)
chisq.test(my.table)
## 
##  Pearson's Chi-squared test
## 
## data:  my.table
## X-squared = 173.81, df = 2, p-value < 2.2e-16

Writing this up:

``To determine the degree of association between passenger class and survival, I carried out a chi-square test on the cross-tabulation of class and survival. The value of the chi-square statistic was 173.81 (d.f. = 2, N = 1313), and the corresponding p-value was less than 0.05, allowing us to reject the null hypothesis that there is no association.’’