Contingency tables - qualitative data

Introductory exercise

Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:

##         Believe
## Gender   Yes  No
##   Female 435 375
##   Male   147 134

Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 0.11103, df = 1, p-value = 0.739
##         Believe
## Gender         Yes        No
##   Female 0.3987168 0.3437214
##   Male   0.1347388 0.1228231

As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.

## 
## Attaching package: 'DescTools'
## The following objects are masked from 'package:psych':
## 
##     AUC, ICC, SD
## [1] 0.01218871
## [1] 0.0121878
## [1] 0.01218871
## [1] 0.01218871

Laboratory - 21/04/2021. Bivariate analysis for the ‘Titanic’ data.

Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.

The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.

8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.

In the following chunk, please find few significant correlations between nominal variables, present their distribution on the plot and in the form of a contingency table.

How to visualize cross-tabulations? Please find some hints here and here.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(psych)


male_survivors <- sum(titanic$Status=="Survivor" & titanic$Gender == "Male")
female_survivors <-sum(titanic$Status=="Survivor" & titanic$Gender == "Female")
male_victims <- sum(titanic$Status=="Victim" & titanic$Gender == "Male")
female_victims <- sum(titanic$Status=="Victim" & titanic$Gender == "Female")
x=c(female_survivors,male_survivors,female_victims,male_victims)
dim(x)=c(2,2)
dane<-as.table(x)
dimnames(dane)=list(Gender=c('Female','Male'),Status=c('Survivor','Victim'))
dane
##         Status
## Gender   Survivor Victim
##   Female      359    130
##   Male        352   1366
chisq.test(dane)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 485.87, df = 1, p-value < 2.2e-16
prop.table(dane)
##         Status
## Gender     Survivor     Victim
##   Female 0.16266425 0.05890349
##   Male   0.15949252 0.61893974
Phi(dane)
## [1] 0.4703662
ContCoef(dane)
## [1] 0.4256325
CramerV(dane)
## [1] 0.4703662
TschuprowT(dane)
## [1] 0.4703662
mosaicplot(dane, col = c("lightgreen", "red"))

barplot(dane, col = c("pink", "lightblue"))

fourfoldplot(dane, col = c("lightblue"))

Data regarding being a crew and surviving the crash

crew_survivors <- sum(titanic$Status=="Survivor" & titanic$Crew.or.Passenger. == "Crew")
passenger_survivors <-sum(titanic$Status=="Survivor" & titanic$Crew.or.Passenger. == "Passenger")
crew_victims <- sum(titanic$Status=="Victim" & titanic$Crew.or.Passenger. == "Crew")
passenger_victims <- sum(titanic$Status=="Victim" & titanic$Crew.or.Passenger. == "Passenger")
x=c(crew_survivors,passenger_survivors,crew_victims,passenger_victims)
dim(x)=c(2,2)
data<-as.table(x)
dimnames(data)=list(Status=c('Survivor','Victim'), Crew.or.Passenger.=c('Crew','Passenger'))
data
##           Crew.or.Passenger.
## Status     Crew Passenger
##   Survivor  211       679
##   Victim    500       817
chisq.test(data)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data
## X-squared = 48.786, df = 1, p-value = 2.855e-12
prop.table(data)
##           Crew.or.Passenger.
## Status           Crew  Passenger
##   Survivor 0.09560489 0.30765745
##   Victim   0.22655188 0.37018577
Phi(data)
## [1] 0.1496655
ContCoef(data)
## [1] 0.1480169
CramerV(data)
## [1] 0.1496655
TschuprowT(data)
## [1] 0.1496655
mosaicplot(data, col = c("lightgreen", "lightblue"))

barplot(data, col = c("lightgreen", "red"))

Here, please interpret your findings.

From the correlation coefficients we can see that there is high/moderate association between gender and survivability. This is also represented in the graph. It goes to show that the “children and ladies first” rule was applied during this catastrophe. Moreover the correlation coefficients regarding being a crew and surviving a crash are very low, which is also represented in the graphs. This shows that it didn’t really matter if you were a part of the crew, and your chances of survival were even slightly lesser.