Contingency tables - qualitative data

Introductory exercise

Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:

##         Believe
## Gender   Yes  No
##   Female 435 375
##   Male   147 134

Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 0.11103, df = 1, p-value = 0.739
##         Believe
## Gender         Yes        No
##   Female 0.3987168 0.3437214
##   Male   0.1347388 0.1228231

As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.

## 
## Attaching package: 'DescTools'
## The following objects are masked from 'package:psych':
## 
##     AUC, ICC, SD
## [1] 0.01218871

Laboratory - 21/04/2021. Bivariate analysis for the ‘Titanic’ data.

Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.

The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.

8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.

In the following chunk, please find few significant correlations between nominal variables, present their distribution on the plot and in the form of a contingency table.

How to visualize cross-tabulations? Please find some hints here and here.

#female and male
ND<-titanic[which(titanic$Disembarked.at== 'Not Disembarked'), ]
tb1<-table(ND$Status,ND$Gender)
tb1
##           
##            Female Male
##   Survivor    359  352
##   Victim      130 1366
fourfoldplot(tb1)

chisq.test(tb1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb1
## X-squared = 485.87, df = 1, p-value < 2.2e-16
prop.table(tb1)
##           
##                Female       Male
##   Survivor 0.16266425 0.15949252
##   Victim   0.05890349 0.61893974
#kid and old
labels <- c("< 15", "15 - 30", "30 - 45", "45 - 60", "60 - 75")
breaks <- c(1,15,30,45,60,75)
agetable <- cut(ND$Age, breaks = breaks, labels = labels, right = TRUE )
tb2<-table(ND$Status,Age=agetable)
tb2
##           Age
##            < 15 15 - 30 30 - 45 45 - 60 60 - 75
##   Survivor   52     334     233      69       7
##   Victim     60     774     504     118      31
tb2.df <- as.data.frame(tb2)
names(tb2.df) <- c("Status", "Age", "Frequency")
ggplot(tb2.df, aes(x=Status, y=Frequency, fill=Age)) + geom_col()

chisq.test(tb2)
## 
##  Pearson's Chi-squared test
## 
## data:  tb2
## X-squared = 17.823, df = 4, p-value = 0.001337
prop.table(tb2)
##           Age
##                   < 15     15 - 30     30 - 45     45 - 60     60 - 75
##   Survivor 0.023831347 0.153070577 0.106782768 0.031622365 0.003208066
##   Victim   0.027497709 0.354720440 0.230980752 0.054078827 0.014207149
#crew and passenger
tb3<-table(ND$Status,ND$Crew.or.Passenger.)
tb3
##           
##            Crew Passenger
##   Survivor  211       500
##   Victim    679       817
mosaicplot(tb3,shade=TRUE)

chisq.test(tb3)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb3
## X-squared = 48.786, df = 1, p-value = 2.855e-12
prop.table(tb3)
##           
##                  Crew  Passenger
##   Survivor 0.09560489 0.22655188
##   Victim   0.30765745 0.37018577
#all
#install.packages("vcd")
library(vcd)  
## Warning: package 'vcd' was built under R version 4.0.5
## Loading required package: grid
mosaic(Titanic,shade=TRUE,legend=TRUE) 

Here, please interpret your findings.

From picture 1, the female is the majority of survivor, and male is the majority of victim. So when they can go, the female is priority to male.

From picture 2, the age in rate are similar, so they use gender priority to age.

From the p-value of chi-square statistics we can get the same conclusion.

From picture 3 and its statistic value, we can get find most of crew are victims.