1.1. Data
Description of the variables:
PassengerID: Passenger ID.
Survived: Whether a passenger survived the shipwreck or not (0:No,
1:Yes).
Pclass: Ticket class of the passenger (1:first class, 2:second
class, 3:third class).
Name: Name of the passenger.
Sex: Gender of the passenger.
Age: Age of the person when boarding the Titanic.
SibSp: Number of siblings/spouses aboard the Titanic.
Parch: Number of parents/children aboard the Titanic.
Ticket: Ticket number of the passenger.
Fare: The fare the passenger paid for the ticket.
Cabin: Cabin number of the passenger.
Embarked: Port where the passengers boarded on the Titanic.
Variables used for the purpose of correlation (to find if there is
relation between them):
Survived - whether a passenger survived or not.
Sex - Whether the passenger is male or female.
mydata$SurvivedF <- factor(mydata$Survived,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$SexF <- factor(mydata$Sex,
levels = c("male", "female"),
labels = c("Male", "Female")) #Converting categorical variables into factors.
summary(mydata[colnames(mydata) %in% c("SurvivedF", "SexF")]) #Descriptive statistics for the two variables.
## SurvivedF SexF
## No :549 Male :577
## Yes:342 Female:314
Explanation of descriptive statistics:
SurvivedF: No is the number of passengers who did
not survive and Yes is the number of passengers who
survived.
SexF: Male is the number of male passengers and
Female is the number of female passengers.
1.2. Analysis
chi_square <- chisq.test(mydata$SurvivedF, mydata$SexF, #Chi_square is a parametric test for correlation.
correct = TRUE) #Yates' continuity correction, since it is 2x2.
chi_square
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$SurvivedF and mydata$SexF
## X-squared = 260.72, df = 1, p-value < 2.2e-16
addmargins(chi_square$observed) #Observed values
## mydata$SexF
## mydata$SurvivedF Male Female Sum
## No 468 81 549
## Yes 109 233 342
## Sum 577 314 891
addmargins(round(chi_square$expected, 2)) #Expected values
## mydata$SexF
## mydata$SurvivedF Male Female Sum
## No 355.53 193.47 549
## Yes 221.47 120.53 342
## Sum 577.00 314.00 891
round(chi_square$res, 2) # Std. Residual values
## mydata$SexF
## mydata$SurvivedF Male Female
## No 5.97 -8.09
## Yes -7.56 10.25
addmargins(round(prop.table(chi_square$observed), 3)) #0.525 - 52.5% were male passengers who did not survive.
## mydata$SexF
## mydata$SurvivedF Male Female Sum
## No 0.525 0.091 0.616
## Yes 0.122 0.262 0.384
## Sum 0.647 0.353 1.000
addmargins(round(prop.table(chi_square$observed, 1), 3), 2) # Out of all passengers who did not survive 85.2% were male.
## mydata$SexF
## mydata$SurvivedF Male Female Sum
## No 0.852 0.148 1.000
## Yes 0.319 0.681 1.000
addmargins(round(prop.table(chi_square$observed, 2), 3), 1) #Out of all passengers who were male 81.1% did not survive.
## mydata$SexF
## mydata$SurvivedF Male Female
## No 0.811 0.258
## Yes 0.189 0.742
## Sum 1.000 1.000
#install.packages("effectsize")
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
effectsize::cramers_v(mydata$SurvivedF, mydata$SexF) #The effect size from cramers_v statistics is 0.54.
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.54 | [0.49, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.54) #The effect size is very large regarding to Funder2019.
## [1] "very large"
## (Rules: funder2019)
oddsratio(mydata$SurvivedF, mydata$SexF) #Odds ratio between the two variables.
## Odds ratio | 95% CI
## --------------------------
## 12.35 | [8.90, 17.14]
interpret_oddsratio(12.35) # The odds ratio 12.35 is large regarding to Chen2010.
## [1] "large"
## (Rules: chen2010)
fisher.test(mydata$SurvivedF, mydata$SexF) # Fisher test is a non parametric test used when the assumptions and reuqirements are not met. In this case the model explained with the parametric test is more accurate.
##
## Fisher's Exact Test for Count Data
##
## data: mydata$SurvivedF and mydata$SexF
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 8.787262 17.381934
## sample estimates:
## odds ratio
## 12.30265