Homework with the program R. Made by Martin Nastoski.

1. First research question

Research question: Is there a relationship between Sex and Survived passengers from the Titanic shipwreck?

1.1. Data

Description of the variables:

PassengerID: Passenger ID.

Survived: Whether a passenger survived the shipwreck or not (0:No, 1:Yes).

Pclass: Ticket class of the passenger (1:first class, 2:second class, 3:third class).

Name: Name of the passenger.

Sex: Gender of the passenger.

Age: Age of the person when boarding the Titanic.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Ticket: Ticket number of the passenger.

Fare: The fare the passenger paid for the ticket.

Cabin: Cabin number of the passenger.

Embarked: Port where the passengers boarded on the Titanic.

Variables used for the purpose of correlation (to find if there is relation between them):

Survived - whether a passenger survived or not.

Sex - Whether the passenger is male or female.

mydata$SurvivedF <- factor(mydata$Survived, 
                          levels = c(0, 1), 
                          labels = c("No", "Yes"))

mydata$SexF <- factor(mydata$Sex, 
                            levels = c("male", "female"),
                            labels = c("Male", "Female")) #Converting categorical variables into factors.
summary(mydata[colnames(mydata) %in% c("SurvivedF", "SexF")]) #Descriptive statistics for the two variables.
##  SurvivedF     SexF    
##  No :549   Male  :577  
##  Yes:342   Female:314

Explanation of descriptive statistics:

SurvivedF: No is the number of passengers who did not survive and Yes is the number of passengers who survived.

SexF: Male is the number of male passengers and Female is the number of female passengers.

1.2. Analysis

chi_square <- chisq.test(mydata$SurvivedF, mydata$SexF, #Chi_square is a parametric test for correlation. 
                        correct = TRUE) #Yates' continuity correction, since it is 2x2.

chi_square
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$SurvivedF and mydata$SexF
## X-squared = 260.72, df = 1, p-value < 2.2e-16
addmargins(chi_square$observed) #Observed values
##                 mydata$SexF
## mydata$SurvivedF Male Female Sum
##              No   468     81 549
##              Yes  109    233 342
##              Sum  577    314 891
addmargins(round(chi_square$expected, 2)) #Expected values
##                 mydata$SexF
## mydata$SurvivedF   Male Female Sum
##              No  355.53 193.47 549
##              Yes 221.47 120.53 342
##              Sum 577.00 314.00 891
round(chi_square$res, 2) # Std. Residual values
##                 mydata$SexF
## mydata$SurvivedF  Male Female
##              No   5.97  -8.09
##              Yes -7.56  10.25
addmargins(round(prop.table(chi_square$observed), 3)) #0.525 - 52.5% were male passengers who did not survive.
##                 mydata$SexF
## mydata$SurvivedF  Male Female   Sum
##              No  0.525  0.091 0.616
##              Yes 0.122  0.262 0.384
##              Sum 0.647  0.353 1.000
addmargins(round(prop.table(chi_square$observed, 1), 3), 2) # Out of all passengers who did not survive 85.2% were male.  
##                 mydata$SexF
## mydata$SurvivedF  Male Female   Sum
##              No  0.852  0.148 1.000
##              Yes 0.319  0.681 1.000
addmargins(round(prop.table(chi_square$observed, 2), 3), 1) #Out of all passengers who were male 81.1% did not survive.
##                 mydata$SexF
## mydata$SurvivedF  Male Female
##              No  0.811  0.258
##              Yes 0.189  0.742
##              Sum 1.000  1.000
#install.packages("effectsize")
library(effectsize) 
## Warning: package 'effectsize' was built under R version 4.3.2
effectsize::cramers_v(mydata$SurvivedF, mydata$SexF) #The effect size from cramers_v statistics is 0.54.
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.54              | [0.49, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.54) #The effect size is very large regarding to Funder2019.
## [1] "very large"
## (Rules: funder2019)
oddsratio(mydata$SurvivedF, mydata$SexF) #Odds ratio between the two variables.
## Odds ratio |        95% CI
## --------------------------
## 12.35      | [8.90, 17.14]
interpret_oddsratio(12.35) # The odds ratio 12.35 is large regarding to Chen2010.
## [1] "large"
## (Rules: chen2010)
fisher.test(mydata$SurvivedF, mydata$SexF) # Fisher test is a non parametric test used when the assumptions and reuqirements are not met. In this case the model explained with the parametric test is more accurate.
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$SurvivedF and mydata$SexF
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   8.787262 17.381934
## sample estimates:
## odds ratio 
##   12.30265

1.3. Conclusion

To conclude, I performed both the parametric and the non parametric tests, but since the assumptions and requirements are met the model with the parametric test should be used even though the results are very similar. It can be seen that the two variables have a very large association between eachother and with that the research question can be answered with a positive answer Yes, meaning that there is a relationship between the sex of the passengers and whether he/she survived the shipwreck of the Titanic.