Correlation

Homework with the program R. Made by Martin Nastoski.

1. First research question

Research question: Is there a relationship between Sex and Survived passengers from the Titanic shipwreck?

The source of this data is from this link: https://www.kaggle.com/code/alexisbcook/titanic-tutorial

mydata <- read.table("C:/Users/Martin/Desktop/Fangzz/Faculty of Economics/Year 1/Semester 1/Research Methods and Techniques/Program R/Homework/Homework part 1/Titanic/train.csv", header = TRUE, sep = ",", dec = ",")

head(mydata) #Showing the first six rows of the data set

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male         0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171    7.25              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282   7.925              S
## 4           113803    53.1  C123        S
## 5           373450    8.05              S
## 6           330877  8.4583              Q

tail(mydata) #Showing the last six rows of the data set

##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536     13              S
## 888  19     0     0     112053     30   B42        S
## 889         1     2 W./C. 6607  23.45              S
## 890  26     0     0     111369     30  C148        C
## 891  32     0     0     370376   7.75              Q

1.1. Data

Description of the variables:

PassengerID: Passenger ID.

Survived: Whether a passenger survived the shipwreck or not (0:No, 1:Yes).

Pclass: Ticket class of the passenger (1:first class, 2:second class, 3:third class).

Name: Name of the passenger.

Sex: Gender of the passenger.

Age: Age of the person when boarding the Titanic.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Ticket: Ticket number of the passenger.

Fare: The fare the passenger paid for the ticket.

Cabin: Cabin number of the passenger.

Embarked: Port where the passengers boarded on the Titanic.

Variables used for the purpose of correlation (to find if there is relation between them):

Survived - whether a passenger survived or not.

Sex - Whether the passenger is male or female.

mydata$SurvivedF <- factor(mydata$Survived, 
                          levels = c(0, 1), 
                          labels = c("No", "Yes"))

mydata$SexF <- factor(mydata$Sex, 
                            levels = c("male", "female"),
                            labels = c("Male", "Female")) #Converting categorical variables into factors.

summary(mydata[colnames(mydata) %in% c("SurvivedF", "SexF")]) #Descriptive statistics for the two variables.

##  SurvivedF     SexF    
##  No :549   Male  :577  
##  Yes:342   Female:314

Explanation of descriptive statistics:

SurvivedF: No is the number of passengers who did not survive and Yes is the number of passengers who survived.

SexF: Male is the number of male passengers and Female is the number of female passengers.

1.2. Analysis

chi_square <- chisq.test(mydata$SurvivedF, mydata$SexF, #Chi_square is a parametric test for correlation. 
                        correct = TRUE) #Yates' continuity correction, since it is 2x2.

chi_square

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$SurvivedF and mydata$SexF
## X-squared = 260.72, df = 1, p-value < 2.2e-16

addmargins(chi_square$observed) #Observed values

##                 mydata$SexF
## mydata$SurvivedF Male Female Sum
##              No   468     81 549
##              Yes  109    233 342
##              Sum  577    314 891

addmargins(round(chi_square$expected, 2)) #Expected values

##                 mydata$SexF
## mydata$SurvivedF   Male Female Sum
##              No  355.53 193.47 549
##              Yes 221.47 120.53 342
##              Sum 577.00 314.00 891

round(chi_square$res, 2) # Std. Residual values

##                 mydata$SexF
## mydata$SurvivedF  Male Female
##              No   5.97  -8.09
##              Yes -7.56  10.25

addmargins(round(prop.table(chi_square$observed), 3)) #0.525 - 52.5% were male passengers who did not survive.

##                 mydata$SexF
## mydata$SurvivedF  Male Female   Sum
##              No  0.525  0.091 0.616
##              Yes 0.122  0.262 0.384
##              Sum 0.647  0.353 1.000

addmargins(round(prop.table(chi_square$observed, 1), 3), 2) # Out of all passengers who did not survive 85.2% were male.

##                 mydata$SexF
## mydata$SurvivedF  Male Female   Sum
##              No  0.852  0.148 1.000
##              Yes 0.319  0.681 1.000

addmargins(round(prop.table(chi_square$observed, 2), 3), 1) #Out of all passengers who were male 81.1% did not survive.

##                 mydata$SexF
## mydata$SurvivedF  Male Female
##              No  0.811  0.258
##              Yes 0.189  0.742
##              Sum 1.000  1.000

#install.packages("effectsize")
library(effectsize)

## Warning: package 'effectsize' was built under R version 4.3.2

effectsize::cramers_v(mydata$SurvivedF, mydata$SexF) #The effect size from cramers_v statistics is 0.54.

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.54              | [0.49, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.54) #The effect size is very large regarding to Funder2019.

## [1] "very large"
## (Rules: funder2019)

oddsratio(mydata$SurvivedF, mydata$SexF) #Odds ratio between the two variables.

## Odds ratio |        95% CI
## --------------------------
## 12.35      | [8.90, 17.14]

interpret_oddsratio(12.35) # The odds ratio 12.35 is large regarding to Chen2010.

## [1] "large"
## (Rules: chen2010)

fisher.test(mydata$SurvivedF, mydata$SexF) # Fisher test is a non parametric test used when the assumptions and reuqirements are not met. In this case the model explained with the parametric test is more accurate.

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$SurvivedF and mydata$SexF
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   8.787262 17.381934
## sample estimates:
## odds ratio 
##   12.30265

1.3. Conclusion

To conclude, I performed both the parametric and the non parametric tests, but since the assumptions and requirements are met the model with the parametric test should be used even though the results are very similar. It can be seen that the two variables have a very large association between eachother and with that the research question can be answered with a positive answer Yes, meaning that there is a relationship between the sex of the passengers and whether he/she survived the shipwreck of the Titanic.

Correlation

2024-01-08

Homework with the program R. Made by Martin Nastoski.

1. First research question

Research question: Is there a relationship between Sex and Survived passengers from the Titanic shipwreck?

The source of this data is from this link: https://www.kaggle.com/code/alexisbcook/titanic-tutorial

1.1. Data

Description of the variables:

PassengerID: Passenger ID.

Survived: Whether a passenger survived the shipwreck or not (0:No, 1:Yes).

Pclass: Ticket class of the passenger (1:first class, 2:second class, 3:third class).

Name: Name of the passenger.

Sex: Gender of the passenger.

Age: Age of the person when boarding the Titanic.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Ticket: Ticket number of the passenger.

Fare: The fare the passenger paid for the ticket.

Cabin: Cabin number of the passenger.

Embarked: Port where the passengers boarded on the Titanic.

Variables used for the purpose of correlation (to find if there is relation between them):

Survived - whether a passenger survived or not.

Sex - Whether the passenger is male or female.

Explanation of descriptive statistics:

SurvivedF: No is the number of passengers who did not survive and Yes is the number of passengers who survived.

SexF: Male is the number of male passengers and Female is the number of female passengers.

1.2. Analysis

1.3. Conclusion