HOMEWORK 2 - CORRELATION

LARISA O.

titanic_data <- read.csv("Titanic-Dataset.csv")
head(titanic_data, 10) #Table with first 10 rows.
##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                   Name    Sex Age SibSp Parch
## 1                              Braund, Mr. Owen Harris   male  22     1     0
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                             Allen, Mr. William Henry   male  35     0     0
## 6                                     Moran, Mr. James   male  NA     0     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 8                       Palsson, Master. Gosta Leonard   male   2     3     1
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##              Ticket    Fare Cabin Embarked
## 1         A/5 21171  7.2500              S
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 5            373450  8.0500              S
## 6            330877  8.4583              Q
## 7             17463 51.8625   E46        S
## 8            349909 21.0750              S
## 9            347742 11.1333              S
## 10           237736 30.0708              C

Description of data:

  • PassengerId: identifier assigned to each passenger

  • Survived: binary variable; 1 if the individual survived, 0 if not survived

  • Pclass (class of ticket): 1 - 1st class, 2 - 2nd class, 3 - 3rd class

  • Name: passenger’s name

  • Sex: binary variable; 1 if the individual is male, 0 if female

  • Age: age in years

  • SibSp: number of siblings/spouses aboard the Titanic

  • Parch: number of parents/children aboard the Titanic

  • Ticket: ticket number

  • Fare: the amount paid for the ticket in dollars

  • Cabin: cabin number where the passenger stayed

  • Embarked: port of embarkation; C: Cherbourg, Q: Queenstown, S: Southampton

This is the Titanic Passenger Data. The unit of observation in the Titanic dataset is an individual passenger who was aboard the RMS Titanic during its last voyage in 1912. The dataset includes information on 891 passengers of the Titanic (=sample size). There are 12 variables capturing key details about the passengers (writen above).

The Titanic dataset is from the Kaggle website (https://www.kaggle.com/datasets/yasserh/titanic-dataset?resource=download).

titanic_data <- titanic_data[, -3] #I removed the class.
titanic_data <- titanic_data[, -3] #I removed the name.
titanic_data <- titanic_data[, -5] #I removed siblings/spouses aboard.
titanic_data <- titanic_data[, -5] #I removed parents/children aboard.
titanic_data <- titanic_data[, -5] #I removed ticket number.
titanic_data <- titanic_data[, -6] #I removed cabin number.
titanic_data <- titanic_data[, -1] #I removed unique passenger ID.
titanic_data <- titanic_data[, -5] #I removed location of embarking.
any(is.na(titanic_data)) #I looked if there are some missing data in dataset.
## [1] TRUE

There is some missing data.

missing_counts <- colSums(is.na(titanic_data))  

print(missing_counts[missing_counts > 0])#I looked where I have missing values... 
## Age 
## 177

177 rows are missing values on Age.

titanic_data <- na.omit(titanic_data, cols = "Age") #I removed all rows where Age was a missing value.
titanic_data$Age <- round(titanic_data$Age, 0) #I rounded Age to a full number.
titanic_data <- titanic_data[titanic_data$Age != 0, , drop = FALSE] #I removed all aged 0.
titanic_data$Fare <- round(titanic_data$Fare, 2) #I rounded Fare to two decimals.
titanic_data$Survived <- factor(titanic_data$Survived, 
                                levels = c(1, 0), 
                                labels = c("yes", "no"))

titanic_data$Sex  <- factor(titanic_data$Sex, 
                            levels = c("male", "female"), 
                            labels = c("male", "female")) #I did the factoring.
head(titanic_data, 10)
##    Survived    Sex Age  Fare
## 1        no   male  22  7.25
## 2       yes female  38 71.28
## 3       yes female  26  7.92
## 4       yes female  35 53.10
## 5        no   male  35  8.05
## 7        no   male  54 51.86
## 8        no   male   2 21.08
## 9       yes female  27 11.13
## 10      yes female  14 30.07
## 11      yes female   4 16.70
#install.packages("psych")
library(psych)
describe(titanic_data)
##           vars   n  mean    sd median trimmed   mad min    max  range  skew
## Survived*    1 713  1.59  0.49   2.00    1.62  0.00   1   2.00   1.00 -0.38
## Sex*         2 713  1.37  0.48   1.00    1.33  0.00   1   2.00   1.00  0.55
## Age          3 713 29.73 14.49  28.00   29.29 13.34   1  80.00  79.00  0.40
## Fare         4 713 34.73 52.95  15.74   23.22 12.22   0 512.33 512.33  4.63
##           kurtosis   se
## Survived*    -1.85 0.02
## Sex*         -1.69 0.02
## Age           0.16 0.54
## Fare         30.57 1.98

The average survival rate is approximately 1.59, indicating that, on average, passengers survived more often than not. The distribution is slightly negatively skewed (-0.38), suggesting that more passengers survived (higher values).

The average age of passengers is approximately 29.73 years.

The fare distribution has a high amount of variability, with a large standard deviation of 52.95. The distribution is highly positively skewed (4.63), indicating a long right tail.

summary(titanic_data)
##  Survived      Sex           Age             Fare       
##  yes:289   male  :452   Min.   : 1.00   Min.   :  0.00  
##  no :424   female:261   1st Qu.:20.00   1st Qu.:  8.05  
##                         Median :28.00   Median : 15.74  
##                         Mean   :29.73   Mean   : 34.73  
##                         3rd Qu.:38.00   3rd Qu.: 33.50  
##                         Max.   :80.00   Max.   :512.33

Among the passengers, 290 survived, and 424 did not survive. There were 453 male passengers and 261 female passengers on the Titanic. The age distribution of passengers was from 0 (min) to 80 (max). The median age of passengers was 28, meaning age of half or less passengers were of this age, and the mean (average) age was 29.69. The fare distribution varied from 0 (min) to 512.33 (max) dollars. The median fare was 15.74 dollars, meaning that half or less of ticket prices cost this much, and the mean (average) fare cost 34.69 dollars. 25% of the passengers paid 8.05 or less than 8.05 dollars for their ticket.

#install.packages("ggplot2")
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(titanic_data, aes(x = Age, fill = Sex)) +
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.7) +
  facet_wrap(~Survived) +
  labs(title = "Histogram of Age by Survival Status and Gender")

From this histogram we can observe that there are more females that survived than not survived, and vice-versa for men.

Correlation analysis

RQ 1: Is there correlation between the age and fare passengers paid?

#install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
numeric <- titanic_data[, c("Age", "Fare")]
scatterplotMatrix(numeric, smooth = FALSE)

There seem to be two outliers, but looking at age/fare scatterplot we can see that they follow the “logic”, so I won`t remove them.

#install.packages("ggplot2")
library(ggplot2)
ggplot(titanic_data, aes(x = Age, y = Fare)) +
  geom_point() +
  ggtitle("Scatterplot of Age vs. Fare")

#install.packages("Hmisc")
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(titanic_data[ , -c(1, 2, 5)]), 
      type = "pearson") #More detailed information about correlation coefficients, including p-values and the number of valid observations for each pair.
##      Age Fare
## Age  1.0  0.1
## Fare 0.1  1.0
## 
## n= 713 
## 
## 
## P
##      Age   Fare 
## Age        0.011
## Fare 0.011

The correlation between “Age” and “Fare”: there is a weak positive linear relationship (0.1). The positive correlation coefficient 0.1 means that on average, as the age of passengers increases, the fare they paid tends to increase slightly. The p-value indicate that the observed correlations are statistically significant (p-value = 0.011).

cor(titanic_data$Age, titanic_data$Fare,
    method = "pearson",
    use = "complete.obs") #Basic function for calculating correlation coefficients.
## [1] 0.09516346

There is a weak positive linear relationship between Age and Fare (0.095).

cor.test(titanic_data$Age, titanic_data$Fare,
         method = "pearson",
         use = "complete.obs") #Function calculates the correlation coefficient and performs a hypothesis test to determine if the observed correlation is significantly different from zero.
## 
##  Pearson's product-moment correlation
## 
## data:  titanic_data$Age and titanic_data$Fare
## t = 2.5491, df = 711, p-value = 0.01101
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02189266 0.16741744
## sample estimates:
##        cor 
## 0.09516346

Null hypothesis:The correlation is equal to 0.

Alternative hypothesis:The correlation is not equal to 0.

Based on the sample data, we can reject the null hypothesis (at p-value = 0.011). The correlation between Age and Fare is considered statistically significant. As previously said there is a weak positive linear relationship (0.1). The positive correlation coefficient of 0.1 means that on average, as the age of passengers increases, the fare they paid tends to increase slightly.

RQ 2 - Is there any relationship between the survival status and the gender of Titanic passengers?

library(ggplot2)
ggplot(titanic_data, aes(x = Survived, fill = Sex)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Survival by Gender",
       x = "survival",
       y = "count") +
  scale_fill_manual(values = c("blue", "pink"), labels = c("Male", "Female"))

Before proceeding with the Pearson CHi2 test, we need to check the following assumptions:

  • Observations are independent - true; each person was either male or female and has survived or not (single option).

  • All expected frequencies are greater than 5 for all combinations.

None of the assumptions are violated with our data set.

results <- chisq.test(titanic_data$Survived, titanic_data$Sex, 
                      correct = TRUE)
results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  titanic_data$Survived and titanic_data$Sex
## X-squared = 206.31, df = 1, p-value < 2.2e-16

Null hypothesis: There is no association between survival and gender.

Alternative hypothesis: There is association between survival and gender.

Based on the sample data, we reject the null hypothesis (p<0.001).

The observed chi-squared value of 206.31, which is quite large, indicates a strong association.

addmargins(results$observed)
##                      titanic_data$Sex
## titanic_data$Survived male female Sum
##                   yes   92    197 289
##                   no   360     64 424
##                   Sum  452    261 713
round(results$expected)
##                      titanic_data$Sex
## titanic_data$Survived male female
##                   yes  183    106
##                   no   269    155

All expected frequencies are higher than 5 meaning the assumptions of test are met.

round(results$res, 2)
##                      titanic_data$Sex
## titanic_data$Survived  male female
##                   yes -6.74   8.87
##                   no   5.56  -7.32

A standardized residual of -6.74 for survived males suggests that from observation significantly fewer males survived than expected to.

addmargins(round(prop.table(results$observed), 3))
##                      titanic_data$Sex
## titanic_data$Survived  male female   Sum
##                   yes 0.129  0.276 0.405
##                   no  0.505  0.090 0.595
##                   Sum 0.634  0.366 1.000

12.9% of the individuals in the data set are males who survived.

addmargins(round(prop.table(results$observed, 1), 3), 2)
##                      titanic_data$Sex
## titanic_data$Survived  male female   Sum
##                   yes 0.318  0.682 1.000
##                   no  0.849  0.151 1.000

Approximately 31.8% out of all passengers that survived were males.

addmargins(round(prop.table(results$observed, 2), 3), 1)
##                      titanic_data$Sex
## titanic_data$Survived  male female
##                   yes 0.204  0.755
##                   no  0.796  0.245
##                   Sum 1.000  1.000

Out of male passengers, approximately 20.4% of them survived.

#install.packages("effectsize")
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(titanic_data$Survived, titanic_data$Sex)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.54              | [0.48, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.54)
## [1] "very large"
## (Rules: funder2019)

There is a very large association between variables survived and sex.

oddsratio(titanic_data$Survived, titanic_data$Sex)
## Odds ratio |       95% CI
## -------------------------
## 0.08       | [0.06, 0.12]
interpret_oddsratio(0.08)
## [1] "large"
## (Rules: chen2010)

The odds of survival of males are 0.08 times the odds of the survival of females. There is a large difference in the odds of survival (looking through gender).

fisher.test(titanic_data$Survived, titanic_data$Sex)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  titanic_data$Survived and titanic_data$Sex
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.05678004 0.12125183
## sample estimates:
## odds ratio 
## 0.08340755

This non-parametric test is used when not all assumptions are met. I will do it for practice, since I expect similar results as before.

Null Hypothesis: The odds of survival for individuals with different genders are equal.

Alternative Hypothesis: The odds of survival for individuals with different genders are not equal.

Based on the data, we reject the null hypothesis (p-value<0.001). The odds of survival are significantly lower for males compared females (OR = 0.083), which supports the conclusion that there is a significant association between survival and gender.

I therefore conclude that the survival status does vary depending on the gender of Titanic passengers. As described earlier, there is a significant association between survival status and gender (the odds of survival are greater for females).