titanic_data <- read.csv("Titanic-Dataset.csv")
head(titanic_data, 10) #Table with first 10 rows.
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## 7 7 0 1
## 8 8 0 3
## 9 9 1 3
## 10 10 1 2
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 8 Palsson, Master. Gosta Leonard male 2 3 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
## 7 17463 51.8625 E46 S
## 8 349909 21.0750 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
Description of data:
PassengerId: identifier assigned to each passenger
Survived: binary variable; 1 if the individual survived, 0 if not survived
Pclass (class of ticket): 1 - 1st class, 2 - 2nd class, 3 - 3rd class
Name: passenger’s name
Sex: binary variable; 1 if the individual is male, 0 if female
Age: age in years
SibSp: number of siblings/spouses aboard the Titanic
Parch: number of parents/children aboard the Titanic
Ticket: ticket number
Fare: the amount paid for the ticket in dollars
Cabin: cabin number where the passenger stayed
Embarked: port of embarkation; C: Cherbourg, Q: Queenstown, S: Southampton
This is the Titanic Passenger Data. The unit of observation in the Titanic dataset is an individual passenger who was aboard the RMS Titanic during its last voyage in 1912. The dataset includes information on 891 passengers of the Titanic (=sample size). There are 12 variables capturing key details about the passengers (writen above).
The Titanic dataset is from the Kaggle website (https://www.kaggle.com/datasets/yasserh/titanic-dataset?resource=download).
titanic_data <- titanic_data[, -3] #I removed the class.
titanic_data <- titanic_data[, -3] #I removed the name.
titanic_data <- titanic_data[, -5] #I removed siblings/spouses aboard.
titanic_data <- titanic_data[, -5] #I removed parents/children aboard.
titanic_data <- titanic_data[, -5] #I removed ticket number.
titanic_data <- titanic_data[, -6] #I removed cabin number.
titanic_data <- titanic_data[, -1] #I removed unique passenger ID.
titanic_data <- titanic_data[, -5] #I removed location of embarking.
any(is.na(titanic_data)) #I looked if there are some missing data in dataset.
## [1] TRUE
There is some missing data.
missing_counts <- colSums(is.na(titanic_data))
print(missing_counts[missing_counts > 0])#I looked where I have missing values...
## Age
## 177
177 rows are missing values on Age.
titanic_data <- na.omit(titanic_data, cols = "Age") #I removed all rows where Age was a missing value.
titanic_data$Age <- round(titanic_data$Age, 0) #I rounded Age to a full number.
titanic_data <- titanic_data[titanic_data$Age != 0, , drop = FALSE] #I removed all aged 0.
titanic_data$Fare <- round(titanic_data$Fare, 2) #I rounded Fare to two decimals.
titanic_data$Survived <- factor(titanic_data$Survived,
levels = c(1, 0),
labels = c("yes", "no"))
titanic_data$Sex <- factor(titanic_data$Sex,
levels = c("male", "female"),
labels = c("male", "female")) #I did the factoring.
head(titanic_data, 10)
## Survived Sex Age Fare
## 1 no male 22 7.25
## 2 yes female 38 71.28
## 3 yes female 26 7.92
## 4 yes female 35 53.10
## 5 no male 35 8.05
## 7 no male 54 51.86
## 8 no male 2 21.08
## 9 yes female 27 11.13
## 10 yes female 14 30.07
## 11 yes female 4 16.70
#install.packages("psych")
library(psych)
describe(titanic_data)
## vars n mean sd median trimmed mad min max range skew
## Survived* 1 713 1.59 0.49 2.00 1.62 0.00 1 2.00 1.00 -0.38
## Sex* 2 713 1.37 0.48 1.00 1.33 0.00 1 2.00 1.00 0.55
## Age 3 713 29.73 14.49 28.00 29.29 13.34 1 80.00 79.00 0.40
## Fare 4 713 34.73 52.95 15.74 23.22 12.22 0 512.33 512.33 4.63
## kurtosis se
## Survived* -1.85 0.02
## Sex* -1.69 0.02
## Age 0.16 0.54
## Fare 30.57 1.98
The average survival rate is approximately 1.59, indicating that, on average, passengers survived more often than not. The distribution is slightly negatively skewed (-0.38), suggesting that more passengers survived (higher values).
The average age of passengers is approximately 29.73 years.
The fare distribution has a high amount of variability, with a large standard deviation of 52.95. The distribution is highly positively skewed (4.63), indicating a long right tail.
summary(titanic_data)
## Survived Sex Age Fare
## yes:289 male :452 Min. : 1.00 Min. : 0.00
## no :424 female:261 1st Qu.:20.00 1st Qu.: 8.05
## Median :28.00 Median : 15.74
## Mean :29.73 Mean : 34.73
## 3rd Qu.:38.00 3rd Qu.: 33.50
## Max. :80.00 Max. :512.33
Among the passengers, 290 survived, and 424 did not survive. There were 453 male passengers and 261 female passengers on the Titanic. The age distribution of passengers was from 0 (min) to 80 (max). The median age of passengers was 28, meaning age of half or less passengers were of this age, and the mean (average) age was 29.69. The fare distribution varied from 0 (min) to 512.33 (max) dollars. The median fare was 15.74 dollars, meaning that half or less of ticket prices cost this much, and the mean (average) fare cost 34.69 dollars. 25% of the passengers paid 8.05 or less than 8.05 dollars for their ticket.
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(titanic_data, aes(x = Age, fill = Sex)) +
geom_histogram(binwidth = 5, position = "identity", alpha = 0.7) +
facet_wrap(~Survived) +
labs(title = "Histogram of Age by Survival Status and Gender")
From this histogram we can observe that there are more females that
survived than not survived, and vice-versa for men.
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
numeric <- titanic_data[, c("Age", "Fare")]
scatterplotMatrix(numeric, smooth = FALSE)
There seem to be two outliers, but looking at age/fare scatterplot we
can see that they follow the “logic”, so I won`t remove them.
#install.packages("ggplot2")
library(ggplot2)
ggplot(titanic_data, aes(x = Age, y = Fare)) +
geom_point() +
ggtitle("Scatterplot of Age vs. Fare")
#install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(titanic_data[ , -c(1, 2, 5)]),
type = "pearson") #More detailed information about correlation coefficients, including p-values and the number of valid observations for each pair.
## Age Fare
## Age 1.0 0.1
## Fare 0.1 1.0
##
## n= 713
##
##
## P
## Age Fare
## Age 0.011
## Fare 0.011
The correlation between “Age” and “Fare”: there is a weak positive linear relationship (0.1). The positive correlation coefficient 0.1 means that on average, as the age of passengers increases, the fare they paid tends to increase slightly. The p-value indicate that the observed correlations are statistically significant (p-value = 0.011).
cor(titanic_data$Age, titanic_data$Fare,
method = "pearson",
use = "complete.obs") #Basic function for calculating correlation coefficients.
## [1] 0.09516346
There is a weak positive linear relationship between Age and Fare (0.095).
cor.test(titanic_data$Age, titanic_data$Fare,
method = "pearson",
use = "complete.obs") #Function calculates the correlation coefficient and performs a hypothesis test to determine if the observed correlation is significantly different from zero.
##
## Pearson's product-moment correlation
##
## data: titanic_data$Age and titanic_data$Fare
## t = 2.5491, df = 711, p-value = 0.01101
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02189266 0.16741744
## sample estimates:
## cor
## 0.09516346
Null hypothesis:The correlation is equal to 0.
Alternative hypothesis:The correlation is not equal to 0.
Based on the sample data, we can reject the null hypothesis (at p-value = 0.011). The correlation between Age and Fare is considered statistically significant. As previously said there is a weak positive linear relationship (0.1). The positive correlation coefficient of 0.1 means that on average, as the age of passengers increases, the fare they paid tends to increase slightly.
library(ggplot2)
ggplot(titanic_data, aes(x = Survived, fill = Sex)) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Survival by Gender",
x = "survival",
y = "count") +
scale_fill_manual(values = c("blue", "pink"), labels = c("Male", "Female"))
Before proceeding with the Pearson CHi2 test, we need to check the
following assumptions:
Observations are independent - true; each person was either male or female and has survived or not (single option).
All expected frequencies are greater than 5 for all combinations.
None of the assumptions are violated with our data set.
results <- chisq.test(titanic_data$Survived, titanic_data$Sex,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: titanic_data$Survived and titanic_data$Sex
## X-squared = 206.31, df = 1, p-value < 2.2e-16
Null hypothesis: There is no association between survival and gender.
Alternative hypothesis: There is association between survival and gender.
Based on the sample data, we reject the null hypothesis (p<0.001).
The observed chi-squared value of 206.31, which is quite large, indicates a strong association.
addmargins(results$observed)
## titanic_data$Sex
## titanic_data$Survived male female Sum
## yes 92 197 289
## no 360 64 424
## Sum 452 261 713
round(results$expected)
## titanic_data$Sex
## titanic_data$Survived male female
## yes 183 106
## no 269 155
All expected frequencies are higher than 5 meaning the assumptions of test are met.
round(results$res, 2)
## titanic_data$Sex
## titanic_data$Survived male female
## yes -6.74 8.87
## no 5.56 -7.32
A standardized residual of -6.74 for survived males suggests that from observation significantly fewer males survived than expected to.
addmargins(round(prop.table(results$observed), 3))
## titanic_data$Sex
## titanic_data$Survived male female Sum
## yes 0.129 0.276 0.405
## no 0.505 0.090 0.595
## Sum 0.634 0.366 1.000
12.9% of the individuals in the data set are males who survived.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## titanic_data$Sex
## titanic_data$Survived male female Sum
## yes 0.318 0.682 1.000
## no 0.849 0.151 1.000
Approximately 31.8% out of all passengers that survived were males.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## titanic_data$Sex
## titanic_data$Survived male female
## yes 0.204 0.755
## no 0.796 0.245
## Sum 1.000 1.000
Out of male passengers, approximately 20.4% of them survived.
#install.packages("effectsize")
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(titanic_data$Survived, titanic_data$Sex)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.54 | [0.48, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.54)
## [1] "very large"
## (Rules: funder2019)
There is a very large association between variables survived and sex.
oddsratio(titanic_data$Survived, titanic_data$Sex)
## Odds ratio | 95% CI
## -------------------------
## 0.08 | [0.06, 0.12]
interpret_oddsratio(0.08)
## [1] "large"
## (Rules: chen2010)
The odds of survival of males are 0.08 times the odds of the survival of females. There is a large difference in the odds of survival (looking through gender).
fisher.test(titanic_data$Survived, titanic_data$Sex)
##
## Fisher's Exact Test for Count Data
##
## data: titanic_data$Survived and titanic_data$Sex
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.05678004 0.12125183
## sample estimates:
## odds ratio
## 0.08340755
This non-parametric test is used when not all assumptions are met. I will do it for practice, since I expect similar results as before.
Null Hypothesis: The odds of survival for individuals with different genders are equal.
Alternative Hypothesis: The odds of survival for individuals with different genders are not equal.
Based on the data, we reject the null hypothesis (p-value<0.001). The odds of survival are significantly lower for males compared females (OR = 0.083), which supports the conclusion that there is a significant association between survival and gender.
I therefore conclude that the survival status does vary depending on the gender of Titanic passengers. As described earlier, there is a significant association between survival status and gender (the odds of survival are greater for females).