In this markdown file, we continue exploring the Titanic Data.csv file and use it to gain deeper understanding about the ages of the survivors and deceased.
We are initially writing down the script to read the required csv file and attach () it so that we can start using it.
setwd("~/Muyeena/Internship/Case Studies/Titanic")
titanic = read.csv("Titanic Data.csv")
#View(titanic)
attach(titanic)
str(titanic)
## 'data.frame': 889 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 29.7 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
We use the aggregate() function to get a table which contains the average age of the survivors and the average age of the people who died.
mean.age = aggregate(Age ~ Survived, data = titanic, mean)
mean.age
## Survived Age
## 1 0 30.41530
## 2 1 28.42382
The following code is used to visualize the ages of the survivors and those of the deceased.
boxplot(Age ~ Survived, data = titanic,
horizontal = TRUE,
yaxt = "n",
ylab = "Survival Status",
xlab = "Age",
col = c("red","blue"),
main = "Comparison of ages of survivors and victims")
axis(side = 2, at=c(1,2), labels = c("Deceased","Survivors"))
The given hypothesis is -
The titanic survivors were younger than the passengers who died.
Therefore, the null hypothesis will be -
There is no significance difference betweent the ages of the survivors and that of the deceased
To test the hypothesis, we run a code called t.test() in R. This test will conduct the Welch Two Sample t-test.
p = t.test(Age ~ Survived) # We can just use the variable name in this, because we have attached the dataset initially. Thus, attaching dataset leads to easy access.
p
##
## Welch Two Sample t-test
##
## data: Age by Survived
## t = 2.1816, df = 667.56, p-value = 0.02949
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1990628 3.7838912
## sample estimates:
## mean in group 0 mean in group 1
## 30.41530 28.42382
In the above t-test, the p-value obtained is 0.0294879. This value is less than 0.05 but it is greater than 0.01.
Therefore, we can say that as the given p-value is greater than 0.01, we cannot reject the null hypothesis.
Hence, There is no significant differences in the ages of the survivors and in the ages of the deceased.