Load and review the Titanic Dataset
titanic.df <- read.csv(paste("Titanic Data.csv", sep=""))
View(titanic.df)
head(titanic.df)
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 22.0 1 0 7.2500 S
## 2 1 1 female 38.0 1 0 71.2833 C
## 3 1 3 female 26.0 0 0 7.9250 S
## 4 1 1 female 35.0 1 0 53.1000 S
## 5 0 3 male 35.0 0 0 8.0500 S
## 6 0 3 male 29.7 0 0 8.4583 Q
Create a table showing the average age of Survived and dead people where 0 = Not Survived and 1 = Survived
mean_age <- aggregate(titanic.df$Age, list(titanic.df$Survived), mean)
mean_age
## Group.1 x
## 1 0 30.41530
## 2 1 28.42382
Plot box plots for each category to understand data distribution and identify outliers
boxplot(Age ~ Survived, data = titanic.df, main = "Ages of Survived and dead people", xlab = "Survived (No/Yes)",col = (c("green","blue")), ylab = "Age")
run a t-test to test the following hypothesis: H2: The Titanic survivors were younger than the passengers who died.
t.test(Age ~ Survived, data = titanic.df, var.equal = TRUE)
##
## Two Sample t-test
##
## data: Age by Survived
## t = 2.2302, df = 887, p-value = 0.02599
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.238890 3.744064
## sample estimates:
## mean in group 0 mean in group 1
## 30.41530 28.42382
Interpret the results
The p-value based on the t-test = 0.02599
The t test showed there was a significant difference in average age between living and dead people on the titanic. We can now reject the null hypothesis that there is no significant difference between the ages of live and dead people on the titanic. The average age of people who died is greater than that of the people who survived.