Task 4b

Load and review the Titanic Dataset

titanic.df <- read.csv(paste("Titanic Data.csv", sep=""))
View(titanic.df)
head(titanic.df)
##   Survived Pclass    Sex  Age SibSp Parch    Fare Embarked
## 1        0      3   male 22.0     1     0  7.2500        S
## 2        1      1 female 38.0     1     0 71.2833        C
## 3        1      3 female 26.0     0     0  7.9250        S
## 4        1      1 female 35.0     1     0 53.1000        S
## 5        0      3   male 35.0     0     0  8.0500        S
## 6        0      3   male 29.7     0     0  8.4583        Q

Create a table showing the average age of Survived and dead people where 0 = Not Survived and 1 = Survived

mean_age <- aggregate(titanic.df$Age, list(titanic.df$Survived), mean)
mean_age
##   Group.1        x
## 1       0 30.41530
## 2       1 28.42382

Plot box plots for each category to understand data distribution and identify outliers

boxplot(Age ~ Survived, data = titanic.df, main = "Ages of Survived and dead people", xlab = "Survived (No/Yes)",col = (c("green","blue")), ylab = "Age")

Task 4c

run a t-test to test the following hypothesis: H2: The Titanic survivors were younger than the passengers who died.

t.test(Age ~ Survived, data = titanic.df, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  Age by Survived
## t = 2.2302, df = 887, p-value = 0.02599
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.238890 3.744064
## sample estimates:
## mean in group 0 mean in group 1 
##        30.41530        28.42382

Task 4d

Interpret the results

  1. The p-value based on the t-test = 0.02599

  2. The t test showed there was a significant difference in average age between living and dead people on the titanic. We can now reject the null hypothesis that there is no significant difference between the ages of live and dead people on the titanic. The average age of people who died is greater than that of the people who survived.