Titanic.HW1 <- read.csv("~/Boston College/Data Analysis/Datasets/Titanic HW1.csv")
Passenger ID is a nominal and qualitative variable. It exists only to classify which passenger is which but implies no relation with the other passengers. Age however is both quantitative and ratio. It allows passengers to be ranked by age, the size of an age gap measures an important characteristic, and an age of zero actually represents a real zero, the absence of any time spent alive.
Using the code below, we can see that the most missing variable is Age with 177 NAs. All other variables have zero NAs.
colSums(is.na(Titanic.HW1))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Since age is a ratio variable, I used the median function below. Since no other variables have missing observations, I only used it on the Age column.
Titanic.HW1$Age[is.na(Titanic.HW1$Age)] <- median(Titanic.HW1$Age, na.rm=TRUE)
colSums(is.na(Titanic.HW1))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Using the describe function illuminates a couple interesting facts. The first of which is that Age by far the variable with the most variation with a range of almost 80 years from youngest to oldest. SibSp (describing how many siblings or spouses are with a given passenger) and Parch (describing how many parents or children are with a given passenger) give us an idea of how connected the passengers are by familial relations, which in this case is not much. The fact that the median for both variables is zero suggests that a majority of passengers had very few, if any, relations on board.
library(psych)
describe(Titanic.HW1[c("Age","SibSp","Parch")])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Age 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
## SibSp 2 891 0.52 1.10 0 0.27 0.0 0.00 8 8.00 3.68 17.73 0.04
## Parch 3 891 0.38 0.81 0 0.18 0.0 0.00 6 6.00 2.74 9.69 0.03
The most noticeable characteristic of this data is the massive gap in survival rates between male and female passengers. About 74.2% of female passengers survived, while only about 18.9% of male passengers did so. The policy of women and children first seems to have been, at least partially, implemented in real life.
fm_table <- table(Titanic.HW1$Survived, Titanic.HW1$Sex)
print(fm_table)
##
## female male
## 0 81 468
## 1 233 109
barplot(fm_table, legend.text = c("Died", "Survived"), ylab = "Number of Passengers")
It seems that the group of survivors has the larger age range. Though anecdotal, the fact that both the youngest and oldest passenger survived might be indicative of a larger trend. The extremely young and elderly are viewed as weak and in need of assistance, and could have been given preferred access to life boats. It is also interesting to note that the median age looks to be unchanged between the two groups.
boxplot(Titanic.HW1$Age~Titanic.HW1$Survived, notch=TRUE, horizontal=T, xlab = "Passenger Age", ylab = "Survived?")
class_table <- table(Titanic.HW1$Survived, Titanic.HW1$Pclass)
print(class_table)
##
## 1 2 3
## 0 80 97 372
## 1 136 87 119
barplot(class_table, legend.text = c("Died", "Survived"), xlab ="Passenger Class", ylab = "Number of Passengers")
The survival rate by passenger class comes out roughly to the following, 1st: 63.0%, 2nd: 47.3% , 3rd: 24.2%. Though not as stark a difference as the one shown by sex, it appears that socioeconomic status was an important factor in whether one survived or not.