Titanic.HW1 <- read.csv("~/Boston College/Data Analysis/Datasets/Titanic HW1.csv")

Q1 a. What are the types of variable (quantitative / qualitative) and levels of measurement. (nominal / ordinal / interval / ratio) for PassengerId and Age?

Passenger ID is a nominal and qualitative variable. It exists only to classify which passenger is which but implies no relation with the other passengers. Age however is both quantitative and ratio. It allows passengers to be ranked by age, the size of an age gap measures an important characteristic, and an age of zero actually represents a real zero, the absence of any time spent alive.

Q1b. Which variable has the most missing observations? You could have Googled for “Count NA values in R for all columns in a dataframe” or something like that.

Using the code below, we can see that the most missing variable is Age with 177 NAs. All other variables have zero NAs.

colSums(is.na(Titanic.HW1))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

Since age is a ratio variable, I used the median function below. Since no other variables have missing observations, I only used it on the Age column.

Titanic.HW1$Age[is.na(Titanic.HW1$Age)] <- median(Titanic.HW1$Age, na.rm=TRUE)
colSums(is.na(Titanic.HW1))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Q3. Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch. Please comment on what you observe from the summary statistics.

Using the describe function illuminates a couple interesting facts. The first of which is that Age by far the variable with the most variation with a range of almost 80 years from youngest to oldest. SibSp (describing how many siblings or spouses are with a given passenger) and Parch (describing how many parents or children are with a given passenger) give us an idea of how connected the passengers are by familial relations, which in this case is not much. The fact that the median for both variables is zero suggests that a majority of passengers had very few, if any, relations on board.

library(psych)
describe(Titanic.HW1[c("Age","SibSp","Parch")])
##       vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## Age      1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
## SibSp    2 891  0.52  1.10      0    0.27 0.0 0.00   8  8.00 3.68    17.73 0.04
## Parch    3 891  0.38  0.81      0    0.18 0.0 0.00   6  6.00 2.74     9.69 0.03

Q4. Provide a cross-tabulation of Survived and Sex. What do you notice?

The most noticeable characteristic of this data is the massive gap in survival rates between male and female passengers. About 74.2% of female passengers survived, while only about 18.9% of male passengers did so. The policy of women and children first seems to have been, at least partially, implemented in real life.

fm_table <- table(Titanic.HW1$Survived, Titanic.HW1$Sex)
print(fm_table)
##    
##     female male
##   0     81  468
##   1    233  109
barplot(fm_table, legend.text = c("Died", "Survived"), ylab = "Number of Passengers")

Q5. Provide notched boxplots for Survived and Age. What do you notice?

It seems that the group of survivors has the larger age range. Though anecdotal, the fact that both the youngest and oldest passenger survived might be indicative of a larger trend. The extremely young and elderly are viewed as weak and in need of assistance, and could have been given preferred access to life boats. It is also interesting to note that the median age looks to be unchanged between the two groups.

boxplot(Titanic.HW1$Age~Titanic.HW1$Survived, notch=TRUE, horizontal=T, xlab = "Passenger Age", ylab = "Survived?")

Though not a required question, I was curious enough to do the following short analysis:

class_table <- table(Titanic.HW1$Survived, Titanic.HW1$Pclass)
print(class_table)
##    
##       1   2   3
##   0  80  97 372
##   1 136  87 119
barplot(class_table, legend.text = c("Died", "Survived"), xlab ="Passenger Class", ylab = "Number of Passengers")

The survival rate by passenger class comes out roughly to the following, 1st: 63.0%, 2nd: 47.3% , 3rd: 24.2%. Though not as stark a difference as the one shown by sex, it appears that socioeconomic status was an important factor in whether one survived or not.