Week 1 - Homework 1

Grace Inorio
titanic_train_data <- read.csv('/Users/grace/Desktop/MSAE Spring 25/titanic_train.csv', skip = 1, header = TRUE)
# The title of the data, "titanic", was read into the data as one column. Without the skip = 1 command in the code, R is reading the data set as one large first column with several beneath it, and an error is returned. skip = 1 skips the first row, and header = TRUE indicates that the second row contains the column titles #

Q1 a.

What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?

library(psych)
str(titanic_train_data)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

PassengerId is qualitative nominal data. It is an ID for us, the data users, to identify and differentiate between cases, and its only purpose is classification. Age is quantitative ratio data. It can be measured, and it can be true zero.

Q1b.

Which variable has the most missing observations?

titanic_train_data[titanic_train_data == ""] = NA
colSums(is.na(titanic_train_data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Cabin has the most missing observations with 687 out of 891 missing.

Q2.

Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

# There are no missing observations for SibSp and Parch
 titanic_train_data$Age[is.na(titanic_train_data$Age)] = median(titanic_train_data$Age, na.rm=TRUE)

Q3.

Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch. Please comment on what you observe from the summary statistics.

library(psych)
titanic_train_data_with_NA <- read.csv('/Users/grace/Desktop/MSAE Spring 25/titanic_train.csv', skip = 1, header = TRUE)
describe(titanic_train_data_with_NA$Age)
##    vars   n mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 714 29.7 14.53     28   29.27 13.34 0.42  80 79.58 0.39     0.16 0.54
describe(titanic_train_data$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44

I have found the descriptive statistics of the Age variable before and after removing the empty observations with the median, 28.

Replacing the missing observations with 28 has caused some interesting changes in our statistics. The statistics cannot be calculated on NA observations, so when we replace NA with 28, we are working with 177 more cases. The mean decreases by only 0.34 years of age, and the min, max, and range all remain the same, as expected. The youngest passenger on the titanic was 0.42 years old, and the oldest passenger was 80 years old. The standard deviation decreases 1.51, trimmed mean decreases 1.14, and the mad (median absolute deviation) drops 4.44. All of this makes sense–as we are injecting more observations into the data that are clustered at the median, less data is going to appear outside our center peak. As our peak grows higher, kurtosis grows as well, and we see it increase 0.81.

Both samples are right skewed and slightly leptokurtic with a peak only a little bit higher than the normal distribution. Our data with the median replacing the NA values is more leptokurtic and right skewed.

describe(titanic_train_data$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04

In this dataset, family relations are brother, sister, stepbrother, or stepsister, and spouse is reserved for husband and wife (mistress and fiancés were ignored). The numerical value that is reported reflects the number of siblings or spouses each individual had aboard the titanic.

The average is 0.52, min is 0, and max is 8. The data has a high kurtosis and skew indicating that it is leptokurtic and right skewed. There were a lot of individuals onboard that did not have a sibling or spouse with them, but many also had several sibling/spouse connections.

describe(titanic_train_data$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03

Parch is defined as the number of parents and children aboard the titanic. Parents were mothers or fathers, and children were daughter, son, stepdaughter, or stepson. Some children were travelling with only a nanny, and their parch = 0.

The average is 0.38, min is 0, and max is 6. The distribution is skewed to the right and is leptokurtic, with a kurtosis value of 9.69. This means that the majority of passengers were not traveling with their parents or children, but some were traveling with as many as 6.

Q4.

Provide a cross-tabulation of Survived and Sex. What do you notice?

table(titanic_train_data$Survived, titanic_train_data$Sex)
##    
##     female male
##   0     81  468
##   1    233  109
female = c(round(table(titanic_train_data$Survived, titanic_train_data$Sex)[,1] / colSums(table(titanic_train_data$Survived, titanic_train_data$Sex))[1], digits = 2))
male = c(round(table(titanic_train_data$Survived, titanic_train_data$Sex)[,2] / colSums(table(titanic_train_data$Survived, titanic_train_data$Sex))[2], digits = 2))
cbind(female, male, colnames(c("female", "male")))
##   female male
## 0   0.26 0.81
## 1   0.74 0.19

Overall, there were more males on board than females. However, more women survived the crash than men, both overall and proportionally. Only 109, 19%, of male passengers survived. Meanwhile, 233, 74%, of female passengers survived. This is in line with the historical accounts that describe women and children being prioritized for boarding life boats.

Q5.

Provide notched boxplots for Survived and Age. What do you notice?

boxplot(titanic_train_data$Age~titanic_train_data$Survived, notch=TRUE, horizontal=T, names = c("No", "Yes"), main = "Titanic Survivors", ylab = "Survived", xlab = "Age", col = "lightblue")

The median for both boxplots is the same. However, the IQR is larger, there are less outliers, and all outliers are high in the boxplot of passengers who survived compared to the boxplot of those that died.

These findings reinforce the idea that children were given seats first in the lifeboats. The lower whisker for passengers that survived goes all the way to zero, capturing even the youngest passenger, Assad Alexander Thomas, who was 0.42 years old. The whiskers on the boxplot of the passengers who died do not extend as far. The majority of those who died were between the ages of 20 and 40.