Titanic Data Set

Importing the CSV from kaggle imports a 891 observation data set which contains 12 variables. The variables are PassengerId, Survived, Pclass, Name, Sex, Age, Sibsp (number of siblings and spouses), Parch (number of parents and children), Ticket, Fare, Cabin, and Embarked (which had port of embarkation).

# Open up a window to select the .csv file
titanic <- read.csv("D:\\eric\\Boston College\\1. ADEC7310.02 - Data Analysis\\Week 1\\train.csv")

# Load psych and Hmisc packages
library(psych)
library(Hmisc)

Question 1a

Question

What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal /interval / ratio) for PassengerId, and Age?

# Display the structure of each variable in question
str(titanic$PassengerId)
##  int [1:891] 1 2 3 4 5 6 7 8 9 10 ...
str(titanic$Age)
##  num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...

Answer

Passenger ID is a qaulitative variable and Age is a quantitative variable. The levels of measurement are nominal for Passenger ID (ID is a unique number assigned to each passenger) and interval for Age.

Question 1b

Question

Which variable has the most missing observations?

# I explicitly called out the package to use since both psych and Hmisc have
# describe() functions
Hmisc::describe(titanic)
## titanic 
## 
##  12  Variables      891  Observations
## --------------------------------------------------------------------------------
## PassengerId 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      891        0      891        1      446    297.3     45.5     90.0 
##      .25      .50      .75      .90      .95 
##    223.5    446.0    668.5    802.0    846.5 
## 
## lowest :   1   2   3   4   5, highest: 887 888 889 890 891
## --------------------------------------------------------------------------------
## Survived 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      891        0        2     0.71      342   0.3838   0.4735 
## 
## --------------------------------------------------------------------------------
## Pclass 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        3     0.81    2.309   0.8631 
##                             
## Value          1     2     3
## Frequency    216   184   491
## Proportion 0.242 0.207 0.551
## --------------------------------------------------------------------------------
## Name 
##        n  missing distinct 
##      891        0      891 
## 
## lowest : Abbing, Mr. Anthony                    Abbott, Mr. Rossmore Edward            Abbott, Mrs. Stanton (Rosa Hunt)       Abelson, Mr. Samuel                    Abelson, Mrs. Samuel (Hannah Wizosky) 
## highest: Yousseff, Mr. Gerious                  Yrois, Miss. Henriette ("Mrs Harbeck") Zabour, Miss. Hileni                   Zabour, Miss. Thamine                  Zimmerman, Mr. Leo                    
## --------------------------------------------------------------------------------
## Sex 
##        n  missing distinct 
##      891        0        2 
##                         
## Value      female   male
## Frequency     314    577
## Proportion  0.352  0.648
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      714      177       88    0.999     29.7    16.21     4.00    14.00 
##      .25      .50      .75      .90      .95 
##    20.12    28.00    38.00    50.00    56.00 
## 
## lowest :  0.42  0.67  0.75  0.83  0.92, highest: 70.00 70.50 71.00 74.00 80.00
## --------------------------------------------------------------------------------
## SibSp 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        7    0.669    0.523    0.823 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 8
##                                                     
## Value          0     1     2     3     4     5     8
## Frequency    608   209    28    16    18     5     7
## Proportion 0.682 0.235 0.031 0.018 0.020 0.006 0.008
## --------------------------------------------------------------------------------
## Parch 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        7    0.556   0.3816   0.6259 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency    678   118    80     5     4     5     1
## Proportion 0.761 0.132 0.090 0.006 0.004 0.006 0.001
## --------------------------------------------------------------------------------
## Ticket 
##        n  missing distinct 
##      891        0      681 
## 
## lowest : 110152      110413      110465      110564      110813     
## highest: W./C. 6608  W./C. 6609  W.E.P. 5734 W/C 14208   WE/P 5735  
## --------------------------------------------------------------------------------
## Fare 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      891        0      248        1     32.2    36.78    7.225    7.550 
##      .25      .50      .75      .90      .95 
##    7.910   14.454   31.000   77.958  112.079 
## 
## lowest :   0.0000   4.0125   5.0000   6.2375   6.4375
## highest: 227.5250 247.5208 262.3750 263.0000 512.3292
## --------------------------------------------------------------------------------
## Cabin 
##        n  missing distinct 
##      204      687      147 
## 
## lowest : A10 A14 A16 A19 A20, highest: F33 F38 F4  G6  T  
## --------------------------------------------------------------------------------
## Embarked 
##        n  missing distinct 
##      889        2        3 
##                             
## Value          C     Q     S
## Frequency    168    77   644
## Proportion 0.189 0.087 0.724
## --------------------------------------------------------------------------------

Answer

Cabin has the most missing values (the proportion of entries listed as NA is 0.771). Age is the second highest variable with missing entries (177 entries missing or a proportion of 0.199).

Question 2

Question

Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)]=median(mydata$Age, na.rm=TRUE).

# Determine variable median
median_age <- median(titanic$Age, na.rm=T)

# Store original data in a variable in case we make a mistake or want it for 
# comparison
original_age_data <- titanic$Age

# replace NAs with median age
titanic$Age[is.na(titanic$Age)] <- median_age
modified_age_data <- titanic$Age

# Comparison between original data and adjusted data
summary(original_age_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177
summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00

Answer

the code above makes the changes. As expected, min, median, and max values remain unchanged. The mean has lessened slightly and both quartiles have been pulled closer to the median due to the increased number of variables (the data set is roughly 25% larger).

Question 3

Question

Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).

# Package has already been loaded at this point so we'll jump straight to 
# descriptive statistics

# Age (for this one we'll also run descriptive statistics on the original data)
psych::describe(original_age_data)
##    vars   n mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 714 29.7 14.53     28   29.27 13.34 0.42  80 79.58 0.39     0.16 0.54
psych::describe(titanic$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
hist(titanic$Age, xlab = "Age", main = "Age")

# SibSp
psych::describe(titanic$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
hist(titanic$SibSp, xlab = "Siblings / Spouses", main = "Siblings / Spouses")

# Parch
psych::describe(titanic$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03
hist(titanic$Parch, xlab = "Parents / Children", main = "Parents / Children")

Answer

The code above provides the descriptive statistics for the three variables. Additionally, I had histograms generated to better visualize the distributions and skewness of the variables.

Question4

Question

Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?

table(titanic$Survived, titanic$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

Answer

What I noticed based on this table was a large disparity between proportion of men that survived and the proportion of women that survived (which makes sense given that “women and children first” was the standard policy to allocate lifeboats when evacuating a ship at the time). Additional analysis that would be interesting to apply here would be to categorize all passengers as either child or adult and then create two additional tables to cross-tabulate children and adults versus survival (one table for females and one table for males). There would need to be some additional research to determine at what age a young boy was considered to no longer be a child in 1912.

Question 5

Question

Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?

# NOTE: for this one I was curious to see the differences in the notched 
# boxplots if I used the original age data or the modified data (NAs replaced 
# with the median age)

# First with the Modified Age Data
boxplot(titanic$Age~titanic$Survived, notch=T, horizontal=T, ylab = "Survived",
        xlab = "Age", main = "Modified Age and Survival Rate (1 = Survived)")

# Additional Summaries to view the specific numbers
print("Survived")
## [1] "Survived"
summary(titanic$Age[titanic$Survived==1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   28.00   28.29   35.00   80.00
print("Perished")
## [1] "Perished"
summary(titanic$Age[titanic$Survived==0])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   23.00   28.00   30.03   35.00   74.00
# Replace age data with original
titanic$Age <- original_age_data

# Now with the Unmodified Age Data
boxplot(titanic$Age~titanic$Survived, notch=T, horizontal=T, ylab = "Survived",
        xlab = "Age", main = "Original Age and Survival Rate (1 = Survived)")

# Additional summaries would help to see the specific numbers
print("Survived")
## [1] "Survived"
summary(titanic$Age[titanic$Survived==1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   19.00   28.00   28.34   36.00   80.00      52
print("Perished")
## [1] "Perished"
summary(titanic$Age[titanic$Survived==0])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   21.00   28.00   30.63   39.00   74.00     125
# Put the modified data back into the data frame
titanic$Age <- modified_age_data

Answer

The boxplots actually have quit a bit of overlap around the medianage. A lot more children survived than perished (although there were four outliers that showed some children did not managed to get off the boat). More passengers over the age of 60 perished than managed to board a life raft. The perished boxplot is more tightly centered about the median, shrinking the range of the data compared to the survived data. It is worth noting that this does need to be taken with a grain of salt given that ~20% of the ages are not truly known and are substituted with the median age.