HW 1 - Titanic

Preparing Workspace

  rm(list = ls()) # Clear environment
  gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528198 28.3    1174427 62.8         NA   669420 35.8
## Vcells 976244  7.5    8388608 64.0      16384  1851787 14.2

  cat("\f")       # Clear the console

  if(!is.null(dev.list())) dev.off() # Clear all plots

## null device 
##           1

#Installing Packages

# Prepare needed libraries

   packages <- c("psych",
                 "tidyverse",
                 "ggplot2",
                 "lemon",
                 "gridExtra",
                 "ggrepel",
                 "scales"
   )
   
  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i], dependencies = TRUE)
    }
    library(packages[i], character.only = TRUE)
  }

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'lemon'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     %||%
## 
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

  rm(packages)

#Set WD

  setwd("/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW")
  getwd()

## [1] "/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW"

#Importing Data

Train <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW/train.csv"
                                    , check.names = FALSE
                                    , stringsAsFactors = FALSE
                                    , na.strings = ""
                                    )

Question 1a:

What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?

#PassengerId
summary(Train$PassengerId)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   223.5   446.0   446.0   668.5   891.0

str(Train$PassengerId)

##  int [1:891] 1 2 3 4 5 6 7 8 9 10 ...

PassengerId is a qualitative variable (identifier used to categorize passengers). The level of measurement is nominal given the variable has no intrinsic order or ranking.

#Age
summary(Train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

str(Train$Age)

##  num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...

Age is a quantitative variable (represents a measurable quantity). The level of measurement is ratio given age has a natural order, quantifiable difference between values, and an aboslute “zero-point.”

#Question 1b:

Which variable has the most missing observations?

colSums(is.na(Train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

The Cabin variable has the most missing observations (687), followed by the Age variable (177), and Embarked variable (2).

Question 2:

Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

#age
age.source.data <- Train$Age #Storing source data
summary(age.source.data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

age.median <- median(Train$Age, na.rm = T) #Calculating median age 

Train$Age[is.na(Train$Age)] <- median(Train$Age, na.rm=TRUE) #Replacing NAs with median age 

sum(is.na(Train$Age)) #Check

## [1] 0

summary(Train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00

summary(age.source.data) #source data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

In order to address missing observations associated with the Age variable, median imputation was applied to replace all missing observations with the median age of Titanic passengers. Comparing the original age data with the imputed age data reveals a slight decrease in the mean following imputation. The downward shift is driven by the median value, which is lower than the mean, pulling the distribution to the left and reducing the right skewness of the overall age distribution. The upward shift in the 25th percentile results from the median value pulling lower observed values closer to the center of the distribution. The reverse is true for the 75th percentile.

sum(is.na(Train$SibSp))

## [1] 0

sum(is.na(Train$Parch))

## [1] 0

There are no missing values associated with the SibSp and Parch variables - median imputation was not applied.

Question 3:

Install the psych package in R. Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch. Please comment on what you observe from the summary statistics.

#Age

psych::describe(Train$Age)

The range of distribution for the Age variable is 79.58 years old. The observations extend from a minimum of .42 years old to a maximum of 80 years old. The mean age of of 29.36 is higher than the median age of 28, indicating the distribution of values is positively skewed (right tail) with a skewness value of .51. The standard deviation of 13.02 indicates a wide dispersion of values relative to the mean. The kurtosis value of .97 indicates the distribution is Leptokurtic and slightly more peaked than a normal distribution.

ggplot(data = Train, aes(x = Age)) +
  geom_density(fill = "green") + 
  labs(title = "Passengers by Age",
       x = "Age") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, #formatting title
                                                    face = "bold",
                                                    size = 18,
                                            ))

#SibSp

psych::describe(Train$SibSp)

The range of distribution for the SibSp is 8 siblings/spouses. The observations extend from a minimum of 0 siblings/spouses aboard the Titanic to a maximum of 8 siblings/spouses. The mean number of siblings/spouses on the Titanic is .52, higher than the median of 0. The distribution of values is positively skewed (right tail) with a skewness value of 3.68. The standard deviation of 1.1 indicates a tighter dispersion of values relative to the mean as compared to the Age variable. The kurtosis value of 17.73 indicates a Leptokurtic distribution with a more pronounced peak than the Age variable. The below density plot clearly shows a high concentration of values and sharp peak on the left side of the distribution.

ggplot(data = Train, aes(x = SibSp)) +
  geom_density(fill = "green") + 
  labs(title = "Siblings/Spouses Aboard the Titanic",
       x = "Siblings/Spouses") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, #formatting title
                                                    face = "bold",
                                                    size = 18,
                                            ))

psych::describe(Train$Parch)

The range of distribution for the Parch variable is 6 parents/children aboard the Titanic. The observations extend from a minimum of 0 parents/children to a maximum of 6 parents/children aboard the Titanic. The mean number of parents/children aboard the Titanic is .38, higher than the median of 0. The distribution of values is positively skewed (right tail) with a skewness value of 2.74. The standard deviation of .81 is the lowest of the three variables observed and indicates a tighter dispersion of values relative to the mean as compared to the Age and SipSp variables. The kurtosis value of 9.69 indicates the distribution is Leptokurtic. The below density plot clearly shows a high concentration of values and sharp peak on the left side of the distribution.

ggplot(data = Train, aes(x = Parch)) +
  geom_density(fill = "green") + 
  labs(title = "Parents/Children Aboard the Titanic",
       x = "Parents/Children") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, #formatting title
                                                    face = "bold",
                                                    size = 18,
                                            ))

Question 4:

Provide a cross-tabulation of Survived and Sex. What do you notice?

table(Train$Survived, Train$Sex)

##    
##     female male
##   0     81  468
##   1    233  109

Records indicate 468 males and 81 females did not survive the voyage. 233 females and 109 males survived the voyage.

proportion.survived.sex <- table(Train$Survived, Train$Sex)
proportion.survived.sex <- prop.table(proportion.survived.sex, margin = 2)
proportion.survived.sex

##    
##        female      male
##   0 0.2579618 0.8110919
##   1 0.7420382 0.1889081

Female passengers exhibit a higher survival rate than male passengers (74.2% versus 18.8%). The findings could be attributable to the Captain’s order to evacuate “women and children first.”

survival.rate.Pclass <- tapply(Train$Survived, list(Train$Sex, Train$Pclass), mean, na.rm = TRUE)
survival.rate.Pclass

##                1         2         3
## female 0.9680851 0.9210526 0.5000000
## male   0.3688525 0.1574074 0.1354467

Female and male first-class passengers had the highest survival rates at 96.8% and 36.8% respectively. Third-class passengers for both sexes recorded the lowest survival rates (50% and 13.5%). These results corroborate previous research which found passengers who paid higher fair rates were more likely to survive.

Question 5:

Provide notched boxplots for Survived and Age. What do you notice?

boxplot(Train$Age~Train$Survived, notch=TRUE, horizontal=T, 
        xlab = "Age",
        ylab = "Survived",
        main = "Age & Survival Rate"
                                  )

Notched boxplots were created to compare the median age of Titanic passengers who survived or perished. The top boxplot (“Survived”) indicates most Titanic survivors were young adults. The interquartile range (25th percentile, median, and 75th percentile of survivors) all fall within this age range. The trend holds for passengers who did not survive the voyage as indicated by the lower boxplot (“Perished”). The Survived boxplot demonstrates greater variability in age as indicated by the larger interquartile range.

The outliers observed correspond to both children and the elderly. Children were more likely to survive the voyage as indicated by the lower minimum line in the Survived boxplot. The reverse hold true for the elderly who were less likely to survive the voyage (lower maximum and greater number of outliers in the Perished boxplot). The overlapping notches provide strong evidence there is no statistically significant difference in median values at the 95% confidence interval. The smaller notch in the Perished boxplot indicates less uncertainty in this observation as compared to the Survived boxplot (further analysis is required to validate findings).