rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528198 28.3 1174427 62.8 NA 669420 35.8
## Vcells 976244 7.5 8388608 64.0 16384 1851787 14.2
cat("\f") # Clear the console
if(!is.null(dev.list())) dev.off() # Clear all plots
## null device
## 1
#Installing Packages
# Prepare needed libraries
packages <- c("psych",
"tidyverse",
"ggplot2",
"lemon",
"gridExtra",
"ggrepel",
"scales"
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i], dependencies = TRUE)
}
library(packages[i], character.only = TRUE)
}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'lemon'
##
##
## The following object is masked from 'package:purrr':
##
## %||%
##
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## The following objects are masked from 'package:psych':
##
## alpha, rescale
rm(packages)
#Set WD
setwd("/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW")
getwd()
## [1] "/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW"
#Importing Data
Train <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Week 1/HW/train.csv"
, check.names = FALSE
, stringsAsFactors = FALSE
, na.strings = ""
)
#PassengerId
summary(Train$PassengerId)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 223.5 446.0 446.0 668.5 891.0
str(Train$PassengerId)
## int [1:891] 1 2 3 4 5 6 7 8 9 10 ...
PassengerId is a qualitative variable (identifier used to categorize passengers). The level of measurement is nominal given the variable has no intrinsic order or ranking.
#Age
summary(Train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
str(Train$Age)
## num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
Age is a quantitative variable (represents a measurable quantity). The level of measurement is ratio given age has a natural order, quantifiable difference between values, and an aboslute “zero-point.”
colSums(is.na(Train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
The Cabin variable has the most missing observations (687), followed by the Age variable (177), and Embarked variable (2).
#age
age.source.data <- Train$Age #Storing source data
summary(age.source.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
age.median <- median(Train$Age, na.rm = T) #Calculating median age
Train$Age[is.na(Train$Age)] <- median(Train$Age, na.rm=TRUE) #Replacing NAs with median age
sum(is.na(Train$Age)) #Check
## [1] 0
summary(Train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
summary(age.source.data) #source data
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
In order to address missing observations associated with the Age variable, median imputation was applied to replace all missing observations with the median age of Titanic passengers. Comparing the original age data with the imputed age data reveals a slight decrease in the mean following imputation. The downward shift is driven by the median value, which is lower than the mean, pulling the distribution to the left and reducing the right skewness of the overall age distribution. The upward shift in the 25th percentile results from the median value pulling lower observed values closer to the center of the distribution. The reverse is true for the 75th percentile.
sum(is.na(Train$SibSp))
## [1] 0
sum(is.na(Train$Parch))
## [1] 0
There are no missing values associated with the SibSp and Parch variables - median imputation was not applied.
#Age
psych::describe(Train$Age)
The range of distribution for the Age variable is 79.58 years old. The observations extend from a minimum of .42 years old to a maximum of 80 years old. The mean age of of 29.36 is higher than the median age of 28, indicating the distribution of values is positively skewed (right tail) with a skewness value of .51. The standard deviation of 13.02 indicates a wide dispersion of values relative to the mean. The kurtosis value of .97 indicates the distribution is Leptokurtic and slightly more peaked than a normal distribution.
ggplot(data = Train, aes(x = Age)) +
geom_density(fill = "green") +
labs(title = "Passengers by Age",
x = "Age") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, #formatting title
face = "bold",
size = 18,
))
#SibSp
psych::describe(Train$SibSp)
The range of distribution for the SibSp is 8 siblings/spouses. The observations extend from a minimum of 0 siblings/spouses aboard the Titanic to a maximum of 8 siblings/spouses. The mean number of siblings/spouses on the Titanic is .52, higher than the median of 0. The distribution of values is positively skewed (right tail) with a skewness value of 3.68. The standard deviation of 1.1 indicates a tighter dispersion of values relative to the mean as compared to the Age variable. The kurtosis value of 17.73 indicates a Leptokurtic distribution with a more pronounced peak than the Age variable. The below density plot clearly shows a high concentration of values and sharp peak on the left side of the distribution.
ggplot(data = Train, aes(x = SibSp)) +
geom_density(fill = "green") +
labs(title = "Siblings/Spouses Aboard the Titanic",
x = "Siblings/Spouses") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, #formatting title
face = "bold",
size = 18,
))
psych::describe(Train$Parch)
The range of distribution for the Parch variable is 6 parents/children aboard the Titanic. The observations extend from a minimum of 0 parents/children to a maximum of 6 parents/children aboard the Titanic. The mean number of parents/children aboard the Titanic is .38, higher than the median of 0. The distribution of values is positively skewed (right tail) with a skewness value of 2.74. The standard deviation of .81 is the lowest of the three variables observed and indicates a tighter dispersion of values relative to the mean as compared to the Age and SipSp variables. The kurtosis value of 9.69 indicates the distribution is Leptokurtic. The below density plot clearly shows a high concentration of values and sharp peak on the left side of the distribution.
ggplot(data = Train, aes(x = Parch)) +
geom_density(fill = "green") +
labs(title = "Parents/Children Aboard the Titanic",
x = "Parents/Children") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, #formatting title
face = "bold",
size = 18,
))
table(Train$Survived, Train$Sex)
##
## female male
## 0 81 468
## 1 233 109
Records indicate 468 males and 81 females did not survive the voyage. 233 females and 109 males survived the voyage.
proportion.survived.sex <- table(Train$Survived, Train$Sex)
proportion.survived.sex <- prop.table(proportion.survived.sex, margin = 2)
proportion.survived.sex
##
## female male
## 0 0.2579618 0.8110919
## 1 0.7420382 0.1889081
Female passengers exhibit a higher survival rate than male passengers (74.2% versus 18.8%). The findings could be attributable to the Captain’s order to evacuate “women and children first.”
survival.rate.Pclass <- tapply(Train$Survived, list(Train$Sex, Train$Pclass), mean, na.rm = TRUE)
survival.rate.Pclass
## 1 2 3
## female 0.9680851 0.9210526 0.5000000
## male 0.3688525 0.1574074 0.1354467
Female and male first-class passengers had the highest survival rates at 96.8% and 36.8% respectively. Third-class passengers for both sexes recorded the lowest survival rates (50% and 13.5%). These results corroborate previous research which found passengers who paid higher fair rates were more likely to survive.
boxplot(Train$Age~Train$Survived, notch=TRUE, horizontal=T,
xlab = "Age",
ylab = "Survived",
main = "Age & Survival Rate"
)
Notched boxplots were created to compare the median age of Titanic passengers who survived or perished. The top boxplot (“Survived”) indicates most Titanic survivors were young adults. The interquartile range (25th percentile, median, and 75th percentile of survivors) all fall within this age range. The trend holds for passengers who did not survive the voyage as indicated by the lower boxplot (“Perished”). The Survived boxplot demonstrates greater variability in age as indicated by the larger interquartile range.
The outliers observed correspond to both children and the elderly. Children were more likely to survive the voyage as indicated by the lower minimum line in the Survived boxplot. The reverse hold true for the elderly who were less likely to survive the voyage (lower maximum and greater number of outliers in the Perished boxplot). The overlapping notches provide strong evidence there is no statistically significant difference in median values at the 95% confidence interval. The smaller notch in the Perished boxplot indicates less uncertainty in this observation as compared to the Survived boxplot (further analysis is required to validate findings).