setwd("/Users/jiwonban/ADEC7301/HW1/titanic")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hw1.data <- read.csv("train.csv") #Name file
and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?
PassengerId is represents the de-identified unique codes
for every passenger that was on the Titanic, ranging from 1:891, and
Age is age shown in years. The variable
PassengerId is an identifier, so it can be considered a
qualitative nominal value. The variable Age is quantitative
and can be considered ratio; this is because Age has a true zero-point,
i.e., a newborn is considered 0 years old. The dataframes in R can be
classified as an integer and numeric, respectively.
range(hw1.data$PassengerId) #range of IDs
## [1] 1 891
class(hw1.data$PassengerId) #Integer
## [1] "integer"
class(hw1.data$Age) #Numeric
## [1] "numeric"
You could have Googled for “Count NA values in R for all columns in a dataframe” or something like that.
The variable Cabin is missing 687 cases, compared to
Age missing 177 cases and all the other variables having no
missing values.
#make empty cells == NA for Cabin (Found code here https://www.statology.org/r-replace-blank-with-na/)
hw1.data <- hw1.data %>%
mutate(Cabin = na_if(Cabin,""))
#Using the code provided in Q
sum(is.na(hw1.data$PassengerId))
## [1] 0
sum(is.na(hw1.data$Survived))
## [1] 0
sum(is.na(hw1.data$Pclass))
## [1] 0
sum(is.na(hw1.data$Name))
## [1] 0
sum(is.na(hw1.data$Sex))
## [1] 0
sum(is.na(hw1.data$Age)) # 177 missing cases
## [1] 177
sum(is.na(hw1.data$SibSp))
## [1] 0
sum(is.na(hw1.data$Parch))
## [1] 0
sum(is.na(hw1.data$Ticket))
## [1] 0
sum(is.na(hw1.data$Fare))
## [1] 0
sum(is.na(hw1.data$Cabin)) # 687 missing cases
## [1] 687
sum(is.na(hw1.data$Embarked))
## [1] 0
To do so, use something like this: mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe. (i.e. “dataframe_name$variable_name”).
For Age
hw1.data$Age[is.na(hw1.data$Age)] <- median(hw1.data$Age,
na.rm = TRUE) #signifies: ignore missing values when calculating median
summary(hw1.data$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
For SibSp and Parch, although they did not
have any missing values.
#SibSp
hw1.data$SibSp[is.na(hw1.data$SibSp)] <- median(hw1.data$SibSp,
na.rm = TRUE)
summary(hw1.data$SibSp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.523 1.000 8.000
#Parch
hw1.data$Parch[is.na(hw1.data$Parch)] <- median(hw1.data$Parch,
na.rm = TRUE)
summary(hw1.data$Parch)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3816 0.0000 6.0000
Install and Invoke the psych package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics.
library(psych)
Age:After accounting for missing values (in Q2), we now see 891 values
for Age. On average, passengers were 28.83 years old
(SD = 13.02 years). The median age was 28 years old, which
means 50th percentile, or half the passengers, were younger/older than
that age. The skewness of 0.51 indicates the distribution being
positively skewed (longer right-end tail); kurtosis of 0.97 indicate
that the data has a slightly leptokurtic distribution (i.e., a higher
peak with less tails at the ends). The youngest passenger was younger
than a year old (0.42 years) and the oldest passenger is logged to have
been 80 years old. The histogram confirms this distribution.
describe(hw1.data$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
hist(hw1.data$Age, label=TRUE, ylim=c(0,450))
SibSp:This variable indicates the number of siblings or spouses aboard on the Titanic. There is an average of 0.52 siblings/spouses, which is not very meaningful, because 1) the distribution is not normal, and 2) only discrete numbers are valuable here (e.g., half a sibling or spouse does not make practical sense). The median is 0, which means half the sample population did not have any siblings or spouses on the ship. The histogram (and mode) shows us that majority of the passengers came solo, but there were some outliers (those presumably embarked on the Titanic as whole families). The high, positive skewness implies that the frequency distribution is skewed right— that is, there is a very long right-ended tail. The kurtosis indicates that there is a very sharp peak, with a large range across the distribution (caused by outliers).
describe(hw1.data$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
hist(hw1.data$SibSp, labels = TRUE, ylim=c(0,850))
Parch:The variable Parch represents the number of
parents/children who were on the Titanic. The mean is 0.38 (SD=
0.81), with a skewness of 2.74 and kurtosis of 9.69. Similar to
SibSp, this variable is not normally distributed— rather,
it is skewed right with a high concentration of frequency at “0” (i.e.,
interpreted from the median).
describe(hw1.data$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
hist(hw1.data$Parch, labels = TRUE, ylim=c(0,750))
(e.g., table(mydata$Survived, mydata$Sex). What do you notice?
The frequency count shows that more females than males (233 vs 109)
survived (Survived = 1) the sinking of the Titanic;
consistent with that pattern, many more males did not survive compared
to females (Survived = 0, 468 vs 81). This is consistent
with the safely evacuation practices, which prioritizes women, children,
and the elderly to evacuate first.
table(hw1.data$Survived, hw1.data$Sex) # 0=No, 1=Yes
##
## female male
## 0 81 468
## 1 233 109
We can also visualize cross-tabs and confirm the pattern of frequency. The pattern is confirmed with the provided bar graph below.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(hw1.data, aes(x = Survived, fill = Sex)) +
geom_bar() +
labs()
(e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T)). What do you notice?
Surprisingly, there wasn’t much difference in age range of those who survived versus those who did not— that is, shown by the overlap of the two boxes. The outliers, or the tails of the boxplots, indicate that more passengers of vulnerable ages (e.g., very young or very old) survived (with a larger range to account for it); in fact, the outliers for the non-survived group indicate that it was rare for the vulnerable ages to have not survived the sinking. Again, this seems to support the typical evacuation protocol.
boxplot(hw1.data$Age~hw1.data$Survived, notch=TRUE, horizontal=T)