HW1: Titanic

(train.csv)

setwd("/Users/jiwonban/ADEC7301/HW1/titanic")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
hw1.data <- read.csv("train.csv") #Name file

Q1 a. What are the types of variable (quantitative / qualitative)

and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?

1a. Answer:

PassengerId is represents the de-identified unique codes for every passenger that was on the Titanic, ranging from 1:891, and Age is age shown in years. The variable PassengerId is an identifier, so it can be considered a qualitative nominal value. The variable Age is quantitative and can be considered ratio; this is because Age has a true zero-point, i.e., a newborn is considered 0 years old. The dataframes in R can be classified as an integer and numeric, respectively.

range(hw1.data$PassengerId) #range of IDs
## [1]   1 891
class(hw1.data$PassengerId) #Integer
## [1] "integer"
class(hw1.data$Age) #Numeric
## [1] "numeric"

Q1b.  Which variable has the most missing observations?

  You could have Googled for “Count NA values in R for all columns in a dataframe” or something like that. 

1b. Answer:

The variable Cabin is missing 687 cases, compared to Age missing 177 cases and all the other variables having no missing values.

#make empty cells == NA for Cabin  (Found code here https://www.statology.org/r-replace-blank-with-na/)
hw1.data <- hw1.data %>% 
  mutate(Cabin = na_if(Cabin,""))

#Using the code provided in Q
sum(is.na(hw1.data$PassengerId))
## [1] 0
sum(is.na(hw1.data$Survived))
## [1] 0
sum(is.na(hw1.data$Pclass))
## [1] 0
sum(is.na(hw1.data$Name))
## [1] 0
sum(is.na(hw1.data$Sex))
## [1] 0
sum(is.na(hw1.data$Age)) # 177 missing cases
## [1] 177
sum(is.na(hw1.data$SibSp))
## [1] 0
sum(is.na(hw1.data$Parch))
## [1] 0
sum(is.na(hw1.data$Ticket))
## [1] 0
sum(is.na(hw1.data$Fare))
## [1] 0
sum(is.na(hw1.data$Cabin)) # 687 missing cases
## [1] 687
sum(is.na(hw1.data$Embarked))
## [1] 0

Q2.  Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

  To do so, use something like this:  mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe. (i.e. “dataframe_name$variable_name”).  

2. Answer:

For Age

hw1.data$Age[is.na(hw1.data$Age)] <- median(hw1.data$Age, 
                                            na.rm = TRUE) #signifies: ignore missing values when calculating median

summary(hw1.data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00

For SibSp and Parch, although they did not have any missing values.

#SibSp
hw1.data$SibSp[is.na(hw1.data$SibSp)] <- median(hw1.data$SibSp, 
                                            na.rm = TRUE)
summary(hw1.data$SibSp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000
#Parch
hw1.data$Parch[is.na(hw1.data$Parch)] <- median(hw1.data$Parch, 
                                            na.rm = TRUE)
summary(hw1.data$Parch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000

Q3.  Descriptive statistics

  Install and Invoke the psych package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).  Please comment on what you observe from the summary statistics.  

3. Answer:

library(psych)

Age:

After accounting for missing values (in Q2), we now see 891 values for Age. On average, passengers were 28.83 years old (SD = 13.02 years). The median age was 28 years old, which means 50th percentile, or half the passengers, were younger/older than that age. The skewness of 0.51 indicates the distribution being positively skewed (longer right-end tail); kurtosis of 0.97 indicate that the data has a slightly leptokurtic distribution (i.e., a higher peak with less tails at the ends). The youngest passenger was younger than a year old (0.42 years) and the oldest passenger is logged to have been 80 years old. The histogram confirms this distribution.

describe(hw1.data$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
hist(hw1.data$Age, label=TRUE, ylim=c(0,450))

SibSp:

This variable indicates the number of siblings or spouses aboard on the Titanic. There is an average of 0.52 siblings/spouses, which is not very meaningful, because 1) the distribution is not normal, and 2) only discrete numbers are valuable here (e.g., half a sibling or spouse does not make practical sense). The median is 0, which means half the sample population did not have any siblings or spouses on the ship. The histogram (and mode) shows us that majority of the passengers came solo, but there were some outliers (those presumably embarked on the Titanic as whole families). The high, positive skewness implies that the frequency distribution is skewed right— that is, there is a very long right-ended tail. The kurtosis indicates that there is a very sharp peak, with a large range across the distribution (caused by outliers).

describe(hw1.data$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
hist(hw1.data$SibSp, labels = TRUE, ylim=c(0,850))

Parch:

The variable Parch represents the number of parents/children who were on the Titanic. The mean is 0.38 (SD= 0.81), with a skewness of 2.74 and kurtosis of 9.69. Similar to SibSp, this variable is not normally distributed— rather, it is skewed right with a high concentration of frequency at “0” (i.e., interpreted from the median).

describe(hw1.data$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03
hist(hw1.data$Parch, labels = TRUE, ylim=c(0,750))

Q4.  Provide a cross-tabulation of Survived and Sex

(e.g., table(mydata$Survived, mydata$Sex).  What do you notice?

4. Answer:

The frequency count shows that more females than males (233 vs 109) survived (Survived = 1) the sinking of the Titanic; consistent with that pattern, many more males did not survive compared to females (Survived = 0, 468 vs 81). This is consistent with the safely evacuation practices, which prioritizes women, children, and the elderly to evacuate first.

table(hw1.data$Survived, hw1.data$Sex) # 0=No, 1=Yes
##    
##     female male
##   0     81  468
##   1    233  109

We can also visualize cross-tabs and confirm the pattern of frequency. The pattern is confirmed with the provided bar graph below.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(hw1.data, aes(x = Survived, fill = Sex)) + 
  geom_bar() +
  labs()

Q5.  Provide notched boxplots for Survived and Age

(e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T)).  What do you notice?

5. Answer:

Surprisingly, there wasn’t much difference in age range of those who survived versus those who did not— that is, shown by the overlap of the two boxes. The outliers, or the tails of the boxplots, indicate that more passengers of vulnerable ages (e.g., very young or very old) survived (with a larger range to account for it); in fact, the outliers for the non-survived group indicate that it was rare for the vulnerable ages to have not survived the sinking. Again, this seems to support the typical evacuation protocol.

boxplot(hw1.data$Age~hw1.data$Survived, notch=TRUE, horizontal=T)