Titanic Data Analysis
Packages
Loading necessary packages:
# A custom function for loading packages
# Source - https://github.com/MdAhsanulHimel/load_packages_R/blob/master/script.R
load_packages <- function(packages) {
not_installed <- packages[!packages %in% rownames(installed.packages())]
if(length(not_installed)>0){
message("Installing packages:", paste(" ", not_installed))
install.packages(not_installed, character.only = TRUE) |> suppressMessages()
}
lapply(X = packages, FUN = library, character.only = TRUE) |> invisible()
message("Packages loaded:", paste(" ", packages))
}load_packages("data.table") # to import data
load_packages("dplyr") # for data wrangling
load_packages("naniar") # to visualize missing values
load_packages("table1") # for pretty frequency table
load_packages("skimr") # for summarizing dataData
Using fread() to import the dataset since it is the fastestSee more:
data <- fread("train.csv", na.strings=c("",NA,"NULL"))
# na.strings=c("",NA,"NULL") from https://stackoverflow.com/a/24212891/13323413Data dictionary:
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Explanatory Data Analysis
Structure of the data:
str(data)## Classes 'data.table' and 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## - attr(*, ".internal.selfref")=<externalptr>
Converting all character variables to factor:
data <- data %>%
mutate_if(is.character, as.factor)Summary of the dataset:
summary(data) ## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## B96 B98 : 4 C :168
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3 NA's: 2
## D : 3
## (Other) :186
## NA's :687
Removing categorical variables that seem not useful and factorizing some numeric variables:
data2 <- data %>%
select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = factor(Survived, levels = c(0,1), labels = c("No", "Yes")),
Pclass = factor(Pclass, levels = c(1,2,3), labels = c("1st", "2nd", "3rd")),
Embarked = recode(Embarked,
"C" = "Cherbourg", "Q" = "Queenstown", "S" = "Southampton"),
Sex = recode(Sex, "female" = "Female", "male" = "Male"))Summary of new dataset:
summary(data2)## Survived Pclass Sex Age SibSp
## No :549 1st:216 Female:314 Min. : 0.42 Min. :0.000
## Yes:342 2nd:184 Male :577 1st Qu.:20.12 1st Qu.:0.000
## 3rd:491 Median :28.00 Median :0.000
## Mean :29.70 Mean :0.523
## 3rd Qu.:38.00 3rd Qu.:1.000
## Max. :80.00 Max. :8.000
## NA's :177
## Parch Fare Embarked
## Min. :0.0000 Min. : 0.00 Cherbourg :168
## 1st Qu.:0.0000 1st Qu.: 7.91 Queenstown : 77
## Median :0.0000 Median : 14.45 Southampton:644
## Mean :0.3816 Mean : 32.20 NA's : 2
## 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :6.0000 Max. :512.33
##
Detailed summary of the new dataset:
skim(data2) %>% yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 177 | 0.8 | 29.70 | 14.53 | 0.42 | 20.12 | 28.00 | 38 | 80.00 | ▂▇▅▂▁ |
| SibSp | 0 | 1.0 | 0.52 | 1.10 | 0.00 | 0.00 | 0.00 | 1 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1.0 | 0.38 | 0.81 | 0.00 | 0.00 | 0.00 | 0 | 6.00 | ▇▁▁▁▁ |
| Fare | 0 | 1.0 | 32.20 | 49.69 | 0.00 | 7.91 | 14.45 | 31 | 512.33 | ▇▁▁▁▁ |
skim(data2) %>% yank("factor")Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Survived | 0 | 1 | FALSE | 2 | No: 549, Yes: 342 |
| Pclass | 0 | 1 | FALSE | 3 | 3rd: 491, 1st: 216, 2nd: 184 |
| Sex | 0 | 1 | FALSE | 2 | Mal: 577, Fem: 314 |
| Embarked | 2 | 1 | FALSE | 3 | Sou: 644, Che: 168, Que: 77 |
Visualizing missing values:
gg_miss_var(data2)Summary statistics by survival status:
# Defining custom function for continuous variable
my.render.cont <- function(x) {
with(stats.apply.rounding(stats.default(x), digits=2),
c("",
"Mean (SD)"=sprintf("%s (± %s)", MEAN, SD),
"Min - Max"=sprintf("%s - %s", MIN, MAX)
)
)
}
# Defining custom function for categorical variable
my.render.cat <- function(x) { c("",
sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.0f %%)", FREQ, PCT))))
}
# Perform ANOVA for continuous var. and Chi-square test for categorical var.
# https://github.com/benjaminrich/table1/issues/52#issuecomment-841916472
pvalue <- function(x, ...) {
x <- x[-length(x)] # Remove "overall" group
# Construct vectors of data y, and groups (strata) g
y <- unlist(x)
g <- factor(rep(1:length(x), times=sapply(x, length)))
if (is.numeric(y)) {
# For numeric variables, perform an ANOVA
p <- summary(aov(y ~ g))[[1]][["Pr(>F)"]][1]
} else {
# For categorical variables, perform a chi-squared test of independence
p <- chisq.test(table(y, g))$p.value
}
# Format the p-value, using an HTML entity for the less-than sign.
# The initial empty string places the output on the line below the variable label.
c("", sub("<", "<", format.pval(p, digits=3, eps=0.0001)))
}# first column for total and next columns for stratified data by Survived variable
strata <- c(split(data2, data2$Survived), list(Total=data2))
# Labeling to make the table readable
labels <- list(variables=list(Age = "Age",
Sex = "Sex",
Pclass = "Passenger Class",
Fare = "Fare",
SibSp = "Number of Siblings / Spouses",
Parch = "Number of Parents / Children",
Embarked = "Port of Embarkation"),
groups=list("Survived", "", "")
)
table1(strata, labels,
groupspan=c(2, 1, 1),
topclass="Rtable1-grid",
render.continuous=my.render.cont,
render.categorical=my.render.cat,
extra.col=list(`P value`=pvalue),
extra.col.pos=4 # adding p-value column at the end
)Survived |
||||
|---|---|---|---|---|
| No (N=549) |
Yes (N=342) |
Total (N=891) |
P value | |
| Age | ||||
| Mean (SD) | 31 (± 14) | 28 (± 15) | 30 (± 15) | 0.0391 |
| Min - Max | 1.0 - 74 | 0.42 - 80 | 0.42 - 80 | |
| Missing | 125 (22.8%) | 52 (15.2%) | 177 (19.9%) | |
| Sex | ||||
| Female | 81 (15 %) | 233 (68 %) | 314 (35 %) | <1e-04 |
| Male | 468 (85 %) | 109 (32 %) | 577 (65 %) | |
| Passenger Class | ||||
| 1st | 80 (15 %) | 136 (40 %) | 216 (24 %) | <1e-04 |
| 2nd | 97 (18 %) | 87 (25 %) | 184 (21 %) | |
| 3rd | 372 (68 %) | 119 (35 %) | 491 (55 %) | |
| Fare | ||||
| Mean (SD) | 22 (± 31) | 48 (± 67) | 32 (± 50) | <1e-04 |
| Min - Max | 0 - 260 | 0 - 510 | 0 - 510 | |
| Number of Siblings / Spouses | ||||
| Mean (SD) | 0.55 (± 1.3) | 0.47 (± 0.71) | 0.52 (± 1.1) | 0.292 |
| Min - Max | 0 - 8.0 | 0 - 4.0 | 0 - 8.0 | |
| Number of Parents / Children | ||||
| Mean (SD) | 0.33 (± 0.82) | 0.46 (± 0.77) | 0.38 (± 0.81) | 0.0148 |
| Min - Max | 0 - 6.0 | 0 - 5.0 | 0 - 6.0 | |
| Port of Embarkation | ||||
| Cherbourg | 75 (14 %) | 93 (27 %) | 168 (19 %) | <1e-04 |
| Queenstown | 47 (9 %) | 30 (9 %) | 77 (9 %) | |
| Southampton | 427 (78 %) | 217 (63 %) | 644 (72 %) | |
| Missing | 0 (0%) | 2 (0.6%) | 2 (0.2%) | |
It can be seen that all the variables except SibSp have
significant association at 5% level with Survived.
Note: ANOVA was performed on continuous variables and Chi-square test was performed on categorical variables.
Replacing missing age values for each group in Survived by their median age (since median is not influenced by extreme observations) and removing the two passengers whose Embarked information is missing:
# https://www.codingprof.com/3-ways-to-replace-missing-values-with-the-median-per-group-in-r/
data3 <- data2 %>%
filter(!is.na(Embarked)) %>%
group_by(Survived) %>%
mutate(Age = ifelse(is.na(Age), floor(median(Age, na.rm = TRUE)), Age))Detailed summary of the new dataset:
table1(c(split(data3, data3$Survived), list(Total=data3)),
labels,
groupspan=c(2, 1, 1),
topclass="Rtable1-grid",
render.continuous=my.render.cont,
render.categorical=my.render.cat,
extra.col=list(`P value`=pvalue),
extra.col.pos=4 # adding p-value column at the end
)Survived |
||||
|---|---|---|---|---|
| No (N=549) |
Yes (N=340) |
Total (N=889) |
P value | |
| Age | ||||
| Mean (SD) | 30 (± 13) | 28 (± 14) | 29 (± 13) | 0.0374 |
| Min - Max | 1.0 - 74 | 0.42 - 80 | 0.42 - 80 | |
| Sex | ||||
| Female | 81 (15 %) | 231 (68 %) | 312 (35 %) | <1e-04 |
| Male | 468 (85 %) | 109 (32 %) | 577 (65 %) | |
| Passenger Class | ||||
| 1st | 80 (15 %) | 134 (39 %) | 214 (24 %) | <1e-04 |
| 2nd | 97 (18 %) | 87 (26 %) | 184 (21 %) | |
| 3rd | 372 (68 %) | 119 (35 %) | 491 (55 %) | |
| Fare | ||||
| Mean (SD) | 22 (± 31) | 48 (± 67) | 32 (± 50) | <1e-04 |
| Min - Max | 0 - 260 | 0 - 510 | 0 - 510 | |
| Number of Siblings / Spouses | ||||
| Mean (SD) | 0.55 (± 1.3) | 0.48 (± 0.71) | 0.52 (± 1.1) | 0.311 |
| Min - Max | 0 - 8.0 | 0 - 4.0 | 0 - 8.0 | |
| Number of Parents / Children | ||||
| Mean (SD) | 0.33 (± 0.82) | 0.47 (± 0.77) | 0.38 (± 0.81) | 0.0131 |
| Min - Max | 0 - 6.0 | 0 - 5.0 | 0 - 6.0 | |
| Port of Embarkation | ||||
| Cherbourg | 75 (14 %) | 93 (27 %) | 168 (19 %) | <1e-04 |
| Queenstown | 47 (9 %) | 30 (9 %) | 77 (9 %) | |
| Southampton | 427 (78 %) | 217 (64 %) | 644 (72 %) | |
Training and Test Set
Creating train and test datasets on 80:20 ratio:
set.seed(0) # setting seed for reproducibility
train_ids <- sample(seq_len(nrow(data3)), size = floor(0.80*nrow(data3)))
train <- data3[train_ids, ]
test <- data3[-train_ids, ]Logistic Regression
logistic_model <- glm(Survived ~ .,
data = train,
family = binomial(link = 'logit'))
epiDisplay::logistic.display(logistic_model, simplified = TRUE)##
## OR lower95ci upper95ci Pr(>|Z|)
## Pclass2nd 0.45170433 0.23754842 0.8589272 1.536125e-02
## Pclass3rd 0.12932715 0.06837099 0.2446288 3.184129e-10
## SexMale 0.06876174 0.04428593 0.1067647 8.704131e-33
## Age 0.95788282 0.94113165 0.9749321 1.749924e-06
## SibSp 0.70934024 0.55797593 0.9017657 5.042322e-03
## Parch 0.93038895 0.71881270 1.2042408 5.836084e-01
## Fare 1.00103972 0.99625625 1.0058462 6.706813e-01
## EmbarkedQueenstown 0.72329095 0.31703384 1.6501387 4.414296e-01
## EmbarkedSouthampton 0.52142803 0.31146111 0.8729411 1.325589e-02
# Step wise regression with backward direction
summary(step(logistic_model, direction="backward"))## Start: AIC=650.33
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Fare 1 630.51 648.51
## - Parch 1 630.63 648.63
## <none> 630.33 650.33
## - Embarked 2 636.74 652.74
## - SibSp 1 639.80 657.80
## - Age 1 655.40 673.40
## - Pclass 2 677.43 693.43
## - Sex 1 808.21 826.21
##
## Step: AIC=648.51
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked
##
## Df Deviance AIC
## - Parch 1 630.74 646.74
## <none> 630.51 648.51
## - Embarked 2 637.47 651.47
## - SibSp 1 639.80 655.80
## - Age 1 656.08 672.08
## - Pclass 2 699.52 713.52
## - Sex 1 809.59 825.59
##
## Step: AIC=646.74
## Survived ~ Pclass + Sex + Age + SibSp + Embarked
##
## Df Deviance AIC
## <none> 630.74 646.74
## - Embarked 2 637.68 649.68
## - SibSp 1 642.33 656.33
## - Age 1 656.21 670.21
## - Pclass 2 699.92 711.92
## - Sex 1 816.35 830.35
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5572 -0.6246 -0.4001 0.6419 2.4871
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.338160 0.466434 9.301 < 2e-16 ***
## Pclass2nd -0.854608 0.297963 -2.868 0.00413 **
## Pclass3rd -2.119471 0.279486 -7.583 3.36e-14 ***
## SexMale -2.656733 0.218390 -12.165 < 2e-16 ***
## Age -0.043180 0.008969 -4.814 1.48e-06 ***
## SibSp -0.355654 0.116927 -3.042 0.00235 **
## EmbarkedQueenstown -0.312078 0.416582 -0.749 0.45377
## EmbarkedSouthampton -0.665305 0.259314 -2.566 0.01030 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 942.14 on 710 degrees of freedom
## Residual deviance: 630.74 on 703 degrees of freedom
## AIC: 646.74
##
## Number of Fisher Scoring iterations: 5