This is a survey of the Titanic data. The univariate analysis characterizes the predictor distributions. The bivariate analysis characterizes relationships among the predictors and with the response variable. The influential outliers analysis searches for problematic observations.

Conclusions:

Factor variables Title and Employee have near-zero-variance.
Nearly all of the quantitative predictors suffer from skew.
Sex moderates Pclass and Embarked.
Created grouping variable AgeCohort. It should interact with Age and Sex.
Created grouping variable TicketNCohort. It should interact with TicketN.
Created net survivors predictor NetSurv with values truncated at -1 and +1.

Setup

library(tidyverse)
library(caret)  # for nearZeroVar()
library(e1071)  # for skewness()
library(broom)  # for tidy()
library(GGally) # for ggpairs()
library(gridExtra) 
library(janitor)

Load Data

The initial data management created the data set full, with training rows indexed by train_index.

load("./titanic_01.RData")

glimpse(full)

## Rows: 1,309
## Columns: 18
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, ...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Surname     <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "Allen", ...
## $ Title       <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss,...
## $ Name        <chr> "Owen Harris", "John Bradley (Florence Briggs Thayer)",...
## $ Sex         <fct> male, female, female, female, male, male, male, male, f...
## $ Age         <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...
## $ TicketN     <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2...
## $ FarePerPass <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8.4...
## $ Employee    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Deck        <fct> F, C, F, C, F, F, E, C, E, F, F, C, C, B, F, D, B, F, D...

Univariate Analysis

In this section I will look at data distributions. For factor variables, I am interested in which have near-zero-variance. For quantitative variables, I am looking for significant skew.

Conclusions from this section are:

Factor variables Title and Employee have near-zero-variance.
Nearly all of the quantitative predictors suffer from skew.

It will be handy to classify the predictors by data type.

preds <- full %>% 
  select(-Survived, -PassengerId, -Surname, -Name, -Ticket, -Cabin, -Fare) %>% 
  colnames()
preds_class <- full[, preds] %>% map(class) %>% unlist()
preds_factor <- subset(preds_class, preds_class == "factor") %>% names()
preds_numeric <- subset(preds_class, preds_class %in% c("numeric", "integer")) %>% names()
mdl_vars <- c("Survived", preds)
rm(preds_class)

assertthat::are_equal(length(c(preds_factor, preds_numeric)), length(preds))

## [1] TRUE

Factor Variables

Inspect each factor variable, looking for near-zero-variance (NZV). I might collapse them to prevent zero-variance in CV folds. caret::nearZeroVar() defines NZV as a frequency ratio of the most common value to the second most common value frequency >= 19/5 and a unique value percentage <= 10%. (DataCamp course Machine Learning Toolbox suggests more aggressive thresholds of frequency ratio >= 2 and unique value percentage <= 20%).

dummies <- dummyVars(~., data = full[, preds_factor], fullRank = FALSE)
dummy_dat <- as.data.frame(predict(dummies, full[, preds_factor]))
(nzv <- dummy_dat %>%
    nearZeroVar(saveMetrics= TRUE)
)

##                      freqRatio percentUnique zeroVar   nzv
## Pclass.1              3.052632     0.1527884   FALSE FALSE
## Pclass.2              3.725632     0.1527884   FALSE FALSE
## Pclass.3              1.181667     0.1527884   FALSE FALSE
## Title.Capt         1308.000000     0.1527884   FALSE  TRUE
## Title.Col           326.250000     0.1527884   FALSE  TRUE
## Title.Don          1308.000000     0.1527884   FALSE  TRUE
## Title.Dona         1308.000000     0.1527884   FALSE  TRUE
## Title.Dr            162.625000     0.1527884   FALSE  TRUE
## Title.Jonkheer     1308.000000     0.1527884   FALSE  TRUE
## Title.Lady         1308.000000     0.1527884   FALSE  TRUE
## Title.Major         653.500000     0.1527884   FALSE  TRUE
## Title.Master         20.459016     0.1527884   FALSE  TRUE
## Title.Miss            4.034615     0.1527884   FALSE FALSE
## Title.Mlle          653.500000     0.1527884   FALSE  TRUE
## Title.Mme          1308.000000     0.1527884   FALSE  TRUE
## Title.Mr              1.371377     0.1527884   FALSE FALSE
## Title.Mrs             5.644670     0.1527884   FALSE FALSE
## Title.Ms            653.500000     0.1527884   FALSE  TRUE
## Title.Rev           162.625000     0.1527884   FALSE  TRUE
## Title.Sir          1308.000000     0.1527884   FALSE  TRUE
## Title.the Countess 1308.000000     0.1527884   FALSE  TRUE
## Sex.male              1.809013     0.1527884   FALSE FALSE
## Sex.female            1.809013     0.1527884   FALSE FALSE
## Embarked.C            3.812500     0.1527884   FALSE FALSE
## Embarked.Q            9.642276     0.1527884   FALSE FALSE
## Embarked.S            2.313924     0.1527884   FALSE FALSE
## Employee.0           76.000000     0.1527884   FALSE  TRUE
## Employee.1           76.000000     0.1527884   FALSE  TRUE
## Deck.A               12.494845     0.1527884   FALSE FALSE
## Deck.B                6.745562     0.1527884   FALSE FALSE
## Deck.C                4.113281     0.1527884   FALSE FALSE
## Deck.D                5.480198     0.1527884   FALSE FALSE
## Deck.E                5.263158     0.1527884   FALSE FALSE
## Deck.F                2.481383     0.1527884   FALSE FALSE

The problematic predictors are Title and Employee.

Here is a visualization of the data. Ideally, you want the vars to land in the top left quadrant. NZV are in the lower right quadrant.

nzv %>%
  data.frame() %>% rownames_to_column(var = "col") %>%
  separate(col, sep = "\\.", into = c("col", "level"), ) %>%
  filter(!is.na(level)) %>%
  ggplot(aes(x = freqRatio, y = percentUnique, 
             color = fct_rev(factor(nzv, labels = c("(okay)", "NZV"))), 
             label = level)) +
  geom_text(check_overlap = TRUE, size = 2, na.rm = TRUE) +
  geom_point(size = 3, alpha = 0.6, na.rm = TRUE) +
  geom_hline(yintercept = 10, linetype = "dashed") +
  geom_vline(xintercept = 95/5, linetype = "dashed") +
  theme(legend.position = "top") +
  labs(title = "Near-Zero Variance of Factor Variables", color = "") +
  facet_wrap(~col)

rm(dummies, dummy_dat, nzv)

Quantitative Variables

Skew can contribute to violation of linearity in linear regressions. I’ll check which variables have significant skew. Skew between 0.5 and 1.0 is generally considered moderate, and skew greater than 1 severe. In the following charts, negligibly skewed predictors are colored teal (there are none), moderately skewed predictors are colored gold and the severely skewed predictors are colored red.

col_skew <- map(full[, preds_numeric], skewness) %>% unlist()
col_skew_is_mod <- names(col_skew[abs(col_skew) > .5 & abs(col_skew) <= 1.0])
col_skew_is_high <- names(col_skew[abs(col_skew) > 1.0])
p <- map(
  colnames(full[, preds_numeric]),
  ~ ggplot(full, aes_string(x = .x)) +
    geom_histogram(fill = case_when(.x %in% col_skew_is_mod ~ "goldenrod", 
                                    .x %in% col_skew_is_high ~ "orangered4",
                                    TRUE ~ "cadetblue")) +
    labs(y = "", x = "", title = .x) +
    theme(axis.text.y=element_blank(), plot.title = element_text(size = 10))
)
exec(grid.arrange, ncol = 2, !!!p)

rm(col_skew, col_skew_is_high, col_skew_is_mod)

Nearly all of the quantitative predictors suffer from skew. It may make sense to transform these variables in modeling.

Bivariate Analysis

In this section I will look at inter-variable relationships. For factor variables, I am interested in which levels have significantly different log survival odds. For quantitative variables, I am looking for linear relationships with log survival odds and low correlations with each other. I am also interested in any variable interactions that might improve the linearity assumptions of linear models.

Conclusions:

Sex moderates Pclass and Embarked.
Created grouping variable AgeCohort. It should interact with Age and Sex.
Created grouping variable TicketNCohort. It should interact with TicketN.
Created net survivors predictor NetSurv with values truncated at -1 and +1.

Before I begin, let’s get some initial perspective which variables are correlated with survival.

dummies_fullrank <- dummyVars(~., data = full[, mdl_vars], fullRank = TRUE)
full_dum <- as.data.frame(predict(dummies_fullrank, full[, mdl_vars]))

cor_out <- full_dum[train_index, ] %>% cor()
cor_out[, "Survived.Yes"] %>% abs() %>% sort(decreasing = TRUE) %>% head(11)

## Survived.Yes     Title.Mr   Sex.female    Title.Mrs   Title.Miss     Pclass.3 
##   1.00000000   0.54919918   0.54335138   0.33904025   0.32709255   0.32230836 
##  FarePerPass   Embarked.S     Pclass.2       Deck.B   Employee.1 
##   0.28189162   0.15566027   0.09334857   0.08850991   0.08534273

Title and Sex are by the far the best predictors of survival (Title basically combines sex and age). Pclass and FarePerPass are next. Embarkation port is also important. So far it seems your best bet for survival is to be a rich woman.

For this analysis I’ll define a plotting function with a logistic regression fit, and an odds ratio table function.

model_bin <- function(dat, fmla) {
  mdl <- train(
    as.formula(fmla),
    dat,
    method = "glm", family = "binomial",
    metric = "Accuracy",
    trControl = trainControl(method = "cv", n = 10)      
  )
}

plot_bin <- function(dat, x_var, facet_var, fmla) {
  dat %>% 
    group_by(!!sym(x_var), !!sym(facet_var), Survived, pred) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = "Survived", values_from = n, names_prefix = "Survived.") %>%
    replace_na(list(Survived.No = 0, Survived.Yes = 0)) %>%
    mutate(
      N = Survived.No + Survived.Yes,
      prop = Survived.Yes / N,
      odds = prop / (1 - prop),
      log_odds = log(odds),
      log_odds = case_when(log_odds == -Inf ~ -3, 
                           log_odds ==  Inf ~  3, 
                           TRUE ~ log_odds),
      pred_log_odds = log(pred / (1 - pred))
    ) %>%
    ggplot(aes(x = !!sym(x_var))) +
      geom_point(aes(y = log_odds, size = N, color = "Actual"), alpha = 0.6) +
      geom_point(aes(y = pred_log_odds, color = "Predicted")) +
      scale_color_manual(values = c("cadetblue", "goldenrod")) +
      theme(axis.text.x = element_text(angle = 90)) +
      labs(title = fmla, x = x_var, color = "") +
      facet_wrap(facets = facet_var)
}

write_tab <- function(dat, group_var){
  fmla <- paste0("Survived ~ ", group_var)

  mdl_obj <- glm(as.formula(fmla), data = dat, family = "binomial") %>%
    tidy() %>%
    mutate(term = str_remove(term, group_var), OR = exp(estimate)) %>% 
    filter(term != "(Intercept)") %>%
    select(term, OR)

  dat %>%
  group_by(!!sym(group_var)) %>%
  summarise(
    Lived = sum(if_else(Survived == "Yes", 1, 0)),
    Died = sum(if_else(Survived == "No", 1, 0)),
    N = Lived + Died,
    Surv = Lived / N,
    Odds = Surv / (1 - Surv) 
  ) %>%
  ungroup() %>%
  rename(term = !!sym(group_var)) %>%
  mutate(term = as.character(term), Surv = Surv * 100) %>%
  left_join(mdl_obj, by = "term") %>%
  select(term, N, Surv, Odds, OR) %>%
  flextable::flextable() %>% 
  flextable::colformat_int(j = 2) %>% 
  flextable::colformat_num(j = 3, digits = 0, suffix = "%") %>%
  flextable::colformat_num(j = c(4:5)) %>%
  flextable::set_caption(fmla)
}

Sex

In the training data set, 19% of males survived and 74% of the females survived. We should get 74-81% accuracy just by predicting all males perish and all females survive.

write_tab(full[train_index, ], group_var = "Sex")

Survived ~ Sex
term	N	Surv	Odds	OR
male	577	19%	0.23
female	314	74%	2.88	12.35

Age

Here is the survival log-odds for Age. Children of both sexes had relatively good survival odds. For males, the children reverse a generally positive relationship with age. The female relationship seems okay.

fmla <- "Survived ~ Sex*Age"
dat <- full[train_index, ] %>% mutate(age_bin = cut(Age, 16))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "age_bin", facet_var = "Sex", fmla = fmla)

So maybe there should be an AgeCohort = c(lte10, gt10) factor variable that moderates Age.

fmla <- "Survived ~ Sex*Age*AgeCohort"
dat <- full[train_index, ] %>% 
  mutate(
    AgeCohort = factor(if_else(Age <= 10, "lte10", "gt10")),
    AgeBin = cut(Age, 16)
  )
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "AgeBin", facet_var = "Sex", fmla = fmla)

tidy(mdl$finalModel) %>% flextable::regulartable() %>% 
  flextable::colformat_num(j = 2:5, digits = 3) %>%
  flextable::autofit()

term	estimate	std.error	statistic	p.value
(Intercept)	-1.729	0.307	-5.634	0.000
Sexfemale	2.258	0.504	4.484	0.000
Age	0.004	0.009	0.415	0.678
AgeCohortlte10	2.419	0.636	3.803	0.000
`Sexfemale:Age`	0.019	0.016	1.194	0.233
`Sexfemale:AgeCohortlte10`	-1.796	1.016	-1.767	0.077
`Age:AgeCohortlte10`	-0.155	0.116	-1.334	0.182
`Sexfemale:Age:AgeCohortlte10`	-0.082	0.173	-0.471	0.638

Better. I’ll create the AgeCohort predictor and plan to interact it with Age and Sex since the slopes differ.

full <- full %>%
  mutate(AgeCohort = factor(if_else(Age <= 10, "lte10", "gt10"), 
                            levels = c("lte10", "gt10")))

Title

Title is closely related to Sex and Age. Is it still useful?

Here is the sex/age breakdown. 53-54% of children survive (same for males and females). Over age 10, 17% of males survive and 77% of females survive.

write_tab(full[train_index, ] %>% filter(Sex == "male"), group_var = "AgeCohort")

Survived ~ AgeCohort
term	N	Surv	Odds	OR
lte10	36	53%	1.12
gt10	541	17%	0.20	0.18

write_tab(full[train_index, ] %>% filter(Sex == "female"), group_var = "AgeCohort")

Survived ~ AgeCohort
term	N	Surv	Odds	OR
lte10	35	54%	1.19
gt10	279	77%	3.29	2.77

full[train_index, ] %>% 
  tabyl(AgeCohort, Survived, Sex) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
#  adorn_title("combined") %>%
#  untabyl() %>% 
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

male.AgeCohort	male.No	male.Yes	male.Total	female.AgeCohort	female.No	female.Yes	female.Total
lte10	47.2% (17)	52.8% (19)	100.0% (36)	lte10	45.7% (16)	54.3% (19)	100.0% (35)
gt10	83.4% (451)	16.6% (90)	100.0% (541)	gt10	23.3% (65)	76.7% (214)	100.0% (279)
Total	81.1% (468)	18.9% (109)	100.0% (577)	Total	25.8% (81)	74.2% (233)	100.0% (314)

Combining AgeCohort and Title, 34/40 “Master” are under age 10, and 515/517 “Mr” are over age 10. “Miss” has less separation: 147/182 “Miss” are over age 10. I suspect “Miss” means any female who has not yet married.

full[train_index, ] %>% 
  tabyl(Title, Survived, AgeCohort) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

lte10.Title.Survived	lte10.No	lte10.Yes	lte10.Total	gt10.Title.Survived	gt10.No	gt10.Yes	gt10.Total
Capt	- (0)	- (0)	100.0% (0)	Capt	100.0% (1)	0.0% (0)	100.0% (1)
Col	- (0)	- (0)	100.0% (0)	Col	50.0% (1)	50.0% (1)	100.0% (2)
Don	- (0)	- (0)	100.0% (0)	Don	100.0% (1)	0.0% (0)	100.0% (1)
Dona	- (0)	- (0)	100.0% (0)	Dona	- (0)	- (0)	100.0% (0)
Dr	- (0)	- (0)	100.0% (0)	Dr	57.1% (4)	42.9% (3)	100.0% (7)
Jonkheer	- (0)	- (0)	100.0% (0)	Jonkheer	100.0% (1)	0.0% (0)	100.0% (1)
Lady	- (0)	- (0)	100.0% (0)	Lady	0.0% (0)	100.0% (1)	100.0% (1)
Major	- (0)	- (0)	100.0% (0)	Major	50.0% (1)	50.0% (1)	100.0% (2)
Master	44.1% (15)	55.9% (19)	100.0% (34)	Master	33.3% (2)	66.7% (4)	100.0% (6)
Miss	45.7% (16)	54.3% (19)	100.0% (35)	Miss	26.5% (39)	73.5% (108)	100.0% (147)
Mlle	- (0)	- (0)	100.0% (0)	Mlle	0.0% (0)	100.0% (2)	100.0% (2)
Mme	- (0)	- (0)	100.0% (0)	Mme	0.0% (0)	100.0% (1)	100.0% (1)
Mr	100.0% (2)	0.0% (0)	100.0% (2)	Mr	84.3% (434)	15.7% (81)	100.0% (515)
Mrs	- (0)	- (0)	100.0% (0)	Mrs	20.8% (26)	79.2% (99)	100.0% (125)
Ms	- (0)	- (0)	100.0% (0)	Ms	0.0% (0)	100.0% (1)	100.0% (1)
Rev	- (0)	- (0)	100.0% (0)	Rev	100.0% (6)	0.0% (0)	100.0% (6)
Sir	- (0)	- (0)	100.0% (0)	Sir	0.0% (0)	100.0% (1)	100.0% (1)
the Countess	- (0)	- (0)	100.0% (0)	the Countess	0.0% (0)	100.0% (1)	100.0% (1)
Total	46.5% (33)	53.5% (38)	100.0% (71)	Total	62.9% (516)	37.1% (304)	100.0% (820)

#  knitr::kable()

So, do passengers with a title other than Master/Miss/Mrs/Mr have a better chance of surviving? A little: 44% of titled persons survived compared to 38% of untitled. But only 27/891 = 3% of passengers have a title.

full[train_index, ] %>% 
  mutate(HasTitle = if_else(Title %in% c("Mr", "Mrs", "Master", "Miss"), "No", "Yes")) %>%
  tabyl(HasTitle, Survived) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

HasTitle.Survived	No	Yes	Total
No	61.8% (534)	38.2% (330)	100.0% (864)
Yes	55.6% (15)	44.4% (12)	100.0% (27)
Total	61.6% (549)	38.4% (342)	100.0% (891)

I think the conclusion here is that Title is redundant to AgeCohort and Sex and I should not include it in the models.

Pclass

Third class passengers (Pclass = 3) had a rough go of it. Male survival odds are cut in half when dropping to 2nd class, then stay about the same for 3rd class. Female survival odds stayed the same for 2nd class, but halved when dropping to 3rd class. N-size is small for the females though. Looks like all three classes are important, but should be moderated by Sex.

fmla <- "Survived ~ Pclass*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Pclass", facet_var = "Sex", fmla = fmla)

TicketN

Traveling with with others helps - but only to a point! Both male and female survival odds increase with ticket size to n = 4, then they fall.

fmla <- "Survived ~ TicketN + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)

How about a dummy variable TicketNCohort = c("lte4", "gt4").

fmla <- "Survived ~ TicketN*TicketNCohort + Sex"
dat <- full[train_index, ] %>%
  mutate(TicketNCohort = factor(if_else(TicketN <= 4, "lte4", "gt4")))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)

Better. I’ll create the TicketNCohort predictor and plan to interact it with TicketN.

full <- full %>%
  mutate(TicketNCohort = factor(if_else(TicketN <= 4, "lte4", "gt4"), 
                            levels = c("lte4", "gt4")))

SibSp

Survival odds fall with increasing numbers of spouse+siblings. Maybe people made sure at least on child in a family got aboard a raft. Males traveling with no spouse or sibling is a conspicuously frequent and dubious group.

fmla <- "Survived ~ SibSp + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)

Including the newly-created TicketNCohort variable seems to help.

fmla <- "Survived ~ SibSp + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)

No actions need to be taken with this variable - just include it in the model as-is.

Parch

Survival odds fall with increasing numbers of parents and children. Again, males traveling alone are in trouble.

fmla <- "Survived ~ Parch + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)

Including the newly-created TicketNCohort variable seems to help. Hard to tell, actually.

fmla <- "Survived ~ Parch + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)

No actions need to be taken with this variable - just include it in the model as-is.

Embarked

Embarking from Queenstown was a death sentence for males for some reason. Embarked should interact with Sex.

fmla <- "Survived ~ Embarked*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Embarked", facet_var = "Sex", fmla = fmla)

FarePerPass

Here is the survival log-odds for FarePerPass. I don’t really see much of a relationship. I think the thing to do is to leave the variable alone, include it in the model and see what happens.

fmla <- "Survived ~ Sex + FarePerPass"
dat <- full[train_index, ] %>% mutate(FarePerPassBin = cut(FarePerPass, 10))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "FarePerPassBin", facet_var = "Sex", fmla = fmla)

Employee

93% of employees perished, but n = 14. Maybe this variable is useful in a tree model, I don’t know. I’ll leave it in, but I don’t expect much.

tabyl(full[train_index, ], Employee, Survived) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

Employee.Survived	No	Yes	Total
0	61.1% (535)	38.9% (341)	100.0% (876)
1	93.3% (14)	6.7% (1)	100.0% (15)
Total	61.6% (549)	38.4% (342)	100.0% (891)

Deck

Generally, it’s better to have a cabin on a deck at the top of the alphabet. No interaction with Sex. Not a lot of separation among the decks. Deck A has a low n-size. I’ll leave this variable alone.

fmla <- "Survived ~ Deck + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Deck", facet_var = "Sex", fmla = fmla)

NetSurv

Is the survival status of the other passengers on the ticket predictive of your survival odds? Create feature NetSurv to check.

full <- full %>%
  group_by(Ticket) %>%
  mutate(
    SurvN = sum(Survived == "Yes", na.rm = TRUE),
    PrshN = sum(Survived == "No", na.rm = TRUE)
  ) %>% 
  ungroup() %>%
  mutate(
    SurvN = SurvN - if_else(!is.na(Survived) & Survived == "Yes", 1, 0),
    PrshN = PrshN - if_else(!is.na(Survived) & Survived == "No", 1, 0),
    NetSurv = SurvN - PrshN,
  ) %>%
  select(-SurvN, -PrshN)

NetSurv certainly does appear to be predictive. The tails are thinly populated and the linearity breaks down.

fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)

I can probably collapse the tails.

full[train_index, ] %>% group_by(NetSurv) %>% summarize(n = n(), mean = mean(Survived == "Yes"))

## # A tibble: 11 x 3
##    NetSurv     n   mean
##      <dbl> <int>  <dbl>
##  1      -6    14 0     
##  2      -5    18 0     
##  3      -4    10 0     
##  4      -3    13 0.0769
##  5      -2     7 0.143 
##  6      -1    87 0.391 
##  7       0   571 0.324 
##  8       1   121 0.719 
##  9       2    37 0.703 
## 10       3    11 0.727 
## 11       4     2 0

full$NetSurv <- case_when(full$NetSurv <= -1 ~ -1,
                          full$NetSurv >=  1 ~  1,
                          TRUE ~ 0)

This could be an ordinal factor variable, but I think it works fine as a numeric.

fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)

Collinearity

There predictor set has grown from 11 variables to 13. I’m adding factor variables AgeCohort and TicketNCohort, and numeric variable NetSurv, and I am dropping factor variable Title.

preds <- c(preds, "AgeCohort", "TicketNCohort", "NetSurv") 
preds <- subset(preds, preds != "Title")
preds_numeric <- c(preds_numeric, "NetSurv")

Let’s measure the collinearity of the numeric predictors. TicketN has a high correlation with SibSp and Parch. Parch has a moderate correlatoin with SibSp. When predictors are correlated, one or both may show low statistical significance.

ggpairs(full[train_index, preds_numeric])

plot_bv_box <- function(dat, num_var, color){
  dat %>% 
    ggplot(aes(y = !!sym(num_var)))
}

The following tables break down the factor levels and associated odds ratios with the base level.

Save Work

Another look at the data.

mdl_vars <- c("Survived", preds)
skimr::skim(full[, mdl_vars])

Data summary
Name	full[, mdl_vars]
Number of rows	1309
Number of columns	14
_______________________
Column type frequency:
factor	8
numeric	6
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Survived	418	0.68	FALSE	2	No: 549, Yes: 342
Pclass	0	1.00	FALSE	3	3: 709, 1: 323, 2: 277
Sex	0	1.00	FALSE	2	mal: 843, fem: 466
Embarked	0	1.00	FALSE	3	S: 914, C: 272, Q: 123
Employee	0	1.00	FALSE	2	0: 1292, 1: 17
Deck	0	1.00	FALSE	6	F: 376, C: 256, E: 209, D: 202
AgeCohort	0	1.00	FALSE	2	gt1: 1211, lte: 98
TicketNCohort	0	1.00	FALSE	2	lte: 1188, gt4: 121

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	29.46	14.09	0.17	21.00	27.00	38.00	80.00	▂▇▃▂▁
SibSp	1	0.50	1.04	0.00	0.00	0.00	1.00	8.00	▇▁▁▁▁
Parch	1	0.39	0.87	0.00	0.00	0.00	0.00	9.00	▇▁▁▁▁
TicketN	1	2.10	1.78	1.00	1.00	1.00	3.00	11.00	▇▁▁▁▁
FarePerPass	1	14.88	13.47	3.17	7.67	8.05	15.03	128.08	▇▁▁▁▁
NetSurv	1	0.03	0.59	-1.00	0.00	0.00	0.00	1.00	▂▁▇▁▂

For the modeling phase, I’ll split the full data set into training and testing, then 80:20 split training into training_80 for training and training_20 as a holdout set to compare models.

training <- full[train_index, ]
testing <- full[-train_index, ]
 
set.seed(1920)
partition <- createDataPartition(training$Survived, p = 0.80, list = FALSE)[, 1]
training_80 <- training[partition, ]
training_20 <- training[-partition, ]

Save the objects for the modeling.

save(full, train_index, training, testing, training_80, training_20, 
     preds, mdl_vars, file = "./titanic_02.RData")

Kaggle - Titanic

Step 2: Exploratory Analysis

Michael Foley

5/7/2020