This is a survey of the Titanic data. The univariate analysis characterizes the predictor distributions. The bivariate analysis characterizes relationships among the predictors and with the response variable. The influential outliers analysis searches for problematic observations.

Conclusions:

Setup

library(tidyverse)
library(caret)  # for nearZeroVar()
library(e1071)  # for skewness()
library(broom)  # for tidy()
library(GGally) # for ggpairs()
library(gridExtra) 
library(janitor)

Load Data

The initial data management created the data set full, with training rows indexed by train_index.

load("./titanic_01.RData")

glimpse(full)
## Rows: 1,309
## Columns: 18
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, ...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Surname     <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "Allen", ...
## $ Title       <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss,...
## $ Name        <chr> "Owen Harris", "John Bradley (Florence Briggs Thayer)",...
## $ Sex         <fct> male, female, female, female, male, male, male, male, f...
## $ Age         <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...
## $ TicketN     <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2...
## $ FarePerPass <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8.4...
## $ Employee    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Deck        <fct> F, C, F, C, F, F, E, C, E, F, F, C, C, B, F, D, B, F, D...

Univariate Analysis

In this section I will look at data distributions. For factor variables, I am interested in which have near-zero-variance. For quantitative variables, I am looking for significant skew.

Conclusions from this section are:

It will be handy to classify the predictors by data type.

preds <- full %>% 
  select(-Survived, -PassengerId, -Surname, -Name, -Ticket, -Cabin, -Fare) %>% 
  colnames()
preds_class <- full[, preds] %>% map(class) %>% unlist()
preds_factor <- subset(preds_class, preds_class == "factor") %>% names()
preds_numeric <- subset(preds_class, preds_class %in% c("numeric", "integer")) %>% names()
mdl_vars <- c("Survived", preds)
rm(preds_class)

assertthat::are_equal(length(c(preds_factor, preds_numeric)), length(preds))
## [1] TRUE

Factor Variables

Inspect each factor variable, looking for near-zero-variance (NZV). I might collapse them to prevent zero-variance in CV folds. caret::nearZeroVar() defines NZV as a frequency ratio of the most common value to the second most common value frequency >= 19/5 and a unique value percentage <= 10%. (DataCamp course Machine Learning Toolbox suggests more aggressive thresholds of frequency ratio >= 2 and unique value percentage <= 20%).

dummies <- dummyVars(~., data = full[, preds_factor], fullRank = FALSE)
dummy_dat <- as.data.frame(predict(dummies, full[, preds_factor]))
(nzv <- dummy_dat %>%
    nearZeroVar(saveMetrics= TRUE)
)
##                      freqRatio percentUnique zeroVar   nzv
## Pclass.1              3.052632     0.1527884   FALSE FALSE
## Pclass.2              3.725632     0.1527884   FALSE FALSE
## Pclass.3              1.181667     0.1527884   FALSE FALSE
## Title.Capt         1308.000000     0.1527884   FALSE  TRUE
## Title.Col           326.250000     0.1527884   FALSE  TRUE
## Title.Don          1308.000000     0.1527884   FALSE  TRUE
## Title.Dona         1308.000000     0.1527884   FALSE  TRUE
## Title.Dr            162.625000     0.1527884   FALSE  TRUE
## Title.Jonkheer     1308.000000     0.1527884   FALSE  TRUE
## Title.Lady         1308.000000     0.1527884   FALSE  TRUE
## Title.Major         653.500000     0.1527884   FALSE  TRUE
## Title.Master         20.459016     0.1527884   FALSE  TRUE
## Title.Miss            4.034615     0.1527884   FALSE FALSE
## Title.Mlle          653.500000     0.1527884   FALSE  TRUE
## Title.Mme          1308.000000     0.1527884   FALSE  TRUE
## Title.Mr              1.371377     0.1527884   FALSE FALSE
## Title.Mrs             5.644670     0.1527884   FALSE FALSE
## Title.Ms            653.500000     0.1527884   FALSE  TRUE
## Title.Rev           162.625000     0.1527884   FALSE  TRUE
## Title.Sir          1308.000000     0.1527884   FALSE  TRUE
## Title.the Countess 1308.000000     0.1527884   FALSE  TRUE
## Sex.male              1.809013     0.1527884   FALSE FALSE
## Sex.female            1.809013     0.1527884   FALSE FALSE
## Embarked.C            3.812500     0.1527884   FALSE FALSE
## Embarked.Q            9.642276     0.1527884   FALSE FALSE
## Embarked.S            2.313924     0.1527884   FALSE FALSE
## Employee.0           76.000000     0.1527884   FALSE  TRUE
## Employee.1           76.000000     0.1527884   FALSE  TRUE
## Deck.A               12.494845     0.1527884   FALSE FALSE
## Deck.B                6.745562     0.1527884   FALSE FALSE
## Deck.C                4.113281     0.1527884   FALSE FALSE
## Deck.D                5.480198     0.1527884   FALSE FALSE
## Deck.E                5.263158     0.1527884   FALSE FALSE
## Deck.F                2.481383     0.1527884   FALSE FALSE

The problematic predictors are Title and Employee.

Here is a visualization of the data. Ideally, you want the vars to land in the top left quadrant. NZV are in the lower right quadrant.

nzv %>%
  data.frame() %>% rownames_to_column(var = "col") %>%
  separate(col, sep = "\\.", into = c("col", "level"), ) %>%
  filter(!is.na(level)) %>%
  ggplot(aes(x = freqRatio, y = percentUnique, 
             color = fct_rev(factor(nzv, labels = c("(okay)", "NZV"))), 
             label = level)) +
  geom_text(check_overlap = TRUE, size = 2, na.rm = TRUE) +
  geom_point(size = 3, alpha = 0.6, na.rm = TRUE) +
  geom_hline(yintercept = 10, linetype = "dashed") +
  geom_vline(xintercept = 95/5, linetype = "dashed") +
  theme(legend.position = "top") +
  labs(title = "Near-Zero Variance of Factor Variables", color = "") +
  facet_wrap(~col)

rm(dummies, dummy_dat, nzv)

Quantitative Variables

Skew can contribute to violation of linearity in linear regressions. I’ll check which variables have significant skew. Skew between 0.5 and 1.0 is generally considered moderate, and skew greater than 1 severe. In the following charts, negligibly skewed predictors are colored teal (there are none), moderately skewed predictors are colored gold and the severely skewed predictors are colored red.

col_skew <- map(full[, preds_numeric], skewness) %>% unlist()
col_skew_is_mod <- names(col_skew[abs(col_skew) > .5 & abs(col_skew) <= 1.0])
col_skew_is_high <- names(col_skew[abs(col_skew) > 1.0])
p <- map(
  colnames(full[, preds_numeric]),
  ~ ggplot(full, aes_string(x = .x)) +
    geom_histogram(fill = case_when(.x %in% col_skew_is_mod ~ "goldenrod", 
                                    .x %in% col_skew_is_high ~ "orangered4",
                                    TRUE ~ "cadetblue")) +
    labs(y = "", x = "", title = .x) +
    theme(axis.text.y=element_blank(), plot.title = element_text(size = 10))
)
exec(grid.arrange, ncol = 2, !!!p)

rm(col_skew, col_skew_is_high, col_skew_is_mod)

Nearly all of the quantitative predictors suffer from skew. It may make sense to transform these variables in modeling.

Bivariate Analysis

In this section I will look at inter-variable relationships. For factor variables, I am interested in which levels have significantly different log survival odds. For quantitative variables, I am looking for linear relationships with log survival odds and low correlations with each other. I am also interested in any variable interactions that might improve the linearity assumptions of linear models.

Conclusions:

Before I begin, let’s get some initial perspective which variables are correlated with survival.

dummies_fullrank <- dummyVars(~., data = full[, mdl_vars], fullRank = TRUE)
full_dum <- as.data.frame(predict(dummies_fullrank, full[, mdl_vars]))

cor_out <- full_dum[train_index, ] %>% cor()
cor_out[, "Survived.Yes"] %>% abs() %>% sort(decreasing = TRUE) %>% head(11)
## Survived.Yes     Title.Mr   Sex.female    Title.Mrs   Title.Miss     Pclass.3 
##   1.00000000   0.54919918   0.54335138   0.33904025   0.32709255   0.32230836 
##  FarePerPass   Embarked.S     Pclass.2       Deck.B   Employee.1 
##   0.28189162   0.15566027   0.09334857   0.08850991   0.08534273

Title and Sex are by the far the best predictors of survival (Title basically combines sex and age). Pclass and FarePerPass are next. Embarkation port is also important. So far it seems your best bet for survival is to be a rich woman.

For this analysis I’ll define a plotting function with a logistic regression fit, and an odds ratio table function.

model_bin <- function(dat, fmla) {
  mdl <- train(
    as.formula(fmla),
    dat,
    method = "glm", family = "binomial",
    metric = "Accuracy",
    trControl = trainControl(method = "cv", n = 10)      
  )
}

plot_bin <- function(dat, x_var, facet_var, fmla) {
  dat %>% 
    group_by(!!sym(x_var), !!sym(facet_var), Survived, pred) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = "Survived", values_from = n, names_prefix = "Survived.") %>%
    replace_na(list(Survived.No = 0, Survived.Yes = 0)) %>%
    mutate(
      N = Survived.No + Survived.Yes,
      prop = Survived.Yes / N,
      odds = prop / (1 - prop),
      log_odds = log(odds),
      log_odds = case_when(log_odds == -Inf ~ -3, 
                           log_odds ==  Inf ~  3, 
                           TRUE ~ log_odds),
      pred_log_odds = log(pred / (1 - pred))
    ) %>%
    ggplot(aes(x = !!sym(x_var))) +
      geom_point(aes(y = log_odds, size = N, color = "Actual"), alpha = 0.6) +
      geom_point(aes(y = pred_log_odds, color = "Predicted")) +
      scale_color_manual(values = c("cadetblue", "goldenrod")) +
      theme(axis.text.x = element_text(angle = 90)) +
      labs(title = fmla, x = x_var, color = "") +
      facet_wrap(facets = facet_var)
}

write_tab <- function(dat, group_var){
  fmla <- paste0("Survived ~ ", group_var)

  mdl_obj <- glm(as.formula(fmla), data = dat, family = "binomial") %>%
    tidy() %>%
    mutate(term = str_remove(term, group_var), OR = exp(estimate)) %>% 
    filter(term != "(Intercept)") %>%
    select(term, OR)

  dat %>%
  group_by(!!sym(group_var)) %>%
  summarise(
    Lived = sum(if_else(Survived == "Yes", 1, 0)),
    Died = sum(if_else(Survived == "No", 1, 0)),
    N = Lived + Died,
    Surv = Lived / N,
    Odds = Surv / (1 - Surv) 
  ) %>%
  ungroup() %>%
  rename(term = !!sym(group_var)) %>%
  mutate(term = as.character(term), Surv = Surv * 100) %>%
  left_join(mdl_obj, by = "term") %>%
  select(term, N, Surv, Odds, OR) %>%
  flextable::flextable() %>% 
  flextable::colformat_int(j = 2) %>% 
  flextable::colformat_num(j = 3, digits = 0, suffix = "%") %>%
  flextable::colformat_num(j = c(4:5)) %>%
  flextable::set_caption(fmla)
}

Sex

In the training data set, 19% of males survived and 74% of the females survived. We should get 74-81% accuracy just by predicting all males perish and all females survive.

write_tab(full[train_index, ], group_var = "Sex")
Survived ~ Sex

term

N

Surv

Odds

OR

male

577

19%

0.23

female

314

74%

2.88

12.35

Age

Here is the survival log-odds for Age. Children of both sexes had relatively good survival odds. For males, the children reverse a generally positive relationship with age. The female relationship seems okay.

fmla <- "Survived ~ Sex*Age"
dat <- full[train_index, ] %>% mutate(age_bin = cut(Age, 16))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "age_bin", facet_var = "Sex", fmla = fmla)

So maybe there should be an AgeCohort = c(lte10, gt10) factor variable that moderates Age.

fmla <- "Survived ~ Sex*Age*AgeCohort"
dat <- full[train_index, ] %>% 
  mutate(
    AgeCohort = factor(if_else(Age <= 10, "lte10", "gt10")),
    AgeBin = cut(Age, 16)
  )
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "AgeBin", facet_var = "Sex", fmla = fmla)

tidy(mdl$finalModel) %>% flextable::regulartable() %>% 
  flextable::colformat_num(j = 2:5, digits = 3) %>%
  flextable::autofit()

term

estimate

std.error

statistic

p.value

(Intercept)

-1.729

0.307

-5.634

0.000

Sexfemale

2.258

0.504

4.484

0.000

Age

0.004

0.009

0.415

0.678

AgeCohortlte10

2.419

0.636

3.803

0.000

`Sexfemale:Age`

0.019

0.016

1.194

0.233

`Sexfemale:AgeCohortlte10`

-1.796

1.016

-1.767

0.077

`Age:AgeCohortlte10`

-0.155

0.116

-1.334

0.182

`Sexfemale:Age:AgeCohortlte10`

-0.082

0.173

-0.471

0.638

Better. I’ll create the AgeCohort predictor and plan to interact it with Age and Sex since the slopes differ.

full <- full %>%
  mutate(AgeCohort = factor(if_else(Age <= 10, "lte10", "gt10"), 
                            levels = c("lte10", "gt10")))

Title

Title is closely related to Sex and Age. Is it still useful?

Here is the sex/age breakdown. 53-54% of children survive (same for males and females). Over age 10, 17% of males survive and 77% of females survive.

write_tab(full[train_index, ] %>% filter(Sex == "male"), group_var = "AgeCohort")
Survived ~ AgeCohort

term

N

Surv

Odds

OR

lte10

36

53%

1.12

gt10

541

17%

0.20

0.18

write_tab(full[train_index, ] %>% filter(Sex == "female"), group_var = "AgeCohort")
Survived ~ AgeCohort

term

N

Surv

Odds

OR

lte10

35

54%

1.19

gt10

279

77%

3.29

2.77

full[train_index, ] %>% 
  tabyl(AgeCohort, Survived, Sex) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
#  adorn_title("combined") %>%
#  untabyl() %>% 
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

male.AgeCohort

male.No

male.Yes

male.Total

female.AgeCohort

female.No

female.Yes

female.Total

lte10

47.2% (17)

52.8% (19)

100.0% (36)

lte10

45.7% (16)

54.3% (19)

100.0% (35)

gt10

83.4% (451)

16.6% (90)

100.0% (541)

gt10

23.3% (65)

76.7% (214)

100.0% (279)

Total

81.1% (468)

18.9% (109)

100.0% (577)

Total

25.8% (81)

74.2% (233)

100.0% (314)

Combining AgeCohort and Title, 34/40 “Master” are under age 10, and 515/517 “Mr” are over age 10. “Miss” has less separation: 147/182 “Miss” are over age 10. I suspect “Miss” means any female who has not yet married.

full[train_index, ] %>% 
  tabyl(Title, Survived, AgeCohort) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

lte10.Title.Survived

lte10.No

lte10.Yes

lte10.Total

gt10.Title.Survived

gt10.No

gt10.Yes

gt10.Total

Capt

- (0)

- (0)

100.0% (0)

Capt

100.0% (1)

0.0% (0)

100.0% (1)

Col

- (0)

- (0)

100.0% (0)

Col

50.0% (1)

50.0% (1)

100.0% (2)

Don

- (0)

- (0)

100.0% (0)

Don

100.0% (1)

0.0% (0)

100.0% (1)

Dona

- (0)

- (0)

100.0% (0)

Dona

- (0)

- (0)

100.0% (0)

Dr

- (0)

- (0)

100.0% (0)

Dr

57.1% (4)

42.9% (3)

100.0% (7)

Jonkheer

- (0)

- (0)

100.0% (0)

Jonkheer

100.0% (1)

0.0% (0)

100.0% (1)

Lady

- (0)

- (0)

100.0% (0)

Lady

0.0% (0)

100.0% (1)

100.0% (1)

Major

- (0)

- (0)

100.0% (0)

Major

50.0% (1)

50.0% (1)

100.0% (2)

Master

44.1% (15)

55.9% (19)

100.0% (34)

Master

33.3% (2)

66.7% (4)

100.0% (6)

Miss

45.7% (16)

54.3% (19)

100.0% (35)

Miss

26.5% (39)

73.5% (108)

100.0% (147)

Mlle

- (0)

- (0)

100.0% (0)

Mlle

0.0% (0)

100.0% (2)

100.0% (2)

Mme

- (0)

- (0)

100.0% (0)

Mme

0.0% (0)

100.0% (1)

100.0% (1)

Mr

100.0% (2)

0.0% (0)

100.0% (2)

Mr

84.3% (434)

15.7% (81)

100.0% (515)

Mrs

- (0)

- (0)

100.0% (0)

Mrs

20.8% (26)

79.2% (99)

100.0% (125)

Ms

- (0)

- (0)

100.0% (0)

Ms

0.0% (0)

100.0% (1)

100.0% (1)

Rev

- (0)

- (0)

100.0% (0)

Rev

100.0% (6)

0.0% (0)

100.0% (6)

Sir

- (0)

- (0)

100.0% (0)

Sir

0.0% (0)

100.0% (1)

100.0% (1)

the Countess

- (0)

- (0)

100.0% (0)

the Countess

0.0% (0)

100.0% (1)

100.0% (1)

Total

46.5% (33)

53.5% (38)

100.0% (71)

Total

62.9% (516)

37.1% (304)

100.0% (820)

#  knitr::kable()

So, do passengers with a title other than Master/Miss/Mrs/Mr have a better chance of surviving? A little: 44% of titled persons survived compared to 38% of untitled. But only 27/891 = 3% of passengers have a title.

full[train_index, ] %>% 
  mutate(HasTitle = if_else(Title %in% c("Mr", "Mrs", "Master", "Miss"), "No", "Yes")) %>%
  tabyl(HasTitle, Survived) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

HasTitle.Survived

No

Yes

Total

No

61.8% (534)

38.2% (330)

100.0% (864)

Yes

55.6% (15)

44.4% (12)

100.0% (27)

Total

61.6% (549)

38.4% (342)

100.0% (891)

I think the conclusion here is that Title is redundant to AgeCohort and Sex and I should not include it in the models.

Pclass

Third class passengers (Pclass = 3) had a rough go of it. Male survival odds are cut in half when dropping to 2nd class, then stay about the same for 3rd class. Female survival odds stayed the same for 2nd class, but halved when dropping to 3rd class. N-size is small for the females though. Looks like all three classes are important, but should be moderated by Sex.

fmla <- "Survived ~ Pclass*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Pclass", facet_var = "Sex", fmla = fmla)

TicketN

Traveling with with others helps - but only to a point! Both male and female survival odds increase with ticket size to n = 4, then they fall.

fmla <- "Survived ~ TicketN + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)

How about a dummy variable TicketNCohort = c("lte4", "gt4").

fmla <- "Survived ~ TicketN*TicketNCohort + Sex"
dat <- full[train_index, ] %>%
  mutate(TicketNCohort = factor(if_else(TicketN <= 4, "lte4", "gt4")))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)

Better. I’ll create the TicketNCohort predictor and plan to interact it with TicketN.

full <- full %>%
  mutate(TicketNCohort = factor(if_else(TicketN <= 4, "lte4", "gt4"), 
                            levels = c("lte4", "gt4")))

SibSp

Survival odds fall with increasing numbers of spouse+siblings. Maybe people made sure at least on child in a family got aboard a raft. Males traveling with no spouse or sibling is a conspicuously frequent and dubious group.

fmla <- "Survived ~ SibSp + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)

Including the newly-created TicketNCohort variable seems to help.

fmla <- "Survived ~ SibSp + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)

No actions need to be taken with this variable - just include it in the model as-is.

Parch

Survival odds fall with increasing numbers of parents and children. Again, males traveling alone are in trouble.

fmla <- "Survived ~ Parch + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)

Including the newly-created TicketNCohort variable seems to help. Hard to tell, actually.

fmla <- "Survived ~ Parch + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)

No actions need to be taken with this variable - just include it in the model as-is.

Embarked

Embarking from Queenstown was a death sentence for males for some reason. Embarked should interact with Sex.

fmla <- "Survived ~ Embarked*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Embarked", facet_var = "Sex", fmla = fmla)

FarePerPass

Here is the survival log-odds for FarePerPass. I don’t really see much of a relationship. I think the thing to do is to leave the variable alone, include it in the model and see what happens.

fmla <- "Survived ~ Sex + FarePerPass"
dat <- full[train_index, ] %>% mutate(FarePerPassBin = cut(FarePerPass, 10))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "FarePerPassBin", facet_var = "Sex", fmla = fmla)

Employee

93% of employees perished, but n = 14. Maybe this variable is useful in a tree model, I don’t know. I’ll leave it in, but I don’t expect much.

tabyl(full[train_index, ], Employee, Survived) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
  adorn_title("combined") %>%
  data.frame() %>%
  flextable::regulartable() %>% 
  flextable::autofit()

Employee.Survived

No

Yes

Total

0

61.1% (535)

38.9% (341)

100.0% (876)

1

93.3% (14)

6.7% (1)

100.0% (15)

Total

61.6% (549)

38.4% (342)

100.0% (891)

Deck

Generally, it’s better to have a cabin on a deck at the top of the alphabet. No interaction with Sex. Not a lot of separation among the decks. Deck A has a low n-size. I’ll leave this variable alone.

fmla <- "Survived ~ Deck + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "Deck", facet_var = "Sex", fmla = fmla)

NetSurv

Is the survival status of the other passengers on the ticket predictive of your survival odds? Create feature NetSurv to check.

full <- full %>%
  group_by(Ticket) %>%
  mutate(
    SurvN = sum(Survived == "Yes", na.rm = TRUE),
    PrshN = sum(Survived == "No", na.rm = TRUE)
  ) %>% 
  ungroup() %>%
  mutate(
    SurvN = SurvN - if_else(!is.na(Survived) & Survived == "Yes", 1, 0),
    PrshN = PrshN - if_else(!is.na(Survived) & Survived == "No", 1, 0),
    NetSurv = SurvN - PrshN,
  ) %>%
  select(-SurvN, -PrshN)

NetSurv certainly does appear to be predictive. The tails are thinly populated and the linearity breaks down.

fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)

I can probably collapse the tails.

full[train_index, ] %>% group_by(NetSurv) %>% summarize(n = n(), mean = mean(Survived == "Yes"))
## # A tibble: 11 x 3
##    NetSurv     n   mean
##      <dbl> <int>  <dbl>
##  1      -6    14 0     
##  2      -5    18 0     
##  3      -4    10 0     
##  4      -3    13 0.0769
##  5      -2     7 0.143 
##  6      -1    87 0.391 
##  7       0   571 0.324 
##  8       1   121 0.719 
##  9       2    37 0.703 
## 10       3    11 0.727 
## 11       4     2 0
full$NetSurv <- case_when(full$NetSurv <= -1 ~ -1,
                          full$NetSurv >=  1 ~  1,
                          TRUE ~ 0)

This could be an ordinal factor variable, but I think it works fine as a numeric.

fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values  
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)

Collinearity

There predictor set has grown from 11 variables to 13. I’m adding factor variables AgeCohort and TicketNCohort, and numeric variable NetSurv, and I am dropping factor variable Title.

preds <- c(preds, "AgeCohort", "TicketNCohort", "NetSurv") 
preds <- subset(preds, preds != "Title")
preds_numeric <- c(preds_numeric, "NetSurv") 

Let’s measure the collinearity of the numeric predictors. TicketN has a high correlation with SibSp and Parch. Parch has a moderate correlatoin with SibSp. When predictors are correlated, one or both may show low statistical significance.

ggpairs(full[train_index, preds_numeric])

plot_bv_box <- function(dat, num_var, color){
  dat %>% 
    ggplot(aes(y = !!sym(num_var)))
}

The following tables break down the factor levels and associated odds ratios with the base level.

Save Work

Another look at the data.

mdl_vars <- c("Survived", preds)
skimr::skim(full[, mdl_vars])
Data summary
Name full[, mdl_vars]
Number of rows 1309
Number of columns 14
_______________________
Column type frequency:
factor 8
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Survived 418 0.68 FALSE 2 No: 549, Yes: 342
Pclass 0 1.00 FALSE 3 3: 709, 1: 323, 2: 277
Sex 0 1.00 FALSE 2 mal: 843, fem: 466
Embarked 0 1.00 FALSE 3 S: 914, C: 272, Q: 123
Employee 0 1.00 FALSE 2 0: 1292, 1: 17
Deck 0 1.00 FALSE 6 F: 376, C: 256, E: 209, D: 202
AgeCohort 0 1.00 FALSE 2 gt1: 1211, lte: 98
TicketNCohort 0 1.00 FALSE 2 lte: 1188, gt4: 121

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 29.46 14.09 0.17 21.00 27.00 38.00 80.00 ▂▇▃▂▁
SibSp 0 1 0.50 1.04 0.00 0.00 0.00 1.00 8.00 ▇▁▁▁▁
Parch 0 1 0.39 0.87 0.00 0.00 0.00 0.00 9.00 ▇▁▁▁▁
TicketN 0 1 2.10 1.78 1.00 1.00 1.00 3.00 11.00 ▇▁▁▁▁
FarePerPass 0 1 14.88 13.47 3.17 7.67 8.05 15.03 128.08 ▇▁▁▁▁
NetSurv 0 1 0.03 0.59 -1.00 0.00 0.00 0.00 1.00 ▂▁▇▁▂

For the modeling phase, I’ll split the full data set into training and testing, then 80:20 split training into training_80 for training and training_20 as a holdout set to compare models.

training <- full[train_index, ]
testing <- full[-train_index, ]
 
set.seed(1920)
partition <- createDataPartition(training$Survived, p = 0.80, list = FALSE)[, 1]
training_80 <- training[partition, ]
training_20 <- training[-partition, ]

Save the objects for the modeling.

save(full, train_index, training, testing, training_80, training_20, 
     preds, mdl_vars, file = "./titanic_02.RData")