This is a survey of the Titanic data. The univariate analysis characterizes the predictor distributions. The bivariate analysis characterizes relationships among the predictors and with the response variable. The influential outliers analysis searches for problematic observations.
Conclusions:
Title and Employee have near-zero-variance.Sex moderates Pclass and Embarked.AgeCohort. It should interact with Age and Sex.TicketNCohort. It should interact with TicketN.NetSurv with values truncated at -1 and +1.The initial data management created the data set full, with training rows indexed by train_index.
## Rows: 1,309
## Columns: 18
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, ...
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Surname <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "Allen", ...
## $ Title <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss,...
## $ Name <chr> "Owen Harris", "John Bradley (Florence Briggs Thayer)",...
## $ Sex <fct> male, female, female, female, male, male, male, male, f...
## $ Age <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...
## $ TicketN <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2...
## $ FarePerPass <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8.4...
## $ Employee <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Deck <fct> F, C, F, C, F, F, E, C, E, F, F, C, C, B, F, D, B, F, D...
In this section I will look at data distributions. For factor variables, I am interested in which have near-zero-variance. For quantitative variables, I am looking for significant skew.
Conclusions from this section are:
Title and Employee have near-zero-variance.It will be handy to classify the predictors by data type.
preds <- full %>%
select(-Survived, -PassengerId, -Surname, -Name, -Ticket, -Cabin, -Fare) %>%
colnames()
preds_class <- full[, preds] %>% map(class) %>% unlist()
preds_factor <- subset(preds_class, preds_class == "factor") %>% names()
preds_numeric <- subset(preds_class, preds_class %in% c("numeric", "integer")) %>% names()
mdl_vars <- c("Survived", preds)
rm(preds_class)
assertthat::are_equal(length(c(preds_factor, preds_numeric)), length(preds))## [1] TRUE
Inspect each factor variable, looking for near-zero-variance (NZV). I might collapse them to prevent zero-variance in CV folds. caret::nearZeroVar() defines NZV as a frequency ratio of the most common value to the second most common value frequency >= 19/5 and a unique value percentage <= 10%. (DataCamp course Machine Learning Toolbox suggests more aggressive thresholds of frequency ratio >= 2 and unique value percentage <= 20%).
dummies <- dummyVars(~., data = full[, preds_factor], fullRank = FALSE)
dummy_dat <- as.data.frame(predict(dummies, full[, preds_factor]))
(nzv <- dummy_dat %>%
nearZeroVar(saveMetrics= TRUE)
)## freqRatio percentUnique zeroVar nzv
## Pclass.1 3.052632 0.1527884 FALSE FALSE
## Pclass.2 3.725632 0.1527884 FALSE FALSE
## Pclass.3 1.181667 0.1527884 FALSE FALSE
## Title.Capt 1308.000000 0.1527884 FALSE TRUE
## Title.Col 326.250000 0.1527884 FALSE TRUE
## Title.Don 1308.000000 0.1527884 FALSE TRUE
## Title.Dona 1308.000000 0.1527884 FALSE TRUE
## Title.Dr 162.625000 0.1527884 FALSE TRUE
## Title.Jonkheer 1308.000000 0.1527884 FALSE TRUE
## Title.Lady 1308.000000 0.1527884 FALSE TRUE
## Title.Major 653.500000 0.1527884 FALSE TRUE
## Title.Master 20.459016 0.1527884 FALSE TRUE
## Title.Miss 4.034615 0.1527884 FALSE FALSE
## Title.Mlle 653.500000 0.1527884 FALSE TRUE
## Title.Mme 1308.000000 0.1527884 FALSE TRUE
## Title.Mr 1.371377 0.1527884 FALSE FALSE
## Title.Mrs 5.644670 0.1527884 FALSE FALSE
## Title.Ms 653.500000 0.1527884 FALSE TRUE
## Title.Rev 162.625000 0.1527884 FALSE TRUE
## Title.Sir 1308.000000 0.1527884 FALSE TRUE
## Title.the Countess 1308.000000 0.1527884 FALSE TRUE
## Sex.male 1.809013 0.1527884 FALSE FALSE
## Sex.female 1.809013 0.1527884 FALSE FALSE
## Embarked.C 3.812500 0.1527884 FALSE FALSE
## Embarked.Q 9.642276 0.1527884 FALSE FALSE
## Embarked.S 2.313924 0.1527884 FALSE FALSE
## Employee.0 76.000000 0.1527884 FALSE TRUE
## Employee.1 76.000000 0.1527884 FALSE TRUE
## Deck.A 12.494845 0.1527884 FALSE FALSE
## Deck.B 6.745562 0.1527884 FALSE FALSE
## Deck.C 4.113281 0.1527884 FALSE FALSE
## Deck.D 5.480198 0.1527884 FALSE FALSE
## Deck.E 5.263158 0.1527884 FALSE FALSE
## Deck.F 2.481383 0.1527884 FALSE FALSE
The problematic predictors are Title and Employee.
Here is a visualization of the data. Ideally, you want the vars to land in the top left quadrant. NZV are in the lower right quadrant.
nzv %>%
data.frame() %>% rownames_to_column(var = "col") %>%
separate(col, sep = "\\.", into = c("col", "level"), ) %>%
filter(!is.na(level)) %>%
ggplot(aes(x = freqRatio, y = percentUnique,
color = fct_rev(factor(nzv, labels = c("(okay)", "NZV"))),
label = level)) +
geom_text(check_overlap = TRUE, size = 2, na.rm = TRUE) +
geom_point(size = 3, alpha = 0.6, na.rm = TRUE) +
geom_hline(yintercept = 10, linetype = "dashed") +
geom_vline(xintercept = 95/5, linetype = "dashed") +
theme(legend.position = "top") +
labs(title = "Near-Zero Variance of Factor Variables", color = "") +
facet_wrap(~col)Skew can contribute to violation of linearity in linear regressions. I’ll check which variables have significant skew. Skew between 0.5 and 1.0 is generally considered moderate, and skew greater than 1 severe. In the following charts, negligibly skewed predictors are colored teal (there are none), moderately skewed predictors are colored gold and the severely skewed predictors are colored red.
col_skew <- map(full[, preds_numeric], skewness) %>% unlist()
col_skew_is_mod <- names(col_skew[abs(col_skew) > .5 & abs(col_skew) <= 1.0])
col_skew_is_high <- names(col_skew[abs(col_skew) > 1.0])
p <- map(
colnames(full[, preds_numeric]),
~ ggplot(full, aes_string(x = .x)) +
geom_histogram(fill = case_when(.x %in% col_skew_is_mod ~ "goldenrod",
.x %in% col_skew_is_high ~ "orangered4",
TRUE ~ "cadetblue")) +
labs(y = "", x = "", title = .x) +
theme(axis.text.y=element_blank(), plot.title = element_text(size = 10))
)
exec(grid.arrange, ncol = 2, !!!p)Nearly all of the quantitative predictors suffer from skew. It may make sense to transform these variables in modeling.
In this section I will look at inter-variable relationships. For factor variables, I am interested in which levels have significantly different log survival odds. For quantitative variables, I am looking for linear relationships with log survival odds and low correlations with each other. I am also interested in any variable interactions that might improve the linearity assumptions of linear models.
Conclusions:
Sex moderates Pclass and Embarked.AgeCohort. It should interact with Age and Sex.TicketNCohort. It should interact with TicketN.NetSurv with values truncated at -1 and +1.Before I begin, let’s get some initial perspective which variables are correlated with survival.
dummies_fullrank <- dummyVars(~., data = full[, mdl_vars], fullRank = TRUE)
full_dum <- as.data.frame(predict(dummies_fullrank, full[, mdl_vars]))
cor_out <- full_dum[train_index, ] %>% cor()
cor_out[, "Survived.Yes"] %>% abs() %>% sort(decreasing = TRUE) %>% head(11)## Survived.Yes Title.Mr Sex.female Title.Mrs Title.Miss Pclass.3
## 1.00000000 0.54919918 0.54335138 0.33904025 0.32709255 0.32230836
## FarePerPass Embarked.S Pclass.2 Deck.B Employee.1
## 0.28189162 0.15566027 0.09334857 0.08850991 0.08534273
Title and Sex are by the far the best predictors of survival (Title basically combines sex and age). Pclass and FarePerPass are next. Embarkation port is also important. So far it seems your best bet for survival is to be a rich woman.
For this analysis I’ll define a plotting function with a logistic regression fit, and an odds ratio table function.
model_bin <- function(dat, fmla) {
mdl <- train(
as.formula(fmla),
dat,
method = "glm", family = "binomial",
metric = "Accuracy",
trControl = trainControl(method = "cv", n = 10)
)
}
plot_bin <- function(dat, x_var, facet_var, fmla) {
dat %>%
group_by(!!sym(x_var), !!sym(facet_var), Survived, pred) %>%
summarize(n = n()) %>%
pivot_wider(names_from = "Survived", values_from = n, names_prefix = "Survived.") %>%
replace_na(list(Survived.No = 0, Survived.Yes = 0)) %>%
mutate(
N = Survived.No + Survived.Yes,
prop = Survived.Yes / N,
odds = prop / (1 - prop),
log_odds = log(odds),
log_odds = case_when(log_odds == -Inf ~ -3,
log_odds == Inf ~ 3,
TRUE ~ log_odds),
pred_log_odds = log(pred / (1 - pred))
) %>%
ggplot(aes(x = !!sym(x_var))) +
geom_point(aes(y = log_odds, size = N, color = "Actual"), alpha = 0.6) +
geom_point(aes(y = pred_log_odds, color = "Predicted")) +
scale_color_manual(values = c("cadetblue", "goldenrod")) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = fmla, x = x_var, color = "") +
facet_wrap(facets = facet_var)
}
write_tab <- function(dat, group_var){
fmla <- paste0("Survived ~ ", group_var)
mdl_obj <- glm(as.formula(fmla), data = dat, family = "binomial") %>%
tidy() %>%
mutate(term = str_remove(term, group_var), OR = exp(estimate)) %>%
filter(term != "(Intercept)") %>%
select(term, OR)
dat %>%
group_by(!!sym(group_var)) %>%
summarise(
Lived = sum(if_else(Survived == "Yes", 1, 0)),
Died = sum(if_else(Survived == "No", 1, 0)),
N = Lived + Died,
Surv = Lived / N,
Odds = Surv / (1 - Surv)
) %>%
ungroup() %>%
rename(term = !!sym(group_var)) %>%
mutate(term = as.character(term), Surv = Surv * 100) %>%
left_join(mdl_obj, by = "term") %>%
select(term, N, Surv, Odds, OR) %>%
flextable::flextable() %>%
flextable::colformat_int(j = 2) %>%
flextable::colformat_num(j = 3, digits = 0, suffix = "%") %>%
flextable::colformat_num(j = c(4:5)) %>%
flextable::set_caption(fmla)
}In the training data set, 19% of males survived and 74% of the females survived. We should get 74-81% accuracy just by predicting all males perish and all females survive.
term | N | Surv | Odds | OR |
male | 577 | 19% | 0.23 | |
female | 314 | 74% | 2.88 | 12.35 |
Here is the survival log-odds for Age. Children of both sexes had relatively good survival odds. For males, the children reverse a generally positive relationship with age. The female relationship seems okay.
fmla <- "Survived ~ Sex*Age"
dat <- full[train_index, ] %>% mutate(age_bin = cut(Age, 16))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "age_bin", facet_var = "Sex", fmla = fmla)So maybe there should be an AgeCohort = c(lte10, gt10) factor variable that moderates Age.
fmla <- "Survived ~ Sex*Age*AgeCohort"
dat <- full[train_index, ] %>%
mutate(
AgeCohort = factor(if_else(Age <= 10, "lte10", "gt10")),
AgeBin = cut(Age, 16)
)
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "AgeBin", facet_var = "Sex", fmla = fmla)tidy(mdl$finalModel) %>% flextable::regulartable() %>%
flextable::colformat_num(j = 2:5, digits = 3) %>%
flextable::autofit()term | estimate | std.error | statistic | p.value |
(Intercept) | -1.729 | 0.307 | -5.634 | 0.000 |
Sexfemale | 2.258 | 0.504 | 4.484 | 0.000 |
Age | 0.004 | 0.009 | 0.415 | 0.678 |
AgeCohortlte10 | 2.419 | 0.636 | 3.803 | 0.000 |
`Sexfemale:Age` | 0.019 | 0.016 | 1.194 | 0.233 |
`Sexfemale:AgeCohortlte10` | -1.796 | 1.016 | -1.767 | 0.077 |
`Age:AgeCohortlte10` | -0.155 | 0.116 | -1.334 | 0.182 |
`Sexfemale:Age:AgeCohortlte10` | -0.082 | 0.173 | -0.471 | 0.638 |
AgeCohort predictor and plan to interact it with Age and Sex since the slopes differ.
Title is closely related to Sex and Age. Is it still useful?
Here is the sex/age breakdown. 53-54% of children survive (same for males and females). Over age 10, 17% of males survive and 77% of females survive.
term | N | Surv | Odds | OR |
lte10 | 36 | 53% | 1.12 | |
gt10 | 541 | 17% | 0.20 | 0.18 |
term | N | Surv | Odds | OR |
lte10 | 35 | 54% | 1.19 | |
gt10 | 279 | 77% | 3.29 | 2.77 |
full[train_index, ] %>%
tabyl(AgeCohort, Survived, Sex) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
# adorn_title("combined") %>%
# untabyl() %>%
data.frame() %>%
flextable::regulartable() %>%
flextable::autofit()male.AgeCohort | male.No | male.Yes | male.Total | female.AgeCohort | female.No | female.Yes | female.Total |
lte10 | 47.2% (17) | 52.8% (19) | 100.0% (36) | lte10 | 45.7% (16) | 54.3% (19) | 100.0% (35) |
gt10 | 83.4% (451) | 16.6% (90) | 100.0% (541) | gt10 | 23.3% (65) | 76.7% (214) | 100.0% (279) |
Total | 81.1% (468) | 18.9% (109) | 100.0% (577) | Total | 25.8% (81) | 74.2% (233) | 100.0% (314) |
Combining AgeCohort and Title, 34/40 “Master” are under age 10, and 515/517 “Mr” are over age 10. “Miss” has less separation: 147/182 “Miss” are over age 10. I suspect “Miss” means any female who has not yet married.
full[train_index, ] %>%
tabyl(Title, Survived, AgeCohort) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
adorn_title("combined") %>%
data.frame() %>%
flextable::regulartable() %>%
flextable::autofit()lte10.Title.Survived | lte10.No | lte10.Yes | lte10.Total | gt10.Title.Survived | gt10.No | gt10.Yes | gt10.Total |
Capt | - (0) | - (0) | 100.0% (0) | Capt | 100.0% (1) | 0.0% (0) | 100.0% (1) |
Col | - (0) | - (0) | 100.0% (0) | Col | 50.0% (1) | 50.0% (1) | 100.0% (2) |
Don | - (0) | - (0) | 100.0% (0) | Don | 100.0% (1) | 0.0% (0) | 100.0% (1) |
Dona | - (0) | - (0) | 100.0% (0) | Dona | - (0) | - (0) | 100.0% (0) |
Dr | - (0) | - (0) | 100.0% (0) | Dr | 57.1% (4) | 42.9% (3) | 100.0% (7) |
Jonkheer | - (0) | - (0) | 100.0% (0) | Jonkheer | 100.0% (1) | 0.0% (0) | 100.0% (1) |
Lady | - (0) | - (0) | 100.0% (0) | Lady | 0.0% (0) | 100.0% (1) | 100.0% (1) |
Major | - (0) | - (0) | 100.0% (0) | Major | 50.0% (1) | 50.0% (1) | 100.0% (2) |
Master | 44.1% (15) | 55.9% (19) | 100.0% (34) | Master | 33.3% (2) | 66.7% (4) | 100.0% (6) |
Miss | 45.7% (16) | 54.3% (19) | 100.0% (35) | Miss | 26.5% (39) | 73.5% (108) | 100.0% (147) |
Mlle | - (0) | - (0) | 100.0% (0) | Mlle | 0.0% (0) | 100.0% (2) | 100.0% (2) |
Mme | - (0) | - (0) | 100.0% (0) | Mme | 0.0% (0) | 100.0% (1) | 100.0% (1) |
Mr | 100.0% (2) | 0.0% (0) | 100.0% (2) | Mr | 84.3% (434) | 15.7% (81) | 100.0% (515) |
Mrs | - (0) | - (0) | 100.0% (0) | Mrs | 20.8% (26) | 79.2% (99) | 100.0% (125) |
Ms | - (0) | - (0) | 100.0% (0) | Ms | 0.0% (0) | 100.0% (1) | 100.0% (1) |
Rev | - (0) | - (0) | 100.0% (0) | Rev | 100.0% (6) | 0.0% (0) | 100.0% (6) |
Sir | - (0) | - (0) | 100.0% (0) | Sir | 0.0% (0) | 100.0% (1) | 100.0% (1) |
the Countess | - (0) | - (0) | 100.0% (0) | the Countess | 0.0% (0) | 100.0% (1) | 100.0% (1) |
Total | 46.5% (33) | 53.5% (38) | 100.0% (71) | Total | 62.9% (516) | 37.1% (304) | 100.0% (820) |
So, do passengers with a title other than Master/Miss/Mrs/Mr have a better chance of surviving? A little: 44% of titled persons survived compared to 38% of untitled. But only 27/891 = 3% of passengers have a title.
full[train_index, ] %>%
mutate(HasTitle = if_else(Title %in% c("Mr", "Mrs", "Master", "Miss"), "No", "Yes")) %>%
tabyl(HasTitle, Survived) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
adorn_title("combined") %>%
data.frame() %>%
flextable::regulartable() %>%
flextable::autofit()HasTitle.Survived | No | Yes | Total |
No | 61.8% (534) | 38.2% (330) | 100.0% (864) |
Yes | 55.6% (15) | 44.4% (12) | 100.0% (27) |
Total | 61.6% (549) | 38.4% (342) | 100.0% (891) |
I think the conclusion here is that Title is redundant to AgeCohort and Sex and I should not include it in the models.
Third class passengers (Pclass = 3) had a rough go of it. Male survival odds are cut in half when dropping to 2nd class, then stay about the same for 3rd class. Female survival odds stayed the same for 2nd class, but halved when dropping to 3rd class. N-size is small for the females though. Looks like all three classes are important, but should be moderated by Sex.
fmla <- "Survived ~ Pclass*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "Pclass", facet_var = "Sex", fmla = fmla)Traveling with with others helps - but only to a point! Both male and female survival odds increase with ticket size to n = 4, then they fall.
fmla <- "Survived ~ TicketN + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)How about a dummy variable TicketNCohort = c("lte4", "gt4").
fmla <- "Survived ~ TicketN*TicketNCohort + Sex"
dat <- full[train_index, ] %>%
mutate(TicketNCohort = factor(if_else(TicketN <= 4, "lte4", "gt4")))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "TicketN", facet_var = "Sex", fmla = fmla)Better. I’ll create the TicketNCohort predictor and plan to interact it with TicketN.
Survival odds fall with increasing numbers of spouse+siblings. Maybe people made sure at least on child in a family got aboard a raft. Males traveling with no spouse or sibling is a conspicuously frequent and dubious group.
fmla <- "Survived ~ SibSp + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)Including the newly-created TicketNCohort variable seems to help.
fmla <- "Survived ~ SibSp + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "SibSp", facet_var = "Sex", fmla = fmla)No actions need to be taken with this variable - just include it in the model as-is.
Survival odds fall with increasing numbers of parents and children. Again, males traveling alone are in trouble.
fmla <- "Survived ~ Parch + Sex"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)Including the newly-created TicketNCohort variable seems to help. Hard to tell, actually.
fmla <- "Survived ~ Parch + Sex + TicketN*TicketNCohort"
dat <- full[train_index, ]
mdl <- full[train_index, ] %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "Parch", facet_var = "Sex", fmla = fmla)No actions need to be taken with this variable - just include it in the model as-is.
Embarking from Queenstown was a death sentence for males for some reason. Embarked should interact with Sex.
fmla <- "Survived ~ Embarked*Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "Embarked", facet_var = "Sex", fmla = fmla)Here is the survival log-odds for FarePerPass. I don’t really see much of a relationship. I think the thing to do is to leave the variable alone, include it in the model and see what happens.
fmla <- "Survived ~ Sex + FarePerPass"
dat <- full[train_index, ] %>% mutate(FarePerPassBin = cut(FarePerPass, 10))
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "FarePerPassBin", facet_var = "Sex", fmla = fmla)93% of employees perished, but n = 14. Maybe this variable is useful in a tree model, I don’t know. I’ll leave it in, but I don’t expect much.
tabyl(full[train_index, ], Employee, Survived) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages() %>% adorn_pct_formatting() %>% adorn_ns() %>%
adorn_title("combined") %>%
data.frame() %>%
flextable::regulartable() %>%
flextable::autofit()Employee.Survived | No | Yes | Total |
0 | 61.1% (535) | 38.9% (341) | 100.0% (876) |
1 | 93.3% (14) | 6.7% (1) | 100.0% (15) |
Total | 61.6% (549) | 38.4% (342) | 100.0% (891) |
Generally, it’s better to have a cabin on a deck at the top of the alphabet. No interaction with Sex. Not a lot of separation among the decks. Deck A has a low n-size. I’ll leave this variable alone.
fmla <- "Survived ~ Deck + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "Deck", facet_var = "Sex", fmla = fmla)Is the survival status of the other passengers on the ticket predictive of your survival odds? Create feature NetSurv to check.
full <- full %>%
group_by(Ticket) %>%
mutate(
SurvN = sum(Survived == "Yes", na.rm = TRUE),
PrshN = sum(Survived == "No", na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
SurvN = SurvN - if_else(!is.na(Survived) & Survived == "Yes", 1, 0),
PrshN = PrshN - if_else(!is.na(Survived) & Survived == "No", 1, 0),
NetSurv = SurvN - PrshN,
) %>%
select(-SurvN, -PrshN)NetSurv certainly does appear to be predictive. The tails are thinly populated and the linearity breaks down.
fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)I can probably collapse the tails.
## # A tibble: 11 x 3
## NetSurv n mean
## <dbl> <int> <dbl>
## 1 -6 14 0
## 2 -5 18 0
## 3 -4 10 0
## 4 -3 13 0.0769
## 5 -2 7 0.143
## 6 -1 87 0.391
## 7 0 571 0.324
## 8 1 121 0.719
## 9 2 37 0.703
## 10 3 11 0.727
## 11 4 2 0
This could be an ordinal factor variable, but I think it works fine as a numeric.
fmla <- "Survived ~ NetSurv + Sex"
dat <- full[train_index, ]
mdl <- dat %>% model_bin(fmla)
dat$pred <- mdl$finalModel$fitted.values
plot_bin(dat, x_var = "NetSurv", facet_var = "Sex", fmla = fmla)There predictor set has grown from 11 variables to 13. I’m adding factor variables AgeCohort and TicketNCohort, and numeric variable NetSurv, and I am dropping factor variable Title.
preds <- c(preds, "AgeCohort", "TicketNCohort", "NetSurv")
preds <- subset(preds, preds != "Title")
preds_numeric <- c(preds_numeric, "NetSurv") Let’s measure the collinearity of the numeric predictors. TicketN has a high correlation with SibSp and Parch. Parch has a moderate correlatoin with SibSp. When predictors are correlated, one or both may show low statistical significance.
The following tables break down the factor levels and associated odds ratios with the base level.
Another look at the data.
| Name | full[, mdl_vars] |
| Number of rows | 1309 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Survived | 418 | 0.68 | FALSE | 2 | No: 549, Yes: 342 |
| Pclass | 0 | 1.00 | FALSE | 3 | 3: 709, 1: 323, 2: 277 |
| Sex | 0 | 1.00 | FALSE | 2 | mal: 843, fem: 466 |
| Embarked | 0 | 1.00 | FALSE | 3 | S: 914, C: 272, Q: 123 |
| Employee | 0 | 1.00 | FALSE | 2 | 0: 1292, 1: 17 |
| Deck | 0 | 1.00 | FALSE | 6 | F: 376, C: 256, E: 209, D: 202 |
| AgeCohort | 0 | 1.00 | FALSE | 2 | gt1: 1211, lte: 98 |
| TicketNCohort | 0 | 1.00 | FALSE | 2 | lte: 1188, gt4: 121 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 29.46 | 14.09 | 0.17 | 21.00 | 27.00 | 38.00 | 80.00 | ▂▇▃▂▁ |
| SibSp | 0 | 1 | 0.50 | 1.04 | 0.00 | 0.00 | 0.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1 | 0.39 | 0.87 | 0.00 | 0.00 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
| TicketN | 0 | 1 | 2.10 | 1.78 | 1.00 | 1.00 | 1.00 | 3.00 | 11.00 | ▇▁▁▁▁ |
| FarePerPass | 0 | 1 | 14.88 | 13.47 | 3.17 | 7.67 | 8.05 | 15.03 | 128.08 | ▇▁▁▁▁ |
| NetSurv | 0 | 1 | 0.03 | 0.59 | -1.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▂▁▇▁▂ |
For the modeling phase, I’ll split the full data set into training and testing, then 80:20 split training into training_80 for training and training_20 as a holdout set to compare models.
training <- full[train_index, ]
testing <- full[-train_index, ]
set.seed(1920)
partition <- createDataPartition(training$Survived, p = 0.80, list = FALSE)[, 1]
training_80 <- training[partition, ]
training_20 <- training[-partition, ]Save the objects for the modeling.