The Titanic: Machine Learning from Disaster competition challenges participants to predict the survivorship (1 = survived, 0 = perished) of passengers aboard the April 1912 maiden voyage of the Titanic. In total, 1,502 of the 2,224 passengers and crew perished (68%). The training (n = 891) and test (n = 418) data sets comprise only a fraction (59%) of the total. The data includes 10 features. Competitors build a predictive model with the training data set, then apply the model to the test data set to produce a submission file consisting of PassengerId and Survived (1|0). Kaggle evaluates submissions on accuracy.

This file handles the data management and feature engineering. I got a lot of ideas from here.

Setup

library(tidyverse)
library(caret)
library(mice)
library(janitor)

Load Data

Here are the train/test data set column definitions from Kaggle.

Variable Definition
PassengerId
Survived Survival (0 = No, 1 = Yes)
Pclass Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name
Sex
Age Passenger age (years)
SibSp # of siblings / spouses aboard the Titanic
Parch # of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare ($s)
Cabin Cabin number
Embarked Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
train <- read_csv("./train.csv") 
test <- read_csv("./test.csv")
full <- bind_rows(train, test)
train_index <- c(1:891)

glimpse(full)
## Rows: 1,309
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

Feature Engineering

At this point, I know the full data set is 1,309 x 12, including 10 feature columns. How are the data types and missingness?

skimr::skim(full)
Data summary
Name full
Number of rows 1309
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1.00 12 82 0 1307 0
Sex 0 1.00 4 6 0 2 0
Ticket 0 1.00 3 18 0 929 0
Cabin 1014 0.23 1 15 0 186 0
Embarked 2 1.00 1 1 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.00 655.00 378.02 1.00 328.0 655.00 982.00 1309.00 ▇▇▇▇▇
Survived 418 0.68 0.38 0.49 0.00 0.0 0.00 1.00 1.00 ▇▁▁▁▅
Pclass 0 1.00 2.29 0.84 1.00 2.0 3.00 3.00 3.00 ▃▁▃▁▇
Age 263 0.80 29.88 14.41 0.17 21.0 28.00 39.00 80.00 ▂▇▅▂▁
SibSp 0 1.00 0.50 1.04 0.00 0.0 0.00 1.00 8.00 ▇▁▁▁▁
Parch 0 1.00 0.39 0.87 0.00 0.0 0.00 0.00 9.00 ▇▁▁▁▁
Fare 1 1.00 33.30 51.76 0.00 7.9 14.45 31.27 512.33 ▇▁▁▁▁

Survived, Pclass, Sex, and Embarked should be factors. Assign character labels for Survived because train() will create variable names from the levels.

full$Survived <- factor(full$Survived, labels = c("No", "Yes"))
full$Pclass <- factor(full$Pclass)
full$Sex <- factor(full$Sex, levels = c("male", "female"))
full$Embarked <- factor(full$Embarked)

Cabin is 77% null, Age is 20% null, and there are 2 nulls for Embarked and 1 for Fare.

Fare -> Fare per Ticket

Fare is the price of the ticket, but there can be multiple passengers per ticket. The per-person price is Fare divided number of passengers on the ticket. Let’s untangle that by 1) creating a variable TicketN for the number of passengers on the ticket, and 2) creating a variable FarePerPass for the per-passenger price.

Passengers per Ticket

Create new predictor, TicketN, for number of passengers per ticket.

full <- full %>% group_by(Ticket) %>% mutate(TicketN = n()) %>% ungroup()

full %>% ggplot(aes(x = TicketN)) + geom_histogram(bins = 30) +
  scale_x_continuous(n.breaks = 15) +
  labs(title = "Passenger Count Distribution") +
  theme_minimal()

Impute NA: Fare

Define FarePerPass as Fare divided by TicketN. First, however, I need to deal with the passenger ticket with missing Fare.

full %>% filter(is.na(Fare))
## # A tibble: 1 x 13
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl> <fct>    <fct>  <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1        1044 <NA>     3      Stor~ male   60.5     0     0 3701      NA <NA> 
## # ... with 2 more variables: Embarked <fct>, TicketN <int>

PassengerId 1044 was a third-class traveler boarding at Southampton. What is the typical fare per passenger for in this profile?

full %>% 
  filter(Embarked == "S" & Pclass == 3) %>% 
  mutate(FarePerPass = Fare / TicketN) %>% 
  select(FarePerPass) %>% 
  summary()
##   FarePerPass    
##  Min.   : 0.000  
##  1st Qu.: 6.981  
##  Median : 7.796  
##  Mean   : 7.372  
##  3rd Qu.: 8.050  
##  Max.   :19.967  
##  NA's   :1

The median fare per passenger was $7.796.

full[full$PassengerId == 1044, ]$Fare <- 7.796

Fare per Passenger

Calculate FarePerPass now.

full$FarePerPass <- full$Fare / full$TicketN

Let’s see what the distribution of fares look like.

p1 <- full %>% ggplot(aes(x = FarePerPass, color = Pclass)) + geom_density() +
  theme_minimal() + 
  theme(legend.position = "top") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "FarePerPass Distribution")
p2 <- full %>% ggplot(aes(x = Pclass, y = FarePerPass, color = Pclass)) + 
  geom_boxplot() + geom_jitter(alpha = 0.6) +
  theme_minimal() + 
  theme(legend.position = "top") + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "FarePerPass Box Plot")
gridExtra::grid.arrange(p1, p2, nrow = 1)

First class tickets can cost up to $125 per passenger! Even more surprising is the tickets costing $0 per passenger.

Fare

17 passengers paid $0 for their ticket. All were male. Of the 15 in the training set, 14 perished.

full %>% filter(Fare == 0) %>% arrange(Name)
## # A tibble: 17 x 14
##    PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##          <dbl> <fct>    <fct>  <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1         807 No       1      "And~ male     39     0     0 112050     0 A36  
##  2         467 No       2      "Cam~ male     NA     0     0 239853     0 <NA> 
##  3        1158 <NA>     1      "Chi~ male     NA     0     0 112051     0 <NA> 
##  4         414 No       2      "Cun~ male     NA     0     0 239853     0 <NA> 
##  5         482 No       2      "Fro~ male     NA     0     0 239854     0 <NA> 
##  6         816 No       1      "Fry~ male     NA     0     0 112058     0 B102 
##  7         264 No       1      "Har~ male     40     0     0 112059     0 B94  
##  8        1264 <NA>     1      "Ism~ male     49     0     0 112058     0 B52 ~
##  9         598 No       3      "Joh~ male     49     0     0 LINE       0 <NA> 
## 10         303 No       3      "Joh~ male     19     0     0 LINE       0 <NA> 
## 11         733 No       2      "Kni~ male     NA     0     0 239855     0 <NA> 
## 12         180 No       3      "Leo~ male     36     0     0 LINE       0 <NA> 
## 13         278 No       2      "Par~ male     NA     0     0 239853     0 <NA> 
## 14         634 No       1      "Par~ male     NA     0     0 112052     0 <NA> 
## 15         823 No       1      "Reu~ male     38     0     0 19972      0 <NA> 
## 16         272 Yes      3      "Tor~ male     25     0     0 LINE       0 <NA> 
## 17         675 No       2      "Wat~ male     NA     0     0 239856     0 <NA> 
## # ... with 3 more variables: Embarked <fct>, TicketN <int>, FarePerPass <dbl>

After a little research here, it appears that nine of the passengers who paid $0 were from the Guarantee Group from Harland & Wolff, the shipbuilding firm. They were regarded as crew. Of the others, Reuchlin (PassengerId 823) was sort of an employee. So were Fry, Harrison, and Ismay (816, 264, and 1264). Perhaps the best way to deal with these passengers is to create a new employee variable, Employee (1 = yes, 0 = no), set their FarePerPass to NA, and impute a “market-rate” fare.

full <- full %>% 
  mutate(
    Employee = factor(if_else(Fare == 0, 1, 0)),
    FarePerPass = if_else(FarePerPass == 0, as.numeric(NA), FarePerPass)
  )

I’ll use Mice to impute. You might think the response variable Survived would be useful to include, but 1) it feels cicular to use it, and 2) it is often missing. Anyway, I experimented with and without it, and leaving it out yielded more accurate imputations.

set.seed(2020)
mice_obj <- mice(
  full[, !names(full) %in% c("PassengerId", "Name", "Ticket", "Cabin", "Survived")],
  method = "rf"
)
## 
##  iter imp variable
##   1   1  Age  Embarked  FarePerPass
##   1   2  Age  Embarked  FarePerPass
##   1   3  Age  Embarked  FarePerPass
##   1   4  Age  Embarked  FarePerPass
##   1   5  Age  Embarked  FarePerPass
##   2   1  Age  Embarked  FarePerPass
##   2   2  Age  Embarked  FarePerPass
##   2   3  Age  Embarked  FarePerPass
##   2   4  Age  Embarked  FarePerPass
##   2   5  Age  Embarked  FarePerPass
##   3   1  Age  Embarked  FarePerPass
##   3   2  Age  Embarked  FarePerPass
##   3   3  Age  Embarked  FarePerPass
##   3   4  Age  Embarked  FarePerPass
##   3   5  Age  Embarked  FarePerPass
##   4   1  Age  Embarked  FarePerPass
##   4   2  Age  Embarked  FarePerPass
##   4   3  Age  Embarked  FarePerPass
##   4   4  Age  Embarked  FarePerPass
##   4   5  Age  Embarked  FarePerPass
##   5   1  Age  Embarked  FarePerPass
##   5   2  Age  Embarked  FarePerPass
##   5   3  Age  Embarked  FarePerPass
##   5   4  Age  Embarked  FarePerPass
##   5   5  Age  Embarked  FarePerPass
## Warning: Number of logged events: 25
mice_cmpl <- complete(mice_obj)
data.frame(full = full$FarePerPass, mice = mice_cmpl$FarePerPass) %>%
  replace_na(list(full = 0)) %>%
  pivot_longer(cols = c("full", "mice"), names_to = "Data", values_to= "Fare") %>%
  ggplot(aes(x = Fare, color = Data)) + 
  geom_density(na.rm = TRUE) +
  theme_minimal() + 
  scale_x_continuous(labels = scales::dollar) +
  theme(legend.position = "top") +
  labs(title = "Fare Distribution", color = "")

Those 17 fit in invisibly. Assign the imputed values. Recalculate Fare too.

full$FarePerPass <- mice_cmpl$FarePerPass
full$Fare <- full$FarePerPass * full$TicketN

One more look at FarePerPass.

p1 <- full %>% ggplot(aes(x = FarePerPass, color = Pclass)) + geom_density() +
  theme_minimal() + 
  theme(legend.position = "top") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "FarePerPass Distribution")
p2 <- full %>% ggplot(aes(x = Pclass, y = FarePerPass, color = Pclass)) + 
  geom_boxplot() + geom_jitter(alpha = 0.6) +
  theme_minimal() + 
  theme(legend.position = "top") + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "FarePerPass Box Plot")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Impute NA: Embarked

Passengers Embarked from three ports (C = Cherbourg, Q = Queenstown, S = Southampton). Two passengers have NA.

full %>% filter(is.na(Embarked))
## # A tibble: 2 x 15
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl> <fct>    <fct>  <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1          62 Yes      1      Icar~ fema~    38     0     0 113572    80 B28  
## 2         830 Yes      1      Ston~ fema~    62     0     0 113572    80 B28  
## # ... with 4 more variables: Embarked <fct>, TicketN <int>, FarePerPass <dbl>,
## #   Employee <fct>

They traveled on the same ticket, so they probably embarked from the same port. They paid $40 per passenger. Where did other passengers fitting this profile embark?

plot_prof <- function(hline, subtitle){
  full %>%
    filter(!is.na(Embarked)) %>%
    ggplot(aes(x = Embarked, y = FarePerPass, color = Pclass)) + 
    geom_boxplot(na.rm = TRUE) +
    geom_hline(yintercept = hline, linetype = 2) +
    scale_y_continuous(labels = scales::dollar) +
    scale_color_brewer(palette = "Paired") +
    labs(
      y = "Fare",
      title = "Passenger Profiles",
      subtitle = subtitle
    )
}
plot_prof(40, "First Class passengers paying $80/ticket probably embarked from Cherbourg")

Seems like port Cherbourg is a good guess, but it could be any of them. What did mice() predict?

mice_cmpl[c(62, 830), ] %>% select(Embarked, everything())
##     Embarked Pclass    Sex Age SibSp Parch Fare TicketN FarePerPass Employee
## 62         C      1 female  38     0     0   80       2          40        0
## 830        C      1 female  62     0     0   80       2          40        0

mice() thinks Chebourg too.

full$Embarked <- mice_cmpl$Embarked

Impute NA: Age

20% of Age values are missing. I have no good ideas for imputation, so I’ll just go with mice.

data.frame(full = full$Age, mice = mice_cmpl$Age) %>%
  pivot_longer(cols = c("full", "mice"), names_to = "Data", values_to= "Age") %>%
  ggplot(aes(x = Age, color = Data)) + 
  geom_density(na.rm = TRUE) +
  labs(title = "Age Distribution", color = "")

Mice looks great.

full$Age <- mice_cmpl$Age

Name -> Surname, Title, Name

Name includes a title and surname. The title may be predictive of survivorship. The surname may be useful for constructing family units (do families sink or swim together?), but I’m not sure how to go about doing that, so I’ll focus on title.

Separate Name into Surname, Title, Name.

full <- full %>% 
  separate(Name, ", ", into = c("Surname", "Name")) %>%
  separate(Name, "\\. ", into = c("Title", "Name"), extra = "merge") %>%
  mutate(Title = factor(Title))

full %>% count(Title, sort = TRUE) 
## # A tibble: 18 x 2
##    Title            n
##    <fct>        <int>
##  1 Mr             757
##  2 Miss           260
##  3 Mrs            197
##  4 Master          61
##  5 Dr               8
##  6 Rev              8
##  7 Col              4
##  8 Major            2
##  9 Mlle             2
## 10 Ms               2
## 11 Capt             1
## 12 Don              1
## 13 Dona             1
## 14 Jonkheer         1
## 15 Lady             1
## 16 Mme              1
## 17 Sir              1
## 18 the Countess     1

Cabin -> Deck

Cabin contains both the deck [A-G, T] (T = Tank Top?) and cabin number (deck descriptions here). There are a few rows with multiple cabins, and in a few of these instances the multiple cabins are on multiple decks.

unique(full$Cabin)
##   [1] NA                "C85"             "C123"            "E46"            
##   [5] "G6"              "C103"            "D56"             "A6"             
##   [9] "C23 C25 C27"     "B78"             "D33"             "B30"            
##  [13] "C52"             "B28"             "C83"             "F33"            
##  [17] "F G73"           "E31"             "A5"              "D10 D12"        
##  [21] "D26"             "C110"            "B58 B60"         "E101"           
##  [25] "F E69"           "D47"             "B86"             "F2"             
##  [29] "C2"              "E33"             "B19"             "A7"             
##  [33] "C49"             "F4"              "A32"             "B4"             
##  [37] "B80"             "A31"             "D36"             "D15"            
##  [41] "C93"             "C78"             "D35"             "C87"            
##  [45] "B77"             "E67"             "B94"             "C125"           
##  [49] "C99"             "C118"            "D7"              "A19"            
##  [53] "B49"             "D"               "C22 C26"         "C106"           
##  [57] "C65"             "E36"             "C54"             "B57 B59 B63 B66"
##  [61] "C7"              "E34"             "C32"             "B18"            
##  [65] "C124"            "C91"             "E40"             "T"              
##  [69] "C128"            "D37"             "B35"             "E50"            
##  [73] "C82"             "B96 B98"         "E10"             "E44"            
##  [77] "A34"             "C104"            "C111"            "C92"            
##  [81] "E38"             "D21"             "E12"             "E63"            
##  [85] "A14"             "B37"             "C30"             "D20"            
##  [89] "B79"             "E25"             "D46"             "B73"            
##  [93] "C95"             "B38"             "B39"             "B22"            
##  [97] "C86"             "C70"             "A16"             "C101"           
## [101] "C68"             "A10"             "E68"             "B41"            
## [105] "A20"             "D19"             "D50"             "D9"             
## [109] "A23"             "B50"             "A26"             "D48"            
## [113] "E58"             "C126"            "B71"             "B51 B53 B55"    
## [117] "D49"             "B5"              "B20"             "F G63"          
## [121] "C62 C64"         "E24"             "C90"             "C45"            
## [125] "E8"              "B101"            "D45"             "C46"            
## [129] "D30"             "E121"            "D11"             "E77"            
## [133] "F38"             "B3"              "D6"              "B82 B84"        
## [137] "D17"             "A36"             "B102"            "B69"            
## [141] "E49"             "C47"             "D28"             "E17"            
## [145] "A24"             "C50"             "B42"             "C148"           
## [149] "B45"             "B36"             "A21"             "D34"            
## [153] "A9"              "C31"             "B61"             "C53"            
## [157] "D43"             "C130"            "C132"            "C55 C57"        
## [161] "C116"            "F"               "A29"             "C6"             
## [165] "C28"             "C51"             "C97"             "D22"            
## [169] "B10"             "E45"             "E52"             "A11"            
## [173] "B11"             "C80"             "C89"             "F E46"          
## [177] "B26"             "F E57"           "A18"             "E60"            
## [181] "E39 E41"         "B52 B54 B56"     "C39"             "B24"            
## [185] "D40"             "D38"             "C105"

I will asssume the first listed cabin contains the deck number.

full$Deck <- map_chr(full$Cabin, str_sub, start = 1, end = 1) %>% factor()

Here is the distribution of values, grouped by passenger class.

full %>%
  tabyl(Deck, Pclass) %>% 
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined") %>%
  knitr::kable()
Deck/Pclass 1 2 3 Total
A 6.8% (22) 0.0% (0) 0.0% (0) 1.7% (22)
B 20.1% (65) 0.0% (0) 0.0% (0) 5.0% (65)
C 29.1% (94) 0.0% (0) 0.0% (0) 7.2% (94)
D 12.4% (40) 2.2% (6) 0.0% (0) 3.5% (46)
E 10.5% (34) 1.4% (4) 0.4% (3) 3.1% (41)
F 0.0% (0) 4.7% (13) 1.1% (8) 1.6% (21)
G 0.0% (0) 0.0% (0) 0.7% (5) 0.4% (5)
T 0.3% (1) 0.0% (0) 0.0% (0) 0.1% (1)
NA 20.7% (67) 91.7% (254) 97.7% (693) 77.5% (1014)
Total 100.0% (323) 100.0% (277) 100.0% (709) 100.0% (1309)

There are 8 decks listed. One fellow seems to have been assigned to the tank top, meaning he’s sleeping with the ship boilers.

full %>% filter(Deck == "T")
## # A tibble: 1 x 18
##   PassengerId Survived Pclass Surname Title Name  Sex     Age SibSp Parch Ticket
##         <dbl> <fct>    <fct>  <chr>   <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr> 
## 1         340 No       1      Blackw~ Mr    Step~ male     45     0     0 113784
## # ... with 7 more variables: Fare <dbl>, Cabin <chr>, Embarked <fct>,
## #   TicketN <int>, FarePerPass <dbl>, Employee <fct>, Deck <fct>

First class passenger too - that seems wrong. What else… The decks are labeled in descending level above water, so the highest deck is “A” and it would have been the most exclusive. That is evident from the table above - decks A, B, and C are 100% first class. Lots of missing values though. Seems like a good guess that the third-class passengers are on deck F or G. How about average ticket fare by deck?

full %>% 
  ggplot(aes(x = fct_explicit_na(Deck), y = FarePerPass, 
             color = as.factor(Pclass))) +
  geom_boxplot() +
  scale_color_brewer(palette = "Paired") +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.position = "top") +
  labs(title = "FarePerPass by Deck", color = "Class", x = "Deck", y = "Fare per Passenger")

Okay, so what am I going to do with Deck? First, I’ll un-assign the deck from Mr. Tank Top since that just seems wrong, and I can’t have a factor variable level with one row. Then I’ll let mice impute the NAs. After that, I’ll decide if I should collapse factors F and G since G was pretty sparse.

full[340, ]$Deck <- NA
full$Deck = fct_drop(full$Deck)

set.seed(2020)
mice_obj <- mice(
  full[, !names(full) %in% c("PassengerId", "Name", "Ticket", "Cabin", "Survived")],
  method = "rf"
)
## 
##  iter imp variable
##   1   1  Deck
##   1   2  Deck
##   1   3  Deck
##   1   4  Deck
##   1   5  Deck
##   2   1  Deck
##   2   2  Deck
##   2   3  Deck
##   2   4  Deck
##   2   5  Deck
##   3   1  Deck
##   3   2  Deck
##   3   3  Deck
##   3   4  Deck
##   3   5  Deck
##   4   1  Deck
##   4   2  Deck
##   4   3  Deck
##   4   4  Deck
##   4   5  Deck
##   5   1  Deck
##   5   2  Deck
##   5   3  Deck
##   5   4  Deck
##   5   5  Deck
## Warning: Number of logged events: 26
mice_cmpl <- complete(mice_obj)
addmargins(table(mice_cmpl$Deck, mice_cmpl$Pclass, useNA = "ifany"))
##      
##          1    2    3  Sum
##   A     26   11   60   97
##   B     80   30   59  169
##   C    125   34   97  256
##   D     48   72   82  202
##   E     43   43  123  209
##   F      1   84  254  339
##   G      0    3   34   37
##   Sum  323  277  709 1309
mice_cmpl %>%
  tabyl(Deck, Pclass) %>% 
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined") %>%
  knitr::kable()
Deck/Pclass 1 2 3 Total
A 8.0% (26) 4.0% (11) 8.5% (60) 7.4% (97)
B 24.8% (80) 10.8% (30) 8.3% (59) 12.9% (169)
C 38.7% (125) 12.3% (34) 13.7% (97) 19.6% (256)
D 14.9% (48) 26.0% (72) 11.6% (82) 15.4% (202)
E 13.3% (43) 15.5% (43) 17.3% (123) 16.0% (209)
F 0.3% (1) 30.3% (84) 35.8% (254) 25.9% (339)
G 0.0% (0) 1.1% (3) 4.8% (34) 2.8% (37)
Total 100.0% (323) 100.0% (277) 100.0% (709) 100.0% (1309)

Deck G is still sparse, so let’s go ahead and collapse the level into F.

full$Deck <- mice_cmpl$Deck
full$Deck <- fct_collapse(full$Deck, F = "G")

Oh, where did Mr. Tank Top go?

full %>% filter(PassengerId == 340) %>% select(Deck, everything())
## # A tibble: 1 x 18
##   Deck  PassengerId Survived Pclass Surname Title Name  Sex     Age SibSp Parch
##   <fct>       <dbl> <fct>    <fct>  <chr>   <fct> <chr> <fct> <dbl> <dbl> <dbl>
## 1 A             340 No       1      Blackw~ Mr    Step~ male     45     0     0
## # ... with 7 more variables: Ticket <chr>, Fare <dbl>, Cabin <chr>,
## #   Embarked <fct>, TicketN <int>, FarePerPass <dbl>, Employee <fct>

Deck B!

Save Work

One final look.

skimr::skim(full)
Data summary
Name full
Number of rows 1309
Number of columns 18
_______________________
Column type frequency:
character 4
factor 7
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Surname 0 1.00 3 22 0 875 0
Name 0 1.00 3 55 0 1126 0
Ticket 0 1.00 3 18 0 929 0
Cabin 1014 0.23 1 15 0 186 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Survived 418 0.68 FALSE 2 No: 549, Yes: 342
Pclass 0 1.00 FALSE 3 3: 709, 1: 323, 2: 277
Title 0 1.00 FALSE 18 Mr: 757, Mis: 260, Mrs: 197, Mas: 61
Sex 0 1.00 FALSE 2 mal: 843, fem: 466
Embarked 0 1.00 FALSE 3 S: 914, C: 272, Q: 123
Employee 0 1.00 FALSE 2 0: 1292, 1: 17
Deck 0 1.00 FALSE 6 F: 376, C: 256, E: 209, D: 202

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1 655.00 378.02 1.00 328.00 655.00 982.00 1309.00 ▇▇▇▇▇
Age 0 1 29.46 14.09 0.17 21.00 27.00 38.00 80.00 ▂▇▃▂▁
SibSp 0 1 0.50 1.04 0.00 0.00 0.00 1.00 8.00 ▇▁▁▁▁
Parch 0 1 0.39 0.87 0.00 0.00 0.00 0.00 9.00 ▇▁▁▁▁
Fare 0 1 33.53 51.65 3.17 7.92 14.50 31.27 512.33 ▇▁▁▁▁
TicketN 0 1 2.10 1.78 1.00 1.00 1.00 3.00 11.00 ▇▁▁▁▁
FarePerPass 0 1 14.88 13.47 3.17 7.67 8.05 15.03 128.08 ▇▁▁▁▁
save(full, train_index, file = "./titanic_01.RData")