The Titanic: Machine Learning from Disaster competition challenges participants to predict the survivorship (1 = survived, 0 = perished) of passengers aboard the April 1912 maiden voyage of the Titanic. In total, 1,502 of the 2,224 passengers and crew perished (68%). The training (n = 891) and test (n = 418) data sets comprise only a fraction (59%) of the total. The data includes 10 features. Competitors build a predictive model with the training data set, then apply the model to the test data set to produce a submission file consisting of PassengerId and Survived (1|0). Kaggle evaluates submissions on accuracy.
This file handles the data management and feature engineering. I got a lot of ideas from here.
Here are the train/test data set column definitions from Kaggle.
| Variable | Definition |
|---|---|
| PassengerId | |
| Survived | Survival (0 = No, 1 = Yes) |
| Pclass | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) |
| Name | |
| Sex | |
| Age | Passenger age (years) |
| SibSp | # of siblings / spouses aboard the Titanic |
| Parch | # of parents / children aboard the Titanic |
| Ticket | Ticket number |
| Fare | Passenger fare ($s) |
| Cabin | Cabin number |
| Embarked | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
train <- read_csv("./train.csv")
test <- read_csv("./test.csv")
full <- bind_rows(train, test)
train_index <- c(1:891)
glimpse(full)## Rows: 1,309
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...
At this point, I know the full data set is 1,309 x 12, including 10 feature columns. How are the data types and missingness?
| Name | full |
| Number of rows | 1309 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1.00 | 12 | 82 | 0 | 1307 | 0 |
| Sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| Ticket | 0 | 1.00 | 3 | 18 | 0 | 929 | 0 |
| Cabin | 1014 | 0.23 | 1 | 15 | 0 | 186 | 0 |
| Embarked | 2 | 1.00 | 1 | 1 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1.00 | 655.00 | 378.02 | 1.00 | 328.0 | 655.00 | 982.00 | 1309.00 | ▇▇▇▇▇ |
| Survived | 418 | 0.68 | 0.38 | 0.49 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| Pclass | 0 | 1.00 | 2.29 | 0.84 | 1.00 | 2.0 | 3.00 | 3.00 | 3.00 | ▃▁▃▁▇ |
| Age | 263 | 0.80 | 29.88 | 14.41 | 0.17 | 21.0 | 28.00 | 39.00 | 80.00 | ▂▇▅▂▁ |
| SibSp | 0 | 1.00 | 0.50 | 1.04 | 0.00 | 0.0 | 0.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1.00 | 0.39 | 0.87 | 0.00 | 0.0 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
| Fare | 1 | 1.00 | 33.30 | 51.76 | 0.00 | 7.9 | 14.45 | 31.27 | 512.33 | ▇▁▁▁▁ |
Survived, Pclass, Sex, and Embarked should be factors. Assign character labels for Survived because train() will create variable names from the levels.
full$Survived <- factor(full$Survived, labels = c("No", "Yes"))
full$Pclass <- factor(full$Pclass)
full$Sex <- factor(full$Sex, levels = c("male", "female"))
full$Embarked <- factor(full$Embarked)Cabin is 77% null, Age is 20% null, and there are 2 nulls for Embarked and 1 for Fare.
Fare is the price of the ticket, but there can be multiple passengers per ticket. The per-person price is Fare divided number of passengers on the ticket. Let’s untangle that by 1) creating a variable TicketN for the number of passengers on the ticket, and 2) creating a variable FarePerPass for the per-passenger price.
Create new predictor, TicketN, for number of passengers per ticket.
full <- full %>% group_by(Ticket) %>% mutate(TicketN = n()) %>% ungroup()
full %>% ggplot(aes(x = TicketN)) + geom_histogram(bins = 30) +
scale_x_continuous(n.breaks = 15) +
labs(title = "Passenger Count Distribution") +
theme_minimal()Define FarePerPass as Fare divided by TicketN. First, however, I need to deal with the passenger ticket with missing Fare.
## # A tibble: 1 x 13
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <fct> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1044 <NA> 3 Stor~ male 60.5 0 0 3701 NA <NA>
## # ... with 2 more variables: Embarked <fct>, TicketN <int>
PassengerId 1044 was a third-class traveler boarding at Southampton. What is the typical fare per passenger for in this profile?
full %>%
filter(Embarked == "S" & Pclass == 3) %>%
mutate(FarePerPass = Fare / TicketN) %>%
select(FarePerPass) %>%
summary()## FarePerPass
## Min. : 0.000
## 1st Qu.: 6.981
## Median : 7.796
## Mean : 7.372
## 3rd Qu.: 8.050
## Max. :19.967
## NA's :1
The median fare per passenger was $7.796.
Calculate FarePerPass now.
Let’s see what the distribution of fares look like.
p1 <- full %>% ggplot(aes(x = FarePerPass, color = Pclass)) + geom_density() +
theme_minimal() +
theme(legend.position = "top") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "FarePerPass Distribution")
p2 <- full %>% ggplot(aes(x = Pclass, y = FarePerPass, color = Pclass)) +
geom_boxplot() + geom_jitter(alpha = 0.6) +
theme_minimal() +
theme(legend.position = "top") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "FarePerPass Box Plot")
gridExtra::grid.arrange(p1, p2, nrow = 1)First class tickets can cost up to $125 per passenger! Even more surprising is the tickets costing $0 per passenger.
17 passengers paid $0 for their ticket. All were male. Of the 15 in the training set, 14 perished.
## # A tibble: 17 x 14
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <fct> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 807 No 1 "And~ male 39 0 0 112050 0 A36
## 2 467 No 2 "Cam~ male NA 0 0 239853 0 <NA>
## 3 1158 <NA> 1 "Chi~ male NA 0 0 112051 0 <NA>
## 4 414 No 2 "Cun~ male NA 0 0 239853 0 <NA>
## 5 482 No 2 "Fro~ male NA 0 0 239854 0 <NA>
## 6 816 No 1 "Fry~ male NA 0 0 112058 0 B102
## 7 264 No 1 "Har~ male 40 0 0 112059 0 B94
## 8 1264 <NA> 1 "Ism~ male 49 0 0 112058 0 B52 ~
## 9 598 No 3 "Joh~ male 49 0 0 LINE 0 <NA>
## 10 303 No 3 "Joh~ male 19 0 0 LINE 0 <NA>
## 11 733 No 2 "Kni~ male NA 0 0 239855 0 <NA>
## 12 180 No 3 "Leo~ male 36 0 0 LINE 0 <NA>
## 13 278 No 2 "Par~ male NA 0 0 239853 0 <NA>
## 14 634 No 1 "Par~ male NA 0 0 112052 0 <NA>
## 15 823 No 1 "Reu~ male 38 0 0 19972 0 <NA>
## 16 272 Yes 3 "Tor~ male 25 0 0 LINE 0 <NA>
## 17 675 No 2 "Wat~ male NA 0 0 239856 0 <NA>
## # ... with 3 more variables: Embarked <fct>, TicketN <int>, FarePerPass <dbl>
After a little research here, it appears that nine of the passengers who paid $0 were from the Guarantee Group from Harland & Wolff, the shipbuilding firm. They were regarded as crew. Of the others, Reuchlin (PassengerId 823) was sort of an employee. So were Fry, Harrison, and Ismay (816, 264, and 1264). Perhaps the best way to deal with these passengers is to create a new employee variable, Employee (1 = yes, 0 = no), set their FarePerPass to NA, and impute a “market-rate” fare.
full <- full %>%
mutate(
Employee = factor(if_else(Fare == 0, 1, 0)),
FarePerPass = if_else(FarePerPass == 0, as.numeric(NA), FarePerPass)
)I’ll use Mice to impute. You might think the response variable Survived would be useful to include, but 1) it feels cicular to use it, and 2) it is often missing. Anyway, I experimented with and without it, and leaving it out yielded more accurate imputations.
set.seed(2020)
mice_obj <- mice(
full[, !names(full) %in% c("PassengerId", "Name", "Ticket", "Cabin", "Survived")],
method = "rf"
)##
## iter imp variable
## 1 1 Age Embarked FarePerPass
## 1 2 Age Embarked FarePerPass
## 1 3 Age Embarked FarePerPass
## 1 4 Age Embarked FarePerPass
## 1 5 Age Embarked FarePerPass
## 2 1 Age Embarked FarePerPass
## 2 2 Age Embarked FarePerPass
## 2 3 Age Embarked FarePerPass
## 2 4 Age Embarked FarePerPass
## 2 5 Age Embarked FarePerPass
## 3 1 Age Embarked FarePerPass
## 3 2 Age Embarked FarePerPass
## 3 3 Age Embarked FarePerPass
## 3 4 Age Embarked FarePerPass
## 3 5 Age Embarked FarePerPass
## 4 1 Age Embarked FarePerPass
## 4 2 Age Embarked FarePerPass
## 4 3 Age Embarked FarePerPass
## 4 4 Age Embarked FarePerPass
## 4 5 Age Embarked FarePerPass
## 5 1 Age Embarked FarePerPass
## 5 2 Age Embarked FarePerPass
## 5 3 Age Embarked FarePerPass
## 5 4 Age Embarked FarePerPass
## 5 5 Age Embarked FarePerPass
## Warning: Number of logged events: 25
data.frame(full = full$FarePerPass, mice = mice_cmpl$FarePerPass) %>%
replace_na(list(full = 0)) %>%
pivot_longer(cols = c("full", "mice"), names_to = "Data", values_to= "Fare") %>%
ggplot(aes(x = Fare, color = Data)) +
geom_density(na.rm = TRUE) +
theme_minimal() +
scale_x_continuous(labels = scales::dollar) +
theme(legend.position = "top") +
labs(title = "Fare Distribution", color = "")Those 17 fit in invisibly. Assign the imputed values. Recalculate Fare too.
One more look at FarePerPass.
p1 <- full %>% ggplot(aes(x = FarePerPass, color = Pclass)) + geom_density() +
theme_minimal() +
theme(legend.position = "top") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "FarePerPass Distribution")
p2 <- full %>% ggplot(aes(x = Pclass, y = FarePerPass, color = Pclass)) +
geom_boxplot() + geom_jitter(alpha = 0.6) +
theme_minimal() +
theme(legend.position = "top") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "FarePerPass Box Plot")
gridExtra::grid.arrange(p1, p2, nrow = 1)Passengers Embarked from three ports (C = Cherbourg, Q = Queenstown, S = Southampton). Two passengers have NA.
## # A tibble: 2 x 15
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <fct> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 62 Yes 1 Icar~ fema~ 38 0 0 113572 80 B28
## 2 830 Yes 1 Ston~ fema~ 62 0 0 113572 80 B28
## # ... with 4 more variables: Embarked <fct>, TicketN <int>, FarePerPass <dbl>,
## # Employee <fct>
They traveled on the same ticket, so they probably embarked from the same port. They paid $40 per passenger. Where did other passengers fitting this profile embark?
plot_prof <- function(hline, subtitle){
full %>%
filter(!is.na(Embarked)) %>%
ggplot(aes(x = Embarked, y = FarePerPass, color = Pclass)) +
geom_boxplot(na.rm = TRUE) +
geom_hline(yintercept = hline, linetype = 2) +
scale_y_continuous(labels = scales::dollar) +
scale_color_brewer(palette = "Paired") +
labs(
y = "Fare",
title = "Passenger Profiles",
subtitle = subtitle
)
}
plot_prof(40, "First Class passengers paying $80/ticket probably embarked from Cherbourg")Seems like port Cherbourg is a good guess, but it could be any of them. What did mice() predict?
## Embarked Pclass Sex Age SibSp Parch Fare TicketN FarePerPass Employee
## 62 C 1 female 38 0 0 80 2 40 0
## 830 C 1 female 62 0 0 80 2 40 0
mice() thinks Chebourg too.
20% of Age values are missing. I have no good ideas for imputation, so I’ll just go with mice.
data.frame(full = full$Age, mice = mice_cmpl$Age) %>%
pivot_longer(cols = c("full", "mice"), names_to = "Data", values_to= "Age") %>%
ggplot(aes(x = Age, color = Data)) +
geom_density(na.rm = TRUE) +
labs(title = "Age Distribution", color = "")Mice looks great.
Name includes a title and surname. The title may be predictive of survivorship. The surname may be useful for constructing family units (do families sink or swim together?), but I’m not sure how to go about doing that, so I’ll focus on title.
Separate Name into Surname, Title, Name.
full <- full %>%
separate(Name, ", ", into = c("Surname", "Name")) %>%
separate(Name, "\\. ", into = c("Title", "Name"), extra = "merge") %>%
mutate(Title = factor(Title))
full %>% count(Title, sort = TRUE) ## # A tibble: 18 x 2
## Title n
## <fct> <int>
## 1 Mr 757
## 2 Miss 260
## 3 Mrs 197
## 4 Master 61
## 5 Dr 8
## 6 Rev 8
## 7 Col 4
## 8 Major 2
## 9 Mlle 2
## 10 Ms 2
## 11 Capt 1
## 12 Don 1
## 13 Dona 1
## 14 Jonkheer 1
## 15 Lady 1
## 16 Mme 1
## 17 Sir 1
## 18 the Countess 1
Cabin contains both the deck [A-G, T] (T = Tank Top?) and cabin number (deck descriptions here). There are a few rows with multiple cabins, and in a few of these instances the multiple cabins are on multiple decks.
## [1] NA "C85" "C123" "E46"
## [5] "G6" "C103" "D56" "A6"
## [9] "C23 C25 C27" "B78" "D33" "B30"
## [13] "C52" "B28" "C83" "F33"
## [17] "F G73" "E31" "A5" "D10 D12"
## [21] "D26" "C110" "B58 B60" "E101"
## [25] "F E69" "D47" "B86" "F2"
## [29] "C2" "E33" "B19" "A7"
## [33] "C49" "F4" "A32" "B4"
## [37] "B80" "A31" "D36" "D15"
## [41] "C93" "C78" "D35" "C87"
## [45] "B77" "E67" "B94" "C125"
## [49] "C99" "C118" "D7" "A19"
## [53] "B49" "D" "C22 C26" "C106"
## [57] "C65" "E36" "C54" "B57 B59 B63 B66"
## [61] "C7" "E34" "C32" "B18"
## [65] "C124" "C91" "E40" "T"
## [69] "C128" "D37" "B35" "E50"
## [73] "C82" "B96 B98" "E10" "E44"
## [77] "A34" "C104" "C111" "C92"
## [81] "E38" "D21" "E12" "E63"
## [85] "A14" "B37" "C30" "D20"
## [89] "B79" "E25" "D46" "B73"
## [93] "C95" "B38" "B39" "B22"
## [97] "C86" "C70" "A16" "C101"
## [101] "C68" "A10" "E68" "B41"
## [105] "A20" "D19" "D50" "D9"
## [109] "A23" "B50" "A26" "D48"
## [113] "E58" "C126" "B71" "B51 B53 B55"
## [117] "D49" "B5" "B20" "F G63"
## [121] "C62 C64" "E24" "C90" "C45"
## [125] "E8" "B101" "D45" "C46"
## [129] "D30" "E121" "D11" "E77"
## [133] "F38" "B3" "D6" "B82 B84"
## [137] "D17" "A36" "B102" "B69"
## [141] "E49" "C47" "D28" "E17"
## [145] "A24" "C50" "B42" "C148"
## [149] "B45" "B36" "A21" "D34"
## [153] "A9" "C31" "B61" "C53"
## [157] "D43" "C130" "C132" "C55 C57"
## [161] "C116" "F" "A29" "C6"
## [165] "C28" "C51" "C97" "D22"
## [169] "B10" "E45" "E52" "A11"
## [173] "B11" "C80" "C89" "F E46"
## [177] "B26" "F E57" "A18" "E60"
## [181] "E39 E41" "B52 B54 B56" "C39" "B24"
## [185] "D40" "D38" "C105"
I will asssume the first listed cabin contains the deck number.
Here is the distribution of values, grouped by passenger class.
full %>%
tabyl(Deck, Pclass) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("col") %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title("combined") %>%
knitr::kable()| Deck/Pclass | 1 | 2 | 3 | Total |
|---|---|---|---|---|
| A | 6.8% (22) | 0.0% (0) | 0.0% (0) | 1.7% (22) |
| B | 20.1% (65) | 0.0% (0) | 0.0% (0) | 5.0% (65) |
| C | 29.1% (94) | 0.0% (0) | 0.0% (0) | 7.2% (94) |
| D | 12.4% (40) | 2.2% (6) | 0.0% (0) | 3.5% (46) |
| E | 10.5% (34) | 1.4% (4) | 0.4% (3) | 3.1% (41) |
| F | 0.0% (0) | 4.7% (13) | 1.1% (8) | 1.6% (21) |
| G | 0.0% (0) | 0.0% (0) | 0.7% (5) | 0.4% (5) |
| T | 0.3% (1) | 0.0% (0) | 0.0% (0) | 0.1% (1) |
| NA | 20.7% (67) | 91.7% (254) | 97.7% (693) | 77.5% (1014) |
| Total | 100.0% (323) | 100.0% (277) | 100.0% (709) | 100.0% (1309) |
There are 8 decks listed. One fellow seems to have been assigned to the tank top, meaning he’s sleeping with the ship boilers.
## # A tibble: 1 x 18
## PassengerId Survived Pclass Surname Title Name Sex Age SibSp Parch Ticket
## <dbl> <fct> <fct> <chr> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr>
## 1 340 No 1 Blackw~ Mr Step~ male 45 0 0 113784
## # ... with 7 more variables: Fare <dbl>, Cabin <chr>, Embarked <fct>,
## # TicketN <int>, FarePerPass <dbl>, Employee <fct>, Deck <fct>
First class passenger too - that seems wrong. What else… The decks are labeled in descending level above water, so the highest deck is “A” and it would have been the most exclusive. That is evident from the table above - decks A, B, and C are 100% first class. Lots of missing values though. Seems like a good guess that the third-class passengers are on deck F or G. How about average ticket fare by deck?
full %>%
ggplot(aes(x = fct_explicit_na(Deck), y = FarePerPass,
color = as.factor(Pclass))) +
geom_boxplot() +
scale_color_brewer(palette = "Paired") +
theme_minimal() +
scale_y_continuous(labels = scales::dollar) +
theme(legend.position = "top") +
labs(title = "FarePerPass by Deck", color = "Class", x = "Deck", y = "Fare per Passenger")Okay, so what am I going to do with Deck? First, I’ll un-assign the deck from Mr. Tank Top since that just seems wrong, and I can’t have a factor variable level with one row. Then I’ll let mice impute the NAs. After that, I’ll decide if I should collapse factors F and G since G was pretty sparse.
full[340, ]$Deck <- NA
full$Deck = fct_drop(full$Deck)
set.seed(2020)
mice_obj <- mice(
full[, !names(full) %in% c("PassengerId", "Name", "Ticket", "Cabin", "Survived")],
method = "rf"
)##
## iter imp variable
## 1 1 Deck
## 1 2 Deck
## 1 3 Deck
## 1 4 Deck
## 1 5 Deck
## 2 1 Deck
## 2 2 Deck
## 2 3 Deck
## 2 4 Deck
## 2 5 Deck
## 3 1 Deck
## 3 2 Deck
## 3 3 Deck
## 3 4 Deck
## 3 5 Deck
## 4 1 Deck
## 4 2 Deck
## 4 3 Deck
## 4 4 Deck
## 4 5 Deck
## 5 1 Deck
## 5 2 Deck
## 5 3 Deck
## 5 4 Deck
## 5 5 Deck
## Warning: Number of logged events: 26
##
## 1 2 3 Sum
## A 26 11 60 97
## B 80 30 59 169
## C 125 34 97 256
## D 48 72 82 202
## E 43 43 123 209
## F 1 84 254 339
## G 0 3 34 37
## Sum 323 277 709 1309
mice_cmpl %>%
tabyl(Deck, Pclass) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("col") %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title("combined") %>%
knitr::kable()| Deck/Pclass | 1 | 2 | 3 | Total |
|---|---|---|---|---|
| A | 8.0% (26) | 4.0% (11) | 8.5% (60) | 7.4% (97) |
| B | 24.8% (80) | 10.8% (30) | 8.3% (59) | 12.9% (169) |
| C | 38.7% (125) | 12.3% (34) | 13.7% (97) | 19.6% (256) |
| D | 14.9% (48) | 26.0% (72) | 11.6% (82) | 15.4% (202) |
| E | 13.3% (43) | 15.5% (43) | 17.3% (123) | 16.0% (209) |
| F | 0.3% (1) | 30.3% (84) | 35.8% (254) | 25.9% (339) |
| G | 0.0% (0) | 1.1% (3) | 4.8% (34) | 2.8% (37) |
| Total | 100.0% (323) | 100.0% (277) | 100.0% (709) | 100.0% (1309) |
Deck G is still sparse, so let’s go ahead and collapse the level into F.
Oh, where did Mr. Tank Top go?
## # A tibble: 1 x 18
## Deck PassengerId Survived Pclass Surname Title Name Sex Age SibSp Parch
## <fct> <dbl> <fct> <fct> <chr> <fct> <chr> <fct> <dbl> <dbl> <dbl>
## 1 A 340 No 1 Blackw~ Mr Step~ male 45 0 0
## # ... with 7 more variables: Ticket <chr>, Fare <dbl>, Cabin <chr>,
## # Embarked <fct>, TicketN <int>, FarePerPass <dbl>, Employee <fct>
Deck B!
One final look.
| Name | full |
| Number of rows | 1309 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| factor | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Surname | 0 | 1.00 | 3 | 22 | 0 | 875 | 0 |
| Name | 0 | 1.00 | 3 | 55 | 0 | 1126 | 0 |
| Ticket | 0 | 1.00 | 3 | 18 | 0 | 929 | 0 |
| Cabin | 1014 | 0.23 | 1 | 15 | 0 | 186 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Survived | 418 | 0.68 | FALSE | 2 | No: 549, Yes: 342 |
| Pclass | 0 | 1.00 | FALSE | 3 | 3: 709, 1: 323, 2: 277 |
| Title | 0 | 1.00 | FALSE | 18 | Mr: 757, Mis: 260, Mrs: 197, Mas: 61 |
| Sex | 0 | 1.00 | FALSE | 2 | mal: 843, fem: 466 |
| Embarked | 0 | 1.00 | FALSE | 3 | S: 914, C: 272, Q: 123 |
| Employee | 0 | 1.00 | FALSE | 2 | 0: 1292, 1: 17 |
| Deck | 0 | 1.00 | FALSE | 6 | F: 376, C: 256, E: 209, D: 202 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1 | 655.00 | 378.02 | 1.00 | 328.00 | 655.00 | 982.00 | 1309.00 | ▇▇▇▇▇ |
| Age | 0 | 1 | 29.46 | 14.09 | 0.17 | 21.00 | 27.00 | 38.00 | 80.00 | ▂▇▃▂▁ |
| SibSp | 0 | 1 | 0.50 | 1.04 | 0.00 | 0.00 | 0.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1 | 0.39 | 0.87 | 0.00 | 0.00 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
| Fare | 0 | 1 | 33.53 | 51.65 | 3.17 | 7.92 | 14.50 | 31.27 | 512.33 | ▇▁▁▁▁ |
| TicketN | 0 | 1 | 2.10 | 1.78 | 1.00 | 1.00 | 1.00 | 3.00 | 11.00 | ▇▁▁▁▁ |
| FarePerPass | 0 | 1 | 14.88 | 13.47 | 3.17 | 7.67 | 8.05 | 15.03 | 128.08 | ▇▁▁▁▁ |