1 Introduction

This is an R approach to the Titanic Exploratory Data Analysis and modelling using the tidyverse packages. In a first step, I will focus on visualisations of the various features and their (inter-related) properties. Later, I will explore the usage of different classifiers.

I’m mainly using dplyr for data manipulation and ggplot2 for visualisation. Since I’m new to both packages I will add some explanations on how they work; being in the process of learning these tools myself and therefore (hopefully) understanding which parts are less intuitive.

The script was heavily inspired by the great works of Megan Risdal and others, which I will reference here as the script evolves.

Note: If you’re more interested in Python then you might find my Pytanic kernel useful.

1.1 Load libraries, functions, and data files

I suggest to load all the necessary packages at the top of the script, so that you can keep an overview of what you need:

We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots.

We also define a helper function to compute 95% binomial confidence limits:

Load the data:

We are using readr’s read_csv function to read in the data sets, instead of the default read.csv. This helps to make our data work a bit better with dplyr and friends (and computes a bit faster, although not as fast as fread which you want to use for really large files.)

Ironically, (since one of the things dplyr is good at is not to convert strings to factors automatically but to store them as characters) we decide to convert Sex, Pclass, Embarked, and Survived to factors. This better represents the different levels that the values of these features take. (We do this transformation using mutate and the pipe %>%, both of which we will discuss in more detail below.)

We then combine the train and test data sets in case we want to have a closer look at the overall distributions.

1.2 Data overview

##   PassengerId   Survived   Pclass      Name               Sex     
##  Min.   :   1   0   :549   1:323   Length:1309        female:466  
##  1st Qu.: 328   1   :342   2:277   Class :character   male  :843  
##  Median : 655   NA's:418   3:709   Mode  :character               
##  Mean   : 655                                                     
##  3rd Qu.: 982                                                     
##  Max.   :1309                                                     
##                                                                   
##       Age            SibSp            Parch          Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.000   Length:1309       
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :28.00   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :29.88   Mean   :0.4989   Mean   :0.385                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000                     
##  Max.   :80.00   Max.   :8.0000   Max.   :9.000                     
##  NA's   :263                                                        
##       Fare            Cabin           Embarked  
##  Min.   :  0.000   Length:1309        C   :270  
##  1st Qu.:  7.896   Class :character   Q   :123  
##  Median : 14.454   Mode  :character   S   :914  
##  Mean   : 33.295                      NA's:  2  
##  3rd Qu.: 31.275                                
##  Max.   :512.329                                
##  NA's   :1

The summary function gives us an overview over the different feature columns, their type (character, numerical) and basic distribution information. We also see that the features Age, Fare, and Embarked have missing values, and that there is a large range in Fare. Naturally, Survived is missing for all test data rows. (Here we would not see if there are missing values in some of the character features.)

## Rows: 1,309
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <fct> male, female, female, female, male, male, male, male, f...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...

The aptly named dplyr function glimpse allows us to get a quick impression of the data we are dealing with. Together, summary and glimpse are an effective first exploration step which gives us the following information:

Together with the PassengerId which is just a running index and the indication whether this passenger survived (1) or not (0) we have the following information for each person:

  • Pclass is the Ticket-class: first (1), second (2), and third (3) class tickets were used. We turned it into a factor.

  • Name is the name of the passenger. The names also contain titles and some persons might share the same surname; indicating family relations. We know that some titles can indicate a certain age group. For instance Master is a boy while Mr is a man. This feature is a character string of variable length but similar format.

  • Sex is an indicator whether the passenger was female or male. This is another factor we created from a categorical text string.

  • Age is the integer age of the passenger. There are NaN values in this column.

  • SibSp is another ordinal integer feature describing the number of siblings or spouses travelling with each passenger.

  • Parch is another ordinal integer features that gives the number of parents or children travelling with each passenger.

  • Ticket is a character string of variable length that gives the ticket number.

  • Fare is a float feature showing how much each passenger paid for their rather memorable journey.

  • Cabin gives the cabin number of each passenger. This is another string feature.

  • Embarked shows the port of embarkation as a categorical factor.

In summary we have 1 floating point feature (Fare), 1 integer variable (Age), 2 ordinal integer features (SibSp, Parch), 3 categorical factors (Sex, Pclass, Embarked), and 3 text string features (Ticket, Cabin, Name).

The factor encoding immediately allows us to see that the testing data set has 418 rows (NAs in Survived), that there were almost twice as many male passengers than female ones, and that only about 40% of passengers survived. The latter can also be extracted as follows:

## # A tibble: 2 x 2
##   Survived     n
##   <fct>    <int>
## 1 0          549
## 2 1          342

Here we also have the pipe, which uses the symbol %>%. The simplest way of thinking about this powerful tool is that it passes the output of one operation as an input to the next one. It’s a bit similar to the “.” in python/pandas and even more similar to the pipe “|” in unix shell scripting; in case this helps anyone. The pipe allows you to build your code in a modular way that is easy to extend and to adjust to different applications.

The next code block is simply extracting the numbers for survivals and non survivals for the example afterwards:

Now we will see how R code can be directly include in the markdown text: `r surv/(surv+nosurv)*100`. This allows us to insert simple calculations in the narrative flow, such as the fact that 38.4 percent of passengers survived the disaster.

1.3 More about missing values

Knowing about missing values is important because they indicate how much we don’t know about our data. Making inferences based on just a few cases is often unwise. In addition, many modelling procedures break down when missing values are involved and the corresponding rows will either have to be removed completely or the values need to be estimated somehow.

The next plot was inspired by this well-organised kernel. What we see are the different combinations of missing values for the individual features. For instance, there are 529 NA’s in Cabin alone, 158 in Cabin and Age simultaneously, 1 in Fare, and so on.

Fig. 1

Fig. 1

## 
##  Variables sorted by number of missings: 
##     Variable Count
##        Cabin  1014
##     Survived   418
##          Age   263
##     Embarked     2
##         Fare     1
##  PassengerId     0
##       Pclass     0
##         Name     0
##          Sex     0
##        SibSp     0
##        Parch     0
##       Ticket     0

For the two text features, Cabin and Ticket, we use a neat property of boolean vectors (which also works in python):

## [1] 0
## [1] 1014

The function is.na determines which of the elements are missing and gives a true/false output. In numerical functions like sum, “true” is always represented as “1” and “false” as “0”. This way, we see immediately that all the Ticket information is complete but the vast majority of the Cabin numbers are missing.

2 Initial Exploration / Visualisation

Look at your data in as many different ways as possible. Some properties and connections will be immediately obvious. Others will require you to examine the data, or parts of it, in more specific ways.

2.1 Individual features

The ggplot2 approach, introduced by Hadley Wickham, uses a common ‘grammar’ to describe a large variety of plotting functions. This style contains the following building blocks:

  • data: what we are plotting (the input)

  • asthetics: where we are plotting it (assignment of representation)

  • geoms: how we are plotting it (the plotting style)

These blocks are the same for any kind of plot, which makes it easy to switch from one visualisation to another once you’ve understood the basic principle. Geom layers can be added on top of one another to build more complex plots from these simple elements. This kernel contains a number of plotting examples for you to play with.

We start with a relatively complex overview plot for which we create a dashboard-like view that shows the survival distribution in all the accessible features (i.e. without the text ones.) However, the most complicated thing here is the multiplot functionality which as a beginner you can ignore at the moment and come back to later. Also, the stuff with guides and theme is just styling. The important things happen in the first (data) and second (geom + aesthetics) line.

Fig. 2

Fig. 2

There’s a lot going on in this figure, so take your time to look at all the details.

We learn the following things from studying the individual features:

  • Age: The medians are identical (see below). However, it’s noticeable that fewer young adults have survived (ages 18 - 30-ish) whereas children younger than 10-ish had a better survival rate. Also, there are no obvious outliers that would indicate problematic input data. The highest ages are well consistent with the overall distribution. There is a notable shortage of teenagers compared to the crowd of younger kids. But this could have natural reasons. Here we choose a small binwidth for the plotted graph to emphasise that the data is not smooth. Later, we will use density plots to study the behaviour of Age (and Fare) on a more global scale.

  • Pclass: There’s a clear trend that being a 1st class passenger gives you better chances of survival. Life just isn’t fair.

  • SibSp & Parch: Having 1-3 siblings/spouses/parents/children on board (SibSp = 1-2, Parch = 1-3) suggests proportionally better survival numbers than being alone (SibSp + Parch = 0) or having a large family travelling with you.

  • Embarked: Well, that does look more interesting than expected. Embarking at “C” resulted in a higher survival rate than embarking at “S”. There might be a correlation with other variables, here though.

  • Fare: This is case where a linear scaling isn’t of much help because there is a smaller number of more extreme numbers. A natural choice in this case is to use a logarithmic axis. The plot tells us that the survival chances were much lower for the cheaper cabins (i.e. the big red spike that’s not mirrored by a blue spike). Naively, one would assume that those cheap cabins were mostly located deeper inside the ship, i.e. further away from the life boats.

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Survived median_age
##   <fct>         <dbl>
## 1 0                28
## 2 1                28

The tidyverse ecosystem, also championed by Wickham, is a collection of packages which share a common ‘language’ to display and modify data. The tidy in tidyverse refers to the convention that the input data should be ‘well behaved’; i.e. presented in a rectangular table with columns corresponding to different features and all rows being completely filled with observations for the features. The data we’re working with here is pretty tidy already, but that won’t always be the case (see the tidyr package for these messy cases).

The dplyr package provides tools to work with tidy data. Together with the pipe, “%>%”, we can chain together individual dplyr operations in a clean and powerful way. In the example above, we pass a data frame (train) to be grouped (group_by) along one (or more) feature(s) and then compute the median age for these groups. Similarly, we can count occurences:

## # A tibble: 4 x 3
## # Groups:   Survived, Sex [4]
##   Survived Sex        n
##   <fct>    <fct>  <int>
## 1 0        female    81
## 2 0        male     468
## 3 1        female   233
## 4 1        male     109

The dplyr function mutate allows us to add new columns to data frames (and assign column names) in the following, straight forward way:

## # A tibble: 2 x 3
## # Groups:   single [2]
##   single     n  freq
##   <lgl>  <int> <dbl>
## 1 FALSE    283 0.318
## 2 TRUE     608 0.682
## # A tibble: 2 x 3
## # Groups:   single [2]
##   single     n  freq
##   <lgl>  <int> <dbl>
## 1 FALSE    213 0.239
## 2 TRUE     678 0.761

This shows us that 32% of passengers had siblings on board and, independently of that, 24% had parents/children on board.

2.2 Feature relations

2.2.1 Correlation overview

After inspecting the available features individually you might have realised that some of them are likely to be connected. Does the age-dependent survival change with sex? How are PClass and Fare related? Are they strongly enough connected so that one of them is superfluous? Let’s find out.

We start with visualising the correlation matrix of the numerical and categorical variables we identified above. For this process, we use our first “proper” dplyr-style chain of commands. First, we remove the PassengerID and the three text features. This is accomplished using the select function. Intuitively, a minus sign in front of a column name removes that column. The we use dplyr’s mutate to change the columns: we recode the Sex factor as integers and then convert all the factors to integer values. The output data frame is passed to the standard cor function to compute the correlation matrix.

The visualisation uses the corrplot function from the eponymous (one of my favourite words!) package. Corrplot gives us great flexibility in manipulating the style of our plot. What we see, is the correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation.

Here, we are of course interested whether features correlate with the “Survived” variable, since this is what we ultimately want to predict. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the variance in our data set and improve the accuracy of our prediction.

Fig. 3

Fig. 3

In this kind of plot we want to look for the bright, large circles which immediately show the strong correlations (size and shading depends on the absolute values of the coefficients; colour depends on direction). Anything that you would have to squint to see is usually not worth seeing We see the following:

  • Survived is correlate most to Sex, and then to Pclass. Fare and Embarked might play a secondary role; the other features are pretty weak

  • Fare and Pclass are strongly related (1st-class cabins will be more expensive)

  • A correlation of SibSp and Parch makes intuitive sense (both indicate family size)

  • Pclass and Age seem related (richer people are on average older? not inconceivable)

We take this overview plot as a starting point to investigate specific multi-feature comparisons in the following. Those examinations will likely result in more questions, which we can also examine (to a certain extend) in the same step. Here, we stop at defining new features which will be the subject of another section.

When conducting your detailed studies of individual features it is useful to set out a (preliminary) plan that you want to be following to avoid getting distracted. Here we set the following targets, based on the correlation overview:

  • Pclass vs Fare

  • Pclass vs Age

  • Pclass vs Embarked

  • Sex vs Parch

  • Age vs SibSp

  • SibSp vs Parch

  • Fare vs Embarked

We won’t necessarily investigate them in this specific order, but it’s good to have a list we can come back to and check what still needs to be done.

2.2.2 Multi-feature comparisons

Now we continue to examine these initial indications in more detail. Earlier, we had a look at the Survived statistics of the individual features in Fig. 2. Here, we want to look at correlations between the predictor features and how they could affect the target Survived behaviour.

Usually it’s most interesting to start with the strong signals in the correlation plot and to examine them more in detail.

2.2.2.1 Pclass vs Fare

To compare a categorical variable like Pclass with a continuous variable like Fare there are several useful visualisations. We will use a boxplot here, and try some of the other ones later. Note the logarithmic y-axis, which we add using the scale function:

Fig. 4

Fig. 4

In a boxplot, we display the median value (inside the box), the 1st and 3rd quartiles (lower and upper hinges), and the outliers (individual data points). And outlier is any point that is further than 1.5 the distance between the 1st and 3rd quartile (the inter-quartile range) away from the hinge.

We find:

  • The different Pclass categories are clustered around different average levels of Fare. This is not very surprising, as 1st class tickets are usually more expensive than 3rd class ones.

  • In 2nd Pclass, and especially in 1st, the median Fare for the Survived == 1 passengers is notably higher than for those who died. This suggests that there is a sub-division into more/less expensive cabins (i.e. closer/further from the life boats) even within each Pclass.

It’s certainly worth to have a closer look at the Fare distributions depending on Pclass. In order to contrast similar plots for different factor feature levels ggplot2 offers the facet mechanism. Adding facets to a plot allows us to separate it along one or two factor variables. Naturally, this visualisation approach works best for relatively small numbers of factor levels.

Fig. 5

Fig. 5

We learn:

  • There is a suprisingly broad distribution between the 1st class passenger fares

  • There’s an interesting bimodality in the 2nd class cabins and a long tail in the 3rd class ones

  • For each class there is strong evidence that the cheaper cabins were worse for survival. A similar effect can be seen in a boxplot:

2.2.2.2 Pclass vs Embarked

First, we plot the frequency of the Embarked ports for the different Pclass factors and add a facet to split by the Survived factor:

Fig. 6

Fig. 6

We find:

  • Embarked == Q contains almost exclusively 3rd class passengers

  • The survival chances for 1st class passengers are better for every port. In contrast, the chances for the 2nd class passengers were relatively worse for Embarked == S whereas the frequencies for Embarked == C look comparable.

  • 3rd class passengers had bad chances everywhere, but the relative difference for Embarked == S looks particularly strong.

2.2.2.3 Pclass vs Age and multi-dimensional plots

By now, we’ve got some practice with simple plots and facetted ones, so let’s get a bit more adventurous:

We will plot Age vs Fare and facet then by 2 variables, Embarked and Pclass, to create a grid. In addition, we use different colours for the Survived status and different symbols for Sex. The result is a comprehensive overview plot for the relationship between many of the main features:

Fig. 7

Fig. 7

We find:

  • Pclass == 1 passengers seem indeed on average older than those in 3rd (and maybe 2nd) class. Not many children seemed to have travelled 1st class.

  • Most Pclass == 2 children appear to have survived, regardless of Sex

  • More men than women seem to have travelled 3rd Pclass, whereas for 1st Pclass the ratio looks comparable. Note, that those are only the ones for which we know the Age, which might introduce a systematic bias.

Admittedly, there’s a lot going on in this plot, but the difference between the upper left and lower right corner is striking. Armed with these insights, we can study individual relations in more detail.

2.2.2.4 Age vs Sex

This wasn’t in our original list, but the multi-facet plot above prompts us to examine the interplay between Age and Sex more closely. Feel free to follow interesting signals in this exploratory stage, but keep the big picture in mind.

Here we are using a density plot with colour overlap and facetting:

Fig. 8

Fig. 8

We find:

  • The distributions peaks are similar, but some substructures are different.

  • Younger boys had a notable survival advantage over male teenagers, whereas the same was not true for girls to nearly the same extent.

  • Most women over 60 survived, whereas for men the high-Age tail of the distribution falls slower.

  • Note that those are normalised densities, not histogrammed numbers. Also, remember that Age contains many missing values.

2.2.2.5 Pclass vs Sex

From Megan’s [super-popular kernel(https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic) we borrow the idea of using a mosaic plot visualisation, because it’s insightful and looks fancy at the same time. Here we apply it to the Survived statistics of Pclass vs Sex:

Fig. 9

Fig. 9

In a mosaic plot, the size of the boxes corresponds to the number of observations contained in their specific categories. Here we also use a colour coding corresponding to a statistical test that tells us whether some boxes are more (blue) or less (red) populated than when assuming independent distributions for the categories. The overall separation is between “Survived == 0” (left) vs Survived == 1 (right). Within these categories, the Pclass values are split from horizontally and the Sex categories are split vertically.

We find:

  • Almost all females that died were 3rd Plass passengers.

  • For the males, being in 3rd Pclass resulted in a significant disadvantage regarding their Survived status.

2.2.2.6 Parch vs SibSp

Next, we will have a closer look at the family relation features Parch (number of parents or children on board) and SibSp (number of siblings or spouses on board). In order to see how many cases there were for each combination we will use a count plot:

Fig. 10

Fig. 10

Here, the size of the circles is proportional to the number of cases. The colours show which Survived status dominates (“1” is plotted over “0”, as you can see at Parch == SibSp == 0).

We find:

  • A large number of passengers were travelling alone.

  • Passengers with the largest number of parents/children had relatively few siblings on board.

  • Survival was better for smaller families, but not for passengers travelling alone.

2.2.2.7 Parch vs Sex

Another correlation that piqued our interest in the overview plot was the one between Parch vs Sex. Here we examine it in more detail using a barplot:

Fig. 11

Fig. 11

We find:

  • Many more men travelled without parents or children than women did. The difference might look small here but that’s because of the logarithmic y-axis.

  • The log axis helps us to examine the less frequent Parch levels in more detail: Parch == 2,3 still look comparable. Beyond that, it seems that women were somewhat more likely to travel with more relatives. However, beware of small numbers:

## # A tibble: 13 x 3
## # Groups:   Parch, Sex [13]
##    Parch Sex        n
##    <dbl> <fct>  <int>
##  1     0 female   194
##  2     0 male     484
##  3     1 female    60
##  4     1 male      58
##  5     2 female    49
##  6     2 male      31
##  7     3 female     4
##  8     3 male       1
##  9     4 female     2
## 10     4 male       2
## 11     5 female     4
## 12     5 male       1
## 13     6 female     1

The difference between 4 women and 1 man with Parch == 3 is close to being significant, as a simple binomial test will readily tell us. Here we test whether a finding of 1 of 5 passengers being male would still be expected given the overall ratio of men to women:

## 
##  Exact binomial test
## 
## data:  1 and 5
## number of successes = 1, number of trials = 5, p-value = 0.05538
## alternative hypothesis: true probability of success is not equal to 0.647587
## 95 percent confidence interval:
##  0.005050763 0.716417936
## sample estimates:
## probability of success 
##                    0.2

A p-value of just above 5% does normally not count as significant. And even if it were just below 5% there are several other variables here that could influence the statistics. Therefore, it’s better to look at the larger numbers for a useful signal.

2.2.2.8 Age vs SibSp

The final correlation we noticed was between the Age and SibSp features. Naively, one would expect that a larger number of siblings would indicate a younger age; i.e. families with several kids travelling together. (Larger numbers of spouses would be unusual.) Let’s see whether the data confirms our idea:

Fig. 12

Fig. 12

We find:

  • The highest SibSp values (4 and 5) are indeed associated with a narrower distribution peaking at a lower Age. Most likely groups of children from large families.

  • This will lead to a certain degree of interaction between Age and SibSp with respect to the impact on the Survived status. It might also allow us to predict Age from SibSp with a relatively decent accuracy for the higher SibSp values.

2.3 Missing values imputation

After studying the relations between the different features let’s fill in a few missing values based on what we learned.

In my opinion, the only training feature for which it makes sense to fill in the NAs is Embarked. Too many Cabin numbers are missing. And for Age we will choose a different approach below. We fill in the 1 missing Fare value in the test data frame accordingly.

We are performing these imputations on the combined data set, which we will also use as a basis for the step thereafter.

Let’s find the two passengers and assign the most likely port based on what we found so far:

## # A tibble: 2 x 12
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl> <fct>    <fct>  <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1          62 1        1      Icar~ fema~    38     0     0 113572    80 B28  
## 2         830 1        1      Ston~ fema~    62     0     0 113572    80 B28  
## # ... with 1 more variable: Embarked <fct>

These are two women that travelled together in 1st class, were 38 and 62 years old, and had no family on board.

## `summarise()` regrouping output by 'Embarked', 'Pclass', 'Sex', 'Parch' (override with `.groups` argument)
## # A tibble: 16 x 6
## # Groups:   Embarked, Pclass, Sex, Parch [8]
##    Embarked Pclass Sex    Parch SibSp count
##    <fct>    <fct>  <fct>  <dbl> <dbl> <int>
##  1 C        1      female     0     0    30
##  2 C        1      female     0     1    20
##  3 C        1      female     1     0    10
##  4 C        1      female     1     1     6
##  5 C        1      female     2     0     2
##  6 C        1      female     2     2     2
##  7 C        1      female     3     1     1
##  8 S        1      female     0     0    20
##  9 S        1      female     0     1    20
## 10 S        1      female     0     2     3
## 11 S        1      female     1     0     7
## 12 S        1      female     1     1     6
## 13 S        1      female     2     0     4
## 14 S        1      female     2     1     5
## 15 S        1      female     2     3     3
## 16 S        1      female     4     1     1

Admittedly, these are quite a few grouping levels, but 30 (“C”) vs 20 (“S”) are numbers that are still large enough to be useful in this context. In addition, already a grouping without the Parch and SibSp features suggests similar numbers for women in 1st class embarking from “C” (71) vs “S” (69) (in contrast to the larger overall number of all 1st class passengers leaving from “S”).

Another kernel (definitely worth checking out) makes a convincing case for predicting Embarked == “S” for these two passengers (see also the comments). However, in my opinion we have better reasons to impute “C” instead. I recommend that you weigh the arguments and make your own decision.

(How much does it actually matter? Well, in the big picture these are only 2 passengers and their impact on our model accuracy won’t be large. However, since the main point of this challenge is to practice data analysis it is certainly worth to take your time to examine the question in a bit more detail.)

Practically, we can replace values using dplyr’s mutate and a handy case_when statement, since we are modifying both NA’s to the same value:

Next, this is the passenger for which Fare is missing:

## # A tibble: 1 x 12
##   PassengerId Survived Pclass Name               Sex     Age SibSp Parch Ticket
##         <dbl> <fct>    <fct>  <chr>              <fct> <dbl> <dbl> <dbl> <chr> 
## 1        1044 <NA>     3      Storey, Mr. Thomas male   60.5     0     0 3701  
##    Fare Cabin Embarked
##   <dbl> <chr> <fct>   
## 1    NA <NA>  S

A 60-yr old 3rd class passenger without family on board. We will base our Fare prediction on the median of the 3rd-class fares:

## `summarise()` ungrouping output (override with `.groups` argument)

This concludes the missing values imputation. We will deal with the Age feature in a different way.

3 Derived (engineered) features

The next idea is to define new features based on the existing ones that allow for a split into survived/not-survived with higher confidence than the existing features. An example would be “rich woman” vs “poor man”, but this particular distinction should be handled well by most classifiers. We’re looking for something a bit more subtle here.

This part of the analysis is called Feature Engineering. I prefer the approach to list all the new features that we define together in one place, to keep an overview. Every time we can think of a new feature, we come back here to define it and then study it further down. We compute the new features in the combined data set, to make sure that all feature realisations are complete, and then split the combine data again into train and test.

Practically, we use dplyr’s mutate to add new features. The extraction of the new title feature using regular expressions was contributed by Nad13 in the comments (many thanks!). Finally, we use the fct_lump function from the forcats package (factor manipulation) to lump together all the rare titles into an “Other” category.

In the same way as for the missing values, we’re adding all features to the combine sample and then split it back into train and test afterwards. Besides being more efficient to write, this approach also allows us to catch factor levels that are missing in one set versus the other.

## `summarise()` ungrouping output (override with `.groups` argument)

These are the new features we are defining here:

Again, this list will grow as new features are being added.

3.5 Title

We extracted the title of each passenger (e.g. Mrs or Master) from the Name character string. Originally, there were 18 different titles (feature title_orig) with different frequencies of occurences in the train + test data sets. For our analysis we further group all the rare titles into an “Other” category and then compare their Age distributions and Survived fractions:

Fig. 25

Fig. 25

We find:

  • The most frequent titles are “Mr”, “Miss”, “Mrs”, and “Master”. The overview plot of the original titles showcases how to rotate axis labels to improve their readability.

  • We visualise the Age distributions of the title groups using violin plots. A violin plot is similar to a boxplot in that it shows the range of the data. In addition, the outline of the “violin” shows the density of the distribution; thereby highlighting its frequency profile. Here we see especially that “Master” is capturing the male children/teenagers very well.

  • The “Miss” title applies to girls as well as younger women up to about 40. Mrs does not contain many teenagers, but has a sizeable overlap with Miss; especially in the range of 20-30 years old. Nevertheless, Miss is more likely to indicate a younger woman. Overall, there is a certain amount of variance and we’re not going to be able to pinpoint a certain age based on the title.

  • In the Survival statistics we recover the Sex dependency that results in “Mr” having far worse survival chances than the other titles.

In consequence, we decide to model 2 Age Groups through the Young feature variable we have already studied above. The idea behind this is to address the issue of missing Age values by combining the Age and title features into a single feature that should still contain some of the signal regarding survival.

For this, we define everyone under 30 or with a title of Master, Miss, or Mlle (Mademoiselle) as Young. All the other titles we group into Not Young. This is a bit of a generalisation in terms of how Miss and Mrs overlap, but it might be a useful starting point. All the other rare titles (like Don or Lady) have average ages that are high enough to count as Not Young.

3.6 Overview

We finish this chapter by visualising the correlations and connection between the original and the engineered features.

First is another correlation plot like we had at the beginning of the “Features relations” section. This time we use the ggcorr tool provided by the GGally package, which has better integration into the overall ggplot2 framework. It also gives us a slightly different set of styling options, like the coordinate flip to make the labels more readable. Here, the stronger correlations have brighter colours in either red (positive correlation) or blue (negative correlation). The closer to white the weaker the correlation:

Fig. 26

Fig. 26

We find:

  • There are obvious strong correlations between the new engineered features and those from which they were derived; for instance fclass and Fare. We also see significant coefficients for features that were designed to show complementary view points (e.g. alone vs family) or a more differentiated view of the same aspect (e.g. ticket_group and shared_ticket).

  • Other, more interesting correlations include for instance that the new ttype feature is strongly related to Pclass, suggesting that Tickets were primarily encoded by passenger class. The fact that family is correlated to shared_ticket indicates that mostly families travelled on the same Ticket.

To examine the strong correlations in more detail there is another useful overview visualisation in the form of a “pairplot”, once more realised via GGally. This kind of plot provides a more detailed visualisation of the relationships between variables then a simple correlation coefficient. The ggpairs tool automatically picks the type of visualisation (barplot, histogram, scatterplot) according to the format of the underlying feature:

Fig. 27

Fig. 27

We find:

  • We recover the familiar survival statistics for the important features Pclass or Sex together with new features such as cabin_known or shared_ticket in one comprehensive overview layout. This is the kind of plot that almost gives you ready-made dashboard for key performance indicators or similarly important features.

  • For example, in the right-most column we can go through an almost complete analysis of the new shared_ticket feature. We see its positive impact on the Survived status, then realise that sharing a ticket is more likely for female than male passengers or for those in 1st class vs 3rd class. We also find that we are more likely to know the cabin of a passenger with a shared ticket.

A final overview visualisation, before we move on to modelling, is the “alluvial plot”. Provided via the new ggalluvial package, those plots are a kind of mix between a flow chart and a bar plot and they show how the target categories relate to various discrete features. In other words: those plots allow us to see the flow of the target between differrent predictor features. See my blog post for more details on the plot parameters in comparison to the more established alluvial package.

Shout out to retrospectprospect for introducing me to alluvial plots in their very elegant Titanic EDA kernel.

Here is the plot for the 4 features Pclass, Sex, shared_ticket, and cabin_known:

We find:

  • The categories Survived (blue) and Not Survived (red) are colour-coded as usual and their frequencies for each feature are connected to adjacent features. We don’t show alluvials with less than 20 counts to keep the plot tidy.

  • Almost all female passengers in Pclass 1 or 2 survived. Those women travelling in class 3 who died were more likely to have a shared ticket, even though the overall percentage of passengers with shared tickets is less than 50%.

  • Similarly, the few male passengers in Pclass = 3 who survived were somewhat more likely to not share a ticket. Among all of those who did not share a ticket we have almost no surivors for which the cabin is known.


To be continued