This report is a summarization of the most important features related to the data validation of the Titanic data set from the kaggle competition. The challenge of this competition consists in predicting the survivors of the Titanic sinking.

The training data set provided by Kaggle is composed by 891 observations and 12 features, including the ‘Survived’ feature. The description of the variables, as well as the original data set can be found in the competition site

The validation has been performed ussing the RStudio software.

Below we present the features that have issues, which must be fixed before any analysis is performed, and a recomendation to clean up the data when necessary.


Age

There is a considerable amount of missing values, more precisely, for 177 passengers the age is missing, a 19.9% of the total number of observations.

Different methods can be implemented to impute missing values. A KNN approach is recomended in this case, since it shows good performance for the amount of missing data we are dealing with in classification problems, as showed by Acuña (2004).

The creation of a new feature indicating whether the age is missing is also recommended since the lack of data for the age feature could be a predictor itself.

Another new feature ,‘Under 18’, also can be created. In this data set, men under the age of 18 are addressed as ‘Master’. We can use this information to include a few ‘NAs’ for the Age feature.

library(dplyr)
master <- filter(data, grepl("[Mm]aster", data$Name), is.na(Age))
select(master, Name, Age, Survived)
##                                                Name Age Survived
## 1                          Moubarek, Master. Gerios  NA        1
## 2                        Sage, Master. Thomas Henry  NA        0
## 3                     Lefebre, Master. Henry Forbes  NA        0
## 4 Moubarek, Master. Halim Gonios ("William George")  NA        1


Tickets

Ticket numbers are non-unique. Moreover, the duplicates don’t match with the passengers who embarked with the company of some relative. So, the reason for the duplicate numbers is not entirely clear. Encyclopedia Titanica, doesn’t clarify this issue.

Since this feature will likely be unuseful to predict the Titanic survivors, its apparent incongruences won’t involve a significant problem.


Cabin

For this feature, a big share of the observations are missing. For 687 passengers in the Titanic, the corresponding cabin label is not present. This is a 77.1%of the total number of passengers in the training set.

These missing values will likely correspond to passangers for which there are serious doubts regarding the cabin they occupied. Encyclopedia titanica explains the unreliability, confusion and speculation when it comes to determine the cabin allocations in the Titanic.

In addition, a few values don’t correspond to any of the possible cabin labels. A cabin label is composed by a letter from A to G (matching the deck they were located), followed by a number up to three figures, except the boat deck cabin labeled with a “T”. In the data presented, for three of the passengers, the cabin they occupied is labeled just with “D”, because the only information known in this respect is that these passengers occupied a cabin in the deck “D”. The data related to these passengers is presented below, as well as the names of these passengers with an hyperlink to their page in Enciclopedia Titanica.

wrong_label <- data[grep("D$", data$Cabin),]
select(wrong_label, Name, Age, Survived)
##                                             Name Age Survived
## 293                       Levy, Mr. Rene Jacques  36        0
## 328                      Ball, Mrs. (Ada E Hall)  36        1
## 474 Jerwan, Mrs. Amin S (Marie Marthe Thuillard)  23        1

Mrs. Amin and Mrs. E Hall turned out to share the same cabin in the D deck. Mr. Rene shared cabin with a man named Noël Malachard and an unknown man. The name Malachard isn’t among the passenger’s names provided in the training data set, so it can be deduced that the corresponding observation is in the test set.

Due to the large amount of missing data for this feature, an easy recomendation would be to drop it from the analysis, but maybe it is worth to check, at least in the exploratory stage, if it’s likely passengers who shared the same cabin also met the same fate.


Embarked

The feature ‘Embarked’ references the port from which the passenger Embarked. Can be one of three ports: Southampton, Cherbourg or Queenstown, labeled in the data set as “S” “C” and “Q” respectively. However, a call to the level function for this feature returns these levels plus a blank level.

levels(as.factor(data$Embarked))
## [1] ""  "C" "Q" "S"


More concretely, there are two instances where ‘Embarked’ is blank. This instances correspond to observations 62 and 830

table(as.factor(data$Embarked))
## 
##       C   Q   S 
##   2 168  77 644
data[c(62, 830),]
##     PassengerId Survived Pclass                                      Name
## 62           62        1      1                       Icard, Miss. Amelie
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62  female  38     0     0 113572   80   B28         
## 830 female  62     0     0 113572   80   B28


The two passangers occupied the same cabin. According to the Encyclopedia titanica they Embarked in Southampton. An “S” label must be assigned to these observations

data[c(62, 830), "Embarked"] <- "S"
data[c(62, 830),]
##     PassengerId Survived Pclass                                      Name
## 62           62        1      1                       Icard, Miss. Amelie
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62  female  38     0     0 113572   80   B28        S
## 830 female  62     0     0 113572   80   B28        S


The remaining features are OK. No outliers have been found.


Cleaning plan

For the most part of features presented here, the cleaning proccess shold be extremely easy, involving just the asignation of values and the creation of new features as specified.

KNN with mixed value types

The major workload will correspond to Age, where we would have to perform a statistical learning analysis by itself in order assign the missing values. The challenge here is we are presented with both categorial an numeric variables. The knn package used must support this kind of data

These are the possible options

  1. Create dummy variables and use the knn function: knn only works with numeric values. Dummy transformations can be performed on the original categorical features and then compute knn using Euclidean distance. Most R functions which create dummies automatically, drop the base category. The dummy package can take care of this issue.

  2. Try knncat function from the knncat package: apparently, supports both numerical and categorical predictors. But the response must be categorical. It also performs some previous transformations to the numeric features, setting knots, spline style. In order to use this approach we should transform age in categorical, with the subsequent information loss.

  3. Impute directly the missing values with caret package: caret takes care of all the steps in the machine learning analysis, including imputation using a knn algorithm. A clear understanding of how the imputation functions work is neccesary.

  4. Create a knn function which can deal with both types of variables at the same time: knn algorithm is simple. The main work loud is to create a distance measure which can be used in our case. It can be one combining Euclidian distance with single matching (single matching is proportional to Euclidian distance for categorial variables previously transformed in dummies).