Titanic data cleaning

This report consists in the cleaning of the titanic data set from the kaggle competition. The audit of the data can be found here.

Some of the cleaning process consist merely in updating a few values from the original data set. The major workload is in the imputation of the missing values from the Age variable. Three alternatives will be discussed next.

Embarked

As showed in the validity report, “Embarked” represents a very minor problem, which is related to the variable levels: there is a blank level, associated to a sole observation, which actually corresponds to a passanger who embarked in Southamptom (‘S’).

titanic[c(62, 830), "Embarked"] <- "S"
titanic$Embarked <- as.factor(as.character(titanic$Embarked))

Cabin

Due to the high number of missing values, this variable is directly dropped from the analysis.

Age

About a 20% of the total number of observations are missing values, which can be assigned using some imputation method. Below, three ways of assigning values are discussed: the arithmetic mean, linear regression and KNN. We’ll use the MSE as error function, measured for 5-fold cross validation samples.

Subsetting data set for cross validation

Data is splitted into a training set and a test set, having matched the missing data with the test set observations. The final imputation will be perfomed over the test set. The cross validation resampling method will be carried out in the training set.

idx_na <- is.na(titanic$Age)
age_train <- titanic[!idx_na, ]
age_test <- titanic[idx_na, ]

Exploring: Age and covariates

If we take a look at the other covariates from the data set we can see that Pclass and SibSp are potential Age predictors

summary(age_train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   20.12   28.00   29.70   38.00   80.00

ggvis(age_train, ~Age) %>% layer_histograms()

## Guessing width = 2 # range / 40

cvidx <- rep(1:5, each = ceiling(nrow(age_train)/5))
cvidx <- sample(cvidx, nrow(age_train))
age_train <- select(age_train, Age, Survived, Pclass, Sex, 
                    SibSp, Parch, Fare, Embarked) %>% 
    mutate(cvidx = cvidx)
age_test <- select(age_test, PassengerId, Age, Survived, Pclass, Sex, SibSp, Parch, Fare, Embarked)


par(mfrow = c(3, 3))
for (col in 2:(ncol(age_train)-1)){
    boxplot(age_train$Age~age_train[,col],
            xlab = names(age_train)[col], 
            col = "steelblue")
}

We’ll use this information for some of the next imputation methods discussed.

Mean

One of the firsts things that usually comes into ones mind when confronted with a data imputaton task is the arithmetic mean. Although prone to misrepresent the data in the presence of missing values, it is a simple method, that at least is usually better that just dropping the missig data.

If we assign the arithmetic mean to all the missing values, the mean prediction error will be the standar deviation of ‘Age’, assuming ‘Age’ distribution for missing values remains the same across missing and non-missing values. As, at least, Pclass is related to age, if the proportion of passangers in each class is different when comparing missing and non-missing values for the ‘Age’ variable, the distribution ‘Age’ will be different.

An association analisys leads us to conclude that there is an association between missing values and Pclass

pc <- titanic$Pclass
mnm <- rep(NA, nrow(titanic))
mnm[is.na(titanic$Age)] <- "missing"
mnm[!is.na(titanic$Age)] <- "non-missing"
mnm <- as.factor(mnm)

chisq.test(pc, mnm)

## 
##  Pearson's Chi-squared test
## 
## data:  pc and mnm
## X-squared = 46.063, df = 2, p-value = 9.945e-11

An anova analysis certifies that there is a relationship between PClass an Age. First, we must check for ANOVA analysis conditions, that is normality across populations and homocedasticity

tapply(titanic$Age, titanic$Pclass, shapiro.test)

## $`1`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99169, p-value = 0.3643
## 
## 
## $`2`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97695, p-value = 0.005648
## 
## 
## $`3`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97344, p-value = 4.186e-06

bartlett.test(titanic$Age~titanic$Pclass)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  titanic$Age by titanic$Pclass
## Bartlett's K-squared = 7.818, df = 2, p-value = 0.02006

As requirements for ANOVA are not met, we’ll perform the Welch test

oneway.test(titanic$Age ~ titanic$Pclass)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  titanic$Age and titanic$Pclass
## F = 53.355, num df = 2.00, denom df = 359.44, p-value < 2.2e-16

The null hypothesis is rejected. It cannot be assumed, therefore, that the ‘Age’ distribution for the different Classes remains invariant, as the proportion of missing and non-missing values for ‘Age’ is different in each class. Consequently, we cannot take the sd of ‘Age’ for non-missing values as an unbiased estimator of the error. However, we know that the error reported as sd of the ‘Age’ values for non-missing values will be an overestimation since the mean gives the minimum squared error possible over any other value.

sd(age_train$Age)

## [1] 14.5265

So, this standard deviation becomes a lowerbound: in order to pick another imputation method, it must outperform the mean in terms of error prediction (that is, sd). If the next proposed methods don’t surpass this bound, Age average will be imputed to the missing values.

Linear regression

A linear regression was carried out with different covariates, prioritizing the ones which seem to have more predictive power. Functions performong cross validation for this very data set have been build. The code can be found in this github repo.

We’ll choose the best model according to the ‘one standar deviation’ rule, which implies that standar deviation has to be estimated for each cv error. The resample_cvlr function, built specifically for this task, deals with this issue. It carries out several cross validation anylsis over different resampling of the data.

setwd("C:/Users/Martin/Desktop/kaggle/titanic")
source("text of analysis/scripts/titanic_agelm_crossvalidation.R")

A data.frame has been built where each column is a different pick of variables, and each row is a 5-fold cv error. 100 resampling have been performed on each pick.

cv <- data.frame(
    Pclass = 
        resample_cvlr(select
                      (age_train, Age, 
                      Pclass, cvidx), 100), 
    Pclass.Survived = 
        resample_cvlr(select
                      (age_train, Age, Pclass,
                      Survived, cvidx), 100), 
    Pclass.Survived.SibSp = 
        resample_cvlr(select
                      (age_train, Age, Pclass, 
                      Survived,SibSp, cvidx), 100),
    Pclass.Survived.SibSp.Sex = 
        resample_cvlr(select
                      (age_train, Age, Pclass, 
                      SibSp,  Survived, Sex, cvidx), 100), 
    Pclass.Survived.SibSp.Sex.Fare = 
        resample_cvlr(select
                      (age_train,Age, Pclass, 
                             Survived, SibSp, 
                             Fare, cvidx), 100), 
    Pclass.Survived.SibSp.Sex.Fare.Parch = 
        resample_cvlr(select
                      (age_train, Age, Pclass, 
                      Survived, SibSp, Fare,
                      Parch, cvidx), 100))

df <- data.frame(er_mean = apply(cv, 2, mean),
                 er_sd = apply(cv, 2, sd))
df$vars <- row.names(df)

The figure below, represents these cv errors with error bars. We can see how the simplest model with the lowest cv error is the one with the variables ‘Pclass’, ‘Survived’ and ‘SibSp’.

theme_set(theme_bw())
ggplot(data = df,aes(x = vars, y = er_mean)) + geom_errorbar(
    aes(ymin = er_mean - er_sd, ymax = er_mean + er_sd, 
        width = 0.15, position = "dodge")) +  geom_point(size = 3) +
    scale_x_discrete(labels = c("Pclass", "Pclass Survived",
                                "Pclass Survived \n SibSb", 
                                "Pclass Survived \n SibSb Sex",
                                "Pclass  Survived \n SibSb Sex \n Fare",
                                "Pclass Survived \n SibSb Sex \n Fare Parch")
                                ) + theme(axis.text=element_text(size=6)) + 
    ylab("Cross validation error") + xlab("Predictors")

The linear reggresion model outperforms, in terms of error, the mean imputation method

KNN

K Nearest Neighbours is a method that shows good performance for this amount of missing data. In order to execute a KNN we have to use a function able to deal with categorical data. A set of different functions which carry out a knn algorithm and a cv method for the titanic data set have been developed. These functions can deal with both categorical an numerical data, using a mixed distance (euclidean + single matching) as dissimilarity measure. The code can be found in this github repo

Data has to be standarized before a KNN algorithm is executed. The variables are normalized to values between 0 and 1.

norm01 <- function(x) x/max(x)
age_train <- mutate(age_train, cvidx = cvidx, Pclass = norm01(Pclass), 
                    SibSp = norm01(SibSp), Parch = norm01(Parch), 
                    Fare = norm01(Fare))

age_test <- mutate(age_test, Pclass = norm01(Pclass),  
                   SibSp = norm01(SibSp), 
                   Fare = norm01(Fare))

Now, we’ll proceed to run the KNN algorithm 20 times for three different variable subsets among the variables which showed most potential predictive power in the exploratory analyis, with the K parameter varying from 1 to 20.

source("scripts/knn_mixed.R")
# Pclass, SibSp, Parch
neig_PSP <- rep(NA, 20)
for (n_idx in 1:20){
    neig_PSP[n_idx] <- mknn_cv(select(age_train, Age, Pclass, SibSp, Parch, cvidx), 
                               "Age", n_idx, "cvidx")$mrss
}

# # Pclass, SibSp, Parch, Fare
neig_PSPF <- rep(NA, 20)
for (n_idx in 1:20){
    neig_PSPF[n_idx] <- mknn_cv(select(age_train, Age, Pclass, SibSp, Parch, Fare, cvidx), 
                               "Age", n_idx, "cvidx")$mrss
}

# # Pclass, SibSp
neig_PS <- rep(NA, 20)
for (n_idx in 1:20){
    neig_PS[n_idx] <- mknn_cv(select(age_train, Age, Pclass, SibSp, cvidx), 
                                "Age", n_idx, "cvidx")$mrss
}

Below is presented the cross validation error for the different values of K

df1 <-data.frame(cbind(Error = c(neig_PSP, neig_PS, neig_PSPF), 
                      Predictors = rep(c("Pclss SibSb \n Parch", 
                                         "Pclss  SibSb", 
                                         "Pclss SibSb \n Parch Fare"), each = 20), 
                      K = rep(1:20, 3)))

df1 <- mutate(df1, Error = as.numeric(as.character(Error)), K = as.numeric(as.character(K)))



ggplot(df1, aes(x = K, y = Error, color = Predictors)) + geom_line() + 
    geom_point(size = 3.5) + scale_color_manual(values = c("black", "steelblue", "orange")) + theme_bw() + theme(legend.text = element_text(size = 8))

For the three subsets we have similar results: the cross validation error reaches its lower value at K = 20 (around 14.7). Similar,but still higher, than the arithmetic mean, and also higher than the cv obtained through linear regression.

setwd("C:/Users/Martin/Desktop/kaggle/titanic")
source("text of analysis/scripts/knn_mixed.R")