Applying Machine Learning Techniques on Titanic Data

The analyses here uses the titanic data, available easily of kaggle. In this analyses, logistic regression, LDA and QDA techniques are used to predict which type of passenger would have survived the tragedy.

Loading Libraries

library(MASS)
library(ggplot2)
library(dplyr)

Data exploration and Data Cleaning

In this part, data exploration is carried out and necessary data cleaning steps are performed to make the data ready for predictive modedling.

Loading the data from the directory

The following code is used to load the data:

setwd("C:/Users/rohit/Documents/RPubs_uploads/proj_3/")
train <- read.csv("train.csv")
test <- read.csv("test.csv")

test$Survived <- NA
all <- rbind(train, test)

glimpse(all)
## Observations: 1,309
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <fctr> Braund, Mr. Owen Harris, Cumings, Mrs. John Bradl...
## $ Sex         <fctr> male, female, female, female, male, male, male, m...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <fctr> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 37...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <fctr> , C85, , C123, , , E46, , , , G6, C103, , , , , ,...
## $ Embarked    <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q...
colSums(is.na(all))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0
space <- function (x) {sum(x=="") }
apply(all, 2, space)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0          NA           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0          NA        1014           2

It is observed that age is missing for 263 observations. Since, age can be one of the important factors to determine the survival of a passenger, it has been imputeted in the next section.

Predicting missing age

The missing values of age are predicted using fare and Passenger class.

ggplot(data = all, aes(x = Fare, y = Age, col = factor(Pclass))) +
  geom_smooth()

index <- which(is.na(all$Age))
predict_age <- all[index,]
train_age <- all[-index,]

lm_age <- lm(Age ~ Pclass + Fare + Pclass:Fare, data = train_age)
summary(lm_age)
## 
## Call:
## lm(formula = Age ~ Pclass + Fare + Pclass:Fare, data = train_age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.908  -8.117  -1.360   7.775  47.876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.60475    1.56595  29.123  < 2e-16 ***
## Pclass      -5.64937    0.67503  -8.369  < 2e-16 ***
## Fare         0.14181    0.02824   5.021 6.03e-07 ***
## Pclass:Fare -0.15585    0.02583  -6.033 2.23e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.88 on 1041 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2012, Adjusted R-squared:  0.1988 
## F-statistic: 87.38 on 3 and 1041 DF,  p-value: < 2.2e-16
predict_age$Age <- as.integer(predict(lm_age, newdata = predict_age))

all <- rbind(train_age, predict_age)
colSums(is.na(all))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0
all$family <- all$SibSp + all$Parch + 1

Splitting the data for predictive modeling

The whole data has been randomly split into training and test. 80% of the data is used as training data to build the model and the rest 20% data is used to test the model built.

#Convert categorical variable to factor type
all$Survived <- as.factor(all$Survived)
all$Pclass <- as.factor(all$Pclass)


index <- which(is.na(all$Survived)) 
train <- all[-index,]
test <- all[index,]

str(all)
## 'data.frame':    1309 obs. of  13 variables:
##  $ PassengerId: int  1 2 3 4 5 7 8 9 10 11 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 520 629 417 581 732 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ Age        : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp      : int  1 1 0 1 0 0 3 0 1 1 ...
##  $ Parch      : int  0 0 0 0 0 0 1 2 0 1 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 86 396 345 133 617 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 131 1 1 1 147 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
##  $ family     : num  2 2 1 2 1 1 5 3 2 3 ...
index <- sample(1:nrow(train), nrow(train)*0.8)
train_t <- train[index,]
test_t <- train[-index,]

Logistic Regression

Initial Model

A initial model is built using the important variables observed from the EDA above.

log.fit <- glm(Survived ~ Pclass + Sex + Age + Fare + Embarked + family, family = "binomial", data = train_t)
summary(log.fit)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Fare + Embarked + 
##     family, family = "binomial", data = train_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5914  -0.6101  -0.4234   0.6224   2.4739  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  16.839481 607.860034   0.028 0.977899    
## Pclass2      -0.912723   0.337197  -2.707 0.006794 ** 
## Pclass3      -2.207719   0.344811  -6.403 1.53e-10 ***
## Sexmale      -2.706839   0.223592 -12.106  < 2e-16 ***
## Age          -0.043054   0.009133  -4.714 2.43e-06 ***
## Fare          0.003047   0.002725   1.118 0.263556    
## EmbarkedC   -12.378034 607.859862  -0.020 0.983754    
## EmbarkedQ   -12.503025 607.859938  -0.021 0.983590    
## EmbarkedS   -12.734645 607.859840  -0.021 0.983286    
## family       -0.289555   0.075475  -3.836 0.000125 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 947.02  on 711  degrees of freedom
## Residual deviance: 626.90  on 702  degrees of freedom
## AIC: 646.9
## 
## Number of Fisher Scoring iterations: 13

Model with important variables

log.fit2 <- glm(Survived ~ Pclass +Sex + Age + family , family = "binomial", data = train_t)
summary(log.fit2)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + family, family = "binomial", 
##     data = train_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5993  -0.5899  -0.4190   0.6141   2.4614  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.570095   0.504352   9.061  < 2e-16 ***
## Pclass2     -1.198594   0.297017  -4.035 5.45e-05 ***
## Pclass3     -2.484581   0.284655  -8.728  < 2e-16 ***
## Sexmale     -2.749358   0.221718 -12.400  < 2e-16 ***
## Age         -0.045152   0.009067  -4.980 6.36e-07 ***
## family      -0.284078   0.071789  -3.957 7.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 947.02  on 711  degrees of freedom
## Residual deviance: 631.35  on 706  degrees of freedom
## AIC: 643.35
## 
## Number of Fisher Scoring iterations: 5

Prediction on the test data

test_t$prob <- predict(log.fit2, newdata = test_t, type = "response")
test_t$pred <- ifelse(test_t$prob > 0.5, 1 , 0)
table(test_t$pred, test_t$Survived)
##    
##       0   1
##   0 100  24
##   1   9  46

Linear Discriminant Analysis

LDA Modeling

lda.model = lda (Survived ~ Pclass + Sex + Age + Parch, data=train_t)
lda.model
## Call:
## lda(Survived ~ Pclass + Sex + Age + Parch, data = train_t)
## 
## Prior probabilities of groups:
##         0         1 
## 0.6179775 0.3820225 
## 
## Group means:
##     Pclass2   Pclass3   Sexmale      Age     Parch
## 0 0.1750000 0.6886364 0.8363636 29.35909 0.3545455
## 1 0.2536765 0.3602941 0.3088235 27.58029 0.4926471
## 
## Coefficients of linear discriminants:
##                 LD1
## Pclass2 -0.76664664
## Pclass3 -1.64266162
## Sexmale -2.10072845
## Age     -0.02363916
## Parch   -0.17455310
plot(lda.model)

Prediction on the test data

lda_prob <- predict(lda.model, newdata = test_t)
table(Predicted = lda_prob$class, test_t$Survived)
##          
## Predicted   0   1
##         0 100  26
##         1   9  44

Quadratic Discriminant Analysis

QDA Modeling

qda.model <- qda (Survived ~ Pclass + Sex + Age + Parch, data=train_t)
qda.model
## Call:
## qda(Survived ~ Pclass + Sex + Age + Parch, data = train_t)
## 
## Prior probabilities of groups:
##         0         1 
## 0.6179775 0.3820225 
## 
## Group means:
##     Pclass2   Pclass3   Sexmale      Age     Parch
## 0 0.1750000 0.6886364 0.8363636 29.35909 0.3545455
## 1 0.2536765 0.3602941 0.3088235 27.58029 0.4926471

Prediction on the test data

qda_prob <- predict(qda.model, newdata = test_t)
table(Predicted = qda_prob$class, test_t$Survived)
##          
## Predicted  0  1
##         0 93 18
##         1 16 52

Conclusion

LDA and QDA work well when class separation and normality assumption holds true in the dataset. If the dataset is not normal then Logistic regression has an edge over LDA and QDA model. Logistic regression does not work properly if the response classes are fully separated from each other. In general, logistic regression is used for binomial classification and in case of multiple response classes, LDA and QDA are more popular