Step 1: Collecting Data

The Titanic datasets are obtained from Kaggle (https://www.kaggle.com/c/titanic/data). there are originally three datasets form the website which autometically splitted into three portions; for one being the trained data include the survival outcome (binary 0: died or 1: survived value), another one being the tested data without the survival outcome, and a third one only contained the actual survival outcome of passengers listed in the tested data.

This study used the datasets to make prediction on the survival outcome of passengers in the tested data with a model built from the trained dataset. Since this is a binary outcome prediction, the logistic regression analysis will be used to model.

Step 2: Exploring & Preparing the Data

Since the datasets are given seperately as trained and tested data, they will be kept as it is. The thing that needed to be done is to merge the actual survival outcome of passengers from tested data with other information in that dataset. The column of survival outcome (dependent variable) is merged with the rest of the independent variables/features of the passegers from the tested dataset by passengerId. The trained dataset contains 891 observations (passenger information) and 12 features (information of passengers), and the tested dataset contains 418 observations.

setwd("C:/Users/Emily/Desktop/GRADUATE PROGRAM COURSES/Data Science Projects/Data Files/Titanic")
titanic.train = read.csv('train.csv',  header = T, na.strings = c(''))
str(titanic.train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
titanic.test = read.csv('test.csv', header = T, na.strings = c(""))
str(titanic.test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 76 levels "A11","A18","A21",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
test.label = read.csv('gender_submission.csv', header = T, na.string=c(""))
str(test.label)
## 'data.frame':    418 obs. of  2 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...
titanic.test = merge(titanic.test, test.label, by = "PassengerId")

The Pcalss (passenger class) feature are converted into factor in both dataset because the value of 1,2,3 should not be treated as numerical but as category levels for analysis later.

titanic.train$Pclass = as.factor(titanic.train$Pclass)
titanic.test$Pclass = as.factor(titanic.test$Pclass)
str(titanic.train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
str(titanic.test)
## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 76 levels "A11","A18","A21",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...

While the Kaggle website describes information of the dataset, and mentioned that ‘age’ that are less then 1 is recorded as fractional numbers. The observations with age value less than 1 is then assess here out of curiosity for both dataset, and it looks like the fractions indicates the months-old of infants out of 12 months.

titanic.train[which(titanic.train$Age < 1),'Age']
## [1] 0.83 0.92 0.75 0.75 0.67 0.42 0.83
titanic.test[which(titanic.test$Age < 1),'Age']
## [1] 0.33 0.92 0.75 0.83 0.17

Use sapply() function to count the number of observations with each feature that contains ‘NA’. There are many missing age and Cabine values in both dataset, while theere are 2 values missing in the Embarked feature in the trained data, and 1 value missing in the Fare feature.

sapply(titanic.train, function(x) sum(is.na(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
sapply(titanic.test, function(x) sum(is.na(x)))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1         327           0           0

Similarly, the number of unique observations per column is revealed below.

sapply(titanic.train, function(x) length(unique(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4
sapply(titanic.test, function(x) length(unique(x)))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##         418           3         418           2          80           7 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           8         363         170          77           3           2

Using the missmap() function under the Amelia package, the visualization of the amount of missing and observed values per features is observed below. Most information in the Cabin and Age features are missing in both datasets.

library(Amelia) 
## Warning: package 'Amelia' was built under R version 3.3.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(titanic.train, main = "Missing Values vs. Observed")

missmap(titanic.test, main = "Missing Values vs. Observed")

Features that contains many missing values, such as ‘Cabin’, and features that are assumed to be insignificant to predict survival, such as ‘Ticket’ and ‘Name,’ will be excluded. The features that are left for analysis are: Sex, Age, Pclass, Sibsp, Parch, Fare, Embarked and Survived.

titanic.train = subset(titanic.train, select = c(2,3,5,6,7,8,10,12))
titanic.test = subset(titanic.test, select = c(2,4,5,6,7,9,11,12))

To deal with the misssing value in Age feature, there are several way such as using mean, median or mode to fill in the missing value. Here the mean age of the Titanic population is used to applied on the age column that are missing in both datasets.

age = c(titanic.train$Age, titanic.test$Age)
avg.age = mean(age, na.rm = T)
titanic.train$Age[is.na(titanic.train$Age)] = avg.age
titanic.test$Age[is.na(titanic.test$Age)] = avg.age

Since there’s only two missing observations in Embarked’ and one in ‘Fare’ features, those observations are removed assuming that should not lose too much information compared with the rest of the datasets.

titanic.train = titanic.train[!is.na(titanic.train$Embarked),]
titanic.test = titanic.test[!is.na(titanic.test$Fare),]

Using the missmap() again allow to confirm all missing values are solved.

missmap(titanic.train, main = "Missing Values vs. Observed")

missmap(titanic.test, main = "Missing Values vs. Observed")

Step 3 Model Training for Data

Using the generalized linear model, glm() function, make a logistic regression analysis using ‘Survived’ feature as outcome, with the rest of features in the training dataset as independent predictors. Specified binomial(link = ‘logit’) in the family argument will analyze the data using logistic regression.

The output of the logistic regression object, reg.model, shows that catergorical features, ‘Pcalss’ and ‘Sex’, and the numerical features ‘Age’ and ‘Sibsp’ are significant features for predicitng survival outcome at alpha = 0.05 level. Th rest of the model coefficients suggests insignificant contribution in survival prediction. For example, increase one unit in age will decrease the log odd of survival by 0.039; being a male will decrease the log odd of survival by 2.7 compared to female; and being in class2 will decrease the log odd of survival by 0.92, being in class3 will decrease the log odd of survival by 2.15.

reg.model = glm(Survived ~., family = binomial(link = 'logit'), data = titanic.train)
summary(reg.model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = titanic.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6240  -0.6098  -0.4240   0.6111   2.4510  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.106628   0.476646   8.616  < 2e-16 ***
## Pclass2     -0.925239   0.297932  -3.106  0.00190 ** 
## Pclass3     -2.150054   0.297752  -7.221 5.16e-13 ***
## Sexmale     -2.709536   0.201347 -13.457  < 2e-16 ***
## Age         -0.039367   0.007889  -4.990 6.02e-07 ***
## SibSp       -0.322293   0.109595  -2.941  0.00327 ** 
## Parch       -0.095458   0.119045  -0.802  0.42263    
## Fare         0.002257   0.002462   0.917  0.35936    
## EmbarkedQ   -0.026843   0.381586  -0.070  0.94392    
## EmbarkedS   -0.446383   0.239749  -1.862  0.06262 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.82  on 888  degrees of freedom
## Residual deviance:  783.67  on 879  degrees of freedom
## AIC: 803.67
## 
## Number of Fisher Scoring iterations: 5

The 95% confident interval for each predictor’s coefficient in the logistic regression model is also shown below.

confint(reg.model)
## Waiting for profiling to be done...
##                    2.5 %       97.5 %
## (Intercept)  3.189850970  5.061392867
## Pclass2     -1.512119676 -0.342095343
## Pclass3     -2.737781385 -1.567387117
## Sexmale     -3.113452224 -2.323201068
## Age         -0.055129287 -0.024164851
## SibSp       -0.547877170 -0.117971257
## Parch       -0.335304813  0.134816293
## Fare        -0.002278929  0.007628827
## EmbarkedQ   -0.779096791  0.718877318
## EmbarkedS   -0.915721407  0.025405303

Linear contrast can be done using the wald.test() function in the aod package for the model object. For example, the first one is comparing the ‘Embarked’ feature coefficients from the model, which is the all three level including the reference level. The p-value = 0.11 indicates statistical insignificant difference for the levels of the ‘Embarked’ feature. The second one is comparing the statistical significance of ‘Pclass’ and its p-value indicates that the passenger classes are significantly difference.

library(aod)
## Warning: package 'aod' was built under R version 3.3.3
wald.test(b = coef(reg.model), Sigma = vcov(reg.model), Terms = 9:10)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 4.5, df = 2, P(> X2) = 0.11
wald.test(b = coef(reg.model), Sigma = vcov(reg.model), Terms = 2:3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 58.6, df = 2, P(> X2) = 1.8e-13

Moreover, exponentiate the model coefficients can look at the result and interpret its meaning at a different angle. Below are table of the “odd ratio” value for each predictor coefficient relative to the survival and their respective 95% confident interval odd ratio value.

Now, the result can be interpreted as: for a unit increase in age, the odds of surviving from the incident decrease by a factor of 0.96; and being a third class (class3), the odds of surviving decrease by a factor of 0.11, etc.

exp(cbind(OR = coef(reg.model), confint(reg.model)))
## Waiting for profiling to be done...
##                      OR       2.5 %       97.5 %
## (Intercept) 60.74152102 24.28480801 157.81017182
## Pclass2      0.39643656  0.22044222   0.71028048
## Pclass3      0.11647791  0.06471376   0.20858949
## Sexmale      0.06656766  0.04444725   0.09795951
## Age          0.96139741  0.94636279   0.97612478
## SibSp        0.72448571  0.57817588   0.88872160
## Parch        0.90895624  0.71512008   1.14432654
## Fare         1.00225929  0.99772367   1.00765800
## EmbarkedQ    0.97351370  0.45882023   2.05212803
## EmbarkedS    0.63993835  0.40022779   1.02573077

The anova() function for the model object allows to see the null and residuals deviances. The difference between these two deviances shows how well the model is performing against the null deviance. The residuals deviance column allows to see the drop of deviance value by additional respective predictor term added. The table shows that adding Pcalssm Sex, age and Sibsp has significantly reduce the residuals deviances while the other terms aren’t anymore at alpha = 0.05 level.

anova(reg.model, test = 'Chisq')
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       888    1182.82              
## Pclass    2  101.571       886    1081.25 < 2.2e-16 ***
## Sex       1  254.743       885     826.50 < 2.2e-16 ***
## Age       1   21.846       884     804.66 2.954e-06 ***
## SibSp     1   14.392       883     790.27 0.0001484 ***
## Parch     1    0.504       882     789.76 0.4775463    
## Fare      1    1.633       881     788.13 0.2013133    
## Embarked  2    4.458       879     783.67 0.1076199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Furthermore, using teh pR2() function in the pscl package allows to see a linear regression R-square value equivelent, which is the McFadden R-square index. This is equivelently saying that the logsitic regresion model has well explained 34% of variation in the survival prediction.

library(pscl)
## Warning: package 'pscl' was built under R version 3.3.3
## Loading required package: MASS
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.3.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(reg.model)
##          llh      llhNull           G2     McFadden         r2ML 
## -391.8348723 -591.4088880  399.1480313    0.3374552    0.3617246 
##         r2CU 
##    0.4917035

Step 4: Model Performance Evaluation

Applying the logistic regression model object and fit all independent features of the tested dataset in the model. The titanic.predict is a vector that holds the predicted survival outcomes of passengers in the tested data.These values in the titanic.predict vector is in probability between 0 to 1. The first few probability of survival are shown below in order.

titanic.predict = predict(reg.model, newdata = subset(titanic.test, select = c(1:7)),
                          type = 'response')
head(titanic.predict)
##          1          2          3          4          5          6 
## 0.10713808 0.34376190 0.12196959 0.09597951 0.56318178 0.15062961

Since this is a binary/logistic predictor, it needs to change the continuous probability values into categorical values between 0 or 1. One way to fairly does that is to use an ifelse() statement to change probability greater than 0.5 to 1, as assumed to be likely survived; and probability less than 0.5 to 0, as assumed to be unlikely survived. The first few final binary survival outcome are shown below.

titanic.predict = ifelse(titanic.predict >0.5, 1, 0)
head(titanic.predict)
## 1 2 3 4 5 6 
## 0 0 0 0 1 0

The misClassifiError is a value that shows proportions of predicted outcomes that are not the same as the actual survival outcomes. Accuracy would be 1 minues the error value. The accuracy of predicting the Titanic survival outcome is 0.945, 94.5%.

misClassifiError = mean(titanic.predict != titanic.test$Survived)
print(paste('Accuracy', 1 - misClassifiError))
## [1] "Accuracy 0.94484412470024"

The Receiver Operating Characteristic (ROC) curve is plotted below for false positive rate (FPR) in the x-axis vs. the true positive rate (TPR) in the y-axis. It shows the detection of true positive while avoiding the false positive. This is the same as measuring the unspecificity (1 - specificity) in x-axis, against the sensitivity in y-axis. This ROC curve in particular shows that its very closed to the perfect classifier meaning that its better at identifying the positive values. An index for that is the AUC (area under the curve) of this ROC, which is 0.98 for this case.

library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
p = predict(reg.model, newdata = subset(titanic.test, select = c(1:7)),
            type = 'response')
pr = prediction(p, titanic.test$Survived)
prf = performance(pr, measure = 'tpr', x.measure = 'fpr')
plot(prf)
abline(0,1, lwd = 2, lty = 2)

auc = performance(pr, measure = 'auc')
str(auc)
## Formal class 'performance' [package "ROCR"] with 6 slots
##   ..@ x.name      : chr "None"
##   ..@ y.name      : chr "Area under the ROC curve"
##   ..@ alpha.name  : chr "none"
##   ..@ x.values    : list()
##   ..@ y.values    :List of 1
##   .. ..$ : num 0.98
##   ..@ alpha.values: list()
auc = auc@y.values[[1]]
auc
## [1] 0.9803128

Step 5: Model Performance Improvement

The model improvement is done by using the SVM (support vector machine) which can handle both numerical and categorical predictors just like those from the features used in the Titanic datasets. The kernel type used is ‘vanilladot.’ Applied the svm.model just created on the predictors values from the tested dataset. Below are the first few probability value of the survival prediction in the tested dataset.

library(kernlab)
svm.model <- ksvm(Survived ~ ., data = titanic.train, kernel = 'vanilladot')
##  Setting default kernel parameters
svm.model
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 411 
## 
## Objective Function Value : -354.461 
## Training error : 0.825634
svm.predict <- predict(svm.model, titanic.test[,1:7])
head(svm.predict)
##         [,1]
## 1 0.04860103
## 2 0.95096222
## 3 0.04867597
## 4 0.04856677
## 5 0.95095819
## 6 0.04861916

The vector of probability of the predicted survival, the svm.predict, is converted into a binary value of 0 or 1 just like what had been done before. Using the table() function to see the svm performance relative to the actual survival outcomes in the tested dataset. As it turns out surprising, all observations survival outcomes are predicted correctly, and it has a 100% accuracy in the prediction from the svm modeling!

svm.predict = ifelse(svm.predict >0.5, 1, 0)
head(svm.predict)
##   [,1]
## 1    0
## 2    1
## 3    0
## 4    0
## 5    1
## 6    0
table(svm.predict, titanic.test$Survived)
##            
## svm.predict   0   1
##           0 265   0
##           1   0 152
misClassifiError = mean(svm.predict != titanic.test$Survived)
print(paste('Accuracy is', 1 - misClassifiError))
## [1] "Accuracy is 1"

The ROC curve is plotted again for the prediction from svm model. The ROC curve performance is just like a perfect classifier because it has correctly identified all of the positive before it incorrectly classifies any negative results. The AUC of the ROC curve is also 1.

p = predict(svm.model, newdata = subset(titanic.test, select = c(1:7)),
            type = 'response')
pr = prediction(p, titanic.test$Survived)
prf = performance(pr, measure = 'tpr', x.measure = 'fpr')
plot(prf)
abline(0,1, lwd = 2, lty = 2)

auc = performance(pr, measure = 'auc')
str(auc)
## Formal class 'performance' [package "ROCR"] with 6 slots
##   ..@ x.name      : chr "None"
##   ..@ y.name      : chr "Area under the ROC curve"
##   ..@ alpha.name  : chr "none"
##   ..@ x.values    : list()
##   ..@ y.values    :List of 1
##   .. ..$ : num 1
##   ..@ alpha.values: list()
auc = auc@y.values[[1]]
auc
## [1] 1

Conclusion: Titanic dataset is used for the STAT-6620 Machine Learning course final project. The purpose of this project is to use the existing features of passengers onboard Titanic as predictors to predict their survival outcome, for 0 being dead and 1 being survived from the tragic ship crash. The binary logistic regression is first performed with the glm, and improved performance with the Support Vector Machine (SVM) analysis. It is certain through the practice of model improvement, the SVM analysis is better performed than the original logistic regression analysis for prediction accuracy.

However, even from the logistic regression model, we can easily see that the Titanic survival outcome is highly depended on several predictors, such as sex, age and passenger class. In particular, female are more likely to survived than male while keeping other predictors conditions constant, older people are less likely to survived while keeping other predictors conditions constant; and lastly, people from a lower class are less likely to survived keeping other predictors conditions constant.