0utput: Html_document

Project description

========================================

This project is all about predicting the survivors in the Titanic disaster. On April 15, 1912, the Titanic ship sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The outcome of the project is to predict the passengers who survived the tragedy

DataSet description.

=======================================

The datasets used in this project is from Kaggle portel and it is part of the Kaggle’s competion to predict the survivors in the ship.

Following are the 2 datasets
+ Train.CSV + Test.CSV

The Train.csv is used for model creation and Test.CSV is used to apply the model and predict the passengers survived using reggression/classification

Explanation of the variables in the train.csv dataset

======================================================================================

survival -: Survival (0 = No; 1 = Yes)
pclass -: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name -: Name
sex -: Sex
age -: Age
sibsp -: Number of Siblings/Spouses Aboard
parch -: Number of Parents/Children Aboard
ticket -: Ticket Number
fare -: Passenger Fare
cabin -: Cabin
embarked -: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

test.csv contains the same variables as it is train.csv except the survival variable. This is the one that is supposed to be predicted

Reference Papers

==============================

Similar type of analysis has been done by others. The following are the links to some of the papers submitted.

##   PassengerId Pclass Survived                            Name    Sex Age
## 1         601    3rd        0             Abbing, Mr. Anthony   male  42
## 2         602    3rd        0   Abbott, Master. Eugene Joseph   male  13
## 3         603    3rd        0     Abbott, Mr. Rossmore Edward   male  16
## 4         604    3rd        1 Abbott, Mrs. Stanton (Rosa Hunt female  35
## 5         605    3rd        1     Abelseth, Miss. Karen Marie female  16
## 6         606    3rd        1   Abelseth, Mr. Olaus Jorgensen   male  25
##   SibSp Parch    Ticket  Fare Cabin    Embarked Boat Body
## 1     0     0 C.A. 5547  7.55       Southampton        NA
## 2     0     2 C.A. 2673 20.25       Southampton        NA
## 3     1     1 C.A. 2673 20.25       Southampton       190
## 4     1     1 C.A. 2673 20.25       Southampton    A   NA
## 5     0     0    348125  7.65       Southampton   16   NA
## 6     0     0    348122  7.65 F G63 Southampton    A   NA
##                home.dest
## 1                       
## 2    East Providence, RI
## 3    East Providence, RI
## 4    East Providence, RI
## 5 Norway Los Angeles, CA
## 6     Perkins County, SD

## 
## Summary of Titanic Dataset
## ------------------------------------------
## Statistic     N   Mean  St. Dev. Min  Max 
## ------------------------------------------
## PASSENGERID 1,309 655.0  378.0    1  1,309
## SURVIVED    1,309  0.4    0.5     0    1  
## AGE         1,046 29.9    14.4   0.2 80.0 
## SIBSP       1,309  0.5    1.0     0    8  
## PARCH       1,309  0.4    0.9     0    9  
## FARE        1,308 33.3    51.8   0.0 512.3
## BODY         121  160.8   97.7    1   328 
## ------------------------------------------

## 'data.frame':    1309 obs. of  15 variables:
##  $ PassengerId: int  601 602 603 604 605 606 324 325 607 608 ...
##  $ Pclass     : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ Survived   : int  0 0 0 1 1 1 0 1 1 1 ...
##  $ Name       : chr  "Abbing, Mr. Anthony" "Abbott, Master. Eugene Joseph" "Abbott, Mr. Rossmore Edward" "Abbott, Mrs. Stanton (Rosa Hunt" ...
##  $ Sex        : chr  "male" "male" "male" "female" ...
##  $ Age        : num  42 13 16 35 16 25 30 28 20 18 ...
##  $ SibSp      : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ Parch      : int  0 2 1 1 0 0 0 0 0 0 ...
##  $ Ticket     : chr  "C.A. 5547" "C.A. 2673" "C.A. 2673" "C.A. 2673" ...
##  $ Fare       : num  7.55 20.25 20.25 20.25 7.65 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Southampton" "Southampton" "Southampton" "Southampton" ...
##  $ Boat       : chr  "" "" "" "A" ...
##  $ Body       : int  NA NA 190 NA NA NA NA NA NA NA ...
##  $ home.dest  : chr  "" "East Providence, RI" "East Providence, RI" "East Providence, RI" ...

## [1] 1309

Chisq test

==============================

This test is used to find the relation between Age and Survived"

## 
##  Chi-squared test for given probabilities
## 
## data:  xt$Age
## X-squared = 7241.1, df = 1044, p-value < 2.2e-16

Cullen & Frey graph

===============================================

This graph is used to understand the distribution of the Age/PClass/Fare variables.

The graph shows that all the variables are having a Normal distribution.

## summary statistics
## ------
## min:  0.1667   max:  80 
## median:  28 
## mean:  29.85183 
## estimated sd:  14.3892 
## estimated skewness:  0.4070077 
## estimated kurtosis:  3.154922

## summary statistics
## ------
## min:  1   max:  3 
## median:  2 
## mean:  2.206699 
## estimated sd:  0.8415418 
## estimated skewness:  -0.4053072 
## estimated kurtosis:  1.529315

## summary statistics
## ------
## min:  1   max:  256 
## median:  110 
## mean:  115.622 
## estimated sd:  71.99422 
## estimated skewness:  0.3300987 
## estimated kurtosis:  1.958779

Balloon plot 1

===============================

This chart is used to show the graphical matrix of survivors, where each cell contains the number of survivors. It is marked with a circle, the size which reflects the magnitude of the number of survivors

Observation:

========================

Balloon plot shows that the survival rate is higher for the 1st class passengers. From the male/femal perspective, the graph shows that females had the higher rate of survival than males.

Also, the survival view based on the passengers embarkment shows that the passengers who had embarked from Cherbourg and Southampton had more chance of survival compared to Queenstown

Balloon Plot 2

==============================

This Balloon plot shows the graphical matrix of non-survivors data

Observation:

============================

Balloon plot shows that the non survival rate is higher for the 2nd & 3rd class passengers. Also, from the male/female perspective survival rate is lower for Males than females

Also, the passengers who had embarked from Southampton had less chance of survival compared to others

Scatter plot 1

==============================

These are comparison charts that shows the impact of Age/Sex/Pclass/Fare on the survival rate

Observation:

======================================

The survival rate of females are more than males. Also, the age factor seems to have had an impact on th survival rate. Survival rate seems to be higher for the age group < 20

Scatter plot 2

===============================

Observation:

=====================

This graph shows that the survival rate is higher for passengers (in the age group of <20 ) in the 1st class and second class cabins

Scatter Plot 3

=======================

Impact of Fare on the survival rate.

Observation:

================================

This shows the impact of low fares on the survival rate. Decrease in survival rate is observed when the fare was low.

GLM Model1:-

===========================================

Logistic Regression using GLM. This model uses all the variables in the dataset for prediction.

## 
## Summary of Titanic Dataset
## ------------------------------------------
## Statistic     N   Mean  St. Dev. Min  Max 
## ------------------------------------------
## PASSENGERID 1,309 655.0  378.0    1  1,309
## SURVIVED    1,309  0.4    0.5     0    1  
## AGE         1,046 29.9    14.4   0.2 80.0 
## SIBSP       1,309  0.5    1.0     0    8  
## PARCH       1,309  0.4    0.9     0    9  
## FARE        1,308 33.3    51.8   0.0 512.3
## BODY         121  160.8   97.7    1   328 
## ------------------------------------------

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2456  -0.6947  -0.4435   0.6935   2.3081  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.863088   0.528253   7.313 2.61e-13 ***
## Pclass2nd           -1.042704   0.337055  -3.094 0.001978 ** 
## Pclass3rd           -1.833942   0.341764  -5.366 8.05e-08 ***
## Sexmale             -2.472881   0.226489 -10.918  < 2e-16 ***
## Age                 -0.032616   0.008483  -3.845 0.000121 ***
## SibSp               -0.306722   0.131959  -2.324 0.020106 *  
## Parch                0.070353   0.143910   0.489 0.624934    
## Fare                 0.001537   0.002348   0.655 0.512716    
## EmbarkedQueenstown  -1.189361   0.534583  -2.225 0.026092 *  
## EmbarkedSouthampton -0.677110   0.271398  -2.495 0.012599 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.35  on 629  degrees of freedom
## Residual deviance: 595.95  on 620  degrees of freedom
## AIC: 615.95
## 
## Number of Fisher Scoring iterations: 4

## [1] 615.9482

##  FP  TP  TN  FN 
##  42 123 205  43 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.7941889 0.7409639 0.8299595 0.7454545 0.8266129 2.6363246 
## attr(,"negative")
## [1] "Died"

## [1] 166

## [1] 165

## [1] 247

## [1] 248

GLM Model2 :-

==========================

Logistic regression with GLM. In this model the blank rows or NA’s are replaced with the Mean value. Mean of Age and Fare, are computed and filled in the rows with NA’s

## 
## Call:
## glm(formula = Survived ~ Sex + Age + SibSp + Parch * Pclass + 
##     Fare, family = "binomial", data = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3291  -0.5828  -0.4758   0.6317   2.3708  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.248872   0.443741   7.322 2.45e-13 ***
## Sexmale         -2.545496   0.203084 -12.534  < 2e-16 ***
## Age             -0.030926   0.008229  -3.758 0.000171 ***
## SibSp           -0.249122   0.113781  -2.189 0.028562 *  
## Parch           -0.021865   0.265659  -0.082 0.934404    
## Pclass2nd       -1.566125   0.336842  -4.649 3.33e-06 ***
## Pclass3rd       -1.797859   0.292860  -6.139 8.31e-10 ***
## Fare             0.003040   0.002477   1.227 0.219650    
## Parch:Pclass2nd  1.062037   0.415413   2.557 0.010571 *  
## Parch:Pclass3rd -0.269219   0.307315  -0.876 0.381009    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1045.18  on 784  degrees of freedom
## Residual deviance:  718.17  on 775  degrees of freedom
## AIC: 738.17
## 
## Number of Fisher Scoring iterations: 5

## [1] 738.1675

##  FP  TP  TN  FN 
##  41 137 284  62 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8034351 0.6884422 0.8738462 0.7696629 0.8208092 2.7282487 
## attr(,"negative")
## [1] "Died"

GLM Model 3 :-

=========================

Prediction based on logistic regression using signicifant variables. Also, ploynomial transformation is applied on the Age.

## 
## Call:
## glm(formula = Survived ~ Sex + poly(Age, 2) + Pclass * SibSp + 
##     Parch, family = "binomial", data = train5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3690  -0.6602  -0.4429   0.6433   2.4406  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.3814     0.3080   7.732 1.06e-14 ***
## Sexmale          -2.4679     0.2233 -11.052  < 2e-16 ***
## poly(Age, 2)1   -13.0071     3.1032  -4.191 2.77e-05 ***
## poly(Age, 2)2     6.6840     2.7485   2.432 0.015021 *  
## Pclass2nd        -1.2033     0.3472  -3.466 0.000528 ***
## Pclass3rd        -1.7919     0.3192  -5.614 1.97e-08 ***
## SibSp             0.2334     0.3135   0.744 0.456594    
## Parch             0.0534     0.1438   0.371 0.710421    
## Pclass2nd:SibSp  -0.2276     0.4537  -0.502 0.616009    
## Pclass3rd:SibSp  -0.8590     0.3527  -2.436 0.014868 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.35  on 629  degrees of freedom
## Residual deviance: 592.02  on 620  degrees of freedom
## AIC: 612.02
## 
## Number of Fisher Scoring iterations: 5

## [1] 612.0175

##  FP  TP  TN  FN 
##  31 126 216  40 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8280872 0.7590361 0.8744939 0.8025478 0.8437500 3.0886937 
## attr(,"negative")
## [1] "Died"

##               df      AIC
## glm.logistic2 10 738.1675
## glm.logistic3 10 612.0175
## glm.logistic1 10 615.9482

Obeservation:

=======================

On comparison of the AIC values of all the GLM models,GLM model 3 has a lesser AIC and hence we can say that it is the better fit among the GLM models.

C50 Model 1:-

========================

Model using C50 Rules. This model uses all the variables in the dataset.

## 
## Call:
## C5.0.formula(formula = Survived ~ ., data = train7, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 17 03:15:33 2015
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (8 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (406/90, lift 1.3)
##  Sex = male
##  ->  class 0  [0.777]
## 
## Rule 2: (269/64, lift 1.3)
##  Pclass = 3rd
##  Embarked in {Southampton, Queenstown}
##  ->  class 0  [0.760]
## 
## Rule 3: (57/2, lift 2.3)
##  Sex = female
##  Embarked = Cherbourg
##  ->  class 1  [0.949]
## 
## Rule 4: (136/7, lift 2.3)
##  Pclass in {1st, 2nd}
##  Sex = female
##  ->  class 1  [0.942]
## 
## Rule 5: (8, lift 2.2)
##  Pclass = 1st
##  SibSp > 0
##  Fare <= 57.9792
##  Embarked = Cherbourg
##  ->  class 1  [0.900]
## 
## Rule 6: (18/2, lift 2.1)
##  Sex = male
##  Age <= 9
##  SibSp <= 3
##  ->  class 1  [0.850]
## 
## Rule 7: (4, lift 2.0)
##  Fare > 263
##  ->  class 1  [0.833]
## 
## Rule 8: (18/3, lift 1.9)
##  SibSp > 0
##  Fare > 73.5
##  Fare <= 136.7792
##  Embarked = Southampton
##  ->  class 1  [0.800]
## 
## Default class: 0
## 
## 
## Evaluation on training data (630 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       8  107(17.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     359    12    (a): class 0
##      95   164    (b): class 1
## 
## 
##  Attribute usage:
## 
##   88.10% Sex
##   65.08% Pclass
##   55.40% Embarked
##    6.98% SibSp
##    4.76% Fare
##    2.86% Age
## 
## 
## Time: 0.0 secs

##  FP  TP  TN  FN 
##  14 100 233  66 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8062954 0.6024096 0.9433198 0.8771930 0.7792642 3.2274966 
## attr(,"negative")
## [1] "Died"

C50 Model 2:-

==========================

C50 rule model based on significant variables

## 
## Call:
## C5.0.formula(formula = Survived ~ Age + Sex * Pclass + Fare *
##  Embarked, data = train10, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 17 03:15:34 2015
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (6 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (379/73, lift 1.4)
##  Age > 9
##  Sex = male
##  ->  class 0  [0.806]
## 
## Rule 2: (269/64, lift 1.3)
##  Pclass = 3rd
##  Embarked in {Southampton, Queenstown}
##  ->  class 0  [0.760]
## 
## Rule 3: (57/2, lift 2.3)
##  Sex = female
##  Embarked = Cherbourg
##  ->  class 1  [0.949]
## 
## Rule 4: (136/7, lift 2.3)
##  Sex = female
##  Pclass in {1st, 2nd}
##  ->  class 1  [0.942]
## 
## Rule 5: (52/19, lift 1.5)
##  Age <= 9
##  ->  class 1  [0.630]
## 
## Default class: 0
## 
## 
## Evaluation on training data (630 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       5  117(18.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     362     9    (a): class 0
##     108   151    (b): class 1
## 
## 
##  Attribute usage:
## 
##   83.81% Sex
##   68.41% Age
##   64.29% Pclass
##   51.75% Embarked
## 
## 
## Time: 0.0 secs

##  FP  TP  TN  FN 
##  13 100 234  66 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8087167 0.6024096 0.9473684 0.8849558 0.7800000 3.3058872 
## attr(,"negative")
## [1] "Died"

C50 Model 3:-

===========================

3rd Variations using C50 algorithm wherein the NA values are replaced with Mean of Age & Fare

## 
## Call:
## C5.0.formula(formula = Survived ~ Age + Sex * Pclass + Fare, data
##  = train13, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 17 03:15:34 2015
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 785 cases (5 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (18/1, lift 1.5)
##  Sex = female
##  Pclass = 3rd
##  Fare > 23.25
##  ->  class 0  [0.900]
## 
## Rule 2: (295/48, lift 1.4)
##  Sex = male
##  Pclass = 3rd
##  ->  class 0  [0.835]
## 
## Rule 3: (478/82, lift 1.3)
##  Age > 14
##  Sex = male
##  ->  class 0  [0.827]
## 
## Rule 4: (21, lift 2.5)
##  Age <= 14
##  Pclass in {1st, 2nd}
##  ->  class 1  [0.957]
## 
## Rule 5: (272/73, lift 1.9)
##  Sex = female
##  ->  class 1  [0.730]
## 
## Default class: 0
## 
## 
## Evaluation on training data (785 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       5  147(18.7%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     428    56    (a): class 0
##      91   210    (b): class 1
## 
## 
##  Attribute usage:
## 
##   98.47% Sex
##   63.57% Age
##   42.55% Pclass
##    2.29% Fare
## 
## 
## Time: 0.0 secs

##  FP  TP  TN  FN 
##  42 142 283  57 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8110687 0.7135678 0.8707692 0.7717391 0.8323529 2.8205531 
## attr(,"negative")
## [1] "Died"

Decision Tree model ( DT Model 1) :-

=============================================================

This model uses the C50 algorithm to create decision trees.

## 
## Call:
## C5.0.default(x = train14[, -2], y = train14$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 17 03:15:34 2015
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1st,2nd}: 1 (136/7)
## :   Pclass = 3rd:
## :   :...Embarked = Cherbourg: 1 (13/2)
## :       Embarked in {Southampton,Queenstown}: 0 (75/29)
## Sex = male:
## :...Age <= 9:
##     :...SibSp <= 3: 1 (18/2)
##     :   SibSp > 3: 0 (9/1)
##     Age > 9:
##     :...Pclass in {3rd,2nd}: 0 (284/40)
##         Pclass = 1st:
##         :...Embarked = Queenstown: 0 (0)
##             Embarked = Cherbourg:
##             :...SibSp <= 0:
##             :   :...Fare <= 263: 0 (22/7)
##             :   :   Fare > 263: 1 (2)
##             :   SibSp > 0:
##             :   :...Fare <= 57.9792: 1 (5)
##             :       Fare > 57.9792: 0 (13/4)
##             Embarked = Southampton:
##             :...SibSp <= 0:
##                 :...Fare > 31.3875: 0 (14)
##                 :   Fare <= 31.3875:
##                 :   :...Fare <= 26: 0 (7)
##                 :       Fare > 26: 1 (13/5)
##                 SibSp > 0:
##                 :...Fare <= 73.5: 0 (6/1)
##                     Fare > 73.5:
##                     :...Fare <= 136.7792: 1 (9/3)
##                         Fare > 136.7792: 0 (4)
## 
## 
## Evaluation on training data (630 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      16  101(16.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     352    19    (a): class 0
##      82   177    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% Sex
##   95.71% Pclass
##   64.44% Age
##   29.05% Embarked
##   19.37% SibSp
##   15.08% Fare
## 
## 
## Time: 0.0 secs

##  FP  TP  TN  FN 
##  25 107 222  59 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.7966102 0.6445783 0.8987854 0.8106061 0.7900356 2.7790929 
## attr(,"negative")
## [1] "Died"

DT Model 2 :-

==========================

2nd variations in Tree model using C50 where in significant variables are chosen for model creation and prediction

## 
## Call:
## C5.0.default(x = train15[, -c(2, 6, 7)], y = train15$Survived, control
##  = C5.0Control(winnow = TRUE))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 17 03:15:34 2015
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (6 attributes) from undefined.data
## 
## No attributes winnowed
## 
## Decision tree:
## 
## Sex = male:
## :...Age <= 9: 1 (27/10)
## :   Age > 9: 0 (379/73)
## Sex = female:
## :...Pclass in {1st,2nd}: 1 (136/7)
##     Pclass = 3rd:
##     :...Embarked = Cherbourg: 1 (13/2)
##         Embarked in {Southampton,Queenstown}: 0 (75/29)
## 
## 
## Evaluation on training data (630 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5  121(19.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     352    19    (a): class 0
##     102   157    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% Sex
##   64.44% Age
##   35.56% Pclass
##   13.97% Embarked
## 
## 
## Time: 0.0 secs

##          Overall
## Age           25
## Embarked      25
## Pclass        25
## Sex           25
## SibSp          0

##  FP  TP  TN  FN 
##  20 103 227  63 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.7990315 0.6204819 0.9190283 0.8373984 0.7827586 2.9208120 
## attr(,"negative")
## [1] "Died"

RPart Model 1:

======================================

Decision tree using the RPart package

## Call:
## rpart(formula = Survived ~ ., data = train16, method = "class")
##   n= 630 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.44015444      0 1.0000000 1.0000000 0.04768335
## 2 0.03281853      1 0.5598456 0.5598456 0.04079292
## 3 0.02702703      3 0.4942085 0.5675676 0.04098852
## 4 0.01000000      5 0.4401544 0.4671815 0.03817522
## 
## Variable importance
##      Sex   Pclass     Fare      Age Embarked    SibSp    Parch 
##       46       15       15        9        6        6        2 
## 
## Node number 1: 630 observations,    complexity param=0.4401544
##   predicted class=0  expected loss=0.4111111  P(node) =1
##     class counts:   371   259
##    probabilities: 0.589 0.411 
##   left son=2 (406 obs) right son=3 (224 obs)
##   Primary splits:
##       Sex      splits as  RL,           improve=81.95485, (0 missing)
##       Fare     < 52.2771  to the left,  improve=31.22644, (0 missing)
##       Pclass   splits as  RLL,          improve=27.44000, (0 missing)
##       Embarked splits as  RLL,          improve=16.34740, (0 missing)
##       Parch    < 0.5      to the left,  improve=10.22907, (0 missing)
##   Surrogate splits:
##       Fare     < 77.6229  to the left,  agree=0.673, adj=0.080, (0 split)
##       Parch    < 0.5      to the left,  agree=0.660, adj=0.045, (0 split)
##       Embarked splits as  LRL,          agree=0.651, adj=0.018, (0 split)
##       Age      < 5.5      to the right, agree=0.646, adj=0.004, (0 split)
## 
## Node number 2: 406 observations,    complexity param=0.02702703
##   predicted class=0  expected loss=0.2216749  P(node) =0.6444444
##     class counts:   316    90
##    probabilities: 0.778 0.222 
##   left son=4 (379 obs) right son=5 (27 obs)
##   Primary splits:
##       Age      < 9.5      to the right, improve=9.627302, (0 missing)
##       Pclass   splits as  RLL,          improve=5.482567, (0 missing)
##       Fare     < 26.26875 to the left,  improve=4.713136, (0 missing)
##       Embarked splits as  RLL,          improve=3.793760, (0 missing)
##       Parch    < 0.5      to the left,  improve=1.996034, (0 missing)
##   Surrogate splits:
##       SibSp < 3.5      to the left,  agree=0.943, adj=0.148, (0 split)
## 
## Node number 3: 224 observations,    complexity param=0.03281853
##   predicted class=1  expected loss=0.2455357  P(node) =0.3555556
##     class counts:    55   169
##    probabilities: 0.246 0.754 
##   left son=6 (88 obs) right son=7 (136 obs)
##   Primary splits:
##       Pclass   splits as  RRL,          improve=26.075300, (0 missing)
##       Fare     < 21.55    to the left,  improve=11.673610, (0 missing)
##       Embarked splits as  RLL,          improve= 6.772141, (0 missing)
##       Age      < 32.5     to the left,  improve= 5.080290, (0 missing)
##       SibSp    < 2.5      to the right, improve= 1.287827, (0 missing)
##   Surrogate splits:
##       Fare     < 20.7875  to the left,  agree=0.853, adj=0.625, (0 split)
##       Age      < 22.5     to the left,  agree=0.688, adj=0.205, (0 split)
##       Embarked splits as  RLR,          agree=0.661, adj=0.136, (0 split)
##       SibSp    < 3.5      to the right, agree=0.634, adj=0.068, (0 split)
##       Parch    < 2.5      to the right, agree=0.612, adj=0.011, (0 split)
## 
## Node number 4: 379 observations
##   predicted class=0  expected loss=0.1926121  P(node) =0.6015873
##     class counts:   306    73
##    probabilities: 0.807 0.193 
## 
## Node number 5: 27 observations,    complexity param=0.02702703
##   predicted class=1  expected loss=0.3703704  P(node) =0.04285714
##     class counts:    10    17
##    probabilities: 0.370 0.630 
##   left son=10 (9 obs) right son=11 (18 obs)
##   Primary splits:
##       SibSp  < 3        to the right, improve=7.25925900, (0 missing)
##       Pclass splits as  RRL,          improve=4.35729800, (0 missing)
##       Fare   < 29.0625  to the right, improve=1.79259300, (0 missing)
##       Parch  < 1.5      to the left,  improve=0.19698820, (0 missing)
##       Age    < 3.5      to the right, improve=0.09259259, (0 missing)
##   Surrogate splits:
##       Fare     < 29.0625  to the right, agree=0.778, adj=0.333, (0 split)
##       Embarked splits as  RLR,          agree=0.741, adj=0.222, (0 split)
##       Pclass   splits as  RRL,          agree=0.704, adj=0.111, (0 split)
## 
## Node number 6: 88 observations,    complexity param=0.03281853
##   predicted class=0  expected loss=0.4545455  P(node) =0.1396825
##     class counts:    48    40
##    probabilities: 0.545 0.455 
##   left son=12 (75 obs) right son=13 (13 obs)
##   Primary splits:
##       Embarked splits as  RLL,          improve=4.6784150, (0 missing)
##       Fare     < 23.0875  to the right, improve=3.3246750, (0 missing)
##       Age      < 19.5     to the right, improve=1.6813660, (0 missing)
##       SibSp    < 1.5      to the right, improve=0.3159003, (0 missing)
##       Parch    < 0.5      to the left,  improve=0.1302693, (0 missing)
##   Surrogate splits:
##       Fare < 7.2396   to the right, agree=0.886, adj=0.231, (0 split)
##       Age  < 0.875    to the right, agree=0.875, adj=0.154, (0 split)
## 
## Node number 7: 136 observations
##   predicted class=1  expected loss=0.05147059  P(node) =0.215873
##     class counts:     7   129
##    probabilities: 0.051 0.949 
## 
## Node number 10: 9 observations
##   predicted class=0  expected loss=0.1111111  P(node) =0.01428571
##     class counts:     8     1
##    probabilities: 0.889 0.111 
## 
## Node number 11: 18 observations
##   predicted class=1  expected loss=0.1111111  P(node) =0.02857143
##     class counts:     2    16
##    probabilities: 0.111 0.889 
## 
## Node number 12: 75 observations
##   predicted class=0  expected loss=0.3866667  P(node) =0.1190476
##     class counts:    46    29
##    probabilities: 0.613 0.387 
## 
## Node number 13: 13 observations
##   predicted class=1  expected loss=0.1538462  P(node) =0.02063492
##     class counts:     2    11
##    probabilities: 0.154 0.846

##  FP  TP  TN  FN 
##  13 103 234  63 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8159806 0.6204819 0.9473684 0.8879310 0.7878788 3.3819660 
## attr(,"negative")
## [1] "Died"

RPart Model 2 :-

=======================================

Rpart 2nd variations based on significant variables

## Call:
## rpart(formula = Survived ~ Sex + Pclass + Fare, data = train19, 
##     method = "class", control = rpart.control(minsplit = 3, cp = 0.01))
##   n= 630 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.44015444      0 1.0000000 1.0000000 0.04768335
## 2 0.03088803      1 0.5598456 0.5598456 0.04079292
## 3 0.01544402      2 0.5289575 0.5675676 0.04098852
## 4 0.01000000      5 0.4826255 0.5366795 0.04018631
## 
## Variable importance
##    Sex   Fare Pclass 
##     59     22     19 
## 
## Node number 1: 630 observations,    complexity param=0.4401544
##   predicted class=0  expected loss=0.4111111  P(node) =1
##     class counts:   371   259
##    probabilities: 0.589 0.411 
##   left son=2 (406 obs) right son=3 (224 obs)
##   Primary splits:
##       Sex    splits as  RL,          improve=81.95485, (0 missing)
##       Fare   < 52.2771 to the left,  improve=31.22644, (0 missing)
##       Pclass splits as  RLL,         improve=27.44000, (0 missing)
##   Surrogate splits:
##       Fare < 77.6229 to the left,  agree=0.673, adj=0.08, (0 split)
## 
## Node number 2: 406 observations
##   predicted class=0  expected loss=0.2216749  P(node) =0.6444444
##     class counts:   316    90
##    probabilities: 0.778 0.222 
## 
## Node number 3: 224 observations,    complexity param=0.03088803
##   predicted class=1  expected loss=0.2455357  P(node) =0.3555556
##     class counts:    55   169
##    probabilities: 0.246 0.754 
##   left son=6 (88 obs) right son=7 (136 obs)
##   Primary splits:
##       Pclass splits as  RRL,         improve=26.07530, (0 missing)
##       Fare   < 21.55   to the left,  improve=11.67361, (0 missing)
##   Surrogate splits:
##       Fare < 20.7875 to the left,  agree=0.853, adj=0.625, (0 split)
## 
## Node number 6: 88 observations,    complexity param=0.01544402
##   predicted class=0  expected loss=0.4545455  P(node) =0.1396825
##     class counts:    48    40
##    probabilities: 0.545 0.455 
##   left son=12 (11 obs) right son=13 (77 obs)
##   Primary splits:
##       Fare < 23.0875 to the right, improve=3.324675, (0 missing)
## 
## Node number 7: 136 observations
##   predicted class=1  expected loss=0.05147059  P(node) =0.215873
##     class counts:     7   129
##    probabilities: 0.051 0.949 
## 
## Node number 12: 11 observations
##   predicted class=0  expected loss=0.09090909  P(node) =0.01746032
##     class counts:    10     1
##    probabilities: 0.909 0.091 
## 
## Node number 13: 77 observations,    complexity param=0.01544402
##   predicted class=1  expected loss=0.4935065  P(node) =0.1222222
##     class counts:    38    39
##    probabilities: 0.494 0.506 
##   left son=26 (72 obs) right son=27 (5 obs)
##   Primary splits:
##       Fare < 7.5896  to the right, improve=2.604618, (0 missing)
## 
## Node number 26: 72 observations,    complexity param=0.01544402
##   predicted class=0  expected loss=0.4722222  P(node) =0.1142857
##     class counts:    38    34
##    probabilities: 0.528 0.472 
##   left son=52 (39 obs) right son=53 (33 obs)
##   Primary splits:
##       Fare < 10.825  to the left,  improve=2.182595, (0 missing)
## 
## Node number 27: 5 observations
##   predicted class=1  expected loss=0  P(node) =0.007936508
##     class counts:     0     5
##    probabilities: 0.000 1.000 
## 
## Node number 52: 39 observations
##   predicted class=0  expected loss=0.3589744  P(node) =0.06190476
##     class counts:    25    14
##    probabilities: 0.641 0.359 
## 
## Node number 53: 33 observations
##   predicted class=1  expected loss=0.3939394  P(node) =0.05238095
##     class counts:    13    20
##    probabilities: 0.394 0.606

##  FP  TP  TN  FN 
##  21 108 226  58 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8087167 0.6506024 0.9149798 0.8372093 0.7957746 2.9977008 
## attr(,"negative")
## [1] "Died"

RPart Model 3 :-

====================================

Rpart 3rd variations

## Call:
## rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked, data = train21, method = "class", control = rpart.control(minsplit = 3, 
##     cp = 0.01))
##   n= 630 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.44015444      0 1.0000000 1.0000000 0.04768335
## 2 0.03281853      1 0.5598456 0.5598456 0.04079292
## 3 0.02702703      3 0.4942085 0.5675676 0.04098852
## 4 0.01000000      5 0.4401544 0.4671815 0.03817522
## 
## Variable importance
##      Sex   Pclass     Fare      Age Embarked    SibSp    Parch 
##       46       15       15        9        6        6        2 
## 
## Node number 1: 630 observations,    complexity param=0.4401544
##   predicted class=0  expected loss=0.4111111  P(node) =1
##     class counts:   371   259
##    probabilities: 0.589 0.411 
##   left son=2 (406 obs) right son=3 (224 obs)
##   Primary splits:
##       Sex      splits as  RL,           improve=81.95485, (0 missing)
##       Fare     < 52.2771  to the left,  improve=31.22644, (0 missing)
##       Pclass   splits as  RLL,          improve=27.44000, (0 missing)
##       Embarked splits as  RLL,          improve=16.34740, (0 missing)
##       Parch    < 0.5      to the left,  improve=10.22907, (0 missing)
##   Surrogate splits:
##       Fare     < 77.6229  to the left,  agree=0.673, adj=0.080, (0 split)
##       Parch    < 0.5      to the left,  agree=0.660, adj=0.045, (0 split)
##       Embarked splits as  LRL,          agree=0.651, adj=0.018, (0 split)
##       Age      < 5.5      to the right, agree=0.646, adj=0.004, (0 split)
## 
## Node number 2: 406 observations,    complexity param=0.02702703
##   predicted class=0  expected loss=0.2216749  P(node) =0.6444444
##     class counts:   316    90
##    probabilities: 0.778 0.222 
##   left son=4 (379 obs) right son=5 (27 obs)
##   Primary splits:
##       Age      < 9.5      to the right, improve=9.627302, (0 missing)
##       Pclass   splits as  RLL,          improve=5.482567, (0 missing)
##       Fare     < 26.26875 to the left,  improve=4.713136, (0 missing)
##       Embarked splits as  RLL,          improve=3.793760, (0 missing)
##       Parch    < 0.5      to the left,  improve=1.996034, (0 missing)
##   Surrogate splits:
##       SibSp < 3.5      to the left,  agree=0.943, adj=0.148, (0 split)
## 
## Node number 3: 224 observations,    complexity param=0.03281853
##   predicted class=1  expected loss=0.2455357  P(node) =0.3555556
##     class counts:    55   169
##    probabilities: 0.246 0.754 
##   left son=6 (88 obs) right son=7 (136 obs)
##   Primary splits:
##       Pclass   splits as  RRL,          improve=26.075300, (0 missing)
##       Fare     < 21.55    to the left,  improve=11.673610, (0 missing)
##       Embarked splits as  RLL,          improve= 6.772141, (0 missing)
##       Age      < 32.5     to the left,  improve= 5.080290, (0 missing)
##       SibSp    < 3.5      to the right, improve= 2.186790, (0 missing)
##   Surrogate splits:
##       Fare     < 20.7875  to the left,  agree=0.853, adj=0.625, (0 split)
##       Age      < 22.5     to the left,  agree=0.688, adj=0.205, (0 split)
##       Embarked splits as  RLR,          agree=0.661, adj=0.136, (0 split)
##       SibSp    < 3.5      to the right, agree=0.634, adj=0.068, (0 split)
##       Parch    < 2.5      to the right, agree=0.612, adj=0.011, (0 split)
## 
## Node number 4: 379 observations
##   predicted class=0  expected loss=0.1926121  P(node) =0.6015873
##     class counts:   306    73
##    probabilities: 0.807 0.193 
## 
## Node number 5: 27 observations,    complexity param=0.02702703
##   predicted class=1  expected loss=0.3703704  P(node) =0.04285714
##     class counts:    10    17
##    probabilities: 0.370 0.630 
##   left son=10 (9 obs) right son=11 (18 obs)
##   Primary splits:
##       SibSp    < 3        to the right, improve=7.2592590, (0 missing)
##       Pclass   splits as  RRL,          improve=4.3572980, (0 missing)
##       Fare     < 29.0625  to the right, improve=1.7925930, (0 missing)
##       Embarked splits as  RLR,          improve=1.7125930, (0 missing)
##       Age      < 0.5      to the left,  improve=0.8233618, (0 missing)
##   Surrogate splits:
##       Fare     < 29.0625  to the right, agree=0.778, adj=0.333, (0 split)
##       Embarked splits as  RLR,          agree=0.741, adj=0.222, (0 split)
##       Pclass   splits as  RRL,          agree=0.704, adj=0.111, (0 split)
## 
## Node number 6: 88 observations,    complexity param=0.03281853
##   predicted class=0  expected loss=0.4545455  P(node) =0.1396825
##     class counts:    48    40
##    probabilities: 0.545 0.455 
##   left son=12 (75 obs) right son=13 (13 obs)
##   Primary splits:
##       Embarked splits as  RLL,          improve=4.6784150, (0 missing)
##       Fare     < 23.0875  to the right, improve=3.3246750, (0 missing)
##       Age      < 19.5     to the right, improve=1.6813660, (0 missing)
##       Parch    < 2.5      to the right, improve=1.2834220, (0 missing)
##       SibSp    < 4.5      to the right, improve=0.4179728, (0 missing)
##   Surrogate splits:
##       Fare < 7.2396   to the right, agree=0.886, adj=0.231, (0 split)
##       Age  < 0.875    to the right, agree=0.875, adj=0.154, (0 split)
## 
## Node number 7: 136 observations
##   predicted class=1  expected loss=0.05147059  P(node) =0.215873
##     class counts:     7   129
##    probabilities: 0.051 0.949 
## 
## Node number 10: 9 observations
##   predicted class=0  expected loss=0.1111111  P(node) =0.01428571
##     class counts:     8     1
##    probabilities: 0.889 0.111 
## 
## Node number 11: 18 observations
##   predicted class=1  expected loss=0.1111111  P(node) =0.02857143
##     class counts:     2    16
##    probabilities: 0.111 0.889 
## 
## Node number 12: 75 observations
##   predicted class=0  expected loss=0.3866667  P(node) =0.1190476
##     class counts:    46    29
##    probabilities: 0.613 0.387 
## 
## Node number 13: 13 observations
##   predicted class=1  expected loss=0.1538462  P(node) =0.02063492
##     class counts:     2    11
##    probabilities: 0.154 0.846

##  FP  TP  TN  FN 
##  13 103 234  63 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8159806 0.6204819 0.9473684 0.8879310 0.7878788 3.3819660 
## attr(,"negative")
## [1] "Died"

RPart Model 4 :-

==========================================

RPart using Engineered variables

## [[1]]
## [1] "Abbing"   " Mr"      " Anthony"

## [1] "Abbing"   " Mr"      " Anthony"

## [1] " Mr"

## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

## Call:
## rpart(formula = Survived ~ Sex + Title + FamilySize + Pclass, 
##     data = train23, method = "class")
##   n= 785 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.43189369      0 1.0000000 1.0000000 0.04525896
## 2 0.04651163      1 0.5681063 0.5813953 0.03874205
## 3 0.03986711      2 0.5215947 0.5847176 0.03882073
## 4 0.01000000      3 0.4817276 0.5249169 0.03732166
## 
## Variable importance
##      Title        Sex FamilySize     Pclass 
##         40         32         16         12 
## 
## Node number 1: 785 observations,    complexity param=0.4318937
##   predicted class=0  expected loss=0.3834395  P(node) =1
##     class counts:   484   301
##    probabilities: 0.617 0.383 
##   left son=2 (471 obs) right son=3 (314 obs)
##   Primary splits:
##       Title      splits as  RLRRRRLRLLR, improve=109.58130, (0 missing)
##       Sex        splits as  RL,          improve=100.91470, (0 missing)
##       Pclass     splits as  RLL,         improve= 34.25228, (0 missing)
##       FamilySize < 1.5 to the left,      improve= 17.68573, (0 missing)
##   Surrogate splits:
##       Sex        splits as  RL,      agree=0.944, adj=0.86, (0 split)
##       FamilySize < 1.5 to the left,  agree=0.716, adj=0.29, (0 split)
## 
## Node number 2: 471 observations
##   predicted class=0  expected loss=0.1677282  P(node) =0.6
##     class counts:   392    79
##    probabilities: 0.832 0.168 
## 
## Node number 3: 314 observations,    complexity param=0.04651163
##   predicted class=1  expected loss=0.2929936  P(node) =0.4
##     class counts:    92   222
##    probabilities: 0.293 0.707 
##   left son=6 (150 obs) right son=7 (164 obs)
##   Primary splits:
##       Pclass     splits as  RRL,         improve=36.962020, (0 missing)
##       FamilySize < 4.5 to the right,     improve=19.113200, (0 missing)
##       Title      splits as  L-RLLR-R--L, improve= 3.580400, (0 missing)
##       Sex        splits as  RL,          improve= 2.952126, (0 missing)
##   Surrogate splits:
##       Title      splits as  R-RLLR-R--R, agree=0.640, adj=0.247, (0 split)
##       FamilySize < 4.5 to the right,     agree=0.605, adj=0.173, (0 split)
##       Sex        splits as  RL,          agree=0.551, adj=0.060, (0 split)
## 
## Node number 6: 150 observations,    complexity param=0.03986711
##   predicted class=0  expected loss=0.4533333  P(node) =0.1910828
##     class counts:    82    68
##    probabilities: 0.547 0.453 
##   left son=12 (32 obs) right son=13 (118 obs)
##   Primary splits:
##       FamilySize < 4.5 to the right,     improve=10.51934, (0 missing)
##       Sex        splits as  RL,          improve= 1.33426, (0 missing)
##       Title      splits as  ---LR--R---, improve= 1.33426, (0 missing)
##   Surrogate splits:
##       Sex   splits as  RL,          agree=0.813, adj=0.125, (0 split)
##       Title splits as  ---LR--R---, agree=0.813, adj=0.125, (0 split)
## 
## Node number 7: 164 observations
##   predicted class=1  expected loss=0.06097561  P(node) =0.2089172
##     class counts:    10   154
##    probabilities: 0.061 0.939 
## 
## Node number 12: 32 observations
##   predicted class=0  expected loss=0.09375  P(node) =0.04076433
##     class counts:    29     3
##    probabilities: 0.906 0.094 
## 
## Node number 13: 118 observations
##   predicted class=1  expected loss=0.4491525  P(node) =0.1503185
##     class counts:    53    65
##    probabilities: 0.449 0.551

##  FP  TP  TN  FN 
##  48 149 277  50 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.8129771 0.7487437 0.8523077 0.7563452 0.8470948 2.8447398 
## attr(,"negative")
## [1] "Died"

Step function

==================================

Using the Step function to select a formula based model

## 'data.frame':    785 obs. of  8 variables:
##  $ Pclass  : chr  "1st" "3rd" "3rd" "3rd" ...
##  $ Survived: int  0 0 0 0 0 0 1 0 1 1 ...
##  $ Sex     : chr  "male" "male" "female" "male" ...
##  $ Age     : num  39 42 1 20 18 64 34 47 23 15 ...
##  $ SibSp   : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Parch   : int  0 0 1 0 0 0 0 0 0 2 ...
##  $ Fare    : num  29.7 7.55 12.18 7.93 11.5 ...
##  $ Embarked: chr  "Cherbourg" "Southampton" "Southampton" "Southampton" ...

## 'data.frame':    524 obs. of  8 variables:
##  $ Pclass  : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ Survived: int  0 0 1 1 1 1 1 1 1 0 ...
##  $ Sex     : chr  "male" "male" "female" "female" ...
##  $ Age     : num  13 16 35 16 28 ...
##  $ SibSp   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ Parch   : int  2 1 1 0 0 0 1 1 0 0 ...
##  $ Fare    : num  20.25 20.25 20.25 7.65 24 ...
##  $ Embarked: chr  "Southampton" "Southampton" "Southampton" "Southampton" ...

## Start:  AIC=615.95
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Parch     1   596.19 614.19
## - Fare      1   596.40 614.40
## <none>          595.95 615.95
## - SibSp     1   601.80 619.80
## - Embarked  2   603.84 619.84
## - Age       1   611.56 629.56
## - Pclass    2   626.02 642.02
## - Sex       1   741.44 759.44
## 
## Step:  AIC=614.19
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
## 
##            Df Deviance    AIC
## - Fare      1   596.90 612.90
## <none>          596.19 614.19
## + Parch     1   595.95 615.95
## - SibSp     1   601.84 617.84
## - Embarked  2   604.15 618.15
## - Age       1   612.39 628.39
## - Pclass    2   626.07 640.07
## - Sex       1   745.47 761.47
## 
## Step:  AIC=612.9
## Survived ~ Pclass + Sex + Age + SibSp + Embarked
## 
##            Df Deviance    AIC
## <none>          596.90 612.90
## + Fare      1   596.19 614.19
## + Parch     1   596.40 614.40
## - SibSp     1   602.01 616.01
## - Embarked  2   605.60 617.60
## - Age       1   613.41 627.41
## - Pclass    2   644.57 656.57
## - Sex       1   750.40 764.40

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked, 
##     family = "binomial", data = train20)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1751  -0.7021  -0.4469   0.6893   2.3100  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.073112   0.477934   8.522  < 2e-16 ***
## Pclass2nd           -1.143384   0.305017  -3.749 0.000178 ***
## Pclass3rd           -1.955423   0.296704  -6.590 4.38e-11 ***
## Sexmale             -2.501689   0.224279 -11.154  < 2e-16 ***
## Age                 -0.033351   0.008437  -3.953 7.72e-05 ***
## SibSp               -0.270168   0.124246  -2.174 0.029671 *  
## EmbarkedQueenstown  -1.244848   0.532301  -2.339 0.019355 *  
## EmbarkedSouthampton -0.702309   0.268756  -2.613 0.008970 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.35  on 629  degrees of freedom
## Residual deviance: 596.90  on 622  degrees of freedom
## AIC: 612.9
## 
## Number of Fisher Scoring iterations: 4

##  Named num [1:630] 0.5673 0.0768 0.7525 0.1476 0.2943 ...
##  - attr(*, "names")= chr [1:630] "363" "1" "668" "19" ...

## 
##   0   1 
## 388 242

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked, family = "binomial", data = train20)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2456  -0.6947  -0.4435   0.6935   2.3081  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.863088   0.528253   7.313 2.61e-13 ***
## Pclass2nd           -1.042704   0.337055  -3.094 0.001978 ** 
## Pclass3rd           -1.833942   0.341764  -5.366 8.05e-08 ***
## Sexmale             -2.472881   0.226489 -10.918  < 2e-16 ***
## Age                 -0.032616   0.008483  -3.845 0.000121 ***
## SibSp               -0.306722   0.131959  -2.324 0.020106 *  
## Parch                0.070353   0.143910   0.489 0.624934    
## Fare                 0.001537   0.002348   0.655 0.512716    
## EmbarkedQueenstown  -1.189361   0.534583  -2.225 0.026092 *  
## EmbarkedSouthampton -0.677110   0.271398  -2.495 0.012599 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.35  on 629  degrees of freedom
## Residual deviance: 595.95  on 620  degrees of freedom
## AIC: 615.95
## 
## Number of Fisher Scoring iterations: 4

## [1] 615.9482

##  FP  TP  TN  FN 
##  40 121 207  45 
## attr(,"negative")
## [1] "Died"

##       acc      sens      spec       ppv       npv       lor 
## 0.7941889 0.7289157 0.8380567 0.7515528 0.8214286 2.6329674 
## attr(,"negative")
## [1] "Died"

Comparison of the metrics of all the models.

========================================================================

As per the Business metrics defined for this project,TP(True positive) has been given more weightage. Based on this, using the formula (TP-TN+FP-FN) formula Best Fit has been arrived at.

Observation:-

====================

Since the cost of predicting the correct survivor is the cost chosen for this project, as per the table,“RPart Feature Engineering” seems to be the best fit model followed by C50 Decision Tree1 and GLM model 2.

Title: “Regression/Classification project”

Author: “Vijay”

Date: “12 August 2015”

0utput: Html_document

Project description

DataSet description.

Explanation of the variables in the train.csv dataset

Reference Papers

Chisq test

This test is used to find the relation between Age and Survived"

Cullen & Frey graph

This graph is used to understand the distribution of the Age/PClass/Fare variables.

The graph shows that all the variables are having a Normal distribution.

Balloon plot 1

This chart is used to show the graphical matrix of survivors, where each cell contains the number of survivors. It is marked with a circle, the size which reflects the magnitude of the number of survivors

Observation:

Balloon plot shows that the survival rate is higher for the 1st class passengers. From the male/femal perspective, the graph shows that females had the higher rate of survival than males.

Also, the survival view based on the passengers embarkment shows that the passengers who had embarked from Cherbourg and Southampton had more chance of survival compared to Queenstown

Balloon Plot 2

This Balloon plot shows the graphical matrix of non-survivors data

Observation:

Balloon plot shows that the non survival rate is higher for the 2nd & 3rd class passengers. Also, from the male/female perspective survival rate is lower for Males than females

Also, the passengers who had embarked from Southampton had less chance of survival compared to others

Scatter plot 1

These are comparison charts that shows the impact of Age/Sex/Pclass/Fare on the survival rate

Observation:

The survival rate of females are more than males. Also, the age factor seems to have had an impact on th survival rate. Survival rate seems to be higher for the age group < 20

Scatter plot 2

Observation:

This graph shows that the survival rate is higher for passengers (in the age group of <20 ) in the 1st class and second class cabins

Scatter Plot 3

Impact of Fare on the survival rate.

Observation:

This shows the impact of low fares on the survival rate. Decrease in survival rate is observed when the fare was low.

GLM Model1:-

Logistic Regression using GLM. This model uses all the variables in the dataset for prediction.

GLM Model2 :-

Logistic regression with GLM. In this model the blank rows or NA’s are replaced with the Mean value. Mean of Age and Fare, are computed and filled in the rows with NA’s

GLM Model 3 :-

Prediction based on logistic regression using signicifant variables. Also, ploynomial transformation is applied on the Age.

Obeservation:

On comparison of the AIC values of all the GLM models,GLM model 3 has a lesser AIC and hence we can say that it is the better fit among the GLM models.

C50 Model 1:-

Model using C50 Rules. This model uses all the variables in the dataset.

C50 Model 2:-

C50 rule model based on significant variables

C50 Model 3:-

3rd Variations using C50 algorithm wherein the NA values are replaced with Mean of Age & Fare

Decision Tree model ( DT Model 1) :-

This model uses the C50 algorithm to create decision trees.

DT Model 2 :-

2nd variations in Tree model using C50 where in significant variables are chosen for model creation and prediction

RPart Model 1:

Decision tree using the RPart package

RPart Model 2 :-

Rpart 2nd variations based on significant variables

RPart Model 3 :-

Rpart 3rd variations

RPart Model 4 :-

RPart using Engineered variables

Step function

Using the Step function to select a formula based model

Comparison of the metrics of all the models.

As per the Business metrics defined for this project,TP(True positive) has been given more weightage. Based on this, using the formula (TP-TN+FP-FN) formula Best Fit has been arrived at.

Observation:-

Since the cost of predicting the correct survivor is the cost chosen for this project, as per the table,“RPart Feature Engineering” seems to be the best fit model followed by C50 Decision Tree1 and GLM model 2.