========================================
This project is all about predicting the survivors in the Titanic disaster. On April 15, 1912, the Titanic ship sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The outcome of the project is to predict the passengers who survived the tragedy
=======================================
The datasets used in this project is from Kaggle portel and it is part of the Kaggle’s competion to predict the survivors in the ship.
Following are the 2 datasets
+ Train.CSV + Test.CSV
The Train.csv is used for model creation and Test.CSV is used to apply the model and predict the passengers survived using reggression/classification
======================================================================================
test.csv contains the same variables as it is train.csv except the survival variable. This is the one that is supposed to be predicted
==============================
Similar type of analysis has been done by others. The following are the links to some of the papers submitted.
## PassengerId Pclass Survived Name Sex Age
## 1 601 3rd 0 Abbing, Mr. Anthony male 42
## 2 602 3rd 0 Abbott, Master. Eugene Joseph male 13
## 3 603 3rd 0 Abbott, Mr. Rossmore Edward male 16
## 4 604 3rd 1 Abbott, Mrs. Stanton (Rosa Hunt female 35
## 5 605 3rd 1 Abelseth, Miss. Karen Marie female 16
## 6 606 3rd 1 Abelseth, Mr. Olaus Jorgensen male 25
## SibSp Parch Ticket Fare Cabin Embarked Boat Body
## 1 0 0 C.A. 5547 7.55 Southampton NA
## 2 0 2 C.A. 2673 20.25 Southampton NA
## 3 1 1 C.A. 2673 20.25 Southampton 190
## 4 1 1 C.A. 2673 20.25 Southampton A NA
## 5 0 0 348125 7.65 Southampton 16 NA
## 6 0 0 348122 7.65 F G63 Southampton A NA
## home.dest
## 1
## 2 East Providence, RI
## 3 East Providence, RI
## 4 East Providence, RI
## 5 Norway Los Angeles, CA
## 6 Perkins County, SD
##
## Summary of Titanic Dataset
## ------------------------------------------
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------
## PASSENGERID 1,309 655.0 378.0 1 1,309
## SURVIVED 1,309 0.4 0.5 0 1
## AGE 1,046 29.9 14.4 0.2 80.0
## SIBSP 1,309 0.5 1.0 0 8
## PARCH 1,309 0.4 0.9 0 9
## FARE 1,308 33.3 51.8 0.0 512.3
## BODY 121 160.8 97.7 1 328
## ------------------------------------------
## 'data.frame': 1309 obs. of 15 variables:
## $ PassengerId: int 601 602 603 604 605 606 324 325 607 608 ...
## $ Pclass : chr "3rd" "3rd" "3rd" "3rd" ...
## $ Survived : int 0 0 0 1 1 1 0 1 1 1 ...
## $ Name : chr "Abbing, Mr. Anthony" "Abbott, Master. Eugene Joseph" "Abbott, Mr. Rossmore Edward" "Abbott, Mrs. Stanton (Rosa Hunt" ...
## $ Sex : chr "male" "male" "male" "female" ...
## $ Age : num 42 13 16 35 16 25 30 28 20 18 ...
## $ SibSp : int 0 0 1 1 0 0 1 1 0 0 ...
## $ Parch : int 0 2 1 1 0 0 0 0 0 0 ...
## $ Ticket : chr "C.A. 5547" "C.A. 2673" "C.A. 2673" "C.A. 2673" ...
## $ Fare : num 7.55 20.25 20.25 20.25 7.65 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Southampton" "Southampton" "Southampton" "Southampton" ...
## $ Boat : chr "" "" "" "A" ...
## $ Body : int NA NA 190 NA NA NA NA NA NA NA ...
## $ home.dest : chr "" "East Providence, RI" "East Providence, RI" "East Providence, RI" ...
## [1] 1309
==============================
##
## Chi-squared test for given probabilities
##
## data: xt$Age
## X-squared = 7241.1, df = 1044, p-value < 2.2e-16
===============================================
## summary statistics
## ------
## min: 0.1667 max: 80
## median: 28
## mean: 29.85183
## estimated sd: 14.3892
## estimated skewness: 0.4070077
## estimated kurtosis: 3.154922
## summary statistics
## ------
## min: 1 max: 3
## median: 2
## mean: 2.206699
## estimated sd: 0.8415418
## estimated skewness: -0.4053072
## estimated kurtosis: 1.529315
## summary statistics
## ------
## min: 1 max: 256
## median: 110
## mean: 115.622
## estimated sd: 71.99422
## estimated skewness: 0.3300987
## estimated kurtosis: 1.958779
===============================
========================
==============================
============================
==============================
======================================
===============================
=====================
=======================
================================
===========================================
##
## Summary of Titanic Dataset
## ------------------------------------------
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------
## PASSENGERID 1,309 655.0 378.0 1 1,309
## SURVIVED 1,309 0.4 0.5 0 1
## AGE 1,046 29.9 14.4 0.2 80.0
## SIBSP 1,309 0.5 1.0 0 8
## PARCH 1,309 0.4 0.9 0 9
## FARE 1,308 33.3 51.8 0.0 512.3
## BODY 121 160.8 97.7 1 328
## ------------------------------------------
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2456 -0.6947 -0.4435 0.6935 2.3081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.863088 0.528253 7.313 2.61e-13 ***
## Pclass2nd -1.042704 0.337055 -3.094 0.001978 **
## Pclass3rd -1.833942 0.341764 -5.366 8.05e-08 ***
## Sexmale -2.472881 0.226489 -10.918 < 2e-16 ***
## Age -0.032616 0.008483 -3.845 0.000121 ***
## SibSp -0.306722 0.131959 -2.324 0.020106 *
## Parch 0.070353 0.143910 0.489 0.624934
## Fare 0.001537 0.002348 0.655 0.512716
## EmbarkedQueenstown -1.189361 0.534583 -2.225 0.026092 *
## EmbarkedSouthampton -0.677110 0.271398 -2.495 0.012599 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.35 on 629 degrees of freedom
## Residual deviance: 595.95 on 620 degrees of freedom
## AIC: 615.95
##
## Number of Fisher Scoring iterations: 4
## [1] 615.9482
## FP TP TN FN
## 42 123 205 43
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.7941889 0.7409639 0.8299595 0.7454545 0.8266129 2.6363246
## attr(,"negative")
## [1] "Died"
## [1] 166
## [1] 165
## [1] 247
## [1] 248
==========================
##
## Call:
## glm(formula = Survived ~ Sex + Age + SibSp + Parch * Pclass +
## Fare, family = "binomial", data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3291 -0.5828 -0.4758 0.6317 2.3708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.248872 0.443741 7.322 2.45e-13 ***
## Sexmale -2.545496 0.203084 -12.534 < 2e-16 ***
## Age -0.030926 0.008229 -3.758 0.000171 ***
## SibSp -0.249122 0.113781 -2.189 0.028562 *
## Parch -0.021865 0.265659 -0.082 0.934404
## Pclass2nd -1.566125 0.336842 -4.649 3.33e-06 ***
## Pclass3rd -1.797859 0.292860 -6.139 8.31e-10 ***
## Fare 0.003040 0.002477 1.227 0.219650
## Parch:Pclass2nd 1.062037 0.415413 2.557 0.010571 *
## Parch:Pclass3rd -0.269219 0.307315 -0.876 0.381009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1045.18 on 784 degrees of freedom
## Residual deviance: 718.17 on 775 degrees of freedom
## AIC: 738.17
##
## Number of Fisher Scoring iterations: 5
## [1] 738.1675
## FP TP TN FN
## 41 137 284 62
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8034351 0.6884422 0.8738462 0.7696629 0.8208092 2.7282487
## attr(,"negative")
## [1] "Died"
=========================
##
## Call:
## glm(formula = Survived ~ Sex + poly(Age, 2) + Pclass * SibSp +
## Parch, family = "binomial", data = train5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3690 -0.6602 -0.4429 0.6433 2.4406
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3814 0.3080 7.732 1.06e-14 ***
## Sexmale -2.4679 0.2233 -11.052 < 2e-16 ***
## poly(Age, 2)1 -13.0071 3.1032 -4.191 2.77e-05 ***
## poly(Age, 2)2 6.6840 2.7485 2.432 0.015021 *
## Pclass2nd -1.2033 0.3472 -3.466 0.000528 ***
## Pclass3rd -1.7919 0.3192 -5.614 1.97e-08 ***
## SibSp 0.2334 0.3135 0.744 0.456594
## Parch 0.0534 0.1438 0.371 0.710421
## Pclass2nd:SibSp -0.2276 0.4537 -0.502 0.616009
## Pclass3rd:SibSp -0.8590 0.3527 -2.436 0.014868 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.35 on 629 degrees of freedom
## Residual deviance: 592.02 on 620 degrees of freedom
## AIC: 612.02
##
## Number of Fisher Scoring iterations: 5
## [1] 612.0175
## FP TP TN FN
## 31 126 216 40
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8280872 0.7590361 0.8744939 0.8025478 0.8437500 3.0886937
## attr(,"negative")
## [1] "Died"
## df AIC
## glm.logistic2 10 738.1675
## glm.logistic3 10 612.0175
## glm.logistic1 10 615.9482
=======================
========================
##
## Call:
## C5.0.formula(formula = Survived ~ ., data = train7, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 17 03:15:33 2015
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 630 cases (8 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (406/90, lift 1.3)
## Sex = male
## -> class 0 [0.777]
##
## Rule 2: (269/64, lift 1.3)
## Pclass = 3rd
## Embarked in {Southampton, Queenstown}
## -> class 0 [0.760]
##
## Rule 3: (57/2, lift 2.3)
## Sex = female
## Embarked = Cherbourg
## -> class 1 [0.949]
##
## Rule 4: (136/7, lift 2.3)
## Pclass in {1st, 2nd}
## Sex = female
## -> class 1 [0.942]
##
## Rule 5: (8, lift 2.2)
## Pclass = 1st
## SibSp > 0
## Fare <= 57.9792
## Embarked = Cherbourg
## -> class 1 [0.900]
##
## Rule 6: (18/2, lift 2.1)
## Sex = male
## Age <= 9
## SibSp <= 3
## -> class 1 [0.850]
##
## Rule 7: (4, lift 2.0)
## Fare > 263
## -> class 1 [0.833]
##
## Rule 8: (18/3, lift 1.9)
## SibSp > 0
## Fare > 73.5
## Fare <= 136.7792
## Embarked = Southampton
## -> class 1 [0.800]
##
## Default class: 0
##
##
## Evaluation on training data (630 cases):
##
## Rules
## ----------------
## No Errors
##
## 8 107(17.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 359 12 (a): class 0
## 95 164 (b): class 1
##
##
## Attribute usage:
##
## 88.10% Sex
## 65.08% Pclass
## 55.40% Embarked
## 6.98% SibSp
## 4.76% Fare
## 2.86% Age
##
##
## Time: 0.0 secs
## FP TP TN FN
## 14 100 233 66
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8062954 0.6024096 0.9433198 0.8771930 0.7792642 3.2274966
## attr(,"negative")
## [1] "Died"
==========================
##
## Call:
## C5.0.formula(formula = Survived ~ Age + Sex * Pclass + Fare *
## Embarked, data = train10, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 17 03:15:34 2015
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 630 cases (6 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (379/73, lift 1.4)
## Age > 9
## Sex = male
## -> class 0 [0.806]
##
## Rule 2: (269/64, lift 1.3)
## Pclass = 3rd
## Embarked in {Southampton, Queenstown}
## -> class 0 [0.760]
##
## Rule 3: (57/2, lift 2.3)
## Sex = female
## Embarked = Cherbourg
## -> class 1 [0.949]
##
## Rule 4: (136/7, lift 2.3)
## Sex = female
## Pclass in {1st, 2nd}
## -> class 1 [0.942]
##
## Rule 5: (52/19, lift 1.5)
## Age <= 9
## -> class 1 [0.630]
##
## Default class: 0
##
##
## Evaluation on training data (630 cases):
##
## Rules
## ----------------
## No Errors
##
## 5 117(18.6%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 362 9 (a): class 0
## 108 151 (b): class 1
##
##
## Attribute usage:
##
## 83.81% Sex
## 68.41% Age
## 64.29% Pclass
## 51.75% Embarked
##
##
## Time: 0.0 secs
## FP TP TN FN
## 13 100 234 66
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8087167 0.6024096 0.9473684 0.8849558 0.7800000 3.3058872
## attr(,"negative")
## [1] "Died"
===========================
##
## Call:
## C5.0.formula(formula = Survived ~ Age + Sex * Pclass + Fare, data
## = train13, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 17 03:15:34 2015
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 785 cases (5 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (18/1, lift 1.5)
## Sex = female
## Pclass = 3rd
## Fare > 23.25
## -> class 0 [0.900]
##
## Rule 2: (295/48, lift 1.4)
## Sex = male
## Pclass = 3rd
## -> class 0 [0.835]
##
## Rule 3: (478/82, lift 1.3)
## Age > 14
## Sex = male
## -> class 0 [0.827]
##
## Rule 4: (21, lift 2.5)
## Age <= 14
## Pclass in {1st, 2nd}
## -> class 1 [0.957]
##
## Rule 5: (272/73, lift 1.9)
## Sex = female
## -> class 1 [0.730]
##
## Default class: 0
##
##
## Evaluation on training data (785 cases):
##
## Rules
## ----------------
## No Errors
##
## 5 147(18.7%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 428 56 (a): class 0
## 91 210 (b): class 1
##
##
## Attribute usage:
##
## 98.47% Sex
## 63.57% Age
## 42.55% Pclass
## 2.29% Fare
##
##
## Time: 0.0 secs
## FP TP TN FN
## 42 142 283 57
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8110687 0.7135678 0.8707692 0.7717391 0.8323529 2.8205531
## attr(,"negative")
## [1] "Died"
=============================================================
##
## Call:
## C5.0.default(x = train14[, -2], y = train14$Survived)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 17 03:15:34 2015
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 630 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## Sex = female:
## :...Pclass in {1st,2nd}: 1 (136/7)
## : Pclass = 3rd:
## : :...Embarked = Cherbourg: 1 (13/2)
## : Embarked in {Southampton,Queenstown}: 0 (75/29)
## Sex = male:
## :...Age <= 9:
## :...SibSp <= 3: 1 (18/2)
## : SibSp > 3: 0 (9/1)
## Age > 9:
## :...Pclass in {3rd,2nd}: 0 (284/40)
## Pclass = 1st:
## :...Embarked = Queenstown: 0 (0)
## Embarked = Cherbourg:
## :...SibSp <= 0:
## : :...Fare <= 263: 0 (22/7)
## : : Fare > 263: 1 (2)
## : SibSp > 0:
## : :...Fare <= 57.9792: 1 (5)
## : Fare > 57.9792: 0 (13/4)
## Embarked = Southampton:
## :...SibSp <= 0:
## :...Fare > 31.3875: 0 (14)
## : Fare <= 31.3875:
## : :...Fare <= 26: 0 (7)
## : Fare > 26: 1 (13/5)
## SibSp > 0:
## :...Fare <= 73.5: 0 (6/1)
## Fare > 73.5:
## :...Fare <= 136.7792: 1 (9/3)
## Fare > 136.7792: 0 (4)
##
##
## Evaluation on training data (630 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 16 101(16.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 352 19 (a): class 0
## 82 177 (b): class 1
##
##
## Attribute usage:
##
## 100.00% Sex
## 95.71% Pclass
## 64.44% Age
## 29.05% Embarked
## 19.37% SibSp
## 15.08% Fare
##
##
## Time: 0.0 secs
## FP TP TN FN
## 25 107 222 59
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.7966102 0.6445783 0.8987854 0.8106061 0.7900356 2.7790929
## attr(,"negative")
## [1] "Died"
==========================
##
## Call:
## C5.0.default(x = train15[, -c(2, 6, 7)], y = train15$Survived, control
## = C5.0Control(winnow = TRUE))
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 17 03:15:34 2015
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 630 cases (6 attributes) from undefined.data
##
## No attributes winnowed
##
## Decision tree:
##
## Sex = male:
## :...Age <= 9: 1 (27/10)
## : Age > 9: 0 (379/73)
## Sex = female:
## :...Pclass in {1st,2nd}: 1 (136/7)
## Pclass = 3rd:
## :...Embarked = Cherbourg: 1 (13/2)
## Embarked in {Southampton,Queenstown}: 0 (75/29)
##
##
## Evaluation on training data (630 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 5 121(19.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 352 19 (a): class 0
## 102 157 (b): class 1
##
##
## Attribute usage:
##
## 100.00% Sex
## 64.44% Age
## 35.56% Pclass
## 13.97% Embarked
##
##
## Time: 0.0 secs
## Overall
## Age 25
## Embarked 25
## Pclass 25
## Sex 25
## SibSp 0
## FP TP TN FN
## 20 103 227 63
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.7990315 0.6204819 0.9190283 0.8373984 0.7827586 2.9208120
## attr(,"negative")
## [1] "Died"
======================================
## Call:
## rpart(formula = Survived ~ ., data = train16, method = "class")
## n= 630
##
## CP nsplit rel error xerror xstd
## 1 0.44015444 0 1.0000000 1.0000000 0.04768335
## 2 0.03281853 1 0.5598456 0.5598456 0.04079292
## 3 0.02702703 3 0.4942085 0.5675676 0.04098852
## 4 0.01000000 5 0.4401544 0.4671815 0.03817522
##
## Variable importance
## Sex Pclass Fare Age Embarked SibSp Parch
## 46 15 15 9 6 6 2
##
## Node number 1: 630 observations, complexity param=0.4401544
## predicted class=0 expected loss=0.4111111 P(node) =1
## class counts: 371 259
## probabilities: 0.589 0.411
## left son=2 (406 obs) right son=3 (224 obs)
## Primary splits:
## Sex splits as RL, improve=81.95485, (0 missing)
## Fare < 52.2771 to the left, improve=31.22644, (0 missing)
## Pclass splits as RLL, improve=27.44000, (0 missing)
## Embarked splits as RLL, improve=16.34740, (0 missing)
## Parch < 0.5 to the left, improve=10.22907, (0 missing)
## Surrogate splits:
## Fare < 77.6229 to the left, agree=0.673, adj=0.080, (0 split)
## Parch < 0.5 to the left, agree=0.660, adj=0.045, (0 split)
## Embarked splits as LRL, agree=0.651, adj=0.018, (0 split)
## Age < 5.5 to the right, agree=0.646, adj=0.004, (0 split)
##
## Node number 2: 406 observations, complexity param=0.02702703
## predicted class=0 expected loss=0.2216749 P(node) =0.6444444
## class counts: 316 90
## probabilities: 0.778 0.222
## left son=4 (379 obs) right son=5 (27 obs)
## Primary splits:
## Age < 9.5 to the right, improve=9.627302, (0 missing)
## Pclass splits as RLL, improve=5.482567, (0 missing)
## Fare < 26.26875 to the left, improve=4.713136, (0 missing)
## Embarked splits as RLL, improve=3.793760, (0 missing)
## Parch < 0.5 to the left, improve=1.996034, (0 missing)
## Surrogate splits:
## SibSp < 3.5 to the left, agree=0.943, adj=0.148, (0 split)
##
## Node number 3: 224 observations, complexity param=0.03281853
## predicted class=1 expected loss=0.2455357 P(node) =0.3555556
## class counts: 55 169
## probabilities: 0.246 0.754
## left son=6 (88 obs) right son=7 (136 obs)
## Primary splits:
## Pclass splits as RRL, improve=26.075300, (0 missing)
## Fare < 21.55 to the left, improve=11.673610, (0 missing)
## Embarked splits as RLL, improve= 6.772141, (0 missing)
## Age < 32.5 to the left, improve= 5.080290, (0 missing)
## SibSp < 2.5 to the right, improve= 1.287827, (0 missing)
## Surrogate splits:
## Fare < 20.7875 to the left, agree=0.853, adj=0.625, (0 split)
## Age < 22.5 to the left, agree=0.688, adj=0.205, (0 split)
## Embarked splits as RLR, agree=0.661, adj=0.136, (0 split)
## SibSp < 3.5 to the right, agree=0.634, adj=0.068, (0 split)
## Parch < 2.5 to the right, agree=0.612, adj=0.011, (0 split)
##
## Node number 4: 379 observations
## predicted class=0 expected loss=0.1926121 P(node) =0.6015873
## class counts: 306 73
## probabilities: 0.807 0.193
##
## Node number 5: 27 observations, complexity param=0.02702703
## predicted class=1 expected loss=0.3703704 P(node) =0.04285714
## class counts: 10 17
## probabilities: 0.370 0.630
## left son=10 (9 obs) right son=11 (18 obs)
## Primary splits:
## SibSp < 3 to the right, improve=7.25925900, (0 missing)
## Pclass splits as RRL, improve=4.35729800, (0 missing)
## Fare < 29.0625 to the right, improve=1.79259300, (0 missing)
## Parch < 1.5 to the left, improve=0.19698820, (0 missing)
## Age < 3.5 to the right, improve=0.09259259, (0 missing)
## Surrogate splits:
## Fare < 29.0625 to the right, agree=0.778, adj=0.333, (0 split)
## Embarked splits as RLR, agree=0.741, adj=0.222, (0 split)
## Pclass splits as RRL, agree=0.704, adj=0.111, (0 split)
##
## Node number 6: 88 observations, complexity param=0.03281853
## predicted class=0 expected loss=0.4545455 P(node) =0.1396825
## class counts: 48 40
## probabilities: 0.545 0.455
## left son=12 (75 obs) right son=13 (13 obs)
## Primary splits:
## Embarked splits as RLL, improve=4.6784150, (0 missing)
## Fare < 23.0875 to the right, improve=3.3246750, (0 missing)
## Age < 19.5 to the right, improve=1.6813660, (0 missing)
## SibSp < 1.5 to the right, improve=0.3159003, (0 missing)
## Parch < 0.5 to the left, improve=0.1302693, (0 missing)
## Surrogate splits:
## Fare < 7.2396 to the right, agree=0.886, adj=0.231, (0 split)
## Age < 0.875 to the right, agree=0.875, adj=0.154, (0 split)
##
## Node number 7: 136 observations
## predicted class=1 expected loss=0.05147059 P(node) =0.215873
## class counts: 7 129
## probabilities: 0.051 0.949
##
## Node number 10: 9 observations
## predicted class=0 expected loss=0.1111111 P(node) =0.01428571
## class counts: 8 1
## probabilities: 0.889 0.111
##
## Node number 11: 18 observations
## predicted class=1 expected loss=0.1111111 P(node) =0.02857143
## class counts: 2 16
## probabilities: 0.111 0.889
##
## Node number 12: 75 observations
## predicted class=0 expected loss=0.3866667 P(node) =0.1190476
## class counts: 46 29
## probabilities: 0.613 0.387
##
## Node number 13: 13 observations
## predicted class=1 expected loss=0.1538462 P(node) =0.02063492
## class counts: 2 11
## probabilities: 0.154 0.846
## FP TP TN FN
## 13 103 234 63
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8159806 0.6204819 0.9473684 0.8879310 0.7878788 3.3819660
## attr(,"negative")
## [1] "Died"
=======================================
## Call:
## rpart(formula = Survived ~ Sex + Pclass + Fare, data = train19,
## method = "class", control = rpart.control(minsplit = 3, cp = 0.01))
## n= 630
##
## CP nsplit rel error xerror xstd
## 1 0.44015444 0 1.0000000 1.0000000 0.04768335
## 2 0.03088803 1 0.5598456 0.5598456 0.04079292
## 3 0.01544402 2 0.5289575 0.5675676 0.04098852
## 4 0.01000000 5 0.4826255 0.5366795 0.04018631
##
## Variable importance
## Sex Fare Pclass
## 59 22 19
##
## Node number 1: 630 observations, complexity param=0.4401544
## predicted class=0 expected loss=0.4111111 P(node) =1
## class counts: 371 259
## probabilities: 0.589 0.411
## left son=2 (406 obs) right son=3 (224 obs)
## Primary splits:
## Sex splits as RL, improve=81.95485, (0 missing)
## Fare < 52.2771 to the left, improve=31.22644, (0 missing)
## Pclass splits as RLL, improve=27.44000, (0 missing)
## Surrogate splits:
## Fare < 77.6229 to the left, agree=0.673, adj=0.08, (0 split)
##
## Node number 2: 406 observations
## predicted class=0 expected loss=0.2216749 P(node) =0.6444444
## class counts: 316 90
## probabilities: 0.778 0.222
##
## Node number 3: 224 observations, complexity param=0.03088803
## predicted class=1 expected loss=0.2455357 P(node) =0.3555556
## class counts: 55 169
## probabilities: 0.246 0.754
## left son=6 (88 obs) right son=7 (136 obs)
## Primary splits:
## Pclass splits as RRL, improve=26.07530, (0 missing)
## Fare < 21.55 to the left, improve=11.67361, (0 missing)
## Surrogate splits:
## Fare < 20.7875 to the left, agree=0.853, adj=0.625, (0 split)
##
## Node number 6: 88 observations, complexity param=0.01544402
## predicted class=0 expected loss=0.4545455 P(node) =0.1396825
## class counts: 48 40
## probabilities: 0.545 0.455
## left son=12 (11 obs) right son=13 (77 obs)
## Primary splits:
## Fare < 23.0875 to the right, improve=3.324675, (0 missing)
##
## Node number 7: 136 observations
## predicted class=1 expected loss=0.05147059 P(node) =0.215873
## class counts: 7 129
## probabilities: 0.051 0.949
##
## Node number 12: 11 observations
## predicted class=0 expected loss=0.09090909 P(node) =0.01746032
## class counts: 10 1
## probabilities: 0.909 0.091
##
## Node number 13: 77 observations, complexity param=0.01544402
## predicted class=1 expected loss=0.4935065 P(node) =0.1222222
## class counts: 38 39
## probabilities: 0.494 0.506
## left son=26 (72 obs) right son=27 (5 obs)
## Primary splits:
## Fare < 7.5896 to the right, improve=2.604618, (0 missing)
##
## Node number 26: 72 observations, complexity param=0.01544402
## predicted class=0 expected loss=0.4722222 P(node) =0.1142857
## class counts: 38 34
## probabilities: 0.528 0.472
## left son=52 (39 obs) right son=53 (33 obs)
## Primary splits:
## Fare < 10.825 to the left, improve=2.182595, (0 missing)
##
## Node number 27: 5 observations
## predicted class=1 expected loss=0 P(node) =0.007936508
## class counts: 0 5
## probabilities: 0.000 1.000
##
## Node number 52: 39 observations
## predicted class=0 expected loss=0.3589744 P(node) =0.06190476
## class counts: 25 14
## probabilities: 0.641 0.359
##
## Node number 53: 33 observations
## predicted class=1 expected loss=0.3939394 P(node) =0.05238095
## class counts: 13 20
## probabilities: 0.394 0.606
## FP TP TN FN
## 21 108 226 58
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8087167 0.6506024 0.9149798 0.8372093 0.7957746 2.9977008
## attr(,"negative")
## [1] "Died"
====================================
## Call:
## rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked, data = train21, method = "class", control = rpart.control(minsplit = 3,
## cp = 0.01))
## n= 630
##
## CP nsplit rel error xerror xstd
## 1 0.44015444 0 1.0000000 1.0000000 0.04768335
## 2 0.03281853 1 0.5598456 0.5598456 0.04079292
## 3 0.02702703 3 0.4942085 0.5675676 0.04098852
## 4 0.01000000 5 0.4401544 0.4671815 0.03817522
##
## Variable importance
## Sex Pclass Fare Age Embarked SibSp Parch
## 46 15 15 9 6 6 2
##
## Node number 1: 630 observations, complexity param=0.4401544
## predicted class=0 expected loss=0.4111111 P(node) =1
## class counts: 371 259
## probabilities: 0.589 0.411
## left son=2 (406 obs) right son=3 (224 obs)
## Primary splits:
## Sex splits as RL, improve=81.95485, (0 missing)
## Fare < 52.2771 to the left, improve=31.22644, (0 missing)
## Pclass splits as RLL, improve=27.44000, (0 missing)
## Embarked splits as RLL, improve=16.34740, (0 missing)
## Parch < 0.5 to the left, improve=10.22907, (0 missing)
## Surrogate splits:
## Fare < 77.6229 to the left, agree=0.673, adj=0.080, (0 split)
## Parch < 0.5 to the left, agree=0.660, adj=0.045, (0 split)
## Embarked splits as LRL, agree=0.651, adj=0.018, (0 split)
## Age < 5.5 to the right, agree=0.646, adj=0.004, (0 split)
##
## Node number 2: 406 observations, complexity param=0.02702703
## predicted class=0 expected loss=0.2216749 P(node) =0.6444444
## class counts: 316 90
## probabilities: 0.778 0.222
## left son=4 (379 obs) right son=5 (27 obs)
## Primary splits:
## Age < 9.5 to the right, improve=9.627302, (0 missing)
## Pclass splits as RLL, improve=5.482567, (0 missing)
## Fare < 26.26875 to the left, improve=4.713136, (0 missing)
## Embarked splits as RLL, improve=3.793760, (0 missing)
## Parch < 0.5 to the left, improve=1.996034, (0 missing)
## Surrogate splits:
## SibSp < 3.5 to the left, agree=0.943, adj=0.148, (0 split)
##
## Node number 3: 224 observations, complexity param=0.03281853
## predicted class=1 expected loss=0.2455357 P(node) =0.3555556
## class counts: 55 169
## probabilities: 0.246 0.754
## left son=6 (88 obs) right son=7 (136 obs)
## Primary splits:
## Pclass splits as RRL, improve=26.075300, (0 missing)
## Fare < 21.55 to the left, improve=11.673610, (0 missing)
## Embarked splits as RLL, improve= 6.772141, (0 missing)
## Age < 32.5 to the left, improve= 5.080290, (0 missing)
## SibSp < 3.5 to the right, improve= 2.186790, (0 missing)
## Surrogate splits:
## Fare < 20.7875 to the left, agree=0.853, adj=0.625, (0 split)
## Age < 22.5 to the left, agree=0.688, adj=0.205, (0 split)
## Embarked splits as RLR, agree=0.661, adj=0.136, (0 split)
## SibSp < 3.5 to the right, agree=0.634, adj=0.068, (0 split)
## Parch < 2.5 to the right, agree=0.612, adj=0.011, (0 split)
##
## Node number 4: 379 observations
## predicted class=0 expected loss=0.1926121 P(node) =0.6015873
## class counts: 306 73
## probabilities: 0.807 0.193
##
## Node number 5: 27 observations, complexity param=0.02702703
## predicted class=1 expected loss=0.3703704 P(node) =0.04285714
## class counts: 10 17
## probabilities: 0.370 0.630
## left son=10 (9 obs) right son=11 (18 obs)
## Primary splits:
## SibSp < 3 to the right, improve=7.2592590, (0 missing)
## Pclass splits as RRL, improve=4.3572980, (0 missing)
## Fare < 29.0625 to the right, improve=1.7925930, (0 missing)
## Embarked splits as RLR, improve=1.7125930, (0 missing)
## Age < 0.5 to the left, improve=0.8233618, (0 missing)
## Surrogate splits:
## Fare < 29.0625 to the right, agree=0.778, adj=0.333, (0 split)
## Embarked splits as RLR, agree=0.741, adj=0.222, (0 split)
## Pclass splits as RRL, agree=0.704, adj=0.111, (0 split)
##
## Node number 6: 88 observations, complexity param=0.03281853
## predicted class=0 expected loss=0.4545455 P(node) =0.1396825
## class counts: 48 40
## probabilities: 0.545 0.455
## left son=12 (75 obs) right son=13 (13 obs)
## Primary splits:
## Embarked splits as RLL, improve=4.6784150, (0 missing)
## Fare < 23.0875 to the right, improve=3.3246750, (0 missing)
## Age < 19.5 to the right, improve=1.6813660, (0 missing)
## Parch < 2.5 to the right, improve=1.2834220, (0 missing)
## SibSp < 4.5 to the right, improve=0.4179728, (0 missing)
## Surrogate splits:
## Fare < 7.2396 to the right, agree=0.886, adj=0.231, (0 split)
## Age < 0.875 to the right, agree=0.875, adj=0.154, (0 split)
##
## Node number 7: 136 observations
## predicted class=1 expected loss=0.05147059 P(node) =0.215873
## class counts: 7 129
## probabilities: 0.051 0.949
##
## Node number 10: 9 observations
## predicted class=0 expected loss=0.1111111 P(node) =0.01428571
## class counts: 8 1
## probabilities: 0.889 0.111
##
## Node number 11: 18 observations
## predicted class=1 expected loss=0.1111111 P(node) =0.02857143
## class counts: 2 16
## probabilities: 0.111 0.889
##
## Node number 12: 75 observations
## predicted class=0 expected loss=0.3866667 P(node) =0.1190476
## class counts: 46 29
## probabilities: 0.613 0.387
##
## Node number 13: 13 observations
## predicted class=1 expected loss=0.1538462 P(node) =0.02063492
## class counts: 2 11
## probabilities: 0.154 0.846
## FP TP TN FN
## 13 103 234 63
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8159806 0.6204819 0.9473684 0.8879310 0.7878788 3.3819660
## attr(,"negative")
## [1] "Died"
==========================================
## [[1]]
## [1] "Abbing" " Mr" " Anthony"
## [1] "Abbing" " Mr" " Anthony"
## [1] " Mr"
##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
## Call:
## rpart(formula = Survived ~ Sex + Title + FamilySize + Pclass,
## data = train23, method = "class")
## n= 785
##
## CP nsplit rel error xerror xstd
## 1 0.43189369 0 1.0000000 1.0000000 0.04525896
## 2 0.04651163 1 0.5681063 0.5813953 0.03874205
## 3 0.03986711 2 0.5215947 0.5847176 0.03882073
## 4 0.01000000 3 0.4817276 0.5249169 0.03732166
##
## Variable importance
## Title Sex FamilySize Pclass
## 40 32 16 12
##
## Node number 1: 785 observations, complexity param=0.4318937
## predicted class=0 expected loss=0.3834395 P(node) =1
## class counts: 484 301
## probabilities: 0.617 0.383
## left son=2 (471 obs) right son=3 (314 obs)
## Primary splits:
## Title splits as RLRRRRLRLLR, improve=109.58130, (0 missing)
## Sex splits as RL, improve=100.91470, (0 missing)
## Pclass splits as RLL, improve= 34.25228, (0 missing)
## FamilySize < 1.5 to the left, improve= 17.68573, (0 missing)
## Surrogate splits:
## Sex splits as RL, agree=0.944, adj=0.86, (0 split)
## FamilySize < 1.5 to the left, agree=0.716, adj=0.29, (0 split)
##
## Node number 2: 471 observations
## predicted class=0 expected loss=0.1677282 P(node) =0.6
## class counts: 392 79
## probabilities: 0.832 0.168
##
## Node number 3: 314 observations, complexity param=0.04651163
## predicted class=1 expected loss=0.2929936 P(node) =0.4
## class counts: 92 222
## probabilities: 0.293 0.707
## left son=6 (150 obs) right son=7 (164 obs)
## Primary splits:
## Pclass splits as RRL, improve=36.962020, (0 missing)
## FamilySize < 4.5 to the right, improve=19.113200, (0 missing)
## Title splits as L-RLLR-R--L, improve= 3.580400, (0 missing)
## Sex splits as RL, improve= 2.952126, (0 missing)
## Surrogate splits:
## Title splits as R-RLLR-R--R, agree=0.640, adj=0.247, (0 split)
## FamilySize < 4.5 to the right, agree=0.605, adj=0.173, (0 split)
## Sex splits as RL, agree=0.551, adj=0.060, (0 split)
##
## Node number 6: 150 observations, complexity param=0.03986711
## predicted class=0 expected loss=0.4533333 P(node) =0.1910828
## class counts: 82 68
## probabilities: 0.547 0.453
## left son=12 (32 obs) right son=13 (118 obs)
## Primary splits:
## FamilySize < 4.5 to the right, improve=10.51934, (0 missing)
## Sex splits as RL, improve= 1.33426, (0 missing)
## Title splits as ---LR--R---, improve= 1.33426, (0 missing)
## Surrogate splits:
## Sex splits as RL, agree=0.813, adj=0.125, (0 split)
## Title splits as ---LR--R---, agree=0.813, adj=0.125, (0 split)
##
## Node number 7: 164 observations
## predicted class=1 expected loss=0.06097561 P(node) =0.2089172
## class counts: 10 154
## probabilities: 0.061 0.939
##
## Node number 12: 32 observations
## predicted class=0 expected loss=0.09375 P(node) =0.04076433
## class counts: 29 3
## probabilities: 0.906 0.094
##
## Node number 13: 118 observations
## predicted class=1 expected loss=0.4491525 P(node) =0.1503185
## class counts: 53 65
## probabilities: 0.449 0.551
## FP TP TN FN
## 48 149 277 50
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.8129771 0.7487437 0.8523077 0.7563452 0.8470948 2.8447398
## attr(,"negative")
## [1] "Died"
==================================
## 'data.frame': 785 obs. of 8 variables:
## $ Pclass : chr "1st" "3rd" "3rd" "3rd" ...
## $ Survived: int 0 0 0 0 0 0 1 0 1 1 ...
## $ Sex : chr "male" "male" "female" "male" ...
## $ Age : num 39 42 1 20 18 64 34 47 23 15 ...
## $ SibSp : int 0 0 1 0 0 1 0 0 0 0 ...
## $ Parch : int 0 0 1 0 0 0 0 0 0 2 ...
## $ Fare : num 29.7 7.55 12.18 7.93 11.5 ...
## $ Embarked: chr "Cherbourg" "Southampton" "Southampton" "Southampton" ...
## 'data.frame': 524 obs. of 8 variables:
## $ Pclass : chr "3rd" "3rd" "3rd" "3rd" ...
## $ Survived: int 0 0 1 1 1 1 1 1 1 0 ...
## $ Sex : chr "male" "male" "female" "female" ...
## $ Age : num 13 16 35 16 28 ...
## $ SibSp : int 0 1 1 0 1 0 0 0 0 0 ...
## $ Parch : int 2 1 1 0 0 0 1 1 0 0 ...
## $ Fare : num 20.25 20.25 20.25 7.65 24 ...
## $ Embarked: chr "Southampton" "Southampton" "Southampton" "Southampton" ...
## Start: AIC=615.95
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Parch 1 596.19 614.19
## - Fare 1 596.40 614.40
## <none> 595.95 615.95
## - SibSp 1 601.80 619.80
## - Embarked 2 603.84 619.84
## - Age 1 611.56 629.56
## - Pclass 2 626.02 642.02
## - Sex 1 741.44 759.44
##
## Step: AIC=614.19
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
##
## Df Deviance AIC
## - Fare 1 596.90 612.90
## <none> 596.19 614.19
## + Parch 1 595.95 615.95
## - SibSp 1 601.84 617.84
## - Embarked 2 604.15 618.15
## - Age 1 612.39 628.39
## - Pclass 2 626.07 640.07
## - Sex 1 745.47 761.47
##
## Step: AIC=612.9
## Survived ~ Pclass + Sex + Age + SibSp + Embarked
##
## Df Deviance AIC
## <none> 596.90 612.90
## + Fare 1 596.19 614.19
## + Parch 1 596.40 614.40
## - SibSp 1 602.01 616.01
## - Embarked 2 605.60 617.60
## - Age 1 613.41 627.41
## - Pclass 2 644.57 656.57
## - Sex 1 750.40 764.40
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
## family = "binomial", data = train20)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1751 -0.7021 -0.4469 0.6893 2.3100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.073112 0.477934 8.522 < 2e-16 ***
## Pclass2nd -1.143384 0.305017 -3.749 0.000178 ***
## Pclass3rd -1.955423 0.296704 -6.590 4.38e-11 ***
## Sexmale -2.501689 0.224279 -11.154 < 2e-16 ***
## Age -0.033351 0.008437 -3.953 7.72e-05 ***
## SibSp -0.270168 0.124246 -2.174 0.029671 *
## EmbarkedQueenstown -1.244848 0.532301 -2.339 0.019355 *
## EmbarkedSouthampton -0.702309 0.268756 -2.613 0.008970 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.35 on 629 degrees of freedom
## Residual deviance: 596.90 on 622 degrees of freedom
## AIC: 612.9
##
## Number of Fisher Scoring iterations: 4
## Named num [1:630] 0.5673 0.0768 0.7525 0.1476 0.2943 ...
## - attr(*, "names")= chr [1:630] "363" "1" "668" "19" ...
##
## 0 1
## 388 242
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked, family = "binomial", data = train20)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2456 -0.6947 -0.4435 0.6935 2.3081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.863088 0.528253 7.313 2.61e-13 ***
## Pclass2nd -1.042704 0.337055 -3.094 0.001978 **
## Pclass3rd -1.833942 0.341764 -5.366 8.05e-08 ***
## Sexmale -2.472881 0.226489 -10.918 < 2e-16 ***
## Age -0.032616 0.008483 -3.845 0.000121 ***
## SibSp -0.306722 0.131959 -2.324 0.020106 *
## Parch 0.070353 0.143910 0.489 0.624934
## Fare 0.001537 0.002348 0.655 0.512716
## EmbarkedQueenstown -1.189361 0.534583 -2.225 0.026092 *
## EmbarkedSouthampton -0.677110 0.271398 -2.495 0.012599 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.35 on 629 degrees of freedom
## Residual deviance: 595.95 on 620 degrees of freedom
## AIC: 615.95
##
## Number of Fisher Scoring iterations: 4
## [1] 615.9482
## FP TP TN FN
## 40 121 207 45
## attr(,"negative")
## [1] "Died"
## acc sens spec ppv npv lor
## 0.7941889 0.7289157 0.8380567 0.7515528 0.8214286 2.6329674
## attr(,"negative")
## [1] "Died"
========================================================================
====================