Data Dictionary
Variable
survival: Survival 0 = No, 1 = Yes
pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex: Sex
Age: Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch: # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin :Cabin number
embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES)
1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
[1] 891 12
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
[1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
[6] "Age" "SibSp" "Parch" "Ticket" "Fare"
[11] "Cabin" "Embarked"
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Call:
glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked, family = "binomial", data = ndata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.102784 0.476303 8.614 < 2e-16 ***
Pclass2 -0.924047 0.297882 -3.102 0.00192 **
Pclass3 -2.149626 0.297749 -7.220 5.21e-13 ***
Sexmale -2.709611 0.201336 -13.458 < 2e-16 ***
Age -0.039320 0.007888 -4.984 6.21e-07 ***
SibSp -0.322143 0.109545 -2.941 0.00327 **
Parch -0.095061 0.119028 -0.799 0.42450
Fare 0.002261 0.002462 0.918 0.35842
EmbarkedQ -0.029839 0.381534 -0.078 0.93766
EmbarkedS -0.445754 0.239730 -1.859 0.06297 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1182.82 on 888 degrees of freedom
Residual deviance: 783.74 on 879 degrees of freedom
AIC: 803.74
Number of Fisher Scoring iterations: 5
We can see that ‘Pclass2’, ‘Pclass3’, ‘SexMale’, ‘Age’, and ‘SibSp’ are statistically significant, as their p-values are less than 0.05.
To improve the fit of our model we use AIC backward method.
Start: AIC=803.74
Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
Df Deviance AIC
- Parch 1 784.38 802.38
- Fare 1 784.65 802.65
<none> 783.74 803.74
- Embarked 2 788.16 804.16
- SibSp 1 793.85 811.85
- Age 1 810.58 828.58
- Pclass 2 843.57 859.57
- Sex 1 1010.80 1028.80
Step: AIC=802.38
Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
Df Deviance AIC
- Fare 1 785.03 801.03
<none> 784.38 802.38
- Embarked 2 789.08 803.08
- SibSp 1 797.02 813.02
- Age 1 811.03 827.03
- Pclass 2 847.53 861.53
- Sex 1 1016.06 1032.06
Step: AIC=801.03
Survived ~ Pclass + Sex + Age + SibSp + Embarked
Df Deviance AIC
<none> 785.03 801.03
- Embarked 2 790.30 802.30
- SibSp 1 797.02 811.02
- Age 1 812.43 826.43
- Pclass 2 882.25 894.25
- Sex 1 1022.23 1036.23
We have selected the variables ‘Embarked’, ‘SibSp’, ‘Age’, ‘Pclass’, and ‘Sex’ for our regression model.
Loading required package: MASS
Binomial model
Most of the residuals fall within these confidence bands, suggesting that the model is capturing the majority of the variability in the data accurately.
Loading required package: lattice
The model’s AUC is 0.8567, indicating good predictive performance.