2.1. Data
Description of the variables:
PassengerID: Passenger ID.
Survived: Whether a passenger survived the shipwreck or not (0:No,
1:Yes).
Pclass: Ticket class of the passenger (1:first class, 2:second
class, 3:third class).
Name: Name of the passenger - will not use this since it is not
meaningful data.
Sex: Gender of the passenger.
Age: Age of the person when boarding the Titanic.
SibSp: Number of siblings/spouses aboard the Titanic.
Parch: Number of parents/children aboard the Titanic.
Ticket: Ticket number of the passenger - will not use this since it
is not meaningful data.
Fare: The fare the passenger paid for the ticket - will not use this
since it is not meaningful data.
Cabin: Cabin number of the passenger - will not use this since it
has many missing values and that can make the data set harder to
analyze.
Embarked: Port where the passengers boarded on the Titanic - will
not use this since it is not meaningful data.
Defining variables and their effect on the depended variable:
Depended variable:
Survived - whether a passenger survived or not.
Explanatory variables:
Pclass - higher the class might lead to higher access to lifeboats,
assistance etc…
Sex - women had the privilege to evacuate first.
Age - children had the privilege to evacuate first, also younger
passengers can endure more then older passengers.
SibSp - having family members on board means help and collaboration
in finding lifeboats and also moral support.
Parch - parents with children had the privilege to evacuate
first.
library(tidyr)
mydata <- drop_na(mydata) #Removing the units that have missing values
head(mydata)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 7 0 1
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 McCarthy, Mr. Timothy J male 54 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.25 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.925 S
## 4 113803 53.1 C123 S
## 5 373450 8.05 S
## 6 17463 51.8625 E46 S
mydata$SurvivedF <- factor(mydata$Survived,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$PclassF <- factor(mydata$Pclass,
levels = c(1, 2, 3),
labels = c("First Class", "Second Class", "Third Class"))
mydata$SexF <- factor(mydata$Sex,
levels = c("male", "female"),
labels = c("Male", "Female")) #Converting categorical variables into factors
summary(mydata[colnames(mydata) %in% c("Age", "SibSp", "Parch", "SurvivedF", "PclassF", "SexF")]) #Descriptive statistics for the depended and explanatory variables
## Age SibSp Parch SurvivedF PclassF
## Min. : 0.00 Min. :0.0000 Min. :0.0000 No :424 First Class :186
## 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:0.0000 Yes:290 Second Class:173
## Median :28.00 Median :0.0000 Median :0.0000 Third Class :355
## Mean :29.68 Mean :0.5126 Mean :0.4314
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :80.00 Max. :5.0000 Max. :6.0000
## SexF
## Male :453
## Female:261
##
##
##
##
Explanation of descriptive statistics:
Age: The minimum age is 0 years, the maximum age is 80 years, while
the mean of age is 29.68 years.
SibSp: The minimum number of siblings/spouses is 0, the maximum
number is 5, while the mean is 0.5126 siblings/spouses.
Parch: The minimum number of parents/children is 0, the maximum
number is 6, while the mean is 0.4314 parents/children.
SurvivedF: No is the number of passengers who did
not survive and Yes is the number of passengers who
survived.
PclassF: First Class is the number of passengers
with first class tickets, Second Class is the number of
passengers with second class tickets and Third Class is
the number of passengers with third class tickets.
SexF: Male is the number of male passengers and
Female is the number of female passengers.
2.2. Analysis
fit0 <- glm(SurvivedF ~ 1, #Dependent and explanatory variables
family = binomial, #Binary logistic regression
data = mydata)
summary(fit0)
##
## Call:
## glm(formula = SurvivedF ~ 1, family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3799 0.0762 -4.985 6.2e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 964.52 on 713 degrees of freedom
## AIC: 966.52
##
## Number of Fisher Scoring iterations: 4
exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #Odds for Y=1
## odds 2.5 % 97.5 %
## (Intercept) 0.6839623 0.5890725 0.7941372
head(fitted(fit0)) #Estimated probability for Y=1
## 1 2 3 4 5 6
## 0.4061625 0.4061625 0.4061625 0.4061625 0.4061625 0.4061625
Pseudo_R2_fit1 <- 424/714 #Proportion of correctly classified units
Pseudo_R2_fit1
## [1] 0.5938375
fit1 <- glm(SurvivedF ~ Age,
family = binomial,
data = mydata)
summary(fit1)
##
## Call:
## glm(formula = SurvivedF ~ Age, family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.057480 0.173380 -0.332 0.7402
## Age -0.010945 0.005326 -2.055 0.0399 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 960.24 on 712 degrees of freedom
## AIC: 964.24
##
## Number of Fisher Scoring iterations: 4
anova(fit0, fit1, test = "Chi") #Comparison of models based on -2LL statistics
## Analysis of Deviance Table
##
## Model 1: SurvivedF ~ 1
## Model 2: SurvivedF ~ Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 713 964.52
## 2 712 960.24 1 4.2796 0.03857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #Odds ratio for Y=1 (with 95% CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.9441407 0.6721335 1.3262271
## Age 0.9891146 0.9788435 0.9994935
fit2 <- glm(SurvivedF ~ Age + SibSp + Parch + PclassF + SexF,
family = binomial,
data = mydata)
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
vif(fit2) #Checking for multicolinearity
## GVIF Df GVIF^(1/(2*Df))
## Age 1.452938 1 1.205379
## SibSp 1.256252 1 1.120826
## Parch 1.206082 1 1.098218
## PclassF 1.434100 2 1.094321
## SexF 1.176588 1 1.084707
mydata$StdResid <- rstandard(fit2)
mydata$Cook <- cooks.distance(fit2) #Checking the cooks distance (to see if there are outliers)
hist(mydata$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")

#Checking the standardized residuals to see if there are outliers (no outliers - everything is in the range -3/+3)
head(mydata[order(-mydata$Cook), c("PassengerId", "Cook")]) #Checking the cooks distance to see if there are outliers (there is no distance bigger than 1 but there are some gaps so the PassengerId 262, 298 and 26 will be removed)
## PassengerId Cook
## 211 262 0.02699205
## 241 298 0.02458253
## 23 26 0.01958668
## 499 631 0.01529684
## 452 571 0.01406456
## 397 499 0.01247619
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!PassengerId == 262)
library(dplyr)
mydata <- mydata %>%
filter(!PassengerId == 298)
library(dplyr)
mydata <- mydata %>%
filter(!PassengerId == 26)
head(mydata[order(-mydata$Cook), c("PassengerId", "Cook")]) #By removing the outliers it is clear that there are no more significant gaps)
## PassengerId Cook
## 496 631 0.01529684
## 449 571 0.01406456
## 394 499 0.01247619
## 65 86 0.01173899
## 51 69 0.01140253
## 383 484 0.01127991
fit1 <- glm(SurvivedF ~ Age,
family = binomial,
data = mydata)
fit2 <- glm(SurvivedF ~ Age + SibSp + Parch + PclassF + SexF,
family = binomial,
data = mydata)
anova(fit1, fit2, test = "Chi") #Model comparison to see which one is better. In this case Model 2 is better or fit2 - it is statistically significant.
## Analysis of Deviance Table
##
## Model 1: SurvivedF ~ Age
## Model 2: SurvivedF ~ Age + SibSp + Parch + PclassF + SexF
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 709 955.53
## 2 704 620.83 5 334.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit2)
##
## Call:
## glm(formula = SurvivedF ~ Age + SibSp + Parch + PclassF + SexF,
## family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.889999 0.404017 4.678 2.90e-06 ***
## Age -0.047969 0.008409 -5.704 1.17e-08 ***
## SibSp -0.417193 0.131006 -3.185 0.00145 **
## Parch -0.072148 0.125994 -0.573 0.56690
## PclassFSecond Class -1.500019 0.289243 -5.186 2.15e-07 ***
## PclassFThird Class -2.784624 0.292889 -9.507 < 2e-16 ***
## SexFFemale 2.724103 0.225123 12.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 959.87 on 710 degrees of freedom
## Residual deviance: 620.83 on 704 degrees of freedom
## AIC: 634.83
##
## Number of Fisher Scoring iterations: 5
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 6.61935955 2.9985828 14.6122100
## Age 0.95316371 0.9375830 0.9690033
## SibSp 0.65889359 0.5096863 0.8517802
## Parch 0.93039353 0.7268087 1.1910040
## PclassFSecond Class 0.22312596 0.1265744 0.3933276
## PclassFThird Class 0.06175228 0.0347813 0.1096378
## SexFFemale 15.24273167 9.8047749 23.6967060
mydata$EstProb <- fitted(fit2) #Estimates of probabilities for Y=1: P(Y=1)
head(mydata)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 7 0 1
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 McCarthy, Mr. Timothy J male 54 0 0
## Ticket Fare Cabin Embarked SurvivedF PclassF SexF
## 1 A/5 21171 7.25 S No Third Class Male
## 2 PC 17599 71.2833 C85 C Yes First Class Female
## 3 STON/O2. 3101282 7.925 S Yes Third Class Female
## 4 113803 53.1 C123 S Yes First Class Female
## 5 373450 8.05 S No Third Class Male
## 6 17463 51.8625 E46 S No First Class Male
## StdResid Cook EstProb
## 1 -0.4396242 5.141144e-05 0.08571391
## 2 0.4416866 8.959972e-05 0.91483093
## 3 0.9634587 9.226806e-04 0.64159488
## 4 0.4140949 7.162465e-05 0.92539456
## 5 -0.3966638 3.638144e-05 0.07086172
## 6 -0.9002776 7.813198e-04 0.33174431
mydata$Classification <- ifelse(test = mydata$EstProb < 0.50,
yes = "NO",
no = "YES")
#If estimated probability is below 0.50, person is classified into a group of passengers that did not survive the shipwreck (NO), otherwise YES.
mydata$ClassificationF <- factor(mydata$Classification,
levels = c("NO", "YES"),
labels = c("NO", "YES"))
ClassificationTable <- table(mydata$SurvivedF, mydata$ClassificationF) #Classification table based on 2 categorical variables.
ClassificationTable
##
## NO YES
## No 364 59
## Yes 77 211
Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit2 #Coefficient of determination
## [1] 0.8087201
2.3. Conclusion
To conclude, every explanatory variable in fit 2 will be
explained:
Age: If age is increased by one year the odds of
survival are equal to 0.954 times the original odds under the
assumptions that the remaining explanatory variables do not change. The
probability to survive decreases. (p<0.001)
SibSp: If the number of siblings/spouses is
increased by one sibling/spouse the odds of survival are equal to 0.659
times the original odds under the assumptions that the remaining
explanatory variables do not change. The probability to survive
decreases. (p<0.001)
Parch: We do not have enough evidence to say. It is
statistically insignificant. (p=0.57)
PclassFSecond Class: Given the values of other
explanatory variables, the odds for survival for passengers with second
class tickets are equal to 0.22 times the odds for survival for
passengers with first class tickets. Passengers with second class
tickets are less likely to survive.(p<0.001)
PclassFThird Class: Given the values of other
explanatory variables, the odds for survival for passengers with third
class tickets are equal to 0.062 times the odds for survival for
passengers with first class tickets. Passengers with third class tickets
are less likely to survive. (p<0.001)
SexFFemale: Given the values of other explanatory
variables, the odds for survival for passengers that are female are
equal to 15.243 times the odds for survival for passengers that are
male. Female passengers are more likely to survive. (p<0.001)
Pseudo R2 = 81%, which means that 81% of the
variance of the dependent variable being studied is explained by the
variance of the independent variable