Logistic Regression

2. Second Research Question

Research Question: What factors influence the outcome of passengers surviving the Titanic shipwreck?

The source of this data is from this link: https://www.kaggle.com/code/alexisbcook/titanic-tutorial

mydata <- read.table("C:/Users/Martin/Desktop/Fangzz/Faculty of Economics/Year 1/Semester 1/Research Methods and Techniques/Program R/Homework/Homework part 2/Titanic/train.csv", header = TRUE, sep = ",", dec = ",")

mydata$Age <- as.integer(mydata$Age) #Converting characters to numbers

summary(mydata[colnames(mydata) %in% c("Pclass")]) #Descriptive statistics to see the number of ticket classes

##      Pclass     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.309  
##  3rd Qu.:3.000  
##  Max.   :3.000

head(mydata) #Showing the first six rows of the data set

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171    7.25              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282   7.925              S
## 4           113803    53.1  C123        S
## 5           373450    8.05              S
## 6           330877  8.4583              Q

tail(mydata) #Showing the last six rows of the data set

##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536     13              S
## 888  19     0     0     112053     30   B42        S
## 889  NA     1     2 W./C. 6607  23.45              S
## 890  26     0     0     111369     30  C148        C
## 891  32     0     0     370376   7.75              Q

2.1. Data

Description of the variables:

PassengerID: Passenger ID.

Survived: Whether a passenger survived the shipwreck or not (0:No, 1:Yes).

Pclass: Ticket class of the passenger (1:first class, 2:second class, 3:third class).

Name: Name of the passenger - will not use this since it is not meaningful data.

Sex: Gender of the passenger.

Age: Age of the person when boarding the Titanic.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Ticket: Ticket number of the passenger - will not use this since it is not meaningful data.

Fare: The fare the passenger paid for the ticket - will not use this since it is not meaningful data.

Cabin: Cabin number of the passenger - will not use this since it has many missing values and that can make the data set harder to analyze.

Embarked: Port where the passengers boarded on the Titanic - will not use this since it is not meaningful data.

Defining variables and their effect on the depended variable:

Depended variable:

Survived - whether a passenger survived or not.

Explanatory variables:

Pclass - higher the class might lead to higher access to lifeboats, assistance etc…

Sex - women had the privilege to evacuate first.

Age - children had the privilege to evacuate first, also younger passengers can endure more then older passengers.

SibSp - having family members on board means help and collaboration in finding lifeboats and also moral support.

Parch - parents with children had the privilege to evacuate first.

library(tidyr)
mydata <- drop_na(mydata) #Removing the units that have missing values
head(mydata)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           7        0      1
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                             McCarthy, Mr. Timothy J   male  54     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171    7.25              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282   7.925              S
## 4           113803    53.1  C123        S
## 5           373450    8.05              S
## 6            17463 51.8625   E46        S

mydata$SurvivedF <- factor(mydata$Survived, 
                          levels = c(0, 1), 
                          labels = c("No", "Yes"))

mydata$PclassF <- factor(mydata$Pclass, 
                          levels = c(1, 2, 3), 
                          labels = c("First Class", "Second Class", "Third Class"))

mydata$SexF <- factor(mydata$Sex, 
                            levels = c("male", "female"),
                            labels = c("Male", "Female")) #Converting categorical variables into factors

summary(mydata[colnames(mydata) %in% c("Age", "SibSp", "Parch", "SurvivedF", "PclassF", "SexF")]) #Descriptive statistics for the depended and explanatory variables

##       Age            SibSp            Parch        SurvivedF         PclassF   
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   No :424   First Class :186  
##  1st Qu.:20.00   1st Qu.:0.0000   1st Qu.:0.0000   Yes:290   Second Class:173  
##  Median :28.00   Median :0.0000   Median :0.0000             Third Class :355  
##  Mean   :29.68   Mean   :0.5126   Mean   :0.4314                               
##  3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000                               
##  Max.   :80.00   Max.   :5.0000   Max.   :6.0000                               
##      SexF    
##  Male  :453  
##  Female:261  
##              
##              
##              
##

Explanation of descriptive statistics:

Age: The minimum age is 0 years, the maximum age is 80 years, while the mean of age is 29.68 years.

SibSp: The minimum number of siblings/spouses is 0, the maximum number is 5, while the mean is 0.5126 siblings/spouses.

Parch: The minimum number of parents/children is 0, the maximum number is 6, while the mean is 0.4314 parents/children.

SurvivedF: No is the number of passengers who did not survive and Yes is the number of passengers who survived.

PclassF: First Class is the number of passengers with first class tickets, Second Class is the number of passengers with second class tickets and Third Class is the number of passengers with third class tickets.

SexF: Male is the number of male passengers and Female is the number of female passengers.

2.2. Analysis

fit0 <- glm(SurvivedF ~ 1, #Dependent and explanatory variables
            family = binomial, #Binary logistic regression
            data = mydata)

summary(fit0)

## 
## Call:
## glm(formula = SurvivedF ~ 1, family = binomial, data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3799     0.0762  -4.985  6.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 964.52  on 713  degrees of freedom
## AIC: 966.52
## 
## Number of Fisher Scoring iterations: 4

exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #Odds for Y=1

##                  odds     2.5 %    97.5 %
## (Intercept) 0.6839623 0.5890725 0.7941372

head(fitted(fit0)) #Estimated probability for Y=1

##         1         2         3         4         5         6 
## 0.4061625 0.4061625 0.4061625 0.4061625 0.4061625 0.4061625

Pseudo_R2_fit1 <- 424/714 #Proportion of correctly classified units

Pseudo_R2_fit1

## [1] 0.5938375

fit1 <- glm(SurvivedF ~ Age,  
            family = binomial, 
            data = mydata)

summary(fit1)

## 
## Call:
## glm(formula = SurvivedF ~ Age, family = binomial, data = mydata)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.057480   0.173380  -0.332   0.7402  
## Age         -0.010945   0.005326  -2.055   0.0399 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.24  on 712  degrees of freedom
## AIC: 964.24
## 
## Number of Fisher Scoring iterations: 4

anova(fit0, fit1, test = "Chi") #Comparison of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: SurvivedF ~ 1
## Model 2: SurvivedF ~ Age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       713     964.52                       
## 2       712     960.24  1   4.2796  0.03857 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #Odds ratio for Y=1 (with 95% CI)

##                    OR     2.5 %    97.5 %
## (Intercept) 0.9441407 0.6721335 1.3262271
## Age         0.9891146 0.9788435 0.9994935

fit2 <- glm(SurvivedF ~ Age + SibSp + Parch + PclassF + SexF,  
            family = binomial, 
            data = mydata)

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

vif(fit2) #Checking for multicolinearity

##             GVIF Df GVIF^(1/(2*Df))
## Age     1.452938  1        1.205379
## SibSp   1.256252  1        1.120826
## Parch   1.206082  1        1.098218
## PclassF 1.434100  2        1.094321
## SexF    1.176588  1        1.084707

mydata$StdResid <- rstandard(fit2) 
mydata$Cook <- cooks.distance(fit2) #Checking the cooks distance (to see if there are outliers)

hist(mydata$StdResid,
     main = "Histogram of standardized residuals",
     ylab = "Frequency",
     xlab = "Standardized residuals")

#Checking the standardized residuals to see if there are outliers (no outliers - everything is in the range -3/+3)

head(mydata[order(-mydata$Cook), c("PassengerId", "Cook")]) #Checking the cooks distance to see if there are outliers (there is no distance bigger than 1 but there are some gaps so the PassengerId 262, 298 and 26 will be removed)

##     PassengerId       Cook
## 211         262 0.02699205
## 241         298 0.02458253
## 23           26 0.01958668
## 499         631 0.01529684
## 452         571 0.01406456
## 397         499 0.01247619

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(!PassengerId == 262)

library(dplyr)
mydata <- mydata %>%
  filter(!PassengerId == 298)

library(dplyr)
mydata <- mydata %>%
  filter(!PassengerId == 26)

head(mydata[order(-mydata$Cook), c("PassengerId", "Cook")]) #By removing the outliers it is clear that there are no more significant gaps)

##     PassengerId       Cook
## 496         631 0.01529684
## 449         571 0.01406456
## 394         499 0.01247619
## 65           86 0.01173899
## 51           69 0.01140253
## 383         484 0.01127991

fit1 <- glm(SurvivedF ~ Age,  
            family = binomial, 
            data = mydata)

fit2 <- glm(SurvivedF ~ Age + SibSp + Parch + PclassF + SexF,  
            family = binomial, 
            data = mydata)

anova(fit1, fit2, test = "Chi") #Model comparison to see which one is better. In this case Model 2 is better or fit2 - it is statistically significant.

## Analysis of Deviance Table
## 
## Model 1: SurvivedF ~ Age
## Model 2: SurvivedF ~ Age + SibSp + Parch + PclassF + SexF
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       709     955.53                          
## 2       704     620.83  5    334.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit2)

## 
## Call:
## glm(formula = SurvivedF ~ Age + SibSp + Parch + PclassF + SexF, 
##     family = binomial, data = mydata)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.889999   0.404017   4.678 2.90e-06 ***
## Age                 -0.047969   0.008409  -5.704 1.17e-08 ***
## SibSp               -0.417193   0.131006  -3.185  0.00145 ** 
## Parch               -0.072148   0.125994  -0.573  0.56690    
## PclassFSecond Class -1.500019   0.289243  -5.186 2.15e-07 ***
## PclassFThird Class  -2.784624   0.292889  -9.507  < 2e-16 ***
## SexFFemale           2.724103   0.225123  12.101  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 959.87  on 710  degrees of freedom
## Residual deviance: 620.83  on 704  degrees of freedom
## AIC: 634.83
## 
## Number of Fisher Scoring iterations: 5

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))

##                              OR     2.5 %     97.5 %
## (Intercept)          6.61935955 2.9985828 14.6122100
## Age                  0.95316371 0.9375830  0.9690033
## SibSp                0.65889359 0.5096863  0.8517802
## Parch                0.93039353 0.7268087  1.1910040
## PclassFSecond Class  0.22312596 0.1265744  0.3933276
## PclassFThird Class   0.06175228 0.0347813  0.1096378
## SexFFemale          15.24273167 9.8047749 23.6967060

mydata$EstProb <- fitted(fit2) #Estimates of probabilities for Y=1: P(Y=1)
head(mydata)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           7        0      1
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                             McCarthy, Mr. Timothy J   male  54     0     0
##             Ticket    Fare Cabin Embarked SurvivedF     PclassF   SexF
## 1        A/5 21171    7.25              S        No Third Class   Male
## 2         PC 17599 71.2833   C85        C       Yes First Class Female
## 3 STON/O2. 3101282   7.925              S       Yes Third Class Female
## 4           113803    53.1  C123        S       Yes First Class Female
## 5           373450    8.05              S        No Third Class   Male
## 6            17463 51.8625   E46        S        No First Class   Male
##     StdResid         Cook    EstProb
## 1 -0.4396242 5.141144e-05 0.08571391
## 2  0.4416866 8.959972e-05 0.91483093
## 3  0.9634587 9.226806e-04 0.64159488
## 4  0.4140949 7.162465e-05 0.92539456
## 5 -0.3966638 3.638144e-05 0.07086172
## 6 -0.9002776 7.813198e-04 0.33174431

mydata$Classification <- ifelse(test = mydata$EstProb < 0.50, 
                                yes = "NO", 
                                no = "YES")

#If estimated probability is below 0.50, person is classified into a group of passengers that did not survive the shipwreck (NO), otherwise YES.

mydata$ClassificationF <- factor(mydata$Classification,
                                 levels = c("NO", "YES"), 
                                 labels = c("NO", "YES"))

ClassificationTable <- table(mydata$SurvivedF, mydata$ClassificationF) #Classification table based on 2 categorical variables.

ClassificationTable

##      
##        NO YES
##   No  364  59
##   Yes  77 211

Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit2 #Coefficient of determination

## [1] 0.8087201

2.3. Conclusion

To conclude, every explanatory variable in fit 2 will be explained:

Age: If age is increased by one year the odds of survival are equal to 0.954 times the original odds under the assumptions that the remaining explanatory variables do not change. The probability to survive decreases. (p<0.001)

SibSp: If the number of siblings/spouses is increased by one sibling/spouse the odds of survival are equal to 0.659 times the original odds under the assumptions that the remaining explanatory variables do not change. The probability to survive decreases. (p<0.001)

Parch: We do not have enough evidence to say. It is statistically insignificant. (p=0.57)

PclassFSecond Class: Given the values of other explanatory variables, the odds for survival for passengers with second class tickets are equal to 0.22 times the odds for survival for passengers with first class tickets. Passengers with second class tickets are less likely to survive.(p<0.001)

PclassFThird Class: Given the values of other explanatory variables, the odds for survival for passengers with third class tickets are equal to 0.062 times the odds for survival for passengers with first class tickets. Passengers with third class tickets are less likely to survive. (p<0.001)

SexFFemale: Given the values of other explanatory variables, the odds for survival for passengers that are female are equal to 15.243 times the odds for survival for passengers that are male. Female passengers are more likely to survive. (p<0.001)

Logistic Regression

2024-01-07

2. Second Research Question

Research Question: What factors influence the outcome of passengers surviving the Titanic shipwreck?

The source of this data is from this link: https://www.kaggle.com/code/alexisbcook/titanic-tutorial

2.1. Data

Description of the variables:

PassengerID: Passenger ID.

Survived: Whether a passenger survived the shipwreck or not (0:No, 1:Yes).

Pclass: Ticket class of the passenger (1:first class, 2:second class, 3:third class).

Name: Name of the passenger - will not use this since it is not meaningful data.

Sex: Gender of the passenger.

Age: Age of the person when boarding the Titanic.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Ticket: Ticket number of the passenger - will not use this since it is not meaningful data.

Fare: The fare the passenger paid for the ticket - will not use this since it is not meaningful data.

Cabin: Cabin number of the passenger - will not use this since it has many missing values and that can make the data set harder to analyze.

Embarked: Port where the passengers boarded on the Titanic - will not use this since it is not meaningful data.

Defining variables and their effect on the depended variable:

Depended variable:

Survived - whether a passenger survived or not.

Explanatory variables:

Pclass - higher the class might lead to higher access to lifeboats, assistance etc…

Sex - women had the privilege to evacuate first.

Age - children had the privilege to evacuate first, also younger passengers can endure more then older passengers.

SibSp - having family members on board means help and collaboration in finding lifeboats and also moral support.

Parch - parents with children had the privilege to evacuate first.

Explanation of descriptive statistics:

Age: The minimum age is 0 years, the maximum age is 80 years, while the mean of age is 29.68 years.

SibSp: The minimum number of siblings/spouses is 0, the maximum number is 5, while the mean is 0.5126 siblings/spouses.

Parch: The minimum number of parents/children is 0, the maximum number is 6, while the mean is 0.4314 parents/children.

SurvivedF: No is the number of passengers who did not survive and Yes is the number of passengers who survived.

PclassF: First Class is the number of passengers with first class tickets, Second Class is the number of passengers with second class tickets and Third Class is the number of passengers with third class tickets.

SexF: Male is the number of male passengers and Female is the number of female passengers.

2.2. Analysis

2.3. Conclusion

To conclude, every explanatory variable in fit 2 will be explained:

Age: If age is increased by one year the odds of survival are equal to 0.954 times the original odds under the assumptions that the remaining explanatory variables do not change. The probability to survive decreases. (p<0.001)

SibSp: If the number of siblings/spouses is increased by one sibling/spouse the odds of survival are equal to 0.659 times the original odds under the assumptions that the remaining explanatory variables do not change. The probability to survive decreases. (p<0.001)

Parch: We do not have enough evidence to say. It is statistically insignificant. (p=0.57)

PclassFSecond Class: Given the values of other explanatory variables, the odds for survival for passengers with second class tickets are equal to 0.22 times the odds for survival for passengers with first class tickets. Passengers with second class tickets are less likely to survive.(p<0.001)

PclassFThird Class: Given the values of other explanatory variables, the odds for survival for passengers with third class tickets are equal to 0.062 times the odds for survival for passengers with first class tickets. Passengers with third class tickets are less likely to survive. (p<0.001)

SexFFemale: Given the values of other explanatory variables, the odds for survival for passengers that are female are equal to 15.243 times the odds for survival for passengers that are male. Female passengers are more likely to survive. (p<0.001)

Pseudo R2 = 81%, which means that 81% of the variance of the dependent variable being studied is explained by the variance of the independent variable