The analyses here uses the titanic data, available easily of kaggle. In this analyses, logistic regression, LDA and QDA techniques are used to predict which type of passenger would have survived the tragedy.
library(MASS)
library(ggplot2)
library(dplyr)
In this part, data exploration is carried out and necessary data cleaning steps are performed to make the data ready for predictive modedling.
The following code is used to load the data:
setwd("C:/Users/rohit/Documents/RPubs_uploads/proj_3/")
train <- read.csv("train.csv")
test <- read.csv("test.csv")
test$Survived <- NA
all <- rbind(train, test)
glimpse(all)
## Observations: 1,309
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <fctr> Braund, Mr. Owen Harris, Cumings, Mrs. John Bradl...
## $ Sex <fctr> male, female, female, female, male, male, male, m...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <fctr> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 37...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin <fctr> , C85, , C123, , , E46, , , , G6, C103, , , , , ,...
## $ Embarked <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q...
colSums(is.na(all))
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
space <- function (x) {sum(x=="") }
apply(all, 2, space)
## PassengerId Survived Pclass Name Sex Age
## 0 NA 0 0 0 NA
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 NA 1014 2
It is observed that age
is missing for 263 observations. Since, age
can be one of the important factors to determine the survival of a passenger, it has been imputeted in the next section.
The missing values of age
are predicted using fare
and Passenger class
.
ggplot(data = all, aes(x = Fare, y = Age, col = factor(Pclass))) +
geom_smooth()
index <- which(is.na(all$Age))
predict_age <- all[index,]
train_age <- all[-index,]
lm_age <- lm(Age ~ Pclass + Fare + Pclass:Fare, data = train_age)
summary(lm_age)
##
## Call:
## lm(formula = Age ~ Pclass + Fare + Pclass:Fare, data = train_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.908 -8.117 -1.360 7.775 47.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.60475 1.56595 29.123 < 2e-16 ***
## Pclass -5.64937 0.67503 -8.369 < 2e-16 ***
## Fare 0.14181 0.02824 5.021 6.03e-07 ***
## Pclass:Fare -0.15585 0.02583 -6.033 2.23e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.88 on 1041 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2012, Adjusted R-squared: 0.1988
## F-statistic: 87.38 on 3 and 1041 DF, p-value: < 2.2e-16
predict_age$Age <- as.integer(predict(lm_age, newdata = predict_age))
all <- rbind(train_age, predict_age)
colSums(is.na(all))
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
all$family <- all$SibSp + all$Parch + 1
The whole data has been randomly split into training and test. 80% of the data is used as training data to build the model and the rest 20% data is used to test the model built.
#Convert categorical variable to factor type
all$Survived <- as.factor(all$Survived)
all$Pclass <- as.factor(all$Pclass)
index <- which(is.na(all$Survived))
train <- all[-index,]
test <- all[index,]
str(all)
## 'data.frame': 1309 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 520 629 417 581 732 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
## $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 86 396 345 133 617 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 131 1 1 1 147 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
## $ family : num 2 2 1 2 1 1 5 3 2 3 ...
index <- sample(1:nrow(train), nrow(train)*0.8)
train_t <- train[index,]
test_t <- train[-index,]
A initial model is built using the important variables observed from the EDA above.
log.fit <- glm(Survived ~ Pclass + Sex + Age + Fare + Embarked + family, family = "binomial", data = train_t)
summary(log.fit)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Fare + Embarked +
## family, family = "binomial", data = train_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5914 -0.6101 -0.4234 0.6224 2.4739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.839481 607.860034 0.028 0.977899
## Pclass2 -0.912723 0.337197 -2.707 0.006794 **
## Pclass3 -2.207719 0.344811 -6.403 1.53e-10 ***
## Sexmale -2.706839 0.223592 -12.106 < 2e-16 ***
## Age -0.043054 0.009133 -4.714 2.43e-06 ***
## Fare 0.003047 0.002725 1.118 0.263556
## EmbarkedC -12.378034 607.859862 -0.020 0.983754
## EmbarkedQ -12.503025 607.859938 -0.021 0.983590
## EmbarkedS -12.734645 607.859840 -0.021 0.983286
## family -0.289555 0.075475 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 947.02 on 711 degrees of freedom
## Residual deviance: 626.90 on 702 degrees of freedom
## AIC: 646.9
##
## Number of Fisher Scoring iterations: 13
log.fit2 <- glm(Survived ~ Pclass +Sex + Age + family , family = "binomial", data = train_t)
summary(log.fit2)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + family, family = "binomial",
## data = train_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5993 -0.5899 -0.4190 0.6141 2.4614
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.570095 0.504352 9.061 < 2e-16 ***
## Pclass2 -1.198594 0.297017 -4.035 5.45e-05 ***
## Pclass3 -2.484581 0.284655 -8.728 < 2e-16 ***
## Sexmale -2.749358 0.221718 -12.400 < 2e-16 ***
## Age -0.045152 0.009067 -4.980 6.36e-07 ***
## family -0.284078 0.071789 -3.957 7.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 947.02 on 711 degrees of freedom
## Residual deviance: 631.35 on 706 degrees of freedom
## AIC: 643.35
##
## Number of Fisher Scoring iterations: 5
test_t$prob <- predict(log.fit2, newdata = test_t, type = "response")
test_t$pred <- ifelse(test_t$prob > 0.5, 1 , 0)
table(test_t$pred, test_t$Survived)
##
## 0 1
## 0 100 24
## 1 9 46
lda.model = lda (Survived ~ Pclass + Sex + Age + Parch, data=train_t)
lda.model
## Call:
## lda(Survived ~ Pclass + Sex + Age + Parch, data = train_t)
##
## Prior probabilities of groups:
## 0 1
## 0.6179775 0.3820225
##
## Group means:
## Pclass2 Pclass3 Sexmale Age Parch
## 0 0.1750000 0.6886364 0.8363636 29.35909 0.3545455
## 1 0.2536765 0.3602941 0.3088235 27.58029 0.4926471
##
## Coefficients of linear discriminants:
## LD1
## Pclass2 -0.76664664
## Pclass3 -1.64266162
## Sexmale -2.10072845
## Age -0.02363916
## Parch -0.17455310
plot(lda.model)
lda_prob <- predict(lda.model, newdata = test_t)
table(Predicted = lda_prob$class, test_t$Survived)
##
## Predicted 0 1
## 0 100 26
## 1 9 44
qda.model <- qda (Survived ~ Pclass + Sex + Age + Parch, data=train_t)
qda.model
## Call:
## qda(Survived ~ Pclass + Sex + Age + Parch, data = train_t)
##
## Prior probabilities of groups:
## 0 1
## 0.6179775 0.3820225
##
## Group means:
## Pclass2 Pclass3 Sexmale Age Parch
## 0 0.1750000 0.6886364 0.8363636 29.35909 0.3545455
## 1 0.2536765 0.3602941 0.3088235 27.58029 0.4926471
qda_prob <- predict(qda.model, newdata = test_t)
table(Predicted = qda_prob$class, test_t$Survived)
##
## Predicted 0 1
## 0 93 18
## 1 16 52
LDA and QDA work well when class separation and normality assumption holds true in the dataset. If the dataset is not normal then Logistic regression has an edge over LDA and QDA model. Logistic regression does not work properly if the response classes are fully separated from each other. In general, logistic regression is used for binomial classification and in case of multiple response classes, LDA and QDA are more popular