Case : From data titanic passenger, we want to know does passenger still alive or not. Data can be downloaded here https://www.kaggle.com/c/titanic
titanic_train <- read.csv('train.csv')
titanic_test <- read.csv('test.csv')
head(titanic_train)#head(titanic_test)To make easier, combine data train and test into 1 dataframe
library(dplyr)
titanic_test$Survived <- NA
titanic <- rbind(titanic_train, titanic_test) %>%
mutate(Survived = as.factor(Survived))str(titanic)## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Here are some information about columns in wholesale data *
PassengerId: Id number of passenger * Survived
: Passenger Survival or Not * Pclass : A proxy for
socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower *
Name : Name of Passenger * Sex : Sex of
Passenger * Age : Age of Passenger * SibSp :
Sibling = brother, sister, stepbrother, stepsister; Spouse = husband,
wife (mistresses and fiancés were ignored) * Parch : Parent
= mother, father; Child = daughter, son, stepdaughter, stepson; Some
children travelled only with a nanny, therefore parch=0 for them. *
Ticket : Ticket number * Fare : Passenger fare
* Cabin : Cabin number * Embarked : Port of
Embarkation
prop.table(table(titanic$Survived))##
## 0 1
## 0.6161616 0.3838384
Our data has balance enough.
# downsampling
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
library(caret)## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Loading required package: lattice
titanic <- downSample(x = titanic %>%
select(-Survived),
y = titanic$Survived,
yname = "Survived") #nama kolom target
head(titanic)#cek proporsi diab_train
prop.table(table(titanic$Survived))##
## 0 1
## 0.5 0.5
Our data has balance proportion
anyNA(titanic)## [1] TRUE
# remove missing value
titanic <- titanic %>% na.omit()
anyNA(titanic)## [1] FALSE
Give proportion to data train 80% and 20% to data test. Proportion data train bigger than data test because we want our model learn more.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# index sampling
index <- sample(x = nrow(titanic),
size = nrow(titanic)*0.8)
# splitting
titanic_train <- titanic[index,]
titanic_test <- titanic[-index,]prop.table(table(titanic_train$Survived))##
## 0 1
## 0.4920635 0.5079365
Our data class proportion is balance enough.
model_logistic <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked,
data = titanic,
family = "binomial")summary(model_logistic)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Embarked, family = "binomial", data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8565 -0.7227 0.1903 0.6342 2.2393
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.276310 604.089038 0.030 0.97586
## Pclass -1.173481 0.162927 -7.202 5.91e-13 ***
## Sexmale -2.729552 0.263541 -10.357 < 2e-16 ***
## Age -0.048796 0.009219 -5.293 1.20e-07 ***
## SibSp -0.436793 0.146866 -2.974 0.00294 **
## Parch -0.132034 0.137642 -0.959 0.33743
## EmbarkedC -11.636185 604.088855 -0.019 0.98463
## EmbarkedQ -13.033583 604.089100 -0.022 0.98279
## EmbarkedS -12.241519 604.088813 -0.020 0.98383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 763.81 on 551 degrees of freedom
## Residual deviance: 491.01 on 543 degrees of freedom
## AIC: 509.01
##
## Number of Fisher Scoring iterations: 13
Log of odds value from logistic model can’t be interpreted, we should convert it into Odds.
exp(model_logistic$coefficients)## (Intercept) Pclass Sexmale Age SibSp Parch
## 8.655670e+07 3.092885e-01 6.524851e-02 9.523755e-01 6.461050e-01 8.763108e-01
## EmbarkedC EmbarkedQ EmbarkedS
## 8.840344e-06 2.185681e-06 4.825871e-06
We can explain how each variable significant to prediction. According to model summary, passenger in Pclass 2 2.81 times more likely survived. Passenger with parent or children 9.56 times more likely survived.
Choose significant variable to our model
model_step <- step(model_logistic, direction = 'backward', trace = F)
summary(model_step)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9667 -0.7100 0.2112 0.6222 2.2415
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.240306 0.630638 9.895 < 2e-16 ***
## Pclass -1.270671 0.159013 -7.991 1.34e-15 ***
## Sexmale -2.646674 0.247856 -10.678 < 2e-16 ***
## Age -0.049091 0.009126 -5.379 7.48e-08 ***
## SibSp -0.483300 0.137687 -3.510 0.000448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 763.81 on 551 degrees of freedom
## Residual deviance: 497.75 on 547 degrees of freedom
## AIC: 507.75
##
## Number of Fisher Scoring iterations: 5
titanic_test$predict_value <- predict(object = model_step,
newdata = titanic_test,
type = "response")Predict result type probability, transform it into label according to target variable(Survived)
unique(titanic_test$Survived)## [1] 0 1
## Levels: 0 1
titanic_test$pred_label <- ifelse(test = titanic_test$predict_value > 0.5,
yes = "1",
no = "0")
titanic_test$pred_label <- as.factor(titanic_test$pred_label)
head(titanic_test)library(caret)
confusionMatrix(data = titanic_test$pred_label,
reference = titanic_test$Survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 14
## 1 10 52
##
## Accuracy : 0.7838
## 95% CI : (0.6956, 0.8563)
## No Information Rate : 0.5946
## P-Value [Acc > NIR] : 1.946e-05
##
## Kappa : 0.5578
##
## Mcnemar's Test P-Value : 0.5403
##
## Sensitivity : 0.7879
## Specificity : 0.7778
## Pos Pred Value : 0.8387
## Neg Pred Value : 0.7143
## Prevalence : 0.5946
## Detection Rate : 0.4685
## Detection Prevalence : 0.5586
## Balanced Accuracy : 0.7828
##
## 'Positive' Class : 1
##
Logistic model has 78% accuracy, 78% sensitivity, 77% specificity, and 83% precision. Next, we want to compare this model with another model (K-Nearest Neighbor) to see which model better in predict.
titanic_clean <- titanic %>% select(-c(Name, Sex, Ticket, Fare, Cabin, Embarked))prop.table(table(titanic_clean$Survived))##
## 0 1
## 0.4746377 0.5253623
RNGkind(sample.kind = "Rounding")
set.seed(257)
# data train 80%
index <- sample(x = nrow(titanic_clean),
size = nrow(titanic_clean)*0.8)
# splitting
titanic_train_for_knn <- titanic_clean[index, ]
titanic_test_for_knn <- titanic_clean[-index, ]
prop.table(table(titanic_train_for_knn$Survived)) # recheck class proportion##
## 0 1
## 0.4829932 0.5170068
Our data is balance enough
# predictor train
titanic_train_predictor <- titanic_train_for_knn %>%
select(-Survived)
# target train
titanic_train_target <- titanic_train_for_knn %>%
pull(Survived)
# predictor test
titanic_test_predictor <- titanic_test_for_knn %>%
select(-Survived)
# taget test
titanic_test_target <- titanic_test_for_knn %>%
pull(Survived)titanic_train_predictor_scale <- titanic_train_predictor %>%
scale()titanic_test_predictor_scale <- titanic_test_predictor %>%
scale(center = attr(titanic_train_predictor_scale,"scaled:center"),
scale = attr(titanic_train_predictor_scale,"scaled:scale"))Finding k optimum
sqrt(nrow(titanic_train_predictor))## [1] 21
K optimum is 22
library(class)
titanic_pred <- knn(train = titanic_train_predictor_scale,
test = titanic_test_predictor_scale,
cl = titanic_train_target,
k = 20)
titanic_pred## [1] 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1
## [38] 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1
## [75] 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1
## Levels: 0 1
library(caret)
confusionMatrix(data = titanic_pred, reference = titanic_test_target, positive = '1')## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 17
## 1 13 45
##
## Accuracy : 0.7297
## 95% CI : (0.6372, 0.8096)
## No Information Rate : 0.5586
## P-Value [Acc > NIR] : 0.0001498
##
## Kappa : 0.4566
##
## Mcnemar's Test P-Value : 0.5838824
##
## Sensitivity : 0.7258
## Specificity : 0.7347
## Pos Pred Value : 0.7759
## Neg Pred Value : 0.6792
## Prevalence : 0.5586
## Detection Rate : 0.4054
## Detection Prevalence : 0.5225
## Balanced Accuracy : 0.7303
##
## 'Positive' Class : 1
##
KNN model has 72% accuracy, 72% sensitivity, 73% specificity, and 77% precision.
Logistic model has
KNN model has
In this case, we focused on recall or sensitivity. So, Logistic Model better than KNN Model to predict passenger who survived from titanic sink disaster.