The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)
The infamous Titanic disaster that occurred in 1912 took the lives of at least 1,500 people, both passengers and crews alike and leaving only 706 passengers who managed to survived. Quoted from the the Titanic Kaggle Competition, “While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: what sorts of people were more likely to survive?”
To answer that question, we will try to predict whether a passenger will survived or not, based on the given information about each of the Titanic passengers. There are three data sets that we will use here, namely the train, test and gender submission data sets. For Titanic Survival Prediction Part 1, we will try to compare model performances from two different algorithms that we can use in classification, that is the logistic regression and k-Nearest Neighbor (kNN).
# Import packages
library(readr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tracerer) # to automatically calculate mode## Warning: package 'tracerer' was built under R version 4.1.2
library(caret) # model evaluation## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(psych) # create dummy variables## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(class) # kNN## Warning: package 'class' was built under R version 4.1.2
# Import files
data_train <- read_csv("Titanic/train.csv")## Rows: 891 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_trainColumns description:
head(data_train)glimpse(data_train)## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# maximum missing values that we tolerated for each columns
nrow(data_train)*0.05## [1] 44.55
# check missing values from all columns
colSums(is.na(data_train))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
For data wrangling, we would need to:
Cabin, Name, PassengerId and Ticket.PassengerId only stored the identifier numbers for each passengers and doesn’t contain any valuable informations for us to analyse later.Cabin, we will drop it since it has a total of missing values that counts more than half of the total observations.Name, to be honest I think it actually stored valuable information that we can use to explore the family relations among the titanic passengers. It is a very useful variable to use when we are exploring the data sets with more advanced algorithm in classification. We could have keep it as it is but since we are using a much simpler model, that is with logistic regression and kNN, we would not use it to create our models.Ticket has a character data type and stored no valuable informations.# change columns data types into the correct ones
titanic_train <- data_train %>%
mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>%
select(-c(PassengerId, Cabin, Name, Ticket))
head(titanic_train)# check again for columns that have missing values
colSums(is.na(titanic_train))## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 2
We have located the columns that have missing values and those columns consisted of Age and Embarked. Since Embarked is a factor, we will replace the two missing values with its mode or the value that appeared most frequently. As for the Age column, we can either replace the 177 missing values with the mean or median. If there’s no outlier, then we’ll replace them with mean but if it does, then it is better to use the median number instead.
# check the mode for column `Embarked`
table(titanic_train$Embarked)##
## C Q S
## 168 77 644
Turns out, most of the passengers in the Titanic embarked from the Southampton (S) port with 644 passengers in total.
# replace missing values in column `Embarked` with the mode (Southampton port)
which(is.na(titanic_train$Embarked)) # locate missing values index in Embarked## [1] 62 830
titanic_train <- titanic_train %>%
mutate(Embarked = if_else(is.na(Embarked), calc_mode(Embarked), Embarked))# check outliers in Age
boxplot(titanic_train$Age)Since we have plenty of outliers in the Age column, it is better to replace the missing values with the median.
# replace missing values in column `Age` with the median
median(titanic_train$Age, na.rm = T) # check the median age from all passengers without taking into account the NA## [1] 28
titanic_train <- titanic_train %>%
mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age))# check again whether we still have missing values or not
anyNA(titanic_train)## [1] FALSE
Great! All variables within the titanic_train dataframe is now stored in their correct data types and has no more missing values. So far, we have performed data cleansing and we’re ready to go and create classification models from it.
titanic_trainAs we have mentioned earlier, we will create a prediction of whether a passenger will survived or not using two different classification models; the logistic regression and the k-Nearest Neighbor (kNN).
The first step in creating a prediction with both models is to decide which of the variable that we want to set as the target and the predictor variables. Now, because our aim is to predicted which passengers that will survived or not, we will use the Survived column as the target variable while the rest of the variables will be used as the predictors.
Survived. We will also take “1” (Survived) as the positive class for our models.SurvivedCheck whether the target variables have balanced or imbalanced class.
prop.table(table(titanic_train$Survived))##
## 0 1
## 0.6161616 0.3838384
The proportion for our target class seems to be quite balanced and at this stage, we won’t try to downsampling / upsampling the target variable just yet.
Since we were already provided with the separate .csv files for the training and test data set, we don’t have to go through the cross-validation step anymore and we could import the test.csv and the gender_submission.csv files to our environment.
# importing test.csv and the gender_submission.csv
data_test <- read_csv("Titanic/test.csv")## Rows: 418 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
gender_sub <- read_csv("Titanic/gender_submission.csv")## Rows: 418 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): PassengerId, Survived
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#left join column 'Survived' in gender_sub to data_test
data_test_merge <- merge(data_test, gender_sub, by = "PassengerId")glimpse(data_test_merge)## Rows: 418
## Columns: 12
## $ PassengerId <dbl> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903~
## $ Pclass <dbl> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3~
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M~
## $ Sex <chr> "male", "female", "male", "male", "female", "male", "femal~
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0~
## $ SibSp <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0~
## $ Parch <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",~
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2~
## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45", NA,~
## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"~
## $ Survived <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1~
colSums(is.na(data_test_merge))## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 327 0 0
# change columns data types into the correct ones
titanic_test <- data_test_merge %>%
mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>%
select(-c(PassengerId, Cabin, Name, Ticket))
head(titanic_train)For the titanic_test data set, the missing values can be located in columns Age and Fare. Since column Fare data type is a numeric, we’ll decide whether to replace the missing values with its mean or median.
# Check for outliers in columns 'Age' and 'Fare'
boxplot(titanic_test$Age)boxplot(titanic_test$Fare) Both columns have outliers.
# Check the range of fare
range(titanic_test$Fare, na.rm = T)## [1] 0.0000 512.3292
# Check the range of passenger's age
range(titanic_test$Age, na.rm = T)## [1] 0.17 76.00
titanic_test %>%
filter(Fare == 0)# Treatment for missing values in column `Age` and `Fare`
titanic_test <- titanic_test %>%
mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age),
Fare = if_else(is.na(Fare), median(Fare, na.rm = T), Fare))anyNA(titanic_test)## [1] FALSE
titanic_trainprop.table(table(titanic_train$Survived))##
## 0 1
## 0.6161616 0.3838384
model_lm <- glm(Survived~., data = titanic_train, family = "binomial")
summary(model_lm)##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6199 -0.6089 -0.4176 0.6187 2.4514
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.064159 0.472813 8.596 < 2e-16 ***
## Pclass2 -0.919468 0.297326 -3.092 0.00199 **
## Pclass3 -2.150048 0.297720 -7.222 5.13e-13 ***
## Sexmale -2.719444 0.200977 -13.531 < 2e-16 ***
## Age -0.038517 0.007855 -4.903 9.43e-07 ***
## SibSp -0.321794 0.109193 -2.947 0.00321 **
## Parch -0.093329 0.118856 -0.785 0.43232
## Fare 0.002339 0.002469 0.947 0.34346
## EmbarkedQ -0.056267 0.381471 -0.148 0.88274
## EmbarkedS -0.434226 0.239530 -1.813 0.06986 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 785.04 on 881 degrees of freedom
## AIC: 805.04
##
## Number of Fisher Scoring iterations: 5
Several things to concluded from the summary above:
The Estimate number display the log of odds for each columns. If we want to interpreted it, we can use the exp() function to converted them to their odds.
# log of odds -> odds (for each predictor variables)
data.frame(odds = exp(model_lm$coefficients))# feature selection
model_all <- step(model_lm, direction = "backward", trace = F)
summary(model_all)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
## family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6021 -0.6012 -0.4151 0.6218 2.4651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.22353 0.41876 10.086 < 2e-16 ***
## Pclass2 -1.03964 0.26956 -3.857 0.000115 ***
## Pclass3 -2.30238 0.25453 -9.046 < 2e-16 ***
## Sexmale -2.69944 0.19541 -13.814 < 2e-16 ***
## Age -0.03874 0.00782 -4.954 7.26e-07 ***
## SibSp -0.32720 0.10369 -3.155 0.001602 **
## EmbarkedQ -0.05588 0.37822 -0.148 0.882536
## EmbarkedS -0.46779 0.23610 -1.981 0.047557 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 786.37 on 883 degrees of freedom
## AIC: 802.37
##
## Number of Fisher Scoring iterations: 5
# model evaluation with aic
model_lm$aic## [1] 805.0398
model_all$aic## [1] 802.3658
Based on the AIC, model_all has a slightly lower number compared to model_lm. Therefore, we will use model_all instead of model_sm since lower AIC number means that our model has a better performance.
Add new variable to titanic_test that stored the probability of our prediction.
titanic_test$Prediction.Prob <- predict(object = model_all, newdata = titanic_test, type = "response")Create another variable that contains the label of survival prediction for each passengers. If a passenger has a probability of more than 0.5, we would labeled them with “1” (positive class/survived). But, if they scored less than 0.5, they would be put in the negative class (“0”/Not survived).
titanic_test$Prediction.Label <- ifelse(titanic_test$Prediction.Prob > 0.5, "1","0")
head(titanic_test)# change Prediction.Label variable to factor
titanic_test <- titanic_test %>%
mutate(Prediction.Label = as.factor(Prediction.Label))After we have done with making our prediction, we are going to evaluate our model by using the confusion matrix from package caret.
# from package `caret`
lm_model_eval <- confusionMatrix(data = titanic_test$Prediction.Label,
reference = titanic_test$Survived,
positive = "1")
lm_model_eval## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 246 8
## 1 20 144
##
## Accuracy : 0.933
## 95% CI : (0.9046, 0.955)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8577
##
## Mcnemar's Test P-Value : 0.03764
##
## Sensitivity : 0.9474
## Specificity : 0.9248
## Pos Pred Value : 0.8780
## Neg Pred Value : 0.9685
## Prevalence : 0.3636
## Detection Rate : 0.3445
## Detection Prevalence : 0.3923
## Balanced Accuracy : 0.9361
##
## 'Positive' Class : 1
##
From the summary above, we can see that our model is able to correctly predict both its positive and negative classes. But unfortunately, the number of the correct prediction from the true positive class (144) is way lower than the negative class (246). It means our model is better at predicting from one class only, which is the negative class. Ideally, we want our model to learn equally well from both classes and not limited to one class only.
Although the target class is considered to be imbalanced, it turns out that our model has a quite high score both for its recall (Sensitivity) and precision (Pos Pred Value). Since our target class isn’t that balanced, we wouldn’t use Accuracy as the metrics for our model evaluation. Instead, we will focus on the other two metrics, namely recall and precision.
glimpse(titanic_train)## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0~
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
## $ Sex <fct> male, female, female, female, male, male, male, male, female,~
## $ Age <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,~
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0~
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S~
titanic.train <- titanic_train %>%
mutate(Sex = if_else(Sex == "male", 0, 1),
Pclass = as.integer(Pclass)) %>%
select(-Embarked)
head(titanic.train)prop.table(table(titanic.train$Survived))##
## 0 1
## 0.6161616 0.3838384
Before we proceed with creating a kNN model, we first need to check the range of each variables. If the predictor variables have a huge gap in terms of range, then we would have to perform a normalization or scaling. This is done particularly because kNN
summary(titanic.train)## Survived Pclass Sex Age SibSp
## 0:549 Min. :1.000 Min. :0.0000 Min. : 0.42 Min. :0.000
## 1:342 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:0.000
## Median :3.000 Median :0.0000 Median :28.00 Median :0.000
## Mean :2.309 Mean :0.3524 Mean :29.36 Mean :0.523
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:35.00 3rd Qu.:1.000
## Max. :3.000 Max. :1.0000 Max. :80.00 Max. :8.000
## Parch Fare
## Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 7.91
## Median :0.0000 Median : 14.45
## Mean :0.3816 Mean : 32.20
## 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :6.0000 Max. :512.33
Summary from titanic.train: 1. The age of the titanic.train passengers range from 0.42 - 80 years old with a median of 28 years old. 2. Judging from the SibSp column, 50% of the passengers traveled without being accompanied by a family member. 3. The average fare that the passengers had to pay is 14.45.
# data wrangling
# replace value "male" with 0 and "female" with 1 in column Sex
# remove other categorical data and predictions result from previous titanic_test dataset that we used for logistic regression
titanic.test <- titanic_test %>%
mutate(Sex = if_else(Sex == "male", 0, 1),
Pclass = as.integer(Pclass)) %>%
select(-c(Embarked, Prediction.Prob, Prediction.Label))
head(titanic.test)# check total of columns for both data set
ncol(titanic.test)## [1] 7
ncol(titanic.train)## [1] 7
# filter out predictor variables
titanic_train_x <- titanic.train %>% select_if(is.numeric)
titanic_test_x <- titanic.test %>% select_if(is.numeric)
# filter out target variables
titanic_train_y <- titanic.train[,"Survived"]
titanic_test_y <- titanic.test[,"Survived"]head(titanic_train_x)head(titanic_train_y)# scaling
titanic_train_xs <- scale(titanic_train_x)
titanic_test_xs <- scale(titanic_test_x,
center = attr(titanic_train_xs,"scaled:center"), # center = mean
scale = attr(titanic_train_xs, "scaled:scale")) # scale = standard deviationhead(titanic_train_xs)## Pclass Sex Age SibSp Parch Fare
## [1,] 0.8269128 -0.737281 -0.5654189 0.4325504 -0.4734077 -0.5021631
## [2,] -1.5652278 1.354813 0.6634884 0.4325504 -0.4734077 0.7864036
## [3,] 0.8269128 1.354813 -0.2581921 -0.4742788 -0.4734077 -0.4885799
## [4,] -1.5652278 1.354813 0.4330683 0.4325504 -0.4734077 0.4204941
## [5,] 0.8269128 -0.737281 0.4330683 -0.4742788 -0.4734077 -0.4860644
## [6,] 0.8269128 -0.737281 -0.1045787 -0.4742788 -0.4734077 -0.4778481
head(titanic_test_xs)## Pclass Sex Age SibSp Parch Fare
## [1,] 0.8269128 -0.737281 0.3946649 -0.4742788 -0.4734077 -0.4905077
## [2,] 0.8269128 1.354813 1.3547487 0.4325504 -0.4734077 -0.5071940
## [3,] -0.3691575 -0.737281 2.5068493 -0.4742788 -0.4734077 -0.4531124
## [4,] 0.8269128 -0.737281 -0.1813854 -0.4742788 -0.4734077 -0.4737389
## [5,] 0.8269128 1.354813 -0.5654189 0.4325504 0.7671990 -0.4007916
## [6,] 0.8269128 -0.737281 -1.1798725 -0.4742788 -0.4734077 -0.4624195
# find optimum k
sqrt(nrow(titanic_train_xs))## [1] 29.84962
When we have an even number for target class, we should use an odd number for its optimum k and vice versa.
dim(titanic_train_xs)## [1] 891 6
dim(titanic_train_y)## [1] 891 1
# change titanic_train_y data types. from data frame to a vector
titanic_y_vec = titanic_train_y %>% pull("Survived")
titanic_pred <- knn(train = titanic_train_xs, # `train` : predictor variables in data train
test = titanic_test_xs, # `test` : predictor variables in data test
cl = titanic_y_vec, # target variables in data train
k = 29)titanic_pred## [1] 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
## [38] 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0
## [75] 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
## [112] 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
## [149] 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
## [186] 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
## [223] 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
## [260] 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0
## [297] 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
## [334] 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0
## [371] 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
## [408] 1 1 1 1 1 1 0 1 0 0 0
## Levels: 0 1
knn_model_eval <- confusionMatrix(data = as.factor(titanic_pred), reference = titanic_test_y, positive = "1")
knn_model_eval## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 247 8
## 1 19 144
##
## Accuracy : 0.9354
## 95% CI : (0.9074, 0.957)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8626
##
## Mcnemar's Test P-Value : 0.05429
##
## Sensitivity : 0.9474
## Specificity : 0.9286
## Pos Pred Value : 0.8834
## Neg Pred Value : 0.9686
## Prevalence : 0.3636
## Detection Rate : 0.3445
## Detection Prevalence : 0.3900
## Balanced Accuracy : 0.9380
##
## 'Positive' Class : 1
##
We’ll measure how well our kNN model was able to learn and correctly predict the target variables by using the confusion matrix model evaluation, just like when we did for logistic regression. Based on the three above metrics, kNN has a quite good performance with a recall of 94.74%, a precision of 88.34% and 93.54% of accuracy. Now we’ll compare the performance from both models between the logistic regression and the kNN.
# comparing the model performances between knn and logistic regression
# model evaluation when we use logistic regression
lm_model_eval## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 246 8
## 1 20 144
##
## Accuracy : 0.933
## 95% CI : (0.9046, 0.955)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8577
##
## Mcnemar's Test P-Value : 0.03764
##
## Sensitivity : 0.9474
## Specificity : 0.9248
## Pos Pred Value : 0.8780
## Neg Pred Value : 0.9685
## Prevalence : 0.3636
## Detection Rate : 0.3445
## Detection Prevalence : 0.3923
## Balanced Accuracy : 0.9361
##
## 'Positive' Class : 1
##
Accuracy - Logistic regression: 93.3% - kNN: 93.54% (slightly better)
Recall (equal) - Logistic regression: 94.74% - kNN: 94.74%
Precision - Logistic regression: 87.80% - kNN: 88.34% (slightly better)
We have gone through the process of creating a model with logistic regression and k-Nearest Neighbor (kNN) to find out which of the two models that can better predict whether a Titanic passenger will survived or not. As it turns out, the model that was created with kNN has a better result in predicting the survival rate of a passenger in comparison to the one that we made with logistic regression. Since the target variables and the proportion for TN (true negative) / TF (true false) class is imbalanced, we chose to use both recall and precision as the metrics for our model evaluation.
Also, by using the logistic regression model, we found out several parameters that have a significant contribution in predicting whether a passenger will most likely to survived or not. The parameters are such as follows: 1. Ticket class (Upper, Middle or Lower class). A proxy for socio-economic status of a passenger. 2. Sex 3. Age 4. Number of siblings / spouses aboard the Titanic. 5. Port of embarkation