Titanic tragedy is one of the most heartbreaking moment in human history. More than half of her passenger dead in this accident and only few can survive and tell the story about how terrifying what happened at the moment.
With all due respect for the victim, in this analysis I’m using titanic data set to predict which passenger will survive in this tragedy based on their condition. In this analysis I only use GLM function to build some models, than comparing it’s performance to know which model is better to predict which passenger survive or not. The data set was obtained from kagle.com
1 IMPORT LIBRARY
Import all necessary library
library(dplyr)
library(mice)
library(GGally)
library(caret)
library(class)
library(gtools)
2 LOAD DATASET
This data already contain two data set, data train and data test. The data train will be used to build a model. In this step we will split our data train into two new data set. First data set will contain 80% of total data, this data set will be used to build the model. The other 20% will be used to validate the model to see which model have best performance in predicting unseen data, in this case is our second data set with 20% composition. Last, we will predict our data test to see how many passenger survive in this tragedy.
Read data train
titanic_train <- read.csv("train.csv")
rmarkdown::paged_table(titanic_train)
Read data test
titanic_test <- read.csv("test.csv")
rmarkdown::paged_table(titanic_test)
Column description:
- Pclass: Ticket class (1 for Upper, 2 for Middle and 3 for Lower)
- Sex: Passenger’s Gender
- Age: Passenger’s Age
- SibSp: siblings / spouse aboard the Titanic (0 for Siblings and 1 for Spouse)
- Parch: parents / children aboard the Titanic (0 for Parent)
- Ticket: Ticket Number
- Fare: Passenger Fare (amount of money should to be paid)
- Cabin: Cabin Number
- Embarked: Port of Embarkation
- Survived: Whether passenger survive or not
3 DATA PREPROCESSING
Data pre-processing is step where we preparing our raw data before we do analysis or machine learning. So, we can be sure the quality, consistency and compatibility of our data. Anything that can be done here they are handling missing value.
3.1 Handling Missing Value
Checking is there any missing value
colSums(is.na(titanic_train))
#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
colSums(is.na(titanic_test))
#> PassengerId Pclass Name Sex Age SibSp
#> 0 0 0 0 86 0
#> Parch Ticket Fare Cabin Embarked
#> 0 0 1 0 0
Our data train have 177 missing value in age column, it’s a big number if we compare it to the total rows of or data train, so we cannot simply remove the missing value or entire column. To handle it, we will use mice imputation to fill all the missing value.
Mice (Multivariate Imputation by Chained Equation) is a method to fill missing value using by looking at data from other columns and trying to estimate the best prediction for each missing value. Here’s the explanation step about MICE imputation.
Let say we have a data that contain 3 columns with missing value in each column as shown bellow
What we will do is
- Try to fill all missing value with dummy variable.
The question now is, what number we should use to replace all missing value?. To answer this, there’s two method to get dummy variable, they are Univariate method (using mean, median, mode, frequently used or etc.) and second Multivariate Method, which mean we looking to another column to fill the missing value. Seems it’s kinda hard to fill age column, for example, by looking to another column that can be used to determine what number we should use to fill missing value in age columns. So, for first step we will use mean to replace all missing value. An our data will become like this:
Just by observing the result, we see that a 25 years old has 7 years of work experience, which is not possible and that too a 7 years experience person earning 50k is totally misleading. So for our next step we will also use multivariate method to handle this misleading.
- Multivariate Method
As mentioned earlier, multivariate method is a method to fill missing value by looking to another column than estimating the best prediction fo each missing value. In this case we will perform linear regression using three features with experience and salary as predictor and age column as target variable. Similarly we will try to get the missing values form experience and salary features as well.
- First iteration
First, remove dummy number from age column
Second, perform linear regression to get missing value in age column, than do the same method for experience and salary column. And what we get as result for age column:
Third, remove dummy number in salary column
Do the same thing to get result for salary column
After we find all missng value, we will subtract data set with dummy variable (let’s call it zero data set) with the last data set (let’s named it first data set)
As we can see form image above, the absolute different between zero data set (left image) and first data set (middle image) are higher in few imputed values. Our aim is to reduce these differences close to zero. To achieve this we have to do many iteration.
- Second iteration
In second iteration we will use first data set as zero data set to do regression to obtain new value. Let’s perform the same step as before (remove one value than perform regression until we get all thre new values).
As we can see, the difference is very negligible. And that’s all the logic behind mice imputation.
For our data set, we have two columns with missing value, Fare and Age columns, for fare column we will use univariate method (Mean if the data was normally distributed and use mean if have skew) to fill the missing value, while age column we will use mice imputation as well.
3.1.1 Handling Missing Value in Data Train
Mice have various method to fill the missing value. For now we will use midastouch method, which means we will perform linear regression to replace all missing value in train data set.
# Determining dependent variable
dependent <- titanic_train$Age
# Determining independent variable
independent <- titanic_train %>% select(-Age)
# Build mice object
mice_obj <- mice(titanic_train, method = "midastouch", dependen = dependent, independen = independent)
#>
#> iter imp variable
#> 1 1 Age
#> 1 2 Age
#> 1 3 Age
#> 1 4 Age
#> 1 5 Age
#> 2 1 Age
#> 2 2 Age
#> 2 3 Age
#> 2 4 Age
#> 2 5 Age
#> 3 1 Age
#> 3 2 Age
#> 3 3 Age
#> 3 4 Age
#> 3 5 Age
#> 4 1 Age
#> 4 2 Age
#> 4 3 Age
#> 4 4 Age
#> 4 5 Age
#> 5 1 Age
#> 5 2 Age
#> 5 3 Age
#> 5 4 Age
#> 5 5 Age
# Impute complete value
titanic_train_full <- complete(mice_obj)
Checking missing value in our new data train
anyNA(titanic_train_full)
#> [1] FALSE
No missing value detected
3.1.2 Handling Missing Value in Data Test
First, we need to know the distribution of fare column.
hist(titanic_test$Fare)
Since our data has skew (left skew) we will use median to fill the missing value.
# Get median of fare column
med_fare <- median(titanic_test$Fare, na.rm = T)
# Replace missing value
titanic_test$Fare[is.na(titanic_test$Fare)] <- med_fare
Use mice to fill missing value in data test
# Determining dependent variable
dependent <- titanic_test$Age
# Determining independent variable
independent <- titanic_test %>% select(-Age)
# Build mice object
mice_obj <- mice(titanic_test, method = "midastouch", dependen = dependent, independen = independent)
#>
#> iter imp variable
#> 1 1 Age
#> 1 2 Age
#> 1 3 Age
#> 1 4 Age
#> 1 5 Age
#> 2 1 Age
#> 2 2 Age
#> 2 3 Age
#> 2 4 Age
#> 2 5 Age
#> 3 1 Age
#> 3 2 Age
#> 3 3 Age
#> 3 4 Age
#> 3 5 Age
#> 4 1 Age
#> 4 2 Age
#> 4 3 Age
#> 4 4 Age
#> 4 5 Age
#> 5 1 Age
#> 5 2 Age
#> 5 3 Age
#> 5 4 Age
#> 5 5 Age
# Impute complete value
titanic_test_full <- complete(mice_obj)
Checking missing value
anyNA(titanic_test_full)
#> [1] FALSE
3.2 Data Coertion
Before we do any process or analysis in our data, we need to ensure that all the column already have appropriate data type. The whole process checking and converting column data type called data coertion.
First, checking all column data type
glimpse(titanic_train_full)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
#> $ Age <dbl> 22, 38, 26, 35, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
glimpse(titanic_test_full)
#> Rows: 418
#> Columns: 11
#> $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
#> $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
#> $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
#> $ Sex <chr> "male", "female", "male", "male", "female", "male", "femal…
#> $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
#> $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
#> $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
#> $ Cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
#> $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…
As we observed, there’s some column is not in their appropriate data type yet. In data train we have Embarked, Pcalss, Sex and Survived columns. The same column (except survived) also belongs to data test. We need to convert all this ccolumn data type into factor.
The other we need to do is remve any unused column. The reason why we need to remove them, is because this column contain only unique value of all passenger, so no insight we will get from this, and we cannot use this type of column as predictor in GLM (General Logistic Model)
titanic_train_full <- titanic_train_full %>%
select(-c(PassengerId, Ticket, Name, Cabin)) %>%
mutate(Embarked = as.factor(Embarked),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Survived = as.factor(Survived))
titanic_test_full <- titanic_test_full %>%
select(-c(PassengerId, Ticket, Name, Cabin)) %>%
mutate(Embarked = as.factor(Embarked),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex))
4 DATA EXPLORATORY
4.1 Checking Correlation
Strong correlation between dependent and independent variable will determine the quality of the model, the stronger the correlation, the good the model. But, same condition will result to bad model if the correlation between independent variable are high, it’s because the model will fail to determine which predictor can be used most effectively to predict or understand the dependent variable (target), so it can lead to skews or misleading result. So we will plot our data to see all numeric column correlation.
ggcorr(titanic_train_full, label = TRUE, label_size = 5, hjust = 1, layout.exp = 2)
If we observed from plot above, we can say there’s no predictor with high correlation to each other.
4.2 Target Class Proportion
One of the most important thing to note that target variable must have balance proportion for each classes. If the target variable unbalance, than the model will only good in predicting the variable with most classes. It will result in bad model.
prop.table(table(titanic_train_full$Survived))
#>
#> 0 1
#> 0.6161616 0.3838384
As we observed our target variable proportion is still unbalance, so we need to handle it. There’s two ways to balancing our target variable they are Down Sampling Up Sampling. Down Sampling is by reducing the proportion of the dominant class randomly, this method is good for big data. While Up Sampling is by randomly add number into the class with lower proportion.
Doing up sampling in our train data
titanic_train_up <- upSample(x = titanic_train_full %>% select(-Survived), y = titanic_train_full$Survived , yname = "Survived")
Check our target proportion after doing up sampling
prop.table(table(titanic_train_up$Survived))
#>
#> 0 1
#> 0.5 0.5
Our target proportion is already balance
5 MODEL BUILDING
The model weused to predict which passenger will survive in titanic tragedy is GLM (Generalized Linear Model)/logistic regression. Logistic Regression is one of the regression model that used to describe correlation between independent variable (predictor) and dependent variable (target) that can be continuous or discrete.This model have some differences than linear regression:
- Linear regression used to predict continue number and have range
between
Infto-Inf, while logistic regression use to predict probability and have range between 0 to 1. - The output of the linear regression model is predicted number, while logistic linear is log of odds.
To get better understanding about logistic regression, let make an example.
Say, in an airport there’s 100 flights along the day. From all flights, 20 of them are delayed and the rest is on time. Then, we want to know the probability of the delay flight and on time flight, simply we can calculate them:
p_ontime = 80/100
p_delay = 20/100
p_ontime
#> [1] 0.8
p_delay
#> [1] 0.2
The probability of on time flight and delay flight is 0.8 and 0.2 respectively. Than, what if we want to know how much on time flight compared to flight delays?. We can know this by calculating odds. Generally odds is another form of probability, odds is a probability of wanted events (on time flight) compared to unwanted events (flight delay).
# Odds of on time flight
odds_ot <- p_ontime/p_delay
odds_ot
#> [1] 4
The interpretation of this odds is, the probability of on time flight is 4 times more likely than delay flight. Than by ading this number with log, we get log of odds.
log_of_odds <- log(odds_ot)
log_of_odds
#> [1] 1.386294
The log of odds of on time flight is 1.3862. We cannot interpret log of odds, we need to transform it to odds or porbability. To do this we can use inv.logit from gtools library, for example:
inv.logit(log_of_odds)
#> [1] 0.8
We can use inv.logit() function to convert log of odds
into probability. Also we can transform probability back to log of odds
using logit() function. After we understand the basic of
Logistic Regression, now we will make the model.
5.1 Cross Validation
First, we need to split our data train into two new data set, we called it data_train and data_set. New data_train will contain 80% of the total data train and will be used to train our GLM model, while data_test will contain 20% of the data train and will be act as unseen (new) data, later we used this data_test to test the model.
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(nrow(titanic_train_up), nrow(titanic_train_up)*0.8)
# splitting
data_train <- titanic_train_up[index, ]
data_test <- titanic_train_up[-index, ]
We will build more than one models in this prediction. The first model will use all columns, except survived column, as the model predictor. The second model will use only categorical column as predictor. Third model will use only numeric column as predictor. And the fourth model will use step wise elimination with backward direction method, the system will automatically choose only column with high significance to the model. There’s no specific reason why I choose only numeric or only categorical column as predictor, I used trial and error to see which method will give the best model.
5.2 First Model
Use all column as predictor
first_model <- glm(Survived ~ ., data = data_train, family = binomial)
Since our target variable only have two level (survived = 1, not survived = 0), we use binomial as model family
summary(first_model)
#>
#> Call:
#> glm(formula = Survived ~ ., family = binomial, data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 16.995152 601.686167 0.028 0.977466
#> Pclass2 -1.302202 0.325029 -4.006 0.0000616459940824 ***
#> Pclass3 -2.571630 0.328239 -7.835 0.0000000000000047 ***
#> Sexmale -2.805577 0.208497 -13.456 < 0.0000000000000002 ***
#> Age -0.053092 0.007949 -6.679 0.0000000000240845 ***
#> SibSp -0.416590 0.110822 -3.759 0.000171 ***
#> Parch -0.144535 0.121142 -1.193 0.232829
#> Fare 0.004283 0.003032 1.413 0.157758
#> EmbarkedC -11.810584 601.686034 -0.020 0.984339
#> EmbarkedQ -11.629307 601.686090 -0.019 0.984580
#> EmbarkedS -12.092645 601.686021 -0.020 0.983965
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.2 on 877 degrees of freedom
#> Residual deviance: 746.3 on 867 degrees of freedom
#> AIC: 768.3
#>
#> Number of Fisher Scoring iterations: 13
As we observed from our first model, this model have some characteristics:
- have intercept of 16.694 (model log of odds)
- From all predictors, only five predictors give high significance to the model
- Male likely have less survive log of odds ratio compared to female
- The older the age of someone, the lower survive log of odds ratio
- Without any predictor, our first model have an error of 1217.17, while with predictor have an error of 801.32
5.3 Second Model
Second model will use only categorical column
second_model <- glm(Survived ~ Pclass + Sex + Embarked, data = data_train, family = binomial)
summary(second_model)
#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Embarked, family = binomial,
#> data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 14.5661 624.1938 0.023 0.98138
#> Pclass2 -0.8012 0.2622 -3.056 0.00224 **
#> Pclass3 -1.9493 0.2296 -8.489 < 0.0000000000000002 ***
#> Sexmale -2.7068 0.1881 -14.394 < 0.0000000000000002 ***
#> EmbarkedC -11.3673 624.1939 -0.018 0.98547
#> EmbarkedQ -11.2657 624.1939 -0.018 0.98560
#> EmbarkedS -11.9702 624.1939 -0.019 0.98470
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.2 on 877 degrees of freedom
#> Residual deviance: 809.2 on 871 degrees of freedom
#> AIC: 823.2
#>
#> Number of Fisher Scoring iterations: 13
As we observed from our second model, this model have some characteristics:
- have intercept of 14.5661 (model log of odds)
- From all predictors, only three predictors give high significance to the model
- Male likely have less survive log of odds ratio compared to female (as same as our first model)
- Without any predictor, our second model have an error of 1217.2, while with predictor it have an error of 855.5, a little higher than our first mode, its because in our second model, we not use any numeric predictor (which is 2 numeric column have high significance to the model)
5.4 Third Model
Third model will use only numeric column
third_model <- glm(Survived ~ Age + SibSp + Parch + Fare, data = data_train, family = binomial)
summary(third_model)
#>
#> Call:
#> glm(formula = Survived ~ Age + SibSp + Parch + Fare, family = binomial,
#> data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.565857 0.185216 3.055 0.00225 **
#> Age -0.039194 0.005832 -6.721 0.000000000018038275 ***
#> SibSp -0.472580 0.088455 -5.343 0.000000091626827003 ***
#> Parch 0.006372 0.100101 0.064 0.94924
#> Fare 0.029135 0.003573 8.155 0.000000000000000349 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.2 on 877 degrees of freedom
#> Residual deviance: 1070.6 on 873 degrees of freedom
#> AIC: 1080.6
#>
#> Number of Fisher Scoring iterations: 6
As we observed from our third model, this model have some characteristics:
- Have intercept of 0.356754 (model log of odds)
- From all predictors, only three predictors give high significance to the model
- Fare hive high significance to the model, it’s interesting since this column have low contribution in our first model.
- Without any predictor, our first model have an error of 1217.2 (as same as second model), while with predictor it have an error of 1083.4.
5.5 Fourth Model
Fourth model will use step wise elimination with backward direction. This method will automatically choose only column which have big contribution to the model.
model_step <- step(first_model, direction = "backward", trace = F)
Look closely to direction argument in model above, besides backwards, we can also use forward as the direction. The difference between backward and forward is, backward will start with full model (with all its predictor), than the system will eliminate one by one any column with low significance, while forward will start with empty model, then fill one buy one with predictor with high significance to the model.
summary(model_step)
#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare, family = binomial,
#> data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 5.128579 0.483683 10.603 < 0.0000000000000002 ***
#> Pclass2 -1.453051 0.313865 -4.630 0.00000366477812 ***
#> Pclass3 -2.651095 0.314373 -8.433 < 0.0000000000000002 ***
#> Sexmale -2.811730 0.201343 -13.965 < 0.0000000000000002 ***
#> Age -0.055130 0.007881 -6.995 0.00000000000264 ***
#> SibSp -0.483100 0.106779 -4.524 0.00000606023042 ***
#> Fare 0.003850 0.002744 1.403 0.161
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1217.2 on 877 degrees of freedom
#> Residual deviance: 751.3 on 871 degrees of freedom
#> AIC: 765.3
#>
#> Number of Fisher Scoring iterations: 5
Pay attention to fare predictors, it has P-value (Pr(>|z|)) of 0.116365 which is have low significant to the model. Why the system decided to keep this predictor in our model. It because sometimes predictor with low significance can give better model if we keep it, or in simple word, if we keep Flare as predictor, the model will have lower error.
glm(Survived ~ Pclass + Sex + Age + SibSp, data = data_train, family = binomial)
#>
#> Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = binomial,
#> data = data_train)
#>
#> Coefficients:
#> (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
#> 5.42882 -1.65624 -2.89157 -2.83357 -0.05574 -0.45178
#>
#> Degrees of Freedom: 877 Total (i.e. Null); 872 Residual
#> Null Deviance: 1217
#> Residual Deviance: 753.9 AIC: 765.9
As we can see, without Fare as model predictor the model have higher error than not having.
we can summarize all 4 models:
- First model: Null deviance = 1217.17, residual deviance = 801.32 and AIC = 823.32
- Second model: Null deviance = 1217.2, residual deviance = 855.5 and AIC = 869.5
- Third model: Null deviance = 1217.2, residual deviance = 1083.4 and AIC = 1093.4
- Fourth model: Null deviance = 1217.2, residual deviance = 807.7 and AIC = 821.7
Seems our fourth model has the better performance among all models, it has lower residual deviance and lower AIC number. AIC is different than deviance, while null and residual deviance tell us about error, AIC telling us about how well a model fits the data it was generated from. AIC is used to compare different possible models and determine which one is the best model.
6 MODEL EVALUATION
Even now we already know that our fourth model likely have the better performance among all models, we still need to do model evaluation. The number we get from deviance and residuals are only in the form of model calculating data train, or in other word we need to know if the model will have good performance in calculating/predicting new data (data_train). To doing model evaluation, we will predict whether the passenger will survived or not survived using predictor from data_test, our data_test already have survived column which is an actual condition of all passengers, than we will matching the prediction result and the actual condition to see how much false prediction were made by the model.
6.1 First Model
Evaluation of first model
data_test$pred_survive <- predict(object = first_model,
newdata = data_test,
type = "response")
Generally, there’s two kind of type that widely use, they are link and response. Link argument will return the log of odds value, while response will return probability. Link is useful when we want to made further evaluation to the model, while response is useful if we want to know the result immediately. For now we will use response instead.
data_test$pred_label <- ifelse(data_test$pred_survive>0.5, 1, 0)
We use 0.5 for the model threshold, it means that any prediction result that have value greater than 0.5 will be considered as survive.
Next we will the result using table() function.
table(actual = data_test$Survived,
predict = data_test$pred_label)
#> predict
#> actual 0 1
#> 0 86 24
#> 1 28 82
As we observed, our first model seems made some false prediction, both for survived and not survived passenger. For survived passenger, the model successfully made true prediction for 83 passengers and false prediction for 27 passengers (actual condition the passenger is survived, but the model predict them was not), and for not survived passenger, the model made true prediction of 85 passenger and false prediction for 25 passenger. We can understanding the table above with this image:
To get better insight of this, we will use
confusionMatrix() function.
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 86 28
#> 1 24 82
#>
#> Accuracy : 0.7636
#> 95% CI : (0.7019, 0.8181)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.0000000000000009563
#>
#> Kappa : 0.5273
#>
#> Mcnemar's Test P-Value : 0.6774
#>
#> Sensitivity : 0.7455
#> Specificity : 0.7818
#> Pos Pred Value : 0.7736
#> Neg Pred Value : 0.7544
#> Prevalence : 0.5000
#> Detection Rate : 0.3727
#> Detection Prevalence : 0.4818
#> Balanced Accuracy : 0.7636
#>
#> 'Positive' Class : 1
#>
We choose 1 (survived) as positive because it’s really important for rescue team to know how much people survived during the incident, so they can prepare everything before doing any action.
Something we must give our attention are accuracy, Sensitivity/Recall
and Pros Pred Value/Precision. Acuracy tell us about
how much the model made true prediction (survived predicted as survived
and not survived predicted as not survived) from all data (formula:
TP+TN/TOTAL). Sensitivity/recall is value
that represent how much the model made true prediction of positive class
(survived) from actually positive (formula: TP/(TP+FN)).
And pos pred value/Precision tell us about how much the
model made true positive prediction from all positive prediction
(formula: TP/(TP+FP).
This model have good accuracy, it’s around 81%. Another important parameter is sensitivity (recall), it’s describe about how many predicted positive value are true compared with actual positive value, and form this model I can get 81%. As I mentioned earlier, It will be better if the model predicted 1 more than 0, so every survivor can be evacuated although in actual there were less survivor then predicted.
We will do the same things to all four model.
6.2 Second Model
data_test$pred_survive <- predict(object = second_model,
newdata = data_test,
type = "response")
data_test$pred_label <- ifelse(data_test$pred_survive>0.5, 1, 0)
table(actual = data_test$Survived,
predict = data_test$pred_label)
#> predict
#> actual 0 1
#> 0 89 21
#> 1 35 75
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 89 35
#> 1 21 75
#>
#> Accuracy : 0.7455
#> 95% CI : (0.6825, 0.8016)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.00000000000008726
#>
#> Kappa : 0.4909
#>
#> Mcnemar's Test P-Value : 0.08235
#>
#> Sensitivity : 0.6818
#> Specificity : 0.8091
#> Pos Pred Value : 0.7812
#> Neg Pred Value : 0.7177
#> Prevalence : 0.5000
#> Detection Rate : 0.3409
#> Detection Prevalence : 0.4364
#> Balanced Accuracy : 0.7455
#>
#> 'Positive' Class : 1
#>
6.3 Third Model
data_test$pred_survive <- predict(object = third_model,
newdata = data_test,
type = "response")
data_test$pred_label <- ifelse(data_test$pred_survive>0.5, 1, 0)
table(actual = data_test$Survived,
predict = data_test$pred_label)
#> predict
#> actual 0 1
#> 0 78 32
#> 1 44 66
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 78 44
#> 1 32 66
#>
#> Accuracy : 0.6545
#> 95% CI : (0.5877, 0.7172)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000002663
#>
#> Kappa : 0.3091
#>
#> Mcnemar's Test P-Value : 0.207
#>
#> Sensitivity : 0.6000
#> Specificity : 0.7091
#> Pos Pred Value : 0.6735
#> Neg Pred Value : 0.6393
#> Prevalence : 0.5000
#> Detection Rate : 0.3000
#> Detection Prevalence : 0.4455
#> Balanced Accuracy : 0.6545
#>
#> 'Positive' Class : 1
#>
6.4 Fourth Model
data_test$pred_survive <- predict(object = model_step,
newdata = data_test,
type = "response")
data_test$pred_label <- ifelse(data_test$pred_survive>0.5, 1, 0)
table(actual = data_test$Survived,
predict = data_test$pred_label)
#> predict
#> actual 0 1
#> 0 84 26
#> 1 25 85
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 84 25
#> 1 26 85
#>
#> Accuracy : 0.7682
#> 95% CI : (0.7067, 0.8223)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.0000000000000002911
#>
#> Kappa : 0.5364
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.7727
#> Specificity : 0.7636
#> Pos Pred Value : 0.7658
#> Neg Pred Value : 0.7706
#> Prevalence : 0.5000
#> Detection Rate : 0.3864
#> Detection Prevalence : 0.5045
#> Balanced Accuracy : 0.7682
#>
#> 'Positive' Class : 1
#>
Using data_test we can know each model performance in predicting new
data. The most important parameter we should take a note here is
Accuracy and Sensitivity. Why sensitivity? If you remember the formula
for sensitivity is TP/(TP+FN), here, the denominator is FN,
which mean the actual passenger condition was survived but predicted as
not survived. So the bigger the sensitivity number, the better the
model. So from all four models, our fourth model seems the model, it has
biggest accuracy (0.7545) and sensitivity value (0.7545) among all
models.
7 MODEL TUNING
Simply, Model tuning is a method to increasing the model performance. In this case, we can try to adjust our threshold to tuning the model. So, ill try to lower the number to increase the sensitivity.
data_test$pred_label <- ifelse(data_test$pred_survive>0.4, 1, 0)
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 77 21
#> 1 33 89
#>
#> Accuracy : 0.7545
#> 95% CI : (0.6922, 0.8099)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000000000000009586
#>
#> Kappa : 0.5091
#>
#> Mcnemar's Test P-Value : 0.1344
#>
#> Sensitivity : 0.8091
#> Specificity : 0.7000
#> Pos Pred Value : 0.7295
#> Neg Pred Value : 0.7857
#> Prevalence : 0.5000
#> Detection Rate : 0.4045
#> Detection Prevalence : 0.5545
#> Balanced Accuracy : 0.7545
#>
#> 'Positive' Class : 1
#>
As we observed, using 0.3 as new threshold, it seems we can increasing our sensitivity, but we got penalty in the form of Accuracy reduction. Let’s try to lowering our threshold again
data_test$pred_label <- ifelse(data_test$pred_survive>0.3, 1, 0)
confusionMatrix(data = as.factor(data_test$pred_label),
reference = data_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 69 15
#> 1 41 95
#>
#> Accuracy : 0.7455
#> 95% CI : (0.6825, 0.8016)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.00000000000008726
#>
#> Kappa : 0.4909
#>
#> Mcnemar's Test P-Value : 0.0008355
#>
#> Sensitivity : 0.8636
#> Specificity : 0.6273
#> Pos Pred Value : 0.6985
#> Neg Pred Value : 0.8214
#> Prevalence : 0.5000
#> Detection Rate : 0.4318
#> Detection Prevalence : 0.6182
#> Balanced Accuracy : 0.7455
#>
#> 'Positive' Class : 1
#>
I think 0.83% Sensitivity is good enough for the model, also i dont want to get lowe accuracy number. So we will use 0.3 as fixed threshold
8 PREDICTING
titanic_test$pred_survive <- predict(object = model_step,
newdata = titanic_test_full,
type = "response")
titanic_test$pred_label <- ifelse(titanic_test$pred_survive > 0.3, 1, 0)
table(titanic_test$pred_label)
#>
#> 0 1
#> 174 244
By using our fourth model, we can say there’s 245 passenger survived in Titanic tragedy.
9 K-NEAREST NEIGHBOR
Another way to do classification prediction using machine learning is by using K Nearest Neighbor. Unlike the previous model (glm) KNN does not need to do learning process, but it directly make predictions. KNN works by predict the correct class for the test data by calculating the distance between the test data and all the training points. Than select the the K number of points which is closet to the test data. To know more about K-NN, you can visit this link
But first, since my data test didn’t have target variable, I’ll use my data train and data test to see it’s accuracy and sensitivity.
9.1 Scaling
In case our data scale is not balance, we need to do scaling before
doing K-NN prediction suing scale() function. If the data
scale is not in same proportion, it can lead to poor result.
First I need to separate target from predictor, and select only numeric column (we can only apply scale on numeric column data type)
# Separating predictor and store it in new variable
data_train_x <- data_train %>% select_if(is.numeric)
data_test_x <- data_test %>% select_if(is.numeric) %>% select(-c(pred_survive, pred_label))
# Separating target variable and store it in new variable
data_train_y <- data_train[,8]
data_test_y <- data_test[,8]
summary(data_train_x)
#> Age SibSp Parch Fare
#> Min. : 0.42 Min. :0.0000 Min. :0.0000 Min. : 0.000
#> 1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.925
#> Median :28.00 Median :0.0000 Median :0.0000 Median : 15.023
#> Mean :29.32 Mean :0.5308 Mean :0.3929 Mean : 34.855
#> 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 34.375
#> Max. :80.00 Max. :8.0000 Max. :6.0000 Max. :512.329
From summary above, we can know that our data have unbalance scale, as well as our data test
#Scaling our train data
train_x_scale <- scale(data_train_x)
#Using data train scale and center to scaling data test
test_x_scale <- scale(x = data_test_x,
center = attr(train_x_scale, "scaled:center"),
scale = attr(train_x_scale, "scaled:scale"))
summary(test_x_scale)
#> Age SibSp Parch Fare
#> Min. :-1.97318 Min. :-0.48583 Min. :-0.480114 Min. :-0.62724
#> 1st Qu.:-0.70461 1st Qu.:-0.48583 1st Qu.:-0.480114 1st Qu.:-0.48238
#> Median :-0.09012 Median :-0.48583 Median :-0.480114 Median :-0.26958
#> Mean : 0.04995 Mean :-0.08224 Mean : 0.003074 Mean : 0.01591
#> 3rd Qu.: 0.72919 3rd Qu.: 0.42953 3rd Qu.: 0.741741 3rd Qu.:-0.06814
#> Max. : 3.46025 Max. : 3.17563 Max. : 4.407304 Max. : 4.10561
Seems our train and test data already have balance scale.
9.2 Model Building
We need to determine the number of K, we cannot simply choose big number of K, because it will result in over fit. To find optimum number of K, we can rooting the total rows of the data.
# Find optimum K value
sqrt(nrow(data_train))
#> [1] 29.63106
Since my target variable only have 2 class, I’ll use odd K number (29).
titanic_pred <- knn(train = train_x_scale,
test = test_x_scale,
cl = data_train_y,
k= 29)
class(titanic_pred)
#> [1] "factor"
9.3 Model Evaluatioon
confusionMatrix(data = titanic_pred,
reference = data_test_y,
positive= "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 73 41
#> 1 37 69
#>
#> Accuracy : 0.6455
#> 95% CI : (0.5783, 0.7086)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000009526
#>
#> Kappa : 0.2909
#>
#> Mcnemar's Test P-Value : 0.7341
#>
#> Sensitivity : 0.6273
#> Specificity : 0.6636
#> Pos Pred Value : 0.6509
#> Neg Pred Value : 0.6404
#> Prevalence : 0.5000
#> Detection Rate : 0.3136
#> Detection Prevalence : 0.4818
#> Balanced Accuracy : 0.6455
#>
#> 'Positive' Class : 1
#>
SUMMARY:
In this prediction I’m using 2 machine learning to build model to predict which passenger will survive or not. For first model I’m using glm to make prediction. From all fourth model, fourth model have best performance in predicting target variable. Using 0.3 as model threshold I can get 72% in accuracy and 83% in sensitivity. And from all passenger in data test, there’s 245 passenger likely to survive.
Second machine learning I used is K-NN, but this model have worse performance compared to GLM, so I’m not using it.