Titanic Survival Classification v1 (Logistic Regression & K-NN)
Brief History of Titanic Disaster
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.
Purpose of the project :
- Know the relationship between
Survived
based on historical data. - Learn to use Logistic Regression &
K-NN to predict
Survived
based on the data set.
Explanation on “Titanic” data :
- PassengerId : Row ID in the data set
- Survived : If a passenger survived or not (0 = No, 1 = Yes)
- Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name : Passenger’s name
- Sex : Passenger’s gender (male or female)
- Age : Passenger’s age (in years)
- SibSp : # of siblings / spouses aboard the Titanic
- Parch : # of parents / children aboard the Titanic
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Data Preparation
Load library.
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr)
library(rsample)
Load dataset.
# Load data
<- read.csv("dataInputs/train.csv")
titanic
# Show data as table
paged_table(titanic)
Check structure of the new data frame
# Check structure
%>% glimpse() titanic
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Check unique data on Ticket
and Cabin
predictors.
unique(titanic$Ticket)
## [1] "A/5 21171" "PC 17599" "STON/O2. 3101282"
## [4] "113803" "373450" "330877"
## [7] "17463" "349909" "347742"
## [10] "237736" "PP 9549" "113783"
## [13] "A/5. 2151" "347082" "350406"
## [16] "248706" "382652" "244373"
## [19] "345763" "2649" "239865"
## [22] "248698" "330923" "113788"
## [25] "347077" "2631" "19950"
## [28] "330959" "349216" "PC 17601"
## [31] "PC 17569" "335677" "C.A. 24579"
## [34] "PC 17604" "113789" "2677"
## [37] "A./5. 2152" "345764" "2651"
## [40] "7546" "11668" "349253"
## [43] "SC/Paris 2123" "330958" "S.C./A.4. 23567"
## [46] "370371" "14311" "2662"
## [49] "349237" "3101295" "A/4. 39886"
## [52] "PC 17572" "2926" "113509"
## [55] "19947" "C.A. 31026" "2697"
## [58] "C.A. 34651" "CA 2144" "2669"
## [61] "113572" "36973" "347088"
## [64] "PC 17605" "2661" "C.A. 29395"
## [67] "S.P. 3464" "3101281" "315151"
## [70] "C.A. 33111" "S.O.C. 14879" "2680"
## [73] "1601" "348123" "349208"
## [76] "374746" "248738" "364516"
## [79] "345767" "345779" "330932"
## [82] "113059" "SO/C 14885" "3101278"
## [85] "W./C. 6608" "SOTON/OQ 392086" "343275"
## [88] "343276" "347466" "W.E.P. 5734"
## [91] "C.A. 2315" "364500" "374910"
## [94] "PC 17754" "PC 17759" "231919"
## [97] "244367" "349245" "349215"
## [100] "35281" "7540" "3101276"
## [103] "349207" "343120" "312991"
## [106] "349249" "371110" "110465"
## [109] "2665" "324669" "4136"
## [112] "2627" "STON/O 2. 3101294" "370369"
## [115] "PC 17558" "A4. 54510" "27267"
## [118] "370372" "C 17369" "2668"
## [121] "347061" "349241" "SOTON/O.Q. 3101307"
## [124] "A/5. 3337" "228414" "C.A. 29178"
## [127] "SC/PARIS 2133" "11752" "7534"
## [130] "PC 17593" "2678" "347081"
## [133] "STON/O2. 3101279" "365222" "231945"
## [136] "C.A. 33112" "350043" "230080"
## [139] "244310" "S.O.P. 1166" "113776"
## [142] "A.5. 11206" "A/5. 851" "Fa 265302"
## [145] "PC 17597" "35851" "SOTON/OQ 392090"
## [148] "315037" "CA. 2343" "371362"
## [151] "C.A. 33595" "347068" "315093"
## [154] "363291" "113505" "PC 17318"
## [157] "111240" "STON/O 2. 3101280" "17764"
## [160] "350404" "4133" "PC 17595"
## [163] "250653" "LINE" "SC/PARIS 2131"
## [166] "230136" "315153" "113767"
## [169] "370365" "111428" "364849"
## [172] "349247" "234604" "28424"
## [175] "350046" "PC 17610" "368703"
## [178] "4579" "370370" "248747"
## [181] "345770" "3101264" "2628"
## [184] "A/5 3540" "347054" "2699"
## [187] "367231" "112277" "SOTON/O.Q. 3101311"
## [190] "F.C.C. 13528" "A/5 21174" "250646"
## [193] "367229" "35273" "STON/O2. 3101283"
## [196] "243847" "11813" "W/C 14208"
## [199] "SOTON/OQ 392089" "220367" "21440"
## [202] "349234" "19943" "PP 4348"
## [205] "SW/PP 751" "A/5 21173" "236171"
## [208] "347067" "237442" "C.A. 29566"
## [211] "W./C. 6609" "26707" "C.A. 31921"
## [214] "28665" "SCO/W 1585" "367230"
## [217] "W./C. 14263" "STON/O 2. 3101275" "2694"
## [220] "19928" "347071" "250649"
## [223] "11751" "244252" "362316"
## [226] "113514" "A/5. 3336" "370129"
## [229] "2650" "PC 17585" "110152"
## [232] "PC 17755" "230433" "384461"
## [235] "110413" "112059" "382649"
## [238] "C.A. 17248" "347083" "PC 17582"
## [241] "PC 17760" "113798" "250644"
## [244] "PC 17596" "370375" "13502"
## [247] "347073" "239853" "C.A. 2673"
## [250] "336439" "347464" "345778"
## [253] "A/5. 10482" "113056" "349239"
## [256] "345774" "349206" "237798"
## [259] "370373" "19877" "11967"
## [262] "SC/Paris 2163" "349236" "349233"
## [265] "PC 17612" "2693" "113781"
## [268] "19988" "9234" "367226"
## [271] "226593" "A/5 2466" "17421"
## [274] "PC 17758" "P/PP 3381" "PC 17485"
## [277] "11767" "PC 17608" "250651"
## [280] "349243" "F.C.C. 13529" "347470"
## [283] "29011" "36928" "16966"
## [286] "A/5 21172" "349219" "234818"
## [289] "345364" "28551" "111361"
## [292] "113043" "PC 17611" "349225"
## [295] "7598" "113784" "248740"
## [298] "244361" "229236" "248733"
## [301] "31418" "386525" "C.A. 37671"
## [304] "315088" "7267" "113510"
## [307] "2695" "2647" "345783"
## [310] "237671" "330931" "330980"
## [313] "SC/PARIS 2167" "2691" "SOTON/O.Q. 3101310"
## [316] "C 7076" "110813" "2626"
## [319] "14313" "PC 17477" "11765"
## [322] "3101267" "323951" "C 7077"
## [325] "113503" "2648" "347069"
## [328] "PC 17757" "2653" "STON/O 2. 3101293"
## [331] "349227" "27849" "367655"
## [334] "SC 1748" "113760" "350034"
## [337] "3101277" "350052" "350407"
## [340] "28403" "244278" "240929"
## [343] "STON/O 2. 3101289" "341826" "4137"
## [346] "315096" "28664" "347064"
## [349] "29106" "312992" "349222"
## [352] "394140" "STON/O 2. 3101269" "343095"
## [355] "28220" "250652" "28228"
## [358] "345773" "349254" "A/5. 13032"
## [361] "315082" "347080" "A/4. 34244"
## [364] "2003" "250655" "364851"
## [367] "SOTON/O.Q. 392078" "110564" "376564"
## [370] "SC/AH 3085" "STON/O 2. 3101274" "13507"
## [373] "C.A. 18723" "345769" "347076"
## [376] "230434" "65306" "33638"
## [379] "113794" "2666" "113786"
## [382] "65303" "113051" "17453"
## [385] "A/5 2817" "349240" "13509"
## [388] "17464" "F.C.C. 13531" "371060"
## [391] "19952" "364506" "111320"
## [394] "234360" "A/S 2816" "SOTON/O.Q. 3101306"
## [397] "113792" "36209" "323592"
## [400] "315089" "SC/AH Basle 541" "7553"
## [403] "31027" "3460" "350060"
## [406] "3101298" "239854" "A/5 3594"
## [409] "4134" "11771" "A.5. 18509"
## [412] "65304" "SOTON/OQ 3101317" "113787"
## [415] "PC 17609" "A/4 45380" "36947"
## [418] "C.A. 6212" "350035" "315086"
## [421] "364846" "330909" "4135"
## [424] "26360" "111427" "C 4001"
## [427] "382651" "SOTON/OQ 3101316" "PC 17473"
## [430] "PC 17603" "349209" "36967"
## [433] "C.A. 34260" "226875" "349242"
## [436] "12749" "349252" "2624"
## [439] "2700" "367232" "W./C. 14258"
## [442] "PC 17483" "3101296" "29104"
## [445] "2641" "2690" "315084"
## [448] "113050" "PC 17761" "364498"
## [451] "13568" "WE/P 5735" "2908"
## [454] "693" "SC/PARIS 2146" "244358"
## [457] "330979" "2620" "347085"
## [460] "113807" "11755" "345572"
## [463] "372622" "349251" "218629"
## [466] "SOTON/OQ 392082" "SOTON/O.Q. 392087" "A/4 48871"
## [469] "349205" "2686" "350417"
## [472] "S.W./PP 752" "11769" "PC 17474"
## [475] "14312" "A/4. 20589" "358585"
## [478] "243880" "2689" "STON/O 2. 3101286"
## [481] "237789" "13049" "3411"
## [484] "237565" "13567" "14973"
## [487] "A./5. 3235" "STON/O 2. 3101273" "A/5 3902"
## [490] "364848" "SC/AH 29037" "248727"
## [493] "2664" "349214" "113796"
## [496] "364511" "111426" "349910"
## [499] "349246" "113804" "SOTON/O.Q. 3101305"
## [502] "370377" "364512" "220845"
## [505] "31028" "2659" "11753"
## [508] "350029" "54636" "36963"
## [511] "219533" "349224" "334912"
## [514] "27042" "347743" "13214"
## [517] "112052" "237668" "STON/O 2. 3101292"
## [520] "350050" "349231" "13213"
## [523] "S.O./P.P. 751" "CA. 2314" "349221"
## [526] "8475" "330919" "365226"
## [529] "349223" "29751" "2623"
## [532] "5727" "349210" "STON/O 2. 3101285"
## [535] "234686" "312993" "A/5 3536"
## [538] "19996" "29750" "F.C. 12750"
## [541] "C.A. 24580" "244270" "239856"
## [544] "349912" "342826" "4138"
## [547] "330935" "6563" "349228"
## [550] "350036" "24160" "17474"
## [553] "349256" "2672" "113800"
## [556] "248731" "363592" "35852"
## [559] "348121" "PC 17475" "36864"
## [562] "350025" "223596" "PC 17476"
## [565] "PC 17482" "113028" "7545"
## [568] "250647" "348124" "34218"
## [571] "36568" "347062" "350048"
## [574] "12233" "250643" "113806"
## [577] "315094" "36866" "236853"
## [580] "STON/O2. 3101271" "239855" "28425"
## [583] "233639" "349201" "349218"
## [586] "16988" "376566" "STON/O 2. 3101288"
## [589] "250648" "113773" "335097"
## [592] "29103" "392096" "345780"
## [595] "349204" "350042" "29108"
## [598] "363294" "SOTON/O2 3101272" "2663"
## [601] "347074" "112379" "364850"
## [604] "8471" "345781" "350047"
## [607] "S.O./P.P. 3" "2674" "29105"
## [610] "347078" "383121" "36865"
## [613] "2687" "113501" "W./C. 6607"
## [616] "SOTON/O.Q. 3101312" "374887" "3101265"
## [619] "12460" "PC 17600" "349203"
## [622] "28213" "17465" "349244"
## [625] "2685" "2625" "347089"
## [628] "347063" "112050" "347087"
## [631] "248723" "3474" "28206"
## [634] "364499" "112058" "STON/O2. 3101290"
## [637] "S.C./PARIS 2079" "C 7075" "315098"
## [640] "19972" "368323" "367228"
## [643] "2671" "347468" "2223"
## [646] "PC 17756" "315097" "392092"
## [649] "11774" "SOTON/O2 3101287" "2683"
## [652] "315090" "C.A. 5547" "349213"
## [655] "347060" "PC 17592" "392091"
## [658] "113055" "2629" "350026"
## [661] "28134" "17466" "233866"
## [664] "236852" "SC/PARIS 2149" "PC 17590"
## [667] "345777" "349248" "695"
## [670] "345765" "2667" "349212"
## [673] "349217" "349257" "7552"
## [676] "C.A./SOTON 34068" "SOTON/OQ 392076" "211536"
## [679] "112053" "111369" "370376"
unique(titanic$Cabin)
## [1] "" "C85" "C123" "E46"
## [5] "G6" "C103" "D56" "A6"
## [9] "C23 C25 C27" "B78" "D33" "B30"
## [13] "C52" "B28" "C83" "F33"
## [17] "F G73" "E31" "A5" "D10 D12"
## [21] "D26" "C110" "B58 B60" "E101"
## [25] "F E69" "D47" "B86" "F2"
## [29] "C2" "E33" "B19" "A7"
## [33] "C49" "F4" "A32" "B4"
## [37] "B80" "A31" "D36" "D15"
## [41] "C93" "C78" "D35" "C87"
## [45] "B77" "E67" "B94" "C125"
## [49] "C99" "C118" "D7" "A19"
## [53] "B49" "D" "C22 C26" "C106"
## [57] "C65" "E36" "C54" "B57 B59 B63 B66"
## [61] "C7" "E34" "C32" "B18"
## [65] "C124" "C91" "E40" "T"
## [69] "C128" "D37" "B35" "E50"
## [73] "C82" "B96 B98" "E10" "E44"
## [77] "A34" "C104" "C111" "C92"
## [81] "E38" "D21" "E12" "E63"
## [85] "A14" "B37" "C30" "D20"
## [89] "B79" "E25" "D46" "B73"
## [93] "C95" "B38" "B39" "B22"
## [97] "C86" "C70" "A16" "C101"
## [101] "C68" "A10" "E68" "B41"
## [105] "A20" "D19" "D50" "D9"
## [109] "A23" "B50" "A26" "D48"
## [113] "E58" "C126" "B71" "B51 B53 B55"
## [117] "D49" "B5" "B20" "F G63"
## [121] "C62 C64" "E24" "C90" "C45"
## [125] "E8" "B101" "D45" "C46"
## [129] "D30" "E121" "D11" "E77"
## [133] "F38" "B3" "D6" "B82 B84"
## [137] "D17" "A36" "B102" "B69"
## [141] "E49" "C47" "D28" "E17"
## [145] "A24" "C50" "B42" "C148"
💡 Insight :
PassengerId
,Name
aren’t usable variable for prediction model. Therefore, they will be removed.Ticket
andCabin
range are too huge. Therefore, they aren’t usable as predictors and will be removedSurvived
is the target of our prediction.Survived
,Pclass
,Sex
,SibSp
,Parch
andEmbarked
should be converted to categorical type.
<- titanic %>%
titanic select(-c(PassengerId,
Name,
Ticket, %>%
Cabin)) mutate_at(vars(Survived,
Pclass,
Sex,
Parch, Embarked), as.factor)
N/A value on our data frame
# Check proportion of missing data
table(is.na(titanic))
##
## FALSE TRUE
## 6951 177
<- titanic %>% na.omit()
titanic %>% is.na() %>% colSums() titanic
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.
Re-check missing value using Reshape
# Check missing value using "reshape" library
<- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
missing_data cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
## [,1] [,2]
## [1,] "Embarked" "2"
Update missing embarked port using common value
$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S' titanic
Exploratory and Data Analysis
Take a look on data summary
%>% summary() titanic
## Survived Pclass Sex Age SibSp Parch
## 0:424 1:186 female:261 Min. : 0.42 Min. :0.0000 0:521
## 1:290 2:173 male :453 1st Qu.:20.12 1st Qu.:0.0000 1:110
## 3:355 Median :28.00 Median :0.0000 2: 68
## Mean :29.70 Mean :0.5126 3: 5
## 3rd Qu.:38.00 3rd Qu.:1.0000 4: 4
## Max. :80.00 Max. :5.0000 5: 5
## 6: 1
## Fare Embarked
## Min. : 0.00 : 0
## 1st Qu.: 8.05 C:130
## Median : 15.74 Q: 28
## Mean : 34.69 S:556
## 3rd Qu.: 33.38
## Max. :512.33
##
💡 Insight :
- 424 passenger deceased during the tragedy and only 290 people survived
- There are 453 male and 261 female
Age
andFare
seems to have outliers
Check class imbalance
prop.table(table(titanic$Survived))
##
## 0 1
## 0.5938375 0.4061625
Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.
Train Test Split
Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.
RNGkind(sample.kind = "Rounding")
set.seed(100)
<- initial_split(data = titanic,
index prop = 0.8,
strata = "Survived")
<- training(index)
titanic_train <- testing(index) titanic_test
Create Model
We will create two types of models (Generalized Linear Model and K Nearest Neighbor) to predict whether a passenger survived or not. Each model will be developed in several steps:
- Create a model
- Predict the model
- Create an evaluation using Confusion Matrix
- Tuning (if necessary)
Logistic Regression
Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
The binary dependent variable in logistic regression can take on one of two possible outcomes, typically represented as 0 and 1. The independent variables, also known as predictor variables or features, can be continuous, categorical, or a combination of both.
First, we will create a model using all variables.
# Create a basic model
<- glm(Survived~.,
titanic_model_all
titanic_train, family = "binomial")
summary(titanic_model_all)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7613 -0.6205 -0.3580 0.6072 2.4531
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.209e+00 6.240e-01 6.746 1.52e-11 ***
## Pclass2 -1.305e+00 3.759e-01 -3.473 0.000515 ***
## Pclass3 -2.537e+00 3.968e-01 -6.393 1.63e-10 ***
## Sexmale -2.709e+00 2.558e-01 -10.591 < 2e-16 ***
## Age -3.790e-02 9.532e-03 -3.976 7.00e-05 ***
## SibSp -4.120e-01 1.521e-01 -2.710 0.006737 **
## Parch1 4.969e-01 3.380e-01 1.470 0.141469
## Parch2 8.579e-02 4.353e-01 0.197 0.843779
## Parch3 6.735e-01 1.088e+00 0.619 0.536019
## Parch4 -1.409e+01 7.493e+02 -0.019 0.984994
## Parch5 -8.361e-01 1.189e+00 -0.703 0.481947
## Parch6 -1.501e+01 1.455e+03 -0.010 0.991773
## Fare 1.645e-03 2.890e-03 0.569 0.569145
## EmbarkedQ -4.991e-01 7.050e-01 -0.708 0.479023
## EmbarkedS -2.664e-01 3.216e-01 -0.828 0.407473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 487.36 on 556 degrees of freedom
## AIC: 517.36
##
## Number of Fisher Scoring iterations: 14
💡 Insight :
Pclass
,Sex
,Age
andSibSp
are significant predictorsParch
,Fare
andEmbarked
aren’t significant predictors
Next, we will need to create a new model based on
Backward
direction
# Create a backward model
step(titanic_model_all, direction = "backward")
## Start: AIC=517.36
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Parch 6 492.97 510.97
## - Embarked 2 488.22 514.22
## - Fare 1 487.71 515.71
## <none> 487.36 517.36
## - SibSp 1 495.40 523.40
## - Age 1 504.42 532.42
## - Pclass 2 531.89 557.89
## - Sex 1 628.52 656.52
##
## Step: AIC=510.97
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 2 494.36 508.36
## - Fare 1 493.28 509.28
## <none> 492.97 510.97
## - SibSp 1 500.50 516.50
## - Age 1 516.70 532.70
## - Pclass 2 545.68 559.68
## - Sex 1 644.05 660.05
##
## Step: AIC=508.36
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 494.94 506.94
## <none> 494.36 508.36
## - SibSp 1 502.88 514.88
## - Age 1 519.72 531.72
## - Pclass 2 550.86 560.86
## - Sex 1 647.51 659.51
##
## Step: AIC=506.94
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 494.94 506.94
## - SibSp 1 503.02 513.02
## - Age 1 521.39 531.39
## - Pclass 2 590.39 598.39
## - Sex 1 652.32 662.32
##
## Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Coefficients:
## (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
## 4.49646 -1.51383 -2.84367 -2.74482 -0.04411 -0.37814
##
## Degrees of Freedom: 570 Total (i.e. Null); 565 Residual
## Null Deviance: 771.4
## Residual Deviance: 494.9 AIC: 506.9
💡 Insight :
- Significant predictors are
Pclass
,Sex
,Age
andSibSp
same as previous model.
We will need to create a new model using above significant predictors.
<- glm(formula = Survived ~ Pclass + Sex + Age + SibSp,
titanic_model_back family = "binomial",
data = titanic_train)
summary(titanic_model_back)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8453 -0.6361 -0.3619 0.6139 2.4819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.496462 0.511501 8.791 < 2e-16 ***
## Pclass2 -1.513830 0.319908 -4.732 2.22e-06 ***
## Pclass3 -2.843666 0.327033 -8.695 < 2e-16 ***
## Sexmale -2.744822 0.248401 -11.050 < 2e-16 ***
## Age -0.044108 0.009023 -4.888 1.02e-06 ***
## SibSp -0.378139 0.139018 -2.720 0.00653 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 494.94 on 565 degrees of freedom
## AIC: 506.94
##
## Number of Fisher Scoring iterations: 5
Create prediction
First, we need to create a prediction from our previous model. We
will use titanic_model_back
as we have removed all
non-significant predictors.
# Create a prediction
<- predict(titanic_model_back,
glm_predict titanic_test)
Next, we can check the class type of our prediction result.
# Check the class type
class(glm_predict)
## [1] "numeric"
Last, we need to save the “probability” result into our test data.
# Save the probability value
$probability <- predict(titanic_model_back,
titanic_test
titanic_test,type = "response")
paged_table(titanic_test)
Data distribution on probability values
We can check the data distribution of our prediction result using
Geom Density
from ggpplot
library.
# Create a density plot
ggplot(titanic_test,
aes(x=probability))+
geom_density(lwd=0.5)+
theme_minimal()
We can also create a code to view the comparison between our prediction and actual result
# Create comparison
$prediction <- factor(ifelse(titanic_test$probability > 0.5, "1", "0"))
titanic_testpaged_table(titanic_test[1:10, c("prediction", "Survived")])
💡 Insight :
- The probability is skewed to 0. It means
probability data mostly distribute into
Not Survived
value - Just based on the table, our model is quite good on predicting the passengers survival. But, we still need to check the “Model Evaluation”
Model evaluation
<- confusionMatrix(titanic_test$prediction,
glm_evaluation $Survived,
titanic_testpositive = "1")
glm_evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 15
## 1 17 43
##
## Accuracy : 0.7762
## 95% CI : (0.699, 0.8416)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 3.382e-06
##
## Kappa : 0.5384
##
## Mcnemar's Test P-Value : 0.8597
##
## Sensitivity : 0.7414
## Specificity : 0.8000
## Pos Pred Value : 0.7167
## Neg Pred Value : 0.8193
## Prevalence : 0.4056
## Detection Rate : 0.3007
## Detection Prevalence : 0.4196
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
💡 Insight :
- Accuracy of the model to predict
Not Survived
orSurvived
is 77.62% - The Sensitivity value is 74.14%
- The Specificity values is 80%
- The precision value is 71.67%
K Nearest Neighbor
K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
In other words, KNN predicts the output for a new data point by looking at the K nearest data points in the training set, where K is a user-defined parameter. The algorithm measures the distance between the new data point and each of the training data points using a distance metric such as Euclidean distance or Manhattan distance. The K nearest training data points are then used to predict the output for the new data point based on the majority class or value.
First, we will need to do some data wrangling.
# Add label for target
<- titanic %>%
titanic_knn mutate_at(vars(Sex,
Embarked,
Pclass,
Parch), as.numeric)
$Survived <- factor(titanic_knn$Survived,
titanic_knnlevels = c("0","1"),
labels = c("Not Survived", "Survived"))
%>% glimpse() titanic_knn
## Rows: 714
## Columns: 8
## $ Survived <fct> Not Survived, Survived, Survived, Survived, Not Survived, Not…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2…
## $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4…
As we can see, Age
and Fare
values are
higher than other predictors.
Z-Score scaling
The purpose of scaling using Z-score standardization is to transform the original data so that it has a mean of 0 and a standard deviation of 1. This allows for easier comparison and analysis of different variables that may have different scales or units.
By standardizing the data using the Z-score formula, which is calculated as (x - mean)/standard deviation, each data point is transformed into a value representing how many standard deviations it is away from the mean. This makes it possible to compare data points from different variables on the same scale.
Before we do that, we need to do cross validation
as we
use new data (titanic_knn).
RNGkind(sample.kind = "Rounding")
set.seed(100)
<- initial_split(data = titanic_knn,
index prop = 0.8,
strata = "Survived")
<- training(index)
titanic_train_new <- testing(index) titanic_test_new
Next, we need to scale our predictors.
# Predictors
<- scale(titanic_train_new[,-1])
titanic_train_scale <- scale(titanic_test_new[,-1],
titanic_test_scale center = attr(titanic_train_scale, "scaled:center"),
scale = attr(titanic_train_scale, "scaled:scale"))
We also need to scale our target data.
# Target
<- titanic_train_new[,1]
titanic_train_target <- titanic_test_new[,1] titanic_test_target
Create a prediction
First, we need to know the optimum K
value
# Find optimum K
round(sqrt(nrow(titanic_train_new)),0)
## [1] 24
The result is an even number ‘24’, we will use ‘25’ as the k value.
Next, we will create a model based on K-NN
method.
<- knn(train = titanic_train_scale,
knn_predict test = titanic_test_scale,
cl = titanic_train_target,
k = 25)
Model evaluation
<- confusionMatrix(data = knn_predict,
knn_evaluation reference = titanic_test_target,
positive = "Survived")
knn_evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Survived Survived
## Not Survived 77 22
## Survived 8 36
##
## Accuracy : 0.7902
## 95% CI : (0.7143, 0.8538)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 5.407e-07
##
## Kappa : 0.5476
##
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.6207
## Specificity : 0.9059
## Pos Pred Value : 0.8182
## Neg Pred Value : 0.7778
## Prevalence : 0.4056
## Detection Rate : 0.2517
## Detection Prevalence : 0.3077
## Balanced Accuracy : 0.7633
##
## 'Positive' Class : Survived
##
💡 Insight :
- Accuracy of the model to predict
Not Survived
orSurvived
is 79.02% - The Sensitivity value is only 62.07%
- The Specificity value is 90.59%
- The Precision value is 81.82%
Conclusion
- Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
- K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
- For evaluation result, please refer to below table
paged_table(eval_result)