Titanic Survival Classification v1 (Logistic Regression & K-NN)
Brief History of Titanic Disaster
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.
Purpose of the project :
- Know the relationship between
Survivedbased on historical data. - Learn to use Logistic Regression &
K-NN to predict
Survivedbased on the data set.
Explanation on “Titanic” data :
- PassengerId : Row ID in the data set
- Survived : If a passenger survived or not (0 = No, 1 = Yes)
- Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name : Passenger’s name
- Sex : Passenger’s gender (male or female)
- Age : Passenger’s age (in years)
- SibSp : # of siblings / spouses aboard the Titanic
- Parch : # of parents / children aboard the Titanic
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Data Preparation
Load library.
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr)
library(rsample)Load dataset.
# Load data
titanic <- read.csv("dataInputs/train.csv")
# Show data as table
paged_table(titanic)Check structure of the new data frame
# Check structure
titanic %>% glimpse()## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Check unique data on Ticket and Cabin
predictors.
unique(titanic$Ticket)## [1] "A/5 21171" "PC 17599" "STON/O2. 3101282"
## [4] "113803" "373450" "330877"
## [7] "17463" "349909" "347742"
## [10] "237736" "PP 9549" "113783"
## [13] "A/5. 2151" "347082" "350406"
## [16] "248706" "382652" "244373"
## [19] "345763" "2649" "239865"
## [22] "248698" "330923" "113788"
## [25] "347077" "2631" "19950"
## [28] "330959" "349216" "PC 17601"
## [31] "PC 17569" "335677" "C.A. 24579"
## [34] "PC 17604" "113789" "2677"
## [37] "A./5. 2152" "345764" "2651"
## [40] "7546" "11668" "349253"
## [43] "SC/Paris 2123" "330958" "S.C./A.4. 23567"
## [46] "370371" "14311" "2662"
## [49] "349237" "3101295" "A/4. 39886"
## [52] "PC 17572" "2926" "113509"
## [55] "19947" "C.A. 31026" "2697"
## [58] "C.A. 34651" "CA 2144" "2669"
## [61] "113572" "36973" "347088"
## [64] "PC 17605" "2661" "C.A. 29395"
## [67] "S.P. 3464" "3101281" "315151"
## [70] "C.A. 33111" "S.O.C. 14879" "2680"
## [73] "1601" "348123" "349208"
## [76] "374746" "248738" "364516"
## [79] "345767" "345779" "330932"
## [82] "113059" "SO/C 14885" "3101278"
## [85] "W./C. 6608" "SOTON/OQ 392086" "343275"
## [88] "343276" "347466" "W.E.P. 5734"
## [91] "C.A. 2315" "364500" "374910"
## [94] "PC 17754" "PC 17759" "231919"
## [97] "244367" "349245" "349215"
## [100] "35281" "7540" "3101276"
## [103] "349207" "343120" "312991"
## [106] "349249" "371110" "110465"
## [109] "2665" "324669" "4136"
## [112] "2627" "STON/O 2. 3101294" "370369"
## [115] "PC 17558" "A4. 54510" "27267"
## [118] "370372" "C 17369" "2668"
## [121] "347061" "349241" "SOTON/O.Q. 3101307"
## [124] "A/5. 3337" "228414" "C.A. 29178"
## [127] "SC/PARIS 2133" "11752" "7534"
## [130] "PC 17593" "2678" "347081"
## [133] "STON/O2. 3101279" "365222" "231945"
## [136] "C.A. 33112" "350043" "230080"
## [139] "244310" "S.O.P. 1166" "113776"
## [142] "A.5. 11206" "A/5. 851" "Fa 265302"
## [145] "PC 17597" "35851" "SOTON/OQ 392090"
## [148] "315037" "CA. 2343" "371362"
## [151] "C.A. 33595" "347068" "315093"
## [154] "363291" "113505" "PC 17318"
## [157] "111240" "STON/O 2. 3101280" "17764"
## [160] "350404" "4133" "PC 17595"
## [163] "250653" "LINE" "SC/PARIS 2131"
## [166] "230136" "315153" "113767"
## [169] "370365" "111428" "364849"
## [172] "349247" "234604" "28424"
## [175] "350046" "PC 17610" "368703"
## [178] "4579" "370370" "248747"
## [181] "345770" "3101264" "2628"
## [184] "A/5 3540" "347054" "2699"
## [187] "367231" "112277" "SOTON/O.Q. 3101311"
## [190] "F.C.C. 13528" "A/5 21174" "250646"
## [193] "367229" "35273" "STON/O2. 3101283"
## [196] "243847" "11813" "W/C 14208"
## [199] "SOTON/OQ 392089" "220367" "21440"
## [202] "349234" "19943" "PP 4348"
## [205] "SW/PP 751" "A/5 21173" "236171"
## [208] "347067" "237442" "C.A. 29566"
## [211] "W./C. 6609" "26707" "C.A. 31921"
## [214] "28665" "SCO/W 1585" "367230"
## [217] "W./C. 14263" "STON/O 2. 3101275" "2694"
## [220] "19928" "347071" "250649"
## [223] "11751" "244252" "362316"
## [226] "113514" "A/5. 3336" "370129"
## [229] "2650" "PC 17585" "110152"
## [232] "PC 17755" "230433" "384461"
## [235] "110413" "112059" "382649"
## [238] "C.A. 17248" "347083" "PC 17582"
## [241] "PC 17760" "113798" "250644"
## [244] "PC 17596" "370375" "13502"
## [247] "347073" "239853" "C.A. 2673"
## [250] "336439" "347464" "345778"
## [253] "A/5. 10482" "113056" "349239"
## [256] "345774" "349206" "237798"
## [259] "370373" "19877" "11967"
## [262] "SC/Paris 2163" "349236" "349233"
## [265] "PC 17612" "2693" "113781"
## [268] "19988" "9234" "367226"
## [271] "226593" "A/5 2466" "17421"
## [274] "PC 17758" "P/PP 3381" "PC 17485"
## [277] "11767" "PC 17608" "250651"
## [280] "349243" "F.C.C. 13529" "347470"
## [283] "29011" "36928" "16966"
## [286] "A/5 21172" "349219" "234818"
## [289] "345364" "28551" "111361"
## [292] "113043" "PC 17611" "349225"
## [295] "7598" "113784" "248740"
## [298] "244361" "229236" "248733"
## [301] "31418" "386525" "C.A. 37671"
## [304] "315088" "7267" "113510"
## [307] "2695" "2647" "345783"
## [310] "237671" "330931" "330980"
## [313] "SC/PARIS 2167" "2691" "SOTON/O.Q. 3101310"
## [316] "C 7076" "110813" "2626"
## [319] "14313" "PC 17477" "11765"
## [322] "3101267" "323951" "C 7077"
## [325] "113503" "2648" "347069"
## [328] "PC 17757" "2653" "STON/O 2. 3101293"
## [331] "349227" "27849" "367655"
## [334] "SC 1748" "113760" "350034"
## [337] "3101277" "350052" "350407"
## [340] "28403" "244278" "240929"
## [343] "STON/O 2. 3101289" "341826" "4137"
## [346] "315096" "28664" "347064"
## [349] "29106" "312992" "349222"
## [352] "394140" "STON/O 2. 3101269" "343095"
## [355] "28220" "250652" "28228"
## [358] "345773" "349254" "A/5. 13032"
## [361] "315082" "347080" "A/4. 34244"
## [364] "2003" "250655" "364851"
## [367] "SOTON/O.Q. 392078" "110564" "376564"
## [370] "SC/AH 3085" "STON/O 2. 3101274" "13507"
## [373] "C.A. 18723" "345769" "347076"
## [376] "230434" "65306" "33638"
## [379] "113794" "2666" "113786"
## [382] "65303" "113051" "17453"
## [385] "A/5 2817" "349240" "13509"
## [388] "17464" "F.C.C. 13531" "371060"
## [391] "19952" "364506" "111320"
## [394] "234360" "A/S 2816" "SOTON/O.Q. 3101306"
## [397] "113792" "36209" "323592"
## [400] "315089" "SC/AH Basle 541" "7553"
## [403] "31027" "3460" "350060"
## [406] "3101298" "239854" "A/5 3594"
## [409] "4134" "11771" "A.5. 18509"
## [412] "65304" "SOTON/OQ 3101317" "113787"
## [415] "PC 17609" "A/4 45380" "36947"
## [418] "C.A. 6212" "350035" "315086"
## [421] "364846" "330909" "4135"
## [424] "26360" "111427" "C 4001"
## [427] "382651" "SOTON/OQ 3101316" "PC 17473"
## [430] "PC 17603" "349209" "36967"
## [433] "C.A. 34260" "226875" "349242"
## [436] "12749" "349252" "2624"
## [439] "2700" "367232" "W./C. 14258"
## [442] "PC 17483" "3101296" "29104"
## [445] "2641" "2690" "315084"
## [448] "113050" "PC 17761" "364498"
## [451] "13568" "WE/P 5735" "2908"
## [454] "693" "SC/PARIS 2146" "244358"
## [457] "330979" "2620" "347085"
## [460] "113807" "11755" "345572"
## [463] "372622" "349251" "218629"
## [466] "SOTON/OQ 392082" "SOTON/O.Q. 392087" "A/4 48871"
## [469] "349205" "2686" "350417"
## [472] "S.W./PP 752" "11769" "PC 17474"
## [475] "14312" "A/4. 20589" "358585"
## [478] "243880" "2689" "STON/O 2. 3101286"
## [481] "237789" "13049" "3411"
## [484] "237565" "13567" "14973"
## [487] "A./5. 3235" "STON/O 2. 3101273" "A/5 3902"
## [490] "364848" "SC/AH 29037" "248727"
## [493] "2664" "349214" "113796"
## [496] "364511" "111426" "349910"
## [499] "349246" "113804" "SOTON/O.Q. 3101305"
## [502] "370377" "364512" "220845"
## [505] "31028" "2659" "11753"
## [508] "350029" "54636" "36963"
## [511] "219533" "349224" "334912"
## [514] "27042" "347743" "13214"
## [517] "112052" "237668" "STON/O 2. 3101292"
## [520] "350050" "349231" "13213"
## [523] "S.O./P.P. 751" "CA. 2314" "349221"
## [526] "8475" "330919" "365226"
## [529] "349223" "29751" "2623"
## [532] "5727" "349210" "STON/O 2. 3101285"
## [535] "234686" "312993" "A/5 3536"
## [538] "19996" "29750" "F.C. 12750"
## [541] "C.A. 24580" "244270" "239856"
## [544] "349912" "342826" "4138"
## [547] "330935" "6563" "349228"
## [550] "350036" "24160" "17474"
## [553] "349256" "2672" "113800"
## [556] "248731" "363592" "35852"
## [559] "348121" "PC 17475" "36864"
## [562] "350025" "223596" "PC 17476"
## [565] "PC 17482" "113028" "7545"
## [568] "250647" "348124" "34218"
## [571] "36568" "347062" "350048"
## [574] "12233" "250643" "113806"
## [577] "315094" "36866" "236853"
## [580] "STON/O2. 3101271" "239855" "28425"
## [583] "233639" "349201" "349218"
## [586] "16988" "376566" "STON/O 2. 3101288"
## [589] "250648" "113773" "335097"
## [592] "29103" "392096" "345780"
## [595] "349204" "350042" "29108"
## [598] "363294" "SOTON/O2 3101272" "2663"
## [601] "347074" "112379" "364850"
## [604] "8471" "345781" "350047"
## [607] "S.O./P.P. 3" "2674" "29105"
## [610] "347078" "383121" "36865"
## [613] "2687" "113501" "W./C. 6607"
## [616] "SOTON/O.Q. 3101312" "374887" "3101265"
## [619] "12460" "PC 17600" "349203"
## [622] "28213" "17465" "349244"
## [625] "2685" "2625" "347089"
## [628] "347063" "112050" "347087"
## [631] "248723" "3474" "28206"
## [634] "364499" "112058" "STON/O2. 3101290"
## [637] "S.C./PARIS 2079" "C 7075" "315098"
## [640] "19972" "368323" "367228"
## [643] "2671" "347468" "2223"
## [646] "PC 17756" "315097" "392092"
## [649] "11774" "SOTON/O2 3101287" "2683"
## [652] "315090" "C.A. 5547" "349213"
## [655] "347060" "PC 17592" "392091"
## [658] "113055" "2629" "350026"
## [661] "28134" "17466" "233866"
## [664] "236852" "SC/PARIS 2149" "PC 17590"
## [667] "345777" "349248" "695"
## [670] "345765" "2667" "349212"
## [673] "349217" "349257" "7552"
## [676] "C.A./SOTON 34068" "SOTON/OQ 392076" "211536"
## [679] "112053" "111369" "370376"
unique(titanic$Cabin)## [1] "" "C85" "C123" "E46"
## [5] "G6" "C103" "D56" "A6"
## [9] "C23 C25 C27" "B78" "D33" "B30"
## [13] "C52" "B28" "C83" "F33"
## [17] "F G73" "E31" "A5" "D10 D12"
## [21] "D26" "C110" "B58 B60" "E101"
## [25] "F E69" "D47" "B86" "F2"
## [29] "C2" "E33" "B19" "A7"
## [33] "C49" "F4" "A32" "B4"
## [37] "B80" "A31" "D36" "D15"
## [41] "C93" "C78" "D35" "C87"
## [45] "B77" "E67" "B94" "C125"
## [49] "C99" "C118" "D7" "A19"
## [53] "B49" "D" "C22 C26" "C106"
## [57] "C65" "E36" "C54" "B57 B59 B63 B66"
## [61] "C7" "E34" "C32" "B18"
## [65] "C124" "C91" "E40" "T"
## [69] "C128" "D37" "B35" "E50"
## [73] "C82" "B96 B98" "E10" "E44"
## [77] "A34" "C104" "C111" "C92"
## [81] "E38" "D21" "E12" "E63"
## [85] "A14" "B37" "C30" "D20"
## [89] "B79" "E25" "D46" "B73"
## [93] "C95" "B38" "B39" "B22"
## [97] "C86" "C70" "A16" "C101"
## [101] "C68" "A10" "E68" "B41"
## [105] "A20" "D19" "D50" "D9"
## [109] "A23" "B50" "A26" "D48"
## [113] "E58" "C126" "B71" "B51 B53 B55"
## [117] "D49" "B5" "B20" "F G63"
## [121] "C62 C64" "E24" "C90" "C45"
## [125] "E8" "B101" "D45" "C46"
## [129] "D30" "E121" "D11" "E77"
## [133] "F38" "B3" "D6" "B82 B84"
## [137] "D17" "A36" "B102" "B69"
## [141] "E49" "C47" "D28" "E17"
## [145] "A24" "C50" "B42" "C148"
💡 Insight :
PassengerId,Namearen’t usable variable for prediction model. Therefore, they will be removed.TicketandCabinrange are too huge. Therefore, they aren’t usable as predictors and will be removedSurvivedis the target of our prediction.Survived,Pclass,Sex,SibSp,ParchandEmbarkedshould be converted to categorical type.
titanic <- titanic %>%
select(-c(PassengerId,
Name,
Ticket,
Cabin)) %>%
mutate_at(vars(Survived,
Pclass,
Sex,
Parch,
Embarked), as.factor)N/A value on our data frame
# Check proportion of missing data
table(is.na(titanic))##
## FALSE TRUE
## 6951 177
titanic <- titanic %>% na.omit()
titanic %>% is.na() %>% colSums()## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.
Re-check missing value using Reshape
# Check missing value using "reshape" library
missing_data <- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])## [,1] [,2]
## [1,] "Embarked" "2"
Update missing embarked port using common value
titanic$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S'Exploratory and Data Analysis
Take a look on data summary
titanic %>% summary()## Survived Pclass Sex Age SibSp Parch
## 0:424 1:186 female:261 Min. : 0.42 Min. :0.0000 0:521
## 1:290 2:173 male :453 1st Qu.:20.12 1st Qu.:0.0000 1:110
## 3:355 Median :28.00 Median :0.0000 2: 68
## Mean :29.70 Mean :0.5126 3: 5
## 3rd Qu.:38.00 3rd Qu.:1.0000 4: 4
## Max. :80.00 Max. :5.0000 5: 5
## 6: 1
## Fare Embarked
## Min. : 0.00 : 0
## 1st Qu.: 8.05 C:130
## Median : 15.74 Q: 28
## Mean : 34.69 S:556
## 3rd Qu.: 33.38
## Max. :512.33
##
💡 Insight :
- 424 passenger deceased during the tragedy and only 290 people survived
- There are 453 male and 261 female
AgeandFareseems to have outliers
Check class imbalance
prop.table(table(titanic$Survived))##
## 0 1
## 0.5938375 0.4061625
Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.
Train Test Split
Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = titanic,
prop = 0.8,
strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)Create Model
We will create two types of models (Generalized Linear Model and K Nearest Neighbor) to predict whether a passenger survived or not. Each model will be developed in several steps:
- Create a model
- Predict the model
- Create an evaluation using Confusion Matrix
- Tuning (if necessary)
Logistic Regression
Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
The binary dependent variable in logistic regression can take on one of two possible outcomes, typically represented as 0 and 1. The independent variables, also known as predictor variables or features, can be continuous, categorical, or a combination of both.
First, we will create a model using all variables.
# Create a basic model
titanic_model_all <- glm(Survived~.,
titanic_train,
family = "binomial")
summary(titanic_model_all)##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7613 -0.6205 -0.3580 0.6072 2.4531
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.209e+00 6.240e-01 6.746 1.52e-11 ***
## Pclass2 -1.305e+00 3.759e-01 -3.473 0.000515 ***
## Pclass3 -2.537e+00 3.968e-01 -6.393 1.63e-10 ***
## Sexmale -2.709e+00 2.558e-01 -10.591 < 2e-16 ***
## Age -3.790e-02 9.532e-03 -3.976 7.00e-05 ***
## SibSp -4.120e-01 1.521e-01 -2.710 0.006737 **
## Parch1 4.969e-01 3.380e-01 1.470 0.141469
## Parch2 8.579e-02 4.353e-01 0.197 0.843779
## Parch3 6.735e-01 1.088e+00 0.619 0.536019
## Parch4 -1.409e+01 7.493e+02 -0.019 0.984994
## Parch5 -8.361e-01 1.189e+00 -0.703 0.481947
## Parch6 -1.501e+01 1.455e+03 -0.010 0.991773
## Fare 1.645e-03 2.890e-03 0.569 0.569145
## EmbarkedQ -4.991e-01 7.050e-01 -0.708 0.479023
## EmbarkedS -2.664e-01 3.216e-01 -0.828 0.407473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 487.36 on 556 degrees of freedom
## AIC: 517.36
##
## Number of Fisher Scoring iterations: 14
💡 Insight :
Pclass,Sex,AgeandSibSpare significant predictorsParch,FareandEmbarkedaren’t significant predictors- AIC score is 517.36
Next, we will need to create a new model based on
Backward direction
# Create a backward model
step(titanic_model_all, direction = "backward")## Start: AIC=517.36
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Parch 6 492.97 510.97
## - Embarked 2 488.22 514.22
## - Fare 1 487.71 515.71
## <none> 487.36 517.36
## - SibSp 1 495.40 523.40
## - Age 1 504.42 532.42
## - Pclass 2 531.89 557.89
## - Sex 1 628.52 656.52
##
## Step: AIC=510.97
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 2 494.36 508.36
## - Fare 1 493.28 509.28
## <none> 492.97 510.97
## - SibSp 1 500.50 516.50
## - Age 1 516.70 532.70
## - Pclass 2 545.68 559.68
## - Sex 1 644.05 660.05
##
## Step: AIC=508.36
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 494.94 506.94
## <none> 494.36 508.36
## - SibSp 1 502.88 514.88
## - Age 1 519.72 531.72
## - Pclass 2 550.86 560.86
## - Sex 1 647.51 659.51
##
## Step: AIC=506.94
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 494.94 506.94
## - SibSp 1 503.02 513.02
## - Age 1 521.39 531.39
## - Pclass 2 590.39 598.39
## - Sex 1 652.32 662.32
##
## Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Coefficients:
## (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
## 4.49646 -1.51383 -2.84367 -2.74482 -0.04411 -0.37814
##
## Degrees of Freedom: 570 Total (i.e. Null); 565 Residual
## Null Deviance: 771.4
## Residual Deviance: 494.9 AIC: 506.9
We will need to create a new model using above significant predictors.
titanic_model_back <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp,
family = "binomial",
data = titanic_train)
summary(titanic_model_back)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8453 -0.6361 -0.3619 0.6139 2.4819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.496462 0.511501 8.791 < 2e-16 ***
## Pclass2 -1.513830 0.319908 -4.732 2.22e-06 ***
## Pclass3 -2.843666 0.327033 -8.695 < 2e-16 ***
## Sexmale -2.744822 0.248401 -11.050 < 2e-16 ***
## Age -0.044108 0.009023 -4.888 1.02e-06 ***
## SibSp -0.378139 0.139018 -2.720 0.00653 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 771.40 on 570 degrees of freedom
## Residual deviance: 494.94 on 565 degrees of freedom
## AIC: 506.94
##
## Number of Fisher Scoring iterations: 5
The last, we need to check the Multicollinearity on each predictors.
vif(titanic_model_back)## GVIF Df GVIF^(1/(2*Df))
## Pclass 1.460384 2 1.099301
## Sex 1.163603 1 1.078704
## Age 1.441755 1 1.200731
## SibSp 1.174834 1 1.083898
💡 Insight :
- Significant predictors are
Pclass,Sex,AgeandSibSpsame astitanic_model_all. - AIC values of
titanic_model_backis smaller(506.94) thantitanic_model_all(517.36). It means,titanic_model_backis better thantitanic_model_all. - VIF test results are below 1.5, it means there are no Multicollinearity on our model.
Create prediction
First, we need to create a prediction from our previous model. We
will use titanic_model_back as we have removed all
non-significant predictors.
# Create a prediction
glm_predict <- predict(titanic_model_back,
titanic_test)Next, we can check the class type of our prediction result.
# Check the class type
class(glm_predict)## [1] "numeric"
Last, we need to save the “probability” result into our test data.
# Save the probability value
titanic_test$probability <- predict(titanic_model_back,
titanic_test,
type = "response")
paged_table(titanic_test)Data distribution on probability values
We can check the data distribution of our prediction result using
Geom Density from ggpplot library.
# Create a density plot
ggplot(titanic_test,
aes(x=probability))+
geom_density(lwd=0.5)+
theme_minimal()We can also create a code to view the comparison between our prediction and actual result
# Create comparison
titanic_test$prediction <- factor(ifelse(titanic_test$probability > 0.5, "1", "0"))
paged_table(titanic_test[1:10, c("prediction", "Survived")])💡 Insight :
- The probability is skewed to 0. It means
probability data mostly distribute into
Not Survivedvalue - Just based on the table, our model is quite good on predicting the passengers survival. But, we still need to check the “Model Evaluation”
Model evaluation
glm_evaluation <- confusionMatrix(titanic_test$prediction,
titanic_test$Survived,
positive = "1")
glm_evaluation## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 15
## 1 17 43
##
## Accuracy : 0.7762
## 95% CI : (0.699, 0.8416)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 3.382e-06
##
## Kappa : 0.5384
##
## Mcnemar's Test P-Value : 0.8597
##
## Sensitivity : 0.7414
## Specificity : 0.8000
## Pos Pred Value : 0.7167
## Neg Pred Value : 0.8193
## Prevalence : 0.4056
## Detection Rate : 0.3007
## Detection Prevalence : 0.4196
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
💡 Insight :
- The confusion matrix shows the number of instances that were correctly or incorrectly classified by a binary classifier. In this example, the rows correspond to the predicted class (0 or 1) and the columns correspond to the true class (0 or 1).
- The numbers in the first row indicate that the classifier predicted 68 instances to be in class 0 when they were actually in class 0, and predicted 15 instances to be in class 0 when they were actually in class 1. The numbers in the second row indicate that the classifier predicted 17 instances to be in class 1 when they were actually in class 0, and predicted 43 instances to be in class 1 when they were actually in class 1.
- The statistics provide additional information about the performance of the classifier. For example, the accuracy of the classifier is 0.7762, which means that it correctly classified 77.62% of instances. The kappa statistic is 0.5384, which measures the agreement between the classifier and the true classes, and takes into account the possibility of agreement occurring by chance.
- The sensitivity of the classifier is 0.7414, which is also known as the true positive rate or recall. It measures the proportion of actual positive instances that were correctly identified by the classifier. The specificity of the classifier is 0.8, which measures the proportion of actual negative instances that were correctly identified by the classifier.
- The positive predictive value (PPV) of the classifier is 0.7167, which measures the proportion of instances predicted to be positive that were actually positive. The negative predictive value (NPV) of the classifier is 0.8193, which measures the proportion of instances predicted to be negative that were actually negative.
- Overall, the statistics provide a summary of the performance of the classifier and can be used to evaluate how well the classifier is performing on the given data.
K Nearest Neighbor
K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
In other words, KNN predicts the output for a new data point by looking at the K nearest data points in the training set, where K is a user-defined parameter. The algorithm measures the distance between the new data point and each of the training data points using a distance metric such as Euclidean distance or Manhattan distance. The K nearest training data points are then used to predict the output for the new data point based on the majority class or value.
First, we will need to do some data wrangling.
# Add label for target
titanic_knn <- titanic %>%
mutate_at(vars(Sex,
Embarked,
Pclass,
Parch), as.numeric)
titanic_knn$Survived <- factor(titanic_knn$Survived,
levels = c("0","1"),
labels = c("Not Survived", "Survived"))titanic_knn %>% glimpse()## Rows: 714
## Columns: 8
## $ Survived <fct> Not Survived, Survived, Survived, Survived, Not Survived, Not…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2…
## $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4…
As we can see, Age and Fare values are
higher than other predictors.
Z-Score scaling
The purpose of scaling using Z-score standardization is to transform the original data so that it has a mean of 0 and a standard deviation of 1. This allows for easier comparison and analysis of different variables that may have different scales or units.
By standardizing the data using the Z-score formula, which is calculated as (x - mean)/standard deviation, each data point is transformed into a value representing how many standard deviations it is away from the mean. This makes it possible to compare data points from different variables on the same scale.
Before we do that, we need to do cross validation as we
use new data (titanic_knn).
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = titanic_knn,
prop = 0.8,
strata = "Survived")
titanic_train_new <- training(index)
titanic_test_new <- testing(index)Next, we need to scale our predictors.
# Predictors
titanic_train_scale <- scale(titanic_train_new[,-1])
titanic_test_scale <- scale(titanic_test_new[,-1],
center = attr(titanic_train_scale, "scaled:center"),
scale = attr(titanic_train_scale, "scaled:scale"))We also need to scale our target data.
# Target
titanic_train_target <- titanic_train_new[,1]
titanic_test_target <- titanic_test_new[,1]Create a prediction
First, we need to know the optimum K value
# Find optimum K
round(sqrt(nrow(titanic_train_new)),0)## [1] 24
The result is an even number ‘24’, we will use ‘25’ as the k value.
Next, we will create a model based on K-NN method.
knn_predict <- knn(train = titanic_train_scale,
test = titanic_test_scale,
cl = titanic_train_target,
k = 25)Model evaluation
knn_evaluation <- confusionMatrix(data = knn_predict,
reference = titanic_test_target,
positive = "Survived")
knn_evaluation## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Survived Survived
## Not Survived 77 22
## Survived 8 36
##
## Accuracy : 0.7902
## 95% CI : (0.7143, 0.8538)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 5.407e-07
##
## Kappa : 0.5476
##
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.6207
## Specificity : 0.9059
## Pos Pred Value : 0.8182
## Neg Pred Value : 0.7778
## Prevalence : 0.4056
## Detection Rate : 0.2517
## Detection Prevalence : 0.3077
## Balanced Accuracy : 0.7633
##
## 'Positive' Class : Survived
##
💡 Insight :
- In this case, the model predicted “Not Survived” for 85 cases, out of which 77 were correctly predicted, and 8 were false positives. The model predicted “Survived” for 58 cases, out of which 36 were correctly predicted, and 22 were false negatives.
- The accuracy of the model is 0.79, which means that it correctly predicted the outcome for 79% of the cases. The kappa value of 0.55 indicates that the agreement between the predicted and actual classes is moderate.
- The sensitivity of the model is 0.62, which means that it correctly identified 62% of the cases that actually survived. The specificity of the model is 0.91, which means that it correctly identified 91% of the cases that did not survive.
- The positive predictive value (PPV) of the model is 0.82, which means that when the model predicted “Survived,” it was correct 82% of the time. The negative predictive value (NPV) of the model is 0.78, which means that when the model predicted “Not Survived,” it was correct 78% of the time.
- Overall, this model has reasonable accuracy and specificity, but lower sensitivity, indicating that it may have difficulty identifying the cases that actually survived.
Conclusion
- Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
- K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
- For evaluation result, please refer to below table
paged_table(eval_result)