Titanic Survival Classification v1 (Logistic Regression & K-NN)

Brief History of Titanic Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.

Purpose of the project :

  • Know the relationship between Survived based on historical data.
  • Learn to use Logistic Regression & K-NN to predict Survived based on the data set.

Explanation on “Titanic” data :

  • PassengerId : Row ID in the data set
  • Survived : If a passenger survived or not (0 = No, 1 = Yes)
  • Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name : Passenger’s name
  • Sex : Passenger’s gender (male or female)
  • Age : Passenger’s age (in years)
  • SibSp : # of siblings / spouses aboard the Titanic
  • Parch : # of parents / children aboard the Titanic
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Preparation

Load library.

library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr) 
library(rsample)

Load dataset.

# Load data
titanic <- read.csv("dataInputs/train.csv")

# Show data as table
paged_table(titanic)

Check structure of the new data frame

# Check structure
titanic %>% glimpse()
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Check unique data on Ticket and Cabin predictors.

unique(titanic$Ticket)
##   [1] "A/5 21171"          "PC 17599"           "STON/O2. 3101282"  
##   [4] "113803"             "373450"             "330877"            
##   [7] "17463"              "349909"             "347742"            
##  [10] "237736"             "PP 9549"            "113783"            
##  [13] "A/5. 2151"          "347082"             "350406"            
##  [16] "248706"             "382652"             "244373"            
##  [19] "345763"             "2649"               "239865"            
##  [22] "248698"             "330923"             "113788"            
##  [25] "347077"             "2631"               "19950"             
##  [28] "330959"             "349216"             "PC 17601"          
##  [31] "PC 17569"           "335677"             "C.A. 24579"        
##  [34] "PC 17604"           "113789"             "2677"              
##  [37] "A./5. 2152"         "345764"             "2651"              
##  [40] "7546"               "11668"              "349253"            
##  [43] "SC/Paris 2123"      "330958"             "S.C./A.4. 23567"   
##  [46] "370371"             "14311"              "2662"              
##  [49] "349237"             "3101295"            "A/4. 39886"        
##  [52] "PC 17572"           "2926"               "113509"            
##  [55] "19947"              "C.A. 31026"         "2697"              
##  [58] "C.A. 34651"         "CA 2144"            "2669"              
##  [61] "113572"             "36973"              "347088"            
##  [64] "PC 17605"           "2661"               "C.A. 29395"        
##  [67] "S.P. 3464"          "3101281"            "315151"            
##  [70] "C.A. 33111"         "S.O.C. 14879"       "2680"              
##  [73] "1601"               "348123"             "349208"            
##  [76] "374746"             "248738"             "364516"            
##  [79] "345767"             "345779"             "330932"            
##  [82] "113059"             "SO/C 14885"         "3101278"           
##  [85] "W./C. 6608"         "SOTON/OQ 392086"    "343275"            
##  [88] "343276"             "347466"             "W.E.P. 5734"       
##  [91] "C.A. 2315"          "364500"             "374910"            
##  [94] "PC 17754"           "PC 17759"           "231919"            
##  [97] "244367"             "349245"             "349215"            
## [100] "35281"              "7540"               "3101276"           
## [103] "349207"             "343120"             "312991"            
## [106] "349249"             "371110"             "110465"            
## [109] "2665"               "324669"             "4136"              
## [112] "2627"               "STON/O 2. 3101294"  "370369"            
## [115] "PC 17558"           "A4. 54510"          "27267"             
## [118] "370372"             "C 17369"            "2668"              
## [121] "347061"             "349241"             "SOTON/O.Q. 3101307"
## [124] "A/5. 3337"          "228414"             "C.A. 29178"        
## [127] "SC/PARIS 2133"      "11752"              "7534"              
## [130] "PC 17593"           "2678"               "347081"            
## [133] "STON/O2. 3101279"   "365222"             "231945"            
## [136] "C.A. 33112"         "350043"             "230080"            
## [139] "244310"             "S.O.P. 1166"        "113776"            
## [142] "A.5. 11206"         "A/5. 851"           "Fa 265302"         
## [145] "PC 17597"           "35851"              "SOTON/OQ 392090"   
## [148] "315037"             "CA. 2343"           "371362"            
## [151] "C.A. 33595"         "347068"             "315093"            
## [154] "363291"             "113505"             "PC 17318"          
## [157] "111240"             "STON/O 2. 3101280"  "17764"             
## [160] "350404"             "4133"               "PC 17595"          
## [163] "250653"             "LINE"               "SC/PARIS 2131"     
## [166] "230136"             "315153"             "113767"            
## [169] "370365"             "111428"             "364849"            
## [172] "349247"             "234604"             "28424"             
## [175] "350046"             "PC 17610"           "368703"            
## [178] "4579"               "370370"             "248747"            
## [181] "345770"             "3101264"            "2628"              
## [184] "A/5 3540"           "347054"             "2699"              
## [187] "367231"             "112277"             "SOTON/O.Q. 3101311"
## [190] "F.C.C. 13528"       "A/5 21174"          "250646"            
## [193] "367229"             "35273"              "STON/O2. 3101283"  
## [196] "243847"             "11813"              "W/C 14208"         
## [199] "SOTON/OQ 392089"    "220367"             "21440"             
## [202] "349234"             "19943"              "PP 4348"           
## [205] "SW/PP 751"          "A/5 21173"          "236171"            
## [208] "347067"             "237442"             "C.A. 29566"        
## [211] "W./C. 6609"         "26707"              "C.A. 31921"        
## [214] "28665"              "SCO/W 1585"         "367230"            
## [217] "W./C. 14263"        "STON/O 2. 3101275"  "2694"              
## [220] "19928"              "347071"             "250649"            
## [223] "11751"              "244252"             "362316"            
## [226] "113514"             "A/5. 3336"          "370129"            
## [229] "2650"               "PC 17585"           "110152"            
## [232] "PC 17755"           "230433"             "384461"            
## [235] "110413"             "112059"             "382649"            
## [238] "C.A. 17248"         "347083"             "PC 17582"          
## [241] "PC 17760"           "113798"             "250644"            
## [244] "PC 17596"           "370375"             "13502"             
## [247] "347073"             "239853"             "C.A. 2673"         
## [250] "336439"             "347464"             "345778"            
## [253] "A/5. 10482"         "113056"             "349239"            
## [256] "345774"             "349206"             "237798"            
## [259] "370373"             "19877"              "11967"             
## [262] "SC/Paris 2163"      "349236"             "349233"            
## [265] "PC 17612"           "2693"               "113781"            
## [268] "19988"              "9234"               "367226"            
## [271] "226593"             "A/5 2466"           "17421"             
## [274] "PC 17758"           "P/PP 3381"          "PC 17485"          
## [277] "11767"              "PC 17608"           "250651"            
## [280] "349243"             "F.C.C. 13529"       "347470"            
## [283] "29011"              "36928"              "16966"             
## [286] "A/5 21172"          "349219"             "234818"            
## [289] "345364"             "28551"              "111361"            
## [292] "113043"             "PC 17611"           "349225"            
## [295] "7598"               "113784"             "248740"            
## [298] "244361"             "229236"             "248733"            
## [301] "31418"              "386525"             "C.A. 37671"        
## [304] "315088"             "7267"               "113510"            
## [307] "2695"               "2647"               "345783"            
## [310] "237671"             "330931"             "330980"            
## [313] "SC/PARIS 2167"      "2691"               "SOTON/O.Q. 3101310"
## [316] "C 7076"             "110813"             "2626"              
## [319] "14313"              "PC 17477"           "11765"             
## [322] "3101267"            "323951"             "C 7077"            
## [325] "113503"             "2648"               "347069"            
## [328] "PC 17757"           "2653"               "STON/O 2. 3101293" 
## [331] "349227"             "27849"              "367655"            
## [334] "SC 1748"            "113760"             "350034"            
## [337] "3101277"            "350052"             "350407"            
## [340] "28403"              "244278"             "240929"            
## [343] "STON/O 2. 3101289"  "341826"             "4137"              
## [346] "315096"             "28664"              "347064"            
## [349] "29106"              "312992"             "349222"            
## [352] "394140"             "STON/O 2. 3101269"  "343095"            
## [355] "28220"              "250652"             "28228"             
## [358] "345773"             "349254"             "A/5. 13032"        
## [361] "315082"             "347080"             "A/4. 34244"        
## [364] "2003"               "250655"             "364851"            
## [367] "SOTON/O.Q. 392078"  "110564"             "376564"            
## [370] "SC/AH 3085"         "STON/O 2. 3101274"  "13507"             
## [373] "C.A. 18723"         "345769"             "347076"            
## [376] "230434"             "65306"              "33638"             
## [379] "113794"             "2666"               "113786"            
## [382] "65303"              "113051"             "17453"             
## [385] "A/5 2817"           "349240"             "13509"             
## [388] "17464"              "F.C.C. 13531"       "371060"            
## [391] "19952"              "364506"             "111320"            
## [394] "234360"             "A/S 2816"           "SOTON/O.Q. 3101306"
## [397] "113792"             "36209"              "323592"            
## [400] "315089"             "SC/AH Basle 541"    "7553"              
## [403] "31027"              "3460"               "350060"            
## [406] "3101298"            "239854"             "A/5 3594"          
## [409] "4134"               "11771"              "A.5. 18509"        
## [412] "65304"              "SOTON/OQ 3101317"   "113787"            
## [415] "PC 17609"           "A/4 45380"          "36947"             
## [418] "C.A. 6212"          "350035"             "315086"            
## [421] "364846"             "330909"             "4135"              
## [424] "26360"              "111427"             "C 4001"            
## [427] "382651"             "SOTON/OQ 3101316"   "PC 17473"          
## [430] "PC 17603"           "349209"             "36967"             
## [433] "C.A. 34260"         "226875"             "349242"            
## [436] "12749"              "349252"             "2624"              
## [439] "2700"               "367232"             "W./C. 14258"       
## [442] "PC 17483"           "3101296"            "29104"             
## [445] "2641"               "2690"               "315084"            
## [448] "113050"             "PC 17761"           "364498"            
## [451] "13568"              "WE/P 5735"          "2908"              
## [454] "693"                "SC/PARIS 2146"      "244358"            
## [457] "330979"             "2620"               "347085"            
## [460] "113807"             "11755"              "345572"            
## [463] "372622"             "349251"             "218629"            
## [466] "SOTON/OQ 392082"    "SOTON/O.Q. 392087"  "A/4 48871"         
## [469] "349205"             "2686"               "350417"            
## [472] "S.W./PP 752"        "11769"              "PC 17474"          
## [475] "14312"              "A/4. 20589"         "358585"            
## [478] "243880"             "2689"               "STON/O 2. 3101286" 
## [481] "237789"             "13049"              "3411"              
## [484] "237565"             "13567"              "14973"             
## [487] "A./5. 3235"         "STON/O 2. 3101273"  "A/5 3902"          
## [490] "364848"             "SC/AH 29037"        "248727"            
## [493] "2664"               "349214"             "113796"            
## [496] "364511"             "111426"             "349910"            
## [499] "349246"             "113804"             "SOTON/O.Q. 3101305"
## [502] "370377"             "364512"             "220845"            
## [505] "31028"              "2659"               "11753"             
## [508] "350029"             "54636"              "36963"             
## [511] "219533"             "349224"             "334912"            
## [514] "27042"              "347743"             "13214"             
## [517] "112052"             "237668"             "STON/O 2. 3101292" 
## [520] "350050"             "349231"             "13213"             
## [523] "S.O./P.P. 751"      "CA. 2314"           "349221"            
## [526] "8475"               "330919"             "365226"            
## [529] "349223"             "29751"              "2623"              
## [532] "5727"               "349210"             "STON/O 2. 3101285" 
## [535] "234686"             "312993"             "A/5 3536"          
## [538] "19996"              "29750"              "F.C. 12750"        
## [541] "C.A. 24580"         "244270"             "239856"            
## [544] "349912"             "342826"             "4138"              
## [547] "330935"             "6563"               "349228"            
## [550] "350036"             "24160"              "17474"             
## [553] "349256"             "2672"               "113800"            
## [556] "248731"             "363592"             "35852"             
## [559] "348121"             "PC 17475"           "36864"             
## [562] "350025"             "223596"             "PC 17476"          
## [565] "PC 17482"           "113028"             "7545"              
## [568] "250647"             "348124"             "34218"             
## [571] "36568"              "347062"             "350048"            
## [574] "12233"              "250643"             "113806"            
## [577] "315094"             "36866"              "236853"            
## [580] "STON/O2. 3101271"   "239855"             "28425"             
## [583] "233639"             "349201"             "349218"            
## [586] "16988"              "376566"             "STON/O 2. 3101288" 
## [589] "250648"             "113773"             "335097"            
## [592] "29103"              "392096"             "345780"            
## [595] "349204"             "350042"             "29108"             
## [598] "363294"             "SOTON/O2 3101272"   "2663"              
## [601] "347074"             "112379"             "364850"            
## [604] "8471"               "345781"             "350047"            
## [607] "S.O./P.P. 3"        "2674"               "29105"             
## [610] "347078"             "383121"             "36865"             
## [613] "2687"               "113501"             "W./C. 6607"        
## [616] "SOTON/O.Q. 3101312" "374887"             "3101265"           
## [619] "12460"              "PC 17600"           "349203"            
## [622] "28213"              "17465"              "349244"            
## [625] "2685"               "2625"               "347089"            
## [628] "347063"             "112050"             "347087"            
## [631] "248723"             "3474"               "28206"             
## [634] "364499"             "112058"             "STON/O2. 3101290"  
## [637] "S.C./PARIS 2079"    "C 7075"             "315098"            
## [640] "19972"              "368323"             "367228"            
## [643] "2671"               "347468"             "2223"              
## [646] "PC 17756"           "315097"             "392092"            
## [649] "11774"              "SOTON/O2 3101287"   "2683"              
## [652] "315090"             "C.A. 5547"          "349213"            
## [655] "347060"             "PC 17592"           "392091"            
## [658] "113055"             "2629"               "350026"            
## [661] "28134"              "17466"              "233866"            
## [664] "236852"             "SC/PARIS 2149"      "PC 17590"          
## [667] "345777"             "349248"             "695"               
## [670] "345765"             "2667"               "349212"            
## [673] "349217"             "349257"             "7552"              
## [676] "C.A./SOTON 34068"   "SOTON/OQ 392076"    "211536"            
## [679] "112053"             "111369"             "370376"
unique(titanic$Cabin)
##   [1] ""                "C85"             "C123"            "E46"            
##   [5] "G6"              "C103"            "D56"             "A6"             
##   [9] "C23 C25 C27"     "B78"             "D33"             "B30"            
##  [13] "C52"             "B28"             "C83"             "F33"            
##  [17] "F G73"           "E31"             "A5"              "D10 D12"        
##  [21] "D26"             "C110"            "B58 B60"         "E101"           
##  [25] "F E69"           "D47"             "B86"             "F2"             
##  [29] "C2"              "E33"             "B19"             "A7"             
##  [33] "C49"             "F4"              "A32"             "B4"             
##  [37] "B80"             "A31"             "D36"             "D15"            
##  [41] "C93"             "C78"             "D35"             "C87"            
##  [45] "B77"             "E67"             "B94"             "C125"           
##  [49] "C99"             "C118"            "D7"              "A19"            
##  [53] "B49"             "D"               "C22 C26"         "C106"           
##  [57] "C65"             "E36"             "C54"             "B57 B59 B63 B66"
##  [61] "C7"              "E34"             "C32"             "B18"            
##  [65] "C124"            "C91"             "E40"             "T"              
##  [69] "C128"            "D37"             "B35"             "E50"            
##  [73] "C82"             "B96 B98"         "E10"             "E44"            
##  [77] "A34"             "C104"            "C111"            "C92"            
##  [81] "E38"             "D21"             "E12"             "E63"            
##  [85] "A14"             "B37"             "C30"             "D20"            
##  [89] "B79"             "E25"             "D46"             "B73"            
##  [93] "C95"             "B38"             "B39"             "B22"            
##  [97] "C86"             "C70"             "A16"             "C101"           
## [101] "C68"             "A10"             "E68"             "B41"            
## [105] "A20"             "D19"             "D50"             "D9"             
## [109] "A23"             "B50"             "A26"             "D48"            
## [113] "E58"             "C126"            "B71"             "B51 B53 B55"    
## [117] "D49"             "B5"              "B20"             "F G63"          
## [121] "C62 C64"         "E24"             "C90"             "C45"            
## [125] "E8"              "B101"            "D45"             "C46"            
## [129] "D30"             "E121"            "D11"             "E77"            
## [133] "F38"             "B3"              "D6"              "B82 B84"        
## [137] "D17"             "A36"             "B102"            "B69"            
## [141] "E49"             "C47"             "D28"             "E17"            
## [145] "A24"             "C50"             "B42"             "C148"

💡 Insight :

  • PassengerId, Name aren’t usable variable for prediction model. Therefore, they will be removed.
  • Ticket and Cabin range are too huge. Therefore, they aren’t usable as predictors and will be removed
  • Survived is the target of our prediction.
  • Survived, Pclass, Sex, SibSp, Parch and Embarked should be converted to categorical type.
titanic <- titanic %>% 
  select(-c(PassengerId, 
            Name, 
            Ticket, 
            Cabin)) %>% 
  mutate_at(vars(Survived, 
                 Pclass, 
                 Sex,
                 Parch,
                 Embarked), as.factor)

N/A value on our data frame

# Check proportion of missing data
table(is.na(titanic))
## 
## FALSE  TRUE 
##  6951   177
titanic <- titanic %>% na.omit()
titanic %>% is.na() %>% colSums()
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.

Re-check missing value using Reshape

# Check missing value using "reshape" library
missing_data <- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
##      [,1]       [,2]
## [1,] "Embarked" "2"

Update missing embarked port using common value

titanic$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S'

Exploratory and Data Analysis

Take a look on data summary

titanic %>% summary()
##  Survived Pclass      Sex           Age            SibSp        Parch  
##  0:424    1:186   female:261   Min.   : 0.42   Min.   :0.0000   0:521  
##  1:290    2:173   male  :453   1st Qu.:20.12   1st Qu.:0.0000   1:110  
##           3:355                Median :28.00   Median :0.0000   2: 68  
##                                Mean   :29.70   Mean   :0.5126   3:  5  
##                                3rd Qu.:38.00   3rd Qu.:1.0000   4:  4  
##                                Max.   :80.00   Max.   :5.0000   5:  5  
##                                                                 6:  1  
##       Fare        Embarked
##  Min.   :  0.00    :  0   
##  1st Qu.:  8.05   C:130   
##  Median : 15.74   Q: 28   
##  Mean   : 34.69   S:556   
##  3rd Qu.: 33.38           
##  Max.   :512.33           
## 

💡 Insight :

  • 424 passenger deceased during the tragedy and only 290 people survived
  • There are 453 male and 261 female
  • Age and Fare seems to have outliers

Check class imbalance

prop.table(table(titanic$Survived))
## 
##         0         1 
## 0.5938375 0.4061625

Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.

Train Test Split

Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic,
                       prop = 0.8,
                       strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)

Create Model

We will create two types of models (Generalized Linear Model and K Nearest Neighbor) to predict whether a passenger survived or not. Each model will be developed in several steps:

  • Create a model
  • Predict the model
  • Create an evaluation using Confusion Matrix
  • Tuning (if necessary)

Logistic Regression

Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.

The binary dependent variable in logistic regression can take on one of two possible outcomes, typically represented as 0 and 1. The independent variables, also known as predictor variables or features, can be continuous, categorical, or a combination of both.

First, we will create a model using all variables.

# Create a basic model

titanic_model_all <- glm(Survived~., 
                         titanic_train, 
                         family = "binomial")

summary(titanic_model_all)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7613  -0.6205  -0.3580   0.6072   2.4531  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.209e+00  6.240e-01   6.746 1.52e-11 ***
## Pclass2     -1.305e+00  3.759e-01  -3.473 0.000515 ***
## Pclass3     -2.537e+00  3.968e-01  -6.393 1.63e-10 ***
## Sexmale     -2.709e+00  2.558e-01 -10.591  < 2e-16 ***
## Age         -3.790e-02  9.532e-03  -3.976 7.00e-05 ***
## SibSp       -4.120e-01  1.521e-01  -2.710 0.006737 ** 
## Parch1       4.969e-01  3.380e-01   1.470 0.141469    
## Parch2       8.579e-02  4.353e-01   0.197 0.843779    
## Parch3       6.735e-01  1.088e+00   0.619 0.536019    
## Parch4      -1.409e+01  7.493e+02  -0.019 0.984994    
## Parch5      -8.361e-01  1.189e+00  -0.703 0.481947    
## Parch6      -1.501e+01  1.455e+03  -0.010 0.991773    
## Fare         1.645e-03  2.890e-03   0.569 0.569145    
## EmbarkedQ   -4.991e-01  7.050e-01  -0.708 0.479023    
## EmbarkedS   -2.664e-01  3.216e-01  -0.828 0.407473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 487.36  on 556  degrees of freedom
## AIC: 517.36
## 
## Number of Fisher Scoring iterations: 14

💡 Insight :

  • Pclass, Sex, Age and SibSp are significant predictors
  • Parch, Fare and Embarked aren’t significant predictors
  • AIC score is 517.36

Next, we will need to create a new model based on Backward direction

# Create a backward model
step(titanic_model_all, direction = "backward")
## Start:  AIC=517.36
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Parch     6   492.97 510.97
## - Embarked  2   488.22 514.22
## - Fare      1   487.71 515.71
## <none>          487.36 517.36
## - SibSp     1   495.40 523.40
## - Age       1   504.42 532.42
## - Pclass    2   531.89 557.89
## - Sex       1   628.52 656.52
## 
## Step:  AIC=510.97
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  2   494.36 508.36
## - Fare      1   493.28 509.28
## <none>          492.97 510.97
## - SibSp     1   500.50 516.50
## - Age       1   516.70 532.70
## - Pclass    2   545.68 559.68
## - Sex       1   644.05 660.05
## 
## Step:  AIC=508.36
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   494.94 506.94
## <none>        494.36 508.36
## - SibSp   1   502.88 514.88
## - Age     1   519.72 531.72
## - Pclass  2   550.86 560.86
## - Sex     1   647.51 659.51
## 
## Step:  AIC=506.94
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        494.94 506.94
## - SibSp   1   503.02 513.02
## - Age     1   521.39 531.39
## - Pclass  2   590.39 598.39
## - Sex     1   652.32 662.32
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
##     4.49646     -1.51383     -2.84367     -2.74482     -0.04411     -0.37814  
## 
## Degrees of Freedom: 570 Total (i.e. Null);  565 Residual
## Null Deviance:       771.4 
## Residual Deviance: 494.9     AIC: 506.9

We will need to create a new model using above significant predictors.

titanic_model_back <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, 
                      family = "binomial", 
                      data = titanic_train)

summary(titanic_model_back)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8453  -0.6361  -0.3619   0.6139   2.4819  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.496462   0.511501   8.791  < 2e-16 ***
## Pclass2     -1.513830   0.319908  -4.732 2.22e-06 ***
## Pclass3     -2.843666   0.327033  -8.695  < 2e-16 ***
## Sexmale     -2.744822   0.248401 -11.050  < 2e-16 ***
## Age         -0.044108   0.009023  -4.888 1.02e-06 ***
## SibSp       -0.378139   0.139018  -2.720  0.00653 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 494.94  on 565  degrees of freedom
## AIC: 506.94
## 
## Number of Fisher Scoring iterations: 5

The last, we need to check the Multicollinearity on each predictors.

vif(titanic_model_back)
##            GVIF Df GVIF^(1/(2*Df))
## Pclass 1.460384  2        1.099301
## Sex    1.163603  1        1.078704
## Age    1.441755  1        1.200731
## SibSp  1.174834  1        1.083898

💡 Insight :

  • Significant predictors are Pclass, Sex, Age and SibSp same as titanic_model_all.
  • AIC values of titanic_model_back is smaller(506.94) than titanic_model_all(517.36). It means, titanic_model_back is better than titanic_model_all.
  • VIF test results are below 1.5, it means there are no Multicollinearity on our model.

Create prediction

First, we need to create a prediction from our previous model. We will use titanic_model_back as we have removed all non-significant predictors.

# Create a prediction

glm_predict <- predict(titanic_model_back, 
                       titanic_test)

Next, we can check the class type of our prediction result.

# Check the class type 
class(glm_predict)
## [1] "numeric"

Last, we need to save the “probability” result into our test data.

# Save the probability value

titanic_test$probability <- predict(titanic_model_back,
                                    titanic_test,
                                    type = "response")

paged_table(titanic_test)

Data distribution on probability values

We can check the data distribution of our prediction result using Geom Density from ggpplot library.

# Create a density plot

ggplot(titanic_test, 
       aes(x=probability))+
  geom_density(lwd=0.5)+
  theme_minimal()

We can also create a code to view the comparison between our prediction and actual result

# Create comparison

titanic_test$prediction <- factor(ifelse(titanic_test$probability > 0.5, "1", "0"))
paged_table(titanic_test[1:10, c("prediction", "Survived")])

💡 Insight :

  • The probability is skewed to 0. It means probability data mostly distribute into Not Survived value
  • Just based on the table, our model is quite good on predicting the passengers survival. But, we still need to check the “Model Evaluation”

Model evaluation

glm_evaluation <- confusionMatrix(titanic_test$prediction, 
                titanic_test$Survived, 
                positive = "1")

glm_evaluation
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 68 15
##          1 17 43
##                                          
##                Accuracy : 0.7762         
##                  95% CI : (0.699, 0.8416)
##     No Information Rate : 0.5944         
##     P-Value [Acc > NIR] : 3.382e-06      
##                                          
##                   Kappa : 0.5384         
##                                          
##  Mcnemar's Test P-Value : 0.8597         
##                                          
##             Sensitivity : 0.7414         
##             Specificity : 0.8000         
##          Pos Pred Value : 0.7167         
##          Neg Pred Value : 0.8193         
##              Prevalence : 0.4056         
##          Detection Rate : 0.3007         
##    Detection Prevalence : 0.4196         
##       Balanced Accuracy : 0.7707         
##                                          
##        'Positive' Class : 1              
## 

💡 Insight :

  • The confusion matrix shows the number of instances that were correctly or incorrectly classified by a binary classifier. In this example, the rows correspond to the predicted class (0 or 1) and the columns correspond to the true class (0 or 1).
  • The numbers in the first row indicate that the classifier predicted 68 instances to be in class 0 when they were actually in class 0, and predicted 15 instances to be in class 0 when they were actually in class 1. The numbers in the second row indicate that the classifier predicted 17 instances to be in class 1 when they were actually in class 0, and predicted 43 instances to be in class 1 when they were actually in class 1.
  • The statistics provide additional information about the performance of the classifier. For example, the accuracy of the classifier is 0.7762, which means that it correctly classified 77.62% of instances. The kappa statistic is 0.5384, which measures the agreement between the classifier and the true classes, and takes into account the possibility of agreement occurring by chance.
  • The sensitivity of the classifier is 0.7414, which is also known as the true positive rate or recall. It measures the proportion of actual positive instances that were correctly identified by the classifier. The specificity of the classifier is 0.8, which measures the proportion of actual negative instances that were correctly identified by the classifier.
  • The positive predictive value (PPV) of the classifier is 0.7167, which measures the proportion of instances predicted to be positive that were actually positive. The negative predictive value (NPV) of the classifier is 0.8193, which measures the proportion of instances predicted to be negative that were actually negative.
  • Overall, the statistics provide a summary of the performance of the classifier and can be used to evaluate how well the classifier is performing on the given data.

K Nearest Neighbor

K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.

In other words, KNN predicts the output for a new data point by looking at the K nearest data points in the training set, where K is a user-defined parameter. The algorithm measures the distance between the new data point and each of the training data points using a distance metric such as Euclidean distance or Manhattan distance. The K nearest training data points are then used to predict the output for the new data point based on the majority class or value.

First, we will need to do some data wrangling.

# Add label for target

titanic_knn <- titanic %>% 
  mutate_at(vars(Sex, 
                 Embarked, 
                 Pclass,
                 Parch), as.numeric)

titanic_knn$Survived <- factor(titanic_knn$Survived, 
                      levels = c("0","1"), 
                      labels = c("Not Survived", "Survived"))
titanic_knn %>% glimpse()
## Rows: 714
## Columns: 8
## $ Survived <fct> Not Survived, Survived, Survived, Survived, Not Survived, Not…
## $ Pclass   <dbl> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex      <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2…
## $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch    <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4…

As we can see, Age and Fare values are higher than other predictors.

Z-Score scaling

The purpose of scaling using Z-score standardization is to transform the original data so that it has a mean of 0 and a standard deviation of 1. This allows for easier comparison and analysis of different variables that may have different scales or units.

By standardizing the data using the Z-score formula, which is calculated as (x - mean)/standard deviation, each data point is transformed into a value representing how many standard deviations it is away from the mean. This makes it possible to compare data points from different variables on the same scale.

Before we do that, we need to do cross validation as we use new data (titanic_knn).

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic_knn,
                       prop = 0.8,
                       strata = "Survived")
titanic_train_new <- training(index)
titanic_test_new <- testing(index)

Next, we need to scale our predictors.

# Predictors

titanic_train_scale <- scale(titanic_train_new[,-1])
titanic_test_scale <- scale(titanic_test_new[,-1],
                            center = attr(titanic_train_scale, "scaled:center"),
                            scale = attr(titanic_train_scale, "scaled:scale"))

We also need to scale our target data.

# Target

titanic_train_target <- titanic_train_new[,1]
titanic_test_target <- titanic_test_new[,1]

Create a prediction

First, we need to know the optimum K value

# Find optimum K

round(sqrt(nrow(titanic_train_new)),0)
## [1] 24

The result is an even number ‘24’, we will use ‘25’ as the k value.

Next, we will create a model based on K-NN method.

knn_predict <- knn(train = titanic_train_scale, 
                   test = titanic_test_scale, 
                   cl = titanic_train_target, 
                   k = 25)

Model evaluation

knn_evaluation <- confusionMatrix(data = knn_predict, 
                reference = titanic_test_target, 
                positive = "Survived")

knn_evaluation
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Not Survived Survived
##   Not Survived           77       22
##   Survived                8       36
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7143, 0.8538)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 5.407e-07       
##                                           
##                   Kappa : 0.5476          
##                                           
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.6207          
##             Specificity : 0.9059          
##          Pos Pred Value : 0.8182          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.4056          
##          Detection Rate : 0.2517          
##    Detection Prevalence : 0.3077          
##       Balanced Accuracy : 0.7633          
##                                           
##        'Positive' Class : Survived        
## 

💡 Insight :

  • In this case, the model predicted “Not Survived” for 85 cases, out of which 77 were correctly predicted, and 8 were false positives. The model predicted “Survived” for 58 cases, out of which 36 were correctly predicted, and 22 were false negatives.
  • The accuracy of the model is 0.79, which means that it correctly predicted the outcome for 79% of the cases. The kappa value of 0.55 indicates that the agreement between the predicted and actual classes is moderate.
  • The sensitivity of the model is 0.62, which means that it correctly identified 62% of the cases that actually survived. The specificity of the model is 0.91, which means that it correctly identified 91% of the cases that did not survive.
  • The positive predictive value (PPV) of the model is 0.82, which means that when the model predicted “Survived,” it was correct 82% of the time. The negative predictive value (NPV) of the model is 0.78, which means that when the model predicted “Not Survived,” it was correct 78% of the time.
  • Overall, this model has reasonable accuracy and specificity, but lower sensitivity, indicating that it may have difficulty identifying the cases that actually survived.

Conclusion

  • Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
  • K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
  • For evaluation result, please refer to below table
paged_table(eval_result)