Titanic Survival Classification v1 (Logistic Regression & K-NN)

Brief History of Titanic Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.

Purpose of the project :

  • Know the relationship between Survived based on historical data.
  • Learn to use Logistic Regression & K-NN to predict Survived based on the data set.

Explanation on “Titanic” data :

  • PassengerId : Row ID in the data set
  • Survived : If a passenger survived or not (0 = No, 1 = Yes)
  • Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name : Passenger’s name
  • Sex : Passenger’s gender (male or female)
  • Age : Passenger’s age (in years)
  • SibSp : # of siblings / spouses aboard the Titanic
  • Parch : # of parents / children aboard the Titanic
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Preparation

Load library.

library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr) 
library(rsample)

Load dataset.

# Load data
titanic <- read.csv("dataInputs/train.csv")

# Show data as table
paged_table(titanic)

Check structure of the new data frame

# Check structure
titanic %>% glimpse()
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Check unique data on Ticket and Cabin predictors.

unique(titanic$Ticket)
##   [1] "A/5 21171"          "PC 17599"           "STON/O2. 3101282"  
##   [4] "113803"             "373450"             "330877"            
##   [7] "17463"              "349909"             "347742"            
##  [10] "237736"             "PP 9549"            "113783"            
##  [13] "A/5. 2151"          "347082"             "350406"            
##  [16] "248706"             "382652"             "244373"            
##  [19] "345763"             "2649"               "239865"            
##  [22] "248698"             "330923"             "113788"            
##  [25] "347077"             "2631"               "19950"             
##  [28] "330959"             "349216"             "PC 17601"          
##  [31] "PC 17569"           "335677"             "C.A. 24579"        
##  [34] "PC 17604"           "113789"             "2677"              
##  [37] "A./5. 2152"         "345764"             "2651"              
##  [40] "7546"               "11668"              "349253"            
##  [43] "SC/Paris 2123"      "330958"             "S.C./A.4. 23567"   
##  [46] "370371"             "14311"              "2662"              
##  [49] "349237"             "3101295"            "A/4. 39886"        
##  [52] "PC 17572"           "2926"               "113509"            
##  [55] "19947"              "C.A. 31026"         "2697"              
##  [58] "C.A. 34651"         "CA 2144"            "2669"              
##  [61] "113572"             "36973"              "347088"            
##  [64] "PC 17605"           "2661"               "C.A. 29395"        
##  [67] "S.P. 3464"          "3101281"            "315151"            
##  [70] "C.A. 33111"         "S.O.C. 14879"       "2680"              
##  [73] "1601"               "348123"             "349208"            
##  [76] "374746"             "248738"             "364516"            
##  [79] "345767"             "345779"             "330932"            
##  [82] "113059"             "SO/C 14885"         "3101278"           
##  [85] "W./C. 6608"         "SOTON/OQ 392086"    "343275"            
##  [88] "343276"             "347466"             "W.E.P. 5734"       
##  [91] "C.A. 2315"          "364500"             "374910"            
##  [94] "PC 17754"           "PC 17759"           "231919"            
##  [97] "244367"             "349245"             "349215"            
## [100] "35281"              "7540"               "3101276"           
## [103] "349207"             "343120"             "312991"            
## [106] "349249"             "371110"             "110465"            
## [109] "2665"               "324669"             "4136"              
## [112] "2627"               "STON/O 2. 3101294"  "370369"            
## [115] "PC 17558"           "A4. 54510"          "27267"             
## [118] "370372"             "C 17369"            "2668"              
## [121] "347061"             "349241"             "SOTON/O.Q. 3101307"
## [124] "A/5. 3337"          "228414"             "C.A. 29178"        
## [127] "SC/PARIS 2133"      "11752"              "7534"              
## [130] "PC 17593"           "2678"               "347081"            
## [133] "STON/O2. 3101279"   "365222"             "231945"            
## [136] "C.A. 33112"         "350043"             "230080"            
## [139] "244310"             "S.O.P. 1166"        "113776"            
## [142] "A.5. 11206"         "A/5. 851"           "Fa 265302"         
## [145] "PC 17597"           "35851"              "SOTON/OQ 392090"   
## [148] "315037"             "CA. 2343"           "371362"            
## [151] "C.A. 33595"         "347068"             "315093"            
## [154] "363291"             "113505"             "PC 17318"          
## [157] "111240"             "STON/O 2. 3101280"  "17764"             
## [160] "350404"             "4133"               "PC 17595"          
## [163] "250653"             "LINE"               "SC/PARIS 2131"     
## [166] "230136"             "315153"             "113767"            
## [169] "370365"             "111428"             "364849"            
## [172] "349247"             "234604"             "28424"             
## [175] "350046"             "PC 17610"           "368703"            
## [178] "4579"               "370370"             "248747"            
## [181] "345770"             "3101264"            "2628"              
## [184] "A/5 3540"           "347054"             "2699"              
## [187] "367231"             "112277"             "SOTON/O.Q. 3101311"
## [190] "F.C.C. 13528"       "A/5 21174"          "250646"            
## [193] "367229"             "35273"              "STON/O2. 3101283"  
## [196] "243847"             "11813"              "W/C 14208"         
## [199] "SOTON/OQ 392089"    "220367"             "21440"             
## [202] "349234"             "19943"              "PP 4348"           
## [205] "SW/PP 751"          "A/5 21173"          "236171"            
## [208] "347067"             "237442"             "C.A. 29566"        
## [211] "W./C. 6609"         "26707"              "C.A. 31921"        
## [214] "28665"              "SCO/W 1585"         "367230"            
## [217] "W./C. 14263"        "STON/O 2. 3101275"  "2694"              
## [220] "19928"              "347071"             "250649"            
## [223] "11751"              "244252"             "362316"            
## [226] "113514"             "A/5. 3336"          "370129"            
## [229] "2650"               "PC 17585"           "110152"            
## [232] "PC 17755"           "230433"             "384461"            
## [235] "110413"             "112059"             "382649"            
## [238] "C.A. 17248"         "347083"             "PC 17582"          
## [241] "PC 17760"           "113798"             "250644"            
## [244] "PC 17596"           "370375"             "13502"             
## [247] "347073"             "239853"             "C.A. 2673"         
## [250] "336439"             "347464"             "345778"            
## [253] "A/5. 10482"         "113056"             "349239"            
## [256] "345774"             "349206"             "237798"            
## [259] "370373"             "19877"              "11967"             
## [262] "SC/Paris 2163"      "349236"             "349233"            
## [265] "PC 17612"           "2693"               "113781"            
## [268] "19988"              "9234"               "367226"            
## [271] "226593"             "A/5 2466"           "17421"             
## [274] "PC 17758"           "P/PP 3381"          "PC 17485"          
## [277] "11767"              "PC 17608"           "250651"            
## [280] "349243"             "F.C.C. 13529"       "347470"            
## [283] "29011"              "36928"              "16966"             
## [286] "A/5 21172"          "349219"             "234818"            
## [289] "345364"             "28551"              "111361"            
## [292] "113043"             "PC 17611"           "349225"            
## [295] "7598"               "113784"             "248740"            
## [298] "244361"             "229236"             "248733"            
## [301] "31418"              "386525"             "C.A. 37671"        
## [304] "315088"             "7267"               "113510"            
## [307] "2695"               "2647"               "345783"            
## [310] "237671"             "330931"             "330980"            
## [313] "SC/PARIS 2167"      "2691"               "SOTON/O.Q. 3101310"
## [316] "C 7076"             "110813"             "2626"              
## [319] "14313"              "PC 17477"           "11765"             
## [322] "3101267"            "323951"             "C 7077"            
## [325] "113503"             "2648"               "347069"            
## [328] "PC 17757"           "2653"               "STON/O 2. 3101293" 
## [331] "349227"             "27849"              "367655"            
## [334] "SC 1748"            "113760"             "350034"            
## [337] "3101277"            "350052"             "350407"            
## [340] "28403"              "244278"             "240929"            
## [343] "STON/O 2. 3101289"  "341826"             "4137"              
## [346] "315096"             "28664"              "347064"            
## [349] "29106"              "312992"             "349222"            
## [352] "394140"             "STON/O 2. 3101269"  "343095"            
## [355] "28220"              "250652"             "28228"             
## [358] "345773"             "349254"             "A/5. 13032"        
## [361] "315082"             "347080"             "A/4. 34244"        
## [364] "2003"               "250655"             "364851"            
## [367] "SOTON/O.Q. 392078"  "110564"             "376564"            
## [370] "SC/AH 3085"         "STON/O 2. 3101274"  "13507"             
## [373] "C.A. 18723"         "345769"             "347076"            
## [376] "230434"             "65306"              "33638"             
## [379] "113794"             "2666"               "113786"            
## [382] "65303"              "113051"             "17453"             
## [385] "A/5 2817"           "349240"             "13509"             
## [388] "17464"              "F.C.C. 13531"       "371060"            
## [391] "19952"              "364506"             "111320"            
## [394] "234360"             "A/S 2816"           "SOTON/O.Q. 3101306"
## [397] "113792"             "36209"              "323592"            
## [400] "315089"             "SC/AH Basle 541"    "7553"              
## [403] "31027"              "3460"               "350060"            
## [406] "3101298"            "239854"             "A/5 3594"          
## [409] "4134"               "11771"              "A.5. 18509"        
## [412] "65304"              "SOTON/OQ 3101317"   "113787"            
## [415] "PC 17609"           "A/4 45380"          "36947"             
## [418] "C.A. 6212"          "350035"             "315086"            
## [421] "364846"             "330909"             "4135"              
## [424] "26360"              "111427"             "C 4001"            
## [427] "382651"             "SOTON/OQ 3101316"   "PC 17473"          
## [430] "PC 17603"           "349209"             "36967"             
## [433] "C.A. 34260"         "226875"             "349242"            
## [436] "12749"              "349252"             "2624"              
## [439] "2700"               "367232"             "W./C. 14258"       
## [442] "PC 17483"           "3101296"            "29104"             
## [445] "2641"               "2690"               "315084"            
## [448] "113050"             "PC 17761"           "364498"            
## [451] "13568"              "WE/P 5735"          "2908"              
## [454] "693"                "SC/PARIS 2146"      "244358"            
## [457] "330979"             "2620"               "347085"            
## [460] "113807"             "11755"              "345572"            
## [463] "372622"             "349251"             "218629"            
## [466] "SOTON/OQ 392082"    "SOTON/O.Q. 392087"  "A/4 48871"         
## [469] "349205"             "2686"               "350417"            
## [472] "S.W./PP 752"        "11769"              "PC 17474"          
## [475] "14312"              "A/4. 20589"         "358585"            
## [478] "243880"             "2689"               "STON/O 2. 3101286" 
## [481] "237789"             "13049"              "3411"              
## [484] "237565"             "13567"              "14973"             
## [487] "A./5. 3235"         "STON/O 2. 3101273"  "A/5 3902"          
## [490] "364848"             "SC/AH 29037"        "248727"            
## [493] "2664"               "349214"             "113796"            
## [496] "364511"             "111426"             "349910"            
## [499] "349246"             "113804"             "SOTON/O.Q. 3101305"
## [502] "370377"             "364512"             "220845"            
## [505] "31028"              "2659"               "11753"             
## [508] "350029"             "54636"              "36963"             
## [511] "219533"             "349224"             "334912"            
## [514] "27042"              "347743"             "13214"             
## [517] "112052"             "237668"             "STON/O 2. 3101292" 
## [520] "350050"             "349231"             "13213"             
## [523] "S.O./P.P. 751"      "CA. 2314"           "349221"            
## [526] "8475"               "330919"             "365226"            
## [529] "349223"             "29751"              "2623"              
## [532] "5727"               "349210"             "STON/O 2. 3101285" 
## [535] "234686"             "312993"             "A/5 3536"          
## [538] "19996"              "29750"              "F.C. 12750"        
## [541] "C.A. 24580"         "244270"             "239856"            
## [544] "349912"             "342826"             "4138"              
## [547] "330935"             "6563"               "349228"            
## [550] "350036"             "24160"              "17474"             
## [553] "349256"             "2672"               "113800"            
## [556] "248731"             "363592"             "35852"             
## [559] "348121"             "PC 17475"           "36864"             
## [562] "350025"             "223596"             "PC 17476"          
## [565] "PC 17482"           "113028"             "7545"              
## [568] "250647"             "348124"             "34218"             
## [571] "36568"              "347062"             "350048"            
## [574] "12233"              "250643"             "113806"            
## [577] "315094"             "36866"              "236853"            
## [580] "STON/O2. 3101271"   "239855"             "28425"             
## [583] "233639"             "349201"             "349218"            
## [586] "16988"              "376566"             "STON/O 2. 3101288" 
## [589] "250648"             "113773"             "335097"            
## [592] "29103"              "392096"             "345780"            
## [595] "349204"             "350042"             "29108"             
## [598] "363294"             "SOTON/O2 3101272"   "2663"              
## [601] "347074"             "112379"             "364850"            
## [604] "8471"               "345781"             "350047"            
## [607] "S.O./P.P. 3"        "2674"               "29105"             
## [610] "347078"             "383121"             "36865"             
## [613] "2687"               "113501"             "W./C. 6607"        
## [616] "SOTON/O.Q. 3101312" "374887"             "3101265"           
## [619] "12460"              "PC 17600"           "349203"            
## [622] "28213"              "17465"              "349244"            
## [625] "2685"               "2625"               "347089"            
## [628] "347063"             "112050"             "347087"            
## [631] "248723"             "3474"               "28206"             
## [634] "364499"             "112058"             "STON/O2. 3101290"  
## [637] "S.C./PARIS 2079"    "C 7075"             "315098"            
## [640] "19972"              "368323"             "367228"            
## [643] "2671"               "347468"             "2223"              
## [646] "PC 17756"           "315097"             "392092"            
## [649] "11774"              "SOTON/O2 3101287"   "2683"              
## [652] "315090"             "C.A. 5547"          "349213"            
## [655] "347060"             "PC 17592"           "392091"            
## [658] "113055"             "2629"               "350026"            
## [661] "28134"              "17466"              "233866"            
## [664] "236852"             "SC/PARIS 2149"      "PC 17590"          
## [667] "345777"             "349248"             "695"               
## [670] "345765"             "2667"               "349212"            
## [673] "349217"             "349257"             "7552"              
## [676] "C.A./SOTON 34068"   "SOTON/OQ 392076"    "211536"            
## [679] "112053"             "111369"             "370376"
unique(titanic$Cabin)
##   [1] ""                "C85"             "C123"            "E46"            
##   [5] "G6"              "C103"            "D56"             "A6"             
##   [9] "C23 C25 C27"     "B78"             "D33"             "B30"            
##  [13] "C52"             "B28"             "C83"             "F33"            
##  [17] "F G73"           "E31"             "A5"              "D10 D12"        
##  [21] "D26"             "C110"            "B58 B60"         "E101"           
##  [25] "F E69"           "D47"             "B86"             "F2"             
##  [29] "C2"              "E33"             "B19"             "A7"             
##  [33] "C49"             "F4"              "A32"             "B4"             
##  [37] "B80"             "A31"             "D36"             "D15"            
##  [41] "C93"             "C78"             "D35"             "C87"            
##  [45] "B77"             "E67"             "B94"             "C125"           
##  [49] "C99"             "C118"            "D7"              "A19"            
##  [53] "B49"             "D"               "C22 C26"         "C106"           
##  [57] "C65"             "E36"             "C54"             "B57 B59 B63 B66"
##  [61] "C7"              "E34"             "C32"             "B18"            
##  [65] "C124"            "C91"             "E40"             "T"              
##  [69] "C128"            "D37"             "B35"             "E50"            
##  [73] "C82"             "B96 B98"         "E10"             "E44"            
##  [77] "A34"             "C104"            "C111"            "C92"            
##  [81] "E38"             "D21"             "E12"             "E63"            
##  [85] "A14"             "B37"             "C30"             "D20"            
##  [89] "B79"             "E25"             "D46"             "B73"            
##  [93] "C95"             "B38"             "B39"             "B22"            
##  [97] "C86"             "C70"             "A16"             "C101"           
## [101] "C68"             "A10"             "E68"             "B41"            
## [105] "A20"             "D19"             "D50"             "D9"             
## [109] "A23"             "B50"             "A26"             "D48"            
## [113] "E58"             "C126"            "B71"             "B51 B53 B55"    
## [117] "D49"             "B5"              "B20"             "F G63"          
## [121] "C62 C64"         "E24"             "C90"             "C45"            
## [125] "E8"              "B101"            "D45"             "C46"            
## [129] "D30"             "E121"            "D11"             "E77"            
## [133] "F38"             "B3"              "D6"              "B82 B84"        
## [137] "D17"             "A36"             "B102"            "B69"            
## [141] "E49"             "C47"             "D28"             "E17"            
## [145] "A24"             "C50"             "B42"             "C148"

💡 Insight :

  • PassengerId, Name aren’t usable variable for prediction model. Therefore, they will be removed.
  • Ticket and Cabin range are too huge. Therefore, they aren’t usable as predictors and will be removed
  • Survived is the target of our prediction.
  • Survived, Pclass, Sex, SibSp, Parch and Embarked should be converted to categorical type.
titanic <- titanic %>% 
  select(-c(PassengerId, 
            Name, 
            Ticket, 
            Cabin)) %>% 
  mutate_at(vars(Survived, 
                 Pclass, 
                 Sex,
                 Parch,
                 Embarked), as.factor)

N/A value on our data frame

# Check proportion of missing data
table(is.na(titanic))
## 
## FALSE  TRUE 
##  6951   177
titanic <- titanic %>% na.omit()
titanic %>% is.na() %>% colSums()
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.

Re-check missing value using Reshape

# Check missing value using "reshape" library
missing_data <- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
##      [,1]       [,2]
## [1,] "Embarked" "2"

Update missing embarked port using common value

titanic$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S'

Exploratory and Data Analysis

Take a look on data summary

titanic %>% summary()
##  Survived Pclass      Sex           Age            SibSp        Parch  
##  0:424    1:186   female:261   Min.   : 0.42   Min.   :0.0000   0:521  
##  1:290    2:173   male  :453   1st Qu.:20.12   1st Qu.:0.0000   1:110  
##           3:355                Median :28.00   Median :0.0000   2: 68  
##                                Mean   :29.70   Mean   :0.5126   3:  5  
##                                3rd Qu.:38.00   3rd Qu.:1.0000   4:  4  
##                                Max.   :80.00   Max.   :5.0000   5:  5  
##                                                                 6:  1  
##       Fare        Embarked
##  Min.   :  0.00    :  0   
##  1st Qu.:  8.05   C:130   
##  Median : 15.74   Q: 28   
##  Mean   : 34.69   S:556   
##  3rd Qu.: 33.38           
##  Max.   :512.33           
## 

💡 Insight :

  • 424 passenger deceased during the tragedy and only 290 people survived
  • There are 453 male and 261 female
  • Age and Fare seems to have outliers

Check class imbalance

prop.table(table(titanic$Survived))
## 
##         0         1 
## 0.5938375 0.4061625

Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.

Train Test Split

Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic,
                       prop = 0.8,
                       strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)

Create Model

We will create two types of models (Generalized Linear Model and K Nearest Neighbor) to predict whether a passenger survived or not. Each model will be developed in several steps:

  • Create a model
  • Predict the model
  • Create an evaluation using Confusion Matrix
  • Tuning (if necessary)

Logistic Regression

Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.

The binary dependent variable in logistic regression can take on one of two possible outcomes, typically represented as 0 and 1. The independent variables, also known as predictor variables or features, can be continuous, categorical, or a combination of both.

First, we will create a model using all variables.

# Create a basic model

titanic_model_all <- glm(Survived~., 
                         titanic_train, 
                         family = "binomial")

summary(titanic_model_all)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7613  -0.6205  -0.3580   0.6072   2.4531  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.209e+00  6.240e-01   6.746 1.52e-11 ***
## Pclass2     -1.305e+00  3.759e-01  -3.473 0.000515 ***
## Pclass3     -2.537e+00  3.968e-01  -6.393 1.63e-10 ***
## Sexmale     -2.709e+00  2.558e-01 -10.591  < 2e-16 ***
## Age         -3.790e-02  9.532e-03  -3.976 7.00e-05 ***
## SibSp       -4.120e-01  1.521e-01  -2.710 0.006737 ** 
## Parch1       4.969e-01  3.380e-01   1.470 0.141469    
## Parch2       8.579e-02  4.353e-01   0.197 0.843779    
## Parch3       6.735e-01  1.088e+00   0.619 0.536019    
## Parch4      -1.409e+01  7.493e+02  -0.019 0.984994    
## Parch5      -8.361e-01  1.189e+00  -0.703 0.481947    
## Parch6      -1.501e+01  1.455e+03  -0.010 0.991773    
## Fare         1.645e-03  2.890e-03   0.569 0.569145    
## EmbarkedQ   -4.991e-01  7.050e-01  -0.708 0.479023    
## EmbarkedS   -2.664e-01  3.216e-01  -0.828 0.407473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 487.36  on 556  degrees of freedom
## AIC: 517.36
## 
## Number of Fisher Scoring iterations: 14

💡 Insight :

  • Pclass, Sex, Age and SibSp are significant predictors
  • Parch, Fare and Embarked aren’t significant predictors

Next, we will need to create a new model based on Backward direction

# Create a backward model
step(titanic_model_all, direction = "backward")
## Start:  AIC=517.36
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Parch     6   492.97 510.97
## - Embarked  2   488.22 514.22
## - Fare      1   487.71 515.71
## <none>          487.36 517.36
## - SibSp     1   495.40 523.40
## - Age       1   504.42 532.42
## - Pclass    2   531.89 557.89
## - Sex       1   628.52 656.52
## 
## Step:  AIC=510.97
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  2   494.36 508.36
## - Fare      1   493.28 509.28
## <none>          492.97 510.97
## - SibSp     1   500.50 516.50
## - Age       1   516.70 532.70
## - Pclass    2   545.68 559.68
## - Sex       1   644.05 660.05
## 
## Step:  AIC=508.36
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   494.94 506.94
## <none>        494.36 508.36
## - SibSp   1   502.88 514.88
## - Age     1   519.72 531.72
## - Pclass  2   550.86 560.86
## - Sex     1   647.51 659.51
## 
## Step:  AIC=506.94
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        494.94 506.94
## - SibSp   1   503.02 513.02
## - Age     1   521.39 531.39
## - Pclass  2   590.39 598.39
## - Sex     1   652.32 662.32
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
##     4.49646     -1.51383     -2.84367     -2.74482     -0.04411     -0.37814  
## 
## Degrees of Freedom: 570 Total (i.e. Null);  565 Residual
## Null Deviance:       771.4 
## Residual Deviance: 494.9     AIC: 506.9

💡 Insight :

  • Significant predictors are Pclass, Sex, Age and SibSp same as previous model.

We will need to create a new model using above significant predictors.

titanic_model_back <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, 
                      family = "binomial", 
                      data = titanic_train)

summary(titanic_model_back)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8453  -0.6361  -0.3619   0.6139   2.4819  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.496462   0.511501   8.791  < 2e-16 ***
## Pclass2     -1.513830   0.319908  -4.732 2.22e-06 ***
## Pclass3     -2.843666   0.327033  -8.695  < 2e-16 ***
## Sexmale     -2.744822   0.248401 -11.050  < 2e-16 ***
## Age         -0.044108   0.009023  -4.888 1.02e-06 ***
## SibSp       -0.378139   0.139018  -2.720  0.00653 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 494.94  on 565  degrees of freedom
## AIC: 506.94
## 
## Number of Fisher Scoring iterations: 5

Create prediction

First, we need to create a prediction from our previous model. We will use titanic_model_back as we have removed all non-significant predictors.

# Create a prediction

glm_predict <- predict(titanic_model_back, 
                       titanic_test)

Next, we can check the class type of our prediction result.

# Check the class type 
class(glm_predict)
## [1] "numeric"

Last, we need to save the “probability” result into our test data.

# Save the probability value

titanic_test$probability <- predict(titanic_model_back,
                                    titanic_test,
                                    type = "response")

paged_table(titanic_test)

Data distribution on probability values

We can check the data distribution of our prediction result using Geom Density from ggpplot library.

# Create a density plot

ggplot(titanic_test, 
       aes(x=probability))+
  geom_density(lwd=0.5)+
  theme_minimal()

We can also create a code to view the comparison between our prediction and actual result

# Create comparison

titanic_test$prediction <- factor(ifelse(titanic_test$probability > 0.5, "1", "0"))
paged_table(titanic_test[1:10, c("prediction", "Survived")])

💡 Insight :

  • The probability is skewed to 0. It means probability data mostly distribute into Not Survived value
  • Just based on the table, our model is quite good on predicting the passengers survival. But, we still need to check the “Model Evaluation”

Model evaluation

glm_evaluation <- confusionMatrix(titanic_test$prediction, 
                titanic_test$Survived, 
                positive = "1")

glm_evaluation
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 68 15
##          1 17 43
##                                          
##                Accuracy : 0.7762         
##                  95% CI : (0.699, 0.8416)
##     No Information Rate : 0.5944         
##     P-Value [Acc > NIR] : 3.382e-06      
##                                          
##                   Kappa : 0.5384         
##                                          
##  Mcnemar's Test P-Value : 0.8597         
##                                          
##             Sensitivity : 0.7414         
##             Specificity : 0.8000         
##          Pos Pred Value : 0.7167         
##          Neg Pred Value : 0.8193         
##              Prevalence : 0.4056         
##          Detection Rate : 0.3007         
##    Detection Prevalence : 0.4196         
##       Balanced Accuracy : 0.7707         
##                                          
##        'Positive' Class : 1              
## 

💡 Insight :

  • Accuracy of the model to predict Not Survived or Survived is 77.62%
  • The Sensitivity value is 74.14%
  • The Specificity values is 80%
  • The precision value is 71.67%

K Nearest Neighbor

K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.

In other words, KNN predicts the output for a new data point by looking at the K nearest data points in the training set, where K is a user-defined parameter. The algorithm measures the distance between the new data point and each of the training data points using a distance metric such as Euclidean distance or Manhattan distance. The K nearest training data points are then used to predict the output for the new data point based on the majority class or value.

First, we will need to do some data wrangling.

# Add label for target

titanic_knn <- titanic %>% 
  mutate_at(vars(Sex, 
                 Embarked, 
                 Pclass,
                 Parch), as.numeric)

titanic_knn$Survived <- factor(titanic_knn$Survived, 
                      levels = c("0","1"), 
                      labels = c("Not Survived", "Survived"))
titanic_knn %>% glimpse()
## Rows: 714
## Columns: 8
## $ Survived <fct> Not Survived, Survived, Survived, Survived, Not Survived, Not…
## $ Pclass   <dbl> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex      <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2…
## $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch    <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4…

As we can see, Age and Fare values are higher than other predictors.

Z-Score scaling

The purpose of scaling using Z-score standardization is to transform the original data so that it has a mean of 0 and a standard deviation of 1. This allows for easier comparison and analysis of different variables that may have different scales or units.

By standardizing the data using the Z-score formula, which is calculated as (x - mean)/standard deviation, each data point is transformed into a value representing how many standard deviations it is away from the mean. This makes it possible to compare data points from different variables on the same scale.

Before we do that, we need to do cross validation as we use new data (titanic_knn).

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic_knn,
                       prop = 0.8,
                       strata = "Survived")
titanic_train_new <- training(index)
titanic_test_new <- testing(index)

Next, we need to scale our predictors.

# Predictors

titanic_train_scale <- scale(titanic_train_new[,-1])
titanic_test_scale <- scale(titanic_test_new[,-1],
                            center = attr(titanic_train_scale, "scaled:center"),
                            scale = attr(titanic_train_scale, "scaled:scale"))

We also need to scale our target data.

# Target

titanic_train_target <- titanic_train_new[,1]
titanic_test_target <- titanic_test_new[,1]

Create a prediction

First, we need to know the optimum K value

# Find optimum K

round(sqrt(nrow(titanic_train_new)),0)
## [1] 24

The result is an even number ‘24’, we will use ‘25’ as the k value.

Next, we will create a model based on K-NN method.

knn_predict <- knn(train = titanic_train_scale, 
                   test = titanic_test_scale, 
                   cl = titanic_train_target, 
                   k = 25)

Model evaluation

knn_evaluation <- confusionMatrix(data = knn_predict, 
                reference = titanic_test_target, 
                positive = "Survived")

knn_evaluation
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Not Survived Survived
##   Not Survived           77       22
##   Survived                8       36
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7143, 0.8538)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 5.407e-07       
##                                           
##                   Kappa : 0.5476          
##                                           
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.6207          
##             Specificity : 0.9059          
##          Pos Pred Value : 0.8182          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.4056          
##          Detection Rate : 0.2517          
##    Detection Prevalence : 0.3077          
##       Balanced Accuracy : 0.7633          
##                                           
##        'Positive' Class : Survived        
## 

💡 Insight :

  • Accuracy of the model to predict Not Survived or Survived is 79.02%
  • The Sensitivity value is only 62.07%
  • The Specificity value is 90.59%
  • The Precision value is 81.82%

Conclusion

  • Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. It is a type of generalized linear model and is commonly used in the field of machine learning and statistics.
  • K-Nearest Neighbors (KNN) is a type of supervised machine learning algorithm used for classification and regression problems. It works by finding the K closest data points in the training set to a given query data point, and then using the class (in classification) or the value (in regression) of the majority of these K points to predict the class or value of the query data point.
  • For evaluation result, please refer to below table
paged_table(eval_result)