An Algorithmic EDA of the Complete Data Set - Titanic.csv


Step 1: Acquire and Load the Data


## 
## Loaded
## First few rows:
##   pclass survived                           name    sex     age sibsp parch
## 1      1        1  Allen, Miss. Elisabeth Walton female 29.0000     0     0
## 2      1        1 Allison, Master. Hudson Trevor   male  0.9167     1     2
## 3      1        0   Allison, Miss. Helen Loraine female  2.0000     1     2
##   ticket     fare   cabin embarked boat body                       home.dest
## 1  24160 211.3375      B5        S    2   NA                    St Louis, MO
## 2 113781 151.5500 C22 C26        S   11   NA Montreal, PQ / Chesterville, ON
## 3 113781 151.5500 C22 C26        S        NA Montreal, PQ / Chesterville, ON
## Last few rows:
##      pclass survived                      name  sex  age sibsp parch ticket
## 1307      3        0 Zakarian, Mr. Mapriededer male 26.5     0     0   2656
## 1308      3        0       Zakarian, Mr. Ortin male 27.0     0     0   2670
## 1309      3        0        Zimmerman, Mr. Leo male 29.0     0     0 315082
##       fare cabin embarked boat body home.dest
## 1307 7.225              C       304          
## 1308 7.225              C        NA          
## 1309 7.875              S        NA

Steps 2 and 3: Assess the Data Frame, Variable Structure, and Types


## 'data.frame':    1309 obs. of  14 variables:
##  $ pclass   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age      : num  29 0.917 2 30 25 ...
##  $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
##  $ fare     : num  211 152 152 152 152 ...
##  $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
##  $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
##  $ boat     : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
##  $ body     : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
## Rows: 1,309
## Columns: 14
## $ pclass    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ survived  <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, …
## $ name      <fct> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hudson Tr…
## $ sex       <fct> female, male, female, male, female, male, female, male, fema…
## $ age       <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000,…
## $ sibsp     <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ parch     <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
## $ ticket    <fct> 24160, 113781, 113781, 113781, 113781, 19952, 13502, 112050,…
## $ fare      <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 7…
## $ cabin     <fct> B5, C22 C26, C22 C26, C22 C26, C22 C26, E12, D7, A36, C101, …
## $ embarked  <fct> S, S, S, S, S, S, S, S, S, C, C, C, C, S, S, S, C, C, C, C, …
## $ boat      <fct> 2, 11, , , , 3, 10, , D, , , 4, 9, 6, B, , , 6, 8, A, 5, 5, …
## $ body      <int> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA, NA, NA…
## $ home.dest <fct> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal…

Steps 4, 5, and 6: Redefine Variable Types, Levels and Extractions


## [1] Levels:
## x
##    0    1  Sum 
##  809  500 1309
## [1] Levels:
## x
##    1    2    3  Sum 
##  323  277  709 1309
## Missing  female    male     Sum 
##       0     466     843    1309
## Missing       0       1       2       3       4       5       8     Sum 
##       0     891     319      42      20      22       6       9    1309
## [1] Levels:
## x
##    0    1    2    3  Sum 
##  891  319   42   57 1309
## Missing       0       1       2       3       4       5       6       9     Sum 
##       0    1002     170     113       8       6       6       2       2    1309
## [1] Levels:
## x
##    0    1    2    3  Sum 
## 1002  170  113   24 1309
## [1] Levels:
## x
##    1    2    3    4  Sum 
##  337  361  492  119 1309
## [1] Levels:
## x
##    U    A    B    C    D    E  Sum 
## 1041   22   65   94   46   41 1309
## Missing       C       Q       S     Sum 
##       0     270     123     916    1309
## [1] Levels:
## x
##             U           USA OtherAmericas       EURASIA        Nordic 
##           564           505            75            50             9 
##       Ireland       England           Sum 
##            15            91          1309

Step 7: Identify Missing Data


##          age         fare 
## 0.2009167303 0.0007639419
## integer(0)

Steps 8 and 9: Build Univariate and Multivariate Graphs and Statistics


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##                        vars    n   mean     sd median trimmed    mad min
## name*                     1 1309 653.69 377.31 653.00  653.62 484.81   1
## age                       2 1046  29.87  14.41  28.00   29.38  11.86   0
## fare                      3 1308  33.30  51.76  14.45   21.57  10.24   0
## home.dest*                4 1309  31.11  31.07  22.00   28.47  31.13   1
## survived_1                5 1309   0.38   0.49   0.00    0.35   0.00   0
## pclass_1                  6 1309   0.25   0.43   0.00    0.18   0.00   0
## pclass_2                  7 1309   0.21   0.41   0.00    0.14   0.00   0
## sex_female                8 1309   0.36   0.48   0.00    0.32   0.00   0
## sibsp_1                   9 1309   0.24   0.43   0.00    0.18   0.00   0
## sibsp_2                  10 1309   0.03   0.18   0.00    0.00   0.00   0
## sibsp_3                  11 1309   0.04   0.20   0.00    0.00   0.00   0
## parch_1                  12 1309   0.13   0.34   0.00    0.04   0.00   0
## parch_2                  13 1309   0.09   0.28   0.00    0.00   0.00   0
## parch_3                  14 1309   0.02   0.13   0.00    0.00   0.00   0
## ticket_1                 15 1309   0.26   0.44   0.00    0.20   0.00   0
## ticket_2                 16 1309   0.28   0.45   0.00    0.22   0.00   0
## ticket_4                 17 1309   0.09   0.29   0.00    0.00   0.00   0
## cabin_A                  18 1309   0.02   0.13   0.00    0.00   0.00   0
## cabin_B                  19 1309   0.05   0.22   0.00    0.00   0.00   0
## cabin_C                  20 1309   0.07   0.26   0.00    0.00   0.00   0
## cabin_D                  21 1309   0.04   0.18   0.00    0.00   0.00   0
## cabin_E                  22 1309   0.03   0.17   0.00    0.00   0.00   0
## embarked_C               23 1309   0.21   0.40   0.00    0.13   0.00   0
## embarked_Q               24 1309   0.09   0.29   0.00    0.00   0.00   0
## home_England             25 1309   0.07   0.25   0.00    0.00   0.00   0
## home_EURASIA             26 1309   0.04   0.19   0.00    0.00   0.00   0
## home_Ireland             27 1309   0.01   0.11   0.00    0.00   0.00   0
## home_Nordic              28 1309   0.01   0.08   0.00    0.00   0.00   0
## home_OtherAmericas       29 1309   0.06   0.23   0.00    0.00   0.00   0
## home_USA                 30 1309   0.39   0.49   0.00    0.36   0.00   0
## title_AdultFemale        31 1309   0.15   0.36   0.00    0.07   0.00   0
## title_MilitaryDocClass   32 1309   0.01   0.11   0.00    0.00   0.00   0
## title_Reverend           33 1309   0.01   0.08   0.00    0.00   0.00   0
## title_Royalty            34 1309   0.00   0.07   0.00    0.00   0.00   0
## title_YouthFemale        35 1309   0.20   0.40   0.00    0.12   0.00   0
## title_YouthMale          36 1309   0.05   0.21   0.00    0.00   0.00   0
## Index                    37 1309 655.00 378.02 655.00  655.00 484.81   1
##                            max   range  skew kurtosis    se
## name*                  1307.00 1306.00  0.00    -1.20 10.43
## age                      80.00   80.00  0.41     0.13  0.45
## fare                    512.33  512.33  4.36    26.87  1.43
## home.dest*              101.00  100.00  0.39    -1.37  0.86
## survived_1                1.00    1.00  0.49    -1.77  0.01
## pclass_1                  1.00    1.00  1.17    -0.62  0.01
## pclass_2                  1.00    1.00  1.41    -0.01  0.01
## sex_female                1.00    1.00  0.60    -1.64  0.01
## sibsp_1                   1.00    1.00  1.19    -0.58  0.01
## sibsp_2                   1.00    1.00  5.30    26.16  0.00
## sibsp_3                   1.00    1.00  4.47    17.98  0.01
## parch_1                   1.00    1.00  2.20     2.84  0.01
## parch_2                   1.00    1.00  2.94     6.66  0.01
## parch_3                   1.00    1.00  7.17    49.48  0.00
## ticket_1                  1.00    1.00  1.11    -0.77  0.01
## ticket_2                  1.00    1.00  1.00    -1.00  0.01
## ticket_4                  1.00    1.00  2.84     6.09  0.01
## cabin_A                   1.00    1.00  7.51    54.43  0.00
## cabin_B                   1.00    1.00  4.14    15.16  0.01
## cabin_C                   1.00    1.00  3.31     8.98  0.01
## cabin_D                   1.00    1.00  5.04    23.45  0.01
## cabin_E                   1.00    1.00  5.38    26.91  0.00
## embarked_C                1.00    1.00  1.45     0.10  0.01
## embarked_Q                1.00    1.00  2.78     5.73  0.01
## home_England              1.00    1.00  3.38     9.44  0.01
## home_EURASIA              1.00    1.00  4.81    21.18  0.01
## home_Ireland              1.00    1.00  9.17    82.15  0.00
## home_Nordic               1.00    1.00 11.92   140.23  0.00
## home_OtherAmericas        1.00    1.00  3.81    12.49  0.01
## home_USA                  1.00    1.00  0.47    -1.78  0.01
## title_AdultFemale         1.00    1.00  1.91     1.66  0.01
## title_MilitaryDocClass    1.00    1.00  9.17    82.15  0.00
## title_Reverend            1.00    1.00 12.66   158.38  0.00
## title_Royalty             1.00    1.00 14.65   212.84  0.00
## title_YouthFemale         1.00    1.00  1.51     0.28  0.01
## title_YouthMale           1.00    1.00  4.30    16.48  0.01
## Index                  1309.00 1308.00  0.00    -1.20 10.45
## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x151bea360>
## <environment: namespace:stats>
## 'data.frame':    1309 obs. of  36 variables:
##  $ name                  : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ age                   : num  29 1 2 30 25 48 63 39 53 71 ...
##  $ fare                  : num  211 152 152 152 152 ...
##  $ home.dest             : Factor w/ 101 levels "","AB","Argentina",..: 53 69 69 69 69 65 65 59 65 91 ...
##  $ survived_1            : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ pclass_1              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pclass_2              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sex_female            : int  1 0 1 0 1 0 1 0 1 0 ...
##  $ sibsp_1               : int  0 1 1 1 1 0 1 0 0 0 ...
##  $ sibsp_2               : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ sibsp_3               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ parch_1               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ parch_2               : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ parch_3               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ticket_1              : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ ticket_2              : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ ticket_4              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_A               : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ cabin_B               : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_C               : int  0 1 1 1 1 0 0 0 1 0 ...
##  $ cabin_D               : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ cabin_E               : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ embarked_C            : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ embarked_Q            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_England          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_EURASIA          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_Ireland          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ home_Nordic           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_OtherAmericas    : int  0 1 1 1 1 0 0 0 0 1 ...
##  $ home_USA              : int  1 0 0 0 0 1 1 0 1 0 ...
##  $ title_AdultFemale     : int  0 0 0 0 1 0 0 0 1 0 ...
##  $ title_MilitaryDocClass: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_Reverend        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_Royalty         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_YouthFemale     : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ title_YouthMale       : int  0 1 0 0 0 0 0 0 0 0 ...
## #########################################################################

## #########################################################################
##                   survived_1 sex_female title_AdultFemale title_YouthFemale
## survived_1                                                                 
## sex_female        0.529***                                                 
## title_AdultFemale 0.356***   0.575***                                      
## title_YouthFemale 0.302***   0.67***    -0.213***                          
## pclass_1          0.279***   0.107***   0.148***          -0.018           
## ticket_1          0.256***   0.084**    0.126***          -0.026           
## home_USA          0.223***   0.122***   0.178***          -0.013           
## embarked_C        0.182***   0.067*     0.106***          -0.022           
## parch_1           0.164***   0.13***    0.131***          0.041            
## cabin_B           0.16***    0.094***   0.087**           0.027            
##                   pclass_1 ticket_1 home_USA embarked_C parch_1 cabin_B
## survived_1                                                             
## sex_female                                                             
## title_AdultFemale                                                      
## title_YouthFemale                                                      
## pclass_1                                                               
## ticket_1          0.83***                                              
## home_USA          0.296*** 0.226***                                    
## embarked_C        0.326*** 0.287*** 0.03                               
## parch_1           0.042    0.001    0.086**  0.089**                   
## cabin_B           0.399*** 0.324*** 0.064*   0.162***   0.09**
## #########################################################################
##                 survived_1 sibsp_1   ticket_4 cabin_E cabin_C cabin_D sibsp_3 
## survived_1                                                                    
## sibsp_1         0.151***                                                      
## ticket_4        -0.134***  -0.037                                             
## cabin_E         0.129***   0.041     -0.011                                   
## cabin_C         0.128***   0.145***  -0.078** -0.05                           
## cabin_D         0.123***   0.075**   -0.046   -0.034  -0.053                  
## sibsp_3         -0.098***  -0.121*** -0.015   -0.038  -0.001  -0.041          
## parch_2         0.077**    0.009     -0.003   -0.008  0.03    -0.029  0.401***
## title_Reverend  -0.062*    0.001     -0.025   -0.014  -0.022  -0.015  -0.017  
## title_YouthMale 0.057*     0.069*    -0.019   0.002   -0.047  -0.042  0.361***
## home_Ireland    -0.055*    -0.061*   -0.034   -0.019  -0.03   -0.021  -0.023  
##                 parch_2  title_Reverend title_YouthMale home_Ireland
## survived_1                                                          
## sibsp_1                                                             
## ticket_4                                                            
## cabin_E                                                             
## cabin_C                                                             
## cabin_D                                                             
## sibsp_3                                                             
## parch_2                                                             
## title_Reverend  -0.024                                              
## title_YouthMale 0.255*** -0.017                                     
## home_Ireland    -0.033   -0.008         -0.024
## #########################################################################

Step 10: Split the Data and Avoid Leakage


## Seed 1234 set for reproducibility
## Training set size: 1047
## Testing set size: 262

Step 11: Handle Missing Data and Outliers


## [1] 0
## [1] 0
## 'data.frame':    1047 obs. of  36 variables:
##  $ name                  : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 1306 803 964 779 32 619 63 663 357 613 ...
##  $ age                   : num  27 21 28 28 4 23 40 26 42 27 ...
##  $ fare                  : num  7.22 7.78 8.14 23.25 31.27 ...
##  $ home.dest             : Factor w/ 101 levels "","AB","Argentina",..: 1 1 1 1 52 1 46 1 65 1 ...
##  $ survived_1            : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ pclass_1              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pclass_2              : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ sex_female            : int  0 0 1 0 0 0 0 0 0 1 ...
##  $ sibsp_1               : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ sibsp_2               : int  0 0 0 1 0 0 0 1 0 0 ...
##  $ sibsp_3               : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ parch_1               : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ parch_2               : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ parch_3               : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ ticket_1              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ticket_2              : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ ticket_4              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_A               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_B               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_C               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_D               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cabin_E               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ embarked_C            : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ embarked_Q            : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ home_England          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_EURASIA          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_Ireland          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_Nordic           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_OtherAmericas    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_USA              : int  0 0 0 0 1 0 1 0 1 0 ...
##  $ title_AdultFemale     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ title_MilitaryDocClass: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_Reverend        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_Royalty         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ title_YouthFemale     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ title_YouthMale       : int  0 0 0 0 1 0 0 0 0 0 ...

Step 12: Transform Data and Build on the Training Set, then Test Set


## Estimated transformation parameter 
## train$age + 1 
##     0.7941027

Step 13: Scale Data and Build on the Training Set, then Test Set


###########################################################
# STEP 13. Scale data as necessary, build on the training set and apply it to the test set.
###########################################################

mymin=min(train$age)
mymax=max(train$age)
myrange=mymax-mymin
train$age=(train$age-mymin)/myrange
test$age=(test$age-mymin)/myrange

Step 14: Clustering and Model Building with Performance Metrics


###########################################################
# STEP 14. Begin model building.
###########################################################
#tr=data.matrix(train, rownames.force = NA) 

euclid=dist(train,method='euclidean')  
eu=round(as.matrix(euclid)[1:36, 1:36],1)

#Function for predicting clusters
clusters <- function(x, centers) {
  tmp <- sapply(seq_len(nrow(x)), function(i) apply(centers, 1,function(v) sum((x[i, ]-v)^2)))
  max.col(-t(tmp))  # find index of min distance
}

#Select Gender, Passenger Class 1 and Passenger Class 2 for k-mediods (pam)
mykm=pam(train[,c(7,5,6)], 2, nstart=100)
fviz_cluster(mykm, data = train, palette = "jco", ellipse.type = "none", ggtheme = theme_minimal())

#Test Set Performance
mypreds=t(clusters(test[,c(7,5,6)],mykm[["medoids"]])-1)
#u <- union(mypreds, test$survived_1)
predTest=table(mypreds, test$survived_1)
confusionMatrix(t(predTest), positive='1')
## Confusion Matrix and Statistics
## 
##    mypreds
##       0   1
##   0 139  25
##   1  33  65
##                                           
##                Accuracy : 0.7786          
##                  95% CI : (0.7234, 0.8274)
##     No Information Rate : 0.6565          
##     P-Value [Acc > NIR] : 1.121e-05       
##                                           
##                   Kappa : 0.5194          
##                                           
##  Mcnemar's Test P-Value : 0.358           
##                                           
##             Sensitivity : 0.7222          
##             Specificity : 0.8081          
##          Pos Pred Value : 0.6633          
##          Neg Pred Value : 0.8476          
##              Prevalence : 0.3435          
##          Detection Rate : 0.2481          
##    Detection Prevalence : 0.3740          
##       Balanced Accuracy : 0.7652          
##                                           
##        'Positive' Class : 1               
## 
fviz_cluster(list(data=test[,c(2,4:38)], cluster=mypreds+1), palette = "jco",ellipse.type = "none",  ggtheme = theme_minimal())

#
# Measure performance of model
myglm=glm(data=train[,c(4, 7,5,6)], as.factor(survived_1)~., family='binomial')
mypred2=round(predict(myglm, test, type='response'),0)
confusionMatrix(table(mypred2, test$survived_1))
## Confusion Matrix and Statistics
## 
##        
## mypred2   0   1
##       0 139  33
##       1  25  65
##                                           
##                Accuracy : 0.7786          
##                  95% CI : (0.7234, 0.8274)
##     No Information Rate : 0.626           
##     P-Value [Acc > NIR] : 8.322e-08       
##                                           
##                   Kappa : 0.5194          
##                                           
##  Mcnemar's Test P-Value : 0.358           
##                                           
##             Sensitivity : 0.8476          
##             Specificity : 0.6633          
##          Pos Pred Value : 0.8081          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.6260          
##          Detection Rate : 0.5305          
##    Detection Prevalence : 0.6565          
##       Balanced Accuracy : 0.7554          
##                                           
##        'Positive' Class : 0               
## 

Exective Summary


Overview

This study outlines a systematic, algorithmic approach to Exploratory Data Analysis (EDA) applied to the Titanic dataset. Utilizing R Statistical Software and R Studio, the analysis incorporates data preprocessing, feature engineering, normalization, and clustering to derive insights and prepare the data for predictive modeling. Each step ensures model-ready data while preventing data leakage and overfitting.

Objective

The primary aim is to explore the Titanic dataset using repeatable EDA techniques to uncover insights, identify key features, and construct a reliable dataset for predictive modeling. The ultimate goal is to evaluate model performance on unseen test data through metrics like accuracy, recall, and F1 score.

Key Findings

  • Feature Engineering: Dummy variables and new features (e.g., Title from names) significantly improved data interpretability and model robustness.
  • Normalization: Right-skewed variables such as Age and Fare were transformed to approximate normal distributions. Min-max scaling ensured uniformity across features.
  • Clustering and Model Performance: K-medoid clustering demonstrated robustness for binary variables, achieving an accuracy of 77.86% and an F1 score of 0.6915.
  • Data Leakage Prevention: Proper data splitting (80% training, 20% testing) and imputation techniques mitigated overfitting risks.

Key Takeaways

  • Iterative Process: EDA involves repeated exploration and refinement, emphasizing data understanding and preprocessing.
  • Variable Classification: Correctly categorizing variables (e.g., categorical, ordinal, continuous) is crucial for meaningful analysis.
  • Outlier and Missing Data Management: Handling missing data via mode imputation and transforming outliers improved overall dataset quality.
  • Performance Metrics: Evaluating multiple metrics (precision: 0.6633, recall: 0.7222, specificity: 0.8081) ensures a balanced assessment of the model.

Future Steps

  • Explore advanced clustering techniques such as Bayes-Bernoulli mixtures to improve robustness.
  • Investigate nonlinear transformation methods for features resistant to normalization.
  • Apply additional algorithms to benchmark model performance against the k-medoid approach.

Conclusion

This algorithmic EDA approach effectively prepared the Titanic dataset for predictive modeling, ensuring reliable performance metrics on unseen data. By combining feature engineering, robust data preprocessing, and careful validation, the study demonstrates a structured and repeatable framework for data analysis and model-building.

References

Xu, R., & Wunsch, D. C. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. https://doi.org/10.1109/TNN.2005.845141

Sherlock, J., Muniswamaiah, M., Clarke, L., & Cicoria, S. (2018). Classification of Titanic passenger data and chances of surviving the disaster. arXiv preprint arXiv:1810.09851. https://arxiv.org/abs/1810.09851

Olson, D. W., Doescher, R. L., & Sinnott, R. W. (2012). Did the Moon sink the Titanic? Sky & Telescope, 123(4), 28–33. https://phys.org/news/2012-03-icebergs-accomplice-moon-titanic.html

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An introduction to statistical learning with applications in R (2nd ed.). Springer International Publishing. https://www.statlearning.com/ .

Wang, K., Wang, P., & Xu, C. (2022). Toward efficient automated feature engineering. ArXiv. /abs/2212.13152. https://arxiv.org/abs/2212.13152 .

Schubert, E., & Rousseeuw, P. J. (2023). Stop using the elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter, 25(1), 1–8. https://doi.org/10.1145/3606274.3606278

Hinton, W. (2024). Split, Transform, and Scale the Data Set. Available at Rpubs. _{link}(https://www.rpubs.com/whinton/)_ .

Smeaton, A. (2003). NIST/SEMATECH Engineering Statistics Handbook. _{link}(https://www.itl.nist.gov/div898/handbook/)_. R Programming for Statistics and Data Science (Media from Packt Publishing available freely through O’Reilly Media Inc.). (2018).

.


This study performed by Will Hinton