##
## Loaded
## First few rows:
## pclass survived name sex age sibsp parch
## 1 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0
## 2 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2
## 3 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2
## ticket fare cabin embarked boat body home.dest
## 1 24160 211.3375 B5 S 2 NA St Louis, MO
## 2 113781 151.5500 C22 C26 S 11 NA Montreal, PQ / Chesterville, ON
## 3 113781 151.5500 C22 C26 S NA Montreal, PQ / Chesterville, ON
## Last few rows:
## pclass survived name sex age sibsp parch ticket
## 1307 3 0 Zakarian, Mr. Mapriededer male 26.5 0 0 2656
## 1308 3 0 Zakarian, Mr. Ortin male 27.0 0 0 2670
## 1309 3 0 Zimmerman, Mr. Leo male 29.0 0 0 315082
## fare cabin embarked boat body home.dest
## 1307 7.225 C 304
## 1308 7.225 C NA
## 1309 7.875 S NA
## 'data.frame': 1309 obs. of 14 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
## $ boat : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
## Rows: 1,309
## Columns: 14
## $ pclass <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ survived <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, …
## $ name <fct> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hudson Tr…
## $ sex <fct> female, male, female, male, female, male, female, male, fema…
## $ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000,…
## $ sibsp <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ parch <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
## $ ticket <fct> 24160, 113781, 113781, 113781, 113781, 19952, 13502, 112050,…
## $ fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 7…
## $ cabin <fct> B5, C22 C26, C22 C26, C22 C26, C22 C26, E12, D7, A36, C101, …
## $ embarked <fct> S, S, S, S, S, S, S, S, S, C, C, C, C, S, S, S, C, C, C, C, …
## $ boat <fct> 2, 11, , , , 3, 10, , D, , , 4, 9, 6, B, , , 6, 8, A, 5, 5, …
## $ body <int> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA, NA, NA…
## $ home.dest <fct> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal…
## [1] Levels:
## x
## 0 1 Sum
## 809 500 1309
## [1] Levels:
## x
## 1 2 3 Sum
## 323 277 709 1309
## Missing female male Sum
## 0 466 843 1309
## Missing 0 1 2 3 4 5 8 Sum
## 0 891 319 42 20 22 6 9 1309
## [1] Levels:
## x
## 0 1 2 3 Sum
## 891 319 42 57 1309
## Missing 0 1 2 3 4 5 6 9 Sum
## 0 1002 170 113 8 6 6 2 2 1309
## [1] Levels:
## x
## 0 1 2 3 Sum
## 1002 170 113 24 1309
## [1] Levels:
## x
## 1 2 3 4 Sum
## 337 361 492 119 1309
## [1] Levels:
## x
## U A B C D E Sum
## 1041 22 65 94 46 41 1309
## Missing C Q S Sum
## 0 270 123 916 1309
## [1] Levels:
## x
## U USA OtherAmericas EURASIA Nordic
## 564 505 75 50 9
## Ireland England Sum
## 15 91 1309
## age fare
## 0.2009167303 0.0007639419
## integer(0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## vars n mean sd median trimmed mad min
## name* 1 1309 653.69 377.31 653.00 653.62 484.81 1
## age 2 1046 29.87 14.41 28.00 29.38 11.86 0
## fare 3 1308 33.30 51.76 14.45 21.57 10.24 0
## home.dest* 4 1309 31.11 31.07 22.00 28.47 31.13 1
## survived_1 5 1309 0.38 0.49 0.00 0.35 0.00 0
## pclass_1 6 1309 0.25 0.43 0.00 0.18 0.00 0
## pclass_2 7 1309 0.21 0.41 0.00 0.14 0.00 0
## sex_female 8 1309 0.36 0.48 0.00 0.32 0.00 0
## sibsp_1 9 1309 0.24 0.43 0.00 0.18 0.00 0
## sibsp_2 10 1309 0.03 0.18 0.00 0.00 0.00 0
## sibsp_3 11 1309 0.04 0.20 0.00 0.00 0.00 0
## parch_1 12 1309 0.13 0.34 0.00 0.04 0.00 0
## parch_2 13 1309 0.09 0.28 0.00 0.00 0.00 0
## parch_3 14 1309 0.02 0.13 0.00 0.00 0.00 0
## ticket_1 15 1309 0.26 0.44 0.00 0.20 0.00 0
## ticket_2 16 1309 0.28 0.45 0.00 0.22 0.00 0
## ticket_4 17 1309 0.09 0.29 0.00 0.00 0.00 0
## cabin_A 18 1309 0.02 0.13 0.00 0.00 0.00 0
## cabin_B 19 1309 0.05 0.22 0.00 0.00 0.00 0
## cabin_C 20 1309 0.07 0.26 0.00 0.00 0.00 0
## cabin_D 21 1309 0.04 0.18 0.00 0.00 0.00 0
## cabin_E 22 1309 0.03 0.17 0.00 0.00 0.00 0
## embarked_C 23 1309 0.21 0.40 0.00 0.13 0.00 0
## embarked_Q 24 1309 0.09 0.29 0.00 0.00 0.00 0
## home_England 25 1309 0.07 0.25 0.00 0.00 0.00 0
## home_EURASIA 26 1309 0.04 0.19 0.00 0.00 0.00 0
## home_Ireland 27 1309 0.01 0.11 0.00 0.00 0.00 0
## home_Nordic 28 1309 0.01 0.08 0.00 0.00 0.00 0
## home_OtherAmericas 29 1309 0.06 0.23 0.00 0.00 0.00 0
## home_USA 30 1309 0.39 0.49 0.00 0.36 0.00 0
## title_AdultFemale 31 1309 0.15 0.36 0.00 0.07 0.00 0
## title_MilitaryDocClass 32 1309 0.01 0.11 0.00 0.00 0.00 0
## title_Reverend 33 1309 0.01 0.08 0.00 0.00 0.00 0
## title_Royalty 34 1309 0.00 0.07 0.00 0.00 0.00 0
## title_YouthFemale 35 1309 0.20 0.40 0.00 0.12 0.00 0
## title_YouthMale 36 1309 0.05 0.21 0.00 0.00 0.00 0
## Index 37 1309 655.00 378.02 655.00 655.00 484.81 1
## max range skew kurtosis se
## name* 1307.00 1306.00 0.00 -1.20 10.43
## age 80.00 80.00 0.41 0.13 0.45
## fare 512.33 512.33 4.36 26.87 1.43
## home.dest* 101.00 100.00 0.39 -1.37 0.86
## survived_1 1.00 1.00 0.49 -1.77 0.01
## pclass_1 1.00 1.00 1.17 -0.62 0.01
## pclass_2 1.00 1.00 1.41 -0.01 0.01
## sex_female 1.00 1.00 0.60 -1.64 0.01
## sibsp_1 1.00 1.00 1.19 -0.58 0.01
## sibsp_2 1.00 1.00 5.30 26.16 0.00
## sibsp_3 1.00 1.00 4.47 17.98 0.01
## parch_1 1.00 1.00 2.20 2.84 0.01
## parch_2 1.00 1.00 2.94 6.66 0.01
## parch_3 1.00 1.00 7.17 49.48 0.00
## ticket_1 1.00 1.00 1.11 -0.77 0.01
## ticket_2 1.00 1.00 1.00 -1.00 0.01
## ticket_4 1.00 1.00 2.84 6.09 0.01
## cabin_A 1.00 1.00 7.51 54.43 0.00
## cabin_B 1.00 1.00 4.14 15.16 0.01
## cabin_C 1.00 1.00 3.31 8.98 0.01
## cabin_D 1.00 1.00 5.04 23.45 0.01
## cabin_E 1.00 1.00 5.38 26.91 0.00
## embarked_C 1.00 1.00 1.45 0.10 0.01
## embarked_Q 1.00 1.00 2.78 5.73 0.01
## home_England 1.00 1.00 3.38 9.44 0.01
## home_EURASIA 1.00 1.00 4.81 21.18 0.01
## home_Ireland 1.00 1.00 9.17 82.15 0.00
## home_Nordic 1.00 1.00 11.92 140.23 0.00
## home_OtherAmericas 1.00 1.00 3.81 12.49 0.01
## home_USA 1.00 1.00 0.47 -1.78 0.01
## title_AdultFemale 1.00 1.00 1.91 1.66 0.01
## title_MilitaryDocClass 1.00 1.00 9.17 82.15 0.00
## title_Reverend 1.00 1.00 12.66 158.38 0.00
## title_Royalty 1.00 1.00 14.65 212.84 0.00
## title_YouthFemale 1.00 1.00 1.51 0.28 0.01
## title_YouthMale 1.00 1.00 4.30 16.48 0.01
## Index 1309.00 1308.00 0.00 -1.20 10.45
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x151bea360>
## <environment: namespace:stats>
## 'data.frame': 1309 obs. of 36 variables:
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ age : num 29 1 2 30 25 48 63 39 53 71 ...
## $ fare : num 211 152 152 152 152 ...
## $ home.dest : Factor w/ 101 levels "","AB","Argentina",..: 53 69 69 69 69 65 65 59 65 91 ...
## $ survived_1 : int 1 1 0 0 0 1 1 0 1 0 ...
## $ pclass_1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pclass_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sex_female : int 1 0 1 0 1 0 1 0 1 0 ...
## $ sibsp_1 : int 0 1 1 1 1 0 1 0 0 0 ...
## $ sibsp_2 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ sibsp_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ parch_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ parch_2 : int 0 1 1 1 1 0 0 0 0 0 ...
## $ parch_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ticket_1 : int 0 1 1 1 1 1 1 1 1 1 ...
## $ ticket_2 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ ticket_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_A : int 0 0 0 0 0 0 0 1 0 0 ...
## $ cabin_B : int 1 0 0 0 0 0 0 0 0 0 ...
## $ cabin_C : int 0 1 1 1 1 0 0 0 1 0 ...
## $ cabin_D : int 0 0 0 0 0 0 1 0 0 0 ...
## $ cabin_E : int 0 0 0 0 0 1 0 0 0 0 ...
## $ embarked_C : int 0 0 0 0 0 0 0 0 0 1 ...
## $ embarked_Q : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_England : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_EURASIA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_Ireland : int 0 0 0 0 0 0 0 1 0 0 ...
## $ home_Nordic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_OtherAmericas : int 0 1 1 1 1 0 0 0 0 1 ...
## $ home_USA : int 1 0 0 0 0 1 1 0 1 0 ...
## $ title_AdultFemale : int 0 0 0 0 1 0 0 0 1 0 ...
## $ title_MilitaryDocClass: int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_Reverend : int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_Royalty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_YouthFemale : int 1 0 1 0 0 0 1 0 0 0 ...
## $ title_YouthMale : int 0 1 0 0 0 0 0 0 0 0 ...
## #########################################################################
## #########################################################################
## survived_1 sex_female title_AdultFemale title_YouthFemale
## survived_1
## sex_female 0.529***
## title_AdultFemale 0.356*** 0.575***
## title_YouthFemale 0.302*** 0.67*** -0.213***
## pclass_1 0.279*** 0.107*** 0.148*** -0.018
## ticket_1 0.256*** 0.084** 0.126*** -0.026
## home_USA 0.223*** 0.122*** 0.178*** -0.013
## embarked_C 0.182*** 0.067* 0.106*** -0.022
## parch_1 0.164*** 0.13*** 0.131*** 0.041
## cabin_B 0.16*** 0.094*** 0.087** 0.027
## pclass_1 ticket_1 home_USA embarked_C parch_1 cabin_B
## survived_1
## sex_female
## title_AdultFemale
## title_YouthFemale
## pclass_1
## ticket_1 0.83***
## home_USA 0.296*** 0.226***
## embarked_C 0.326*** 0.287*** 0.03
## parch_1 0.042 0.001 0.086** 0.089**
## cabin_B 0.399*** 0.324*** 0.064* 0.162*** 0.09**
## #########################################################################
## survived_1 sibsp_1 ticket_4 cabin_E cabin_C cabin_D sibsp_3
## survived_1
## sibsp_1 0.151***
## ticket_4 -0.134*** -0.037
## cabin_E 0.129*** 0.041 -0.011
## cabin_C 0.128*** 0.145*** -0.078** -0.05
## cabin_D 0.123*** 0.075** -0.046 -0.034 -0.053
## sibsp_3 -0.098*** -0.121*** -0.015 -0.038 -0.001 -0.041
## parch_2 0.077** 0.009 -0.003 -0.008 0.03 -0.029 0.401***
## title_Reverend -0.062* 0.001 -0.025 -0.014 -0.022 -0.015 -0.017
## title_YouthMale 0.057* 0.069* -0.019 0.002 -0.047 -0.042 0.361***
## home_Ireland -0.055* -0.061* -0.034 -0.019 -0.03 -0.021 -0.023
## parch_2 title_Reverend title_YouthMale home_Ireland
## survived_1
## sibsp_1
## ticket_4
## cabin_E
## cabin_C
## cabin_D
## sibsp_3
## parch_2
## title_Reverend -0.024
## title_YouthMale 0.255*** -0.017
## home_Ireland -0.033 -0.008 -0.024
## #########################################################################
## Seed 1234 set for reproducibility
## Training set size: 1047
## Testing set size: 262
## [1] 0
## [1] 0
## 'data.frame': 1047 obs. of 36 variables:
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 1306 803 964 779 32 619 63 663 357 613 ...
## $ age : num 27 21 28 28 4 23 40 26 42 27 ...
## $ fare : num 7.22 7.78 8.14 23.25 31.27 ...
## $ home.dest : Factor w/ 101 levels "","AB","Argentina",..: 1 1 1 1 52 1 46 1 65 1 ...
## $ survived_1 : int 0 1 0 1 0 0 0 0 0 1 ...
## $ pclass_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pclass_2 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ sex_female : int 0 0 1 0 0 0 0 0 0 1 ...
## $ sibsp_1 : int 0 0 0 0 0 0 1 0 1 0 ...
## $ sibsp_2 : int 0 0 0 1 0 0 0 1 0 0 ...
## $ sibsp_3 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ parch_1 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ parch_2 : int 0 0 0 0 1 0 0 0 0 1 ...
## $ parch_3 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ ticket_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ticket_2 : int 1 0 0 0 0 0 0 0 1 0 ...
## $ ticket_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cabin_E : int 0 0 0 0 0 0 0 0 0 0 ...
## $ embarked_C : int 1 0 0 0 0 0 0 0 0 0 ...
## $ embarked_Q : int 0 0 1 1 0 0 0 0 0 0 ...
## $ home_England : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_EURASIA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_Ireland : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_Nordic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_OtherAmericas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_USA : int 0 0 0 0 1 0 1 0 1 0 ...
## $ title_AdultFemale : int 0 0 0 0 0 0 0 0 0 1 ...
## $ title_MilitaryDocClass: int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_Reverend : int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_Royalty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ title_YouthFemale : int 0 0 1 0 0 0 0 0 0 0 ...
## $ title_YouthMale : int 0 0 0 0 1 0 0 0 0 0 ...
## Estimated transformation parameter
## train$age + 1
## 0.7941027
###########################################################
# STEP 13. Scale data as necessary, build on the training set and apply it to the test set.
###########################################################
mymin=min(train$age)
mymax=max(train$age)
myrange=mymax-mymin
train$age=(train$age-mymin)/myrange
test$age=(test$age-mymin)/myrange
###########################################################
# STEP 14. Begin model building.
###########################################################
#tr=data.matrix(train, rownames.force = NA)
euclid=dist(train,method='euclidean')
eu=round(as.matrix(euclid)[1:36, 1:36],1)
#Function for predicting clusters
clusters <- function(x, centers) {
tmp <- sapply(seq_len(nrow(x)), function(i) apply(centers, 1,function(v) sum((x[i, ]-v)^2)))
max.col(-t(tmp)) # find index of min distance
}
#Select Gender, Passenger Class 1 and Passenger Class 2 for k-mediods (pam)
mykm=pam(train[,c(7,5,6)], 2, nstart=100)
fviz_cluster(mykm, data = train, palette = "jco", ellipse.type = "none", ggtheme = theme_minimal())
#Test Set Performance
mypreds=t(clusters(test[,c(7,5,6)],mykm[["medoids"]])-1)
#u <- union(mypreds, test$survived_1)
predTest=table(mypreds, test$survived_1)
confusionMatrix(t(predTest), positive='1')
## Confusion Matrix and Statistics
##
## mypreds
## 0 1
## 0 139 25
## 1 33 65
##
## Accuracy : 0.7786
## 95% CI : (0.7234, 0.8274)
## No Information Rate : 0.6565
## P-Value [Acc > NIR] : 1.121e-05
##
## Kappa : 0.5194
##
## Mcnemar's Test P-Value : 0.358
##
## Sensitivity : 0.7222
## Specificity : 0.8081
## Pos Pred Value : 0.6633
## Neg Pred Value : 0.8476
## Prevalence : 0.3435
## Detection Rate : 0.2481
## Detection Prevalence : 0.3740
## Balanced Accuracy : 0.7652
##
## 'Positive' Class : 1
##
fviz_cluster(list(data=test[,c(2,4:38)], cluster=mypreds+1), palette = "jco",ellipse.type = "none", ggtheme = theme_minimal())
#
# Measure performance of model
myglm=glm(data=train[,c(4, 7,5,6)], as.factor(survived_1)~., family='binomial')
mypred2=round(predict(myglm, test, type='response'),0)
confusionMatrix(table(mypred2, test$survived_1))
## Confusion Matrix and Statistics
##
##
## mypred2 0 1
## 0 139 33
## 1 25 65
##
## Accuracy : 0.7786
## 95% CI : (0.7234, 0.8274)
## No Information Rate : 0.626
## P-Value [Acc > NIR] : 8.322e-08
##
## Kappa : 0.5194
##
## Mcnemar's Test P-Value : 0.358
##
## Sensitivity : 0.8476
## Specificity : 0.6633
## Pos Pred Value : 0.8081
## Neg Pred Value : 0.7222
## Prevalence : 0.6260
## Detection Rate : 0.5305
## Detection Prevalence : 0.6565
## Balanced Accuracy : 0.7554
##
## 'Positive' Class : 0
##
Overview
This study outlines a systematic, algorithmic approach to Exploratory Data Analysis (EDA) applied to the Titanic dataset. Utilizing R Statistical Software and R Studio, the analysis incorporates data preprocessing, feature engineering, normalization, and clustering to derive insights and prepare the data for predictive modeling. Each step ensures model-ready data while preventing data leakage and overfitting.
Objective
The primary aim is to explore the Titanic dataset using repeatable EDA techniques to uncover insights, identify key features, and construct a reliable dataset for predictive modeling. The ultimate goal is to evaluate model performance on unseen test data through metrics like accuracy, recall, and F1 score.
Key Findings
Key Takeaways
Future Steps
Conclusion
This algorithmic EDA approach effectively prepared the Titanic dataset for predictive modeling, ensuring reliable performance metrics on unseen data. By combining feature engineering, robust data preprocessing, and careful validation, the study demonstrates a structured and repeatable framework for data analysis and model-building.
References
Xu, R., & Wunsch, D. C. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. https://doi.org/10.1109/TNN.2005.845141
Sherlock, J., Muniswamaiah, M., Clarke, L., & Cicoria, S. (2018). Classification of Titanic passenger data and chances of surviving the disaster. arXiv preprint arXiv:1810.09851. https://arxiv.org/abs/1810.09851
Olson, D. W., Doescher, R. L., & Sinnott, R. W. (2012). Did the Moon sink the Titanic? Sky & Telescope, 123(4), 28–33. https://phys.org/news/2012-03-icebergs-accomplice-moon-titanic.html
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An introduction to statistical learning with applications in R (2nd ed.). Springer International Publishing. https://www.statlearning.com/ .
Wang, K., Wang, P., & Xu, C. (2022). Toward efficient automated feature engineering. ArXiv. /abs/2212.13152. https://arxiv.org/abs/2212.13152 .
Schubert, E., & Rousseeuw, P. J. (2023). Stop using the elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter, 25(1), 1–8. https://doi.org/10.1145/3606274.3606278
Hinton, W. (2024). Split, Transform, and Scale the Data Set. Available at Rpubs. _{link}(https://www.rpubs.com/whinton/)_ .
Smeaton, A. (2003). NIST/SEMATECH Engineering Statistics Handbook. _{link}(https://www.itl.nist.gov/div898/handbook/)_. R Programming for Statistics and Data Science (Media from Packt Publishing available freely through O’Reilly Media Inc.). (2018).
.
This study performed by Will Hinton