'data.frame': 1817 obs. of 18 variables:
$ aluminium : num -0.76 2.365 -0.694 1.236 -0.76 ...
$ ammonia : num -0.221 0.159 -1.264 -0.679 -1.53 ...
$ arsenic : num 0.802 2.303 0.984 -0.334 -0.289 ...
$ barium : num -0.00428 1.75753 0.2921 2.00451 -0.38298 ...
$ cadmium : num -0.0159 0.563 -0.5949 2.5892 2.0103 ...
$ chloramine : num 0.213 -0.584 1.765 0.358 -1.075 ...
$ chromium : num 0.2307 0.7676 -0.2705 -0.0557 -1.0221 ...
$ copper : num 0.777 1.263 0.95 0.699 -0.117 ...
$ bacteria : num 0.352 -0.985 2.024 -0.985 -0.134 ...
$ viruses : num -0.773 -0.786 1.863 -0.786 -0.768 ...
$ nitrates : num -0.877 0.262 0.968 -1.354 -0.967 ...
$ nitrites : num 1.453 1.158 2.298 -0.256 1.041 ...
$ perchlorate: num 1.98 0.242 0.496 1.75 -0.916 ...
$ radium : num 1.557 1.406 0.422 1.436 -0.867 ...
$ selenium : num -1.017 -0.67 1.063 -1.363 0.717 ...
$ silver : num 0.9325 1.1976 0.0707 0.9987 -0.791 ...
$ uranium : num -1.199 0.638 -0.464 -0.831 0.638 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
Programa de Engenharia Biomédica
Estudo de métodos para classificação de amostras de água quanto a potabilidade utilizando características químicas das amostras
UFRJ-COPPE-PEB
COB820 - Redes Neurais - 2025/3
Trabalho 1
Discente: Walner Passos
Docente : Leticia Raposo e Diogo Antônio Tschoeke
Apresentação parcial:
Boruta - SVM - SHARP
Apresentação do Conjunto de Dados
Usaremos o conjunto de dados WaterQuality disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados foi criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.
Análise Exploratória de Dados (EDA) & Pré-processamento
Realizados anteriormente
Estrutura dos dados
Variáveis
Balanceamento
Boruta
Seleção de variáveis
Treino: 1278 18
Teste : 546 18
Boruta performed 15 iterations in 4.736029 secs.
17 attributes confirmed important: aluminium, ammonia, arsenic,
bacteria, barium and 12 more;
No attributes deemed unimportant.
Todas variáveis selecionadas
SVM
Separando dados para treino e teste
Treino: 1273 18
Teste : 544 18
'data.frame': 1273 obs. of 18 variables:
$ aluminium : num -0.76 -0.694 1.236 -0.76 2.482 ...
$ ammonia : num -0.221 -1.264 -0.679 -1.53 -0.658 ...
$ arsenic : num 0.802 0.984 -0.334 -0.289 1.166 ...
$ barium : num -0.00428 0.2921 2.00451 -0.38298 -1.03337 ...
$ cadmium : num -0.0159 -0.5949 2.5892 2.0103 -0.5949 ...
$ chloramine : num 0.213 1.765 0.358 -1.075 0.136 ...
$ chromium : num 0.2307 -0.2705 -0.0557 -1.0221 0.9466 ...
$ copper : num 0.777 0.95 0.699 -0.117 0.542 ...
$ bacteria : num 0.352 2.024 -0.985 -0.134 -0.134 ...
$ viruses : num -0.773 1.863 -0.786 -0.768 -0.786 ...
$ nitrates : num -0.877 0.968 -1.354 -0.967 -1.549 ...
$ nitrites : num 1.453 2.298 -0.256 1.041 0.648 ...
$ perchlorate: num 1.98 0.496 1.75 -0.916 1.234 ...
$ radium : num 1.557 0.422 1.436 -0.867 -0.626 ...
$ selenium : num -1.0168 1.0633 -1.3634 0.7166 0.0233 ...
$ silver : num 0.9325 0.0707 0.9987 -0.791 1.7279 ...
$ uranium : num -1.1986 -0.4638 -0.8312 0.6383 -0.0964 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
'data.frame': 544 obs. of 18 variables:
$ aluminium : num 2.365 -0.714 -0.714 -0.747 -0.714 ...
$ ammonia : num 0.159 -0.456 -1.221 0.607 0.846 ...
$ arsenic : num 2.303 0.166 -0.107 3.257 -0.47 ...
$ barium : num 1.758 0.712 -1.14 1.552 0.136 ...
$ cadmium : num 0.563 0.8524 0.563 -0.0159 -0.7975 ...
$ chloramine : num -0.584 2.029 -1.063 1.646 -0.91 ...
$ chromium : num 0.7676 0.0517 -0.9863 0.9466 0.9466 ...
$ copper : num 1.263 1.797 -1.231 0.856 -0.133 ...
$ bacteria : num -0.9852 -0.8028 -0.0126 0.1089 -0.1342 ...
$ viruses : num -0.786 -0.765 -0.786 -0.781 -0.781 ...
$ nitrates : num 0.262 -0.19 1.085 1.788 1.096 ...
$ nitrites : num 1.158 0.805 0.432 -0.276 -0.688 ...
$ perchlorate: num 0.242 -0.474 -0.789 -0.906 -0.286 ...
$ radium : num 1.406 1.05 1.26 -0.893 0.487 ...
$ selenium : num -0.6701 -0.3234 -0.6701 1.0633 0.0233 ...
$ silver : num 1.198 -0.327 -0.592 1.33 0.8 ...
$ uranium : num 0.6383 -1.1986 -0.0964 -1.1986 -0.0964 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
SVM Linear
Treinamento
Support Vector Machines with Linear Kernel
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
Pre-processing: centered (17), scaled (17)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 849, 849, 848
Resampling results across tuning parameters:
C ROC Sens Spec
1e-03 0.8104052 0.7318549 0.7668232
1e-02 0.8421539 0.7901726 0.7464789
1e-01 0.8517736 0.8075352 0.7370892
1e+00 0.8517390 0.8011863 0.7417840
1e+01 0.8517312 0.8043459 0.7433490
1e+02 0.8517460 0.8059257 0.7402191
1e+03 0.8517296 0.8075054 0.7386541
ROC was used to select the optimal model using the largest value.
The final value used for the model was C = 0.1.
Predição
Desempenho do modelo
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 218 65
SIM 53 208
Accuracy : 0.7831
95% CI : (0.746, 0.817)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5662
Mcnemar's Test P-Value : 0.3112
Sensitivity : 0.7619
Specificity : 0.8044
Pos Pred Value : 0.7969
Neg Pred Value : 0.7703
Prevalence : 0.5018
Detection Rate : 0.3824
Detection Prevalence : 0.4798
Balanced Accuracy : 0.7832
'Positive' Class : SIM
Curva ROC e AUC
Usando novo ponto de corte
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 209 51
SIM 62 222
Accuracy : 0.7923
95% CI : (0.7557, 0.8256)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5845
Mcnemar's Test P-Value : 0.3468
Sensitivity : 0.8132
Specificity : 0.7712
Pos Pred Value : 0.7817
Neg Pred Value : 0.8038
Prevalence : 0.5018
Detection Rate : 0.4081
Detection Prevalence : 0.5221
Balanced Accuracy : 0.7922
'Positive' Class : SIM
95% CI: 0.835-0.8955 (DeLong)
95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.45 0.7196 0.7712 0.8192 0.7656 0.8132 0.8572
SVM Radial
Treinamento
Support Vector Machines with Radial Basis Function Kernel
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
Pre-processing: centered (17), scaled (17)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 849, 849, 848
Resampling results across tuning parameters:
sigma C ROC Sens Spec
0.01897393 0.25 0.8674059 0.7964917 0.7824726
0.01897393 0.50 0.8892053 0.8185713 0.7965571
0.01897393 1.00 0.9115756 0.8217309 0.8184664
0.01897393 2.00 0.9275342 0.8154043 0.8716745
0.01897393 4.00 0.9353716 0.8327372 0.8654147
0.01897393 8.00 0.9383677 0.8406510 0.8638498
0.01897393 16.00 0.9387108 0.8548392 0.8716745
0.01897393 32.00 0.9359928 0.8390489 0.8669797
0.01897393 64.00 0.9318069 0.8374691 0.8810642
0.01897393 128.00 0.9267738 0.8169767 0.8779343
0.02550416 0.25 0.8788747 0.8043831 0.7840376
0.02550416 0.50 0.9016487 0.8122671 0.8043818
0.02550416 1.00 0.9228068 0.8264479 0.8482003
0.02550416 2.00 0.9336084 0.8280277 0.8654147
0.02550416 4.00 0.9374928 0.8343319 0.8716745
0.02550416 8.00 0.9390805 0.8469552 0.8826291
0.02550416 16.00 0.9371120 0.8406361 0.8763693
0.02550416 32.00 0.9330078 0.8374840 0.8779343
0.02550416 64.00 0.9271603 0.8264330 0.8607199
0.02550416 128.00 0.9249945 0.8311649 0.8544601
0.03203439 0.25 0.8868999 0.8043831 0.7902973
0.03203439 0.50 0.9106657 0.8075278 0.8215962
0.03203439 1.00 0.9288066 0.8169767 0.8732394
0.03203439 2.00 0.9358012 0.8280277 0.8732394
0.03203439 4.00 0.9393686 0.8374989 0.8685446
0.03203439 8.00 0.9384975 0.8564190 0.8716745
0.03203439 16.00 0.9347501 0.8485350 0.8732394
0.03203439 32.00 0.9286200 0.8201288 0.8763693
0.03203439 64.00 0.9261721 0.8248607 0.8575900
0.03203439 128.00 0.9242956 0.8359191 0.8482003
0.03856462 0.25 0.8920084 0.8075129 0.7949922
0.03856462 0.50 0.9168763 0.8122448 0.8372457
0.03856462 1.00 0.9315391 0.8201288 0.8685446
0.03856462 2.00 0.9368244 0.8296000 0.8701095
0.03856462 4.00 0.9388075 0.8422308 0.8763693
0.03856462 8.00 0.9361648 0.8485275 0.8669797
0.03856462 16.00 0.9309071 0.8406659 0.8748044
0.03856462 32.00 0.9274753 0.8327595 0.8622848
0.03856462 64.00 0.9258507 0.8280351 0.8544601
0.03856462 128.00 0.9229732 0.8375361 0.8466354
0.04509484 0.25 0.8956559 0.8075129 0.7949922
0.04509484 0.50 0.9203035 0.8075129 0.8497653
0.04509484 1.00 0.9326428 0.8169767 0.8763693
0.04509484 2.00 0.9371161 0.8311798 0.8669797
0.04509484 4.00 0.9378384 0.8516796 0.8591549
0.04509484 8.00 0.9339368 0.8501073 0.8607199
0.04509484 16.00 0.9285761 0.8359340 0.8638498
0.04509484 32.00 0.9265008 0.8343468 0.8528951
0.04509484 64.00 0.9245281 0.8469776 0.8372457
0.04509484 128.00 0.9215804 0.8327968 0.8356808
0.05162507 0.25 0.8986772 0.8027736 0.7965571
0.05162507 0.50 0.9226065 0.8122448 0.8638498
0.05162507 1.00 0.9330510 0.8232883 0.8716745
0.05162507 2.00 0.9370213 0.8374989 0.8685446
0.05162507 4.00 0.9363805 0.8422159 0.8669797
0.05162507 8.00 0.9317070 0.8437956 0.8638498
0.05162507 16.00 0.9276301 0.8343542 0.8622848
0.05162507 32.00 0.9256801 0.8390861 0.8482003
0.05162507 64.00 0.9234703 0.8359489 0.8450704
0.05162507 128.00 0.9209500 0.8391085 0.8294210
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.03203439 and C = 4.
Predição
Desempenho do modelo
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 231 21
SIM 40 252
Accuracy : 0.8879
95% CI : (0.8583, 0.9131)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7757
Mcnemar's Test P-Value : 0.02119
Sensitivity : 0.9231
Specificity : 0.8524
Pos Pred Value : 0.8630
Neg Pred Value : 0.9167
Prevalence : 0.5018
Detection Rate : 0.4632
Detection Prevalence : 0.5368
Balanced Accuracy : 0.8877
'Positive' Class : SIM
Curva ROC e AUC
Usando novo ponto de corte
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 246 29
SIM 25 244
Accuracy : 0.9007
95% CI : (0.8725, 0.9245)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8015
Mcnemar's Test P-Value : 0.6831
Sensitivity : 0.8938
Specificity : 0.9077
Pos Pred Value : 0.9071
Neg Pred Value : 0.8945
Prevalence : 0.5018
Detection Rate : 0.4485
Detection Prevalence : 0.4945
Balanced Accuracy : 0.9008
'Positive' Class : SIM
95% CI: 0.9489-0.9754 (DeLong)
95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.617 0.8744 0.9077 0.941 0.8571 0.8938 0.9268
SVM Polinomial
Treinamento
Support Vector Machines with Polynomial Kernel
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
Pre-processing: centered (17), scaled (17)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 849, 849, 848
Resampling results across tuning parameters:
degree scale C ROC Sens Spec
1 0.001 0.25 0.7980302 0.8769635 0.4960876
1 0.001 0.50 0.7987269 0.8532892 0.5539906
1 0.001 1.00 0.8097275 0.7365719 0.7668232
1 0.001 2.00 0.8205064 0.7665802 0.7558685
1 0.010 0.25 0.8246704 0.7760514 0.7558685
1 0.010 0.50 0.8361054 0.7902471 0.7527387
1 0.010 1.00 0.8441328 0.7981385 0.7449139
1 0.010 2.00 0.8489654 0.7981460 0.7433490
1 0.100 0.25 0.8503133 0.7950088 0.7449139
1 0.100 0.50 0.8526180 0.8013056 0.7417840
1 0.100 1.00 0.8539824 0.8060523 0.7402191
1 0.100 2.00 0.8544430 0.8154863 0.7386541
1 1.000 0.25 0.8541911 0.8123491 0.7386541
1 1.000 0.50 0.8543093 0.8170661 0.7386541
1 1.000 1.00 0.8540077 0.8029002 0.7496088
1 1.000 2.00 0.8539410 0.8139214 0.7417840
2 0.001 0.25 0.7989566 0.8532892 0.5508607
2 0.001 0.50 0.8100607 0.7349921 0.7683881
2 0.001 1.00 0.8209804 0.7697323 0.7558685
2 0.001 2.00 0.8331873 0.7918269 0.7511737
2 0.010 0.25 0.8449920 0.7886748 0.7621283
2 0.010 0.50 0.8574986 0.8044279 0.7668232
2 0.010 1.00 0.8709039 0.8170437 0.7668232
2 0.010 2.00 0.8886511 0.8217831 0.7887324
2 0.100 0.25 0.9386674 0.8549063 0.8591549
2 0.100 0.50 0.9388897 0.8580584 0.8607199
2 0.100 1.00 0.9388538 0.8502116 0.8701095
2 0.100 2.00 0.9391435 0.8548988 0.8763693
2 1.000 0.25 0.9216178 0.8485499 0.8607199
2 1.000 0.50 0.9216467 0.8406659 0.8575900
2 1.000 1.00 0.9207909 0.8343915 0.8701095
2 1.000 2.00 0.9189020 0.8123193 0.8716745
3 0.001 0.25 0.8039510 0.7397314 0.7636933
3 0.001 0.50 0.8170400 0.7523622 0.7652582
3 0.001 1.00 0.8291880 0.7870875 0.7574335
3 0.001 2.00 0.8406151 0.7949790 0.7496088
3 0.010 0.25 0.8626735 0.8012534 0.7715180
3 0.010 0.50 0.8812267 0.8186012 0.7840376
3 0.010 1.00 0.9028550 0.8186012 0.8059468
3 0.010 2.00 0.9210958 0.8201884 0.8419405
3 0.100 0.25 0.9372542 0.8486020 0.8732394
3 0.100 0.50 0.9340101 0.8296745 0.8779343
3 0.100 1.00 0.9314705 0.8359862 0.8763693
3 0.100 2.00 0.9294665 0.8281096 0.8810642
3 1.000 0.25 0.8606339 0.7618111 0.8090767
3 1.000 0.50 0.8616945 0.7571016 0.8200313
3 1.000 1.00 0.8630592 0.7602313 0.8200313
3 1.000 2.00 0.8597513 0.7697174 0.8137715
ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.1 and C = 2.
Predição
Desempenho
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 234 16
SIM 37 257
Accuracy : 0.9026
95% CI : (0.8745, 0.9262)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8051
Mcnemar's Test P-Value : 0.00601
Sensitivity : 0.9414
Specificity : 0.8635
Pos Pred Value : 0.8741
Neg Pred Value : 0.9360
Prevalence : 0.5018
Detection Rate : 0.4724
Detection Prevalence : 0.5404
Balanced Accuracy : 0.9024
'Positive' Class : SIM
Curva ROC e AUC
Usando novo ponto de corte
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 237 18
SIM 34 255
Accuracy : 0.9044
95% CI : (0.8765, 0.9278)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8088
Mcnemar's Test P-Value : 0.03751
Sensitivity : 0.9341
Specificity : 0.8745
Pos Pred Value : 0.8824
Neg Pred Value : 0.9294
Prevalence : 0.5018
Detection Rate : 0.4688
Detection Prevalence : 0.5312
Balanced Accuracy : 0.9043
'Positive' Class : SIM
95% CI: 0.9465-0.9754 (DeLong)
95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.526 0.8339 0.8745 0.9114 0.9011 0.9341 0.9597
SVM - Comparando os resultados
Call:
summary.resamples(object = results)
Models: SVML, SVMR, SVMP
Number of resamples: 3
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVML 0.8397748 0.8428899 0.8460049 0.8517736 0.8577730 0.8695411 0
SVMR 0.9300225 0.9304675 0.9309125 0.9393686 0.9440416 0.9571707 0
SVMP 0.9332265 0.9366062 0.9399858 0.9391435 0.9421020 0.9442182 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVML 0.7867299 0.7962085 0.8056872 0.8075352 0.8179379 0.8301887 0
SVMR 0.7914692 0.8246445 0.8578199 0.8374989 0.8605137 0.8632075 0
SVMP 0.8490566 0.8510686 0.8530806 0.8548988 0.8578199 0.8625592 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVML 0.7042254 0.7276995 0.7511737 0.7370892 0.7535211 0.7558685 0
SVMR 0.8262911 0.8568075 0.8873239 0.8685446 0.8896714 0.8920188 0
SVMP 0.8450704 0.8638498 0.8826291 0.8763693 0.8920188 0.9014085 0