Programa de Engenharia Biomédica

Estudo de métodos para classificação de amostras de água quanto a potabilidade utilizando características químicas das amostras

Published

August 10, 2025

UFRJ-COPPE-PEB
COB820 - Redes Neurais - 2025/3
Trabalho 1

Discente: Walner Passos

Docente : Leticia Raposo e Diogo Antônio Tschoeke

Apresentação parcial:

Boruta - SVM - SHARP

Apresentação do Conjunto de Dados

Usaremos o conjunto de dados WaterQuality disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados foi criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.

Análise Exploratória de Dados (EDA) & Pré-processamento

Realizados anteriormente

Estrutura dos dados

Variáveis

'data.frame':   1817 obs. of  18 variables:
 $ aluminium  : num  -0.76 2.365 -0.694 1.236 -0.76 ...
 $ ammonia    : num  -0.221 0.159 -1.264 -0.679 -1.53 ...
 $ arsenic    : num  0.802 2.303 0.984 -0.334 -0.289 ...
 $ barium     : num  -0.00428 1.75753 0.2921 2.00451 -0.38298 ...
 $ cadmium    : num  -0.0159 0.563 -0.5949 2.5892 2.0103 ...
 $ chloramine : num  0.213 -0.584 1.765 0.358 -1.075 ...
 $ chromium   : num  0.2307 0.7676 -0.2705 -0.0557 -1.0221 ...
 $ copper     : num  0.777 1.263 0.95 0.699 -0.117 ...
 $ bacteria   : num  0.352 -0.985 2.024 -0.985 -0.134 ...
 $ viruses    : num  -0.773 -0.786 1.863 -0.786 -0.768 ...
 $ nitrates   : num  -0.877 0.262 0.968 -1.354 -0.967 ...
 $ nitrites   : num  1.453 1.158 2.298 -0.256 1.041 ...
 $ perchlorate: num  1.98 0.242 0.496 1.75 -0.916 ...
 $ radium     : num  1.557 1.406 0.422 1.436 -0.867 ...
 $ selenium   : num  -1.017 -0.67 1.063 -1.363 0.717 ...
 $ silver     : num  0.9325 1.1976 0.0707 0.9987 -0.791 ...
 $ uranium    : num  -1.199 0.638 -0.464 -0.831 0.638 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...

Balanceamento

Boruta

Seleção de variáveis

 Treino:  1278 18 
 Teste :  546 18
Boruta performed 15 iterations in 4.736029 secs.
 17 attributes confirmed important: aluminium, ammonia, arsenic,
bacteria, barium and 12 more;
 No attributes deemed unimportant.

Todas variáveis selecionadas

SVM

Separando dados para treino e teste

 Treino:  1273 18 
 Teste :  544 18 
 
'data.frame':   1273 obs. of  18 variables:
 $ aluminium  : num  -0.76 -0.694 1.236 -0.76 2.482 ...
 $ ammonia    : num  -0.221 -1.264 -0.679 -1.53 -0.658 ...
 $ arsenic    : num  0.802 0.984 -0.334 -0.289 1.166 ...
 $ barium     : num  -0.00428 0.2921 2.00451 -0.38298 -1.03337 ...
 $ cadmium    : num  -0.0159 -0.5949 2.5892 2.0103 -0.5949 ...
 $ chloramine : num  0.213 1.765 0.358 -1.075 0.136 ...
 $ chromium   : num  0.2307 -0.2705 -0.0557 -1.0221 0.9466 ...
 $ copper     : num  0.777 0.95 0.699 -0.117 0.542 ...
 $ bacteria   : num  0.352 2.024 -0.985 -0.134 -0.134 ...
 $ viruses    : num  -0.773 1.863 -0.786 -0.768 -0.786 ...
 $ nitrates   : num  -0.877 0.968 -1.354 -0.967 -1.549 ...
 $ nitrites   : num  1.453 2.298 -0.256 1.041 0.648 ...
 $ perchlorate: num  1.98 0.496 1.75 -0.916 1.234 ...
 $ radium     : num  1.557 0.422 1.436 -0.867 -0.626 ...
 $ selenium   : num  -1.0168 1.0633 -1.3634 0.7166 0.0233 ...
 $ silver     : num  0.9325 0.0707 0.9987 -0.791 1.7279 ...
 $ uranium    : num  -1.1986 -0.4638 -0.8312 0.6383 -0.0964 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
'data.frame':   544 obs. of  18 variables:
 $ aluminium  : num  2.365 -0.714 -0.714 -0.747 -0.714 ...
 $ ammonia    : num  0.159 -0.456 -1.221 0.607 0.846 ...
 $ arsenic    : num  2.303 0.166 -0.107 3.257 -0.47 ...
 $ barium     : num  1.758 0.712 -1.14 1.552 0.136 ...
 $ cadmium    : num  0.563 0.8524 0.563 -0.0159 -0.7975 ...
 $ chloramine : num  -0.584 2.029 -1.063 1.646 -0.91 ...
 $ chromium   : num  0.7676 0.0517 -0.9863 0.9466 0.9466 ...
 $ copper     : num  1.263 1.797 -1.231 0.856 -0.133 ...
 $ bacteria   : num  -0.9852 -0.8028 -0.0126 0.1089 -0.1342 ...
 $ viruses    : num  -0.786 -0.765 -0.786 -0.781 -0.781 ...
 $ nitrates   : num  0.262 -0.19 1.085 1.788 1.096 ...
 $ nitrites   : num  1.158 0.805 0.432 -0.276 -0.688 ...
 $ perchlorate: num  0.242 -0.474 -0.789 -0.906 -0.286 ...
 $ radium     : num  1.406 1.05 1.26 -0.893 0.487 ...
 $ selenium   : num  -0.6701 -0.3234 -0.6701 1.0633 0.0233 ...
 $ silver     : num  1.198 -0.327 -0.592 1.33 0.8 ...
 $ uranium    : num  0.6383 -1.1986 -0.0964 -1.1986 -0.0964 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...

SVM Linear

Treinamento

Support Vector Machines with Linear Kernel 

1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

Pre-processing: centered (17), scaled (17) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 849, 849, 848 
Resampling results across tuning parameters:

  C      ROC        Sens       Spec     
  1e-03  0.8104052  0.7318549  0.7668232
  1e-02  0.8421539  0.7901726  0.7464789
  1e-01  0.8517736  0.8075352  0.7370892
  1e+00  0.8517390  0.8011863  0.7417840
  1e+01  0.8517312  0.8043459  0.7433490
  1e+02  0.8517460  0.8059257  0.7402191
  1e+03  0.8517296  0.8075054  0.7386541

ROC was used to select the optimal model using the largest value.
The final value used for the model was C = 0.1.

Predição

Desempenho do modelo

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 218  65
       SIM  53 208
                                        
               Accuracy : 0.7831        
                 95% CI : (0.746, 0.817)
    No Information Rate : 0.5018        
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.5662        
                                        
 Mcnemar's Test P-Value : 0.3112        
                                        
            Sensitivity : 0.7619        
            Specificity : 0.8044        
         Pos Pred Value : 0.7969        
         Neg Pred Value : 0.7703        
             Prevalence : 0.5018        
         Detection Rate : 0.3824        
   Detection Prevalence : 0.4798        
      Balanced Accuracy : 0.7832        
                                        
       'Positive' Class : SIM           
                                        

Curva ROC e AUC

Usando novo ponto de corte

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 209  51
       SIM  62 222
                                          
               Accuracy : 0.7923          
                 95% CI : (0.7557, 0.8256)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5845          
                                          
 Mcnemar's Test P-Value : 0.3468          
                                          
            Sensitivity : 0.8132          
            Specificity : 0.7712          
         Pos Pred Value : 0.7817          
         Neg Pred Value : 0.8038          
             Prevalence : 0.5018          
         Detection Rate : 0.4081          
   Detection Prevalence : 0.5221          
      Balanced Accuracy : 0.7922          
                                          
       'Positive' Class : SIM             
                                          
95% CI: 0.835-0.8955 (DeLong)
95% CI (2000 stratified bootstrap replicates):
 thresholds sp.low sp.median sp.high se.low se.median se.high
       0.45 0.7196    0.7712  0.8192 0.7656    0.8132  0.8572

SVM Radial

Treinamento

Support Vector Machines with Radial Basis Function Kernel 

1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

Pre-processing: centered (17), scaled (17) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 849, 849, 848 
Resampling results across tuning parameters:

  sigma       C       ROC        Sens       Spec     
  0.01897393    0.25  0.8674059  0.7964917  0.7824726
  0.01897393    0.50  0.8892053  0.8185713  0.7965571
  0.01897393    1.00  0.9115756  0.8217309  0.8184664
  0.01897393    2.00  0.9275342  0.8154043  0.8716745
  0.01897393    4.00  0.9353716  0.8327372  0.8654147
  0.01897393    8.00  0.9383677  0.8406510  0.8638498
  0.01897393   16.00  0.9387108  0.8548392  0.8716745
  0.01897393   32.00  0.9359928  0.8390489  0.8669797
  0.01897393   64.00  0.9318069  0.8374691  0.8810642
  0.01897393  128.00  0.9267738  0.8169767  0.8779343
  0.02550416    0.25  0.8788747  0.8043831  0.7840376
  0.02550416    0.50  0.9016487  0.8122671  0.8043818
  0.02550416    1.00  0.9228068  0.8264479  0.8482003
  0.02550416    2.00  0.9336084  0.8280277  0.8654147
  0.02550416    4.00  0.9374928  0.8343319  0.8716745
  0.02550416    8.00  0.9390805  0.8469552  0.8826291
  0.02550416   16.00  0.9371120  0.8406361  0.8763693
  0.02550416   32.00  0.9330078  0.8374840  0.8779343
  0.02550416   64.00  0.9271603  0.8264330  0.8607199
  0.02550416  128.00  0.9249945  0.8311649  0.8544601
  0.03203439    0.25  0.8868999  0.8043831  0.7902973
  0.03203439    0.50  0.9106657  0.8075278  0.8215962
  0.03203439    1.00  0.9288066  0.8169767  0.8732394
  0.03203439    2.00  0.9358012  0.8280277  0.8732394
  0.03203439    4.00  0.9393686  0.8374989  0.8685446
  0.03203439    8.00  0.9384975  0.8564190  0.8716745
  0.03203439   16.00  0.9347501  0.8485350  0.8732394
  0.03203439   32.00  0.9286200  0.8201288  0.8763693
  0.03203439   64.00  0.9261721  0.8248607  0.8575900
  0.03203439  128.00  0.9242956  0.8359191  0.8482003
  0.03856462    0.25  0.8920084  0.8075129  0.7949922
  0.03856462    0.50  0.9168763  0.8122448  0.8372457
  0.03856462    1.00  0.9315391  0.8201288  0.8685446
  0.03856462    2.00  0.9368244  0.8296000  0.8701095
  0.03856462    4.00  0.9388075  0.8422308  0.8763693
  0.03856462    8.00  0.9361648  0.8485275  0.8669797
  0.03856462   16.00  0.9309071  0.8406659  0.8748044
  0.03856462   32.00  0.9274753  0.8327595  0.8622848
  0.03856462   64.00  0.9258507  0.8280351  0.8544601
  0.03856462  128.00  0.9229732  0.8375361  0.8466354
  0.04509484    0.25  0.8956559  0.8075129  0.7949922
  0.04509484    0.50  0.9203035  0.8075129  0.8497653
  0.04509484    1.00  0.9326428  0.8169767  0.8763693
  0.04509484    2.00  0.9371161  0.8311798  0.8669797
  0.04509484    4.00  0.9378384  0.8516796  0.8591549
  0.04509484    8.00  0.9339368  0.8501073  0.8607199
  0.04509484   16.00  0.9285761  0.8359340  0.8638498
  0.04509484   32.00  0.9265008  0.8343468  0.8528951
  0.04509484   64.00  0.9245281  0.8469776  0.8372457
  0.04509484  128.00  0.9215804  0.8327968  0.8356808
  0.05162507    0.25  0.8986772  0.8027736  0.7965571
  0.05162507    0.50  0.9226065  0.8122448  0.8638498
  0.05162507    1.00  0.9330510  0.8232883  0.8716745
  0.05162507    2.00  0.9370213  0.8374989  0.8685446
  0.05162507    4.00  0.9363805  0.8422159  0.8669797
  0.05162507    8.00  0.9317070  0.8437956  0.8638498
  0.05162507   16.00  0.9276301  0.8343542  0.8622848
  0.05162507   32.00  0.9256801  0.8390861  0.8482003
  0.05162507   64.00  0.9234703  0.8359489  0.8450704
  0.05162507  128.00  0.9209500  0.8391085  0.8294210

ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.03203439 and C = 4.

Predição

Desempenho do modelo

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 231  21
       SIM  40 252
                                          
               Accuracy : 0.8879          
                 95% CI : (0.8583, 0.9131)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7757          
                                          
 Mcnemar's Test P-Value : 0.02119         
                                          
            Sensitivity : 0.9231          
            Specificity : 0.8524          
         Pos Pred Value : 0.8630          
         Neg Pred Value : 0.9167          
             Prevalence : 0.5018          
         Detection Rate : 0.4632          
   Detection Prevalence : 0.5368          
      Balanced Accuracy : 0.8877          
                                          
       'Positive' Class : SIM             
                                          

Curva ROC e AUC

Usando novo ponto de corte

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 246  29
       SIM  25 244
                                          
               Accuracy : 0.9007          
                 95% CI : (0.8725, 0.9245)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8015          
                                          
 Mcnemar's Test P-Value : 0.6831          
                                          
            Sensitivity : 0.8938          
            Specificity : 0.9077          
         Pos Pred Value : 0.9071          
         Neg Pred Value : 0.8945          
             Prevalence : 0.5018          
         Detection Rate : 0.4485          
   Detection Prevalence : 0.4945          
      Balanced Accuracy : 0.9008          
                                          
       'Positive' Class : SIM             
                                          
95% CI: 0.9489-0.9754 (DeLong)
95% CI (2000 stratified bootstrap replicates):
 thresholds sp.low sp.median sp.high se.low se.median se.high
      0.617 0.8744    0.9077   0.941 0.8571    0.8938  0.9268

SVM Polinomial

Treinamento

Support Vector Machines with Polynomial Kernel 

1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

Pre-processing: centered (17), scaled (17) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 849, 849, 848 
Resampling results across tuning parameters:

  degree  scale  C     ROC        Sens       Spec     
  1       0.001  0.25  0.7980302  0.8769635  0.4960876
  1       0.001  0.50  0.7987269  0.8532892  0.5539906
  1       0.001  1.00  0.8097275  0.7365719  0.7668232
  1       0.001  2.00  0.8205064  0.7665802  0.7558685
  1       0.010  0.25  0.8246704  0.7760514  0.7558685
  1       0.010  0.50  0.8361054  0.7902471  0.7527387
  1       0.010  1.00  0.8441328  0.7981385  0.7449139
  1       0.010  2.00  0.8489654  0.7981460  0.7433490
  1       0.100  0.25  0.8503133  0.7950088  0.7449139
  1       0.100  0.50  0.8526180  0.8013056  0.7417840
  1       0.100  1.00  0.8539824  0.8060523  0.7402191
  1       0.100  2.00  0.8544430  0.8154863  0.7386541
  1       1.000  0.25  0.8541911  0.8123491  0.7386541
  1       1.000  0.50  0.8543093  0.8170661  0.7386541
  1       1.000  1.00  0.8540077  0.8029002  0.7496088
  1       1.000  2.00  0.8539410  0.8139214  0.7417840
  2       0.001  0.25  0.7989566  0.8532892  0.5508607
  2       0.001  0.50  0.8100607  0.7349921  0.7683881
  2       0.001  1.00  0.8209804  0.7697323  0.7558685
  2       0.001  2.00  0.8331873  0.7918269  0.7511737
  2       0.010  0.25  0.8449920  0.7886748  0.7621283
  2       0.010  0.50  0.8574986  0.8044279  0.7668232
  2       0.010  1.00  0.8709039  0.8170437  0.7668232
  2       0.010  2.00  0.8886511  0.8217831  0.7887324
  2       0.100  0.25  0.9386674  0.8549063  0.8591549
  2       0.100  0.50  0.9388897  0.8580584  0.8607199
  2       0.100  1.00  0.9388538  0.8502116  0.8701095
  2       0.100  2.00  0.9391435  0.8548988  0.8763693
  2       1.000  0.25  0.9216178  0.8485499  0.8607199
  2       1.000  0.50  0.9216467  0.8406659  0.8575900
  2       1.000  1.00  0.9207909  0.8343915  0.8701095
  2       1.000  2.00  0.9189020  0.8123193  0.8716745
  3       0.001  0.25  0.8039510  0.7397314  0.7636933
  3       0.001  0.50  0.8170400  0.7523622  0.7652582
  3       0.001  1.00  0.8291880  0.7870875  0.7574335
  3       0.001  2.00  0.8406151  0.7949790  0.7496088
  3       0.010  0.25  0.8626735  0.8012534  0.7715180
  3       0.010  0.50  0.8812267  0.8186012  0.7840376
  3       0.010  1.00  0.9028550  0.8186012  0.8059468
  3       0.010  2.00  0.9210958  0.8201884  0.8419405
  3       0.100  0.25  0.9372542  0.8486020  0.8732394
  3       0.100  0.50  0.9340101  0.8296745  0.8779343
  3       0.100  1.00  0.9314705  0.8359862  0.8763693
  3       0.100  2.00  0.9294665  0.8281096  0.8810642
  3       1.000  0.25  0.8606339  0.7618111  0.8090767
  3       1.000  0.50  0.8616945  0.7571016  0.8200313
  3       1.000  1.00  0.8630592  0.7602313  0.8200313
  3       1.000  2.00  0.8597513  0.7697174  0.8137715

ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.1 and C = 2.

Predição

Desempenho

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 234  16
       SIM  37 257
                                          
               Accuracy : 0.9026          
                 95% CI : (0.8745, 0.9262)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8051          
                                          
 Mcnemar's Test P-Value : 0.00601         
                                          
            Sensitivity : 0.9414          
            Specificity : 0.8635          
         Pos Pred Value : 0.8741          
         Neg Pred Value : 0.9360          
             Prevalence : 0.5018          
         Detection Rate : 0.4724          
   Detection Prevalence : 0.5404          
      Balanced Accuracy : 0.9024          
                                          
       'Positive' Class : SIM             
                                          

Curva ROC e AUC

Usando novo ponto de corte

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 237  18
       SIM  34 255
                                          
               Accuracy : 0.9044          
                 95% CI : (0.8765, 0.9278)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8088          
                                          
 Mcnemar's Test P-Value : 0.03751         
                                          
            Sensitivity : 0.9341          
            Specificity : 0.8745          
         Pos Pred Value : 0.8824          
         Neg Pred Value : 0.9294          
             Prevalence : 0.5018          
         Detection Rate : 0.4688          
   Detection Prevalence : 0.5312          
      Balanced Accuracy : 0.9043          
                                          
       'Positive' Class : SIM             
                                          
95% CI: 0.9465-0.9754 (DeLong)
95% CI (2000 stratified bootstrap replicates):
 thresholds sp.low sp.median sp.high se.low se.median se.high
      0.526 0.8339    0.8745  0.9114 0.9011    0.9341  0.9597

SVM - Comparando os resultados


Call:
summary.resamples(object = results)

Models: SVML, SVMR, SVMP 
Number of resamples: 3 

ROC 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
SVML 0.8397748 0.8428899 0.8460049 0.8517736 0.8577730 0.8695411    0
SVMR 0.9300225 0.9304675 0.9309125 0.9393686 0.9440416 0.9571707    0
SVMP 0.9332265 0.9366062 0.9399858 0.9391435 0.9421020 0.9442182    0

Sens 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
SVML 0.7867299 0.7962085 0.8056872 0.8075352 0.8179379 0.8301887    0
SVMR 0.7914692 0.8246445 0.8578199 0.8374989 0.8605137 0.8632075    0
SVMP 0.8490566 0.8510686 0.8530806 0.8548988 0.8578199 0.8625592    0

Spec 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
SVML 0.7042254 0.7276995 0.7511737 0.7370892 0.7535211 0.7558685    0
SVMR 0.8262911 0.8568075 0.8873239 0.8685446 0.8896714 0.8920188    0
SVMP 0.8450704 0.8638498 0.8826291 0.8763693 0.8920188 0.9014085    0

SHAP

Separando preditores

Cálculo de SHAP com fastshap

Visualização