Programa de Engenharia Biomédica

Estudo de métodos para classificação de amostras de água quanto a potabilidade utilizando características químicas das amostras

Published

March 7, 2026

Discente: Walner Passos
Docente : Leticia Raposo e Diogo Antônio Tschoeke

Análise Exploratória de Dados (EDA)

Apresentação do Conjunto de Dados

Usaremos o conjunto de dados WaterQuality disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados foi criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.

Carga e Visualização Inicial dos Dados

Resumo dos dados:

  • Número de Linha: 7999
  • Número de Variáveis: 20
  • Rótulo: 1
Variáveis explicativas Tipo de dado perigoso > que
aluminium quantitativa 2.8
ammonia quantitativa 32.5
arsenic quantitativa 0.01
barium quantitativa 2
cádmium quantitativa 0.005
chloramine quantitativa 4
chromium quantitativa 0.1
copper quantitativa 1.3
flouride quantitativa 1.5
bactéria quantitativa 0
viroses quantitativa 0
lead quantitativa 0.015
nitrates quantitativa 10
nitrites quantitativa 1
Mercury quantitativa 0.002
perchlorate quantitativa 56
radium quantitativa 5
selenium quantitativa 0.5
silver quantitativa 0.1
uranium quantitativa 0.3
variável resposta Tipo de dado Observação
is_safe chr “0”, “1”

Informação Dataset

Variáveis

'data.frame':   7999 obs. of  21 variables:
 $ aluminium  : num  1.65 2.32 1.01 1.36 0.92 0.94 2.36 3.93 0.6 0.22 ...
 $ ammonia    : chr  "9.08" "21.16" "14.02" "11.33" ...
 $ arsenic    : num  0.04 0.01 0.04 0.04 0.03 0.03 0.01 0.04 0.01 0.02 ...
 $ barium     : num  2.85 3.31 0.58 2.96 0.2 2.88 1.35 0.66 0.71 1.37 ...
 $ cadmium    : num  0.007 0.002 0.008 0.001 0.006 0.003 0.004 0.001 0.005 0.007 ...
 $ chloramine : num  0.35 5.28 4.24 7.23 2.67 0.8 1.28 6.22 3.14 6.4 ...
 $ chromium   : num  0.83 0.68 0.53 0.03 0.69 0.43 0.62 0.1 0.77 0.49 ...
 $ copper     : num  0.17 0.66 0.02 1.66 0.57 1.38 1.88 1.86 1.45 0.82 ...
 $ flouride   : num  0.05 0.9 0.99 1.08 0.61 0.11 0.33 0.86 0.98 1.24 ...
 $ bacteria   : num  0.2 0.65 0.05 0.71 0.13 0.67 0.13 0.16 0.35 0.83 ...
 $ viruses    : num  0 0.65 0.003 0.71 0.001 0.67 0.007 0.005 0.002 0.83 ...
 $ lead       : num  0.054 0.1 0.078 0.016 0.117 0.135 0.021 0.197 0.167 0.109 ...
 $ nitrates   : num  16.08 2.01 14.16 1.41 6.74 ...
 $ nitrites   : num  1.13 1.93 1.11 1.29 1.11 1.89 1.78 1.81 1.84 1.46 ...
 $ mercury    : num  0.007 0.003 0.006 0.004 0.003 0.006 0.007 0.001 0.004 0.01 ...
 $ perchlorate: num  37.75 32.26 50.28 9.12 16.9 ...
 $ radium     : num  6.78 3.21 7.07 1.72 2.41 5.42 2.84 7.24 4.99 0.08 ...
 $ selenium   : num  0.08 0.08 0.07 0.02 0.02 0.08 0.1 0.08 0.08 0.03 ...
 $ silver     : num  0.34 0.27 0.44 0.45 0.06 0.19 0.24 0.08 0.25 0.31 ...
 $ uranium    : num  0.02 0.05 0.01 0.05 0.02 0.02 0.08 0.07 0.08 0.01 ...
 $ is_safe    : chr  "1" "1" "0" "1" ...

Análise estatística dos dados

   aluminium        ammonia             arsenic           barium     
 Min.   :0.0000   Length:7999        Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0400   Class :character   1st Qu.:0.0300   1st Qu.:0.560  
 Median :0.0700   Mode  :character   Median :0.0500   Median :1.190  
 Mean   :0.6662                      Mean   :0.1614   Mean   :1.568  
 3rd Qu.:0.2800                      3rd Qu.:0.1000   3rd Qu.:2.480  
 Max.   :5.0500                      Max.   :1.0500   Max.   :4.940  
    cadmium          chloramine       chromium          copper      
 Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00800   1st Qu.:0.100   1st Qu.:0.0500   1st Qu.:0.0900  
 Median :0.04000   Median :0.530   Median :0.0900   Median :0.7500  
 Mean   :0.04281   Mean   :2.177   Mean   :0.2472   Mean   :0.8059  
 3rd Qu.:0.07000   3rd Qu.:4.240   3rd Qu.:0.4400   3rd Qu.:1.3900  
 Max.   :0.13000   Max.   :8.680   Max.   :0.9000   Max.   :2.0000  
    flouride         bacteria         viruses            lead        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.4050   1st Qu.:0.0000   1st Qu.:0.0020   1st Qu.:0.04800  
 Median :0.7700   Median :0.2200   Median :0.0080   Median :0.10200  
 Mean   :0.7716   Mean   :0.3197   Mean   :0.3286   Mean   :0.09945  
 3rd Qu.:1.1600   3rd Qu.:0.6100   3rd Qu.:0.7000   3rd Qu.:0.15100  
 Max.   :1.5000   Max.   :1.0000   Max.   :1.0000   Max.   :0.20000  
    nitrates         nitrites       mercury          perchlorate   
 Min.   : 0.000   Min.   :0.00   Min.   :0.000000   Min.   : 0.00  
 1st Qu.: 5.000   1st Qu.:1.00   1st Qu.:0.003000   1st Qu.: 2.17  
 Median : 9.930   Median :1.42   Median :0.005000   Median : 7.74  
 Mean   : 9.819   Mean   :1.33   Mean   :0.005194   Mean   :16.46  
 3rd Qu.:14.610   3rd Qu.:1.76   3rd Qu.:0.008000   3rd Qu.:29.48  
 Max.   :19.830   Max.   :2.93   Max.   :0.010000   Max.   :60.01  
     radium         selenium           silver          uranium       
 Min.   :0.000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.820   1st Qu.:0.02000   1st Qu.:0.0400   1st Qu.:0.02000  
 Median :2.410   Median :0.05000   Median :0.0800   Median :0.05000  
 Mean   :2.921   Mean   :0.04968   Mean   :0.1478   Mean   :0.04467  
 3rd Qu.:4.670   3rd Qu.:0.07000   3rd Qu.:0.2400   3rd Qu.:0.07000  
 Max.   :7.990   Max.   :0.10000   Max.   :0.5000   Max.   :0.09000  
   is_safe         
 Length:7999       
 Class :character  
 Mode  :character  
                   
                   
                   

Limpeza e Preparação dos Dados

  • Sem dados faltantes:
    Número dados faltantes :  0
  • Analise variável Ammonia
     Confirmando tipo de dado da variável:  character
     Alterando o tipo de dados 
     Confirmando alteração  double
     Verificando inconcistência:  3
     Excluuindo registros: 
     Confirmando exclusão: 0
  • Análise dos rótulos
     Rotulos: 1 0
  • Alterado as descrições do rotulo ( “0” - NÃO / “1” - SIM )
     Alterando os rótulos...  
     Confirmando alteração:  NÃO SIM

Análise dos dados

Balanceamento

Análise Univariada

Non-numerical variable(s) ignored: is_safe
Descriptive Statistics  
dados  
N: 7996  

                    aluminium   ammonia   arsenic   bacteria    barium   cadmium   chloramine
----------------- ----------- --------- --------- ---------- --------- --------- ------------
             Mean        0.67     14.28      0.16       0.32      1.57      0.04         2.18
          Std.Dev        1.27      8.88      0.25       0.33      1.22      0.04         2.57
              Min        0.00     -0.08      0.00       0.00      0.00      0.00         0.00
               Q1        0.04      6.58      0.03       0.00      0.56      0.01         0.10
           Median        0.07     14.13      0.05       0.22      1.19      0.04         0.53
               Q3        0.28     22.13      0.10       0.61      2.49      0.07         4.24
              Max        5.05     29.84      1.05       1.00      4.94      0.13         8.68
              MAD        0.06     11.58      0.04       0.33      1.20      0.05         0.76
              IQR        0.24     15.55      0.07       0.61      1.92      0.06         4.14
               CV        1.90      0.62      1.56       1.03      0.78      0.84         1.18
         Skewness        2.01      0.03      1.98       0.55      0.66      0.48         0.89
      SE.Skewness        0.03      0.03      0.03       0.03      0.03      0.03         0.03
         Kurtosis        2.72     -1.23      2.68      -1.14     -0.70     -0.99        -0.68
          N.Valid     7996.00   7996.00   7996.00    7996.00   7996.00   7996.00      7996.00
                N     7996.00   7996.00   7996.00    7996.00   7996.00   7996.00      7996.00
        Pct.Valid      100.00    100.00    100.00     100.00    100.00    100.00       100.00

Table: Table continues below

 

                    chromium    copper   flouride      lead   mercury   nitrates   nitrites
----------------- ---------- --------- ---------- --------- --------- ---------- ----------
             Mean       0.25      0.81       0.77      0.10      0.01       9.82       1.33
          Std.Dev       0.27      0.65       0.44      0.06      0.00       5.54       0.57
              Min       0.00      0.00       0.00      0.00      0.00       0.00       0.00
               Q1       0.05      0.09       0.41      0.05      0.00       5.00       1.00
           Median       0.09      0.75       0.77      0.10      0.00       9.93       1.42
               Q3       0.44      1.39       1.16      0.15      0.01      14.61       1.76
              Max       0.90      2.00       1.50      0.20      0.01      19.83       2.93
              MAD       0.10      0.96       0.56      0.08      0.00       7.06       0.53
              IQR       0.39      1.30       0.75      0.10      0.00       9.61       0.76
               CV       1.09      0.81       0.56      0.59      0.57       0.56       0.43
         Skewness       1.03      0.25      -0.04     -0.06     -0.08      -0.04      -0.50
      SE.Skewness       0.03      0.03       0.03      0.03      0.03       0.03       0.03
         Kurtosis      -0.37     -1.35      -1.17     -1.16     -1.17      -1.19      -0.36
          N.Valid    7996.00   7996.00    7996.00   7996.00   7996.00    7996.00    7996.00
                N    7996.00   7996.00    7996.00   7996.00   7996.00    7996.00    7996.00
        Pct.Valid     100.00    100.00     100.00    100.00    100.00     100.00     100.00

Table: Table continues below

 

                    perchlorate    radium   selenium    silver   uranium   viruses
----------------- ------------- --------- ---------- --------- --------- ---------
             Mean         16.47      2.92       0.05      0.15      0.04      0.33
          Std.Dev         17.69      2.32       0.03      0.14      0.03      0.38
              Min          0.00      0.00       0.00      0.00      0.00      0.00
               Q1          2.17      0.82       0.02      0.04      0.02      0.00
           Median          7.74      2.41       0.05      0.08      0.05      0.01
               Q3         29.50      4.67       0.07      0.24      0.07      0.70
              Max         60.01      7.99       0.10      0.50      0.09      1.00
              MAD         10.73      2.64       0.03      0.09      0.03      0.01
              IQR         27.32      3.85       0.05      0.20      0.05      0.70
               CV          1.07      0.80       0.58      0.97      0.60      1.15
         Skewness          0.94      0.55       0.01      1.03     -0.03      0.42
      SE.Skewness          0.03      0.03       0.03      0.03      0.03      0.03
         Kurtosis         -0.50     -0.93      -1.10     -0.29     -1.17     -1.59
          N.Valid       7996.00   7996.00    7996.00   7996.00   7996.00   7996.00
                N       7996.00   7996.00    7996.00   7996.00   7996.00   7996.00
        Pct.Valid        100.00    100.00     100.00    100.00    100.00    100.00
Setting theme "language: pt"
Características N = 7.9961
aluminium 0,07 (0,04 - 0,28); min=0,00, max=5,05
ammonia 14 (7 - 22); min=0, max=30
arsenic 0,05 (0,03 - 0,10); min=0,00, max=1,05
barium 1,19 (0,56 - 2,49); min=0,00, max=4,94
cadmium 0,040 (0,008 - 0,070); min=0,000, max=0,130
chloramine 0,53 (0,10 - 4,24); min=0,00, max=8,68
chromium 0,09 (0,05 - 0,44); min=0,00, max=0,90
copper 0,75 (0,09 - 1,39); min=0,00, max=2,00
flouride 0,77 (0,41 - 1,16); min=0,00, max=1,50
bacteria 0,22 (0,00 - 0,61); min=0,00, max=1,00
viruses 0,01 (0,00 - 0,70); min=0,00, max=1,00
lead 0,10 (0,05 - 0,15); min=0,00, max=0,20
nitrates 9,9 (5,0 - 14,6); min=0,0, max=19,8
nitrites 1,42 (1,00 - 1,76); min=0,00, max=2,93
mercury 0,0050 (0,0030 - 0,0080); min=0,0000, max=0,0100
perchlorate 8 (2 - 29); min=0, max=60
radium 2,41 (0,82 - 4,67); min=0,00, max=7,99
selenium 0,050 (0,020 - 0,070); min=0,000, max=0,100
silver 0,08 (0,04 - 0,24); min=0,00, max=0,50
uranium 0,050 (0,020 - 0,070); min=0,000, max=0,090
is_safe
    NÃO 7.084 (88.6%)
    SIM 912 (11.4%)
1 Mediana (Q1 - Q3); min=Min, max=Max; n (%)

Análise Bivariada

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

                       variavel diferenca_medias      t_stat       p_valor
mean in group SIM     aluminium     1.3292991619 -24.3680232 5.819921e-104
mean in group SIM1      ammonia    -0.6401259820   2.0996246  3.597473e-02
mean in group SIM2      arsenic    -0.0980346227  16.0088366  1.182082e-53
mean in group SIM3       barium     0.3476075255  -8.2015356  6.232701e-16
mean in group SIM4      cadmium    -0.0290337665  28.4627692 2.157232e-139
mean in group SIM5   chloramine     1.5077341451 -17.4349158  9.952540e-61
mean in group SIM6     chromium     0.1552173232 -16.0484307  2.119329e-52
mean in group SIM7       copper     0.0606479019  -2.7309314  6.409460e-03
mean in group SIM8     flouride     0.0089717686  -0.5944578  5.523211e-01
mean in group SIM9     bacteria    -0.0228827565   2.0265525  4.293343e-02
mean in group SIM10     viruses    -0.1154198955   9.3948392  2.740978e-20
mean in group SIM11        lead    -0.0018242116   0.9095015  3.632725e-01
mean in group SIM12    nitrates    -1.2569316944   6.3974553  2.293154e-10
mean in group SIM13    nitrites     0.0847301802  -5.2520191  1.742362e-07
mean in group SIM14     mercury    -0.0003436321   3.2309886  1.268728e-03
mean in group SIM15 perchlorate     4.2141229989  -8.0645368  1.646648e-15
mean in group SIM16      radium     0.4730600464  -5.8677652  5.749589e-09
mean in group SIM17    selenium    -0.0027988759   2.7878602  5.392131e-03
mean in group SIM18      silver     0.0464318405  -8.6657750  1.547952e-17
mean in group SIM19     uranium    -0.0064001221   6.7995294  1.676211e-11
                    relevancia_estatistica relevancia_diferenca
mean in group SIM          Muito relevante       Alta diferença
mean in group SIM1               Relevante      Média diferença
mean in group SIM2         Muito relevante      Baixa diferença
mean in group SIM3         Muito relevante      Média diferença
mean in group SIM4         Muito relevante      Baixa diferença
mean in group SIM5         Muito relevante       Alta diferença
mean in group SIM6         Muito relevante      Média diferença
mean in group SIM7               Relevante      Baixa diferença
mean in group SIM8         Pouco relevante      Baixa diferença
mean in group SIM9               Relevante      Baixa diferença
mean in group SIM10        Muito relevante      Média diferença
mean in group SIM11        Pouco relevante      Baixa diferença
mean in group SIM12        Muito relevante       Alta diferença
mean in group SIM13        Muito relevante      Baixa diferença
mean in group SIM14              Relevante      Baixa diferença
mean in group SIM15        Muito relevante       Alta diferença
mean in group SIM16        Muito relevante      Média diferença
mean in group SIM17              Relevante      Baixa diferença
mean in group SIM18        Muito relevante      Baixa diferença
mean in group SIM19        Muito relevante      Baixa diferença

Análise Multivariada

Conclusão

  • Número de Linha : 7996

  • Variável ammonia alterada para num:
    $ ammonia : num 9.08 21.16 14.02 11.33 24.33 …

  • Rótulo alterado:
    $ is_safe : Factor w/ 2 levels “NÃO”,“SIM”

  • Pré-Processamento dos dados

    • Verificar o balanceamento dos dados
    • Verificar outliers
    • Normalizar os dados

Pré-processamento

Balanceamento dos dados

Utilizamos a técnica de undersample para o balanceamento dos dados

Seleção de variáveis

Excluímos algumas variáveis que não influênciavam na classificação, pois possuiam valores máximos no dataset menores que o valor limites.

Variáveis Diferenca Medias Grupos t_stat p_valor Relevncia Estatistica Relevancia Diferenca Médias Grupos
flouride 0.00897176864 -0.594457 5.523211e-01 Pouco relevante Baixa diferença
lead -0.0018242116 0.9095015 3.632725e-01 Pouco relevante Baixa diferença
mercury -0.0003436321 3.2309886 1.268728e-03 Relevante Baixa diferença
selenium -0.0027988759 2.7878602 3.632725e-0 Relevante Baixa diferença
'data.frame':   1817 obs. of  18 variables:
 $ aluminium  : num  0.01 4.8 0.11 3.07 0.01 4.98 3.55 0.06 0.03 0.22 ...
 $ ammonia    : num  12.14 15.52 2.87 8.07 0.51 ...
 $ arsenic    : num  0.3 0.63 0.34 0.05 0.06 0.38 0.03 0.01 0.07 0.001 ...
 $ barium     : num  1.7 3.84 2.06 4.14 1.24 0.45 1.94 0.18 0.34 3.74 ...
 $ cadmium    : num  0.03 0.05 0.01 0.12 0.1 0.01 0.005 0.09 0.06 0.005 ...
 $ chloramine : num  3.38 1.3 7.43 3.76 0.02 3.18 0.01 0.02 0.08 5.24 ...
 $ chromium   : num  0.37 0.52 0.23 0.29 0.02 0.57 0.34 0.05 0.07 0.61 ...
 $ copper     : num  1.33 1.64 1.44 1.28 0.76 1.18 0.37 0.05 0.25 1.33 ...
 $ bacteria   : num  0.44 0 0.99 0 0.28 0.28 0.62 0 0.71 0.81 ...
 $ viruses    : num  0.005 0 0.99 0 0.007 0 0.62 0.79 0.71 0.81 ...
 $ nitrates   : num  4.37 10.69 14.61 1.72 3.87 ...
 $ nitrites   : num  2.1 1.95 2.53 1.23 1.89 1.69 1.05 0.66 1.44 1.81 ...
 $ perchlorate: num  50.31 21.93 26.07 46.55 3.02 ...
 $ radium     : num  6.81 6.46 4.17 6.53 1.17 1.73 2.73 0.78 3.32 1.29 ...
 $ selenium   : num  0.02 0.03 0.08 0.01 0.07 0.05 0.04 0.06 0.06 0 ...
 $ silver     : num  0.31 0.35 0.18 0.32 0.05 0.43 0.03 0.03 0.07 0.44 ...
 $ uranium    : num  0.01 0.06 0.03 0.02 0.06 0.04 0.08 0.04 0.03 0.07 ...
 $ is_safe    : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...

Tratamento outliers

As variáveis Aluminium e Arsenic apresentam valores dentro dos encontrados na literatura não caracterizando outliers.

Normalização dos dados

A normalização foi realizada com o zscore

'data.frame':   1817 obs. of  18 variables:
 $ aluminium  : num  -0.76 2.365 -0.694 1.236 -0.76 ...
 $ ammonia    : num  -0.221 0.159 -1.264 -0.679 -1.53 ...
 $ arsenic    : num  0.802 2.303 0.984 -0.334 -0.289 ...
 $ barium     : num  -0.00428 1.75753 0.2921 2.00451 -0.38298 ...
 $ cadmium    : num  -0.0159 0.563 -0.5949 2.5892 2.0103 ...
 $ chloramine : num  0.213 -0.584 1.765 0.358 -1.075 ...
 $ chromium   : num  0.2307 0.7676 -0.2705 -0.0557 -1.0221 ...
 $ copper     : num  0.777 1.263 0.95 0.699 -0.117 ...
 $ bacteria   : num  0.352 -0.985 2.024 -0.985 -0.134 ...
 $ viruses    : num  -0.773 -0.786 1.863 -0.786 -0.768 ...
 $ nitrates   : num  -0.877 0.262 0.968 -1.354 -0.967 ...
 $ nitrites   : num  1.453 1.158 2.298 -0.256 1.041 ...
 $ perchlorate: num  1.98 0.242 0.496 1.75 -0.916 ...
 $ radium     : num  1.557 1.406 0.422 1.436 -0.867 ...
 $ selenium   : num  -1.017 -0.67 1.063 -1.363 0.717 ...
 $ silver     : num  0.9325 1.1976 0.0707 0.9987 -0.791 ...
 $ uranium    : num  -1.199 0.638 -0.464 -0.831 0.638 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...

- Criada uma semente

- Criados os df de treino e teste com 70 e 30%

[1] "Treino"
'data.frame':   1273 obs. of  18 variables:
 $ aluminium  : num  2.365 -0.76 2.482 -0.727 -0.623 ...
 $ ammonia    : num  0.159 -1.53 -0.658 -1.18 1.117 ...
 $ arsenic    : num  2.303 -0.289 1.166 -0.516 -0.557 ...
 $ barium     : num  1.758 -0.383 -1.033 -1.256 1.675 ...
 $ cadmium    : num  0.563 2.01 -0.595 1.721 -0.74 ...
 $ chloramine : num  -0.584 -1.075 0.136 -1.075 0.925 ...
 $ chromium   : num  0.768 -1.022 0.947 -0.915 1.09 ...
 $ copper     : num  1.263 -0.117 0.542 -1.231 0.777 ...
 $ bacteria   : num  -0.985 -0.134 -0.134 -0.985 1.477 ...
 $ viruses    : num  -0.786 -0.768 -0.786 1.327 1.381 ...
 $ nitrates   : num  0.262 -0.967 -1.549 -1.547 1.179 ...
 $ nitrites   : num  1.158 1.041 0.648 -1.376 0.883 ...
 $ perchlorate: num  0.242 -0.916 1.234 -0.655 -0.196 ...
 $ radium     : num  1.406 -0.867 -0.626 -1.035 -0.815 ...
 $ selenium   : num  -0.6701 0.7166 0.0233 0.37 -1.7101 ...
 $ silver     : num  1.198 -0.791 1.728 -0.924 1.794 ...
 $ uranium    : num  0.6383 0.6383 -0.0964 -0.0964 1.0057 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
[1] "-------"
[1] "Teste"
'data.frame':   544 obs. of  18 variables:
 $ aluminium  : num  -0.76 -0.694 1.236 1.55 -0.747 ...
 $ ammonia    : num  -0.221 -1.264 -0.679 1.342 0.206 ...
 $ arsenic    : num  0.802 0.984 -0.334 -0.425 -0.243 ...
 $ barium     : num  -0.00428 0.2921 2.00451 0.19331 -1.12393 ...
 $ cadmium    : num  -0.0159 -0.5949 2.5892 -0.7396 0.8524 ...
 $ chloramine : num  0.213 1.765 0.358 -1.079 -1.052 ...
 $ chromium   : num  0.2307 -0.2705 -0.0557 0.1233 -0.8432 ...
 $ copper     : num  0.777 0.95 0.699 -0.729 -0.917 ...
 $ bacteria   : num  0.352 2.024 -0.985 0.899 1.173 ...
 $ viruses    : num  -0.773 1.863 -0.786 0.873 1.113 ...
 $ nitrates   : num  -0.877 0.968 -1.354 0.572 -1.441 ...
 $ nitrites   : num  1.453 2.298 -0.256 -0.61 0.157 ...
 $ perchlorate: num  1.98 0.496 1.75 1.95 -0.734 ...
 $ radium     : num  1.557 0.422 1.436 -0.197 0.057 ...
 $ selenium   : num  -1.017 1.063 -1.363 -0.323 0.37 ...
 $ silver     : num  0.9325 0.0707 0.9987 -0.9236 -0.6584 ...
 $ uranium    : num  -1.199 -0.464 -0.831 1.373 -0.464 ...
 $ label      : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...

Classisficação(KNN)

- Avaliamos o melhor K.

k-Nearest Neighbors 

1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1020, 1018, 1018, 1018, 1018 
Resampling results across tuning parameters:

  k  ROC        Sens       Spec     
  5  0.8789669  0.6938883  0.9217766
  7  0.8803394  0.6922760  0.9139764
  9  0.8865438  0.6844644  0.9218012

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 9.

9-nearest neighbor model
Training set outcome distribution:

NÃO SIM 
634 639 

Executando predição do modelo com corte de .50

Analisando desempenho do modelo

- Matrix de confusão

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 199  21
       SIM  72 252
                                          
               Accuracy : 0.829           
                 95% CI : (0.7947, 0.8597)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6578          
                                          
 Mcnemar's Test P-Value : 2.163e-07       
                                          
            Sensitivity : 0.9231          
            Specificity : 0.7343          
         Pos Pred Value : 0.7778          
         Neg Pred Value : 0.9045          
             Prevalence : 0.5018          
         Detection Rate : 0.4632          
   Detection Prevalence : 0.5956          
      Balanced Accuracy : 0.8287          
                                          
       'Positive' Class : SIM             
                                          
[1] 0.8105906

- Curva ROC e AUC

Setting levels: control = NÃO, case = SIM
Setting direction: controls < cases

Avalindo outros valores de corte

.30

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 144   4
       SIM 127 269
                                         
               Accuracy : 0.7592         
                 95% CI : (0.721, 0.7946)
    No Information Rate : 0.5018         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.5176         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9853         
            Specificity : 0.5314         
         Pos Pred Value : 0.6793         
         Neg Pred Value : 0.9730         
             Prevalence : 0.5018         
         Detection Rate : 0.4945         
   Detection Prevalence : 0.7279         
      Balanced Accuracy : 0.7584         
                                         
       'Positive' Class : SIM            
                                         
[1] 0.6873508

.40

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 171  12
       SIM 100 261
                                          
               Accuracy : 0.7941          
                 95% CI : (0.7577, 0.8273)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5877          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9560          
            Specificity : 0.6310          
         Pos Pred Value : 0.7230          
         Neg Pred Value : 0.9344          
             Prevalence : 0.5018          
         Detection Rate : 0.4798          
   Detection Prevalence : 0.6636          
      Balanced Accuracy : 0.7935          
                                          
       'Positive' Class : SIM             
                                          
[1] 0.753304

.60

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 218  42
       SIM  53 231
                                          
               Accuracy : 0.8254          
                 95% CI : (0.7908, 0.8564)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6507          
                                          
 Mcnemar's Test P-Value : 0.3049          
                                          
            Sensitivity : 0.8462          
            Specificity : 0.8044          
         Pos Pred Value : 0.8134          
         Neg Pred Value : 0.8385          
             Prevalence : 0.5018          
         Detection Rate : 0.4246          
   Detection Prevalence : 0.5221          
      Balanced Accuracy : 0.8253          
                                          
       'Positive' Class : SIM             
                                          
[1] 0.8210923

Predição do modelo com corte de .50

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 199  21
       SIM  72 252
                                          
               Accuracy : 0.829           
                 95% CI : (0.7947, 0.8597)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6578          
                                          
 Mcnemar's Test P-Value : 2.163e-07       
                                          
            Sensitivity : 0.9231          
            Specificity : 0.7343          
         Pos Pred Value : 0.7778          
         Neg Pred Value : 0.9045          
             Prevalence : 0.5018          
         Detection Rate : 0.4632          
   Detection Prevalence : 0.5956          
      Balanced Accuracy : 0.8287          
                                          
       'Positive' Class : SIM             
                                          
[1] 0.8935228

Resumo

Pontos Valores
Número registros 1,822
Número variáveis 17
Número K 9
Ponto de corte 0.50
Sensibilidade 0.8974
Especificidade 0.7380
Acuracia 0.8180
NÃO SIM
NÃO 200 28
SIM 71 245

Negrito: Valor de referência

Classificação( RANDOM FOREST )

Ajuste dos Parâmetros

Validação cruzada - 5-fold - mtry = 8 - ntree = (500,1000,1500)

1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1018, 1018, 1019, 1019, 1018 
Resampling results across tuning parameters:

  mtry  ntree  Accuracy   Kappa    
  1      500   0.8436622  0.6873492
  1     1000   0.8475930  0.6951833
  1     1500   0.8515146  0.7030395
  2      500   0.8978786  0.7957609
  2     1000   0.8994473  0.7989038
  2     1500   0.9002254  0.8004633
  3      500   0.9151583  0.8303272
  3     1000   0.9159457  0.8319036
  3     1500   0.9198672  0.8397445
  4      500   0.9261633  0.8523313
  4     1000   0.9300818  0.8601695
  4     1500   0.9300849  0.8601767
  5      500   0.9324440  0.8648920
  5     1000   0.9324409  0.8648867
  5     1500   0.9300818  0.8601642
  6      500   0.9347970  0.8695966
  6     1000   0.9371530  0.8743101
  6     1500   0.9371561  0.8743158
  7      500   0.9387216  0.8774458
  7     1000   0.9379404  0.8758818
  7     1500   0.9379404  0.8758850
  8      500   0.9379373  0.8758766
  8     1000   0.9387216  0.8774458
  8     1500   0.9363656  0.8727354

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 7 and ntree = 500.

                Length Class      Mode     
call               5   -none-     call     
type               1   -none-     character
predicted       1273   factor     numeric  
err.rate        1500   -none-     numeric  
confusion          6   -none-     numeric  
votes           2546   matrix     numeric  
oob.times       1273   -none-     numeric  
classes            2   -none-     character
importance        17   -none-     numeric  
importanceSD       0   -none-     NULL     
localImportance    0   -none-     NULL     
proximity          0   -none-     NULL     
ntree              1   -none-     numeric  
mtry               1   -none-     numeric  
forest            14   -none-     list     
y               1273   factor     numeric  
test               0   -none-     NULL     
inbag              0   -none-     NULL     
xNames            17   -none-     character
problemType        1   -none-     character
tuneValue          2   data.frame list     
obsLevels          2   -none-     character
param              0   -none-     list     
1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1145, 1147, 1145, 1146, 1146, 1146, ... 
Resampling results across tuning parameters:

  mtry  ntree  Accuracy   Kappa    
  1      500   0.8522926  0.7045651
  1     1000   0.8467685  0.6935307
  1     1500   0.8491368  0.6982576
  2      500   0.9018013  0.8035853
  2     1000   0.9049510  0.8098915
  2     1500   0.9057261  0.8114336
  3      500   0.9285303  0.8570551
  3     1000   0.9285118  0.8570122
  3     1500   0.9261558  0.8522987
  4      500   0.9355862  0.8711625
  4     1000   0.9348049  0.8696027
  4     1500   0.9332363  0.8664636
  5      500   0.9371794  0.8743591
  5     1000   0.9395293  0.8790588
  5     1500   0.9395293  0.8790529
  6      500   0.9403106  0.8806162
  6     1000   0.9411041  0.8822058
  6     1500   0.9418854  0.8837722
  7      500   0.9426728  0.8853415
  7     1000   0.9426851  0.8853690
  7     1500   0.9442476  0.8884956
  8      500   0.9426728  0.8853446
  8     1000   0.9450227  0.8900459
  8     1500   0.9458101  0.8916186

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 8 and ntree = 1500.

                Length Class      Mode     
call               5   -none-     call     
type               1   -none-     character
predicted       1273   factor     numeric  
err.rate        4500   -none-     numeric  
confusion          6   -none-     numeric  
votes           2546   matrix     numeric  
oob.times       1273   -none-     numeric  
classes            2   -none-     character
importance        17   -none-     numeric  
importanceSD       0   -none-     NULL     
localImportance    0   -none-     NULL     
proximity          0   -none-     NULL     
ntree              1   -none-     numeric  
mtry               1   -none-     numeric  
forest            14   -none-     list     
y               1273   factor     numeric  
test               0   -none-     NULL     
inbag              0   -none-     NULL     
xNames            17   -none-     character
problemType        1   -none-     character
tuneValue          2   data.frame list     
obsLevels          2   -none-     character
param              0   -none-     list     
1273 samples
  17 predictor
   2 classes: 'NÃO', 'SIM' 

No pre-processing
Resampling: Cross-Validated (15 fold) 
Summary of sample sizes: 1187, 1188, 1188, 1189, 1188, 1189, ... 
Resampling results across tuning parameters:

  mtry  ntree  Accuracy   Kappa    
  1      200   0.8491075  0.6981572
  1      500   0.8452411  0.6904364
  1     1000   0.8475018  0.6949492
  2      200   0.9040690  0.8080906
  2      500   0.9009680  0.8018682
  2     1000   0.9048438  0.8096393
  3      200   0.9253111  0.8505769
  3      500   0.9292424  0.8584352
  3     1000   0.9284299  0.8568128
  4      200   0.9316042  0.8631759
  4      500   0.9378607  0.8756871
  4     1000   0.9378701  0.8757086
  5      200   0.9363294  0.8726111
  5      500   0.9418381  0.8836469
  5     1000   0.9371229  0.8741975
  6      200   0.9370947  0.8741539
  6      500   0.9426224  0.8852083
  6     1000   0.9410629  0.8820826
  7      200   0.9410538  0.8820663
  7      500   0.9418472  0.8836466
  7     1000   0.9426131  0.8851903
  8      200   0.9426131  0.8851964
  8      500   0.9434067  0.8867776
  8     1000   0.9441817  0.8883271

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 8 and ntree = 1000.

                Length Class      Mode     
call               5   -none-     call     
type               1   -none-     character
predicted       1273   factor     numeric  
err.rate        3000   -none-     numeric  
confusion          6   -none-     numeric  
votes           2546   matrix     numeric  
oob.times       1273   -none-     numeric  
classes            2   -none-     character
importance        17   -none-     numeric  
importanceSD       0   -none-     NULL     
localImportance    0   -none-     NULL     
proximity          0   -none-     NULL     
ntree              1   -none-     numeric  
mtry               1   -none-     numeric  
forest            14   -none-     list     
y               1273   factor     numeric  
test               0   -none-     NULL     
inbag              0   -none-     NULL     
xNames            17   -none-     character
problemType        1   -none-     character
tuneValue          2   data.frame list     
obsLevels          2   -none-     character
param              0   -none-     list     

Executando o modelo - mtry = 8 - ntree = 1000

            MeanDecreaseAccuracy
aluminium              112.87229
ammonia                 29.45467
arsenic                 32.41532
barium                  15.63559
cadmium                 47.31261
chloramine              32.18045
chromium                19.73895
copper                  33.49641
bacteria                32.98521
viruses                 54.04992
nitrates                44.63777
nitrites                42.10530
perchlorate             58.88871
radium                  41.29022
selenium                12.78911
silver                  75.23968
uranium                 60.04828
            MeanDecreaseGini
aluminium         146.202336
ammonia            20.070994
arsenic            40.112426
barium             11.789671
cadmium            85.173910
chloramine         34.079657
chromium           17.646571
copper             21.033685
bacteria           15.392236
viruses            28.227770
nitrates           26.500947
nitrites           25.691018
perchlorate        51.408950
radium             23.546994
selenium            7.404903
silver             51.264629
uranium            30.006274

 [1] 6348 5313 2929 4205 3831 4419 3899 4019 3121 4096 5423 5291 7167 5235 3019
[16] 5926 4203

Predição

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 250  18
       SIM  21 255
                                          
               Accuracy : 0.9283          
                 95% CI : (0.9033, 0.9485)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8566          
                                          
 Mcnemar's Test P-Value : 0.7488          
                                          
            Sensitivity : 0.9341          
            Specificity : 0.9225          
         Pos Pred Value : 0.9239          
         Neg Pred Value : 0.9328          
             Prevalence : 0.5018          
         Detection Rate : 0.4688          
   Detection Prevalence : 0.5074          
      Balanced Accuracy : 0.9283          
                                          
       'Positive' Class : SIM             
                                          

Curva ROC e AUC

Setting levels: control = NÃO, case = SIM
Setting direction: controls < cases

Alterando o ponto de corte

Confusion Matrix and Statistics

          Reference
Prediction NÃO SIM
       NÃO 248  15
       SIM  23 258
                                          
               Accuracy : 0.9301          
                 95% CI : (0.9054, 0.9501)
    No Information Rate : 0.5018          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8603          
                                          
 Mcnemar's Test P-Value : 0.2561          
                                          
            Sensitivity : 0.9451          
            Specificity : 0.9151          
         Pos Pred Value : 0.9181          
         Neg Pred Value : 0.9430          
             Prevalence : 0.5018          
         Detection Rate : 0.4743          
   Detection Prevalence : 0.5165          
      Balanced Accuracy : 0.9301          
                                          
       'Positive' Class : SIM             
                                          
95% CI: 0.9687-0.9885 (DeLong)
95% CI (2000 stratified bootstrap replicates):
 thresholds sp.low sp.median sp.high se.low se.median se.high
      0.436 0.8819    0.9151  0.9446 0.9158    0.9451  0.9707

Resultados

KNN

Matriz de confusão:

NÃO SIM
NÃO 199 21
SIM 72 252
Estatística Valores
Acuracia 0.829
95% IC (0.7947, 0.8597)
Sensibilidade 0.9232
Especificidade 0.7343

Random Forest

Matriz de confusão:

NÃO SIM
NÃO 239 10
SIM 32 263
Estatística Valores
Acuracia 0.9238
95% IC (0.8971, 0.9438)
Sensibilidade 0.9634
Especificidade 0.8819

Conclusão

  • O RF foi o modelo de melhor performance;
  • Os valores obtidos na acurácia, sensibilidade e especificidade validam o modelo para fins acadêmicos.
  • É necessário a utilização de dados reais para melhor avaliação do modelo;
  • Alteração do ponto de corte deve ser avaliada para melhorar a sensibilidade.