'data.frame': 7999 obs. of 21 variables:
$ aluminium : num 1.65 2.32 1.01 1.36 0.92 0.94 2.36 3.93 0.6 0.22 ...
$ ammonia : chr "9.08" "21.16" "14.02" "11.33" ...
$ arsenic : num 0.04 0.01 0.04 0.04 0.03 0.03 0.01 0.04 0.01 0.02 ...
$ barium : num 2.85 3.31 0.58 2.96 0.2 2.88 1.35 0.66 0.71 1.37 ...
$ cadmium : num 0.007 0.002 0.008 0.001 0.006 0.003 0.004 0.001 0.005 0.007 ...
$ chloramine : num 0.35 5.28 4.24 7.23 2.67 0.8 1.28 6.22 3.14 6.4 ...
$ chromium : num 0.83 0.68 0.53 0.03 0.69 0.43 0.62 0.1 0.77 0.49 ...
$ copper : num 0.17 0.66 0.02 1.66 0.57 1.38 1.88 1.86 1.45 0.82 ...
$ flouride : num 0.05 0.9 0.99 1.08 0.61 0.11 0.33 0.86 0.98 1.24 ...
$ bacteria : num 0.2 0.65 0.05 0.71 0.13 0.67 0.13 0.16 0.35 0.83 ...
$ viruses : num 0 0.65 0.003 0.71 0.001 0.67 0.007 0.005 0.002 0.83 ...
$ lead : num 0.054 0.1 0.078 0.016 0.117 0.135 0.021 0.197 0.167 0.109 ...
$ nitrates : num 16.08 2.01 14.16 1.41 6.74 ...
$ nitrites : num 1.13 1.93 1.11 1.29 1.11 1.89 1.78 1.81 1.84 1.46 ...
$ mercury : num 0.007 0.003 0.006 0.004 0.003 0.006 0.007 0.001 0.004 0.01 ...
$ perchlorate: num 37.75 32.26 50.28 9.12 16.9 ...
$ radium : num 6.78 3.21 7.07 1.72 2.41 5.42 2.84 7.24 4.99 0.08 ...
$ selenium : num 0.08 0.08 0.07 0.02 0.02 0.08 0.1 0.08 0.08 0.03 ...
$ silver : num 0.34 0.27 0.44 0.45 0.06 0.19 0.24 0.08 0.25 0.31 ...
$ uranium : num 0.02 0.05 0.01 0.05 0.02 0.02 0.08 0.07 0.08 0.01 ...
$ is_safe : chr "1" "1" "0" "1" ...
Programa de Engenharia Biomédica
Estudo de métodos para classificação de amostras de água quanto a potabilidade utilizando características químicas das amostras
Discente: Walner Passos
Docente : Leticia Raposo e Diogo Antônio Tschoeke
Análise Exploratória de Dados (EDA)
Apresentação do Conjunto de Dados
Usaremos o conjunto de dados WaterQuality disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados foi criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.
Carga e Visualização Inicial dos Dados
Resumo dos dados:
- Número de Linha: 7999
- Número de Variáveis: 20
- Rótulo: 1
| Variáveis explicativas | Tipo de dado | perigoso > que |
|---|---|---|
| aluminium | quantitativa | 2.8 |
| ammonia | quantitativa | 32.5 |
| arsenic | quantitativa | 0.01 |
| barium | quantitativa | 2 |
| cádmium | quantitativa | 0.005 |
| chloramine | quantitativa | 4 |
| chromium | quantitativa | 0.1 |
| copper | quantitativa | 1.3 |
| flouride | quantitativa | 1.5 |
| bactéria | quantitativa | 0 |
| viroses | quantitativa | 0 |
| lead | quantitativa | 0.015 |
| nitrates | quantitativa | 10 |
| nitrites | quantitativa | 1 |
| Mercury | quantitativa | 0.002 |
| perchlorate | quantitativa | 56 |
| radium | quantitativa | 5 |
| selenium | quantitativa | 0.5 |
| silver | quantitativa | 0.1 |
| uranium | quantitativa | 0.3 |
| variável resposta | Tipo de dado | Observação |
|---|---|---|
| is_safe | chr | “0”, “1” |
Informação Dataset
Variáveis
Análise estatística dos dados
aluminium ammonia arsenic barium
Min. :0.0000 Length:7999 Min. :0.0000 Min. :0.000
1st Qu.:0.0400 Class :character 1st Qu.:0.0300 1st Qu.:0.560
Median :0.0700 Mode :character Median :0.0500 Median :1.190
Mean :0.6662 Mean :0.1614 Mean :1.568
3rd Qu.:0.2800 3rd Qu.:0.1000 3rd Qu.:2.480
Max. :5.0500 Max. :1.0500 Max. :4.940
cadmium chloramine chromium copper
Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00800 1st Qu.:0.100 1st Qu.:0.0500 1st Qu.:0.0900
Median :0.04000 Median :0.530 Median :0.0900 Median :0.7500
Mean :0.04281 Mean :2.177 Mean :0.2472 Mean :0.8059
3rd Qu.:0.07000 3rd Qu.:4.240 3rd Qu.:0.4400 3rd Qu.:1.3900
Max. :0.13000 Max. :8.680 Max. :0.9000 Max. :2.0000
flouride bacteria viruses lead
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.4050 1st Qu.:0.0000 1st Qu.:0.0020 1st Qu.:0.04800
Median :0.7700 Median :0.2200 Median :0.0080 Median :0.10200
Mean :0.7716 Mean :0.3197 Mean :0.3286 Mean :0.09945
3rd Qu.:1.1600 3rd Qu.:0.6100 3rd Qu.:0.7000 3rd Qu.:0.15100
Max. :1.5000 Max. :1.0000 Max. :1.0000 Max. :0.20000
nitrates nitrites mercury perchlorate
Min. : 0.000 Min. :0.00 Min. :0.000000 Min. : 0.00
1st Qu.: 5.000 1st Qu.:1.00 1st Qu.:0.003000 1st Qu.: 2.17
Median : 9.930 Median :1.42 Median :0.005000 Median : 7.74
Mean : 9.819 Mean :1.33 Mean :0.005194 Mean :16.46
3rd Qu.:14.610 3rd Qu.:1.76 3rd Qu.:0.008000 3rd Qu.:29.48
Max. :19.830 Max. :2.93 Max. :0.010000 Max. :60.01
radium selenium silver uranium
Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.820 1st Qu.:0.02000 1st Qu.:0.0400 1st Qu.:0.02000
Median :2.410 Median :0.05000 Median :0.0800 Median :0.05000
Mean :2.921 Mean :0.04968 Mean :0.1478 Mean :0.04467
3rd Qu.:4.670 3rd Qu.:0.07000 3rd Qu.:0.2400 3rd Qu.:0.07000
Max. :7.990 Max. :0.10000 Max. :0.5000 Max. :0.09000
is_safe
Length:7999
Class :character
Mode :character
Limpeza e Preparação dos Dados
- Sem dados faltantes:
Número dados faltantes : 0
- Analise variável Ammonia
Confirmando tipo de dado da variável: character
Alterando o tipo de dados
Confirmando alteração double
Verificando inconcistência: 3
Excluuindo registros:
Confirmando exclusão: 0
- Análise dos rótulos
Rotulos: 1 0
- Alterado as descrições do rotulo ( “0” - NÃO / “1” - SIM )
Alterando os rótulos...
Confirmando alteração: NÃO SIM
Análise dos dados
Balanceamento
Análise Univariada
Non-numerical variable(s) ignored: is_safe
Descriptive Statistics
dados
N: 7996
aluminium ammonia arsenic bacteria barium cadmium chloramine
----------------- ----------- --------- --------- ---------- --------- --------- ------------
Mean 0.67 14.28 0.16 0.32 1.57 0.04 2.18
Std.Dev 1.27 8.88 0.25 0.33 1.22 0.04 2.57
Min 0.00 -0.08 0.00 0.00 0.00 0.00 0.00
Q1 0.04 6.58 0.03 0.00 0.56 0.01 0.10
Median 0.07 14.13 0.05 0.22 1.19 0.04 0.53
Q3 0.28 22.13 0.10 0.61 2.49 0.07 4.24
Max 5.05 29.84 1.05 1.00 4.94 0.13 8.68
MAD 0.06 11.58 0.04 0.33 1.20 0.05 0.76
IQR 0.24 15.55 0.07 0.61 1.92 0.06 4.14
CV 1.90 0.62 1.56 1.03 0.78 0.84 1.18
Skewness 2.01 0.03 1.98 0.55 0.66 0.48 0.89
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis 2.72 -1.23 2.68 -1.14 -0.70 -0.99 -0.68
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Table: Table continues below
chromium copper flouride lead mercury nitrates nitrites
----------------- ---------- --------- ---------- --------- --------- ---------- ----------
Mean 0.25 0.81 0.77 0.10 0.01 9.82 1.33
Std.Dev 0.27 0.65 0.44 0.06 0.00 5.54 0.57
Min 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Q1 0.05 0.09 0.41 0.05 0.00 5.00 1.00
Median 0.09 0.75 0.77 0.10 0.00 9.93 1.42
Q3 0.44 1.39 1.16 0.15 0.01 14.61 1.76
Max 0.90 2.00 1.50 0.20 0.01 19.83 2.93
MAD 0.10 0.96 0.56 0.08 0.00 7.06 0.53
IQR 0.39 1.30 0.75 0.10 0.00 9.61 0.76
CV 1.09 0.81 0.56 0.59 0.57 0.56 0.43
Skewness 1.03 0.25 -0.04 -0.06 -0.08 -0.04 -0.50
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis -0.37 -1.35 -1.17 -1.16 -1.17 -1.19 -0.36
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Table: Table continues below
perchlorate radium selenium silver uranium viruses
----------------- ------------- --------- ---------- --------- --------- ---------
Mean 16.47 2.92 0.05 0.15 0.04 0.33
Std.Dev 17.69 2.32 0.03 0.14 0.03 0.38
Min 0.00 0.00 0.00 0.00 0.00 0.00
Q1 2.17 0.82 0.02 0.04 0.02 0.00
Median 7.74 2.41 0.05 0.08 0.05 0.01
Q3 29.50 4.67 0.07 0.24 0.07 0.70
Max 60.01 7.99 0.10 0.50 0.09 1.00
MAD 10.73 2.64 0.03 0.09 0.03 0.01
IQR 27.32 3.85 0.05 0.20 0.05 0.70
CV 1.07 0.80 0.58 0.97 0.60 1.15
Skewness 0.94 0.55 0.01 1.03 -0.03 0.42
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis -0.50 -0.93 -1.10 -0.29 -1.17 -1.59
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
Setting theme "language: pt"
| Características | N = 7.9961 |
|---|---|
| aluminium | 0,07 (0,04 - 0,28); min=0,00, max=5,05 |
| ammonia | 14 (7 - 22); min=0, max=30 |
| arsenic | 0,05 (0,03 - 0,10); min=0,00, max=1,05 |
| barium | 1,19 (0,56 - 2,49); min=0,00, max=4,94 |
| cadmium | 0,040 (0,008 - 0,070); min=0,000, max=0,130 |
| chloramine | 0,53 (0,10 - 4,24); min=0,00, max=8,68 |
| chromium | 0,09 (0,05 - 0,44); min=0,00, max=0,90 |
| copper | 0,75 (0,09 - 1,39); min=0,00, max=2,00 |
| flouride | 0,77 (0,41 - 1,16); min=0,00, max=1,50 |
| bacteria | 0,22 (0,00 - 0,61); min=0,00, max=1,00 |
| viruses | 0,01 (0,00 - 0,70); min=0,00, max=1,00 |
| lead | 0,10 (0,05 - 0,15); min=0,00, max=0,20 |
| nitrates | 9,9 (5,0 - 14,6); min=0,0, max=19,8 |
| nitrites | 1,42 (1,00 - 1,76); min=0,00, max=2,93 |
| mercury | 0,0050 (0,0030 - 0,0080); min=0,0000, max=0,0100 |
| perchlorate | 8 (2 - 29); min=0, max=60 |
| radium | 2,41 (0,82 - 4,67); min=0,00, max=7,99 |
| selenium | 0,050 (0,020 - 0,070); min=0,000, max=0,100 |
| silver | 0,08 (0,04 - 0,24); min=0,00, max=0,50 |
| uranium | 0,050 (0,020 - 0,070); min=0,000, max=0,090 |
| is_safe | |
| NÃO | 7.084 (88.6%) |
| SIM | 912 (11.4%) |
| 1 Mediana (Q1 - Q3); min=Min, max=Max; n (%) | |
Análise Bivariada
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
variavel diferenca_medias t_stat p_valor
mean in group SIM aluminium 1.3292991619 -24.3680232 5.819921e-104
mean in group SIM1 ammonia -0.6401259820 2.0996246 3.597473e-02
mean in group SIM2 arsenic -0.0980346227 16.0088366 1.182082e-53
mean in group SIM3 barium 0.3476075255 -8.2015356 6.232701e-16
mean in group SIM4 cadmium -0.0290337665 28.4627692 2.157232e-139
mean in group SIM5 chloramine 1.5077341451 -17.4349158 9.952540e-61
mean in group SIM6 chromium 0.1552173232 -16.0484307 2.119329e-52
mean in group SIM7 copper 0.0606479019 -2.7309314 6.409460e-03
mean in group SIM8 flouride 0.0089717686 -0.5944578 5.523211e-01
mean in group SIM9 bacteria -0.0228827565 2.0265525 4.293343e-02
mean in group SIM10 viruses -0.1154198955 9.3948392 2.740978e-20
mean in group SIM11 lead -0.0018242116 0.9095015 3.632725e-01
mean in group SIM12 nitrates -1.2569316944 6.3974553 2.293154e-10
mean in group SIM13 nitrites 0.0847301802 -5.2520191 1.742362e-07
mean in group SIM14 mercury -0.0003436321 3.2309886 1.268728e-03
mean in group SIM15 perchlorate 4.2141229989 -8.0645368 1.646648e-15
mean in group SIM16 radium 0.4730600464 -5.8677652 5.749589e-09
mean in group SIM17 selenium -0.0027988759 2.7878602 5.392131e-03
mean in group SIM18 silver 0.0464318405 -8.6657750 1.547952e-17
mean in group SIM19 uranium -0.0064001221 6.7995294 1.676211e-11
relevancia_estatistica relevancia_diferenca
mean in group SIM Muito relevante Alta diferença
mean in group SIM1 Relevante Média diferença
mean in group SIM2 Muito relevante Baixa diferença
mean in group SIM3 Muito relevante Média diferença
mean in group SIM4 Muito relevante Baixa diferença
mean in group SIM5 Muito relevante Alta diferença
mean in group SIM6 Muito relevante Média diferença
mean in group SIM7 Relevante Baixa diferença
mean in group SIM8 Pouco relevante Baixa diferença
mean in group SIM9 Relevante Baixa diferença
mean in group SIM10 Muito relevante Média diferença
mean in group SIM11 Pouco relevante Baixa diferença
mean in group SIM12 Muito relevante Alta diferença
mean in group SIM13 Muito relevante Baixa diferença
mean in group SIM14 Relevante Baixa diferença
mean in group SIM15 Muito relevante Alta diferença
mean in group SIM16 Muito relevante Média diferença
mean in group SIM17 Relevante Baixa diferença
mean in group SIM18 Muito relevante Baixa diferença
mean in group SIM19 Muito relevante Baixa diferença
Análise Multivariada
Conclusão
Número de Linha : 7996
Variável ammonia alterada para num:
$ ammonia : num 9.08 21.16 14.02 11.33 24.33 …Rótulo alterado:
$ is_safe : Factor w/ 2 levels “NÃO”,“SIM”Pré-Processamento dos dados
- Verificar o balanceamento dos dados
- Verificar outliers
- Normalizar os dados
Pré-processamento
Balanceamento dos dados
Utilizamos a técnica de undersample para o balanceamento dos dados
Seleção de variáveis
Excluímos algumas variáveis que não influênciavam na classificação, pois possuiam valores máximos no dataset menores que o valor limites.
| Variáveis | Diferenca Medias Grupos | t_stat | p_valor | Relevncia Estatistica | Relevancia Diferenca Médias Grupos |
|---|---|---|---|---|---|
| flouride | 0.00897176864 | -0.594457 | 5.523211e-01 | Pouco relevante | Baixa diferença |
| lead | -0.0018242116 | 0.9095015 | 3.632725e-01 | Pouco relevante | Baixa diferença |
| mercury | -0.0003436321 | 3.2309886 | 1.268728e-03 | Relevante | Baixa diferença |
| selenium | -0.0027988759 | 2.7878602 | 3.632725e-0 | Relevante | Baixa diferença |
'data.frame': 1817 obs. of 18 variables:
$ aluminium : num 0.01 4.8 0.11 3.07 0.01 4.98 3.55 0.06 0.03 0.22 ...
$ ammonia : num 12.14 15.52 2.87 8.07 0.51 ...
$ arsenic : num 0.3 0.63 0.34 0.05 0.06 0.38 0.03 0.01 0.07 0.001 ...
$ barium : num 1.7 3.84 2.06 4.14 1.24 0.45 1.94 0.18 0.34 3.74 ...
$ cadmium : num 0.03 0.05 0.01 0.12 0.1 0.01 0.005 0.09 0.06 0.005 ...
$ chloramine : num 3.38 1.3 7.43 3.76 0.02 3.18 0.01 0.02 0.08 5.24 ...
$ chromium : num 0.37 0.52 0.23 0.29 0.02 0.57 0.34 0.05 0.07 0.61 ...
$ copper : num 1.33 1.64 1.44 1.28 0.76 1.18 0.37 0.05 0.25 1.33 ...
$ bacteria : num 0.44 0 0.99 0 0.28 0.28 0.62 0 0.71 0.81 ...
$ viruses : num 0.005 0 0.99 0 0.007 0 0.62 0.79 0.71 0.81 ...
$ nitrates : num 4.37 10.69 14.61 1.72 3.87 ...
$ nitrites : num 2.1 1.95 2.53 1.23 1.89 1.69 1.05 0.66 1.44 1.81 ...
$ perchlorate: num 50.31 21.93 26.07 46.55 3.02 ...
$ radium : num 6.81 6.46 4.17 6.53 1.17 1.73 2.73 0.78 3.32 1.29 ...
$ selenium : num 0.02 0.03 0.08 0.01 0.07 0.05 0.04 0.06 0.06 0 ...
$ silver : num 0.31 0.35 0.18 0.32 0.05 0.43 0.03 0.03 0.07 0.44 ...
$ uranium : num 0.01 0.06 0.03 0.02 0.06 0.04 0.08 0.04 0.03 0.07 ...
$ is_safe : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
Tratamento outliers
As variáveis Aluminium e Arsenic apresentam valores dentro dos encontrados na literatura não caracterizando outliers.
Normalização dos dados
A normalização foi realizada com o zscore
'data.frame': 1817 obs. of 18 variables:
$ aluminium : num -0.76 2.365 -0.694 1.236 -0.76 ...
$ ammonia : num -0.221 0.159 -1.264 -0.679 -1.53 ...
$ arsenic : num 0.802 2.303 0.984 -0.334 -0.289 ...
$ barium : num -0.00428 1.75753 0.2921 2.00451 -0.38298 ...
$ cadmium : num -0.0159 0.563 -0.5949 2.5892 2.0103 ...
$ chloramine : num 0.213 -0.584 1.765 0.358 -1.075 ...
$ chromium : num 0.2307 0.7676 -0.2705 -0.0557 -1.0221 ...
$ copper : num 0.777 1.263 0.95 0.699 -0.117 ...
$ bacteria : num 0.352 -0.985 2.024 -0.985 -0.134 ...
$ viruses : num -0.773 -0.786 1.863 -0.786 -0.768 ...
$ nitrates : num -0.877 0.262 0.968 -1.354 -0.967 ...
$ nitrites : num 1.453 1.158 2.298 -0.256 1.041 ...
$ perchlorate: num 1.98 0.242 0.496 1.75 -0.916 ...
$ radium : num 1.557 1.406 0.422 1.436 -0.867 ...
$ selenium : num -1.017 -0.67 1.063 -1.363 0.717 ...
$ silver : num 0.9325 1.1976 0.0707 0.9987 -0.791 ...
$ uranium : num -1.199 0.638 -0.464 -0.831 0.638 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
- Criada uma semente
- Criados os df de treino e teste com 70 e 30%
[1] "Treino"
'data.frame': 1273 obs. of 18 variables:
$ aluminium : num 2.365 -0.76 2.482 -0.727 -0.623 ...
$ ammonia : num 0.159 -1.53 -0.658 -1.18 1.117 ...
$ arsenic : num 2.303 -0.289 1.166 -0.516 -0.557 ...
$ barium : num 1.758 -0.383 -1.033 -1.256 1.675 ...
$ cadmium : num 0.563 2.01 -0.595 1.721 -0.74 ...
$ chloramine : num -0.584 -1.075 0.136 -1.075 0.925 ...
$ chromium : num 0.768 -1.022 0.947 -0.915 1.09 ...
$ copper : num 1.263 -0.117 0.542 -1.231 0.777 ...
$ bacteria : num -0.985 -0.134 -0.134 -0.985 1.477 ...
$ viruses : num -0.786 -0.768 -0.786 1.327 1.381 ...
$ nitrates : num 0.262 -0.967 -1.549 -1.547 1.179 ...
$ nitrites : num 1.158 1.041 0.648 -1.376 0.883 ...
$ perchlorate: num 0.242 -0.916 1.234 -0.655 -0.196 ...
$ radium : num 1.406 -0.867 -0.626 -1.035 -0.815 ...
$ selenium : num -0.6701 0.7166 0.0233 0.37 -1.7101 ...
$ silver : num 1.198 -0.791 1.728 -0.924 1.794 ...
$ uranium : num 0.6383 0.6383 -0.0964 -0.0964 1.0057 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
[1] "-------"
[1] "Teste"
'data.frame': 544 obs. of 18 variables:
$ aluminium : num -0.76 -0.694 1.236 1.55 -0.747 ...
$ ammonia : num -0.221 -1.264 -0.679 1.342 0.206 ...
$ arsenic : num 0.802 0.984 -0.334 -0.425 -0.243 ...
$ barium : num -0.00428 0.2921 2.00451 0.19331 -1.12393 ...
$ cadmium : num -0.0159 -0.5949 2.5892 -0.7396 0.8524 ...
$ chloramine : num 0.213 1.765 0.358 -1.079 -1.052 ...
$ chromium : num 0.2307 -0.2705 -0.0557 0.1233 -0.8432 ...
$ copper : num 0.777 0.95 0.699 -0.729 -0.917 ...
$ bacteria : num 0.352 2.024 -0.985 0.899 1.173 ...
$ viruses : num -0.773 1.863 -0.786 0.873 1.113 ...
$ nitrates : num -0.877 0.968 -1.354 0.572 -1.441 ...
$ nitrites : num 1.453 2.298 -0.256 -0.61 0.157 ...
$ perchlorate: num 1.98 0.496 1.75 1.95 -0.734 ...
$ radium : num 1.557 0.422 1.436 -0.197 0.057 ...
$ selenium : num -1.017 1.063 -1.363 -0.323 0.37 ...
$ silver : num 0.9325 0.0707 0.9987 -0.9236 -0.6584 ...
$ uranium : num -1.199 -0.464 -0.831 1.373 -0.464 ...
$ label : Factor w/ 2 levels "NÃO","SIM": 1 1 1 1 1 1 1 1 1 1 ...
Classisficação(KNN)
- Avaliamos o melhor K.
k-Nearest Neighbors
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1020, 1018, 1018, 1018, 1018
Resampling results across tuning parameters:
k ROC Sens Spec
5 0.8789669 0.6938883 0.9217766
7 0.8803394 0.6922760 0.9139764
9 0.8865438 0.6844644 0.9218012
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
9-nearest neighbor model
Training set outcome distribution:
NÃO SIM
634 639
Executando predição do modelo com corte de .50
Analisando desempenho do modelo
- Matrix de confusão
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 199 21
SIM 72 252
Accuracy : 0.829
95% CI : (0.7947, 0.8597)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6578
Mcnemar's Test P-Value : 2.163e-07
Sensitivity : 0.9231
Specificity : 0.7343
Pos Pred Value : 0.7778
Neg Pred Value : 0.9045
Prevalence : 0.5018
Detection Rate : 0.4632
Detection Prevalence : 0.5956
Balanced Accuracy : 0.8287
'Positive' Class : SIM
[1] 0.8105906
- Curva ROC e AUC
Setting levels: control = NÃO, case = SIM
Setting direction: controls < cases
Avalindo outros valores de corte
.30
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 144 4
SIM 127 269
Accuracy : 0.7592
95% CI : (0.721, 0.7946)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5176
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9853
Specificity : 0.5314
Pos Pred Value : 0.6793
Neg Pred Value : 0.9730
Prevalence : 0.5018
Detection Rate : 0.4945
Detection Prevalence : 0.7279
Balanced Accuracy : 0.7584
'Positive' Class : SIM
[1] 0.6873508
.40
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 171 12
SIM 100 261
Accuracy : 0.7941
95% CI : (0.7577, 0.8273)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5877
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9560
Specificity : 0.6310
Pos Pred Value : 0.7230
Neg Pred Value : 0.9344
Prevalence : 0.5018
Detection Rate : 0.4798
Detection Prevalence : 0.6636
Balanced Accuracy : 0.7935
'Positive' Class : SIM
[1] 0.753304
.60
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 218 42
SIM 53 231
Accuracy : 0.8254
95% CI : (0.7908, 0.8564)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6507
Mcnemar's Test P-Value : 0.3049
Sensitivity : 0.8462
Specificity : 0.8044
Pos Pred Value : 0.8134
Neg Pred Value : 0.8385
Prevalence : 0.5018
Detection Rate : 0.4246
Detection Prevalence : 0.5221
Balanced Accuracy : 0.8253
'Positive' Class : SIM
[1] 0.8210923
Predição do modelo com corte de .50
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 199 21
SIM 72 252
Accuracy : 0.829
95% CI : (0.7947, 0.8597)
No Information Rate : 0.5018
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6578
Mcnemar's Test P-Value : 2.163e-07
Sensitivity : 0.9231
Specificity : 0.7343
Pos Pred Value : 0.7778
Neg Pred Value : 0.9045
Prevalence : 0.5018
Detection Rate : 0.4632
Detection Prevalence : 0.5956
Balanced Accuracy : 0.8287
'Positive' Class : SIM
[1] 0.8935228
Resumo
| Pontos | Valores |
|---|---|
| Número registros | 1,822 |
| Número variáveis | 17 |
| Número K | 9 |
| Ponto de corte | 0.50 |
| Sensibilidade | 0.8974 |
| Especificidade | 0.7380 |
| Acuracia | 0.8180 |
| NÃO | SIM | |
|---|---|---|
| NÃO | 200 | 28 |
| SIM | 71 | 245 |
Negrito: Valor de referência
Classificação( RANDOM FOREST )
Ajuste dos Parâmetros
Validação cruzada - 5-fold - mtry = 8 - ntree = (500,1000,1500)
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1018, 1018, 1019, 1019, 1018
Resampling results across tuning parameters:
mtry ntree Accuracy Kappa
1 500 0.8436622 0.6873492
1 1000 0.8475930 0.6951833
1 1500 0.8515146 0.7030395
2 500 0.8978786 0.7957609
2 1000 0.8994473 0.7989038
2 1500 0.9002254 0.8004633
3 500 0.9151583 0.8303272
3 1000 0.9159457 0.8319036
3 1500 0.9198672 0.8397445
4 500 0.9261633 0.8523313
4 1000 0.9300818 0.8601695
4 1500 0.9300849 0.8601767
5 500 0.9324440 0.8648920
5 1000 0.9324409 0.8648867
5 1500 0.9300818 0.8601642
6 500 0.9347970 0.8695966
6 1000 0.9371530 0.8743101
6 1500 0.9371561 0.8743158
7 500 0.9387216 0.8774458
7 1000 0.9379404 0.8758818
7 1500 0.9379404 0.8758850
8 500 0.9379373 0.8758766
8 1000 0.9387216 0.8774458
8 1500 0.9363656 0.8727354
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 7 and ntree = 500.
Length Class Mode
call 5 -none- call
type 1 -none- character
predicted 1273 factor numeric
err.rate 1500 -none- numeric
confusion 6 -none- numeric
votes 2546 matrix numeric
oob.times 1273 -none- numeric
classes 2 -none- character
importance 17 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 1273 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 17 -none- character
problemType 1 -none- character
tuneValue 2 data.frame list
obsLevels 2 -none- character
param 0 -none- list
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1145, 1147, 1145, 1146, 1146, 1146, ...
Resampling results across tuning parameters:
mtry ntree Accuracy Kappa
1 500 0.8522926 0.7045651
1 1000 0.8467685 0.6935307
1 1500 0.8491368 0.6982576
2 500 0.9018013 0.8035853
2 1000 0.9049510 0.8098915
2 1500 0.9057261 0.8114336
3 500 0.9285303 0.8570551
3 1000 0.9285118 0.8570122
3 1500 0.9261558 0.8522987
4 500 0.9355862 0.8711625
4 1000 0.9348049 0.8696027
4 1500 0.9332363 0.8664636
5 500 0.9371794 0.8743591
5 1000 0.9395293 0.8790588
5 1500 0.9395293 0.8790529
6 500 0.9403106 0.8806162
6 1000 0.9411041 0.8822058
6 1500 0.9418854 0.8837722
7 500 0.9426728 0.8853415
7 1000 0.9426851 0.8853690
7 1500 0.9442476 0.8884956
8 500 0.9426728 0.8853446
8 1000 0.9450227 0.8900459
8 1500 0.9458101 0.8916186
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 8 and ntree = 1500.
Length Class Mode
call 5 -none- call
type 1 -none- character
predicted 1273 factor numeric
err.rate 4500 -none- numeric
confusion 6 -none- numeric
votes 2546 matrix numeric
oob.times 1273 -none- numeric
classes 2 -none- character
importance 17 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 1273 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 17 -none- character
problemType 1 -none- character
tuneValue 2 data.frame list
obsLevels 2 -none- character
param 0 -none- list
1273 samples
17 predictor
2 classes: 'NÃO', 'SIM'
No pre-processing
Resampling: Cross-Validated (15 fold)
Summary of sample sizes: 1187, 1188, 1188, 1189, 1188, 1189, ...
Resampling results across tuning parameters:
mtry ntree Accuracy Kappa
1 200 0.8491075 0.6981572
1 500 0.8452411 0.6904364
1 1000 0.8475018 0.6949492
2 200 0.9040690 0.8080906
2 500 0.9009680 0.8018682
2 1000 0.9048438 0.8096393
3 200 0.9253111 0.8505769
3 500 0.9292424 0.8584352
3 1000 0.9284299 0.8568128
4 200 0.9316042 0.8631759
4 500 0.9378607 0.8756871
4 1000 0.9378701 0.8757086
5 200 0.9363294 0.8726111
5 500 0.9418381 0.8836469
5 1000 0.9371229 0.8741975
6 200 0.9370947 0.8741539
6 500 0.9426224 0.8852083
6 1000 0.9410629 0.8820826
7 200 0.9410538 0.8820663
7 500 0.9418472 0.8836466
7 1000 0.9426131 0.8851903
8 200 0.9426131 0.8851964
8 500 0.9434067 0.8867776
8 1000 0.9441817 0.8883271
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 8 and ntree = 1000.
Length Class Mode
call 5 -none- call
type 1 -none- character
predicted 1273 factor numeric
err.rate 3000 -none- numeric
confusion 6 -none- numeric
votes 2546 matrix numeric
oob.times 1273 -none- numeric
classes 2 -none- character
importance 17 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 1273 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 17 -none- character
problemType 1 -none- character
tuneValue 2 data.frame list
obsLevels 2 -none- character
param 0 -none- list
Executando o modelo - mtry = 8 - ntree = 1000
MeanDecreaseAccuracy
aluminium 112.87229
ammonia 29.45467
arsenic 32.41532
barium 15.63559
cadmium 47.31261
chloramine 32.18045
chromium 19.73895
copper 33.49641
bacteria 32.98521
viruses 54.04992
nitrates 44.63777
nitrites 42.10530
perchlorate 58.88871
radium 41.29022
selenium 12.78911
silver 75.23968
uranium 60.04828
MeanDecreaseGini
aluminium 146.202336
ammonia 20.070994
arsenic 40.112426
barium 11.789671
cadmium 85.173910
chloramine 34.079657
chromium 17.646571
copper 21.033685
bacteria 15.392236
viruses 28.227770
nitrates 26.500947
nitrites 25.691018
perchlorate 51.408950
radium 23.546994
selenium 7.404903
silver 51.264629
uranium 30.006274
[1] 6348 5313 2929 4205 3831 4419 3899 4019 3121 4096 5423 5291 7167 5235 3019
[16] 5926 4203
Predição
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 250 18
SIM 21 255
Accuracy : 0.9283
95% CI : (0.9033, 0.9485)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8566
Mcnemar's Test P-Value : 0.7488
Sensitivity : 0.9341
Specificity : 0.9225
Pos Pred Value : 0.9239
Neg Pred Value : 0.9328
Prevalence : 0.5018
Detection Rate : 0.4688
Detection Prevalence : 0.5074
Balanced Accuracy : 0.9283
'Positive' Class : SIM
Curva ROC e AUC
Setting levels: control = NÃO, case = SIM
Setting direction: controls < cases
Alterando o ponto de corte
Confusion Matrix and Statistics
Reference
Prediction NÃO SIM
NÃO 248 15
SIM 23 258
Accuracy : 0.9301
95% CI : (0.9054, 0.9501)
No Information Rate : 0.5018
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8603
Mcnemar's Test P-Value : 0.2561
Sensitivity : 0.9451
Specificity : 0.9151
Pos Pred Value : 0.9181
Neg Pred Value : 0.9430
Prevalence : 0.5018
Detection Rate : 0.4743
Detection Prevalence : 0.5165
Balanced Accuracy : 0.9301
'Positive' Class : SIM
95% CI: 0.9687-0.9885 (DeLong)
95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.436 0.8819 0.9151 0.9446 0.9158 0.9451 0.9707
Resultados
KNN
Matriz de confusão:
| NÃO | SIM | |
|---|---|---|
| NÃO | 199 | 21 |
| SIM | 72 | 252 |
| Estatística | Valores |
|---|---|
| Acuracia | 0.829 |
| 95% IC | (0.7947, 0.8597) |
| Sensibilidade | 0.9232 |
| Especificidade | 0.7343 |
Random Forest
Matriz de confusão:
| NÃO | SIM | |
|---|---|---|
| NÃO | 239 | 10 |
| SIM | 32 | 263 |
| Estatística | Valores |
|---|---|
| Acuracia | 0.9238 |
| 95% IC | (0.8971, 0.9438) |
| Sensibilidade | 0.9634 |
| Especificidade | 0.8819 |
Conclusão
- O RF foi o modelo de melhor performance;
- Os valores obtidos na acurácia, sensibilidade e especificidade validam o modelo para fins acadêmicos.
- É necessário a utilização de dados reais para melhor avaliação do modelo;
- Alteração do ponto de corte deve ser avaliada para melhorar a sensibilidade.