'data.frame': 7999 obs. of 21 variables:
$ aluminium : num 1.65 2.32 1.01 1.36 0.92 0.94 2.36 3.93 0.6 0.22 ...
$ ammonia : chr "9.08" "21.16" "14.02" "11.33" ...
$ arsenic : num 0.04 0.01 0.04 0.04 0.03 0.03 0.01 0.04 0.01 0.02 ...
$ barium : num 2.85 3.31 0.58 2.96 0.2 2.88 1.35 0.66 0.71 1.37 ...
$ cadmium : num 0.007 0.002 0.008 0.001 0.006 0.003 0.004 0.001 0.005 0.007 ...
$ chloramine : num 0.35 5.28 4.24 7.23 2.67 0.8 1.28 6.22 3.14 6.4 ...
$ chromium : num 0.83 0.68 0.53 0.03 0.69 0.43 0.62 0.1 0.77 0.49 ...
$ copper : num 0.17 0.66 0.02 1.66 0.57 1.38 1.88 1.86 1.45 0.82 ...
$ flouride : num 0.05 0.9 0.99 1.08 0.61 0.11 0.33 0.86 0.98 1.24 ...
$ bacteria : num 0.2 0.65 0.05 0.71 0.13 0.67 0.13 0.16 0.35 0.83 ...
$ viruses : num 0 0.65 0.003 0.71 0.001 0.67 0.007 0.005 0.002 0.83 ...
$ lead : num 0.054 0.1 0.078 0.016 0.117 0.135 0.021 0.197 0.167 0.109 ...
$ nitrates : num 16.08 2.01 14.16 1.41 6.74 ...
$ nitrites : num 1.13 1.93 1.11 1.29 1.11 1.89 1.78 1.81 1.84 1.46 ...
$ mercury : num 0.007 0.003 0.006 0.004 0.003 0.006 0.007 0.001 0.004 0.01 ...
$ perchlorate: num 37.75 32.26 50.28 9.12 16.9 ...
$ radium : num 6.78 3.21 7.07 1.72 2.41 5.42 2.84 7.24 4.99 0.08 ...
$ selenium : num 0.08 0.08 0.07 0.02 0.02 0.08 0.1 0.08 0.08 0.03 ...
$ silver : num 0.34 0.27 0.44 0.45 0.06 0.19 0.24 0.08 0.25 0.31 ...
$ uranium : num 0.02 0.05 0.01 0.05 0.02 0.02 0.08 0.07 0.08 0.01 ...
$ is_safe : chr "1" "1" "0" "1" ...
Programa de Engenharia Biomédica
COB754 - Análise de dados
Discente: Walner Passos
Docente : Leticia Raposo e Diogo Antônio Tschoeke
Análise Exploratória de Dados (EDA)
Apresentação do Conjunto de Dados
Usaremos o conjunto de dados WaterQuality disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados foi criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.
Carga e Visualização Inicial dos Dados
Resumo dos dados:
- Número de Linha: 7999
- Número de Variáveis: 20
- Rótulo: 1
Variáveis explicativas | Tipo de dado | perigoso > que |
---|---|---|
aluminium | quantitativa | 2.8 |
ammonia | quantitativa | 32.5 |
arsenic | quantitativa | 0.01 |
barium | quantitativa | 2 |
cádmium | quantitativa | 0.005 |
chloramine | quantitativa | 4 |
chromium | quantitativa | 0.1 |
copper | quantitativa | 1.3 |
flouride | quantitativa | 1.5 |
bactéria | quantitativa | 0 |
viroses | quantitativa | 0 |
lead | quantitativa | 0.015 |
nitrates | quantitativa | 10 |
nitrites | quantitativa | 1 |
Mercury | quantitativa | 0.002 |
perchlorate | quantitativa | 56 |
radium | quantitativa | 5 |
selenium | quantitativa | 0.5 |
silver | quantitativa | 0.1 |
uranium | quantitativa | 0.3 |
variável resposta | Tipo de dado | Observação |
---|---|---|
is_safe | chr | “0”, “1” |
Informação Dataset
Variáveis
Análise estatística dos dados
aluminium ammonia arsenic barium
Min. :0.0000 Length:7999 Min. :0.0000 Min. :0.000
1st Qu.:0.0400 Class :character 1st Qu.:0.0300 1st Qu.:0.560
Median :0.0700 Mode :character Median :0.0500 Median :1.190
Mean :0.6662 Mean :0.1614 Mean :1.568
3rd Qu.:0.2800 3rd Qu.:0.1000 3rd Qu.:2.480
Max. :5.0500 Max. :1.0500 Max. :4.940
cadmium chloramine chromium copper
Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00800 1st Qu.:0.100 1st Qu.:0.0500 1st Qu.:0.0900
Median :0.04000 Median :0.530 Median :0.0900 Median :0.7500
Mean :0.04281 Mean :2.177 Mean :0.2472 Mean :0.8059
3rd Qu.:0.07000 3rd Qu.:4.240 3rd Qu.:0.4400 3rd Qu.:1.3900
Max. :0.13000 Max. :8.680 Max. :0.9000 Max. :2.0000
flouride bacteria viruses lead
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.4050 1st Qu.:0.0000 1st Qu.:0.0020 1st Qu.:0.04800
Median :0.7700 Median :0.2200 Median :0.0080 Median :0.10200
Mean :0.7716 Mean :0.3197 Mean :0.3286 Mean :0.09945
3rd Qu.:1.1600 3rd Qu.:0.6100 3rd Qu.:0.7000 3rd Qu.:0.15100
Max. :1.5000 Max. :1.0000 Max. :1.0000 Max. :0.20000
nitrates nitrites mercury perchlorate
Min. : 0.000 Min. :0.00 Min. :0.000000 Min. : 0.00
1st Qu.: 5.000 1st Qu.:1.00 1st Qu.:0.003000 1st Qu.: 2.17
Median : 9.930 Median :1.42 Median :0.005000 Median : 7.74
Mean : 9.819 Mean :1.33 Mean :0.005194 Mean :16.46
3rd Qu.:14.610 3rd Qu.:1.76 3rd Qu.:0.008000 3rd Qu.:29.48
Max. :19.830 Max. :2.93 Max. :0.010000 Max. :60.01
radium selenium silver uranium
Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.820 1st Qu.:0.02000 1st Qu.:0.0400 1st Qu.:0.02000
Median :2.410 Median :0.05000 Median :0.0800 Median :0.05000
Mean :2.921 Mean :0.04968 Mean :0.1478 Mean :0.04467
3rd Qu.:4.670 3rd Qu.:0.07000 3rd Qu.:0.2400 3rd Qu.:0.07000
Max. :7.990 Max. :0.10000 Max. :0.5000 Max. :0.09000
is_safe
Length:7999
Class :character
Mode :character
Limpeza e Preparação dos Dados
- Sem dados faltantes:
Número dados faltantes : 0
- Analise variável Ammonia
Confirmando tipo de dado da variável: character
Alterando o tipo de dados
Confirmando alteração double
Verificando inconcistência: 3
Excluuindo registros:
Confirmando exclusão: 0
- Análise dos rótulos
Rotulos: 1 0
- Alterado as descrições do rotulo ( “0” - NÃO / “1” - SIM )
Alterando os rótulos...
Confirmando alteração: NÃO SIM
Análise dos dados
Balanceamento
Análise Univariada
Descrição:
Non-numerical variable(s) ignored: is_safe
Descriptive Statistics
dados
N: 7996
aluminium ammonia arsenic bacteria barium cadmium chloramine
----------------- ----------- --------- --------- ---------- --------- --------- ------------
Mean 0.67 14.28 0.16 0.32 1.57 0.04 2.18
Std.Dev 1.27 8.88 0.25 0.33 1.22 0.04 2.57
Min 0.00 -0.08 0.00 0.00 0.00 0.00 0.00
Q1 0.04 6.58 0.03 0.00 0.56 0.01 0.10
Median 0.07 14.13 0.05 0.22 1.19 0.04 0.53
Q3 0.28 22.13 0.10 0.61 2.49 0.07 4.24
Max 5.05 29.84 1.05 1.00 4.94 0.13 8.68
MAD 0.06 11.58 0.04 0.33 1.20 0.05 0.76
IQR 0.24 15.55 0.07 0.61 1.92 0.06 4.14
CV 1.90 0.62 1.56 1.03 0.78 0.84 1.18
Skewness 2.01 0.03 1.98 0.55 0.66 0.48 0.89
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis 2.72 -1.23 2.68 -1.14 -0.70 -0.99 -0.68
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Table: Table continues below
chromium copper flouride lead mercury nitrates nitrites
----------------- ---------- --------- ---------- --------- --------- ---------- ----------
Mean 0.25 0.81 0.77 0.10 0.01 9.82 1.33
Std.Dev 0.27 0.65 0.44 0.06 0.00 5.54 0.57
Min 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Q1 0.05 0.09 0.41 0.05 0.00 5.00 1.00
Median 0.09 0.75 0.77 0.10 0.00 9.93 1.42
Q3 0.44 1.39 1.16 0.15 0.01 14.61 1.76
Max 0.90 2.00 1.50 0.20 0.01 19.83 2.93
MAD 0.10 0.96 0.56 0.08 0.00 7.06 0.53
IQR 0.39 1.30 0.75 0.10 0.00 9.61 0.76
CV 1.09 0.81 0.56 0.59 0.57 0.56 0.43
Skewness 1.03 0.25 -0.04 -0.06 -0.08 -0.04 -0.50
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis -0.37 -1.35 -1.17 -1.16 -1.17 -1.19 -0.36
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Table: Table continues below
perchlorate radium selenium silver uranium viruses
----------------- ------------- --------- ---------- --------- --------- ---------
Mean 16.47 2.92 0.05 0.15 0.04 0.33
Std.Dev 17.69 2.32 0.03 0.14 0.03 0.38
Min 0.00 0.00 0.00 0.00 0.00 0.00
Q1 2.17 0.82 0.02 0.04 0.02 0.00
Median 7.74 2.41 0.05 0.08 0.05 0.01
Q3 29.50 4.67 0.07 0.24 0.07 0.70
Max 60.01 7.99 0.10 0.50 0.09 1.00
MAD 10.73 2.64 0.03 0.09 0.03 0.01
IQR 27.32 3.85 0.05 0.20 0.05 0.70
CV 1.07 0.80 0.58 0.97 0.60 1.15
Skewness 0.94 0.55 0.01 1.03 -0.03 0.42
SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03
Kurtosis -0.50 -0.93 -1.10 -0.29 -1.17 -1.59
N.Valid 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
N 7996.00 7996.00 7996.00 7996.00 7996.00 7996.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
Setting theme "language: pt"
Características | N = 7.9961 |
---|---|
aluminium | 0,07 (0,04-0,28) |
ammonia | 14 (7-22) |
arsenic | 0,05 (0,03-0,10) |
barium | 1,19 (0,56-2,49) |
cadmium | 0,040 (0,008-0,070) |
chloramine | 0,53 (0,10-4,24) |
chromium | 0,09 (0,05-0,44) |
copper | 0,75 (0,09-1,39) |
flouride | 0,77 (0,41-1,16) |
bacteria | 0,22 (0,00-0,61) |
viruses | 0,01 (0,00-0,70) |
lead | 0,10 (0,05-0,15) |
nitrates | 9,9 (5,0-14,6) |
nitrites | 1,42 (1,00-1,76) |
mercury | 0,0050 (0,0030-0,0080) |
perchlorate | 8 (2-29) |
radium | 2,41 (0,82-4,67) |
selenium | 0,050 (0,020-0,070) |
silver | 0,08 (0,04-0,24) |
uranium | 0,050 (0,020-0,070) |
is_safe | |
NÃO | 7.084 (88.6%) |
SIM | 912 (11.4%) |
1 Mediana (Q1-Q3); n (%) |
Analise das variáveis quantitativas
Análise Multivariada
Conclusão
Número de Linha : 7996
Variável ammonia alterada para num:
$ ammonia : num 9.08 21.16 14.02 11.33 24.33 …Rótulo alterado:
$ is_safe : Factor w/ 2 levels “NÃO”,“SIM”Pré-Processamento dos dados
- Normalizar os dados
- Verificar outliers
- Verificar a redução de dimencionalidades
- Verificar o balanceamento dos dados