Análise Exploratória de Dados

Author

Walner Passos

Published

July 15, 2025

1 Análise Exploratória de Dados (EDA)

1.1 Apresentação do Conjunto de Dados

Usaremos o conjunto de dados “WaterQuality” disponível no Kaggle em https://www.kaggle.com/datasets/mssmartypants/water-quality/data. Este é um conjunto de dados criado a partir de dados imaginários sobre a qualidade da água em um ambiente urbano.

1.2 Carregamento e Visualização Inicial dos Dados

Code
# Carregar o dataset
dados <- read.csv("waterQuality.csv")
# Visualização inicial dos dados
str(dados)
'data.frame':   7999 obs. of  21 variables:
 $ aluminium  : num  1.65 2.32 1.01 1.36 0.92 0.94 2.36 3.93 0.6 0.22 ...
 $ ammonia    : chr  "9.08" "21.16" "14.02" "11.33" ...
 $ arsenic    : num  0.04 0.01 0.04 0.04 0.03 0.03 0.01 0.04 0.01 0.02 ...
 $ barium     : num  2.85 3.31 0.58 2.96 0.2 2.88 1.35 0.66 0.71 1.37 ...
 $ cadmium    : num  0.007 0.002 0.008 0.001 0.006 0.003 0.004 0.001 0.005 0.007 ...
 $ chloramine : num  0.35 5.28 4.24 7.23 2.67 0.8 1.28 6.22 3.14 6.4 ...
 $ chromium   : num  0.83 0.68 0.53 0.03 0.69 0.43 0.62 0.1 0.77 0.49 ...
 $ copper     : num  0.17 0.66 0.02 1.66 0.57 1.38 1.88 1.86 1.45 0.82 ...
 $ flouride   : num  0.05 0.9 0.99 1.08 0.61 0.11 0.33 0.86 0.98 1.24 ...
 $ bacteria   : num  0.2 0.65 0.05 0.71 0.13 0.67 0.13 0.16 0.35 0.83 ...
 $ viruses    : num  0 0.65 0.003 0.71 0.001 0.67 0.007 0.005 0.002 0.83 ...
 $ lead       : num  0.054 0.1 0.078 0.016 0.117 0.135 0.021 0.197 0.167 0.109 ...
 $ nitrates   : num  16.08 2.01 14.16 1.41 6.74 ...
 $ nitrites   : num  1.13 1.93 1.11 1.29 1.11 1.89 1.78 1.81 1.84 1.46 ...
 $ mercury    : num  0.007 0.003 0.006 0.004 0.003 0.006 0.007 0.001 0.004 0.01 ...
 $ perchlorate: num  37.75 32.26 50.28 9.12 16.9 ...
 $ radium     : num  6.78 3.21 7.07 1.72 2.41 5.42 2.84 7.24 4.99 0.08 ...
 $ selenium   : num  0.08 0.08 0.07 0.02 0.02 0.08 0.1 0.08 0.08 0.03 ...
 $ silver     : num  0.34 0.27 0.44 0.45 0.06 0.19 0.24 0.08 0.25 0.31 ...
 $ uranium    : num  0.02 0.05 0.01 0.05 0.02 0.02 0.08 0.07 0.08 0.01 ...
 $ is_safe    : chr  "1" "1" "0" "1" ...

1.3 Dicionário e Classificação das Variáveis

  • aluminium - dangerous if greater than 2.8

  • ammonia - dangerous if greater than 32.5

  • arsenic - dangerous if greater than 0.01

  • barium - dangerous if greater than 2

  • cadmium - dangerous if greater than 0.005

  • chloramine - dangerous if greater than 4

  • chromium - dangerous if greater than 0.1

  • copper - dangerous if greater than 1.3

  • flouride - dangerous if greater than 1.5

  • bacteria - dangerous if greater than 0

  • viruses - dangerous if greater than 0

  • lead - dangerous if greater than 0.015

  • nitrates - dangerous if greater than 10

  • nitrites - dangerous if greater than 1

  • mercury - dangerous if greater than 0.002

  • perchlorate - dangerous if greater than 56

  • radium - dangerous if greater than 5

  • selenium - dangerous if greater than 0.5

  • silver - dangerous if greater than 0.1

  • uranium - dangerous if greater than 0.3

  • is_safe - class attribute {0 - not safe, 1 - safe}

1.4 Análise das Variáveis

Code
summary(dados)
   aluminium        ammonia             arsenic           barium     
 Min.   :0.0000   Length:7999        Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0400   Class :character   1st Qu.:0.0300   1st Qu.:0.560  
 Median :0.0700   Mode  :character   Median :0.0500   Median :1.190  
 Mean   :0.6662                      Mean   :0.1614   Mean   :1.568  
 3rd Qu.:0.2800                      3rd Qu.:0.1000   3rd Qu.:2.480  
 Max.   :5.0500                      Max.   :1.0500   Max.   :4.940  
    cadmium          chloramine       chromium          copper      
 Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00800   1st Qu.:0.100   1st Qu.:0.0500   1st Qu.:0.0900  
 Median :0.04000   Median :0.530   Median :0.0900   Median :0.7500  
 Mean   :0.04281   Mean   :2.177   Mean   :0.2472   Mean   :0.8059  
 3rd Qu.:0.07000   3rd Qu.:4.240   3rd Qu.:0.4400   3rd Qu.:1.3900  
 Max.   :0.13000   Max.   :8.680   Max.   :0.9000   Max.   :2.0000  
    flouride         bacteria         viruses            lead        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.4050   1st Qu.:0.0000   1st Qu.:0.0020   1st Qu.:0.04800  
 Median :0.7700   Median :0.2200   Median :0.0080   Median :0.10200  
 Mean   :0.7716   Mean   :0.3197   Mean   :0.3286   Mean   :0.09945  
 3rd Qu.:1.1600   3rd Qu.:0.6100   3rd Qu.:0.7000   3rd Qu.:0.15100  
 Max.   :1.5000   Max.   :1.0000   Max.   :1.0000   Max.   :0.20000  
    nitrates         nitrites       mercury          perchlorate   
 Min.   : 0.000   Min.   :0.00   Min.   :0.000000   Min.   : 0.00  
 1st Qu.: 5.000   1st Qu.:1.00   1st Qu.:0.003000   1st Qu.: 2.17  
 Median : 9.930   Median :1.42   Median :0.005000   Median : 7.74  
 Mean   : 9.819   Mean   :1.33   Mean   :0.005194   Mean   :16.46  
 3rd Qu.:14.610   3rd Qu.:1.76   3rd Qu.:0.008000   3rd Qu.:29.48  
 Max.   :19.830   Max.   :2.93   Max.   :0.010000   Max.   :60.01  
     radium         selenium           silver          uranium       
 Min.   :0.000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.820   1st Qu.:0.02000   1st Qu.:0.0400   1st Qu.:0.02000  
 Median :2.410   Median :0.05000   Median :0.0800   Median :0.05000  
 Mean   :2.921   Mean   :0.04968   Mean   :0.1478   Mean   :0.04467  
 3rd Qu.:4.670   3rd Qu.:0.07000   3rd Qu.:0.2400   3rd Qu.:0.07000  
 Max.   :7.990   Max.   :0.10000   Max.   :0.5000   Max.   :0.09000  
   is_safe         
 Length:7999       
 Class :character  
 Mode  :character  
                   
                   
                   

1.5 Limpeza e Preparação dos Dados

Code
# Verificação de valores ausentes
print(paste('Número dados faltantes :', sum(is.na(dados))))
[1] "Número dados faltantes : 0"
Code
#Alternado o tipo de dados da variável Ammonia
print("Alterar a variável Ammonia de chr para numeric")
[1] "Alterar a variável Ammonia de chr para numeric"
Code
dados$ammonia <- as.numeric(dados$ammonia)
Warning: NAs introduzidos por coerção
Code
#Anakisando os dados do rótulo
print("Analisando os dados do Rótulo")
[1] "Analisando os dados do Rótulo"
Code
print(paste("", unique(dados$is_safe)))
[1] " 1"     " 0"     " #NUM!"
Code
print("Verificando o número de registro com valor #NUM! ")
[1] "Verificando o número de registro com valor #NUM! "
Code
table(dados$is_safe%in% c( "#NUM!" ))

FALSE  TRUE 
 7996     3 
Code
print("Deletando os registros com valor #NUM! ")
[1] "Deletando os registros com valor #NUM! "
Code
dados <- filter(dados, is_safe != "#NUM!")
print("Verificando altteração ")
[1] "Verificando altteração "
Code
table(dados$is_safe%in% c( "#NUM!" ))

FALSE 
 7996 
Code
#Alterando os n´veis para o rótulo
print("Criando níveis para o rótulo")
[1] "Criando níveis para o rótulo"
Code
dados$is_safe <- factor(dados$is_safe, levels = c('0','1'), labels = c("impropria", "Potável"))
print("Confirmando alteração")
[1] "Confirmando alteração"
Code
print(paste("", levels(dados$is_safe)))
[1] " impropria" " Potável"  

1.6 Análise Univariada

1.6.1 Descrição dos dados:

Code
# Estatísticas descritivas
descr(dados)
Non-numerical variable(s) ignored: is_safe
Descriptive Statistics  
dados  
N: 7996  

                    aluminium   ammonia   arsenic   bacteria    barium   cadmium   chloramine
----------------- ----------- --------- --------- ---------- --------- --------- ------------
             Mean        0.67     14.28      0.16       0.32      1.57      0.04         2.18
          Std.Dev        1.27      8.88      0.25       0.33      1.22      0.04         2.57
              Min        0.00     -0.08      0.00       0.00      0.00      0.00         0.00
               Q1        0.04      6.58      0.03       0.00      0.56      0.01         0.10
           Median        0.07     14.13      0.05       0.22      1.19      0.04         0.53
               Q3        0.28     22.13      0.10       0.61      2.49      0.07         4.24
              Max        5.05     29.84      1.05       1.00      4.94      0.13         8.68
              MAD        0.06     11.58      0.04       0.33      1.20      0.05         0.76
              IQR        0.24     15.55      0.07       0.61      1.92      0.06         4.14
               CV        1.90      0.62      1.56       1.03      0.78      0.84         1.18
         Skewness        2.01      0.03      1.98       0.55      0.66      0.48         0.89
      SE.Skewness        0.03      0.03      0.03       0.03      0.03      0.03         0.03
         Kurtosis        2.72     -1.23      2.68      -1.14     -0.70     -0.99        -0.68
          N.Valid     7996.00   7996.00   7996.00    7996.00   7996.00   7996.00      7996.00
                N     7996.00   7996.00   7996.00    7996.00   7996.00   7996.00      7996.00
        Pct.Valid      100.00    100.00    100.00     100.00    100.00    100.00       100.00

Table: Table continues below

 

                    chromium    copper   flouride      lead   mercury   nitrates   nitrites
----------------- ---------- --------- ---------- --------- --------- ---------- ----------
             Mean       0.25      0.81       0.77      0.10      0.01       9.82       1.33
          Std.Dev       0.27      0.65       0.44      0.06      0.00       5.54       0.57
              Min       0.00      0.00       0.00      0.00      0.00       0.00       0.00
               Q1       0.05      0.09       0.41      0.05      0.00       5.00       1.00
           Median       0.09      0.75       0.77      0.10      0.00       9.93       1.42
               Q3       0.44      1.39       1.16      0.15      0.01      14.61       1.76
              Max       0.90      2.00       1.50      0.20      0.01      19.83       2.93
              MAD       0.10      0.96       0.56      0.08      0.00       7.06       0.53
              IQR       0.39      1.30       0.75      0.10      0.00       9.61       0.76
               CV       1.09      0.81       0.56      0.59      0.57       0.56       0.43
         Skewness       1.03      0.25      -0.04     -0.06     -0.08      -0.04      -0.50
      SE.Skewness       0.03      0.03       0.03      0.03      0.03       0.03       0.03
         Kurtosis      -0.37     -1.35      -1.17     -1.16     -1.17      -1.19      -0.36
          N.Valid    7996.00   7996.00    7996.00   7996.00   7996.00    7996.00    7996.00
                N    7996.00   7996.00    7996.00   7996.00   7996.00    7996.00    7996.00
        Pct.Valid     100.00    100.00     100.00    100.00    100.00     100.00     100.00

Table: Table continues below

 

                    perchlorate    radium   selenium    silver   uranium   viruses
----------------- ------------- --------- ---------- --------- --------- ---------
             Mean         16.47      2.92       0.05      0.15      0.04      0.33
          Std.Dev         17.69      2.32       0.03      0.14      0.03      0.38
              Min          0.00      0.00       0.00      0.00      0.00      0.00
               Q1          2.17      0.82       0.02      0.04      0.02      0.00
           Median          7.74      2.41       0.05      0.08      0.05      0.01
               Q3         29.50      4.67       0.07      0.24      0.07      0.70
              Max         60.01      7.99       0.10      0.50      0.09      1.00
              MAD         10.73      2.64       0.03      0.09      0.03      0.01
              IQR         27.32      3.85       0.05      0.20      0.05      0.70
               CV          1.07      0.80       0.58      0.97      0.60      1.15
         Skewness          0.94      0.55       0.01      1.03     -0.03      0.42
      SE.Skewness          0.03      0.03       0.03      0.03      0.03      0.03
         Kurtosis         -0.50     -0.93      -1.10     -0.29     -1.17     -1.59
          N.Valid       7996.00   7996.00    7996.00   7996.00   7996.00   7996.00
                N       7996.00   7996.00    7996.00   7996.00   7996.00   7996.00
        Pct.Valid        100.00    100.00     100.00    100.00    100.00    100.00
Code
# Tabela Descritiva
theme_gtsummary_language(
  language = "pt",       # Define o idioma para Português
  decimal.mark = ",",    # Define a vírgula como separador decimal
  big.mark = ".",        # Define o ponto como separador de milhares
  iqr.sep = "-",         # Define o hífen como separador para intervalos interquartis
  ci.sep = "-",          # Define o hífen como separador para intervalos de confiança
  set_theme = TRUE       # Aplica essas configurações como tema padrão para as tabelas
)
Setting theme "language: pt"
Code
list("tbl_summary-fn:percent_fun" = function(x) sprintf(x * 100, fmt='%#.1f')) %>%
  set_gtsummary_theme()  # Aplica a função personalizada de formatação de porcentagens como tema padrão

tbl_summary(dados) # no final vamos ver como mudar diversas características da tabela
Características N = 7.9961
aluminium 0,07 (0,04-0,28)
ammonia 14 (7-22)
arsenic 0,05 (0,03-0,10)
barium 1,19 (0,56-2,49)
cadmium 0,040 (0,008-0,070)
chloramine 0,53 (0,10-4,24)
chromium 0,09 (0,05-0,44)
copper 0,75 (0,09-1,39)
flouride 0,77 (0,41-1,16)
bacteria 0,22 (0,00-0,61)
viruses 0,01 (0,00-0,70)
lead 0,10 (0,05-0,15)
nitrates 9,9 (5,0-14,6)
nitrites 1,42 (1,00-1,76)
mercury 0,0050 (0,0030-0,0080)
perchlorate 8 (2-29)
radium 2,41 (0,82-4,67)
selenium 0,050 (0,020-0,070)
silver 0,08 (0,04-0,24)
uranium 0,050 (0,020-0,070)
is_safe
    impropria 7.084 (88.6%)
    Potável 912 (11.4%)
1 Mediana (Q1-Q3); n (%)

1.7 Análise Multivariada

Code
# Selecionar variáveis numéricas
numeric_vars <- dados %>% select_if(is.numeric)

# Calcular a matriz de correlação
correlation_matrix <- cor(numeric_vars, method = "pearson", use = "complete.obs")

# Visualizar a matriz de correlação 1
library(ggcorrplot)
ggcorrplot(
  correlation_matrix,
  lab = TRUE,
  hc.order = TRUE,
  type = "lower",
  colors = c("red", "white", "blue")
)