Introdução
Sobre o conjunto de dados
- Verificação de dados faltantes
Gráfico com os valores faltantes
Valores faltantes em relação a variável resposta
Histogramas com a variável resposta em relação as demais variaveis
Box plot
Teste de normalidade
Correlação
Modelos
Modelo Random Forest
Quantidade de variáveis (colunas) sorteadas por árvore.
Variaveis importantes
Predição
Conclusão

Introdução

O acesso para economizar água potável é importante para a saúde pública. Segundo a Organização Mundial da Saúde, um melhor abastecimento de água e saneamento e uma melhor gestão dos recursos hídricos podem impulsionar o crescimento econômico e contribuir para a redução da pobreza.

Além disso, a água potável é muito importante para manter nossas funções corporais. Um corpo humano pode sobreviver até 4 semanas sem comida, mas apenas 3 dias sem água.

Portanto, é importante estudar quais variáveis influenciam na água potável.

O conjunto de dados neste estudo consiste em 10 variáveis, com uma variável dependente (1 = potável, 0 = não potável) e 9 variáveis independentes. As variáveis independentes são os parâmetros da água. O objetivo deste estudo é prever a água potável com base nesses parâmetros da água.

Sobre o conjunto de dados

Variáveis

Valor do pH: O pH é um parâmetro importante na avaliação do equilíbrio ácido-base da água.

Dureza: A dureza é causada principalmente por sais de cálcio e magnésio.

Sólidos (Sólidos dissolvidos totais - TDS): A água tem a capacidade de dissolver uma ampla gama de minerais ou sais inorgânicos e alguns orgânicos, como potássio, cálcio, sódio, bicarbonatos, cloretos, magnésio, sulfatos etc.

Cloraminas: Cloro e cloramina são os principais desinfetantes usados em sistemas públicos de água.

Sulfato: Os sulfatos são substâncias naturais encontradas em minerais, solo e rochas.

Condutividade: A água pura não é um bom condutor de corrente elétrica, mas sim um bom isolante. O aumento na concentração de íons aumenta a condutividade elétrica da água.

Organic_carbon: O carbono orgânico total (TOC) nas águas de nascente vem da matéria orgânica natural em decomposição (NOM), bem como de fontes sintéticas.

Trihalometanos: THMs são produtos químicos que podem ser encontrados em água tratada com cloro.

Turbidez: A turbidez da água depende da quantidade de matéria sólida presente no estado suspenso.

Potabilidade: Indica se a água é segura para consumo humano onde 1 significa Potável e 0 significa Não potável.

library(readr)
require(dplyr)
require(tidyverse)
require(ggplot2)
require(janitor)
require(skimr)
require(corrplot)
library(Hmisc)
require(GGally)
require(randomForest)
require(caret)
require(scales)
require(ggpubr)
library(MVN)

water_potability <- read_csv("/Users/Clevia/Documents/MLG/water_potability.csv")
knitr::kable( head(water_potability), "simple")

ph	Hardness	Solids	Chloramines	Sulfate	Conductivity	Organic_carbon	Trihalomethanes	Turbidity
NA	204.8905	20791.32	7.300212	368.5164	564.3087	10.379783	86.99097	2.963135
3.716080	129.4229	18630.06	6.635246	NA	592.8854	15.180013	56.32908	4.500656
8.099124	224.2363	19909.54	9.275884	NA	418.6062	16.868637	66.42009	3.055934
8.316766	214.3734	22018.42	8.059332	356.8861	363.2665	18.436525	100.34167	4.628770
9.092223	181.1015	17978.99	6.546600	310.1357	398.4108	11.558279	31.99799	4.075075
5.584087	188.3133	28748.69	7.544869	326.6784	280.4679	8.399735	54.91786	2.559708

water_potability <- water_potability %>% 
  mutate(Potability = as.factor(Potability)) %>%
  clean_names() %>% glimpse()

## Rows: 3,276
## Columns: 10
## $ ph              <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, ~
## $ hardness        <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.~
## $ solids          <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874~
## $ chloramines     <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54~
## $ sulfate         <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66~
## $ conductivity    <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.~
## $ organic_carbon  <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,~
## $ trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.~
## $ turbidity       <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55~
## $ potability      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

Verificação de dados faltantes

#dados faltantes 

water_potability %>%
  summarise_all(~ sum(is.na(.)))

## # A tibble: 1 x 10
##      ph hardness solids chloramines sulfate conductivity organic_carbon
##   <int>    <int>  <int>       <int>   <int>        <int>          <int>
## 1   491        0      0           0     781            0              0
## # ... with 3 more variables: trihalomethanes <int>, turbidity <int>,
## #   potability <int>

Podemos ver que há 3 variáveis com dados faltantes, PH, Sulfate e Trihalomethanes

Gráfico com os valores faltantes

#grafico com valores faltantes 

water_potability %>%  skim() %>%
  filter(n_missing != 0) %>%
  as_tibble() %>%
  select(skim_variable, n_missing, complete_rate) %>%
  mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
  ggplot(aes(
    x = fct_reorder(skim_variable, n_missing),
    y = missing_rate,
    fill = skim_variable,
    label = paste0(missing_rate, "%")
  )) +
  geom_col(width = .6) +
  geom_text(
    size = 4.5,
    hjust = 1.2,
    vjust = .25,
    col = "white"
  ) +
  coord_flip() + theme(aspect.ratio = .4) +
  theme(
    legend.position = "none"
  ) +
  scale_y_continuous(label = label_percent(scale = 1)) +
  scale_fill_manual(values = c("#25C5DA",
                               "#25A69A",
                               "#66BA6A")) +
  labs(
    title = "Dados Faltantes",
    subtitle = "Distribuição dos dados faltantes",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Neste gráfico observamos a proporção de dados faltantes no banco de dados

Valores faltantes em relação a variável resposta

#valores faltantes em relação a variável resposta

water_potability %>% group_by(potability) %>%  skim() %>%
  filter(n_missing != 0) %>%
  as_tibble() %>%
  select(skim_variable, n_missing, complete_rate, potability) %>%
  mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
  ggplot(aes(
    x = fct_reorder(skim_variable, n_missing),
    y = missing_rate,
    fill = skim_variable,
    label = paste0(missing_rate, "%")
  )) +
  geom_col(width = .6) +
  geom_text(
    size = 4,
    hjust = 1.2,
    vjust = 0.25,
    col = "white"
  ) +
  coord_flip() +
  facet_wrap(vars(potability)) +
  theme(aspect.ratio = .7) +
  theme(
    legend.position = "none",
    strip.background = element_rect(fill="#94246D"),
    strip.text = element_text(color = "white", face = "bold", size = 12)
  ) +
  scale_y_continuous(label = label_percent(scale = 1)) +
  scale_fill_manual(values = c("#D41C64",
                               "#4B6FB5",
                               "#6C3996")) +
  labs(
    title = "Dados faltantes VS Variável resposta",
    subtitle = "Plot, Missing Data distribution VS Target Variable",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Aqui observamos os dados faltantes em relação a variável resposta.

Neste caso optou-se por substituir esses valores pela média, pois com a remoção muita informação seria perdida.

#substituir valores faltantes 

water_potability <- water_potability %>% 
  group_by(potability) %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), 
                                            as.numeric(.)))) %>% ungroup()

#variavel resposta

water_potability %>%
  select(potability) %>%
  count(potability) %>% mutate(percent = paste0(round(n / sum(n) * 100), "%"), 2) %>%
  ggplot(aes(
    x = potability,
    y = n,
    label = percent,
    fill = potability
  )) +
  geom_col() +
  geom_text(vjust = -0.2, color = "#7C4EA8") +
  scale_fill_manual(values = c("#EF1A25", "#0099D5")) +
  labs(
    title = "Potabilidade da água",
    subtitle = "Gráfico de barras com a distribuição da variável potabilidade",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL,
    fill = NULL
  )

Gráfico com a proporção da variável resposta no banco de dados

p1 <- ggplot(water_potability, aes(ph, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "pH", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom")+
  labs(title = "pH")

p2 <- ggplot(water_potability, aes(hardness, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Hardness", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom")+ 
  labs(title = "Hardness")

p3 <- ggplot(water_potability, aes(solids, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Solids", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Solids")

p4 <- ggplot(water_potability, aes(chloramines, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Chloramines", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Chloramines")

p5 <- ggplot(water_potability, aes(sulfate, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Sulfate", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Sulfate")

p6 <- ggplot(water_potability, aes(conductivity, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Conductivity", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Conductivity")

p7 <- ggplot(water_potability, aes(organic_carbon, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Organic Carbon", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Organic Carbon")

p8 <- ggplot(water_potability, aes(trihalomethanes, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Trihalomethanes", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Trihalomethanes")

p9 <- ggplot(water_potability, aes(turbidity, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Turbidity", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Turbidity")

Histogramas com a variável resposta em relação as demais variaveis

figure1 <- ggarrange(p1, p2, p3, p4, p5, nrow = 2, ncol = 3, labels = "AUTO")
figure2 <- ggarrange(p6, p7, p8, p9, nrow = 2, ncol = 2, labels = "AUTO")
figure1

figure2

Box plot

water_potability %>%
  pivot_longer(cols = -potability, names_to = "feature") %>%
  ggplot(aes(x = feature, y = value)) +
  geom_jitter(aes(y = value, col = potability), alpha = 0.1) +
  geom_boxplot(aes(fill = potability)) +
  facet_wrap(vars(feature), ncol = 3, scales = "free") +
  scale_color_manual(values = c("#E4652E", "#0E8A41")) +
  scale_fill_manual(values = c("#E4652E", "#0E8A41")) +
  theme(
    legend.position = "right",
    strip.background = element_rect(fill = "#0B2D5B"),
    strip.text = element_text(color = "white", face = "bold", size = 8)
  ) +
  labs(
    title = "Detect Outliers With Boxplot",
    subtitle = "Plot, Box and Jitter Plot",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL,
    fill = NULL,
    color = NULL
  )

Teste de normalidade

aa<- mvn(data = water_potability[,-10],  univariatePlot =
      "qqplot")

teste <- mvn(data = water_potability[,-10], univariateTest =  
               "SW")
teste$'univariateNormality'

##           Test        Variable Statistic   p value Normality
## 1 Shapiro-Wilk       ph           0.9800  <0.001      NO    
## 2 Shapiro-Wilk    hardness        0.9960  <0.001      NO    
## 3 Shapiro-Wilk     solids         0.9777  <0.001      NO    
## 4 Shapiro-Wilk   chloramines      0.9968  <0.001      NO    
## 5 Shapiro-Wilk     sulfate        0.9610  <0.001      NO    
## 6 Shapiro-Wilk  conductivity      0.9930  <0.001      NO    
## 7 Shapiro-Wilk organic_carbon     0.9995  0.6251      YES   
## 8 Shapiro-Wilk trihalomethanes    0.9970  <0.001      NO    
## 9 Shapiro-Wilk    turbidity       0.9997  0.9336      YES

Correlação

corrplot( cor (water_potability[, -10]), method = "color", #col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", 
         tl.col = T, tl.srt = 45, 
         number.cex = 0.7, tl.cex = 0.7,
         #p.mat = m_corr$P, sig.level = 0.05, insig = "n",
         diag = FALSE, number.digits = 3
)

ggpairs(
  water_potability,
  aes(color = potability),
  columns = 1:9,
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.2,
    size = 0.5,
    color = "#DE942E"
  )),
  diag = list(continuous = "barDiag"),
  upper = list(continuous = wrap("cor", size = 4))
) +
  scale_color_manual(values = c("#1F5736", "#E94046")) +
  scale_fill_manual(values = c("#1F5736", "#E94046")) +
  theme(
    axis.text = element_text(size = 8),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "white"),
    strip.background.x = element_rect(colour = "black"),
    strip.background.y = element_rect(colour = "black"),
    strip.text = element_text(color = "black", face = "bold", size = 8)
  ) +
  labs(
    title = "Pair plot by Potability Var",
    subtitle = "Pair Plot, scatter plot, Histogram and Correlation coefficient",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Podemos ver que não há indícios de Multicolinearidade

Modelos

Foram testados os modelos de Regressão Logistica, Random Forest e KNN

# 80% training and 20% test dataset
set.seed(1) 
trn_index <- createDataPartition(y = water_potability$potability, p = 0.80, list = FALSE)
trn_water <- water_potability[trn_index, ]
tst_water <- water_potability[-trn_index, ]

# K-nearest neighbors
set.seed(1) 
potability_knn <- train(potability ~ ., method = "knn", data = trn_water,
                        trControl = trainControl(method = 'cv', number = 5, returnResamp = "all"))

# Random forest
set.seed(1)
potability_rf <- randomForest(potability ~ .,
                              data=trn_water, ntree= 1000
)

# Logistic regression
set.seed(1)
potability_lr <- train(potability ~ ., method = "glm", data = trn_water, 
                       family = binomial(link = "logit"),
                       trControl = trainControl(method = 'cv', number = 5))

# Prediction on test-dataset
predicted_outcomes_rf <- predict(potability_rf, tst_water)
predicted_outcomes_knn <- predict(potability_knn, tst_water)
predicted_outcomes_lr <- predict(potability_lr, tst_water)

# Create Confusion Matrices
rf_confm <- confusionMatrix(predicted_outcomes_rf, tst_water$potability, positive='1')
knn_confm <- confusionMatrix(predicted_outcomes_knn, tst_water$potability, positive='1')
logistic_confm <- confusionMatrix(predicted_outcomes_lr, tst_water$potability, positive='1')


# plot of confusion matrices
plot_rf <- as.data.frame(rf_confm$table)
plot_rf$Prediction <- factor(plot_rf$Prediction, levels=rev(levels(plot_rf$Prediction)))

plot_knn <- as.data.frame(knn_confm$table)
plot_knn$Prediction <- factor(plot_knn$Prediction, levels=rev(levels(plot_knn$Prediction)))

plot_logistic <- as.data.frame(logistic_confm$table)
plot_logistic$Prediction <- factor(plot_logistic$Prediction, 
                                   levels=rev(levels(plot_logistic$Prediction)))

plot_conf_rf <- ggplot(plot_rf, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("Random Forest. Accuracy: 79,2%, Sensitivity: 59,22%, Specificity: 91,98%")

plot_conf_knn <- ggplot(plot_knn, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("K-Nearest Neighbor. Accuracy: 58,26%, Sensitivity: 22,75%, Specificity: 80,95%")

plot_conf_logistic <- ggplot(plot_logistic, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("Logistic Regression. Accuracy: 61%, Sensitivity: 0%, Specificity: 100%")

mod2 <- glm(formula = trn_water$potability  ~ .,family = binomial(link = "logit"),
                      data = trn_water)

mod3 <- glm(formula = trn_water$potability ~ hardness + solids + sulfate + 
              organic_carbon, family = binomial(link = "logit"), data = trn_water)

summary(mod2)

## 
## Call:
## glm(formula = trn_water$potability ~ ., family = binomial(link = "logit"), 
##     data = trn_water)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1334  -0.9999  -0.9562   1.3536   1.5172  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -2.648e-01  6.832e-01  -0.388    0.698
## ph              -3.864e-03  2.766e-02  -0.140    0.889
## hardness        -1.095e-03  1.227e-03  -0.892    0.372
## solids           6.859e-06  4.627e-06   1.482    0.138
## chloramines      3.120e-02  2.550e-02   1.223    0.221
## sulfate         -1.108e-04  1.128e-03  -0.098    0.922
## conductivity    -4.281e-04  5.006e-04  -0.855    0.392
## organic_carbon  -9.031e-03  1.222e-02  -0.739    0.460
## trihalomethanes  1.254e-03  2.558e-03   0.490    0.624
## turbidity       -1.255e-02  5.155e-02  -0.244    0.808
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3507.3  on 2621  degrees of freedom
## Residual deviance: 3501.1  on 2612  degrees of freedom
## AIC: 3521.1
## 
## Number of Fisher Scoring iterations: 4

step(mod2, direction = "both")

## Start:  AIC=3521.05
## trn_water$potability ~ ph + hardness + solids + chloramines + 
##     sulfate + conductivity + organic_carbon + trihalomethanes + 
##     turbidity
## 
##                   Df Deviance    AIC
## - sulfate          1   3501.1 3519.1
## - ph               1   3501.1 3519.1
## - turbidity        1   3501.1 3519.1
## - trihalomethanes  1   3501.3 3519.3
## - organic_carbon   1   3501.6 3519.6
## - conductivity     1   3501.8 3519.8
## - hardness         1   3501.9 3519.9
## - chloramines      1   3502.6 3520.6
## <none>                 3501.1 3521.1
## - solids           1   3503.2 3521.2
## 
## Step:  AIC=3519.06
## trn_water$potability ~ ph + hardness + solids + chloramines + 
##     conductivity + organic_carbon + trihalomethanes + turbidity
## 
##                   Df Deviance    AIC
## - ph               1   3501.1 3517.1
## - turbidity        1   3501.1 3517.1
## - trihalomethanes  1   3501.3 3517.3
## - organic_carbon   1   3501.6 3517.6
## - conductivity     1   3501.8 3517.8
## - hardness         1   3501.9 3517.9
## - chloramines      1   3502.6 3518.6
## <none>                 3501.1 3519.1
## - solids           1   3503.4 3519.4
## + sulfate          1   3501.1 3521.1
## 
## Step:  AIC=3517.08
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon + trihalomethanes + turbidity
## 
##                   Df Deviance    AIC
## - turbidity        1   3501.1 3515.1
## - trihalomethanes  1   3501.3 3515.3
## - organic_carbon   1   3501.6 3515.6
## - conductivity     1   3501.8 3515.8
## - hardness         1   3501.9 3515.9
## - chloramines      1   3502.6 3516.6
## <none>                 3501.1 3517.1
## - solids           1   3503.4 3517.4
## + ph               1   3501.1 3519.1
## + sulfate          1   3501.1 3519.1
## 
## Step:  AIC=3515.14
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon + trihalomethanes
## 
##                   Df Deviance    AIC
## - trihalomethanes  1   3501.4 3513.4
## - organic_carbon   1   3501.7 3513.7
## - conductivity     1   3501.9 3513.9
## - hardness         1   3501.9 3513.9
## - chloramines      1   3502.6 3514.6
## <none>                 3501.1 3515.1
## - solids           1   3503.5 3515.5
## + turbidity        1   3501.1 3517.1
## + ph               1   3501.1 3517.1
## + sulfate          1   3501.1 3517.1
## 
## Step:  AIC=3513.39
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon
## 
##                   Df Deviance    AIC
## - organic_carbon   1   3501.9 3511.9
## - conductivity     1   3502.1 3512.1
## - hardness         1   3502.2 3512.2
## - chloramines      1   3502.9 3512.9
## <none>                 3501.4 3513.4
## - solids           1   3503.7 3513.7
## + trihalomethanes  1   3501.1 3515.1
## + turbidity        1   3501.3 3515.3
## + ph               1   3501.4 3515.4
## + sulfate          1   3501.4 3515.4
## 
## Step:  AIC=3511.95
## trn_water$potability ~ hardness + solids + chloramines + conductivity
## 
##                   Df Deviance    AIC
## - conductivity     1   3502.7 3510.7
## - hardness         1   3502.8 3510.8
## - chloramines      1   3503.5 3511.5
## <none>                 3501.9 3511.9
## - solids           1   3504.3 3512.3
## + organic_carbon   1   3501.4 3513.4
## + trihalomethanes  1   3501.7 3513.7
## + turbidity        1   3501.9 3513.9
## + ph               1   3501.9 3513.9
## + sulfate          1   3501.9 3513.9
## 
## Step:  AIC=3510.71
## trn_water$potability ~ hardness + solids + chloramines
## 
##                   Df Deviance    AIC
## - hardness         1   3503.5 3509.5
## - chloramines      1   3504.3 3510.3
## <none>                 3502.7 3510.7
## - solids           1   3505.0 3511.0
## + conductivity     1   3501.9 3511.9
## + organic_carbon   1   3502.1 3512.1
## + trihalomethanes  1   3502.5 3512.5
## + turbidity        1   3502.7 3512.7
## + ph               1   3502.7 3512.7
## + sulfate          1   3502.7 3512.7
## 
## Step:  AIC=3509.49
## trn_water$potability ~ solids + chloramines
## 
##                   Df Deviance    AIC
## - chloramines      1   3505.1 3509.1
## <none>                 3503.5 3509.5
## - solids           1   3505.9 3509.9
## + hardness         1   3502.7 3510.7
## + conductivity     1   3502.8 3510.8
## + organic_carbon   1   3502.9 3510.9
## + trihalomethanes  1   3503.2 3511.2
## + turbidity        1   3503.4 3511.4
## + ph               1   3503.4 3511.4
## + sulfate          1   3503.5 3511.5
## 
## Step:  AIC=3509.13
## trn_water$potability ~ solids
## 
##                   Df Deviance    AIC
## <none>                 3505.1 3509.1
## - solids           1   3507.3 3509.3
## + chloramines      1   3503.5 3509.5
## + hardness         1   3504.3 3510.3
## + conductivity     1   3504.4 3510.4
## + organic_carbon   1   3504.5 3510.5
## + trihalomethanes  1   3504.9 3510.9
## + ph               1   3505.1 3511.1
## + turbidity        1   3505.1 3511.1
## + sulfate          1   3505.1 3511.1

## 
## Call:  glm(formula = trn_water$potability ~ solids, family = binomial(link = "logit"), 
##     data = trn_water)
## 
## Coefficients:
## (Intercept)       solids  
##  -5.932e-01    6.652e-06  
## 
## Degrees of Freedom: 2621 Total (i.e. Null);  2620 Residual
## Null Deviance:       3507 
## Residual Deviance: 3505  AIC: 3509

summary(mod3)

## 
## Call:
## glm(formula = trn_water$potability ~ hardness + solids + sulfate + 
##     organic_carbon, family = binomial(link = "logit"), data = trn_water)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1135  -0.9992  -0.9679   1.3595   1.4700  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -2.011e-01  5.215e-01  -0.386    0.700
## hardness       -1.129e-03  1.223e-03  -0.923    0.356
## solids          6.403e-06  4.599e-06   1.392    0.164
## sulfate        -8.739e-05  1.127e-03  -0.078    0.938
## organic_carbon -9.539e-03  1.220e-02  -0.782    0.434
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3507.3  on 2621  degrees of freedom
## Residual deviance: 3503.7  on 2617  degrees of freedom
## AIC: 3513.7
## 
## Number of Fisher Scoring iterations: 4

Utilizando a seleção de modelos Stepwise o modelo que melhor explica os dados foi o modelo 3 entretanto ao fazer a predição o modelo prevê apenas valores 0. Dessa forma foram testados outros modelos que pudessem retornar resultados significativos sento eles Random Forest e KNN

plot_conf_rf

plot_conf_knn

plot_conf_logistic

Modelo Random Forest

set.seed(31967)

TrainIndex <-
  createDataPartition(water_potability$potability, p = 0.8, list = FALSE)

TrainingSet <- water_potability[TrainIndex, ]
TestSet <- water_potability[-TrainIndex, ]

xdf <- TrainingSet %>% select(-potability)
ydf <- TrainingSet %>% select(potability)

set.seed(31967)

Quantidade de variáveis (colunas) sorteadas por árvore.

#Quantidade de variáveis (colunas) sorteadas por árvore.

BestMtry <-
  tuneRF(
    xdf,
    ydf$potability,
    stepFactor = 1.5,
    improve = 1e-6,
    ntree = 1000, 
    plot = F
  )

## mtry = 3  OOB error = 21.13% 
## Searching left ...
## mtry = 2     OOB error = 20.63% 
## 0.0234657 1e-06 
## Searching right ...
## mtry = 4     OOB error = 20.52% 
## 0.005545287 1e-06 
## mtry = 6     OOB error = 21.24% 
## -0.03531599 1e-06

BestMtry %>% as_tibble() %>% 
  ggplot(aes(x = mtry, y = OOBError)) +
  geom_line(col = "steelblue", size = 1.5)+
  geom_point(col = "orange", size = 3)+
  labs(
    title = "Best Mtry",
    caption = "Data source: Kaggle.com, Water Quality",
    x = "Mtry",
    y = "OOB Error"
  )

set.seed(31967)
control <- trainControl(method = "repeatedcv",
                        number = 10,
                        repeats = 7)

set.seed(31967)
RfFinal <- train(
  potability ~ .,
  data = TrainingSet,
  method = "rf",
  metric = "Accuracy",
  tuneGrid = expand.grid(.mtry = 4),
  trControl = control,
  ntree = 1000
)


plot(RfFinal$finalModel)

Variaveis importantes

#variaveis inportantes

VarsImp <- varImp(RfFinal, scale = FALSE)


VarsImp$importance %>% 
  rownames_to_column(var = "Variable") %>% 
  as_tibble() %>% 
  ggplot(aes(x = fct_reorder(Variable,Overall), y = Overall)) +
  geom_col(fill = "#1F5736", size = 1) +
  coord_flip()+
  labs(
    title = "Variables importance",
    subtitle = "Column Plot, Variables importance, RfFinal mpdel",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

#######################################################################

As 5 variáveis mais importantes no modelo são, Sulfate, PH,Hardness, Solids e Chloramines

Predição

#predição

set.seed(31967)

PredRf <- predict(RfFinal,TestSet, type = "raw")


confusionMatrix(data = PredRf, reference = TestSet$potability, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 351  84
##          1  48 171
##                                           
##                Accuracy : 0.7982          
##                  95% CI : (0.7653, 0.8283)
##     No Information Rate : 0.6101          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5647          
##                                           
##  Mcnemar's Test P-Value : 0.002316        
##                                           
##             Sensitivity : 0.6706          
##             Specificity : 0.8797          
##          Pos Pred Value : 0.7808          
##          Neg Pred Value : 0.8069          
##              Prevalence : 0.3899          
##          Detection Rate : 0.2615          
##    Detection Prevalence : 0.3349          
##       Balanced Accuracy : 0.7751          
##                                           
##        'Positive' Class : 1               
##

#######################################################3

Conclusão

O modelo de regressão logística falhou em encontrar um padrão nos dados. Ele prevê que nem tudo é potável. Sendo assim ele não é indicado para esse conjunto de dados.

O modelo de melhor desempenho é o random forest: tem uma precisão de teste de 79,2%.

Uma alta especificidade significa muitos verdadeiros negativos e poucos falsos positivos. Random Forest supera os demais modelos com especificidade de 91,98 %.

mlg agua potavel

Clevia

2022-12-04