Introdução

O acesso para economizar água potável é importante para a saúde pública. Segundo a Organização Mundial da Saúde, um melhor abastecimento de água e saneamento e uma melhor gestão dos recursos hídricos podem impulsionar o crescimento econômico e contribuir para a redução da pobreza.

Além disso, a água potável é muito importante para manter nossas funções corporais. Um corpo humano pode sobreviver até 4 semanas sem comida, mas apenas 3 dias sem água.

Portanto, é importante estudar quais variáveis influenciam na água potável.

O conjunto de dados neste estudo consiste em 10 variáveis, com uma variável dependente (1 = potável, 0 = não potável) e 9 variáveis independentes. As variáveis independentes são os parâmetros da água. O objetivo deste estudo é prever a água potável com base nesses parâmetros da água.

Sobre o conjunto de dados

O conjunto de dados neste estudo consiste em 10 variáveis, com uma variável dependente (1 = potável, 0 = não potável) e 9 variáveis independentes. As variáveis independentes são os parâmetros da água.

Variáveis

Valor do pH: O pH é um parâmetro importante na avaliação do equilíbrio ácido-base da água.

Dureza: A dureza é causada principalmente por sais de cálcio e magnésio.

Sólidos (Sólidos dissolvidos totais - TDS): A água tem a capacidade de dissolver uma ampla gama de minerais ou sais inorgânicos e alguns orgânicos, como potássio, cálcio, sódio, bicarbonatos, cloretos, magnésio, sulfatos etc.

Cloraminas: Cloro e cloramina são os principais desinfetantes usados em sistemas públicos de água.

Sulfato: Os sulfatos são substâncias naturais encontradas em minerais, solo e rochas.

Condutividade: A água pura não é um bom condutor de corrente elétrica, mas sim um bom isolante. O aumento na concentração de íons aumenta a condutividade elétrica da água.

Organic_carbon: O carbono orgânico total (TOC) nas águas de nascente vem da matéria orgânica natural em decomposição (NOM), bem como de fontes sintéticas.

Trihalometanos: THMs são produtos químicos que podem ser encontrados em água tratada com cloro.

Turbidez: A turbidez da água depende da quantidade de matéria sólida presente no estado suspenso.

Potabilidade: Indica se a água é segura para consumo humano onde 1 significa Potável e 0 significa Não potável.

library(readr)
require(dplyr)
require(tidyverse)
require(ggplot2)
require(janitor)
require(skimr)
require(corrplot)
library(Hmisc)
require(GGally)
require(randomForest)
require(caret)
require(scales)
require(ggpubr)
library(MVN)
water_potability <- read_csv("/Users/Clevia/Documents/MLG/water_potability.csv")
knitr::kable( head(water_potability), "simple")
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
NA 204.8905 20791.32 7.300212 368.5164 564.3087 10.379783 86.99097 2.963135 0
3.716080 129.4229 18630.06 6.635246 NA 592.8854 15.180013 56.32908 4.500656 0
8.099124 224.2363 19909.54 9.275884 NA 418.6062 16.868637 66.42009 3.055934 0
8.316766 214.3734 22018.42 8.059332 356.8861 363.2665 18.436525 100.34167 4.628770 0
9.092223 181.1015 17978.99 6.546600 310.1357 398.4108 11.558279 31.99799 4.075075 0
5.584087 188.3133 28748.69 7.544869 326.6784 280.4679 8.399735 54.91786 2.559708 0
water_potability <- water_potability %>% 
  mutate(Potability = as.factor(Potability)) %>%
  clean_names() %>% glimpse()
## Rows: 3,276
## Columns: 10
## $ ph              <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, ~
## $ hardness        <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.~
## $ solids          <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874~
## $ chloramines     <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54~
## $ sulfate         <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66~
## $ conductivity    <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.~
## $ organic_carbon  <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,~
## $ trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.~
## $ turbidity       <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55~
## $ potability      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

Verificação de dados faltantes

#dados faltantes 

water_potability %>%
  summarise_all(~ sum(is.na(.)))
## # A tibble: 1 x 10
##      ph hardness solids chloramines sulfate conductivity organic_carbon
##   <int>    <int>  <int>       <int>   <int>        <int>          <int>
## 1   491        0      0           0     781            0              0
## # ... with 3 more variables: trihalomethanes <int>, turbidity <int>,
## #   potability <int>

Podemos ver que há 3 variáveis com dados faltantes, PH, Sulfate e Trihalomethanes

Gráfico com os valores faltantes

#grafico com valores faltantes 

water_potability %>%  skim() %>%
  filter(n_missing != 0) %>%
  as_tibble() %>%
  select(skim_variable, n_missing, complete_rate) %>%
  mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
  ggplot(aes(
    x = fct_reorder(skim_variable, n_missing),
    y = missing_rate,
    fill = skim_variable,
    label = paste0(missing_rate, "%")
  )) +
  geom_col(width = .6) +
  geom_text(
    size = 4.5,
    hjust = 1.2,
    vjust = .25,
    col = "white"
  ) +
  coord_flip() + theme(aspect.ratio = .4) +
  theme(
    legend.position = "none"
  ) +
  scale_y_continuous(label = label_percent(scale = 1)) +
  scale_fill_manual(values = c("#25C5DA",
                               "#25A69A",
                               "#66BA6A")) +
  labs(
    title = "Dados Faltantes",
    subtitle = "Distribuição dos dados faltantes",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Neste gráfico observamos a proporção de dados faltantes no banco de dados

Valores faltantes em relação a variável resposta

#valores faltantes em relação a variável resposta

water_potability %>% group_by(potability) %>%  skim() %>%
  filter(n_missing != 0) %>%
  as_tibble() %>%
  select(skim_variable, n_missing, complete_rate, potability) %>%
  mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
  ggplot(aes(
    x = fct_reorder(skim_variable, n_missing),
    y = missing_rate,
    fill = skim_variable,
    label = paste0(missing_rate, "%")
  )) +
  geom_col(width = .6) +
  geom_text(
    size = 4,
    hjust = 1.2,
    vjust = 0.25,
    col = "white"
  ) +
  coord_flip() +
  facet_wrap(vars(potability)) +
  theme(aspect.ratio = .7) +
  theme(
    legend.position = "none",
    strip.background = element_rect(fill="#94246D"),
    strip.text = element_text(color = "white", face = "bold", size = 12)
  ) +
  scale_y_continuous(label = label_percent(scale = 1)) +
  scale_fill_manual(values = c("#D41C64",
                               "#4B6FB5",
                               "#6C3996")) +
  labs(
    title = "Dados faltantes VS Variável resposta",
    subtitle = "Plot, Missing Data distribution VS Target Variable",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Aqui observamos os dados faltantes em relação a variável resposta.

Neste caso optou-se por substituir esses valores pela média, pois com a remoção muita informação seria perdida.

#substituir valores faltantes 

water_potability <- water_potability %>% 
  group_by(potability) %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), 
                                            as.numeric(.)))) %>% ungroup()
#variavel resposta

water_potability %>%
  select(potability) %>%
  count(potability) %>% mutate(percent = paste0(round(n / sum(n) * 100), "%"), 2) %>%
  ggplot(aes(
    x = potability,
    y = n,
    label = percent,
    fill = potability
  )) +
  geom_col() +
  geom_text(vjust = -0.2, color = "#7C4EA8") +
  scale_fill_manual(values = c("#EF1A25", "#0099D5")) +
  labs(
    title = "Potabilidade da água",
    subtitle = "Gráfico de barras com a distribuição da variável potabilidade",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL,
    fill = NULL
  )

Gráfico com a proporção da variável resposta no banco de dados

p1 <- ggplot(water_potability, aes(ph, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "pH", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom")+
  labs(title = "pH")

p2 <- ggplot(water_potability, aes(hardness, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Hardness", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom")+ 
  labs(title = "Hardness")

p3 <- ggplot(water_potability, aes(solids, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Solids", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Solids")

p4 <- ggplot(water_potability, aes(chloramines, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Chloramines", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Chloramines")

p5 <- ggplot(water_potability, aes(sulfate, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Sulfate", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Sulfate")

p6 <- ggplot(water_potability, aes(conductivity, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Conductivity", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Conductivity")

p7 <- ggplot(water_potability, aes(organic_carbon, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Organic Carbon", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Organic Carbon")

p8 <- ggplot(water_potability, aes(trihalomethanes, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Trihalomethanes", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Trihalomethanes")

p9 <- ggplot(water_potability, aes(turbidity, color = as.factor(potability)))+
  geom_histogram(bins = 30, fill = "white") +
  labs(x = "Turbidity", y = "Count", col = "Potability") +
  theme_bw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Turbidity")

Histogramas com a variável resposta em relação as demais variaveis

figure1 <- ggarrange(p1, p2, p3, p4, p5, nrow = 2, ncol = 3, labels = "AUTO")
figure2 <- ggarrange(p6, p7, p8, p9, nrow = 2, ncol = 2, labels = "AUTO")
figure1

figure2

Box plot

water_potability %>%
  pivot_longer(cols = -potability, names_to = "feature") %>%
  ggplot(aes(x = feature, y = value)) +
  geom_jitter(aes(y = value, col = potability), alpha = 0.1) +
  geom_boxplot(aes(fill = potability)) +
  facet_wrap(vars(feature), ncol = 3, scales = "free") +
  scale_color_manual(values = c("#E4652E", "#0E8A41")) +
  scale_fill_manual(values = c("#E4652E", "#0E8A41")) +
  theme(
    legend.position = "right",
    strip.background = element_rect(fill = "#0B2D5B"),
    strip.text = element_text(color = "white", face = "bold", size = 8)
  ) +
  labs(
    title = "Detect Outliers With Boxplot",
    subtitle = "Plot, Box and Jitter Plot",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL,
    fill = NULL,
    color = NULL
  )

Teste de normalidade

aa<- mvn(data = water_potability[,-10],  univariatePlot =
      "qqplot")

teste <- mvn(data = water_potability[,-10], univariateTest =  
               "SW")
teste$'univariateNormality'
##           Test        Variable Statistic   p value Normality
## 1 Shapiro-Wilk       ph           0.9800  <0.001      NO    
## 2 Shapiro-Wilk    hardness        0.9960  <0.001      NO    
## 3 Shapiro-Wilk     solids         0.9777  <0.001      NO    
## 4 Shapiro-Wilk   chloramines      0.9968  <0.001      NO    
## 5 Shapiro-Wilk     sulfate        0.9610  <0.001      NO    
## 6 Shapiro-Wilk  conductivity      0.9930  <0.001      NO    
## 7 Shapiro-Wilk organic_carbon     0.9995  0.6251      YES   
## 8 Shapiro-Wilk trihalomethanes    0.9970  <0.001      NO    
## 9 Shapiro-Wilk    turbidity       0.9997  0.9336      YES

Correlação

corrplot( cor (water_potability[, -10]), method = "color", #col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", 
         tl.col = T, tl.srt = 45, 
         number.cex = 0.7, tl.cex = 0.7,
         #p.mat = m_corr$P, sig.level = 0.05, insig = "n",
         diag = FALSE, number.digits = 3
)

ggpairs(
  water_potability,
  aes(color = potability),
  columns = 1:9,
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.2,
    size = 0.5,
    color = "#DE942E"
  )),
  diag = list(continuous = "barDiag"),
  upper = list(continuous = wrap("cor", size = 4))
) +
  scale_color_manual(values = c("#1F5736", "#E94046")) +
  scale_fill_manual(values = c("#1F5736", "#E94046")) +
  theme(
    axis.text = element_text(size = 8),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "white"),
    strip.background.x = element_rect(colour = "black"),
    strip.background.y = element_rect(colour = "black"),
    strip.text = element_text(color = "black", face = "bold", size = 8)
  ) +
  labs(
    title = "Pair plot by Potability Var",
    subtitle = "Pair Plot, scatter plot, Histogram and Correlation coefficient",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Podemos ver que não há indícios de Multicolinearidade

Modelos

Foram testados os modelos de Regressão Logistica, Random Forest e KNN

# 80% training and 20% test dataset
set.seed(1) 
trn_index <- createDataPartition(y = water_potability$potability, p = 0.80, list = FALSE)
trn_water <- water_potability[trn_index, ]
tst_water <- water_potability[-trn_index, ]

# K-nearest neighbors
set.seed(1) 
potability_knn <- train(potability ~ ., method = "knn", data = trn_water,
                        trControl = trainControl(method = 'cv', number = 5, returnResamp = "all"))

# Random forest
set.seed(1)
potability_rf <- randomForest(potability ~ .,
                              data=trn_water, ntree= 1000
)

# Logistic regression
set.seed(1)
potability_lr <- train(potability ~ ., method = "glm", data = trn_water, 
                       family = binomial(link = "logit"),
                       trControl = trainControl(method = 'cv', number = 5))
# Prediction on test-dataset
predicted_outcomes_rf <- predict(potability_rf, tst_water)
predicted_outcomes_knn <- predict(potability_knn, tst_water)
predicted_outcomes_lr <- predict(potability_lr, tst_water)

# Create Confusion Matrices
rf_confm <- confusionMatrix(predicted_outcomes_rf, tst_water$potability, positive='1')
knn_confm <- confusionMatrix(predicted_outcomes_knn, tst_water$potability, positive='1')
logistic_confm <- confusionMatrix(predicted_outcomes_lr, tst_water$potability, positive='1')


# plot of confusion matrices
plot_rf <- as.data.frame(rf_confm$table)
plot_rf$Prediction <- factor(plot_rf$Prediction, levels=rev(levels(plot_rf$Prediction)))

plot_knn <- as.data.frame(knn_confm$table)
plot_knn$Prediction <- factor(plot_knn$Prediction, levels=rev(levels(plot_knn$Prediction)))

plot_logistic <- as.data.frame(logistic_confm$table)
plot_logistic$Prediction <- factor(plot_logistic$Prediction, 
                                   levels=rev(levels(plot_logistic$Prediction)))
plot_conf_rf <- ggplot(plot_rf, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("Random Forest. Accuracy: 79,2%, Sensitivity: 59,22%, Specificity: 91,98%")

plot_conf_knn <- ggplot(plot_knn, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("K-Nearest Neighbor. Accuracy: 58,26%, Sensitivity: 22,75%, Specificity: 80,95%")

plot_conf_logistic <- ggplot(plot_logistic, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
  ggtitle("Logistic Regression. Accuracy: 61%, Sensitivity: 0%, Specificity: 100%")
mod2 <- glm(formula = trn_water$potability  ~ .,family = binomial(link = "logit"),
                      data = trn_water)

mod3 <- glm(formula = trn_water$potability ~ hardness + solids + sulfate + 
              organic_carbon, family = binomial(link = "logit"), data = trn_water)
summary(mod2)
## 
## Call:
## glm(formula = trn_water$potability ~ ., family = binomial(link = "logit"), 
##     data = trn_water)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1334  -0.9999  -0.9562   1.3536   1.5172  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -2.648e-01  6.832e-01  -0.388    0.698
## ph              -3.864e-03  2.766e-02  -0.140    0.889
## hardness        -1.095e-03  1.227e-03  -0.892    0.372
## solids           6.859e-06  4.627e-06   1.482    0.138
## chloramines      3.120e-02  2.550e-02   1.223    0.221
## sulfate         -1.108e-04  1.128e-03  -0.098    0.922
## conductivity    -4.281e-04  5.006e-04  -0.855    0.392
## organic_carbon  -9.031e-03  1.222e-02  -0.739    0.460
## trihalomethanes  1.254e-03  2.558e-03   0.490    0.624
## turbidity       -1.255e-02  5.155e-02  -0.244    0.808
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3507.3  on 2621  degrees of freedom
## Residual deviance: 3501.1  on 2612  degrees of freedom
## AIC: 3521.1
## 
## Number of Fisher Scoring iterations: 4
step(mod2, direction = "both")
## Start:  AIC=3521.05
## trn_water$potability ~ ph + hardness + solids + chloramines + 
##     sulfate + conductivity + organic_carbon + trihalomethanes + 
##     turbidity
## 
##                   Df Deviance    AIC
## - sulfate          1   3501.1 3519.1
## - ph               1   3501.1 3519.1
## - turbidity        1   3501.1 3519.1
## - trihalomethanes  1   3501.3 3519.3
## - organic_carbon   1   3501.6 3519.6
## - conductivity     1   3501.8 3519.8
## - hardness         1   3501.9 3519.9
## - chloramines      1   3502.6 3520.6
## <none>                 3501.1 3521.1
## - solids           1   3503.2 3521.2
## 
## Step:  AIC=3519.06
## trn_water$potability ~ ph + hardness + solids + chloramines + 
##     conductivity + organic_carbon + trihalomethanes + turbidity
## 
##                   Df Deviance    AIC
## - ph               1   3501.1 3517.1
## - turbidity        1   3501.1 3517.1
## - trihalomethanes  1   3501.3 3517.3
## - organic_carbon   1   3501.6 3517.6
## - conductivity     1   3501.8 3517.8
## - hardness         1   3501.9 3517.9
## - chloramines      1   3502.6 3518.6
## <none>                 3501.1 3519.1
## - solids           1   3503.4 3519.4
## + sulfate          1   3501.1 3521.1
## 
## Step:  AIC=3517.08
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon + trihalomethanes + turbidity
## 
##                   Df Deviance    AIC
## - turbidity        1   3501.1 3515.1
## - trihalomethanes  1   3501.3 3515.3
## - organic_carbon   1   3501.6 3515.6
## - conductivity     1   3501.8 3515.8
## - hardness         1   3501.9 3515.9
## - chloramines      1   3502.6 3516.6
## <none>                 3501.1 3517.1
## - solids           1   3503.4 3517.4
## + ph               1   3501.1 3519.1
## + sulfate          1   3501.1 3519.1
## 
## Step:  AIC=3515.14
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon + trihalomethanes
## 
##                   Df Deviance    AIC
## - trihalomethanes  1   3501.4 3513.4
## - organic_carbon   1   3501.7 3513.7
## - conductivity     1   3501.9 3513.9
## - hardness         1   3501.9 3513.9
## - chloramines      1   3502.6 3514.6
## <none>                 3501.1 3515.1
## - solids           1   3503.5 3515.5
## + turbidity        1   3501.1 3517.1
## + ph               1   3501.1 3517.1
## + sulfate          1   3501.1 3517.1
## 
## Step:  AIC=3513.39
## trn_water$potability ~ hardness + solids + chloramines + conductivity + 
##     organic_carbon
## 
##                   Df Deviance    AIC
## - organic_carbon   1   3501.9 3511.9
## - conductivity     1   3502.1 3512.1
## - hardness         1   3502.2 3512.2
## - chloramines      1   3502.9 3512.9
## <none>                 3501.4 3513.4
## - solids           1   3503.7 3513.7
## + trihalomethanes  1   3501.1 3515.1
## + turbidity        1   3501.3 3515.3
## + ph               1   3501.4 3515.4
## + sulfate          1   3501.4 3515.4
## 
## Step:  AIC=3511.95
## trn_water$potability ~ hardness + solids + chloramines + conductivity
## 
##                   Df Deviance    AIC
## - conductivity     1   3502.7 3510.7
## - hardness         1   3502.8 3510.8
## - chloramines      1   3503.5 3511.5
## <none>                 3501.9 3511.9
## - solids           1   3504.3 3512.3
## + organic_carbon   1   3501.4 3513.4
## + trihalomethanes  1   3501.7 3513.7
## + turbidity        1   3501.9 3513.9
## + ph               1   3501.9 3513.9
## + sulfate          1   3501.9 3513.9
## 
## Step:  AIC=3510.71
## trn_water$potability ~ hardness + solids + chloramines
## 
##                   Df Deviance    AIC
## - hardness         1   3503.5 3509.5
## - chloramines      1   3504.3 3510.3
## <none>                 3502.7 3510.7
## - solids           1   3505.0 3511.0
## + conductivity     1   3501.9 3511.9
## + organic_carbon   1   3502.1 3512.1
## + trihalomethanes  1   3502.5 3512.5
## + turbidity        1   3502.7 3512.7
## + ph               1   3502.7 3512.7
## + sulfate          1   3502.7 3512.7
## 
## Step:  AIC=3509.49
## trn_water$potability ~ solids + chloramines
## 
##                   Df Deviance    AIC
## - chloramines      1   3505.1 3509.1
## <none>                 3503.5 3509.5
## - solids           1   3505.9 3509.9
## + hardness         1   3502.7 3510.7
## + conductivity     1   3502.8 3510.8
## + organic_carbon   1   3502.9 3510.9
## + trihalomethanes  1   3503.2 3511.2
## + turbidity        1   3503.4 3511.4
## + ph               1   3503.4 3511.4
## + sulfate          1   3503.5 3511.5
## 
## Step:  AIC=3509.13
## trn_water$potability ~ solids
## 
##                   Df Deviance    AIC
## <none>                 3505.1 3509.1
## - solids           1   3507.3 3509.3
## + chloramines      1   3503.5 3509.5
## + hardness         1   3504.3 3510.3
## + conductivity     1   3504.4 3510.4
## + organic_carbon   1   3504.5 3510.5
## + trihalomethanes  1   3504.9 3510.9
## + ph               1   3505.1 3511.1
## + turbidity        1   3505.1 3511.1
## + sulfate          1   3505.1 3511.1
## 
## Call:  glm(formula = trn_water$potability ~ solids, family = binomial(link = "logit"), 
##     data = trn_water)
## 
## Coefficients:
## (Intercept)       solids  
##  -5.932e-01    6.652e-06  
## 
## Degrees of Freedom: 2621 Total (i.e. Null);  2620 Residual
## Null Deviance:       3507 
## Residual Deviance: 3505  AIC: 3509
summary(mod3)
## 
## Call:
## glm(formula = trn_water$potability ~ hardness + solids + sulfate + 
##     organic_carbon, family = binomial(link = "logit"), data = trn_water)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1135  -0.9992  -0.9679   1.3595   1.4700  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -2.011e-01  5.215e-01  -0.386    0.700
## hardness       -1.129e-03  1.223e-03  -0.923    0.356
## solids          6.403e-06  4.599e-06   1.392    0.164
## sulfate        -8.739e-05  1.127e-03  -0.078    0.938
## organic_carbon -9.539e-03  1.220e-02  -0.782    0.434
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3507.3  on 2621  degrees of freedom
## Residual deviance: 3503.7  on 2617  degrees of freedom
## AIC: 3513.7
## 
## Number of Fisher Scoring iterations: 4

Utilizando a seleção de modelos Stepwise o modelo que melhor explica os dados foi o modelo 3 entretanto ao fazer a predição o modelo prevê apenas valores 0. Dessa forma foram testados outros modelos que pudessem retornar resultados significativos sento eles Random Forest e KNN

plot_conf_rf

plot_conf_knn

plot_conf_logistic

Modelo Random Forest

set.seed(31967)

TrainIndex <-
  createDataPartition(water_potability$potability, p = 0.8, list = FALSE)

TrainingSet <- water_potability[TrainIndex, ]
TestSet <- water_potability[-TrainIndex, ]
xdf <- TrainingSet %>% select(-potability)
ydf <- TrainingSet %>% select(potability)

set.seed(31967)

Quantidade de variáveis (colunas) sorteadas por árvore.

#Quantidade de variáveis (colunas) sorteadas por árvore.

BestMtry <-
  tuneRF(
    xdf,
    ydf$potability,
    stepFactor = 1.5,
    improve = 1e-6,
    ntree = 1000, 
    plot = F
  )
## mtry = 3  OOB error = 21.13% 
## Searching left ...
## mtry = 2     OOB error = 20.63% 
## 0.0234657 1e-06 
## Searching right ...
## mtry = 4     OOB error = 20.52% 
## 0.005545287 1e-06 
## mtry = 6     OOB error = 21.24% 
## -0.03531599 1e-06
BestMtry %>% as_tibble() %>% 
  ggplot(aes(x = mtry, y = OOBError)) +
  geom_line(col = "steelblue", size = 1.5)+
  geom_point(col = "orange", size = 3)+
  labs(
    title = "Best Mtry",
    caption = "Data source: Kaggle.com, Water Quality",
    x = "Mtry",
    y = "OOB Error"
  )

set.seed(31967)
control <- trainControl(method = "repeatedcv",
                        number = 10,
                        repeats = 7)

set.seed(31967)
RfFinal <- train(
  potability ~ .,
  data = TrainingSet,
  method = "rf",
  metric = "Accuracy",
  tuneGrid = expand.grid(.mtry = 4),
  trControl = control,
  ntree = 1000
)


plot(RfFinal$finalModel)

Variaveis importantes

#variaveis inportantes

VarsImp <- varImp(RfFinal, scale = FALSE)


VarsImp$importance %>% 
  rownames_to_column(var = "Variable") %>% 
  as_tibble() %>% 
  ggplot(aes(x = fct_reorder(Variable,Overall), y = Overall)) +
  geom_col(fill = "#1F5736", size = 1) +
  coord_flip()+
  labs(
    title = "Variables importance",
    subtitle = "Column Plot, Variables importance, RfFinal mpdel",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

#######################################################################

As 5 variáveis mais importantes no modelo são, Sulfate, PH,Hardness, Solids e Chloramines

Predição

#predição

set.seed(31967)

PredRf <- predict(RfFinal,TestSet, type = "raw")


confusionMatrix(data = PredRf, reference = TestSet$potability, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 351  84
##          1  48 171
##                                           
##                Accuracy : 0.7982          
##                  95% CI : (0.7653, 0.8283)
##     No Information Rate : 0.6101          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5647          
##                                           
##  Mcnemar's Test P-Value : 0.002316        
##                                           
##             Sensitivity : 0.6706          
##             Specificity : 0.8797          
##          Pos Pred Value : 0.7808          
##          Neg Pred Value : 0.8069          
##              Prevalence : 0.3899          
##          Detection Rate : 0.2615          
##    Detection Prevalence : 0.3349          
##       Balanced Accuracy : 0.7751          
##                                           
##        'Positive' Class : 1               
## 
#######################################################3

Conclusão

O modelo de regressão logística falhou em encontrar um padrão nos dados. Ele prevê que nem tudo é potável. Sendo assim ele não é indicado para esse conjunto de dados.

O modelo de melhor desempenho é o random forest: tem uma precisão de teste de 79,2%.

Uma alta especificidade significa muitos verdadeiros negativos e poucos falsos positivos. Random Forest supera os demais modelos com especificidade de 91,98 %.