O acesso para economizar água potável é importante para a saúde pública. Segundo a Organização Mundial da Saúde, um melhor abastecimento de água e saneamento e uma melhor gestão dos recursos hídricos podem impulsionar o crescimento econômico e contribuir para a redução da pobreza.
Além disso, a água potável é muito importante para manter nossas funções corporais. Um corpo humano pode sobreviver até 4 semanas sem comida, mas apenas 3 dias sem água.
Portanto, é importante estudar quais variáveis influenciam na água potável.
O conjunto de dados neste estudo consiste em 10 variáveis, com uma variável dependente (1 = potável, 0 = não potável) e 9 variáveis independentes. As variáveis independentes são os parâmetros da água. O objetivo deste estudo é prever a água potável com base nesses parâmetros da água.
O conjunto de dados neste estudo consiste em 10 variáveis, com uma variável dependente (1 = potável, 0 = não potável) e 9 variáveis independentes. As variáveis independentes são os parâmetros da água.
Variáveis
Valor do pH: O pH é um parâmetro importante na avaliação do equilíbrio ácido-base da água.
Dureza: A dureza é causada principalmente por sais de cálcio e magnésio.
Sólidos (Sólidos dissolvidos totais - TDS): A água tem a capacidade de dissolver uma ampla gama de minerais ou sais inorgânicos e alguns orgânicos, como potássio, cálcio, sódio, bicarbonatos, cloretos, magnésio, sulfatos etc.
Cloraminas: Cloro e cloramina são os principais desinfetantes usados em sistemas públicos de água.
Sulfato: Os sulfatos são substâncias naturais encontradas em minerais, solo e rochas.
Condutividade: A água pura não é um bom condutor de corrente elétrica, mas sim um bom isolante. O aumento na concentração de íons aumenta a condutividade elétrica da água.
Organic_carbon: O carbono orgânico total (TOC) nas águas de nascente vem da matéria orgânica natural em decomposição (NOM), bem como de fontes sintéticas.
Trihalometanos: THMs são produtos químicos que podem ser encontrados em água tratada com cloro.
Turbidez: A turbidez da água depende da quantidade de matéria sólida presente no estado suspenso.
Potabilidade: Indica se a água é segura para consumo humano onde 1 significa Potável e 0 significa Não potável.
library(readr)
require(dplyr)
require(tidyverse)
require(ggplot2)
require(janitor)
require(skimr)
require(corrplot)
library(Hmisc)
require(GGally)
require(randomForest)
require(caret)
require(scales)
require(ggpubr)
library(MVN)water_potability <- read_csv("/Users/Clevia/Documents/MLG/water_potability.csv")
knitr::kable( head(water_potability), "simple")| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability |
|---|---|---|---|---|---|---|---|---|---|
| NA | 204.8905 | 20791.32 | 7.300212 | 368.5164 | 564.3087 | 10.379783 | 86.99097 | 2.963135 | 0 |
| 3.716080 | 129.4229 | 18630.06 | 6.635246 | NA | 592.8854 | 15.180013 | 56.32908 | 4.500656 | 0 |
| 8.099124 | 224.2363 | 19909.54 | 9.275884 | NA | 418.6062 | 16.868637 | 66.42009 | 3.055934 | 0 |
| 8.316766 | 214.3734 | 22018.42 | 8.059332 | 356.8861 | 363.2665 | 18.436525 | 100.34167 | 4.628770 | 0 |
| 9.092223 | 181.1015 | 17978.99 | 6.546600 | 310.1357 | 398.4108 | 11.558279 | 31.99799 | 4.075075 | 0 |
| 5.584087 | 188.3133 | 28748.69 | 7.544869 | 326.6784 | 280.4679 | 8.399735 | 54.91786 | 2.559708 | 0 |
water_potability <- water_potability %>%
mutate(Potability = as.factor(Potability)) %>%
clean_names() %>% glimpse()## Rows: 3,276
## Columns: 10
## $ ph <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, ~
## $ hardness <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.~
## $ solids <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874~
## $ chloramines <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54~
## $ sulfate <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66~
## $ conductivity <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.~
## $ organic_carbon <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,~
## $ trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.~
## $ turbidity <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55~
## $ potability <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#dados faltantes
water_potability %>%
summarise_all(~ sum(is.na(.)))## # A tibble: 1 x 10
## ph hardness solids chloramines sulfate conductivity organic_carbon
## <int> <int> <int> <int> <int> <int> <int>
## 1 491 0 0 0 781 0 0
## # ... with 3 more variables: trihalomethanes <int>, turbidity <int>,
## # potability <int>
Podemos ver que há 3 variáveis com dados faltantes, PH, Sulfate e Trihalomethanes
#grafico com valores faltantes
water_potability %>% skim() %>%
filter(n_missing != 0) %>%
as_tibble() %>%
select(skim_variable, n_missing, complete_rate) %>%
mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
ggplot(aes(
x = fct_reorder(skim_variable, n_missing),
y = missing_rate,
fill = skim_variable,
label = paste0(missing_rate, "%")
)) +
geom_col(width = .6) +
geom_text(
size = 4.5,
hjust = 1.2,
vjust = .25,
col = "white"
) +
coord_flip() + theme(aspect.ratio = .4) +
theme(
legend.position = "none"
) +
scale_y_continuous(label = label_percent(scale = 1)) +
scale_fill_manual(values = c("#25C5DA",
"#25A69A",
"#66BA6A")) +
labs(
title = "Dados Faltantes",
subtitle = "Distribuição dos dados faltantes",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL
)
Neste gráfico observamos a proporção de dados faltantes no banco de
dados
#valores faltantes em relação a variável resposta
water_potability %>% group_by(potability) %>% skim() %>%
filter(n_missing != 0) %>%
as_tibble() %>%
select(skim_variable, n_missing, complete_rate, potability) %>%
mutate(missing_rate = round(abs(complete_rate - 1) * 100, 1)) %>%
ggplot(aes(
x = fct_reorder(skim_variable, n_missing),
y = missing_rate,
fill = skim_variable,
label = paste0(missing_rate, "%")
)) +
geom_col(width = .6) +
geom_text(
size = 4,
hjust = 1.2,
vjust = 0.25,
col = "white"
) +
coord_flip() +
facet_wrap(vars(potability)) +
theme(aspect.ratio = .7) +
theme(
legend.position = "none",
strip.background = element_rect(fill="#94246D"),
strip.text = element_text(color = "white", face = "bold", size = 12)
) +
scale_y_continuous(label = label_percent(scale = 1)) +
scale_fill_manual(values = c("#D41C64",
"#4B6FB5",
"#6C3996")) +
labs(
title = "Dados faltantes VS Variável resposta",
subtitle = "Plot, Missing Data distribution VS Target Variable",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL
)Aqui observamos os dados faltantes em relação a variável resposta.
Neste caso optou-se por substituir esses valores pela média, pois com a remoção muita informação seria perdida.
#substituir valores faltantes
water_potability <- water_potability %>%
group_by(potability) %>%
mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T),
as.numeric(.)))) %>% ungroup()#variavel resposta
water_potability %>%
select(potability) %>%
count(potability) %>% mutate(percent = paste0(round(n / sum(n) * 100), "%"), 2) %>%
ggplot(aes(
x = potability,
y = n,
label = percent,
fill = potability
)) +
geom_col() +
geom_text(vjust = -0.2, color = "#7C4EA8") +
scale_fill_manual(values = c("#EF1A25", "#0099D5")) +
labs(
title = "Potabilidade da água",
subtitle = "Gráfico de barras com a distribuição da variável potabilidade",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL,
fill = NULL
)Gráfico com a proporção da variável resposta no banco de dados
p1 <- ggplot(water_potability, aes(ph, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "pH", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom")+
labs(title = "pH")
p2 <- ggplot(water_potability, aes(hardness, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Hardness", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom")+
labs(title = "Hardness")
p3 <- ggplot(water_potability, aes(solids, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Solids", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Solids")
p4 <- ggplot(water_potability, aes(chloramines, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Chloramines", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Chloramines")
p5 <- ggplot(water_potability, aes(sulfate, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Sulfate", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Sulfate")
p6 <- ggplot(water_potability, aes(conductivity, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Conductivity", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Conductivity")
p7 <- ggplot(water_potability, aes(organic_carbon, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Organic Carbon", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Organic Carbon")
p8 <- ggplot(water_potability, aes(trihalomethanes, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Trihalomethanes", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Trihalomethanes")
p9 <- ggplot(water_potability, aes(turbidity, color = as.factor(potability)))+
geom_histogram(bins = 30, fill = "white") +
labs(x = "Turbidity", y = "Count", col = "Potability") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Turbidity")figure1 <- ggarrange(p1, p2, p3, p4, p5, nrow = 2, ncol = 3, labels = "AUTO")
figure2 <- ggarrange(p6, p7, p8, p9, nrow = 2, ncol = 2, labels = "AUTO")
figure1figure2water_potability %>%
pivot_longer(cols = -potability, names_to = "feature") %>%
ggplot(aes(x = feature, y = value)) +
geom_jitter(aes(y = value, col = potability), alpha = 0.1) +
geom_boxplot(aes(fill = potability)) +
facet_wrap(vars(feature), ncol = 3, scales = "free") +
scale_color_manual(values = c("#E4652E", "#0E8A41")) +
scale_fill_manual(values = c("#E4652E", "#0E8A41")) +
theme(
legend.position = "right",
strip.background = element_rect(fill = "#0B2D5B"),
strip.text = element_text(color = "white", face = "bold", size = 8)
) +
labs(
title = "Detect Outliers With Boxplot",
subtitle = "Plot, Box and Jitter Plot",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL,
fill = NULL,
color = NULL
)aa<- mvn(data = water_potability[,-10], univariatePlot =
"qqplot")teste <- mvn(data = water_potability[,-10], univariateTest =
"SW")
teste$'univariateNormality'## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk ph 0.9800 <0.001 NO
## 2 Shapiro-Wilk hardness 0.9960 <0.001 NO
## 3 Shapiro-Wilk solids 0.9777 <0.001 NO
## 4 Shapiro-Wilk chloramines 0.9968 <0.001 NO
## 5 Shapiro-Wilk sulfate 0.9610 <0.001 NO
## 6 Shapiro-Wilk conductivity 0.9930 <0.001 NO
## 7 Shapiro-Wilk organic_carbon 0.9995 0.6251 YES
## 8 Shapiro-Wilk trihalomethanes 0.9970 <0.001 NO
## 9 Shapiro-Wilk turbidity 0.9997 0.9336 YES
corrplot( cor (water_potability[, -10]), method = "color", #col = col(200),
type = "upper", order = "hclust",
addCoef.col = "black",
tl.col = T, tl.srt = 45,
number.cex = 0.7, tl.cex = 0.7,
#p.mat = m_corr$P, sig.level = 0.05, insig = "n",
diag = FALSE, number.digits = 3
)ggpairs(
water_potability,
aes(color = potability),
columns = 1:9,
lower = list(continuous = wrap(
"smooth",
alpha = 0.2,
size = 0.5,
color = "#DE942E"
)),
diag = list(continuous = "barDiag"),
upper = list(continuous = wrap("cor", size = 4))
) +
scale_color_manual(values = c("#1F5736", "#E94046")) +
scale_fill_manual(values = c("#1F5736", "#E94046")) +
theme(
axis.text = element_text(size = 8),
panel.background = element_rect(fill = "white"),
strip.background = element_rect(fill = "white"),
strip.background.x = element_rect(colour = "black"),
strip.background.y = element_rect(colour = "black"),
strip.text = element_text(color = "black", face = "bold", size = 8)
) +
labs(
title = "Pair plot by Potability Var",
subtitle = "Pair Plot, scatter plot, Histogram and Correlation coefficient",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL
)Podemos ver que não há indícios de Multicolinearidade
Foram testados os modelos de Regressão Logistica, Random Forest e KNN
# 80% training and 20% test dataset
set.seed(1)
trn_index <- createDataPartition(y = water_potability$potability, p = 0.80, list = FALSE)
trn_water <- water_potability[trn_index, ]
tst_water <- water_potability[-trn_index, ]
# K-nearest neighbors
set.seed(1)
potability_knn <- train(potability ~ ., method = "knn", data = trn_water,
trControl = trainControl(method = 'cv', number = 5, returnResamp = "all"))
# Random forest
set.seed(1)
potability_rf <- randomForest(potability ~ .,
data=trn_water, ntree= 1000
)
# Logistic regression
set.seed(1)
potability_lr <- train(potability ~ ., method = "glm", data = trn_water,
family = binomial(link = "logit"),
trControl = trainControl(method = 'cv', number = 5))# Prediction on test-dataset
predicted_outcomes_rf <- predict(potability_rf, tst_water)
predicted_outcomes_knn <- predict(potability_knn, tst_water)
predicted_outcomes_lr <- predict(potability_lr, tst_water)
# Create Confusion Matrices
rf_confm <- confusionMatrix(predicted_outcomes_rf, tst_water$potability, positive='1')
knn_confm <- confusionMatrix(predicted_outcomes_knn, tst_water$potability, positive='1')
logistic_confm <- confusionMatrix(predicted_outcomes_lr, tst_water$potability, positive='1')
# plot of confusion matrices
plot_rf <- as.data.frame(rf_confm$table)
plot_rf$Prediction <- factor(plot_rf$Prediction, levels=rev(levels(plot_rf$Prediction)))
plot_knn <- as.data.frame(knn_confm$table)
plot_knn$Prediction <- factor(plot_knn$Prediction, levels=rev(levels(plot_knn$Prediction)))
plot_logistic <- as.data.frame(logistic_confm$table)
plot_logistic$Prediction <- factor(plot_logistic$Prediction,
levels=rev(levels(plot_logistic$Prediction)))plot_conf_rf <- ggplot(plot_rf, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
ggtitle("Random Forest. Accuracy: 79,2%, Sensitivity: 59,22%, Specificity: 91,98%")
plot_conf_knn <- ggplot(plot_knn, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
ggtitle("K-Nearest Neighbor. Accuracy: 58,26%, Sensitivity: 22,75%, Specificity: 80,95%")
plot_conf_logistic <- ggplot(plot_logistic, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) + theme(legend.position = "None") +
ggtitle("Logistic Regression. Accuracy: 61%, Sensitivity: 0%, Specificity: 100%")mod2 <- glm(formula = trn_water$potability ~ .,family = binomial(link = "logit"),
data = trn_water)
mod3 <- glm(formula = trn_water$potability ~ hardness + solids + sulfate +
organic_carbon, family = binomial(link = "logit"), data = trn_water)summary(mod2)##
## Call:
## glm(formula = trn_water$potability ~ ., family = binomial(link = "logit"),
## data = trn_water)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1334 -0.9999 -0.9562 1.3536 1.5172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.648e-01 6.832e-01 -0.388 0.698
## ph -3.864e-03 2.766e-02 -0.140 0.889
## hardness -1.095e-03 1.227e-03 -0.892 0.372
## solids 6.859e-06 4.627e-06 1.482 0.138
## chloramines 3.120e-02 2.550e-02 1.223 0.221
## sulfate -1.108e-04 1.128e-03 -0.098 0.922
## conductivity -4.281e-04 5.006e-04 -0.855 0.392
## organic_carbon -9.031e-03 1.222e-02 -0.739 0.460
## trihalomethanes 1.254e-03 2.558e-03 0.490 0.624
## turbidity -1.255e-02 5.155e-02 -0.244 0.808
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3507.3 on 2621 degrees of freedom
## Residual deviance: 3501.1 on 2612 degrees of freedom
## AIC: 3521.1
##
## Number of Fisher Scoring iterations: 4
step(mod2, direction = "both")## Start: AIC=3521.05
## trn_water$potability ~ ph + hardness + solids + chloramines +
## sulfate + conductivity + organic_carbon + trihalomethanes +
## turbidity
##
## Df Deviance AIC
## - sulfate 1 3501.1 3519.1
## - ph 1 3501.1 3519.1
## - turbidity 1 3501.1 3519.1
## - trihalomethanes 1 3501.3 3519.3
## - organic_carbon 1 3501.6 3519.6
## - conductivity 1 3501.8 3519.8
## - hardness 1 3501.9 3519.9
## - chloramines 1 3502.6 3520.6
## <none> 3501.1 3521.1
## - solids 1 3503.2 3521.2
##
## Step: AIC=3519.06
## trn_water$potability ~ ph + hardness + solids + chloramines +
## conductivity + organic_carbon + trihalomethanes + turbidity
##
## Df Deviance AIC
## - ph 1 3501.1 3517.1
## - turbidity 1 3501.1 3517.1
## - trihalomethanes 1 3501.3 3517.3
## - organic_carbon 1 3501.6 3517.6
## - conductivity 1 3501.8 3517.8
## - hardness 1 3501.9 3517.9
## - chloramines 1 3502.6 3518.6
## <none> 3501.1 3519.1
## - solids 1 3503.4 3519.4
## + sulfate 1 3501.1 3521.1
##
## Step: AIC=3517.08
## trn_water$potability ~ hardness + solids + chloramines + conductivity +
## organic_carbon + trihalomethanes + turbidity
##
## Df Deviance AIC
## - turbidity 1 3501.1 3515.1
## - trihalomethanes 1 3501.3 3515.3
## - organic_carbon 1 3501.6 3515.6
## - conductivity 1 3501.8 3515.8
## - hardness 1 3501.9 3515.9
## - chloramines 1 3502.6 3516.6
## <none> 3501.1 3517.1
## - solids 1 3503.4 3517.4
## + ph 1 3501.1 3519.1
## + sulfate 1 3501.1 3519.1
##
## Step: AIC=3515.14
## trn_water$potability ~ hardness + solids + chloramines + conductivity +
## organic_carbon + trihalomethanes
##
## Df Deviance AIC
## - trihalomethanes 1 3501.4 3513.4
## - organic_carbon 1 3501.7 3513.7
## - conductivity 1 3501.9 3513.9
## - hardness 1 3501.9 3513.9
## - chloramines 1 3502.6 3514.6
## <none> 3501.1 3515.1
## - solids 1 3503.5 3515.5
## + turbidity 1 3501.1 3517.1
## + ph 1 3501.1 3517.1
## + sulfate 1 3501.1 3517.1
##
## Step: AIC=3513.39
## trn_water$potability ~ hardness + solids + chloramines + conductivity +
## organic_carbon
##
## Df Deviance AIC
## - organic_carbon 1 3501.9 3511.9
## - conductivity 1 3502.1 3512.1
## - hardness 1 3502.2 3512.2
## - chloramines 1 3502.9 3512.9
## <none> 3501.4 3513.4
## - solids 1 3503.7 3513.7
## + trihalomethanes 1 3501.1 3515.1
## + turbidity 1 3501.3 3515.3
## + ph 1 3501.4 3515.4
## + sulfate 1 3501.4 3515.4
##
## Step: AIC=3511.95
## trn_water$potability ~ hardness + solids + chloramines + conductivity
##
## Df Deviance AIC
## - conductivity 1 3502.7 3510.7
## - hardness 1 3502.8 3510.8
## - chloramines 1 3503.5 3511.5
## <none> 3501.9 3511.9
## - solids 1 3504.3 3512.3
## + organic_carbon 1 3501.4 3513.4
## + trihalomethanes 1 3501.7 3513.7
## + turbidity 1 3501.9 3513.9
## + ph 1 3501.9 3513.9
## + sulfate 1 3501.9 3513.9
##
## Step: AIC=3510.71
## trn_water$potability ~ hardness + solids + chloramines
##
## Df Deviance AIC
## - hardness 1 3503.5 3509.5
## - chloramines 1 3504.3 3510.3
## <none> 3502.7 3510.7
## - solids 1 3505.0 3511.0
## + conductivity 1 3501.9 3511.9
## + organic_carbon 1 3502.1 3512.1
## + trihalomethanes 1 3502.5 3512.5
## + turbidity 1 3502.7 3512.7
## + ph 1 3502.7 3512.7
## + sulfate 1 3502.7 3512.7
##
## Step: AIC=3509.49
## trn_water$potability ~ solids + chloramines
##
## Df Deviance AIC
## - chloramines 1 3505.1 3509.1
## <none> 3503.5 3509.5
## - solids 1 3505.9 3509.9
## + hardness 1 3502.7 3510.7
## + conductivity 1 3502.8 3510.8
## + organic_carbon 1 3502.9 3510.9
## + trihalomethanes 1 3503.2 3511.2
## + turbidity 1 3503.4 3511.4
## + ph 1 3503.4 3511.4
## + sulfate 1 3503.5 3511.5
##
## Step: AIC=3509.13
## trn_water$potability ~ solids
##
## Df Deviance AIC
## <none> 3505.1 3509.1
## - solids 1 3507.3 3509.3
## + chloramines 1 3503.5 3509.5
## + hardness 1 3504.3 3510.3
## + conductivity 1 3504.4 3510.4
## + organic_carbon 1 3504.5 3510.5
## + trihalomethanes 1 3504.9 3510.9
## + ph 1 3505.1 3511.1
## + turbidity 1 3505.1 3511.1
## + sulfate 1 3505.1 3511.1
##
## Call: glm(formula = trn_water$potability ~ solids, family = binomial(link = "logit"),
## data = trn_water)
##
## Coefficients:
## (Intercept) solids
## -5.932e-01 6.652e-06
##
## Degrees of Freedom: 2621 Total (i.e. Null); 2620 Residual
## Null Deviance: 3507
## Residual Deviance: 3505 AIC: 3509
summary(mod3)##
## Call:
## glm(formula = trn_water$potability ~ hardness + solids + sulfate +
## organic_carbon, family = binomial(link = "logit"), data = trn_water)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1135 -0.9992 -0.9679 1.3595 1.4700
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.011e-01 5.215e-01 -0.386 0.700
## hardness -1.129e-03 1.223e-03 -0.923 0.356
## solids 6.403e-06 4.599e-06 1.392 0.164
## sulfate -8.739e-05 1.127e-03 -0.078 0.938
## organic_carbon -9.539e-03 1.220e-02 -0.782 0.434
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3507.3 on 2621 degrees of freedom
## Residual deviance: 3503.7 on 2617 degrees of freedom
## AIC: 3513.7
##
## Number of Fisher Scoring iterations: 4
Utilizando a seleção de modelos Stepwise o modelo que melhor explica os dados foi o modelo 3 entretanto ao fazer a predição o modelo prevê apenas valores 0. Dessa forma foram testados outros modelos que pudessem retornar resultados significativos sento eles Random Forest e KNN
plot_conf_rfplot_conf_knnplot_conf_logisticset.seed(31967)
TrainIndex <-
createDataPartition(water_potability$potability, p = 0.8, list = FALSE)
TrainingSet <- water_potability[TrainIndex, ]
TestSet <- water_potability[-TrainIndex, ]xdf <- TrainingSet %>% select(-potability)
ydf <- TrainingSet %>% select(potability)
set.seed(31967)#Quantidade de variáveis (colunas) sorteadas por árvore.
BestMtry <-
tuneRF(
xdf,
ydf$potability,
stepFactor = 1.5,
improve = 1e-6,
ntree = 1000,
plot = F
)## mtry = 3 OOB error = 21.13%
## Searching left ...
## mtry = 2 OOB error = 20.63%
## 0.0234657 1e-06
## Searching right ...
## mtry = 4 OOB error = 20.52%
## 0.005545287 1e-06
## mtry = 6 OOB error = 21.24%
## -0.03531599 1e-06
BestMtry %>% as_tibble() %>%
ggplot(aes(x = mtry, y = OOBError)) +
geom_line(col = "steelblue", size = 1.5)+
geom_point(col = "orange", size = 3)+
labs(
title = "Best Mtry",
caption = "Data source: Kaggle.com, Water Quality",
x = "Mtry",
y = "OOB Error"
)set.seed(31967)
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 7)
set.seed(31967)
RfFinal <- train(
potability ~ .,
data = TrainingSet,
method = "rf",
metric = "Accuracy",
tuneGrid = expand.grid(.mtry = 4),
trControl = control,
ntree = 1000
)
plot(RfFinal$finalModel)#variaveis inportantes
VarsImp <- varImp(RfFinal, scale = FALSE)
VarsImp$importance %>%
rownames_to_column(var = "Variable") %>%
as_tibble() %>%
ggplot(aes(x = fct_reorder(Variable,Overall), y = Overall)) +
geom_col(fill = "#1F5736", size = 1) +
coord_flip()+
labs(
title = "Variables importance",
subtitle = "Column Plot, Variables importance, RfFinal mpdel",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL
)#######################################################################As 5 variáveis mais importantes no modelo são, Sulfate, PH,Hardness, Solids e Chloramines
#predição
set.seed(31967)
PredRf <- predict(RfFinal,TestSet, type = "raw")
confusionMatrix(data = PredRf, reference = TestSet$potability, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 351 84
## 1 48 171
##
## Accuracy : 0.7982
## 95% CI : (0.7653, 0.8283)
## No Information Rate : 0.6101
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5647
##
## Mcnemar's Test P-Value : 0.002316
##
## Sensitivity : 0.6706
## Specificity : 0.8797
## Pos Pred Value : 0.7808
## Neg Pred Value : 0.8069
## Prevalence : 0.3899
## Detection Rate : 0.2615
## Detection Prevalence : 0.3349
## Balanced Accuracy : 0.7751
##
## 'Positive' Class : 1
##
#######################################################3O modelo de regressão logística falhou em encontrar um padrão nos dados. Ele prevê que nem tudo é potável. Sendo assim ele não é indicado para esse conjunto de dados.
O modelo de melhor desempenho é o random forest: tem uma precisão de teste de 79,2%.
Uma alta especificidade significa muitos verdadeiros negativos e poucos falsos positivos. Random Forest supera os demais modelos com especificidade de 91,98 %.