Predicting Water Potability With Classification: Logistic Regression and K-NN
Introduction
Water is an essential resource for sustaining life on our planet. We rely on it for various purposes, including hydration, sanitation, and agricultural needs. However, not all water sources are safe for consumption. The quality of drinking water is a matter of utmost importance as it directly affects our health and well-being. The distinction between potable (safe for drinking) and non-potable water is crucial for ensuring the safety of individuals and communities.
DataSet Overview
Content The water_potability.csv file contains water quality metrics for 3276 different water bodies.
ph: Represents the pH level of water (double)Hardness: Indicates the hardness of water (double)Solids: Represents the total dissolved solids in water (double)Chloramines: Indicates the concentration of chloramines in water (double)Sulfate: Represents the concentration of sulfate in water (double)Conductivity: Indicates the electrical conductivity of water (double)Organic_carbon: Represents the amount of organic carbon in water (double)Trihalomethanes: Indicates the concentration of trihalomethanes in water (double)Turbidity: Represents the turbidity level of water (double)Potability: Indicates whether the water is potable (1) or not (0) (integer)
Objectives
we can create objectives for predicting if water is safe for human consumption (potable) or not. Here are some possible objectives:
- Objective 1: Develop a classification model to predict water potability using the available features.
- Objective 2: Evaluate the performance of different classification algorithms in predicting water potability.
- Objective 3: Identify the most influential features in determining water potability.
- Objective 4: Optimize the classification model to improve the accuracy of predicting potable and non-potable water.
Data Preparation
Import package & Dataset
Import Package :
library(dplyr) # Package for data manipulation
library(gtools) # Utility functions for data manipulation and statistical analysis
library(ggplot2) # Package for data visualization using the grammar of graphics
library(class) # Package for classification, including K-nearest neighbors (KNN) classification
library(tidyr) # Package for reshaping and tidying data
library(reshape2) # Package for data reshaping and restructuring
library(caret) # Comprehensive toolkit for machine learning
library(plotly) # Interactive plotting library
library(knitr) # Package for dynamic report generationImport dataSet
water_potability <-read.csv("data_input/water_potability.csv", stringsAsFactors = T)
rmarkdown::paged_table(water_potability)Data Wrangling
glimpse(water_potability)#> Rows: 3,276
#> Columns: 10
#> $ ph <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, …
#> $ Hardness <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.…
#> $ Solids <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874…
#> $ Chloramines <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54…
#> $ Sulfate <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66…
#> $ Conductivity <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.…
#> $ Organic_carbon <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,…
#> $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.…
#> $ Turbidity <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55…
#> $ Potability <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
According to the results of my analysis potability
should be included in the datatype factor, and must be changed.
And in the business case this time we focus on building a good model to be able to guess ‘potability’ with all existing predictors
Target : Potability Predictor :
All variabel beside Potability
water_potability <- water_potability %>%
mutate_at(c("Potability"), as.factor)Finding NA Value
colSums(is.na(water_potability))#> ph Hardness Solids Chloramines Sulfate
#> 491 0 0 0 781
#> Conductivity Organic_carbon Trihalomethanes Turbidity Potability
#> 0 0 162 0 0
We can see that there are a few null values in the columns ph, Sulfate, and Trihalomethanes. Given that there are only 3276 items in the dataset, we cannot afford to discard these data points. Alternatively, we might swap them out for the means of the corresponding columns.
So I’m Performing Mean & Median & Mode Imputation
# Replace missing values in 'ph' and 'Trihalomethanes' with the mean of the entire column
water_potability$ph[is.na(water_potability$ph)] <- mean(water_potability$ph, na.rm = TRUE)
water_potability$Trihalomethanes[is.na(water_potability$Trihalomethanes)] <- mean(water_potability$Trihalomethanes, na.rm = TRUE)
# Calculate mean values of 'Sulfate' based on potability
mean_sulfate_potable <- mean(water_potability$Sulfate[water_potability$Potability == 1], na.rm = TRUE)
mean_sulfate_non_potable <- mean(water_potability$Sulfate[water_potability$Potability == 0], na.rm = TRUE)
# Replace missing values in 'Sulfate' based on potability
water_potability$Sulfate[is.na(water_potability$Sulfate) & water_potability$Potability == 1] <- mean_sulfate_potable
water_potability$Sulfate[is.na(water_potability$Sulfate) & water_potability$Potability == 0] <- mean_sulfate_non_potableExploraty Data Analysis
Check distribution/pattern data
Checking Range of our Dataframe
# explore with summary
summary(water_potability)#> ph Hardness Solids Chloramines
#> Min. : 0.000 Min. : 47.43 Min. : 320.9 Min. : 0.352
#> 1st Qu.: 6.278 1st Qu.:176.85 1st Qu.:15666.7 1st Qu.: 6.127
#> Median : 7.081 Median :196.97 Median :20927.8 Median : 7.130
#> Mean : 7.081 Mean :196.37 Mean :22014.1 Mean : 7.122
#> 3rd Qu.: 7.870 3rd Qu.:216.67 3rd Qu.:27332.8 3rd Qu.: 8.115
#> Max. :14.000 Max. :323.12 Max. :61227.2 Max. :13.127
#> Sulfate Conductivity Organic_carbon Trihalomethanes
#> Min. :129.0 Min. :181.5 Min. : 2.20 Min. : 0.738
#> 1st Qu.:317.1 1st Qu.:365.7 1st Qu.:12.07 1st Qu.: 56.648
#> Median :334.6 Median :421.9 Median :14.22 Median : 66.396
#> Mean :333.8 Mean :426.2 Mean :14.28 Mean : 66.396
#> 3rd Qu.:350.4 3rd Qu.:481.8 3rd Qu.:16.56 3rd Qu.: 76.667
#> Max. :481.0 Max. :753.3 Max. :28.30 Max. :124.000
#> Turbidity Potability
#> Min. :1.450 0:1998
#> 1st Qu.:3.440 1:1278
#> Median :3.955
#> Mean :3.967
#> 3rd Qu.:4.500
#> Max. :6.739
Checking outlier in histogram of our Dataframe
water_potability %>%
pivot_longer(cols = -Potability, names_to = "feature") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, aes(fill = feature)) +
facet_wrap(vars(feature, Potability), ncol = 4, scales = "free") +
scale_fill_brewer(palette = "Paired") +
theme(
legend.position = "none",
strip.background = element_rect(fill = "lightblue"),
strip.text = element_text(color = "black", face = "bold", size = 8)
) +
labs(
title = "Detect Outliers With Histogram",
subtitle = "Plot, Histogram",
caption = "Data source: Kaggle.com, Water Quality",
x = NULL,
y = NULL
)
Nothing here seems crucial to me, and because the overall number of data
is very big, I don’t think the outliers will have much of an impact.
Checking Correlation
# Remove 'Potability' column from the data frame
water_potability_without_potability <- water_potability[, -which(names(water_potability) == "Potability")]
# Calculate the correlation matrix
correlation_matrix <- cor(water_potability_without_potability, use = "pairwise.complete.obs")
# Melt the correlation matrix into long format for plotting
melted_correlation <- melt(correlation_matrix)
# Create a correlation plot
ggplot(melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "lightblue", mid = "gray", high = "darkblue", midpoint = 0, limits = c(-1, 1)) +
labs(title = "Correlation between Variables", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
from the results of my analysis and visualization There is clearly no
correlation b/w any of the features.
Check class-imbalance
prop.table(table(water_potability$Potability))#>
#> 0 1
#> 0.6098901 0.3901099
According to my analysis and the visualization, there is no multicollinearity and the predictors are not highly associated, therefore let’s leave them all alone! So i think it’s gonna be fine to leave percentage arround 60 - 30.
Modelling
Cross Validation
If we want to make a forecast, we should avoid looking at the error value in the data used to train the model because it only displays the model’s capacity to predict past data and not necessarily present data.
The data used to train the model is called “train data,” while the data used to test the model is called “test data.”
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Index sampling
index <- sample(x = nrow(water_potability), size = nrow(water_potability) * 0.8)
# Splitting
water_potability_train <- water_potability[index, ]
water_potability_test <- water_potability[-index, ]Logistic Regression
When we have a binary outcome variable, we employ the logistic
regression statistical modeling approach. For instance, will the student
succeed or fail given the conditions? When will it rain? etc. We may
thus utilize the logistic regression modeling approach to predict the
outcome when the outcome variable is binary, despite the fact that we
may have continuous or categorical independent factors. We can use
glm() function to our target variabel
Modelling
model_lr <- glm(formula = Potability ~ .,
family = "binomial",
data = water_potability_train)
# stepwise
model_steplr <- step(model_lr, direction = 'both')#> Start: AIC=3495
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
#> Conductivity + Organic_carbon + Trihalomethanes + Turbidity
#>
#> Df Deviance AIC
#> - Turbidity 1 3475.0 3493.0
#> - Conductivity 1 3475.1 3493.1
#> - Hardness 1 3475.5 3493.5
#> - Trihalomethanes 1 3475.5 3493.5
#> - ph 1 3475.6 3493.6
#> - Sulfate 1 3476.6 3494.6
#> <none> 3475.0 3495.0
#> - Solids 1 3477.3 3495.3
#> - Chloramines 1 3477.7 3495.7
#> - Organic_carbon 1 3482.2 3500.2
#>
#> Step: AIC=3493.02
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
#> Conductivity + Organic_carbon + Trihalomethanes
#>
#> Df Deviance AIC
#> - Conductivity 1 3475.1 3491.1
#> - Hardness 1 3475.5 3491.5
#> - Trihalomethanes 1 3475.5 3491.5
#> - ph 1 3475.7 3491.7
#> - Sulfate 1 3476.6 3492.6
#> <none> 3475.0 3493.0
#> - Solids 1 3477.3 3493.3
#> - Chloramines 1 3477.7 3493.7
#> + Turbidity 1 3475.0 3495.0
#> - Organic_carbon 1 3482.2 3498.2
#>
#> Step: AIC=3491.09
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
#> Organic_carbon + Trihalomethanes
#>
#> Df Deviance AIC
#> - Hardness 1 3475.6 3489.6
#> - Trihalomethanes 1 3475.6 3489.6
#> - ph 1 3475.7 3489.7
#> - Sulfate 1 3476.7 3490.7
#> <none> 3475.1 3491.1
#> - Solids 1 3477.3 3491.3
#> - Chloramines 1 3477.8 3491.8
#> + Conductivity 1 3475.0 3493.0
#> + Turbidity 1 3475.1 3493.1
#> - Organic_carbon 1 3482.3 3496.3
#>
#> Step: AIC=3489.59
#> Potability ~ ph + Solids + Chloramines + Sulfate + Organic_carbon +
#> Trihalomethanes
#>
#> Df Deviance AIC
#> - Trihalomethanes 1 3476.1 3488.1
#> - ph 1 3476.2 3488.2
#> - Sulfate 1 3477.0 3489.0
#> <none> 3475.6 3489.6
#> - Solids 1 3478.0 3490.0
#> - Chloramines 1 3478.4 3490.4
#> + Hardness 1 3475.1 3491.1
#> + Conductivity 1 3475.5 3491.5
#> + Turbidity 1 3475.6 3491.6
#> - Organic_carbon 1 3482.8 3494.8
#>
#> Step: AIC=3488.13
#> Potability ~ ph + Solids + Chloramines + Sulfate + Organic_carbon
#>
#> Df Deviance AIC
#> - ph 1 3476.7 3486.7
#> - Sulfate 1 3477.6 3487.6
#> <none> 3476.1 3488.1
#> - Solids 1 3478.4 3488.4
#> - Chloramines 1 3479.0 3489.0
#> + Trihalomethanes 1 3475.6 3489.6
#> + Hardness 1 3475.6 3489.6
#> + Conductivity 1 3476.1 3490.1
#> + Turbidity 1 3476.1 3490.1
#> - Organic_carbon 1 3483.5 3493.5
#>
#> Step: AIC=3486.7
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon
#>
#> Df Deviance AIC
#> - Sulfate 1 3478.1 3486.1
#> <none> 3476.7 3486.7
#> - Solids 1 3478.9 3486.9
#> - Chloramines 1 3479.4 3487.4
#> + ph 1 3476.1 3488.1
#> + Trihalomethanes 1 3476.2 3488.2
#> + Hardness 1 3476.3 3488.3
#> + Conductivity 1 3476.7 3488.7
#> + Turbidity 1 3476.7 3488.7
#> - Organic_carbon 1 3483.9 3491.9
#>
#> Step: AIC=3486.13
#> Potability ~ Solids + Chloramines + Organic_carbon
#>
#> Df Deviance AIC
#> <none> 3478.1 3486.1
#> + Sulfate 1 3476.7 3486.7
#> - Chloramines 1 3480.8 3486.8
#> - Solids 1 3480.9 3486.9
#> + Trihalomethanes 1 3477.5 3487.5
#> + ph 1 3477.6 3487.6
#> + Hardness 1 3477.9 3487.9
#> + Conductivity 1 3478.1 3488.1
#> + Turbidity 1 3478.1 3488.1
#> - Organic_carbon 1 3485.5 3491.5
# Compare Value of AIC
model_lr$aic#> [1] 3494.999
model_steplr$aic#> [1] 3486.134
because the model_lr model has a higher AIC value then
we will use model_lr
summary(model_steplr)#>
#> Call:
#> glm(formula = Potability ~ Solids + Chloramines + Organic_carbon,
#> family = "binomial", data = water_potability_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.1897 -0.9938 -0.9331 1.3560 1.5730
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.470090403 0.278230522 -1.690 0.09111 .
#> Solids 0.000007648 0.000004597 1.664 0.09616 .
#> Chloramines 0.041921281 0.025629644 1.636 0.10191
#> Organic_carbon -0.032817924 0.012080176 -2.717 0.00659 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 3490.6 on 2619 degrees of freedom
#> Residual deviance: 3478.1 on 2616 degrees of freedom
#> AIC: 3486.1
#>
#> Number of Fisher Scoring iterations: 4
Interpretasi Model: * The greater the value of Organic Carbon, the better the chance that the water can be minimized
Prediction
Predict the potability probability for
water_potability_test data and store it in a new column
named pred_RiskLs. then I will Classify the loans.test data
by pred_RiskLs and store it in a new column named
pred_Label.
water_potability_test$pred_RiskLs <- predict(object = model_lr,
newdata = water_potability_test,
type = "response")
water_potability_test$pred_Label <- ifelse(water_potability_test$pred_RiskLs > 0.5, 1, 0)
head(water_potability_test)#> ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
#> 1 7.080795 204.8905 20791.32 7.300212 368.5164 564.3087 10.37978
#> 3 8.099124 224.2363 19909.54 9.275884 334.5643 418.6062 16.86864
#> 16 6.347272 186.7329 41065.23 9.629596 364.4877 516.7433 11.53978
#> 18 9.181560 273.8138 24041.33 6.904990 398.3505 477.9746 13.38734
#> 36 5.115817 191.9527 19620.55 6.060713 323.8364 441.7484 10.96649
#> 42 5.331940 194.8741 16658.88 7.993830 316.6752 335.1204 10.18051
#> Trihalomethanes Turbidity Potability pred_RiskLs pred_Label
#> 1 86.99097 2.963135 0 0.4032852 0
#> 3 66.42009 3.055934 0 0.3803156 0
#> 16 75.07162 4.376348 0 0.4530251 0
#> 18 71.45736 4.503661 0 0.3670449 0
#> 36 49.23823 3.902089 0 0.3805967 0
#> 42 59.57271 4.434820 0 0.4132216 0
Levels are used to choose the label in the logistic regression model that will become the number 1.
- class “0”,
- “1” plus base = 0,
- probabilities near 0 become 0 and
- odds near 1 become 1.
Model Evaluation
head(water_potability_test %>%
select(Potability, pred_Label))#> Potability pred_Label
#> 1 0 0
#> 3 0 0
#> 16 0 0
#> 18 0 0
#> 36 0 0
#> 42 0 0
After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:
confusionMatrix(data = as.factor(water_potability_test$pred_Label),
reference = water_potability_test$Potability,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 385 269
#> 1 0 2
#>
#> Accuracy : 0.5899
#> 95% CI : (0.5512, 0.6279)
#> No Information Rate : 0.5869
#> P-Value [Acc > NIR] : 0.4536
#>
#> Kappa : 0.0087
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.007380
#> Specificity : 1.000000
#> Pos Pred Value : 1.000000
#> Neg Pred Value : 0.588685
#> Prevalence : 0.413110
#> Detection Rate : 0.003049
#> Detection Prevalence : 0.003049
#> Balanced Accuracy : 0.503690
#>
#> 'Positive' Class : 1
#>
- Accuracy = how well my model correctly guesses target Y.
- Precision = of all the prediction results, how well can my model correctly guess the positive class.
- Re-call/Sensitivity = from all positive actual data, how well is the proportion of my model guessing correctly.
- Specificity = of all negative actual data, how well is my model’s proportion of the correct guess.
Accuracy <- (2+385)/nrow(water_potability_test)
Recall <- (2/(2 + 269))
Precision <- (2)/(2+0)
Specificity <- (385)/(385+0)
performanceLs <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performanceLs#> Accuracy Recall Precision Specificity
#> 1 0.589939 0.007380074 1 1
Model Tuning
after conducting further analysis and checking, it turns out that indeed our model needs to be tuned to improve the recall value. We want increase the recall / sensitiviy, We need the model to guesing correctly. So the first thing we do it’s to do upsampling
water_potability_train <- upSample(
x = water_potability_train %>% select(-Potability),
y = water_potability_train$Potability,
yname = "Potability"
)
prop.table(table(water_potability_train$Potability))#>
#> 0 1
#> 0.5 0.5
The provided code down below creates a logistic regression model in R
using the glm() function. The model predicts the
probability of water potability based on various input variables. The
step() function is then used to perform stepwise variable
selection, which automatically selects the best subset of predictors for
the model based on a specified criterion. The resulting model with the
selected predictors is stored in a variable called
model_steplr.
model_lr <- glm(formula = Potability ~ .,
family = "binomial",
data = water_potability_train)
# stepwise
model_steplr <- step(model_lr, direction = 'both')#> Start: AIC=4472.37
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
#> Conductivity + Organic_carbon + Trihalomethanes + Turbidity
#>
#> Df Deviance AIC
#> - ph 1 4452.4 4470.4
#> - Conductivity 1 4452.4 4470.4
#> - Turbidity 1 4452.8 4470.8
#> - Hardness 1 4453.2 4471.2
#> - Trihalomethanes 1 4453.2 4471.2
#> <none> 4452.4 4472.4
#> - Chloramines 1 4455.2 4473.2
#> - Solids 1 4455.2 4473.2
#> - Sulfate 1 4455.9 4473.9
#> - Organic_carbon 1 4459.5 4477.5
#>
#> Step: AIC=4470.37
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity +
#> Organic_carbon + Trihalomethanes + Turbidity
#>
#> Df Deviance AIC
#> - Conductivity 1 4452.4 4468.4
#> - Turbidity 1 4452.8 4468.8
#> - Hardness 1 4453.2 4469.2
#> - Trihalomethanes 1 4453.2 4469.2
#> <none> 4452.4 4470.4
#> - Chloramines 1 4455.2 4471.2
#> - Solids 1 4455.2 4471.2
#> - Sulfate 1 4455.9 4471.9
#> + ph 1 4452.4 4472.4
#> - Organic_carbon 1 4459.5 4475.5
#>
#> Step: AIC=4468.37
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon +
#> Trihalomethanes + Turbidity
#>
#> Df Deviance AIC
#> - Turbidity 1 4452.8 4466.8
#> - Hardness 1 4453.2 4467.2
#> - Trihalomethanes 1 4453.2 4467.2
#> <none> 4452.4 4468.4
#> - Chloramines 1 4455.2 4469.2
#> - Solids 1 4455.2 4469.2
#> - Sulfate 1 4455.9 4469.9
#> + Conductivity 1 4452.4 4470.4
#> + ph 1 4452.4 4470.4
#> - Organic_carbon 1 4459.5 4473.5
#>
#> Step: AIC=4466.84
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon +
#> Trihalomethanes
#>
#> Df Deviance AIC
#> - Hardness 1 4453.6 4465.6
#> - Trihalomethanes 1 4453.7 4465.7
#> <none> 4452.8 4466.8
#> - Solids 1 4455.7 4467.7
#> - Chloramines 1 4455.7 4467.7
#> - Sulfate 1 4456.4 4468.4
#> + Turbidity 1 4452.4 4468.4
#> + Conductivity 1 4452.8 4468.8
#> + ph 1 4452.8 4468.8
#> - Organic_carbon 1 4460.0 4472.0
#>
#> Step: AIC=4465.61
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon +
#> Trihalomethanes
#>
#> Df Deviance AIC
#> - Trihalomethanes 1 4454.5 4464.5
#> <none> 4453.6 4465.6
#> - Solids 1 4456.6 4466.6
#> - Chloramines 1 4456.7 4466.7
#> - Sulfate 1 4456.8 4466.8
#> + Hardness 1 4452.8 4466.8
#> + Turbidity 1 4453.2 4467.2
#> + ph 1 4453.6 4467.6
#> + Conductivity 1 4453.6 4467.6
#> - Organic_carbon 1 4460.8 4470.8
#>
#> Step: AIC=4464.46
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon
#>
#> Df Deviance AIC
#> <none> 4454.5 4464.5
#> - Solids 1 4457.5 4465.5
#> - Chloramines 1 4457.6 4465.6
#> + Trihalomethanes 1 4453.6 4465.6
#> + Hardness 1 4453.7 4465.7
#> - Sulfate 1 4457.7 4465.7
#> + Turbidity 1 4454.0 4466.0
#> + ph 1 4454.5 4466.5
#> + Conductivity 1 4454.5 4466.5
#> - Organic_carbon 1 4461.8 4469.8
Test dataset
After that we double check our Train Evaluation Logistick Regression model and put it on test dataset:
water_potability_train$pred_RiskLs <- predict(model_steplr,
water_potability_train,
type = "response")
water_potability_train$pred_Label <- ifelse(water_potability_train$pred_RiskLs > 0.5, 1, 0)
confusionMatrix(data = as.factor(water_potability_train$pred_Label),
reference = water_potability_train$Potability,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 876 772
#> 1 737 841
#>
#> Accuracy : 0.5322
#> 95% CI : (0.5148, 0.5496)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.0001334
#>
#> Kappa : 0.0645
#>
#> Mcnemar's Test P-Value : 0.3814355
#>
#> Sensitivity : 0.5214
#> Specificity : 0.5431
#> Pos Pred Value : 0.5330
#> Neg Pred Value : 0.5316
#> Prevalence : 0.5000
#> Detection Rate : 0.2607
#> Detection Prevalence : 0.4892
#> Balanced Accuracy : 0.5322
#>
#> 'Positive' Class : 1
#>
water_potability_test$pred_RiskLs <- predict(object = model_lr,
newdata = water_potability_test,
type = "response")
water_potability_test$pred_Label <- ifelse(water_potability_test$pred_RiskLs > 0.5, 1, 0)
head(water_potability_test)#> ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
#> 1 7.080795 204.8905 20791.32 7.300212 368.5164 564.3087 10.37978
#> 3 8.099124 224.2363 19909.54 9.275884 334.5643 418.6062 16.86864
#> 16 6.347272 186.7329 41065.23 9.629596 364.4877 516.7433 11.53978
#> 18 9.181560 273.8138 24041.33 6.904990 398.3505 477.9746 13.38734
#> 36 5.115817 191.9527 19620.55 6.060713 323.8364 441.7484 10.96649
#> 42 5.331940 194.8741 16658.88 7.993830 316.6752 335.1204 10.18051
#> Trihalomethanes Turbidity Potability pred_RiskLs pred_Label
#> 1 86.99097 2.963135 0 0.5262277 1
#> 3 66.42009 3.055934 0 0.4975655 0
#> 16 75.07162 4.376348 0 0.5637733 1
#> 18 71.45736 4.503661 0 0.4572410 0
#> 36 49.23823 3.902089 0 0.5068487 1
#> 42 59.57271 4.434820 0 0.5297354 1
After doing logistic regression, it’s time for us to do further evaluation regarding our model
confusionMatrix(data = as.factor(water_potability_test$pred_Label),
reference = water_potability_test$Potability,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 182 142
#> 1 203 129
#>
#> Accuracy : 0.4741
#> 95% CI : (0.4353, 0.5131)
#> No Information Rate : 0.5869
#> P-Value [Acc > NIR] : 1.000000
#>
#> Kappa : -0.0496
#>
#> Mcnemar's Test P-Value : 0.001237
#>
#> Sensitivity : 0.4760
#> Specificity : 0.4727
#> Pos Pred Value : 0.3886
#> Neg Pred Value : 0.5617
#> Prevalence : 0.4131
#> Detection Rate : 0.1966
#> Detection Prevalence : 0.5061
#> Balanced Accuracy : 0.4744
#>
#> 'Positive' Class : 1
#>
Model Score Evaluation After Upsampling
Accuracy_tuning <- (129+182)/nrow(water_potability_test)
Recall_tuning <- 129/(129+142)
Precision_tuning <- (129)/(129+203)
Specificity_tuning <- (182)/(182+203)
performanceLs_Tuning <- cbind.data.frame(Accuracy_tuning, Recall_tuning, Precision_tuning, Specificity_tuning)
performanceLs_Tuning#> Accuracy_tuning Recall_tuning Precision_tuning Specificity_tuning
#> 1 0.4740854 0.4760148 0.3885542 0.4727273
Looks like the Recall stil low in 47% correct guesing, so our next method is to focus on increasing our recall value
# increase recall
water_potability_test$pred_Label_new <- ifelse(test = water_potability_test$pred_Risk > 0.45,
yes = 1,
no = 0)
confusionMatrix(data = as.factor(water_potability_test$pred_Label_new),
reference = water_potability_test$Potability,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 27 28
#> 1 358 243
#>
#> Accuracy : 0.4116
#> 95% CI : (0.3736, 0.4503)
#> No Information Rate : 0.5869
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : -0.0281
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.89668
#> Specificity : 0.07013
#> Pos Pred Value : 0.40433
#> Neg Pred Value : 0.49091
#> Prevalence : 0.41311
#> Detection Rate : 0.37043
#> Detection Prevalence : 0.91616
#> Balanced Accuracy : 0.48340
#>
#> 'Positive' Class : 1
#>
Model Evaluation After Tunning
Accuracy_tuning_Recall <- (243+27)/nrow(water_potability_test)
Recall_tuning_Recall <- 243/(243+28)
Precision_tuning_Recall <- (243)/(243+358)
Specificity_tuning_Recall <- (27)/(27+358)
performanceLs_Tuning_Recall <- cbind.data.frame(Accuracy_tuning_Recall, Recall_tuning_Recall, Precision_tuning_Recall, Specificity_tuning_Recall)
performanceLs_Tuning_Recall#> Accuracy_tuning_Recall Recall_tuning_Recall Precision_tuning_Recall
#> 1 0.4115854 0.896679 0.4043261
#> Specificity_tuning_Recall
#> 1 0.07012987
- According to the confusion matrix’s findings, the model’s accuracy in predicting target potability (both potability and nonpotability) is 41%.
- The algorithm was able to predict 90% of the actual data on water potability.
- The model was able to properly forecast the positive class for 40% of all prediction outcomes that it was able to estimate.
- The algorithm can predict 70% of the actual data for water that is unfit for human consumption.
K-NN Model
K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data). Characteristics are measured by Euclidean Distance to get k data points (neighbors) with closest distance. The most class owned by these neighbors is the class of the new data (majority voting).
After that we must scalling using z-score standarization. The test data must also be scaled using parameters from the train data (because it assumes the test data is unseen data).
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Index sampling
index <- sample(x = nrow(water_potability), size = nrow(water_potability) * 0.8)
# Splitting
water_potability_train <- water_potability[index, ]
water_potability_test <- water_potability[-index, ]
# Predictor
water_potability_train_x <- water_potability_train %>% select_if(is.numeric)
water_potability_test_x <- water_potability_test %>% select_if(is.numeric)
# Target
water_potability_train_y <- water_potability_train[,"Potability"]
water_potability_test_y <- water_potability_test[,"Potability"]
# Scaling data
# Train
scaled_train_x <- as.data.frame(scale(water_potability_train_x))
# Test
scaled_test_x <- as.data.frame(scale(water_potability_test_x,
center = colMeans(water_potability_train_x),
scale = apply(water_potability_train_x, 2, sd)))for our next task is to apply it to our test model
# Find optimum k
sqrt(nrow(scaled_train_x))#> [1] 51.18594
# Perform k-NN classification
water_potabilityKNN_pred <- knn(train = scaled_train_x,
test = scaled_test_x,
cl = water_potability_train_y,
k = 51)
# Convert the predicted values to factors
water_potabilityKNN_pred <- as.factor(water_potabilityKNN_pred)
# Set the levels of the test target variable
levels(water_potability_test_y) <- levels(water_potability_train_y)
# Create the confusion matrix
confusion_matrix <- confusionMatrix(data = water_potabilityKNN_pred,
reference = water_potability_test_y,
positive = "1")
# Print the confusion matrix
print(confusion_matrix)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 375 241
#> 1 10 30
#>
#> Accuracy : 0.6174
#> 95% CI : (0.579, 0.6547)
#> No Information Rate : 0.5869
#> P-Value [Acc > NIR] : 0.0606
#>
#> Kappa : 0.097
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.11070
#> Specificity : 0.97403
#> Pos Pred Value : 0.75000
#> Neg Pred Value : 0.60877
#> Prevalence : 0.41311
#> Detection Rate : 0.04573
#> Detection Prevalence : 0.06098
#> Balanced Accuracy : 0.54236
#>
#> 'Positive' Class : 1
#>
The provided code conducts k-Nearest Neighbors (k-NN) classification to predict water potability. It uses a training dataset (scaled_train_x) with corresponding class labels (water_potability_train_y) to train the model. The predictions are made on a test dataset (scaled_test_x). The code then converts the predicted values to factors and ensures consistency in the levels between the test and training datasets. A confusion matrix is created to evaluate the performance of the k-NN model, and the results, including accuracy, precision, recall, and F1-score, are printed.
For our next task is to create Model Evaluation K-Nearest Neighbor :
Accuracy_knn <- (30+375)/nrow(water_potability_test)
recall_knn <- 30/(30+241)
Precision_knn <- (30)/(30+10)
Specificity_knn <- (375)/(375+10)
performanceLs_knn <- cbind.data.frame(Accuracy_knn, recall_knn, Precision_knn, Specificity_knn)
performanceLs_knn#> Accuracy_knn recall_knn Precision_knn Specificity_knn
#> 1 0.617378 0.1107011 0.75 0.974026
- According to the confusion matrix’s findings, the model’s accuracy in predicting target potability (both potability and nonpotability) is 61%.
- The algorithm was able to predict 11% of the actual data on water potability.
- The model was able to properly forecast the positive class for 75% of all prediction outcomes that it was able to estimate.
- The algorithm can predict 97% of the actual data for water that is unfit for human consumption.
Evaluation
After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:
Contents of the Confusion Matrix:
- True Positive (TP): predicted positive and true (positive prediction; actual positive)
- True Negative (TN): predicted negative and true (negative prediction; negative actual)
- False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
- False Negative (FN): predicted negative but wrong (negative prediction; positive actual)
But because we already do that in each modeling Tab, here we just compare them to each other to see which model is the best.
# Create dataframes
performanceLs_Tuning_Recall <- data.frame(
Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
Tuning_Recall = c(0.4115854, 0.896679, 0.4043261, 0.07012987)
)
performanceLs_knn <- data.frame(
Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
knn = c(0.617378, 0.1107011, 0.75, 0.974026)
)
# Merge dataframes into a single table
combined_table <- merge(performanceLs_Tuning_Recall, performanceLs_knn, by = "Metric", all = TRUE)
# Set row names as the metric names
row.names(combined_table) <- combined_table$Metric
combined_table$Metric <- NULL
# Print the table
combined_table#> Tuning_Recall knn
#> Accuracy 0.41158540 0.6173780
#> Precision 0.40432610 0.7500000
#> Recall 0.89667900 0.1107011
#> Specificity 0.07012987 0.9740260
# Create dataframes
performanceLs_Tuning_Recall <- data.frame(
Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
Value = c(0.4115854, 0.896679, 0.4043261, 0.07012987),
Model = "Logistic Regression(After Tunning)"
)
performanceLs_knn <- data.frame(
Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
Value = c(0.617378, 0.1107011, 0.75, 0.974026),
Model = "K-nearest neighbor"
)
# Combine dataframes
combined_df <- rbind(performanceLs_Tuning_Recall, performanceLs_knn)
# Create stacked line plot with tooltips
p <- ggplot(combined_df, aes(x = Metric, y = Value, group = Model, color = Model)) +
geom_line(stat = "identity") +
geom_point(size = 3) +
labs(x = "Metric", y = "Value", color = "Model") +
ggtitle("Comparison of Performance Metrics") +
theme_minimal()
# Convert plot to interactive plotly object
p <- ggplotly(p, tooltip = c("Model", "Metric", "Value"), width = 600, height = 400)
pWhen comparing the two approaches, namely Logistic Regression and K-NN, the Logistic Regression (After Tuning) approach has a precision value that is 89.9% higher than that of Logistic Regression, which improves the model’s capacity to predict accurately from the real data, of which water is potability.
Conclusion
Both models have their own advantages and disadvantages; the K-nearest neighbor model has a better level of accuracy, much better precision, and much better specificity. However, for the recall value, the Logistic Regression Model has the best recall value, which is almost 90%. For the current case, we are more concerned with the Recall evaluation model because we need a model to be able to predict whether water is potable or isn’t potable.” For example, if we want to know if we have a positive, predictable target water and if it is potable, then we have to prioritize the model with a higher precision value. Thus, we may be better off if we choose the initial model of logistic regression (After turing)