Predicting Water Potability With Classification: Logistic Regression and K-NN

Introduction

Water is an essential resource for sustaining life on our planet. We rely on it for various purposes, including hydration, sanitation, and agricultural needs. However, not all water sources are safe for consumption. The quality of drinking water is a matter of utmost importance as it directly affects our health and well-being. The distinction between potable (safe for drinking) and non-potable water is crucial for ensuring the safety of individuals and communities.

DataSet Overview

Content The water_potability.csv file contains water quality metrics for 3276 different water bodies.

  • ph: Represents the pH level of water (double)
  • Hardness: Indicates the hardness of water (double)
  • Solids: Represents the total dissolved solids in water (double)
  • Chloramines: Indicates the concentration of chloramines in water (double)
  • Sulfate: Represents the concentration of sulfate in water (double)
  • Conductivity: Indicates the electrical conductivity of water (double)
  • Organic_carbon : Represents the amount of organic carbon in water (double)
  • Trihalomethanes: Indicates the concentration of trihalomethanes in water (double)
  • Turbidity: Represents the turbidity level of water (double)
  • Potability : Indicates whether the water is potable (1) or not (0) (integer)

Objectives

we can create objectives for predicting if water is safe for human consumption (potable) or not. Here are some possible objectives:

  • Objective 1: Develop a classification model to predict water potability using the available features.
  • Objective 2: Evaluate the performance of different classification algorithms in predicting water potability.
  • Objective 3: Identify the most influential features in determining water potability.
  • Objective 4: Optimize the classification model to improve the accuracy of predicting potable and non-potable water.

Data Preparation

Import package & Dataset

Import Package :

library(dplyr)       # Package for data manipulation
library(gtools)      # Utility functions for data manipulation and statistical analysis
library(ggplot2)     # Package for data visualization using the grammar of graphics
library(class)       # Package for classification, including K-nearest neighbors (KNN) classification
library(tidyr)       # Package for reshaping and tidying data
library(reshape2)    # Package for data reshaping and restructuring
library(caret)       # Comprehensive toolkit for machine learning
library(plotly)      # Interactive plotting library
library(knitr)       # Package for dynamic report generation

Import dataSet

water_potability <-read.csv("data_input/water_potability.csv", stringsAsFactors = T)
rmarkdown::paged_table(water_potability)

Data Wrangling

glimpse(water_potability)
#> Rows: 3,276
#> Columns: 10
#> $ ph              <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, …
#> $ Hardness        <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.…
#> $ Solids          <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874…
#> $ Chloramines     <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54…
#> $ Sulfate         <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66…
#> $ Conductivity    <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.…
#> $ Organic_carbon  <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,…
#> $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.…
#> $ Turbidity       <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55…
#> $ Potability      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

According to the results of my analysis potability should be included in the datatype factor, and must be changed.

And in the business case this time we focus on building a good model to be able to guess ‘potability’ with all existing predictors

Target : Potability Predictor : All variabel beside Potability

water_potability <- water_potability %>%
  mutate_at(c("Potability"), as.factor)

Finding NA Value

colSums(is.na(water_potability))
#>              ph        Hardness          Solids     Chloramines         Sulfate 
#>             491               0               0               0             781 
#>    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
#>               0               0             162               0               0

We can see that there are a few null values in the columns ph, Sulfate, and Trihalomethanes. Given that there are only 3276 items in the dataset, we cannot afford to discard these data points. Alternatively, we might swap them out for the means of the corresponding columns.

So I’m Performing Mean & Median & Mode Imputation

# Replace missing values in 'ph' and 'Trihalomethanes' with the mean of the entire column
water_potability$ph[is.na(water_potability$ph)] <- mean(water_potability$ph, na.rm = TRUE)
water_potability$Trihalomethanes[is.na(water_potability$Trihalomethanes)] <- mean(water_potability$Trihalomethanes, na.rm = TRUE)

# Calculate mean values of 'Sulfate' based on potability
mean_sulfate_potable <- mean(water_potability$Sulfate[water_potability$Potability == 1], na.rm = TRUE)
mean_sulfate_non_potable <- mean(water_potability$Sulfate[water_potability$Potability == 0], na.rm = TRUE)

# Replace missing values in 'Sulfate' based on potability
water_potability$Sulfate[is.na(water_potability$Sulfate) & water_potability$Potability == 1] <- mean_sulfate_potable
water_potability$Sulfate[is.na(water_potability$Sulfate) & water_potability$Potability == 0] <- mean_sulfate_non_potable

Exploraty Data Analysis

Check distribution/pattern data

Checking Range of our Dataframe

# explore with summary
summary(water_potability)
#>        ph            Hardness          Solids         Chloramines    
#>  Min.   : 0.000   Min.   : 47.43   Min.   :  320.9   Min.   : 0.352  
#>  1st Qu.: 6.278   1st Qu.:176.85   1st Qu.:15666.7   1st Qu.: 6.127  
#>  Median : 7.081   Median :196.97   Median :20927.8   Median : 7.130  
#>  Mean   : 7.081   Mean   :196.37   Mean   :22014.1   Mean   : 7.122  
#>  3rd Qu.: 7.870   3rd Qu.:216.67   3rd Qu.:27332.8   3rd Qu.: 8.115  
#>  Max.   :14.000   Max.   :323.12   Max.   :61227.2   Max.   :13.127  
#>     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
#>  Min.   :129.0   Min.   :181.5   Min.   : 2.20   Min.   :  0.738  
#>  1st Qu.:317.1   1st Qu.:365.7   1st Qu.:12.07   1st Qu.: 56.648  
#>  Median :334.6   Median :421.9   Median :14.22   Median : 66.396  
#>  Mean   :333.8   Mean   :426.2   Mean   :14.28   Mean   : 66.396  
#>  3rd Qu.:350.4   3rd Qu.:481.8   3rd Qu.:16.56   3rd Qu.: 76.667  
#>  Max.   :481.0   Max.   :753.3   Max.   :28.30   Max.   :124.000  
#>    Turbidity     Potability
#>  Min.   :1.450   0:1998    
#>  1st Qu.:3.440   1:1278    
#>  Median :3.955             
#>  Mean   :3.967             
#>  3rd Qu.:4.500             
#>  Max.   :6.739

Checking outlier in histogram of our Dataframe

water_potability %>%
  pivot_longer(cols = -Potability, names_to = "feature") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, aes(fill = feature)) +
  facet_wrap(vars(feature, Potability), ncol = 4, scales = "free") +
  scale_fill_brewer(palette = "Paired") +
  theme(
    legend.position = "none",
    strip.background = element_rect(fill = "lightblue"),
    strip.text = element_text(color = "black", face = "bold", size = 8)
  ) +
  labs(
    title = "Detect Outliers With Histogram",
    subtitle = "Plot, Histogram",
    caption = "Data source: Kaggle.com, Water Quality",
    x = NULL,
    y = NULL
  )

Nothing here seems crucial to me, and because the overall number of data is very big, I don’t think the outliers will have much of an impact.

Checking Correlation

# Remove 'Potability' column from the data frame
water_potability_without_potability <- water_potability[, -which(names(water_potability) == "Potability")]

# Calculate the correlation matrix
correlation_matrix <- cor(water_potability_without_potability, use = "pairwise.complete.obs")

# Melt the correlation matrix into long format for plotting
melted_correlation <- melt(correlation_matrix)

# Create a correlation plot
ggplot(melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "lightblue", mid = "gray", high = "darkblue", midpoint = 0, limits = c(-1, 1)) +
  labs(title = "Correlation between Variables", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

from the results of my analysis and visualization There is clearly no correlation b/w any of the features.

Check class-imbalance

prop.table(table(water_potability$Potability))
#> 
#>         0         1 
#> 0.6098901 0.3901099

According to my analysis and the visualization, there is no multicollinearity and the predictors are not highly associated, therefore let’s leave them all alone! So i think it’s gonna be fine to leave percentage arround 60 - 30.


Modelling

Cross Validation

If we want to make a forecast, we should avoid looking at the error value in the data used to train the model because it only displays the model’s capacity to predict past data and not necessarily present data.

The data used to train the model is called “train data,” while the data used to test the model is called “test data.”

RNGkind(sample.kind = "Rounding")
set.seed(100)

# Index sampling
index <- sample(x = nrow(water_potability), size = nrow(water_potability) * 0.8)

# Splitting
water_potability_train <- water_potability[index, ]
water_potability_test <- water_potability[-index, ]

Logistic Regression

When we have a binary outcome variable, we employ the logistic regression statistical modeling approach. For instance, will the student succeed or fail given the conditions? When will it rain? etc. We may thus utilize the logistic regression modeling approach to predict the outcome when the outcome variable is binary, despite the fact that we may have continuous or categorical independent factors. We can use glm() function to our target variabel

Modelling

model_lr <- glm(formula = Potability ~ ., 
             family = "binomial", 
             data = water_potability_train)
# stepwise
model_steplr <- step(model_lr, direction = 'both')
#> Start:  AIC=3495
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
#>     Conductivity + Organic_carbon + Trihalomethanes + Turbidity
#> 
#>                   Df Deviance    AIC
#> - Turbidity        1   3475.0 3493.0
#> - Conductivity     1   3475.1 3493.1
#> - Hardness         1   3475.5 3493.5
#> - Trihalomethanes  1   3475.5 3493.5
#> - ph               1   3475.6 3493.6
#> - Sulfate          1   3476.6 3494.6
#> <none>                 3475.0 3495.0
#> - Solids           1   3477.3 3495.3
#> - Chloramines      1   3477.7 3495.7
#> - Organic_carbon   1   3482.2 3500.2
#> 
#> Step:  AIC=3493.02
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
#>     Conductivity + Organic_carbon + Trihalomethanes
#> 
#>                   Df Deviance    AIC
#> - Conductivity     1   3475.1 3491.1
#> - Hardness         1   3475.5 3491.5
#> - Trihalomethanes  1   3475.5 3491.5
#> - ph               1   3475.7 3491.7
#> - Sulfate          1   3476.6 3492.6
#> <none>                 3475.0 3493.0
#> - Solids           1   3477.3 3493.3
#> - Chloramines      1   3477.7 3493.7
#> + Turbidity        1   3475.0 3495.0
#> - Organic_carbon   1   3482.2 3498.2
#> 
#> Step:  AIC=3491.09
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
#>     Organic_carbon + Trihalomethanes
#> 
#>                   Df Deviance    AIC
#> - Hardness         1   3475.6 3489.6
#> - Trihalomethanes  1   3475.6 3489.6
#> - ph               1   3475.7 3489.7
#> - Sulfate          1   3476.7 3490.7
#> <none>                 3475.1 3491.1
#> - Solids           1   3477.3 3491.3
#> - Chloramines      1   3477.8 3491.8
#> + Conductivity     1   3475.0 3493.0
#> + Turbidity        1   3475.1 3493.1
#> - Organic_carbon   1   3482.3 3496.3
#> 
#> Step:  AIC=3489.59
#> Potability ~ ph + Solids + Chloramines + Sulfate + Organic_carbon + 
#>     Trihalomethanes
#> 
#>                   Df Deviance    AIC
#> - Trihalomethanes  1   3476.1 3488.1
#> - ph               1   3476.2 3488.2
#> - Sulfate          1   3477.0 3489.0
#> <none>                 3475.6 3489.6
#> - Solids           1   3478.0 3490.0
#> - Chloramines      1   3478.4 3490.4
#> + Hardness         1   3475.1 3491.1
#> + Conductivity     1   3475.5 3491.5
#> + Turbidity        1   3475.6 3491.6
#> - Organic_carbon   1   3482.8 3494.8
#> 
#> Step:  AIC=3488.13
#> Potability ~ ph + Solids + Chloramines + Sulfate + Organic_carbon
#> 
#>                   Df Deviance    AIC
#> - ph               1   3476.7 3486.7
#> - Sulfate          1   3477.6 3487.6
#> <none>                 3476.1 3488.1
#> - Solids           1   3478.4 3488.4
#> - Chloramines      1   3479.0 3489.0
#> + Trihalomethanes  1   3475.6 3489.6
#> + Hardness         1   3475.6 3489.6
#> + Conductivity     1   3476.1 3490.1
#> + Turbidity        1   3476.1 3490.1
#> - Organic_carbon   1   3483.5 3493.5
#> 
#> Step:  AIC=3486.7
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon
#> 
#>                   Df Deviance    AIC
#> - Sulfate          1   3478.1 3486.1
#> <none>                 3476.7 3486.7
#> - Solids           1   3478.9 3486.9
#> - Chloramines      1   3479.4 3487.4
#> + ph               1   3476.1 3488.1
#> + Trihalomethanes  1   3476.2 3488.2
#> + Hardness         1   3476.3 3488.3
#> + Conductivity     1   3476.7 3488.7
#> + Turbidity        1   3476.7 3488.7
#> - Organic_carbon   1   3483.9 3491.9
#> 
#> Step:  AIC=3486.13
#> Potability ~ Solids + Chloramines + Organic_carbon
#> 
#>                   Df Deviance    AIC
#> <none>                 3478.1 3486.1
#> + Sulfate          1   3476.7 3486.7
#> - Chloramines      1   3480.8 3486.8
#> - Solids           1   3480.9 3486.9
#> + Trihalomethanes  1   3477.5 3487.5
#> + ph               1   3477.6 3487.6
#> + Hardness         1   3477.9 3487.9
#> + Conductivity     1   3478.1 3488.1
#> + Turbidity        1   3478.1 3488.1
#> - Organic_carbon   1   3485.5 3491.5
# Compare Value of AIC
model_lr$aic
#> [1] 3494.999
model_steplr$aic
#> [1] 3486.134

because the model_lr model has a higher AIC value then we will use model_lr

summary(model_steplr)
#> 
#> Call:
#> glm(formula = Potability ~ Solids + Chloramines + Organic_carbon, 
#>     family = "binomial", data = water_potability_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.1897  -0.9938  -0.9331   1.3560   1.5730  
#> 
#> Coefficients:
#>                    Estimate   Std. Error z value Pr(>|z|)   
#> (Intercept)    -0.470090403  0.278230522  -1.690  0.09111 . 
#> Solids          0.000007648  0.000004597   1.664  0.09616 . 
#> Chloramines     0.041921281  0.025629644   1.636  0.10191   
#> Organic_carbon -0.032817924  0.012080176  -2.717  0.00659 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 3490.6  on 2619  degrees of freedom
#> Residual deviance: 3478.1  on 2616  degrees of freedom
#> AIC: 3486.1
#> 
#> Number of Fisher Scoring iterations: 4

Interpretasi Model: * The greater the value of Organic Carbon, the better the chance that the water can be minimized

Prediction

Predict the potability probability for water_potability_test data and store it in a new column named pred_RiskLs. then I will Classify the loans.test data by pred_RiskLs and store it in a new column named pred_Label.

water_potability_test$pred_RiskLs <- predict(object = model_lr,
                                newdata = water_potability_test,
                                type = "response")

water_potability_test$pred_Label <- ifelse(water_potability_test$pred_RiskLs > 0.5, 1, 0)
head(water_potability_test)
#>          ph Hardness   Solids Chloramines  Sulfate Conductivity Organic_carbon
#> 1  7.080795 204.8905 20791.32    7.300212 368.5164     564.3087       10.37978
#> 3  8.099124 224.2363 19909.54    9.275884 334.5643     418.6062       16.86864
#> 16 6.347272 186.7329 41065.23    9.629596 364.4877     516.7433       11.53978
#> 18 9.181560 273.8138 24041.33    6.904990 398.3505     477.9746       13.38734
#> 36 5.115817 191.9527 19620.55    6.060713 323.8364     441.7484       10.96649
#> 42 5.331940 194.8741 16658.88    7.993830 316.6752     335.1204       10.18051
#>    Trihalomethanes Turbidity Potability pred_RiskLs pred_Label
#> 1         86.99097  2.963135          0   0.4032852          0
#> 3         66.42009  3.055934          0   0.3803156          0
#> 16        75.07162  4.376348          0   0.4530251          0
#> 18        71.45736  4.503661          0   0.3670449          0
#> 36        49.23823  3.902089          0   0.3805967          0
#> 42        59.57271  4.434820          0   0.4132216          0

Levels are used to choose the label in the logistic regression model that will become the number 1.

  • class “0”,
    • “1” plus base = 0,
    • probabilities near 0 become 0 and
    • odds near 1 become 1.

Model Evaluation

head(water_potability_test %>% 
  select(Potability, pred_Label))
#>    Potability pred_Label
#> 1           0          0
#> 3           0          0
#> 16          0          0
#> 18          0          0
#> 36          0          0
#> 42          0          0

After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:

confusionMatrix(data = as.factor(water_potability_test$pred_Label),
                reference = water_potability_test$Potability,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 385 269
#>          1   0   2
#>                                              
#>                Accuracy : 0.5899             
#>                  95% CI : (0.5512, 0.6279)   
#>     No Information Rate : 0.5869             
#>     P-Value [Acc > NIR] : 0.4536             
#>                                              
#>                   Kappa : 0.0087             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.007380           
#>             Specificity : 1.000000           
#>          Pos Pred Value : 1.000000           
#>          Neg Pred Value : 0.588685           
#>              Prevalence : 0.413110           
#>          Detection Rate : 0.003049           
#>    Detection Prevalence : 0.003049           
#>       Balanced Accuracy : 0.503690           
#>                                              
#>        'Positive' Class : 1                  
#> 
  • Accuracy = how well my model correctly guesses target Y.
  • Precision = of all the prediction results, how well can my model correctly guess the positive class.
  • Re-call/Sensitivity = from all positive actual data, how well is the proportion of my model guessing correctly.
  • Specificity = of all negative actual data, how well is my model’s proportion of the correct guess.
Accuracy <- (2+385)/nrow(water_potability_test)
Recall <- (2/(2 + 269))
Precision <- (2)/(2+0)
Specificity <- (385)/(385+0)
                
performanceLs <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performanceLs
#>   Accuracy      Recall Precision Specificity
#> 1 0.589939 0.007380074         1           1

Model Tuning

after conducting further analysis and checking, it turns out that indeed our model needs to be tuned to improve the recall value. We want increase the recall / sensitiviy, We need the model to guesing correctly. So the first thing we do it’s to do upsampling

 water_potability_train <- upSample(
  x = water_potability_train %>% select(-Potability),
  y = water_potability_train$Potability,
  yname = "Potability"
 ) 
prop.table(table(water_potability_train$Potability))
#> 
#>   0   1 
#> 0.5 0.5

The provided code down below creates a logistic regression model in R using the glm() function. The model predicts the probability of water potability based on various input variables. The step() function is then used to perform stepwise variable selection, which automatically selects the best subset of predictors for the model based on a specified criterion. The resulting model with the selected predictors is stored in a variable called model_steplr.

model_lr <- glm(formula = Potability ~ ., 
             family = "binomial", 
             data = water_potability_train)
# stepwise
model_steplr <- step(model_lr, direction = 'both')
#> Start:  AIC=4472.37
#> Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
#>     Conductivity + Organic_carbon + Trihalomethanes + Turbidity
#> 
#>                   Df Deviance    AIC
#> - ph               1   4452.4 4470.4
#> - Conductivity     1   4452.4 4470.4
#> - Turbidity        1   4452.8 4470.8
#> - Hardness         1   4453.2 4471.2
#> - Trihalomethanes  1   4453.2 4471.2
#> <none>                 4452.4 4472.4
#> - Chloramines      1   4455.2 4473.2
#> - Solids           1   4455.2 4473.2
#> - Sulfate          1   4455.9 4473.9
#> - Organic_carbon   1   4459.5 4477.5
#> 
#> Step:  AIC=4470.37
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity + 
#>     Organic_carbon + Trihalomethanes + Turbidity
#> 
#>                   Df Deviance    AIC
#> - Conductivity     1   4452.4 4468.4
#> - Turbidity        1   4452.8 4468.8
#> - Hardness         1   4453.2 4469.2
#> - Trihalomethanes  1   4453.2 4469.2
#> <none>                 4452.4 4470.4
#> - Chloramines      1   4455.2 4471.2
#> - Solids           1   4455.2 4471.2
#> - Sulfate          1   4455.9 4471.9
#> + ph               1   4452.4 4472.4
#> - Organic_carbon   1   4459.5 4475.5
#> 
#> Step:  AIC=4468.37
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon + 
#>     Trihalomethanes + Turbidity
#> 
#>                   Df Deviance    AIC
#> - Turbidity        1   4452.8 4466.8
#> - Hardness         1   4453.2 4467.2
#> - Trihalomethanes  1   4453.2 4467.2
#> <none>                 4452.4 4468.4
#> - Chloramines      1   4455.2 4469.2
#> - Solids           1   4455.2 4469.2
#> - Sulfate          1   4455.9 4469.9
#> + Conductivity     1   4452.4 4470.4
#> + ph               1   4452.4 4470.4
#> - Organic_carbon   1   4459.5 4473.5
#> 
#> Step:  AIC=4466.84
#> Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon + 
#>     Trihalomethanes
#> 
#>                   Df Deviance    AIC
#> - Hardness         1   4453.6 4465.6
#> - Trihalomethanes  1   4453.7 4465.7
#> <none>                 4452.8 4466.8
#> - Solids           1   4455.7 4467.7
#> - Chloramines      1   4455.7 4467.7
#> - Sulfate          1   4456.4 4468.4
#> + Turbidity        1   4452.4 4468.4
#> + Conductivity     1   4452.8 4468.8
#> + ph               1   4452.8 4468.8
#> - Organic_carbon   1   4460.0 4472.0
#> 
#> Step:  AIC=4465.61
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon + 
#>     Trihalomethanes
#> 
#>                   Df Deviance    AIC
#> - Trihalomethanes  1   4454.5 4464.5
#> <none>                 4453.6 4465.6
#> - Solids           1   4456.6 4466.6
#> - Chloramines      1   4456.7 4466.7
#> - Sulfate          1   4456.8 4466.8
#> + Hardness         1   4452.8 4466.8
#> + Turbidity        1   4453.2 4467.2
#> + ph               1   4453.6 4467.6
#> + Conductivity     1   4453.6 4467.6
#> - Organic_carbon   1   4460.8 4470.8
#> 
#> Step:  AIC=4464.46
#> Potability ~ Solids + Chloramines + Sulfate + Organic_carbon
#> 
#>                   Df Deviance    AIC
#> <none>                 4454.5 4464.5
#> - Solids           1   4457.5 4465.5
#> - Chloramines      1   4457.6 4465.6
#> + Trihalomethanes  1   4453.6 4465.6
#> + Hardness         1   4453.7 4465.7
#> - Sulfate          1   4457.7 4465.7
#> + Turbidity        1   4454.0 4466.0
#> + ph               1   4454.5 4466.5
#> + Conductivity     1   4454.5 4466.5
#> - Organic_carbon   1   4461.8 4469.8

Test dataset

After that we double check our Train Evaluation Logistick Regression model and put it on test dataset:

water_potability_train$pred_RiskLs <- predict(model_steplr, 
                                            water_potability_train, 
                                            type = "response")

water_potability_train$pred_Label <- ifelse(water_potability_train$pred_RiskLs > 0.5, 1, 0)

confusionMatrix(data = as.factor(water_potability_train$pred_Label),
                reference = water_potability_train$Potability,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 876 772
#>          1 737 841
#>                                           
#>                Accuracy : 0.5322          
#>                  95% CI : (0.5148, 0.5496)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : 0.0001334       
#>                                           
#>                   Kappa : 0.0645          
#>                                           
#>  Mcnemar's Test P-Value : 0.3814355       
#>                                           
#>             Sensitivity : 0.5214          
#>             Specificity : 0.5431          
#>          Pos Pred Value : 0.5330          
#>          Neg Pred Value : 0.5316          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.2607          
#>    Detection Prevalence : 0.4892          
#>       Balanced Accuracy : 0.5322          
#>                                           
#>        'Positive' Class : 1               
#> 
water_potability_test$pred_RiskLs <- predict(object = model_lr,
                                newdata = water_potability_test,
                                type = "response")

water_potability_test$pred_Label <- ifelse(water_potability_test$pred_RiskLs > 0.5, 1, 0)
head(water_potability_test)
#>          ph Hardness   Solids Chloramines  Sulfate Conductivity Organic_carbon
#> 1  7.080795 204.8905 20791.32    7.300212 368.5164     564.3087       10.37978
#> 3  8.099124 224.2363 19909.54    9.275884 334.5643     418.6062       16.86864
#> 16 6.347272 186.7329 41065.23    9.629596 364.4877     516.7433       11.53978
#> 18 9.181560 273.8138 24041.33    6.904990 398.3505     477.9746       13.38734
#> 36 5.115817 191.9527 19620.55    6.060713 323.8364     441.7484       10.96649
#> 42 5.331940 194.8741 16658.88    7.993830 316.6752     335.1204       10.18051
#>    Trihalomethanes Turbidity Potability pred_RiskLs pred_Label
#> 1         86.99097  2.963135          0   0.5262277          1
#> 3         66.42009  3.055934          0   0.4975655          0
#> 16        75.07162  4.376348          0   0.5637733          1
#> 18        71.45736  4.503661          0   0.4572410          0
#> 36        49.23823  3.902089          0   0.5068487          1
#> 42        59.57271  4.434820          0   0.5297354          1

After doing logistic regression, it’s time for us to do further evaluation regarding our model

confusionMatrix(data = as.factor(water_potability_test$pred_Label),
                reference = water_potability_test$Potability,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 182 142
#>          1 203 129
#>                                           
#>                Accuracy : 0.4741          
#>                  95% CI : (0.4353, 0.5131)
#>     No Information Rate : 0.5869          
#>     P-Value [Acc > NIR] : 1.000000        
#>                                           
#>                   Kappa : -0.0496         
#>                                           
#>  Mcnemar's Test P-Value : 0.001237        
#>                                           
#>             Sensitivity : 0.4760          
#>             Specificity : 0.4727          
#>          Pos Pred Value : 0.3886          
#>          Neg Pred Value : 0.5617          
#>              Prevalence : 0.4131          
#>          Detection Rate : 0.1966          
#>    Detection Prevalence : 0.5061          
#>       Balanced Accuracy : 0.4744          
#>                                           
#>        'Positive' Class : 1               
#> 

Model Score Evaluation After Upsampling

Accuracy_tuning <- (129+182)/nrow(water_potability_test)
Recall_tuning <- 129/(129+142)
Precision_tuning <- (129)/(129+203)
Specificity_tuning <- (182)/(182+203)
                
performanceLs_Tuning <- cbind.data.frame(Accuracy_tuning, Recall_tuning, Precision_tuning, Specificity_tuning)
performanceLs_Tuning
#>   Accuracy_tuning Recall_tuning Precision_tuning Specificity_tuning
#> 1       0.4740854     0.4760148        0.3885542          0.4727273

Looks like the Recall stil low in 47% correct guesing, so our next method is to focus on increasing our recall value

# increase recall
water_potability_test$pred_Label_new <- ifelse(test = water_potability_test$pred_Risk > 0.45,
                                    yes = 1,
                                    no = 0)

confusionMatrix(data = as.factor(water_potability_test$pred_Label_new),
                reference = water_potability_test$Potability, 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0  27  28
#>          1 358 243
#>                                              
#>                Accuracy : 0.4116             
#>                  95% CI : (0.3736, 0.4503)   
#>     No Information Rate : 0.5869             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : -0.0281            
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.89668            
#>             Specificity : 0.07013            
#>          Pos Pred Value : 0.40433            
#>          Neg Pred Value : 0.49091            
#>              Prevalence : 0.41311            
#>          Detection Rate : 0.37043            
#>    Detection Prevalence : 0.91616            
#>       Balanced Accuracy : 0.48340            
#>                                              
#>        'Positive' Class : 1                  
#> 

Model Evaluation After Tunning

Accuracy_tuning_Recall <- (243+27)/nrow(water_potability_test)
Recall_tuning_Recall <- 243/(243+28)
Precision_tuning_Recall <- (243)/(243+358)
Specificity_tuning_Recall <- (27)/(27+358)
                
performanceLs_Tuning_Recall <- cbind.data.frame(Accuracy_tuning_Recall, Recall_tuning_Recall, Precision_tuning_Recall, Specificity_tuning_Recall)
performanceLs_Tuning_Recall
#>   Accuracy_tuning_Recall Recall_tuning_Recall Precision_tuning_Recall
#> 1              0.4115854             0.896679               0.4043261
#>   Specificity_tuning_Recall
#> 1                0.07012987
  • According to the confusion matrix’s findings, the model’s accuracy in predicting target potability (both potability and nonpotability) is 41%.
  • The algorithm was able to predict 90% of the actual data on water potability.
  • The model was able to properly forecast the positive class for 40% of all prediction outcomes that it was able to estimate.
  • The algorithm can predict 70% of the actual data for water that is unfit for human consumption.

K-NN Model

K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data). Characteristics are measured by Euclidean Distance to get k data points (neighbors) with closest distance. The most class owned by these neighbors is the class of the new data (majority voting).

After that we must scalling using z-score standarization. The test data must also be scaled using parameters from the train data (because it assumes the test data is unseen data).

RNGkind(sample.kind = "Rounding")
set.seed(100)

# Index sampling
index <- sample(x = nrow(water_potability), size = nrow(water_potability) * 0.8)

# Splitting
water_potability_train <- water_potability[index, ]
water_potability_test <- water_potability[-index, ]

# Predictor
water_potability_train_x <- water_potability_train %>% select_if(is.numeric)
water_potability_test_x <- water_potability_test %>% select_if(is.numeric)

# Target
water_potability_train_y <- water_potability_train[,"Potability"]
water_potability_test_y <- water_potability_test[,"Potability"]



# Scaling data
# Train
scaled_train_x <- as.data.frame(scale(water_potability_train_x))

# Test
scaled_test_x <- as.data.frame(scale(water_potability_test_x,
                            center = colMeans(water_potability_train_x),
                            scale = apply(water_potability_train_x, 2, sd)))

for our next task is to apply it to our test model

# Find optimum k
sqrt(nrow(scaled_train_x))
#> [1] 51.18594
# Perform k-NN classification
water_potabilityKNN_pred <- knn(train = scaled_train_x,
                 test = scaled_test_x,
                 cl = water_potability_train_y,
                 k = 51)

# Convert the predicted values to factors
water_potabilityKNN_pred <- as.factor(water_potabilityKNN_pred)

# Set the levels of the test target variable
levels(water_potability_test_y) <- levels(water_potability_train_y)

# Create the confusion matrix
confusion_matrix <- confusionMatrix(data = water_potabilityKNN_pred,
                                    reference = water_potability_test_y,
                                    positive = "1")

# Print the confusion matrix
print(confusion_matrix)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 375 241
#>          1  10  30
#>                                              
#>                Accuracy : 0.6174             
#>                  95% CI : (0.579, 0.6547)    
#>     No Information Rate : 0.5869             
#>     P-Value [Acc > NIR] : 0.0606             
#>                                              
#>                   Kappa : 0.097              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.11070            
#>             Specificity : 0.97403            
#>          Pos Pred Value : 0.75000            
#>          Neg Pred Value : 0.60877            
#>              Prevalence : 0.41311            
#>          Detection Rate : 0.04573            
#>    Detection Prevalence : 0.06098            
#>       Balanced Accuracy : 0.54236            
#>                                              
#>        'Positive' Class : 1                  
#> 

The provided code conducts k-Nearest Neighbors (k-NN) classification to predict water potability. It uses a training dataset (scaled_train_x) with corresponding class labels (water_potability_train_y) to train the model. The predictions are made on a test dataset (scaled_test_x). The code then converts the predicted values to factors and ensures consistency in the levels between the test and training datasets. A confusion matrix is created to evaluate the performance of the k-NN model, and the results, including accuracy, precision, recall, and F1-score, are printed.

For our next task is to create Model Evaluation K-Nearest Neighbor :

Accuracy_knn <- (30+375)/nrow(water_potability_test)
recall_knn <- 30/(30+241)
Precision_knn <- (30)/(30+10)
Specificity_knn <- (375)/(375+10)
                
performanceLs_knn <- cbind.data.frame(Accuracy_knn, recall_knn, Precision_knn, Specificity_knn)
performanceLs_knn
#>   Accuracy_knn recall_knn Precision_knn Specificity_knn
#> 1     0.617378  0.1107011          0.75        0.974026
  • According to the confusion matrix’s findings, the model’s accuracy in predicting target potability (both potability and nonpotability) is 61%.
  • The algorithm was able to predict 11% of the actual data on water potability.
  • The model was able to properly forecast the positive class for 75% of all prediction outcomes that it was able to estimate.
  • The algorithm can predict 97% of the actual data for water that is unfit for human consumption.

Evaluation

After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:

  • Contents of the Confusion Matrix:

    • True Positive (TP): predicted positive and true (positive prediction; actual positive)
    • True Negative (TN): predicted negative and true (negative prediction; negative actual)
    • False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
    • False Negative (FN): predicted negative but wrong (negative prediction; positive actual)
  • But  because we already do that in each modeling Tab, here we just compare them to each other to see which model is the best.

# Create dataframes
performanceLs_Tuning_Recall <- data.frame(
  Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
  Tuning_Recall = c(0.4115854, 0.896679, 0.4043261, 0.07012987)
)

performanceLs_knn <- data.frame(
  Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
  knn = c(0.617378, 0.1107011, 0.75, 0.974026)
)

# Merge dataframes into a single table
combined_table <- merge(performanceLs_Tuning_Recall, performanceLs_knn, by = "Metric", all = TRUE)

# Set row names as the metric names
row.names(combined_table) <- combined_table$Metric
combined_table$Metric <- NULL

# Print the table
combined_table
#>             Tuning_Recall       knn
#> Accuracy       0.41158540 0.6173780
#> Precision      0.40432610 0.7500000
#> Recall         0.89667900 0.1107011
#> Specificity    0.07012987 0.9740260
# Create dataframes
performanceLs_Tuning_Recall <- data.frame(
  Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
  Value = c(0.4115854, 0.896679, 0.4043261, 0.07012987),
  Model = "Logistic Regression(After Tunning)"
)

performanceLs_knn <- data.frame(
  Metric = c("Accuracy", "Recall", "Precision", "Specificity"),
  Value = c(0.617378, 0.1107011, 0.75, 0.974026),
  Model = "K-nearest neighbor"
)

# Combine dataframes
combined_df <- rbind(performanceLs_Tuning_Recall, performanceLs_knn)

# Create stacked line plot with tooltips
p <- ggplot(combined_df, aes(x = Metric, y = Value, group = Model, color = Model)) +
  geom_line(stat = "identity") +
  geom_point(size = 3) +
  labs(x = "Metric", y = "Value", color = "Model") +
  ggtitle("Comparison of Performance Metrics") +
  theme_minimal()

# Convert plot to interactive plotly object
p <- ggplotly(p, tooltip = c("Model", "Metric", "Value"), width = 600, height = 400)
p

When comparing the two approaches, namely Logistic Regression and K-NN, the Logistic Regression (After Tuning) approach has a precision value that is 89.9% higher than that of Logistic Regression, which improves the model’s capacity to predict accurately from the real data, of which water is potability.


Conclusion

Both models have their own advantages and disadvantages; the K-nearest neighbor model has a better level of accuracy, much better precision, and much better specificity. However, for the recall value, the Logistic Regression Model has the best recall value, which is almost 90%. For the current case, we are more concerned with the Recall evaluation model because we need a model to be able to predict whether water is potable or isn’t potable.” For example, if we want to know if we have a positive, predictable target water and if it is potable, then we have to prioritize the model with a higher precision value. Thus, we may be better off if we choose the initial model of logistic regression (After turing)