Water Quality Classification: Logistic Regression and K-NN

Intro

Water quality describes the condition of the water, including chemical, physical, and biological characteristics, usually with respect to its suitability for a particular purpose such as drinking or swimming.

In this project, we will try to make machine learning models to classify whether the water quality is potable or not potable. The machine learning algorithms we will use are Logistic Regression and K-Nearest Neighbor (K-NN). The dataset can be downloaded here.

Set Up

First, load the required package

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(class)
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: ggplot2
## Loading required package: lattice
library(stringr)
library(ggplot2)

Logistic Regression

Data Import

water_data <- read.csv(file ="water_potability.csv")

rmarkdown::paged_table(water_data)

Data Manipulation

glimpse(water_data)
## Rows: 3,276
## Columns: 10
## $ ph              <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, …
## $ Hardness        <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.…
## $ Solids          <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874…
## $ Chloramines     <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54…
## $ Sulfate         <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66…
## $ Conductivity    <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.…
## $ Organic_carbon  <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,…
## $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.…
## $ Turbidity       <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55…
## $ Potability      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

From above, our dataset have 3276 rows and 10 columns. Our target variable Potability not in correct data type. We should change Potability to factor data type.

water_data <- water_data %>% 
  mutate(Potability = as.factor(Potability))

Next, we should check wether the data have missing value or not.

colSums(is.na(water_data))
##              ph        Hardness          Solids     Chloramines         Sulfate 
##             491               0               0               0             781 
##    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
##               0               0             162               0               0

We will delete rows with missing value.

water_data_clean <- water_data %>% 
  drop_na()

colSums(is.na(water_data_clean))
##              ph        Hardness          Solids     Chloramines         Sulfate 
##               0               0               0               0               0 
##    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
##               0               0               0               0               0

Data Pre-Processing

Before we create logistic regression model, lets check the proportion of our target variable.

prop.table(table(water_data_clean$Potability))
## 
##         0         1 
## 0.5967181 0.4032819

The proportion of our target variable seems balance, so we dont need another pre-processing for balancing our target classes.

Cross-Validation

Cross validation is step when we split our data into training data and testing data. We use training data to train our model, and we use testing data to test if our model can classify correctly on new data or unseen data.

set.seed(417)
intrain <- sample(nrow(water_data_clean), nrow(water_data_clean)*0.8)
water_train <- water_data_clean[intrain,]
water_test <- water_data_clean[-intrain,]

prop.table(table(water_train$Potability))
## 
##        0        1 
## 0.585199 0.414801

Modeling

We use glm function for creating logistic regression models. We will use all variables except Potability for predictiors.

model_lr1 <- glm(formula = Potability~.,
                family = "binomial",
                data = water_train)
summary(model_lr1)
## 
## Call:
## glm(formula = Potability ~ ., family = "binomial", data = water_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2495  -1.0394  -0.9685   1.3057   1.5662  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -8.443e-01  8.282e-01  -1.019   0.3080  
## ph               2.415e-02  3.191e-02   0.757   0.4492  
## Hardness         1.310e-04  1.560e-03   0.084   0.9331  
## Solids           1.331e-05  6.006e-06   2.216   0.0267 *
## Chloramines      4.948e-02  3.210e-02   1.541   0.1232  
## Sulfate         -1.415e-03  1.248e-03  -1.134   0.2567  
## Conductivity    -1.225e-04  6.348e-04  -0.193   0.8469  
## Organic_carbon  -6.974e-03  1.546e-02  -0.451   0.6520  
## Trihalomethanes  5.610e-04  3.164e-03   0.177   0.8593  
## Turbidity        6.084e-02  6.552e-02   0.929   0.3531  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2182.2  on 1607  degrees of freedom
## Residual deviance: 2171.3  on 1598  degrees of freedom
## AIC: 2191.3
## 
## Number of Fisher Scoring iterations: 4

Model Fitting

From the summary of model_lr1 we see that almost all the predictors doesnt significant to target variable. lets try using step-wise method.

model_lr2 <- step(object = model_lr1,
                  direction = "backward",
                  trace = F)
summary(model_lr2)
## 
## Call:
## glm(formula = Potability ~ Solids + Chloramines, family = "binomial", 
##     data = water_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2644  -1.0384  -0.9749   1.3112   1.5132  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.019e+00  2.767e-01  -3.683 0.000231 ***
## Solids       1.431e-05  5.870e-06   2.437 0.014801 *  
## Chloramines  5.033e-02  3.199e-02   1.573 0.115659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2182.2  on 1607  degrees of freedom
## Residual deviance: 2174.4  on 1605  degrees of freedom
## AIC: 2180.4
## 
## Number of Fisher Scoring iterations: 4

Predicting

We will use model_lr2, the result model of step-wise method, to predict the potability of test data.

water_test$prob_predict <- predict(object = model_lr2, newdata = water_test, type = "response")

rmarkdown::paged_table(as.data.frame(water_test$prob_predict))

Logistic regression return probability values of positive class. lets convert the probability values into class using threshold value. Probability above 0.5 will classified as positive class.

threshold <- 0.5
water_test$pred <- ifelse(water_test$prob_predict > threshold, 1, 0)

Model Evaluation

To evaluate the model, we will use confusion matrix.

confusionMatrix(as.factor(water_test$pred), water_test$Potability, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 256 141
##          1   3   3
##                                           
##                Accuracy : 0.6427          
##                  95% CI : (0.5937, 0.6895)
##     No Information Rate : 0.6427          
##     P-Value [Acc > NIR] : 0.5227          
##                                           
##                   Kappa : 0.0118          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.020833        
##             Specificity : 0.988417        
##          Pos Pred Value : 0.500000        
##          Neg Pred Value : 0.644836        
##              Prevalence : 0.357320        
##          Detection Rate : 0.007444        
##    Detection Prevalence : 0.014888        
##       Balanced Accuracy : 0.504625        
##                                           
##        'Positive' Class : 1               
## 

Based on confusion matrix above, we get that our model have Accuracy, how accurate the model make predictions, is 64.27%. The Sensitifity, how sensitive the model predicting positive value, is 2.08%, which is very low. But the Specifity, how sensitive the model predicting negative value, is 98.8%, which is very high. And the Precision, how precise the model predicting positive values, is 50%.

K-Nearest Neighbor (K-NN)

Data Pre-processing

summary(water_data_clean)
##        ph             Hardness          Solids         Chloramines    
##  Min.   : 0.2275   Min.   : 73.49   Min.   :  320.9   Min.   : 1.391  
##  1st Qu.: 6.0897   1st Qu.:176.74   1st Qu.:15615.7   1st Qu.: 6.139  
##  Median : 7.0273   Median :197.19   Median :20933.5   Median : 7.144  
##  Mean   : 7.0860   Mean   :195.97   Mean   :21917.4   Mean   : 7.134  
##  3rd Qu.: 8.0530   3rd Qu.:216.44   3rd Qu.:27182.6   3rd Qu.: 8.110  
##  Max.   :14.0000   Max.   :317.34   Max.   :56488.7   Max.   :13.127  
##     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
##  Min.   :129.0   Min.   :201.6   Min.   : 2.20   Min.   :  8.577  
##  1st Qu.:307.6   1st Qu.:366.7   1st Qu.:12.12   1st Qu.: 55.953  
##  Median :332.2   Median :423.5   Median :14.32   Median : 66.542  
##  Mean   :333.2   Mean   :426.5   Mean   :14.36   Mean   : 66.401  
##  3rd Qu.:359.3   3rd Qu.:482.4   3rd Qu.:16.68   3rd Qu.: 77.292  
##  Max.   :481.0   Max.   :753.3   Max.   :27.01   Max.   :124.000  
##    Turbidity     Potability
##  Min.   :1.450   0:1200    
##  1st Qu.:3.443   1: 811    
##  Median :3.968             
##  Mean   :3.970             
##  3rd Qu.:4.514             
##  Max.   :6.495

From the summary, we see that our data have different min-max scale. we will use z-score standardization to normalize our data.

water_data_norm <- data.frame(lapply(water_data_clean[,-10], scale))
rmarkdown::paged_table(water_data_norm)
summary(water_data_norm)
##        ph             Hardness           Solids         Chloramines       
##  Min.   :-4.3592   Min.   :-3.7529   Min.   :-2.4989   Min.   :-3.624051  
##  1st Qu.:-0.6332   1st Qu.:-0.5890   1st Qu.:-0.7292   1st Qu.:-0.628111  
##  Median :-0.0373   Median : 0.0375   Median :-0.1139   Median : 0.006037  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.6146   3rd Qu.: 0.6273   3rd Qu.: 0.6092   3rd Qu.: 0.615456  
##  Max.   : 4.3945   Max.   : 3.7190   Max.   : 4.0003   Max.   : 3.781289  
##     Sulfate          Conductivity      Organic_carbon     Trihalomethanes    
##  Min.   :-4.95629   Min.   :-2.78651   Min.   :-3.65650   Min.   :-3.596657  
##  1st Qu.:-0.62109   1st Qu.:-0.74147   1st Qu.:-0.67177   1st Qu.:-0.649880  
##  Median :-0.02409   Median :-0.03804   Median :-0.01073   Median : 0.008791  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.63356   3rd Qu.: 0.69192   3rd Qu.: 0.69936   3rd Qu.: 0.677427  
##  Max.   : 3.58707   Max.   : 4.04914   Max.   : 3.80426   Max.   : 3.582680  
##    Turbidity        
##  Min.   :-3.228989  
##  1st Qu.:-0.675102  
##  Median :-0.001988  
##  Mean   : 0.000000  
##  3rd Qu.: 0.697699  
##  Max.   : 3.235769

Cross-Validation

set.seed(471)
insample <- sample(nrow(water_data_norm), nrow(water_data_norm)*0.8)
water_train_knn <- water_data_norm[insample,]
water_test_knn <- water_data_norm[-insample,]

Modeling

Before we train the model, we need to set the k number first

k <- round(sqrt(nrow(water_train_knn)))
k
## [1] 40
knn_pred <- knn(train = water_train_knn, test = water_test_knn, cl = water_data_clean$Potability[insample], k=k)

Model Evaluation

confusionMatrix(data = knn_pred, reference = water_data_clean$Potability[-insample], positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 210 143
##          1  11  39
##                                           
##                Accuracy : 0.6179          
##                  95% CI : (0.5685, 0.6655)
##     No Information Rate : 0.5484          
##     P-Value [Acc > NIR] : 0.002826        
##                                           
##                   Kappa : 0.1758          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.21429         
##             Specificity : 0.95023         
##          Pos Pred Value : 0.78000         
##          Neg Pred Value : 0.59490         
##              Prevalence : 0.45161         
##          Detection Rate : 0.09677         
##    Detection Prevalence : 0.12407         
##       Balanced Accuracy : 0.58226         
##                                           
##        'Positive' Class : 1               
## 

Based on confusion matrix above, we get that our model have Accuracy, how accurate the model make predictions, is 61.79%. The Sensitifity, how sensitive the model predicting positive value, is 21.42%, which is low. But the Specifity, how sensitive the model predicting negative value, is 95.02%, which is very high. And the Precision, how precise the model predicting positive values, is 78%.

Conclusion

Both logistic regression and K-NN perform worse in accuracy and sensitifity. Both models perform worse on predicting true possitive value, just having 2.08% sensitifity on logistic regression and 21.42% on K-NN. Both models almost have same specifity performance, 98.8% for linear regression and 95.02% for K-NN. However, if our focus is on predicting that the water is not potabile, so we can choose the linear regression model with higher accuracy and specifity than K-NN model.

---
title: "Water Quality Classification: Logistic Regression and K-NN"
author: "Chaidar Aji Nugroho"
date: "2022-11-24"
output:
  rmdformats::readthedown:
    self_contained: true
    code_download: true
    toc_depth: 3
    df_print: paged
    code_folding: show
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Intro

Water quality describes the condition of the water, including chemical, physical, and biological characteristics, usually with respect to its suitability for a particular purpose such as drinking or swimming.

In this project, we will try to make machine learning models to classify whether the water quality is potable or not potable. The machine learning algorithms we will use are Logistic Regression and K-Nearest Neighbor (K-NN). The dataset can be downloaded [here](https://www.kaggle.com/datasets/adityakadiwal/water-potability).

## Set Up

First, load the required package

```{r}
library(dplyr)
library(tidyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
```

## Logistic Regression

### Data Import

```{r}
water_data <- read.csv(file ="water_potability.csv")

rmarkdown::paged_table(water_data)
```

### Data Manipulation

```{r}
glimpse(water_data)
```
From above, our dataset have 3276 rows and 10 columns.
Our target variable **Potability** not in correct data type. We should change `Potability` to *factor* data type.

```{r}
water_data <- water_data %>% 
  mutate(Potability = as.factor(Potability))
```

Next, we should check wether the data have missing value or not.

```{r}
colSums(is.na(water_data))
```

We will delete rows with missing value.

```{r}
water_data_clean <- water_data %>% 
  drop_na()

colSums(is.na(water_data_clean))
```

### Data Pre-Processing

Before we create logistic regression model, lets check the proportion of our target variable.

```{r}
prop.table(table(water_data_clean$Potability))
```

The proportion of our target variable seems balance, so we dont need another pre-processing for balancing our target classes.

### Cross-Validation

Cross validation is step when we split our data into training data and testing data. We use training data to train our model, and we use testing data to test if our model can classify correctly on new data or unseen data.

```{r}
set.seed(417)
intrain <- sample(nrow(water_data_clean), nrow(water_data_clean)*0.8)
water_train <- water_data_clean[intrain,]
water_test <- water_data_clean[-intrain,]

prop.table(table(water_train$Potability))
```
### Modeling

We use ```glm``` function for creating logistic regression models. We will use all variables except `Potability` for predictiors.

```{r}
model_lr1 <- glm(formula = Potability~.,
                family = "binomial",
                data = water_train)
summary(model_lr1)
```
#### Model Fitting

From the summary of `model_lr1` we see that almost all the predictors doesnt significant to target variable. lets try using `step-wise` method.

```{r}
model_lr2 <- step(object = model_lr1,
                  direction = "backward",
                  trace = F)
summary(model_lr2)
```

### Predicting

We will use `model_lr2`, the result model of step-wise method, to predict the potability of test data.

```{r}
water_test$prob_predict <- predict(object = model_lr2, newdata = water_test, type = "response")

rmarkdown::paged_table(as.data.frame(water_test$prob_predict))
```

Logistic regression return probability values of positive class. lets convert the probability values into class using threshold value. Probability above 0.5  will classified as positive class.

```{r}
threshold <- 0.5
water_test$pred <- ifelse(water_test$prob_predict > threshold, 1, 0)
```

### Model Evaluation

To evaluate the model, we will use confusion matrix.

```{r}
confusionMatrix(as.factor(water_test$pred), water_test$Potability, positive = "1")
```

Based on confusion matrix above, we get that our model have `Accuracy`, how accurate the model make predictions, is 64.27%. The `Sensitifity`, how sensitive the model predicting positive value, is 2.08%, which is very low. But the `Specifity`, how sensitive the model predicting negative value, is 98.8%, which is very high. And the `Precision`, how precise the model predicting positive values, is 50%.

## K-Nearest Neighbor (K-NN)

### Data Pre-processing

```{r}
summary(water_data_clean)
```

From the summary, we see that our data have different min-max scale. we will use *z-score standardization* to normalize our data.

```{r}
water_data_norm <- data.frame(lapply(water_data_clean[,-10], scale))
rmarkdown::paged_table(water_data_norm)
```

```{r}
summary(water_data_norm)
```

### Cross-Validation

```{r}
set.seed(471)
insample <- sample(nrow(water_data_norm), nrow(water_data_norm)*0.8)
water_train_knn <- water_data_norm[insample,]
water_test_knn <- water_data_norm[-insample,]
```

### Modeling

Before we train the model, we need to set the k number first
```{r}
k <- round(sqrt(nrow(water_train_knn)))
k
```

```{r}
knn_pred <- knn(train = water_train_knn, test = water_test_knn, cl = water_data_clean$Potability[insample], k=k)
```

### Model Evaluation

```{r}
confusionMatrix(data = knn_pred, reference = water_data_clean$Potability[-insample], positive = "1")
```

Based on confusion matrix above, we get that our model have `Accuracy`, how accurate the model make predictions, is 61.79%. The `Sensitifity`, how sensitive the model predicting positive value, is 21.42%, which is low. But the `Specifity`, how sensitive the model predicting negative value, is 95.02%, which is very high. And the `Precision`, how precise the model predicting positive values, is 78%.

## Conclusion

Both logistic regression and K-NN perform worse in accuracy and sensitifity. Both models perform worse on predicting true possitive value, just having 2.08% sensitifity on logistic regression and 21.42% on K-NN. Both models almost have same specifity performance, 98.8% for linear regression and 95.02% for K-NN. However, if our focus is on predicting that the water is not potabile, so we can choose the linear regression model with higher accuracy and specifity than K-NN model.