Water Quality Classification: Logistic Regression and K-NN
Intro
Water quality describes the condition of the water, including chemical, physical, and biological characteristics, usually with respect to its suitability for a particular purpose such as drinking or swimming.
In this project, we will try to make machine learning models to classify whether the water quality is potable or not potable. The machine learning algorithms we will use are Logistic Regression and K-Nearest Neighbor (K-NN). The dataset can be downloaded here.
Set Up
First, load the required package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(class)
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: ggplot2
## Loading required package: lattice
library(stringr)
library(ggplot2)
Logistic Regression
Data Import
<- read.csv(file ="water_potability.csv")
water_data
::paged_table(water_data) rmarkdown
Data Manipulation
glimpse(water_data)
## Rows: 3,276
## Columns: 10
## $ ph <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, …
## $ Hardness <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.…
## $ Solids <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874…
## $ Chloramines <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54…
## $ Sulfate <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66…
## $ Conductivity <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.…
## $ Organic_carbon <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,…
## $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.…
## $ Turbidity <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55…
## $ Potability <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
From above, our dataset have 3276 rows and 10 columns. Our target
variable Potability not in correct data type. We should
change Potability
to factor data type.
<- water_data %>%
water_data mutate(Potability = as.factor(Potability))
Next, we should check wether the data have missing value or not.
colSums(is.na(water_data))
## ph Hardness Solids Chloramines Sulfate
## 491 0 0 0 781
## Conductivity Organic_carbon Trihalomethanes Turbidity Potability
## 0 0 162 0 0
We will delete rows with missing value.
<- water_data %>%
water_data_clean drop_na()
colSums(is.na(water_data_clean))
## ph Hardness Solids Chloramines Sulfate
## 0 0 0 0 0
## Conductivity Organic_carbon Trihalomethanes Turbidity Potability
## 0 0 0 0 0
Data Pre-Processing
Before we create logistic regression model, lets check the proportion of our target variable.
prop.table(table(water_data_clean$Potability))
##
## 0 1
## 0.5967181 0.4032819
The proportion of our target variable seems balance, so we dont need another pre-processing for balancing our target classes.
Cross-Validation
Cross validation is step when we split our data into training data and testing data. We use training data to train our model, and we use testing data to test if our model can classify correctly on new data or unseen data.
set.seed(417)
<- sample(nrow(water_data_clean), nrow(water_data_clean)*0.8)
intrain <- water_data_clean[intrain,]
water_train <- water_data_clean[-intrain,]
water_test
prop.table(table(water_train$Potability))
##
## 0 1
## 0.585199 0.414801
Modeling
We use glm
function for creating logistic regression
models. We will use all variables except Potability
for
predictiors.
<- glm(formula = Potability~.,
model_lr1 family = "binomial",
data = water_train)
summary(model_lr1)
##
## Call:
## glm(formula = Potability ~ ., family = "binomial", data = water_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2495 -1.0394 -0.9685 1.3057 1.5662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.443e-01 8.282e-01 -1.019 0.3080
## ph 2.415e-02 3.191e-02 0.757 0.4492
## Hardness 1.310e-04 1.560e-03 0.084 0.9331
## Solids 1.331e-05 6.006e-06 2.216 0.0267 *
## Chloramines 4.948e-02 3.210e-02 1.541 0.1232
## Sulfate -1.415e-03 1.248e-03 -1.134 0.2567
## Conductivity -1.225e-04 6.348e-04 -0.193 0.8469
## Organic_carbon -6.974e-03 1.546e-02 -0.451 0.6520
## Trihalomethanes 5.610e-04 3.164e-03 0.177 0.8593
## Turbidity 6.084e-02 6.552e-02 0.929 0.3531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2182.2 on 1607 degrees of freedom
## Residual deviance: 2171.3 on 1598 degrees of freedom
## AIC: 2191.3
##
## Number of Fisher Scoring iterations: 4
Model Fitting
From the summary of model_lr1
we see that almost all the
predictors doesnt significant to target variable. lets try using
step-wise
method.
<- step(object = model_lr1,
model_lr2 direction = "backward",
trace = F)
summary(model_lr2)
##
## Call:
## glm(formula = Potability ~ Solids + Chloramines, family = "binomial",
## data = water_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2644 -1.0384 -0.9749 1.3112 1.5132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.019e+00 2.767e-01 -3.683 0.000231 ***
## Solids 1.431e-05 5.870e-06 2.437 0.014801 *
## Chloramines 5.033e-02 3.199e-02 1.573 0.115659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2182.2 on 1607 degrees of freedom
## Residual deviance: 2174.4 on 1605 degrees of freedom
## AIC: 2180.4
##
## Number of Fisher Scoring iterations: 4
Predicting
We will use model_lr2
, the result model of step-wise
method, to predict the potability of test data.
$prob_predict <- predict(object = model_lr2, newdata = water_test, type = "response")
water_test
::paged_table(as.data.frame(water_test$prob_predict)) rmarkdown
Logistic regression return probability values of positive class. lets convert the probability values into class using threshold value. Probability above 0.5 will classified as positive class.
<- 0.5
threshold $pred <- ifelse(water_test$prob_predict > threshold, 1, 0) water_test
Model Evaluation
To evaluate the model, we will use confusion matrix.
confusionMatrix(as.factor(water_test$pred), water_test$Potability, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 256 141
## 1 3 3
##
## Accuracy : 0.6427
## 95% CI : (0.5937, 0.6895)
## No Information Rate : 0.6427
## P-Value [Acc > NIR] : 0.5227
##
## Kappa : 0.0118
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.020833
## Specificity : 0.988417
## Pos Pred Value : 0.500000
## Neg Pred Value : 0.644836
## Prevalence : 0.357320
## Detection Rate : 0.007444
## Detection Prevalence : 0.014888
## Balanced Accuracy : 0.504625
##
## 'Positive' Class : 1
##
Based on confusion matrix above, we get that our model have
Accuracy
, how accurate the model make predictions, is
64.27%. The Sensitifity
, how sensitive the model predicting
positive value, is 2.08%, which is very low. But the
Specifity
, how sensitive the model predicting negative
value, is 98.8%, which is very high. And the Precision
, how
precise the model predicting positive values, is 50%.
K-Nearest Neighbor (K-NN)
Data Pre-processing
summary(water_data_clean)
## ph Hardness Solids Chloramines
## Min. : 0.2275 Min. : 73.49 Min. : 320.9 Min. : 1.391
## 1st Qu.: 6.0897 1st Qu.:176.74 1st Qu.:15615.7 1st Qu.: 6.139
## Median : 7.0273 Median :197.19 Median :20933.5 Median : 7.144
## Mean : 7.0860 Mean :195.97 Mean :21917.4 Mean : 7.134
## 3rd Qu.: 8.0530 3rd Qu.:216.44 3rd Qu.:27182.6 3rd Qu.: 8.110
## Max. :14.0000 Max. :317.34 Max. :56488.7 Max. :13.127
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :129.0 Min. :201.6 Min. : 2.20 Min. : 8.577
## 1st Qu.:307.6 1st Qu.:366.7 1st Qu.:12.12 1st Qu.: 55.953
## Median :332.2 Median :423.5 Median :14.32 Median : 66.542
## Mean :333.2 Mean :426.5 Mean :14.36 Mean : 66.401
## 3rd Qu.:359.3 3rd Qu.:482.4 3rd Qu.:16.68 3rd Qu.: 77.292
## Max. :481.0 Max. :753.3 Max. :27.01 Max. :124.000
## Turbidity Potability
## Min. :1.450 0:1200
## 1st Qu.:3.443 1: 811
## Median :3.968
## Mean :3.970
## 3rd Qu.:4.514
## Max. :6.495
From the summary, we see that our data have different min-max scale. we will use z-score standardization to normalize our data.
<- data.frame(lapply(water_data_clean[,-10], scale))
water_data_norm ::paged_table(water_data_norm) rmarkdown
summary(water_data_norm)
## ph Hardness Solids Chloramines
## Min. :-4.3592 Min. :-3.7529 Min. :-2.4989 Min. :-3.624051
## 1st Qu.:-0.6332 1st Qu.:-0.5890 1st Qu.:-0.7292 1st Qu.:-0.628111
## Median :-0.0373 Median : 0.0375 Median :-0.1139 Median : 0.006037
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.6146 3rd Qu.: 0.6273 3rd Qu.: 0.6092 3rd Qu.: 0.615456
## Max. : 4.3945 Max. : 3.7190 Max. : 4.0003 Max. : 3.781289
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :-4.95629 Min. :-2.78651 Min. :-3.65650 Min. :-3.596657
## 1st Qu.:-0.62109 1st Qu.:-0.74147 1st Qu.:-0.67177 1st Qu.:-0.649880
## Median :-0.02409 Median :-0.03804 Median :-0.01073 Median : 0.008791
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.63356 3rd Qu.: 0.69192 3rd Qu.: 0.69936 3rd Qu.: 0.677427
## Max. : 3.58707 Max. : 4.04914 Max. : 3.80426 Max. : 3.582680
## Turbidity
## Min. :-3.228989
## 1st Qu.:-0.675102
## Median :-0.001988
## Mean : 0.000000
## 3rd Qu.: 0.697699
## Max. : 3.235769
Cross-Validation
set.seed(471)
<- sample(nrow(water_data_norm), nrow(water_data_norm)*0.8)
insample <- water_data_norm[insample,]
water_train_knn <- water_data_norm[-insample,] water_test_knn
Modeling
Before we train the model, we need to set the k number first
<- round(sqrt(nrow(water_train_knn)))
k k
## [1] 40
<- knn(train = water_train_knn, test = water_test_knn, cl = water_data_clean$Potability[insample], k=k) knn_pred
Model Evaluation
confusionMatrix(data = knn_pred, reference = water_data_clean$Potability[-insample], positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 210 143
## 1 11 39
##
## Accuracy : 0.6179
## 95% CI : (0.5685, 0.6655)
## No Information Rate : 0.5484
## P-Value [Acc > NIR] : 0.002826
##
## Kappa : 0.1758
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.21429
## Specificity : 0.95023
## Pos Pred Value : 0.78000
## Neg Pred Value : 0.59490
## Prevalence : 0.45161
## Detection Rate : 0.09677
## Detection Prevalence : 0.12407
## Balanced Accuracy : 0.58226
##
## 'Positive' Class : 1
##
Based on confusion matrix above, we get that our model have
Accuracy
, how accurate the model make predictions, is
61.79%. The Sensitifity
, how sensitive the model predicting
positive value, is 21.42%, which is low. But the Specifity
,
how sensitive the model predicting negative value, is 95.02%, which is
very high. And the Precision
, how precise the model
predicting positive values, is 78%.
Conclusion
Both logistic regression and K-NN perform worse in accuracy and sensitifity. Both models perform worse on predicting true possitive value, just having 2.08% sensitifity on logistic regression and 21.42% on K-NN. Both models almost have same specifity performance, 98.8% for linear regression and 95.02% for K-NN. However, if our focus is on predicting that the water is not potabile, so we can choose the linear regression model with higher accuracy and specifity than K-NN model.