R Markdown

→ The dataset has 9 attributes and 1 output variable with 3276 observations. Furthermore, we can notice that there are three variables with missing data, and the total of water samples which are potable or not (1998 – non-potability / 1278 potability).

## Convert Potability to a factor variable
water_potability$Potability <- as.factor(water_potability$Potability)
## Splitting data based on target
potability_0 <- subset(water_potability, Potability == 0)
potability_1 <- subset(water_potability, Potability == 1)
## Computing statistical data for each subset
summary(potability_0 [,1:9])
##        ph            Hardness          Solids         Chloramines    
##  Min.   : 0.000   Min.   : 98.45   Min.   :  320.9   Min.   : 1.684  
##  1st Qu.: 6.038   1st Qu.:177.82   1st Qu.:15663.1   1st Qu.: 6.156  
##  Median : 7.035   Median :197.12   Median :20809.6   Median : 7.090  
##  Mean   : 7.085   Mean   :196.73   Mean   :21777.5   Mean   : 7.092  
##  3rd Qu.: 8.156   3rd Qu.:216.12   3rd Qu.:27006.2   3rd Qu.: 8.066  
##  Max.   :14.000   Max.   :304.24   Max.   :61227.2   Max.   :12.653  
##  NA's   :314                                                         
##     Sulfate       Conductivity   Organic_carbon   Trihalomethanes  
##  Min.   :203.4   Min.   :181.5   Min.   : 4.372   Min.   :  0.738  
##  1st Qu.:311.3   1st Qu.:368.5   1st Qu.:12.101   1st Qu.: 55.707  
##  Median :333.4   Median :422.2   Median :14.294   Median : 66.542  
##  Mean   :334.6   Mean   :426.7   Mean   :14.364   Mean   : 66.304  
##  3rd Qu.:356.9   3rd Qu.:480.7   3rd Qu.:16.649   3rd Qu.: 77.278  
##  Max.   :460.1   Max.   :753.3   Max.   :28.300   Max.   :120.030  
##  NA's   :488                                      NA's   :107      
##    Turbidity    
##  Min.   :1.450  
##  1st Qu.:3.444  
##  Median :3.948  
##  Mean   :3.966  
##  3rd Qu.:4.496  
##  Max.   :6.739  
## 
summary(potability_1 [,1:9])
##        ph             Hardness          Solids         Chloramines    
##  Min.   : 0.2275   Min.   : 47.43   Min.   :  728.8   Min.   : 0.352  
##  1st Qu.: 6.1793   1st Qu.:174.33   1st Qu.:15669.0   1st Qu.: 6.094  
##  Median : 7.0368   Median :196.63   Median :21199.4   Median : 7.215  
##  Mean   : 7.0738   Mean   :195.80   Mean   :22384.0   Mean   : 7.169  
##  3rd Qu.: 7.9331   3rd Qu.:218.00   3rd Qu.:27973.2   3rd Qu.: 8.199  
##  Max.   :13.1754   Max.   :323.12   Max.   :56488.7   Max.   :13.127  
##  NA's   :177                                                          
##     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
##  Min.   :129.0   Min.   :201.6   Min.   : 2.20   Min.   :  8.176  
##  1st Qu.:300.8   1st Qu.:360.9   1st Qu.:12.03   1st Qu.: 56.014  
##  Median :331.8   Median :420.7   Median :14.16   Median : 66.678  
##  Mean   :332.6   Mean   :425.4   Mean   :14.16   Mean   : 66.540  
##  3rd Qu.:365.9   3rd Qu.:484.2   3rd Qu.:16.36   3rd Qu.: 77.381  
##  Max.   :481.0   Max.   :695.4   Max.   :23.60   Max.   :124.000  
##  NA's   :293                                     NA's   :55       
##    Turbidity    
##  Min.   :1.492  
##  1st Qu.:3.431  
##  Median :3.959  
##  Mean   :3.968  
##  3rd Qu.:4.510  
##  Max.   :6.494  
## 
## Checking missing values
WP <- water_potability
WP$ph[WP$ph == 0] <- NA
missing_rates <- colMeans(is.na(WP))
print(missing_rates)
##              ph        Hardness          Solids     Chloramines         Sulfate 
##      0.15018315      0.00000000      0.00000000      0.00000000      0.23840049 
##    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
##      0.00000000      0.00000000      0.04945055      0.00000000      0.00000000
summary(missing_rates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04380 0.03709 0.23840
## Calculate the proportion of missing values for each variable
missing_proportion <- colMeans(is.na(water_potability))
## Replacing "NA" figures with "mean"
WP$ph[is.na(WP$ph)]=mean(WP$ph,na.rm=T)
WP$Sulfate[is.na(WP$Sulfate)]=mean(WP$Sulfate,na.rm=T)
cols_to_process <- c("Trihalomethanes")
for (col in cols_to_process) {WP <- WP[!is.na(WP[[col]]),]}
missing_rates_1 <- colMeans(is.na(WP))
summary(missing_rates_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

→ All missing data is cleaned and replaced successfully .

library(ggplot2)
#Histogram of PH
hist(WP$ph, xlab="ph", ylab="Number of water samples", main="Histogram of PH level in water potability", ylim=c(0,1500), xlim=c(0,14), col="pink")

Remark: The bar graph shows statistical data on the pH levels of 3276 different water bodies, with pH values ranging from 2 to 12. The majority of the pH values fall between 6 and 8, with the highest point on the bar chart occurring at a pH level of 7. As it can be seen, the number of water samples that have the pH values in the range of 2 to 6 increase moderately. Additionally, at pH 7, the number of water samples reaches its maximum of 1200. The quantity of water samples has significantly decreased between pH values 7 and 12.

#Histogram of Hardness
hist(WP$Hardness, xlab="Hardness", ylab="Number of water samples", main="Histogram of Hardness in water potability", xlim=c(0,350), ylim=c(0,1000), col="darkblue")

Remark: According to this bar chart, the hardness of water within the range of approximately 60 mg/L to 340 mg/L. The number of water bodies rises proportionally with Hardness values from 60 to 200, then slowly declines with Hardness values ranging from 200 to 340.

#Histogram of Solids
hist(WP$Solids, xlab="Solids", ylab="Number of water samples", main="Histogram of Solids in water potability", ylim=c(0,1000), xlim=c(0,70000), col="darkred")

Remark: The total dissolved solids in water are concentrated between 0 and 65,000 ppm. In detail, the high recorded values are mostly distributed in the range of 10,000 ppm to 30,000 ppm. There are some outliers from the 50,000 ppm to ~ 65,000 ppm range, no values are observed beyond the 65,000 ppm range.

#Histogram of Chloramines
hist(WP$Chloramines,xlab="Chloramines",ylab="Number of water samples",main="Histogram of Chloramines in water potability",ylim=c(0,1000), xlim=c(0,15),col="lightgreen")

The statistical representation of Chloramines shows values ranging from 2.5 ppm to 14 ppm. Most chloramines levels in potable water fall within the range of 5 ppm to 10 ppm with the peak of the bar chart occurring between 6 ppm and 8 ppm.

#Histogram of Sulfate
hist(WP$Sulfate,xlab="Sulfate",ylab="Number of water samples",main="Histogram of Sulfate in water potability",ylim=c(0,1500),xlim=c(0,500),col="brown")

#Histogram of Conductivity

Remark: This bar chart illustrates the statistical information about the amount of Sulfate in water samples which is mostly distributed in range from 110 mg/L to 500 mg/L. The number of samples reaches a peak at 1200, which is approximately in the range within 320 mg/L to 340 mg/L.

hist(WP$Conductivity,xlab="Conductivity",ylab="Number of water samples",main="Histogram of Conductivity in water potability",ylim=c(0,1000),xlim=c(0,900),col="purple")

Remark: The data on water potability conductivity is unevenly distributed between approximately 180 and 800. Nearly half of the data shows the highest values for water electrical conductivity. The figures gradually increase in the first half, but there is a sharp decrease in the second half.

#Histogram of Organic Carbon
hist(WP$Organic_carbon,xlab="Organic_carbon",ylab="Number of water samples",main="Histogram of Organic carbon in water potability",ylim=c(0,1000),xlim=c(0,30),col="yellow")

Remark: The bar chart presents the statistics on the total Organic Carbon (TOC) levels in water potability, which range from 2 to 24. The trend shows a gradual increase in TOC values from 2 to 15. The proportion of organic carbon in water potability reaches its peak at 13 and then experiences a sharp decline until 24.

#Histogram of Trihalomethanes
hist(WP$Trihalomethanes,xlab="Trihalomethanes",ylab="Number of water samples",main="Histogram of Trihalomethanes in water potability",ylim=c(0,1000),xlim=c(0,150),col="orange")

Remark: In the bar chart, The Trihalomethanes in water are concentrated between 10 ppm and 130,000 ppm. Trihalomethanes values between 60 and 70 ppm reach the highest peak, followed by ranges of 70-80 ppm and 50-60 ppm. Most water bodies nearly reach 80 ppm, which is considered the safe drinking limit.

#Histogram of Turdinity
hist(WP$Turbidity,xlab="Turbidity",ylab="Number of water samples",main="Histogram of Turbidity in water potability",ylim=c(0,1000),xlim=c(0,10),col="darkgreen")

Remark: The graph of Turbidity indicates that the most common values fall within the 2.5 to 5.5 NTU range, with a trend towards the lower end of the scale. This implies that most water bodies do not meet the World Health Organization’s recommended maximum turbidity level of 5.00 NTU.

#Barplot of water potability
barplot(table(WP$Potability),xlab="Condition (0=Not Portable,1=Portable)",ylab="Number of water samples",main="Water Portability",col="#6633FF",ylim=c(0,2000))

library (tidyverse)

Remark: Overall, the figure for non-potable water (0) is higher than that for potable water (1). Specifically, there are approximately 2000 regions identified with non-potable water, whereas there are around 1300 regions with potable water.

#Boxplot of PH
Summary<-boxplot(WP$ph~WP$Potability, xlab = "Potability", ylab = "pH", main="Boxplot of pH", col="pink")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                        0        1
## Min             3.664711 4.238283
## First Quartile  6.213951 6.350789
## Median          7.083338 7.083338
## Third Quartile  7.924813 7.776855
## Maximum        10.464502 9.900815

The interquartile range (IQR) of the Potability value at 0 is larger than that at 1. In the case of figure “0,” the outliers are densely clustered at the lower end and scattered towards the top. Conversely, in figure “1,” the outliers are distributed in the opposite pattern. The highest value represents the furthest upper-half outlier in the “0” figure.

#Boxplot of Hardness
Summary<-boxplot(WP$Hardness~WP$Potability, xlab = "Potability", ylab = "Hardness", main="Boxplot of Hardness", col="darkblue")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                       0        1
## Min            119.8858 110.9036
## First Quartile 177.6205 174.5861
## Median         197.2270 196.6589
## Third Quartile 216.2644 218.1079
## Maximum        273.8138 282.7390

The number of water bodies rises steadily as Hardness levels increase from 50 to 200, and then slowly declines as hardness levels rise from 200 to 350. While the median stays the same at 200, the distribution of outliers in graph 0 is more concentrated and closer to the two lines than in graph 1.

#Boxplot for Solids
Summary<-boxplot(WP$Solids~WP$Potability, xlab = "Potability", ylab = "Solids", main="Boxplot of Solids", col="darkred")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                         0          1
## Min              320.9426   728.7508
## First Quartile 15754.9580 15618.1527
## Median         20743.3484 21153.3228
## Third Quartile 26912.8010 27822.4371
## Maximum        43536.0209 46113.9575

In general, the density of the two boxplots remains relatively consistent, with Solids for both types of water potability ranging from 15,000 to around 27,500. The medians of the two figures are quite similar, both around 20,000. The distribution of outliers in graphs 0 and 1 is concentrated in the upper half of the boxplots, near the upper line.

#Boxplot for Chloramines
Summary<-boxplot(WP$Chloramines~WP$Potability, xlab = "Potability", ylab = "Chloramines", main= "Boxplot for Chloramines", col="lightgreen")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                        0         1
## Min             3.331266  3.016033
## First Quartile  6.167605  6.106169
## Median          7.079934  7.217409
## Third Quartile  8.064453  8.220887
## Maximum        10.908687 11.302831

The box plot of Chloramines reveals different patterns of outlier distribution. Non-potable water has evenly spaced outliers that are close together, whereas outliers in potable water are more dispersed and farther from the boxplot. The highest values are concentrated in the upper half of the graph.

#Boxplot for Sulfate
Summary<-boxplot(WP$Sulfate~WP$Potability, xlab = "Potability", ylab = "Sulfate", main="Boxplot for Sulfate", col="brown")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                       0        1
## Min            275.0909 251.0624
## First Quartile 318.7909 313.1232
## Median         333.7758 333.7758
## Third Quartile 348.0171 354.9282
## Maximum        391.6669 417.6024

The top point of the box plot for sulfate represents an outlier in the image. This point is notably higher than the rest of the data. Moreover, the box plot shows that the sulfate levels in the non-potable setup are higher and more varied compared to the potable configuration.

#Boxplot for Conductivity
Summary<-boxplot(WP$Conductivity~WP$Potability, xlab = "Potability", ylab = "Conductivity", main="Boxplot for Conductivity", col="purple")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                       0        1
## Min            210.3192 201.6197
## First Quartile 368.9256 360.9452
## Median         422.2238 421.3437
## Third Quartile 481.2514 485.5352
## Maximum        647.3499 657.5704

Boxplot 0 contains more outliers, and both boxplots display a symmetric distribution. There is no particular range of Conductivity values for water potability. The minimum, median, and first quartile are slightly different, but the third quartile is almost identical between boxplots 0 and 1.

#Boxplot for Organic_Carbon
Summary<-boxplot(WP$Organic_carbon~WP$Potability, xlab = "Potability", ylab = "Organic_carbon", main="Boxplot for Organic_carbon", col="yellow")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                        0         1
## Min             5.362371  5.567693
## First Quartile 12.118597 12.002958
## Median         14.321874 14.137120
## Third Quartile 16.669315 16.302706
## Maximum        23.399516 22.641598

The box plots for TOC values show a similar distribution. However, box plot 0 contains more outliers on the higher end, while box plot 1 has a notable outlier on the lower end. The minimum, maximum, median, and first and third quartile values differ slightly between the two box plots.

#Boxplot for Trihalomethanes
Summary<-boxplot(WP$Trihalomethanes~WP$Potability, xlab = "Potability", ylab = "Trihalomethanes", main="Boxplot for Trihalomethanes", col="orange")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                        0         1
## Min             23.79295  24.53277
## First Quartile  55.70653  56.01425
## Median          66.54220  66.67821
## Third Quartile  77.27770  77.38098
## Maximum        108.58941 108.84957

Both types of boxplots exhibit a symmetric distribution, with minor differences in their factors, making it challenging to differentiate between the ranges of values that indicate potable water. Additionally, the presence of many outliers may impact the skewness of the boxplot if further calculations are performed.

#Boxplot for Turbidity
Summary<-boxplot(WP$Turbidity~WP$Potability, xlab = "Potability", ylab = "Turbidity", main="Boxplot for Turbidity", col="darkgreen")$stats

colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
##                       0        1
## Min            1.910117 1.844372
## First Quartile 3.451784 3.433837
## Median         3.949015 3.959637
## Third Quartile 4.481162 4.512200
## Maximum        5.989543 6.083772

The values in the two types of boxplots are nearly identical, making it difficult to distinguish between them. Additionally, there are more outliers below the lower whisker of box 0 compared to box 1, and there are fewer outliers above the upper whisker.

- Training sets: has the right predictive values, the actual values that the model should have predicted (80%) - Testing sets: allows for creating a testing set, providing real-world checks for checking how well a model performs on new data (20%)

#Split data into training and testing set
library (caTools)
sample <-sample.split (WP$Potability, SplitRatio = 0.8)
train_data<-subset(WP, sample==TRUE)
test_data<-subset(WP,sample==FALSE)

-> In this step, all the variables in the data are changed to fit all of the properties of water into binomial categories, so that the logistic model can be plotted out.

#Fit logistic regression model on the training data
logistic_training_data <- glm(Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + Conductivity + Organic_carbon + Trihalomethanes + Turbidity, data = train_data, family = binomial)
summary(logistic_training_data)
## 
## Call:
## glm(formula = Potability ~ ph + Hardness + Solids + Chloramines + 
##     Sulfate + Conductivity + Organic_carbon + Trihalomethanes + 
##     Turbidity, family = binomial, data = train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -7.826e-02  7.122e-01  -0.110   0.9125  
## ph              -1.471e-02  2.821e-02  -0.522   0.6020  
## Hardness        -7.927e-04  1.253e-03  -0.633   0.5268  
## Solids           7.402e-06  4.772e-06   1.551   0.1208  
## Chloramines      4.124e-02  2.616e-02   1.576   0.1150  
## Sulfate         -9.445e-04  1.154e-03  -0.818   0.4132  
## Conductivity    -3.011e-04  5.111e-04  -0.589   0.5558  
## Organic_carbon  -2.178e-02  1.240e-02  -1.757   0.0789 .
## Trihalomethanes  3.279e-04  2.527e-03   0.130   0.8967  
## Turbidity        4.477e-02  5.298e-02   0.845   0.3981  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3337.5  on 2490  degrees of freedom
## Residual deviance: 3326.1  on 2481  degrees of freedom
## AIC: 3346.1
## 
## Number of Fisher Scoring iterations: 4
best.model=step(logistic_training_data)
## Start:  AIC=3346.15
## Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
##     Conductivity + Organic_carbon + Trihalomethanes + Turbidity
## 
##                   Df Deviance    AIC
## - Trihalomethanes  1   3326.2 3344.2
## - ph               1   3326.4 3344.4
## - Conductivity     1   3326.5 3344.5
## - Hardness         1   3326.5 3344.5
## - Sulfate          1   3326.8 3344.8
## - Turbidity        1   3326.9 3344.9
## <none>                 3326.1 3346.1
## - Solids           1   3328.5 3346.5
## - Chloramines      1   3328.6 3346.6
## - Organic_carbon   1   3329.2 3347.2
## 
## Step:  AIC=3344.16
## Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + 
##     Conductivity + Organic_carbon + Turbidity
## 
##                  Df Deviance    AIC
## - ph              1   3326.4 3342.4
## - Conductivity    1   3326.5 3342.5
## - Hardness        1   3326.6 3342.6
## - Sulfate         1   3326.8 3342.8
## - Turbidity       1   3326.9 3342.9
## <none>                3326.2 3344.2
## - Solids          1   3328.6 3344.6
## - Chloramines     1   3328.7 3344.7
## - Organic_carbon  1   3329.3 3345.3
## 
## Step:  AIC=3342.43
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity + 
##     Organic_carbon + Turbidity
## 
##                  Df Deviance    AIC
## - Conductivity    1   3326.8 3340.8
## - Hardness        1   3326.9 3340.9
## - Sulfate         1   3327.1 3341.1
## - Turbidity       1   3327.2 3341.2
## <none>                3326.4 3342.4
## - Solids          1   3329.0 3343.0
## - Chloramines     1   3329.0 3343.0
## - Organic_carbon  1   3329.6 3343.6
## 
## Step:  AIC=3340.79
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon + 
##     Turbidity
## 
##                  Df Deviance    AIC
## - Hardness        1   3327.2 3339.2
## - Sulfate         1   3327.5 3339.5
## - Turbidity       1   3327.6 3339.6
## <none>                3326.8 3340.8
## - Solids          1   3329.3 3341.3
## - Chloramines     1   3329.4 3341.4
## - Organic_carbon  1   3330.0 3342.0
## 
## Step:  AIC=3339.22
## Potability ~ Solids + Chloramines + Sulfate + Organic_carbon + 
##     Turbidity
## 
##                  Df Deviance    AIC
## - Sulfate         1   3327.8 3337.8
## - Turbidity       1   3328.0 3338.0
## <none>                3327.2 3339.2
## - Solids          1   3329.8 3339.8
## - Chloramines     1   3330.0 3340.0
## - Organic_carbon  1   3330.4 3340.4
## 
## Step:  AIC=3337.77
## Potability ~ Solids + Chloramines + Organic_carbon + Turbidity
## 
##                  Df Deviance    AIC
## - Turbidity       1   3328.6 3336.6
## <none>                3327.8 3337.8
## - Chloramines     1   3330.5 3338.5
## - Solids          1   3330.8 3338.8
## - Organic_carbon  1   3331.1 3339.1
## 
## Step:  AIC=3336.61
## Potability ~ Solids + Chloramines + Organic_carbon
## 
##                  Df Deviance    AIC
## <none>                3328.6 3336.6
## - Chloramines     1   3331.3 3337.3
## - Solids          1   3331.8 3337.8
## - Organic_carbon  1   3332.0 3338.0
summary (best.model)$coef
##                     Estimate   Std. Error   z value   Pr(>|z|)
## (Intercept)    -6.032954e-01 2.847053e-01 -2.119017 0.03408900
## Solids          8.347844e-06 4.699198e-06  1.776440 0.07566043
## Chloramines     4.302107e-02 2.607369e-02  1.649980 0.09894705
## Organic_carbon -2.265578e-02 1.237260e-02 -1.831125 0.06708194

-> This summary outlines important aspects to consider. Nine variables were analyzed using coefficients such as estimated value, standard error, z, and Pr(>|z|) (the probability of observing the given result). The p-values related to the level of significance. If a coefficient has a p-value below the chosen significance level (commonly 0.05), then it is viewed as statistically significant. Moreover, AIC measures the model’s goodness of fit while considering the number of parameters; lower AIC values suggest a better model fit. Fisher Scoring Iterations refer to the number of iterations the optimization algorithm needed to reach a solution.

#Convert Potability to a factor variable in test_data
test_data$Potability <- as.factor(test_data$Potability)
#Make predictions on the test data
predictions <- predict(logistic_training_data, newdata = test_data, type = "response")
test_data$predictions<-round(predictions)
head(test_data,10)
## # A tibble: 10 × 11
##       ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
##    <dbl>    <dbl>  <dbl>       <dbl>   <dbl>        <dbl>          <dbl>
##  1  8.32     214. 22018.        8.06    357.         363.           18.4
##  2  7.36     166. 32453.        7.55    327.         425.           15.6
##  3  7.97     219. 18768.        8.11    334.         364.           14.5
##  4  7.08     150. 27331.        6.84    299.         380.           19.4
##  5  9.18     274. 24041.        6.90    398.         478.           13.4
##  6  7.37     214. 25630.        4.43    336.         470.           12.5
##  7  7.08     216. 17107.        5.61    327.         436.           14.2
##  8  3.90     197. 21168.        7.00    334.         444.           16.6
##  9  5.40     141. 17267.       10.1     328.         473.           11.3
## 10  7.08     266. 26363.        7.70    395.         364.           10.3
## # ℹ 4 more variables: Trihalomethanes <dbl>, Turbidity <dbl>, Potability <fct>,
## #   predictions <dbl>
#Convert predictions to a factor with the same levels as Potability
predictions <- as.factor(ifelse(predictions > 0.5, 1, 0))
levels(predictions) <- levels(test_data$Potability)
#Evaluate model performance
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
conf_matrix <- confusionMatrix(predictions, test_data$Potability)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 378 244
##          1   0   1
##                                           
##                Accuracy : 0.6083          
##                  95% CI : (0.5688, 0.6469)
##     No Information Rate : 0.6067          
##     P-Value [Acc > NIR] : 0.4848          
##                                           
##                   Kappa : 0.0049          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.000000        
##             Specificity : 0.004082        
##          Pos Pred Value : 0.607717        
##          Neg Pred Value : 1.000000        
##              Prevalence : 0.606742        
##          Detection Rate : 0.606742        
##    Detection Prevalence : 0.998395        
##       Balanced Accuracy : 0.502041        
##                                           
##        'Positive' Class : 0               
## 

-> The statistical results above indicate the accuracy of the Logistic Regression model applied on the dataset, before and after cleaning the N/A pH value (which is pH=0). The former accuracy is 60.98%, corresponding to 3276 water bodies, whereas the latter is 61.16%, corresponding to 3114 water bodies. The difference demonstrates that, after removing the “0” pH values, the statistical model is higher in accuracy, thus stating a better result for the dataset.

#CONCLUSION

Our group examined a water potability dataset with nearly 9 attributes and 1 output variable with 3276 observations. The target feature was the quality of the water, which indicated 0 (non-potable) and 1 (potable). The goal was to examine the given characteristics in order to determine their impact on the water quality. Approximately 60.67% of the data corresponds to reality using the Logistic Regression model. We identified three features as the most influential after observing different distributions for the features and taking into account the qualities: Solids, Conductivity and Turbidity For the pH = 0 values, they were belived to be unusual water condition thus being cleaned initially. Both Group 0 and Group 1 have outliers, which suggests that extreme values can even occur in the potable water.