→ The dataset has 9 attributes and 1 output variable with 3276 observations. Furthermore, we can notice that there are three variables with missing data, and the total of water samples which are potable or not (1998 – non-potability / 1278 potability).
## Convert Potability to a factor variable
water_potability$Potability <- as.factor(water_potability$Potability)
## Splitting data based on target
potability_0 <- subset(water_potability, Potability == 0)
potability_1 <- subset(water_potability, Potability == 1)
## Computing statistical data for each subset
summary(potability_0 [,1:9])
## ph Hardness Solids Chloramines
## Min. : 0.000 Min. : 98.45 Min. : 320.9 Min. : 1.684
## 1st Qu.: 6.038 1st Qu.:177.82 1st Qu.:15663.1 1st Qu.: 6.156
## Median : 7.035 Median :197.12 Median :20809.6 Median : 7.090
## Mean : 7.085 Mean :196.73 Mean :21777.5 Mean : 7.092
## 3rd Qu.: 8.156 3rd Qu.:216.12 3rd Qu.:27006.2 3rd Qu.: 8.066
## Max. :14.000 Max. :304.24 Max. :61227.2 Max. :12.653
## NA's :314
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :203.4 Min. :181.5 Min. : 4.372 Min. : 0.738
## 1st Qu.:311.3 1st Qu.:368.5 1st Qu.:12.101 1st Qu.: 55.707
## Median :333.4 Median :422.2 Median :14.294 Median : 66.542
## Mean :334.6 Mean :426.7 Mean :14.364 Mean : 66.304
## 3rd Qu.:356.9 3rd Qu.:480.7 3rd Qu.:16.649 3rd Qu.: 77.278
## Max. :460.1 Max. :753.3 Max. :28.300 Max. :120.030
## NA's :488 NA's :107
## Turbidity
## Min. :1.450
## 1st Qu.:3.444
## Median :3.948
## Mean :3.966
## 3rd Qu.:4.496
## Max. :6.739
##
summary(potability_1 [,1:9])
## ph Hardness Solids Chloramines
## Min. : 0.2275 Min. : 47.43 Min. : 728.8 Min. : 0.352
## 1st Qu.: 6.1793 1st Qu.:174.33 1st Qu.:15669.0 1st Qu.: 6.094
## Median : 7.0368 Median :196.63 Median :21199.4 Median : 7.215
## Mean : 7.0738 Mean :195.80 Mean :22384.0 Mean : 7.169
## 3rd Qu.: 7.9331 3rd Qu.:218.00 3rd Qu.:27973.2 3rd Qu.: 8.199
## Max. :13.1754 Max. :323.12 Max. :56488.7 Max. :13.127
## NA's :177
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :129.0 Min. :201.6 Min. : 2.20 Min. : 8.176
## 1st Qu.:300.8 1st Qu.:360.9 1st Qu.:12.03 1st Qu.: 56.014
## Median :331.8 Median :420.7 Median :14.16 Median : 66.678
## Mean :332.6 Mean :425.4 Mean :14.16 Mean : 66.540
## 3rd Qu.:365.9 3rd Qu.:484.2 3rd Qu.:16.36 3rd Qu.: 77.381
## Max. :481.0 Max. :695.4 Max. :23.60 Max. :124.000
## NA's :293 NA's :55
## Turbidity
## Min. :1.492
## 1st Qu.:3.431
## Median :3.959
## Mean :3.968
## 3rd Qu.:4.510
## Max. :6.494
##
## Checking missing values
WP <- water_potability
WP$ph[WP$ph == 0] <- NA
missing_rates <- colMeans(is.na(WP))
print(missing_rates)
## ph Hardness Solids Chloramines Sulfate
## 0.15018315 0.00000000 0.00000000 0.00000000 0.23840049
## Conductivity Organic_carbon Trihalomethanes Turbidity Potability
## 0.00000000 0.00000000 0.04945055 0.00000000 0.00000000
summary(missing_rates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04380 0.03709 0.23840
## Calculate the proportion of missing values for each variable
missing_proportion <- colMeans(is.na(water_potability))
## Replacing "NA" figures with "mean"
WP$ph[is.na(WP$ph)]=mean(WP$ph,na.rm=T)
WP$Sulfate[is.na(WP$Sulfate)]=mean(WP$Sulfate,na.rm=T)
cols_to_process <- c("Trihalomethanes")
for (col in cols_to_process) {WP <- WP[!is.na(WP[[col]]),]}
missing_rates_1 <- colMeans(is.na(WP))
summary(missing_rates_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
→ All missing data is cleaned and replaced successfully .
library(ggplot2)
#Histogram of PH
hist(WP$ph, xlab="ph", ylab="Number of water samples", main="Histogram of PH level in water potability", ylim=c(0,1500), xlim=c(0,14), col="pink")
Remark: The bar graph shows statistical data on the pH levels of 3276
different water bodies, with pH values ranging from 2 to 12. The
majority of the pH values fall between 6 and 8, with the highest point
on the bar chart occurring at a pH level of 7. As it can be seen, the
number of water samples that have the pH values in the range of 2 to 6
increase moderately. Additionally, at pH 7, the number of water samples
reaches its maximum of 1200. The quantity of water samples has
significantly decreased between pH values 7 and 12.
#Histogram of Hardness
hist(WP$Hardness, xlab="Hardness", ylab="Number of water samples", main="Histogram of Hardness in water potability", xlim=c(0,350), ylim=c(0,1000), col="darkblue")
Remark: According to this bar chart, the hardness of water within the
range of approximately 60 mg/L to 340 mg/L. The number of water bodies
rises proportionally with Hardness values from 60 to 200, then slowly
declines with Hardness values ranging from 200 to 340.
#Histogram of Solids
hist(WP$Solids, xlab="Solids", ylab="Number of water samples", main="Histogram of Solids in water potability", ylim=c(0,1000), xlim=c(0,70000), col="darkred")
Remark: The total dissolved solids in water are concentrated between 0
and 65,000 ppm. In detail, the high recorded values are mostly
distributed in the range of 10,000 ppm to 30,000 ppm. There are some
outliers from the 50,000 ppm to ~ 65,000 ppm range, no values are
observed beyond the 65,000 ppm range.
#Histogram of Chloramines
hist(WP$Chloramines,xlab="Chloramines",ylab="Number of water samples",main="Histogram of Chloramines in water potability",ylim=c(0,1000), xlim=c(0,15),col="lightgreen")
The statistical representation of Chloramines shows values ranging from
2.5 ppm to 14 ppm. Most chloramines levels in potable water fall within
the range of 5 ppm to 10 ppm with the peak of the bar chart occurring
between 6 ppm and 8 ppm.
#Histogram of Sulfate
hist(WP$Sulfate,xlab="Sulfate",ylab="Number of water samples",main="Histogram of Sulfate in water potability",ylim=c(0,1500),xlim=c(0,500),col="brown")
#Histogram of Conductivity
Remark: This bar chart illustrates the statistical information about the amount of Sulfate in water samples which is mostly distributed in range from 110 mg/L to 500 mg/L. The number of samples reaches a peak at 1200, which is approximately in the range within 320 mg/L to 340 mg/L.
hist(WP$Conductivity,xlab="Conductivity",ylab="Number of water samples",main="Histogram of Conductivity in water potability",ylim=c(0,1000),xlim=c(0,900),col="purple")
Remark: The data on water potability conductivity is unevenly
distributed between approximately 180 and 800. Nearly half of the data
shows the highest values for water electrical conductivity. The figures
gradually increase in the first half, but there is a sharp decrease in
the second half.
#Histogram of Organic Carbon
hist(WP$Organic_carbon,xlab="Organic_carbon",ylab="Number of water samples",main="Histogram of Organic carbon in water potability",ylim=c(0,1000),xlim=c(0,30),col="yellow")
Remark: The bar chart presents the statistics on the total Organic
Carbon (TOC) levels in water potability, which range from 2 to 24. The
trend shows a gradual increase in TOC values from 2 to 15. The
proportion of organic carbon in water potability reaches its peak at 13
and then experiences a sharp decline until 24.
#Histogram of Trihalomethanes
hist(WP$Trihalomethanes,xlab="Trihalomethanes",ylab="Number of water samples",main="Histogram of Trihalomethanes in water potability",ylim=c(0,1000),xlim=c(0,150),col="orange")
Remark: In the bar chart, The Trihalomethanes in water are concentrated
between 10 ppm and 130,000 ppm. Trihalomethanes values between 60 and 70
ppm reach the highest peak, followed by ranges of 70-80 ppm and 50-60
ppm. Most water bodies nearly reach 80 ppm, which is considered the safe
drinking limit.
#Histogram of Turdinity
hist(WP$Turbidity,xlab="Turbidity",ylab="Number of water samples",main="Histogram of Turbidity in water potability",ylim=c(0,1000),xlim=c(0,10),col="darkgreen")
Remark: The graph of Turbidity indicates that the most common values
fall within the 2.5 to 5.5 NTU range, with a trend towards the lower end
of the scale. This implies that most water bodies do not meet the World
Health Organization’s recommended maximum turbidity level of 5.00
NTU.
#Barplot of water potability
barplot(table(WP$Potability),xlab="Condition (0=Not Portable,1=Portable)",ylab="Number of water samples",main="Water Portability",col="#6633FF",ylim=c(0,2000))
library (tidyverse)
Remark: Overall, the figure for non-potable water (0) is higher than that for potable water (1). Specifically, there are approximately 2000 regions identified with non-potable water, whereas there are around 1300 regions with potable water.
#Boxplot of PH
Summary<-boxplot(WP$ph~WP$Potability, xlab = "Potability", ylab = "pH", main="Boxplot of pH", col="pink")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 3.664711 4.238283
## First Quartile 6.213951 6.350789
## Median 7.083338 7.083338
## Third Quartile 7.924813 7.776855
## Maximum 10.464502 9.900815
The interquartile range (IQR) of the Potability value at 0 is larger than that at 1. In the case of figure “0,” the outliers are densely clustered at the lower end and scattered towards the top. Conversely, in figure “1,” the outliers are distributed in the opposite pattern. The highest value represents the furthest upper-half outlier in the “0” figure.
#Boxplot of Hardness
Summary<-boxplot(WP$Hardness~WP$Potability, xlab = "Potability", ylab = "Hardness", main="Boxplot of Hardness", col="darkblue")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 119.8858 110.9036
## First Quartile 177.6205 174.5861
## Median 197.2270 196.6589
## Third Quartile 216.2644 218.1079
## Maximum 273.8138 282.7390
The number of water bodies rises steadily as Hardness levels increase from 50 to 200, and then slowly declines as hardness levels rise from 200 to 350. While the median stays the same at 200, the distribution of outliers in graph 0 is more concentrated and closer to the two lines than in graph 1.
#Boxplot for Solids
Summary<-boxplot(WP$Solids~WP$Potability, xlab = "Potability", ylab = "Solids", main="Boxplot of Solids", col="darkred")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 320.9426 728.7508
## First Quartile 15754.9580 15618.1527
## Median 20743.3484 21153.3228
## Third Quartile 26912.8010 27822.4371
## Maximum 43536.0209 46113.9575
In general, the density of the two boxplots remains relatively consistent, with Solids for both types of water potability ranging from 15,000 to around 27,500. The medians of the two figures are quite similar, both around 20,000. The distribution of outliers in graphs 0 and 1 is concentrated in the upper half of the boxplots, near the upper line.
#Boxplot for Chloramines
Summary<-boxplot(WP$Chloramines~WP$Potability, xlab = "Potability", ylab = "Chloramines", main= "Boxplot for Chloramines", col="lightgreen")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 3.331266 3.016033
## First Quartile 6.167605 6.106169
## Median 7.079934 7.217409
## Third Quartile 8.064453 8.220887
## Maximum 10.908687 11.302831
The box plot of Chloramines reveals different patterns of outlier distribution. Non-potable water has evenly spaced outliers that are close together, whereas outliers in potable water are more dispersed and farther from the boxplot. The highest values are concentrated in the upper half of the graph.
#Boxplot for Sulfate
Summary<-boxplot(WP$Sulfate~WP$Potability, xlab = "Potability", ylab = "Sulfate", main="Boxplot for Sulfate", col="brown")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 275.0909 251.0624
## First Quartile 318.7909 313.1232
## Median 333.7758 333.7758
## Third Quartile 348.0171 354.9282
## Maximum 391.6669 417.6024
The top point of the box plot for sulfate represents an outlier in the image. This point is notably higher than the rest of the data. Moreover, the box plot shows that the sulfate levels in the non-potable setup are higher and more varied compared to the potable configuration.
#Boxplot for Conductivity
Summary<-boxplot(WP$Conductivity~WP$Potability, xlab = "Potability", ylab = "Conductivity", main="Boxplot for Conductivity", col="purple")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 210.3192 201.6197
## First Quartile 368.9256 360.9452
## Median 422.2238 421.3437
## Third Quartile 481.2514 485.5352
## Maximum 647.3499 657.5704
Boxplot 0 contains more outliers, and both boxplots display a symmetric distribution. There is no particular range of Conductivity values for water potability. The minimum, median, and first quartile are slightly different, but the third quartile is almost identical between boxplots 0 and 1.
#Boxplot for Organic_Carbon
Summary<-boxplot(WP$Organic_carbon~WP$Potability, xlab = "Potability", ylab = "Organic_carbon", main="Boxplot for Organic_carbon", col="yellow")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 5.362371 5.567693
## First Quartile 12.118597 12.002958
## Median 14.321874 14.137120
## Third Quartile 16.669315 16.302706
## Maximum 23.399516 22.641598
The box plots for TOC values show a similar distribution. However, box plot 0 contains more outliers on the higher end, while box plot 1 has a notable outlier on the lower end. The minimum, maximum, median, and first and third quartile values differ slightly between the two box plots.
#Boxplot for Trihalomethanes
Summary<-boxplot(WP$Trihalomethanes~WP$Potability, xlab = "Potability", ylab = "Trihalomethanes", main="Boxplot for Trihalomethanes", col="orange")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 23.79295 24.53277
## First Quartile 55.70653 56.01425
## Median 66.54220 66.67821
## Third Quartile 77.27770 77.38098
## Maximum 108.58941 108.84957
Both types of boxplots exhibit a symmetric distribution, with minor differences in their factors, making it challenging to differentiate between the ranges of values that indicate potable water. Additionally, the presence of many outliers may impact the skewness of the boxplot if further calculations are performed.
#Boxplot for Turbidity
Summary<-boxplot(WP$Turbidity~WP$Potability, xlab = "Potability", ylab = "Turbidity", main="Boxplot for Turbidity", col="darkgreen")$stats
colnames(Summary)<-c("0","1")
rownames(Summary)<-c("Min","First Quartile","Median","Third Quartile","Maximum")
Summary
## 0 1
## Min 1.910117 1.844372
## First Quartile 3.451784 3.433837
## Median 3.949015 3.959637
## Third Quartile 4.481162 4.512200
## Maximum 5.989543 6.083772
The values in the two types of boxplots are nearly identical, making it difficult to distinguish between them. Additionally, there are more outliers below the lower whisker of box 0 compared to box 1, and there are fewer outliers above the upper whisker.
- Training sets: has the right predictive values, the actual values that
the model should have predicted (80%) - Testing sets: allows for
creating a testing set, providing real-world checks for checking how
well a model performs on new data (20%)
#Split data into training and testing set
library (caTools)
sample <-sample.split (WP$Potability, SplitRatio = 0.8)
train_data<-subset(WP, sample==TRUE)
test_data<-subset(WP,sample==FALSE)
-> In this step, all the variables in the data are changed to fit all of the properties of water into binomial categories, so that the logistic model can be plotted out.
#Fit logistic regression model on the training data
logistic_training_data <- glm(Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + Conductivity + Organic_carbon + Trihalomethanes + Turbidity, data = train_data, family = binomial)
summary(logistic_training_data)
##
## Call:
## glm(formula = Potability ~ ph + Hardness + Solids + Chloramines +
## Sulfate + Conductivity + Organic_carbon + Trihalomethanes +
## Turbidity, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.826e-02 7.122e-01 -0.110 0.9125
## ph -1.471e-02 2.821e-02 -0.522 0.6020
## Hardness -7.927e-04 1.253e-03 -0.633 0.5268
## Solids 7.402e-06 4.772e-06 1.551 0.1208
## Chloramines 4.124e-02 2.616e-02 1.576 0.1150
## Sulfate -9.445e-04 1.154e-03 -0.818 0.4132
## Conductivity -3.011e-04 5.111e-04 -0.589 0.5558
## Organic_carbon -2.178e-02 1.240e-02 -1.757 0.0789 .
## Trihalomethanes 3.279e-04 2.527e-03 0.130 0.8967
## Turbidity 4.477e-02 5.298e-02 0.845 0.3981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3337.5 on 2490 degrees of freedom
## Residual deviance: 3326.1 on 2481 degrees of freedom
## AIC: 3346.1
##
## Number of Fisher Scoring iterations: 4
best.model=step(logistic_training_data)
## Start: AIC=3346.15
## Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
## Conductivity + Organic_carbon + Trihalomethanes + Turbidity
##
## Df Deviance AIC
## - Trihalomethanes 1 3326.2 3344.2
## - ph 1 3326.4 3344.4
## - Conductivity 1 3326.5 3344.5
## - Hardness 1 3326.5 3344.5
## - Sulfate 1 3326.8 3344.8
## - Turbidity 1 3326.9 3344.9
## <none> 3326.1 3346.1
## - Solids 1 3328.5 3346.5
## - Chloramines 1 3328.6 3346.6
## - Organic_carbon 1 3329.2 3347.2
##
## Step: AIC=3344.16
## Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
## Conductivity + Organic_carbon + Turbidity
##
## Df Deviance AIC
## - ph 1 3326.4 3342.4
## - Conductivity 1 3326.5 3342.5
## - Hardness 1 3326.6 3342.6
## - Sulfate 1 3326.8 3342.8
## - Turbidity 1 3326.9 3342.9
## <none> 3326.2 3344.2
## - Solids 1 3328.6 3344.6
## - Chloramines 1 3328.7 3344.7
## - Organic_carbon 1 3329.3 3345.3
##
## Step: AIC=3342.43
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity +
## Organic_carbon + Turbidity
##
## Df Deviance AIC
## - Conductivity 1 3326.8 3340.8
## - Hardness 1 3326.9 3340.9
## - Sulfate 1 3327.1 3341.1
## - Turbidity 1 3327.2 3341.2
## <none> 3326.4 3342.4
## - Solids 1 3329.0 3343.0
## - Chloramines 1 3329.0 3343.0
## - Organic_carbon 1 3329.6 3343.6
##
## Step: AIC=3340.79
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Organic_carbon +
## Turbidity
##
## Df Deviance AIC
## - Hardness 1 3327.2 3339.2
## - Sulfate 1 3327.5 3339.5
## - Turbidity 1 3327.6 3339.6
## <none> 3326.8 3340.8
## - Solids 1 3329.3 3341.3
## - Chloramines 1 3329.4 3341.4
## - Organic_carbon 1 3330.0 3342.0
##
## Step: AIC=3339.22
## Potability ~ Solids + Chloramines + Sulfate + Organic_carbon +
## Turbidity
##
## Df Deviance AIC
## - Sulfate 1 3327.8 3337.8
## - Turbidity 1 3328.0 3338.0
## <none> 3327.2 3339.2
## - Solids 1 3329.8 3339.8
## - Chloramines 1 3330.0 3340.0
## - Organic_carbon 1 3330.4 3340.4
##
## Step: AIC=3337.77
## Potability ~ Solids + Chloramines + Organic_carbon + Turbidity
##
## Df Deviance AIC
## - Turbidity 1 3328.6 3336.6
## <none> 3327.8 3337.8
## - Chloramines 1 3330.5 3338.5
## - Solids 1 3330.8 3338.8
## - Organic_carbon 1 3331.1 3339.1
##
## Step: AIC=3336.61
## Potability ~ Solids + Chloramines + Organic_carbon
##
## Df Deviance AIC
## <none> 3328.6 3336.6
## - Chloramines 1 3331.3 3337.3
## - Solids 1 3331.8 3337.8
## - Organic_carbon 1 3332.0 3338.0
summary (best.model)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.032954e-01 2.847053e-01 -2.119017 0.03408900
## Solids 8.347844e-06 4.699198e-06 1.776440 0.07566043
## Chloramines 4.302107e-02 2.607369e-02 1.649980 0.09894705
## Organic_carbon -2.265578e-02 1.237260e-02 -1.831125 0.06708194
-> This summary outlines important aspects to consider. Nine variables were analyzed using coefficients such as estimated value, standard error, z, and Pr(>|z|) (the probability of observing the given result). The p-values related to the level of significance. If a coefficient has a p-value below the chosen significance level (commonly 0.05), then it is viewed as statistically significant. Moreover, AIC measures the model’s goodness of fit while considering the number of parameters; lower AIC values suggest a better model fit. Fisher Scoring Iterations refer to the number of iterations the optimization algorithm needed to reach a solution.
#Convert Potability to a factor variable in test_data
test_data$Potability <- as.factor(test_data$Potability)
#Make predictions on the test data
predictions <- predict(logistic_training_data, newdata = test_data, type = "response")
test_data$predictions<-round(predictions)
head(test_data,10)
## # A tibble: 10 × 11
## ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.32 214. 22018. 8.06 357. 363. 18.4
## 2 7.36 166. 32453. 7.55 327. 425. 15.6
## 3 7.97 219. 18768. 8.11 334. 364. 14.5
## 4 7.08 150. 27331. 6.84 299. 380. 19.4
## 5 9.18 274. 24041. 6.90 398. 478. 13.4
## 6 7.37 214. 25630. 4.43 336. 470. 12.5
## 7 7.08 216. 17107. 5.61 327. 436. 14.2
## 8 3.90 197. 21168. 7.00 334. 444. 16.6
## 9 5.40 141. 17267. 10.1 328. 473. 11.3
## 10 7.08 266. 26363. 7.70 395. 364. 10.3
## # ℹ 4 more variables: Trihalomethanes <dbl>, Turbidity <dbl>, Potability <fct>,
## # predictions <dbl>
#Convert predictions to a factor with the same levels as Potability
predictions <- as.factor(ifelse(predictions > 0.5, 1, 0))
levels(predictions) <- levels(test_data$Potability)
#Evaluate model performance
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
conf_matrix <- confusionMatrix(predictions, test_data$Potability)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 378 244
## 1 0 1
##
## Accuracy : 0.6083
## 95% CI : (0.5688, 0.6469)
## No Information Rate : 0.6067
## P-Value [Acc > NIR] : 0.4848
##
## Kappa : 0.0049
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.000000
## Specificity : 0.004082
## Pos Pred Value : 0.607717
## Neg Pred Value : 1.000000
## Prevalence : 0.606742
## Detection Rate : 0.606742
## Detection Prevalence : 0.998395
## Balanced Accuracy : 0.502041
##
## 'Positive' Class : 0
##
-> The statistical results above indicate the accuracy of the Logistic Regression model applied on the dataset, before and after cleaning the N/A pH value (which is pH=0). The former accuracy is 60.98%, corresponding to 3276 water bodies, whereas the latter is 61.16%, corresponding to 3114 water bodies. The difference demonstrates that, after removing the “0” pH values, the statistical model is higher in accuracy, thus stating a better result for the dataset.
#CONCLUSION
Our group examined a water potability dataset with nearly 9 attributes and 1 output variable with 3276 observations. The target feature was the quality of the water, which indicated 0 (non-potable) and 1 (potable). The goal was to examine the given characteristics in order to determine their impact on the water quality. Approximately 60.67% of the data corresponds to reality using the Logistic Regression model. We identified three features as the most influential after observing different distributions for the features and taking into account the qualities: Solids, Conductivity and Turbidity For the pH = 0 values, they were belived to be unusual water condition thus being cleaned initially. Both Group 0 and Group 1 have outliers, which suggests that extreme values can even occur in the potable water.