Introduction

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while preserving as much variability (information) as possible in the data. It transforms the original variables into a new set of uncorrelated variables called principal components (PCs), which are linear combinations of the original variables.

In this research, I will attempt to better understand of the data and dependencies between variables, and remove less significant components to reduce noise in the dataset and predict the water quality.

Data Preparation

The dataset used for this analysis is Water Quality and Potability, which contains information about water qualification indexs including: ph, hardness, solids, chloramines, provided by Kaggle (https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability)

Dataset contains 10 variables: - pH: The pH level of the water. - Hardness: Water hardness, a measure of mineral content. - Solids: Total dissolved solids in the water. - Chloramines: Chloramines concentration in the water. - Sulfate: Sulfate concentration in the water. - Conductivity: Electrical conductivity of the water. - Organic_carbon: Organic carbon content in the water. - Trihalomethanes: Trihalomethanes concentration in the water. - Turbidity: Turbidity level, a measure of water clarity. - Potability: Target variable; indicates water potability with values 1 (potable) and 0 (not potable).

# Packages & Libraries
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(corrplot,clusterSim,GGally,factoextra,gridExtra,devtools,ggbiplot)
# Import dataset
dataset<-read.csv("water_potability.csv", sep=",", dec=".", header=TRUE)
data <- as.matrix(dataset[1:9])
# Checking data
head(data)
##            ph Hardness   Solids Chloramines  Sulfate Conductivity
## [1,]       NA 204.8905 20791.32    7.300212 368.5164     564.3087
## [2,] 3.716080 129.4229 18630.06    6.635246       NA     592.8854
## [3,] 8.099124 224.2363 19909.54    9.275884       NA     418.6062
## [4,] 8.316766 214.3734 22018.42    8.059332 356.8861     363.2665
## [5,] 9.092223 181.1015 17978.99    6.546600 310.1357     398.4108
## [6,] 5.584087 188.3133 28748.69    7.544869 326.6784     280.4679
##      Organic_carbon Trihalomethanes Turbidity
## [1,]      10.379783        86.99097  2.963135
## [2,]      15.180013        56.32908  4.500656
## [3,]      16.868637        66.42009  3.055934
## [4,]      18.436524       100.34167  4.628771
## [5,]      11.558279        31.99799  4.075075
## [6,]       8.399735        54.91786  2.559708
dim(data)
## [1] 3276    9
summary(data)
##        ph            Hardness          Solids         Chloramines    
##  Min.   : 0.000   Min.   : 47.43   Min.   :  320.9   Min.   : 0.352  
##  1st Qu.: 6.093   1st Qu.:176.85   1st Qu.:15666.7   1st Qu.: 6.127  
##  Median : 7.037   Median :196.97   Median :20927.8   Median : 7.130  
##  Mean   : 7.081   Mean   :196.37   Mean   :22014.1   Mean   : 7.122  
##  3rd Qu.: 8.062   3rd Qu.:216.67   3rd Qu.:27332.8   3rd Qu.: 8.115  
##  Max.   :14.000   Max.   :323.12   Max.   :61227.2   Max.   :13.127  
##  NA's   :491                                                         
##     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
##  Min.   :129.0   Min.   :181.5   Min.   : 2.20   Min.   :  0.738  
##  1st Qu.:307.7   1st Qu.:365.7   1st Qu.:12.07   1st Qu.: 55.845  
##  Median :333.1   Median :421.9   Median :14.22   Median : 66.622  
##  Mean   :333.8   Mean   :426.2   Mean   :14.28   Mean   : 66.396  
##  3rd Qu.:360.0   3rd Qu.:481.8   3rd Qu.:16.56   3rd Qu.: 77.337  
##  Max.   :481.0   Max.   :753.3   Max.   :28.30   Max.   :124.000  
##  NA's   :781                                     NA's   :162      
##    Turbidity    
##  Min.   :1.450  
##  1st Qu.:3.440  
##  Median :3.955  
##  Mean   :3.967  
##  3rd Qu.:4.500  
##  Max.   :6.739  
## 

There are many NA values in the dataset, so I use complete.case to remove rows with NA value

data <- data[complete.cases(data), ]
any(is.na(data))
## [1] FALSE
# Check the relationship among variables
ggpairs(data)

corrplot(cor(data), method = "circle", order="hclust")

Most variables have very low correlation coefficients (Corr close to 0), suggesting weak linear relationships between them. The strongest correlation observed is Sulfate vs. Hardness (Corr: 0.163**), which is still relatively weak.

Principal Component Analysis (PCA)

pca<- prcomp(data, center = TRUE, scale. = TRUE)
fviz_eig(pca, choice='eigenvalue')

According to Kaiser rule (Kaiser criterion): Only retain principal components with eigenvalues > 1. Based on the graph, PC1 to PC5 can be retained.

fviz_eig(pca)

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6    PC7     PC8
## Standard deviation     1.0986 1.0819 1.0227 1.0053 1.0025 0.9852 0.9754 0.93462
## Proportion of Variance 0.1341 0.1300 0.1162 0.1123 0.1117 0.1078 0.1057 0.09706
## Cumulative Proportion  0.1341 0.2642 0.3804 0.4927 0.6043 0.7122 0.8179 0.91495
##                            PC9
## Standard deviation     0.87492
## Proportion of Variance 0.08505
## Cumulative Proportion  1.00000

In my view, we should select six or seven components based on the plot of components and the variation they explain. The reason for this is because only six components account for 71% of the variation, whereas the seventh component alone does not account for much of it. Therefore, the eighth component will also be considered in the study.

fviz_pca_var(pca, col.var = "blue")

The variables “Hardness”, “Solids”, “Sulfate” and “pH” are located far from the origin, indicating that they have high correlation with other principal components. This means that they contribute more to the variation of the principal components and may have more relationship with other variables in the dataset.

Contributions of individual variables to PC

var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
f<-fviz_contrib(pca, "var",axes = 6)
g<-fviz_contrib(pca, "var",axes = 7)
grid.arrange(a,b,c,d,e,f,g,top='Contribution to the Principal Components')

Some variables appear with high contributions in many major components, such as: Solids (Dim-1) Hardness (Dim-1, Dim-2) Sulfate (Dim-1, Dim-2) Chloramine (Dim-3, Dim-6, Dim-7) Trihalomethanes (Dim-4, Dim-5) Turbidity (Dim-4, Dim-7)

This result found principal components that reflected different aspects of the data, such as: Component Dim-1 focused on the total solids and pH factors. Components Dim-3 and Dim-6 focused on the compound bases. Components Dim-4 and Dim-7 were related to turbidity and disinfectants.

#PCA with distinction of classes (with values 1 (potable) and 0 (not potable).
dataset <- dataset[complete.cases(dataset), ]
pca1 <- prcomp(dataset, center=TRUE, scale=TRUE)
ggbiplot(pca1, obs.scale=1, var.scale=1, groups=as.factor(dataset$Potability), ellipse=TRUE, circle = TRUE)

Overall, the data points belonging to groups 0 and 1 do not exhibit a clear separation within the space defined by the first two principal components (PC1 and PC2). These two groups significantly overlap, indicating no distinct differences between potable and non-potable water based solely on the first two principal components.

This observation suggests that PC1 (accounting for 12.1% of the variance) and PC2 (accounting for 11.7% of the variance) are insufficient to effectively distinguish between the two groups.

Conclusion

Obviously, the variables are not strongly correlated, which leads to the ineffectiveness of PCA. This indicates the importance of the nature of the initial data file and the correlation between the variables.