Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while preserving as much variability (information) as possible in the data. It transforms the original variables into a new set of uncorrelated variables called principal components (PCs), which are linear combinations of the original variables.
In this research, I will attempt to better understand of the data and dependencies between variables, and remove less significant components to reduce noise in the dataset and predict the water quality.
The dataset used for this analysis is Water Quality and Potability, which contains information about water qualification indexs including: ph, hardness, solids, chloramines, provided by Kaggle (https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability)
Dataset contains 10 variables: - pH: The pH level of the water. - Hardness: Water hardness, a measure of mineral content. - Solids: Total dissolved solids in the water. - Chloramines: Chloramines concentration in the water. - Sulfate: Sulfate concentration in the water. - Conductivity: Electrical conductivity of the water. - Organic_carbon: Organic carbon content in the water. - Trihalomethanes: Trihalomethanes concentration in the water. - Turbidity: Turbidity level, a measure of water clarity. - Potability: Target variable; indicates water potability with values 1 (potable) and 0 (not potable).
# Packages & Libraries
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(corrplot,clusterSim,GGally,factoextra,gridExtra,devtools,ggbiplot)
# Import dataset
dataset<-read.csv("water_potability.csv", sep=",", dec=".", header=TRUE)
data <- as.matrix(dataset[1:9])
# Checking data
head(data)
## ph Hardness Solids Chloramines Sulfate Conductivity
## [1,] NA 204.8905 20791.32 7.300212 368.5164 564.3087
## [2,] 3.716080 129.4229 18630.06 6.635246 NA 592.8854
## [3,] 8.099124 224.2363 19909.54 9.275884 NA 418.6062
## [4,] 8.316766 214.3734 22018.42 8.059332 356.8861 363.2665
## [5,] 9.092223 181.1015 17978.99 6.546600 310.1357 398.4108
## [6,] 5.584087 188.3133 28748.69 7.544869 326.6784 280.4679
## Organic_carbon Trihalomethanes Turbidity
## [1,] 10.379783 86.99097 2.963135
## [2,] 15.180013 56.32908 4.500656
## [3,] 16.868637 66.42009 3.055934
## [4,] 18.436524 100.34167 4.628771
## [5,] 11.558279 31.99799 4.075075
## [6,] 8.399735 54.91786 2.559708
dim(data)
## [1] 3276 9
summary(data)
## ph Hardness Solids Chloramines
## Min. : 0.000 Min. : 47.43 Min. : 320.9 Min. : 0.352
## 1st Qu.: 6.093 1st Qu.:176.85 1st Qu.:15666.7 1st Qu.: 6.127
## Median : 7.037 Median :196.97 Median :20927.8 Median : 7.130
## Mean : 7.081 Mean :196.37 Mean :22014.1 Mean : 7.122
## 3rd Qu.: 8.062 3rd Qu.:216.67 3rd Qu.:27332.8 3rd Qu.: 8.115
## Max. :14.000 Max. :323.12 Max. :61227.2 Max. :13.127
## NA's :491
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :129.0 Min. :181.5 Min. : 2.20 Min. : 0.738
## 1st Qu.:307.7 1st Qu.:365.7 1st Qu.:12.07 1st Qu.: 55.845
## Median :333.1 Median :421.9 Median :14.22 Median : 66.622
## Mean :333.8 Mean :426.2 Mean :14.28 Mean : 66.396
## 3rd Qu.:360.0 3rd Qu.:481.8 3rd Qu.:16.56 3rd Qu.: 77.337
## Max. :481.0 Max. :753.3 Max. :28.30 Max. :124.000
## NA's :781 NA's :162
## Turbidity
## Min. :1.450
## 1st Qu.:3.440
## Median :3.955
## Mean :3.967
## 3rd Qu.:4.500
## Max. :6.739
##
There are many NA values in the dataset, so I use complete.case to remove rows with NA value
data <- data[complete.cases(data), ]
any(is.na(data))
## [1] FALSE
# Check the relationship among variables
ggpairs(data)
corrplot(cor(data), method = "circle", order="hclust")
Most variables have very low correlation coefficients (Corr close to 0), suggesting weak linear relationships between them. The strongest correlation observed is Sulfate vs. Hardness (Corr: 0.163**), which is still relatively weak.
pca<- prcomp(data, center = TRUE, scale. = TRUE)
fviz_eig(pca, choice='eigenvalue')
According to Kaiser rule (Kaiser criterion): Only retain principal components with eigenvalues > 1. Based on the graph, PC1 to PC5 can be retained.
fviz_eig(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.0986 1.0819 1.0227 1.0053 1.0025 0.9852 0.9754 0.93462
## Proportion of Variance 0.1341 0.1300 0.1162 0.1123 0.1117 0.1078 0.1057 0.09706
## Cumulative Proportion 0.1341 0.2642 0.3804 0.4927 0.6043 0.7122 0.8179 0.91495
## PC9
## Standard deviation 0.87492
## Proportion of Variance 0.08505
## Cumulative Proportion 1.00000
In my view, we should select six or seven components based on the plot of components and the variation they explain. The reason for this is because only six components account for 71% of the variation, whereas the seventh component alone does not account for much of it. Therefore, the eighth component will also be considered in the study.
fviz_pca_var(pca, col.var = "blue")
The variables “Hardness”, “Solids”, “Sulfate” and “pH” are located far from the origin, indicating that they have high correlation with other principal components. This means that they contribute more to the variation of the principal components and may have more relationship with other variables in the dataset.
var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
f<-fviz_contrib(pca, "var",axes = 6)
g<-fviz_contrib(pca, "var",axes = 7)
grid.arrange(a,b,c,d,e,f,g,top='Contribution to the Principal Components')
Some variables appear with high contributions in many major components, such as: Solids (Dim-1) Hardness (Dim-1, Dim-2) Sulfate (Dim-1, Dim-2) Chloramine (Dim-3, Dim-6, Dim-7) Trihalomethanes (Dim-4, Dim-5) Turbidity (Dim-4, Dim-7)
This result found principal components that reflected different aspects of the data, such as: Component Dim-1 focused on the total solids and pH factors. Components Dim-3 and Dim-6 focused on the compound bases. Components Dim-4 and Dim-7 were related to turbidity and disinfectants.
#PCA with distinction of classes (with values 1 (potable) and 0 (not potable).
dataset <- dataset[complete.cases(dataset), ]
pca1 <- prcomp(dataset, center=TRUE, scale=TRUE)
ggbiplot(pca1, obs.scale=1, var.scale=1, groups=as.factor(dataset$Potability), ellipse=TRUE, circle = TRUE)
Overall, the data points belonging to groups 0 and 1 do not exhibit a clear separation within the space defined by the first two principal components (PC1 and PC2). These two groups significantly overlap, indicating no distinct differences between potable and non-potable water based solely on the first two principal components.
This observation suggests that PC1 (accounting for 12.1% of the variance) and PC2 (accounting for 11.7% of the variance) are insufficient to effectively distinguish between the two groups.
Obviously, the variables are not strongly correlated, which leads to the ineffectiveness of PCA. This indicates the importance of the nature of the initial data file and the correlation between the variables.