PCA is a method of extracting important variables (in the form of components) from a large set of variables available in a data set. PCA is a type of unsupervised linear transformation where we take a dataset with too many variables and untangle the original variables into a smaller set of variables which we called “principal components”.
Principal components are the set of new variables correspond to a linear combination of the original key variables. The number of principal components is less than or equal to the number of original variables.
Singular value decomposition (SVD) is considered to be a general method for PCA. This method examines the covariance’s / correlations between individuals, The functions prcomp ()[“stats” package] and PCA()[“FactoMineR” package] use the singular value decomposition (SVD).
PCA () function comes from FactoMineR. So install this package along with another package called Factoextra which will be used to visualize the results of PCA. In this article, I will demonstrate a sample of SVD method using PCA() function and visualize the variance results.
I will explore the principal components of a dataset which is extracted from KEEL-dataset repository. This datasets was proposed in McDonald, G.C. and Schwing, R.C. (1973) ‘Instabilities of regression estimates relating air pollution to mortality’, Technometrics, vol.15, 463-482. It contains 16 attributes describing 60 different pollution scenarios. The attributes are the following:
In order to define different range of mortality rate, one extra column named “MORTReal_TYPE” has been created in the R data frame. This extra column will be useful to create perform data visualization based on mortality rates.
pollution <- read.delim("pollution.dat", header = FALSE,skip = 19, sep = ",")
colnames(pollution) <- c("PRECReal","JANTReal","JULTReal","OVR65Real","POPNReal","EDUCReal","HOUSReal","DENSReal","NONWReal","WWDRKReal","POORReal","HCReal","NOXReal","SO@Real","HUMIDReal","MORTReal")
pollution <- mutate(pollution,MORTReal_Type = case_when(pollution$MORTReal < 900.0 ~ "Low Mortality",
pollution$MORTReal > 900.0 & MORTReal < 1000.0 ~ "Medium Mortality",
pollution$MORTReal > 1000.0 ~ "High Mortality"))
PCA () [FactoMineR package] function is very useful to identify the principal components and the contributing variables associated with those PCs. A simplified format to normalize the variables are:
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 60 individuals, described by 15 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
For better interpretation of PCA, we need to visualize the components using R functions provided in factoextra R package: get_eigenvalue(): Extract the eigenvalues/variances of principal components fviz_eig(): Visualize the eigenvalues fviz_pca_ind(), fviz_pca_var(): Visualize the results individuals and variables, respectively.
As described in the previous section, eigenvalues are used to measure the variances retained by the principal components. First principal component keeps the largest value of eigenvalues and the subsequent PCs have smaller values. In order to determine the eigenvalues and proportion of variances hold by different PCs of a given data set we need to rely on the R function get_eigenvalue() that can be extracted from the factoextra package.
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.528391602 30.18927735 30.18928
## Dim.2 2.754841543 18.36561028 48.55489
## Dim.3 2.054464043 13.69642695 62.25131
## Dim.4 1.348389581 8.98926387 71.24058
## Dim.5 1.223219959 8.15479973 79.39538
## Dim.6 0.960443977 6.40295985 85.79834
## Dim.7 0.612741552 4.08494368 89.88328
## Dim.8 0.472011722 3.14674481 93.03003
## Dim.9 0.370853024 2.47235350 95.50238
## Dim.10 0.216394684 1.44263122 96.94501
## Dim.11 0.166350401 1.10900267 98.05401
## Dim.12 0.127005110 0.84670073 98.90071
## Dim.13 0.113986775 0.75991183 99.66063
## Dim.14 0.046039741 0.30693161 99.96756
## Dim.15 0.004866287 0.03244191 100.00000
The sum of all the eigenvalues gives a total variance of 16. The proportion of all the eigenvalues demonstrated by the second column “variance.present”. For example, if you divide 4.878 by 16 equals to 0.304875 i.e. almost 30.49% variance explained by the first component/dimension. Based on the output of eig.val object, we can derive the fact that first six eigenvalues keep almost 82% of total variances present the data.
As an alternative approach, we can also examine the pattern of variances using a scree plot which showcases the order of eigenvalues from largest to smallest. In order to produce the scree plot, we will use the function fviz_eig() available in factoextra() package:
Next step is to determine the contribution and the correlation of the variables that have been considered as principal components of the dataset. In order to extract the relationship of the variables from a PCA object we need to use the function get_pca_var () which provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables, squared cosine and contributions).
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## PRECReal 11.9355926 1.05357142 0.07189902 11.07798964 1.49627116
## JANTReal 0.4258015 23.24787328 1.12382205 10.81161895 0.72518687
## JULTReal 11.8670832 3.81866721 0.60998601 0.06152172 15.85760709
## OVR65Real 2.6563871 13.31313109 2.43912037 27.06767987 0.12328338
## POPNReal 8.8371629 0.43540929 0.10226326 31.32854154 5.32611757
## EDUCReal 8.2085117 2.98791569 18.43785598 1.21316484 1.82517652
## HOUSReal 13.0148708 0.25431691 0.30170912 2.20081633 3.72887128
## DENSReal 0.5113210 0.04578729 19.37984452 0.31501300 17.44669037
## NONWReal 9.1211007 13.60707145 0.37432005 1.04976540 0.03167227
## WWDRKReal 3.8594569 5.77096941 11.11082305 0.20825609 14.73496720
## POORReal 12.8063648 7.19993803 0.05626669 7.78654620 1.76973069
## HCReal 7.9959675 13.47918281 5.73259908 0.27843632 4.70122745
## NOXReal 7.0208399 13.20532060 9.62354348 0.18708607 4.09889014
## SO@Real 0.4556910 0.90577945 26.98094776 3.16071390 1.03406283
## HUMIDReal 1.2838483 0.67506607 3.65499957 3.25285013 27.10024517
The quality of representation of the variables on the factor map called as cos2 which is multiplication of squared cosine and squared coordinates. Previously created object var_pollution holds cos2 value:
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## PRECReal 0.54049037 0.02902422 0.001477139 0.1493744582 0.018302687
## JANTReal 0.01928196 0.64044207 0.023088520 0.1457827434 0.008870631
## JULTReal 0.53738800 0.10519823 0.012531943 0.0008295525 0.193973415
## OVR65Real 0.12029161 0.36675567 0.050110851 0.3649777752 0.001508027
## POPNReal 0.40018134 0.01199484 0.002100962 0.4224307901 0.065150133
## EDUCReal 0.37171355 0.08231234 0.378799121 0.0163581883 0.022325923
After observing the quality of representation, the next step is to explore the contribution of variables to the main PCs. Variable contributions in a given principal component are demonstrated in percentage.
Key points to remember: • Variables with high contribution rate should be retained as those are the most important components that can explain the variability in the dataset. • Variables with low contribution rate can be excluded from the dataset in order to reduce the complexity of the data analysis.
To showcase relation between all individual variables and the mortality type a simple biplot can be drawn here :
Graphical represntation of the variable contribution in conjunction with the mortality type variable distribution depicts here:
## png
## 2
## png
## 2
## png
## 2
## png
## 2
In the figure above, column “MORTReal_TYPE” has been used to group the mortality rate value and corresponding key variables.
PCA analysis is unsupervised approach to identify the variability of dataset using fewer attributes which can be used as predictors for future regression modelling work in order to estimate age-adjusted mortality rate. Here are the key observations that can be assumed from this prototype exploratory PCA:
Six PCs demonstrate almost 83% variances of the whole data set.
Following factors are the key contributors to the variability of the data set: Family income(POORReal), housing facilities (HOUSReal) , non-white population(NONWReal), Summer temperature(JULTReal), air quality( NOXReal) contributed more variances to the main PCs.
These key attributes are the basis of selecting predictors for building regression model to predict mortality of a given population.
Note: It is not recommended to use NONWReal attribute as it deemed to be a violation of Civil Rights Act.
1.Husson, Francois, Sebastien Le, and Jérôme Pagès. 2017. Exploratory Multivariate Analysis by Example Using R. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://factominer.free.fr/bookV2/index.html.
2.Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley and Sons, Inc. WIREs Comp Stat 2: 433–59. http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.
3.KEEL-dataset citation paper: J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3 (2011) 255-287.
4.Khaled Labib and V. Rao Vemuri.” An Application of Principal Component Analysis to the Detection and Visualization of Computer Network Attacks”.https://web.cs.ucdavis.edu/~vemuri/papers/pcaVisualization.pdf
5.Libin Yang. 2015. “An Application of Principal Component Analysis to Stock Portfolio Management”.https://ir.canterbury.ac.nz/bitstream/handle/10092/10293/thesis.pdf
6.https://www.researchgate.net/publication/272576742_Principal_Component_Analysis_in_Medical_Image_Processing_A_Study
7.https://rdrr.io/cran/factoextra/man/fviz_pca.html