Principal Component Analysis using R

1. Overview:

PCA is a method of extracting important variables (in the form of components) from a large set of variables available in a data set. PCA is a type of unsupervised linear transformation where we take a dataset with too many variables and untangle the original variables into a smaller set of variables which we called “principal components”.

Principal components are the set of new variables correspond to a linear combination of the original key variables. The number of principal components is less than or equal to the number of original variables.

Singular value decomposition (SVD) is considered to be a general method for PCA. This method examines the covariance’s / correlations between individuals, The functions prcomp ()[“stats” package] and PCA()[“FactoMineR” package] use the singular value decomposition (SVD).

PCA () function comes from FactoMineR. So install this package along with another package called Factoextra which will be used to visualize the results of PCA. In this article, I will demonstrate a sample of SVD method using PCA() function and visualize the variance results.

Dataset description:

I will explore the principal components of a dataset which is extracted from KEEL-dataset repository. This datasets was proposed in McDonald, G.C. and Schwing, R.C. (1973) ‘Instabilities of regression estimates relating air pollution to mortality’, Technometrics, vol.15, 463-482. It contains 16 attributes describing 60 different pollution scenarios. The attributes are the following:

PRECReal: Average annual precipitation in inches
JANTReal: Average January temperature in degrees F
JULTReal: Same for July
OVR65Real: of 1960 SMSA population aged 65 or older
POPNReal: Average household size
EDUCReal: Median school years completed by those over 22
HOUSReal: of housing units which are sound and with all facilities
DENSReal: Population per sq. mile in urbanized areas, 1960
NONWReal: non-white population in urbanized areas, 1960
WWDRKReal: employed in white collar occupations
POORReal: of families with income less than $3000
HCReal: Relative hydrocarbon pollution potential
NOXReal: Same for nitric oxides
SO@Real: Same for sulphur dioxide
HUMIDReal: Annual average % relative humidity at 1pm
MORTReal: Total age-adjusted mortality rate per 100,000

In order to define different range of mortality rate, one extra column named “MORTReal_TYPE” has been created in the R data frame. This extra column will be useful to create perform data visualization based on mortality rates.

pollution <- read.delim("pollution.dat", header = FALSE,skip = 19, sep = ",")

colnames(pollution) <- c("PRECReal","JANTReal","JULTReal","OVR65Real","POPNReal","EDUCReal","HOUSReal","DENSReal","NONWReal","WWDRKReal","POORReal","HCReal","NOXReal","SO@Real","HUMIDReal","MORTReal")

pollution <- mutate(pollution,MORTReal_Type = case_when(pollution$MORTReal < 900.0 ~ "Low Mortality",
                                                        pollution$MORTReal > 900.0 & MORTReal < 1000.0 ~ "Medium Mortality",
                                                        pollution$MORTReal > 1000.0 ~ "High Mortality"))

PCA () [FactoMineR package] function is very useful to identify the principal components and the contributing variables associated with those PCs. A simplified format to normalize the variables are:

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 60 individuals, described by 15 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

For better interpretation of PCA, we need to visualize the components using R functions provided in factoextra R package: get_eigenvalue(): Extract the eigenvalues/variances of principal components fviz_eig(): Visualize the eigenvalues fviz_pca_ind(), fviz_pca_var(): Visualize the results individuals and variables, respectively.

Determine eigenvalues of principal components:

As described in the previous section, eigenvalues are used to measure the variances retained by the principal components. First principal component keeps the largest value of eigenvalues and the subsequent PCs have smaller values. In order to determine the eigenvalues and proportion of variances hold by different PCs of a given data set we need to rely on the R function get_eigenvalue() that can be extracted from the factoextra package.

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.528391602      30.18927735                    30.18928
## Dim.2  2.754841543      18.36561028                    48.55489
## Dim.3  2.054464043      13.69642695                    62.25131
## Dim.4  1.348389581       8.98926387                    71.24058
## Dim.5  1.223219959       8.15479973                    79.39538
## Dim.6  0.960443977       6.40295985                    85.79834
## Dim.7  0.612741552       4.08494368                    89.88328
## Dim.8  0.472011722       3.14674481                    93.03003
## Dim.9  0.370853024       2.47235350                    95.50238
## Dim.10 0.216394684       1.44263122                    96.94501
## Dim.11 0.166350401       1.10900267                    98.05401
## Dim.12 0.127005110       0.84670073                    98.90071
## Dim.13 0.113986775       0.75991183                    99.66063
## Dim.14 0.046039741       0.30693161                    99.96756
## Dim.15 0.004866287       0.03244191                   100.00000

The sum of all the eigenvalues gives a total variance of 16. The proportion of all the eigenvalues demonstrated by the second column “variance.present”. For example, if you divide 4.878 by 16 equals to 0.304875 i.e. almost 30.49% variance explained by the first component/dimension. Based on the output of eig.val object, we can derive the fact that first six eigenvalues keep almost 82% of total variances present the data.

As an alternative approach, we can also examine the pattern of variances using a scree plot which showcases the order of eigenvalues from largest to smallest. In order to produce the scree plot, we will use the function fviz_eig() available in factoextra() package:

Variables contribution and correlation circle plot:

Next step is to determine the contribution and the correlation of the variables that have been considered as principal components of the dataset. In order to extract the relationship of the variables from a PCA object we need to use the function get_pca_var () which provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables, squared cosine and contributions).

##                Dim.1       Dim.2       Dim.3       Dim.4       Dim.5
## PRECReal  11.9355926  1.05357142  0.07189902 11.07798964  1.49627116
## JANTReal   0.4258015 23.24787328  1.12382205 10.81161895  0.72518687
## JULTReal  11.8670832  3.81866721  0.60998601  0.06152172 15.85760709
## OVR65Real  2.6563871 13.31313109  2.43912037 27.06767987  0.12328338
## POPNReal   8.8371629  0.43540929  0.10226326 31.32854154  5.32611757
## EDUCReal   8.2085117  2.98791569 18.43785598  1.21316484  1.82517652
## HOUSReal  13.0148708  0.25431691  0.30170912  2.20081633  3.72887128
## DENSReal   0.5113210  0.04578729 19.37984452  0.31501300 17.44669037
## NONWReal   9.1211007 13.60707145  0.37432005  1.04976540  0.03167227
## WWDRKReal  3.8594569  5.77096941 11.11082305  0.20825609 14.73496720
## POORReal  12.8063648  7.19993803  0.05626669  7.78654620  1.76973069
## HCReal     7.9959675 13.47918281  5.73259908  0.27843632  4.70122745
## NOXReal    7.0208399 13.20532060  9.62354348  0.18708607  4.09889014
## SO@Real    0.4556910  0.90577945 26.98094776  3.16071390  1.03406283
## HUMIDReal  1.2838483  0.67506607  3.65499957  3.25285013 27.10024517

Quality of representation:

The quality of representation of the variables on the factor map called as cos2 which is multiplication of squared cosine and squared coordinates. Previously created object var_pollution holds cos2 value:

##                Dim.1      Dim.2       Dim.3        Dim.4       Dim.5
## PRECReal  0.54049037 0.02902422 0.001477139 0.1493744582 0.018302687
## JANTReal  0.01928196 0.64044207 0.023088520 0.1457827434 0.008870631
## JULTReal  0.53738800 0.10519823 0.012531943 0.0008295525 0.193973415
## OVR65Real 0.12029161 0.36675567 0.050110851 0.3649777752 0.001508027
## POPNReal  0.40018134 0.01199484 0.002100962 0.4224307901 0.065150133
## EDUCReal  0.37171355 0.08231234 0.378799121 0.0163581883 0.022325923

Contribution of variables to PCs:

After observing the quality of representation, the next step is to explore the contribution of variables to the main PCs. Variable contributions in a given principal component are demonstrated in percentage.

Key points to remember: • Variables with high contribution rate should be retained as those are the most important components that can explain the variability in the dataset. • Variables with low contribution rate can be excluded from the dataset in order to reduce the complexity of the data analysis.

Individual Biplot:

To showcase relation between all individual variables and the mortality type a simple biplot can be drawn here :

Combo Bi plot:

Graphical represntation of the variable contribution in conjunction with the mortality type variable distribution depicts here:

Explot plots to pdf files:

## png 
##   2

## png 
##   2

## png 
##   2

## png 
##   2

In the figure above, column “MORTReal_TYPE” has been used to group the mortality rate value and corresponding key variables.

Summary:

PCA analysis is unsupervised approach to identify the variability of dataset using fewer attributes which can be used as predictors for future regression modelling work in order to estimate age-adjusted mortality rate. Here are the key observations that can be assumed from this prototype exploratory PCA:

Six PCs demonstrate almost 83% variances of the whole data set.
Following factors are the key contributors to the variability of the data set: Family income(POORReal), housing facilities (HOUSReal) , non-white population(NONWReal), Summer temperature(JULTReal), air quality( NOXReal) contributed more variances to the main PCs.
These key attributes are the basis of selecting predictors for building regression model to predict mortality of a given population.

Note: It is not recommended to use NONWReal attribute as it deemed to be a violation of Civil Rights Act.

References:

1.Husson, Francois, Sebastien Le, and Jérôme Pagès. 2017. Exploratory Multivariate Analysis by Example Using R. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://factominer.free.fr/bookV2/index.html.

2.Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley and Sons, Inc. WIREs Comp Stat 2: 433–59. http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.

3.KEEL-dataset citation paper: J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3 (2011) 255-287.

4.Khaled Labib and V. Rao Vemuri.” An Application of Principal Component Analysis to the Detection and Visualization of Computer Network Attacks”.https://web.cs.ucdavis.edu/~vemuri/papers/pcaVisualization.pdf

5.Libin Yang. 2015. “An Application of Principal Component Analysis to Stock Portfolio Management”.https://ir.canterbury.ac.nz/bitstream/handle/10092/10293/thesis.pdf

6.https://www.researchgate.net/publication/272576742_Principal_Component_Analysis_in_Medical_Image_Processing_A_Study

7.https://rdrr.io/cran/factoextra/man/fviz_pca.html