R packages

Dataset

We’ll use the CreditCard Default Detaset of a Taiwanese Bank for PCA

We start by subsetting active variables for the principal component analysis

##   CreditLimit Age BillOutstanding LastPayment
## 1       20000  24            3913           0
## 2      120000  26            2682           0
## 3       90000  34           29239        1518
## 4       50000  37           46990        2000
## 5       50000  57            8617        2000
## 6       50000  37           64400        2500

Data standardization

The output of the function PCA() is a list, including the following components :

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 29601 individuals, described by 4 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Visualization and Interpretation

Eigenvalues / Variances

The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.4603748         36.50937                    36.50937
## Dim.2  0.9843115         24.60779                    61.11716
## Dim.3  0.8637435         21.59359                    82.71074
## Dim.4  0.6915702         17.28926                   100.00000

Scree Plot

Results

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package]. This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Correlation circle

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations

Quality of representation

## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded

The most important (or, contributing) variables can be highlighted on the correlation plot as follow: