R packages

Dataset

We’ll use the Airline Detaset For PCA

We start by subsetting active individuals and active variables for the principal component analysis

##   FlyingMinutes Capacity SeatPitch SeatWidth Price AdvancedBookingDays
## 1           130      156        30        17  4051                  54
## 2           125      156        30        17 11587                  52
## 3           135      189        29        17  3977                  48
## 4           135      180        30        18  4234                  59
## 5           130      189        29        17  6837                  48
## 6           130      156        30        17  6518                  52
##   MarketShare LoadFactor
## 1        15.4      83.32
## 2        15.4      83.32
## 3        13.2      94.06
## 4        39.6      87.20
## 5        13.2      94.06
## 6        15.4      83.32

Data standardization

The output of the function PCA() is a list, including the following components :

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 305 individuals, described by 8 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Visualization and Interpretation

Eigenvalues / Variances

The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.1127851        26.409814                    26.40981
## Dim.2  1.8913123        23.641404                    50.05122
## Dim.3  1.1626064        14.532580                    64.58380
## Dim.4  0.9897769        12.372211                    76.95601
## Dim.5  0.9520241        11.900302                    88.85631
## Dim.6  0.6108899         7.636124                    96.49243
## Dim.7  0.1660760         2.075950                    98.56838
## Dim.8  0.1145292         1.431615                   100.00000

Scree Plot

Results

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package]. This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Correlation circle

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations

Quality of representation

## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded

The most important (or, contributing) variables can be highlighted on the correlation plot as follow:

Creating biplot