Principal Component Analysis using FactoMineR package

This is a tutorial for doing Principal Component Analysis (PCA) using FactoMineR package. For visualization, factoextra package has been used. PCA is used for the following tasks: a) pattern recognition b) dimensionality reduction by identifying correlated columns. The dataset used in this tutorial is Cereals which contains nutrition information along with their rating by consumers.

The dataset contains 77 rows and 14 columns. First 70 rows will be used to train the model and the remaining 7 rows will be used for testing the model (prediction). PCA function in factoextra package takes the following syntax. `PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)`.

x: a data frame. Rows are individuals and columns are numeric variables. scale.unit: a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units. It makes variable comparable. ncp: number of dimensions kept in the final results. graph: a logical value. If TRUE, a graph is displayed.

##   record              name mfr type protein fat sodium fiber carbo sugars
## 1      1         100%_Bran   N    C       4   1    130    10     5      6
## 2      2 100%_Natural_Bran   Q    C       3   5     15     2     8      8
## 3      3          All-Bran   K    C       4   1    260     9     7      5
##   potass vitamins calories rating
## 1    280       25       70  68.40
## 2    135        0      120  33.98
## 3    320       25       70  59.43

##       [,1]      
##  [1,] "record"  
##  [2,] "name"    
##  [3,] "mfr"     
##  [4,] "type"    
##  [5,] "protein" 
##  [6,] "fat"     
##  [7,] "sodium"  
##  [8,] "fiber"   
##  [9,] "carbo"   
## [10,] "sugars"  
## [11,] "potass"  
## [12,] "vitamins"
## [13,] "calories"
## [14,] "rating"

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 70 individuals, described by 9 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

## 
## Call:
## PCA(X = df.train, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.715   2.062   1.590   1.045   0.622   0.505   0.374
## % of var.             30.172  22.912  17.663  11.615   6.906   5.612   4.157
## Cumulative % of var.  30.172  53.084  70.746  82.361  89.267  94.880  99.036
##                        Dim.8   Dim.9
## Variance               0.059   0.027
## % of var.              0.660   0.304
## Cumulative % of var.  99.696 100.000
## 
## Individuals (the 10 first)
##              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1        |  5.320 |  5.005 13.179  0.885 |  0.119  0.010  0.000 |  0.127  0.014
## 2        |  4.764 |  1.667  1.461  0.122 |  2.615  4.737  0.301 | -1.961  3.456
## 3        |  5.338 |  4.631 11.281  0.752 |  0.196  0.027  0.001 |  1.183  1.257
## 4        |  7.135 |  6.354 21.238  0.793 | -1.677  1.949  0.055 |  1.269  1.446
## 5        |  1.319 | -0.476  0.119  0.130 |  0.776  0.417  0.346 | -0.223  0.045
## 6        |  1.748 | -0.293  0.045  0.028 |  1.158  0.929  0.439 | -0.927  0.772
## 7        |  2.445 | -0.875  0.403  0.128 |  0.403  0.113  0.027 | -1.540  2.132
## 8        |  1.903 | -0.545  0.156  0.082 |  1.140  0.900  0.359 |  0.931  0.779
## 9        |  1.415 |  0.555  0.162  0.154 | -0.473  0.155  0.112 |  0.198  0.035
## 10       |  2.383 |  1.697  1.515  0.507 | -0.716  0.355  0.090 |  0.763  0.524
##            cos2  
## 1         0.001 |
## 2         0.169 |
## 3         0.049 |
## 4         0.032 |
## 5         0.028 |
## 6         0.281 |
## 7         0.397 |
## 8         0.239 |
## 9         0.019 |
## 10        0.103 |
## 
## Variables
##             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2  
## protein  |  0.604 13.449  0.365 |  0.104  0.527  0.011 |  0.501 15.774  0.251 |
## fat      |  0.128  0.607  0.016 |  0.764 28.313  0.584 | -0.050  0.155  0.002 |
## sodium   | -0.341  4.284  0.116 |  0.237  2.729  0.056 |  0.591 21.986  0.350 |
## fiber    |  0.893 29.362  0.797 |  0.046  0.103  0.002 |  0.246  3.810  0.061 |
## carbo    | -0.549 11.098  0.301 | -0.375  6.832  0.141 |  0.576 20.880  0.332 |
## sugars   | -0.193  1.370  0.037 |  0.781 29.543  0.609 | -0.384  9.280  0.148 |
## potass   |  0.873 28.082  0.763 |  0.247  2.970  0.061 |  0.269  4.551  0.072 |
## vitamins | -0.356  4.674  0.127 |  0.177  1.527  0.031 |  0.561 19.804  0.315 |
## calories | -0.438  7.075  0.192 |  0.752 27.455  0.566 |  0.245  3.761  0.060 |

Eigenvalues /Variances

The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set. How many principal components should be retained? An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. Hence, the components with eigenvalue > 1 are retained. Another option is to use the scree plot. From the plot below, we might want to stop at the fifth principal component. 89% of the information (variances) contained in the data are retained by the first five principal components.

eig.val <- get_eigenvalue(train.pca)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.71546819       30.1718687                    30.17187
## Dim.2 2.06204993       22.9116659                    53.08353
## Dim.3 1.58965943       17.6628825                    70.74642
## Dim.4 1.04535281       11.6150312                    82.36145
## Dim.5 0.62150710        6.9056345                    89.26708
## Dim.6 0.50512083        5.6124536                    94.87954
## Dim.7 0.37411639        4.1568488                    99.03639
## Dim.8 0.05938214        0.6598016                    99.69619
## Dim.9 0.02734318        0.3038132                   100.00000

fviz_eig( train.pca, addlabels = TRUE, ylim = c( 0, 50))

A simple method to extract the results, for variables, from a PCA output is to use the function `get_pca_var()`.

var <- get_pca_var(train.pca)
var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.`$coord` and `$cor` give the same information. `$cos2` is, in fact, the square of `$cor`.

head(var$coord,4)

##              Dim.1      Dim.2       Dim.3      Dim.4       Dim.5
## protein  0.6043195 0.10421830  0.50074864 -0.4105887 -0.12170102
## fat      0.1283722 0.76409269 -0.04956312 -0.4189385 -0.07653449
## sodium  -0.3410769 0.23720423  0.59118956  0.4013632  0.47178314
## fiber    0.8929254 0.04616886  0.24609789  0.2556498  0.05743455

To plot the variables in the top two dimension

fviz_pca_var( train.pca, col.var = "black", title = "Correlation Plot of Variables")

Quality of Representation

The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates) . A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2 indicates that the variable is not perfectly represented by the component. We can obtain cos2 as follow:

##              Dim.1       Dim.2       Dim.3      Dim.4       Dim.5
## protein 0.36520203 0.010861453 0.250749201 0.16858307 0.014811137
## fat     0.01647943 0.583837641 0.002456503 0.17550950 0.005857528
## sodium  0.11633348 0.056265847 0.349505101 0.16109243 0.222579334
## fiber   0.79731576 0.002131563 0.060564172 0.06535682 0.003298728

Contribution of variables to PC

The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables that are correlated with PC1 (i.e., Dim. 1) and PC2 (i.e., Dim. 2) are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis. The contribution of variables can be extracted as follow:

##          Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## protein  13.45  0.53 15.77 16.13  2.38
## fat       0.61 28.31  0.15 16.79  0.94
## sodium    4.28  2.73 21.99 15.41 35.81
## fiber    29.36  0.10  3.81  6.25  0.53
## carbo    11.10  6.83 20.88 11.07  1.72
## sugars    1.37 29.54  9.28  8.78  0.07
## potass   28.08  2.97  4.55  1.43  1.32
## vitamins  4.67  1.53 19.80 16.52 56.15
## calories  7.07 27.46  3.76  7.61  1.06

The larger the value of the contribution, the more the variable contributes to the component. We can use the function `corrplot()` to highlight the most contributing variable for each column.

Contribution of Variables to PC1

The red line above indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/ 9 = 11%. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the component. The total contribution to PC1 and PC2 can be obtained by:

The most important contributing variables can be highlighted in the correlation plot as

fviz_pca_var(train.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

`dimdesc() can be used to identify the most significantly associated variables with a given principal component . The output will be sorted by p-values. This function can be used as follows.

desc <- dimdesc(train.pca, axes = c(1,2), proba = 0.05)
# Description of dimension 1
desc$Dim.1

## $quanti
##          correlation      p.value
## fiber      0.8929254 2.907970e-25
## potass     0.8732524 6.445535e-23
## protein    0.6043195 3.035490e-08
## sodium    -0.3410769 3.859072e-03
## vitamins  -0.3562426 2.471195e-03
## calories  -0.4382992 1.476686e-04
## carbo     -0.5489537 8.629781e-07
## 
## attr(,"class")
## [1] "condes" "list"

Supplementary Elements

Supplementary variables and individuals are not used for the determination of the principal components. Their coordinates are predicted using only the information provided by the performed principal component analysis on active variables/individuals. In the Cereals dataset, rows (71:77) are used as supplementary individuals and the rating column can be used as the supplementary variable.

df <- cereals[, 5:14]
rownames(df)<- cereals$name
test.pca <- PCA(df, ind.sup = 71:77, quanti.sup = 10, graph=FALSE)
test.pca$quanti.sup

## $coord
##            Dim.1      Dim.2     Dim.3      Dim.4      Dim.5
## rating 0.6515088 -0.6907844 0.1880421 -0.1441747 -0.1021708
## 
## $cor
##            Dim.1      Dim.2     Dim.3      Dim.4      Dim.5
## rating 0.6515088 -0.6907844 0.1880421 -0.1441747 -0.1021708
## 
## $cos2
##            Dim.1     Dim.2      Dim.3      Dim.4      Dim.5
## rating 0.4244637 0.4771831 0.03535985 0.02078636 0.01043886

test.pca$ind.sup

## $coord
##                            Dim.1       Dim.2       Dim.3       Dim.4      Dim.5
## Total_Raisin_Bran   -0.002783963  2.63018990  2.43070478  1.91210983 -2.1068160
## Total_Whole_Grain   -0.410411400 -0.17050538  2.54136652  1.36482617 -2.5083678
## Triples             -1.642837134 -0.83840772  1.05478550 -0.54287939  0.8683578
## Trix                -1.540234050  0.46390685 -1.57342504  0.54699819 -0.0287648
## Wheat_Chex           0.290777661 -0.63541920  1.14841469 -0.11567405  0.5856445
## Wheaties             0.325287596 -0.70562425  0.96890341 -0.26171729  0.3678195
## Wheaties_Honey_Gold -1.013187901  0.04068697 -0.07168762  0.09649568  0.4142057
## 
## $cos2
##                            Dim.1       Dim.2       Dim.3       Dim.4
## Total_Raisin_Bran   3.266555e-07 0.291566304 0.249016170 0.154095085
## Total_Whole_Grain   1.083617e-02 0.001870308 0.415500410 0.119837165
## Triples             4.432308e-01 0.115438711 0.182712909 0.048400253
## Trix                4.399118e-01 0.039907433 0.459075728 0.055483519
## Wheat_Chex          3.445405e-02 0.164527646 0.537422314 0.005452430
## Wheaties            5.458474e-02 0.256852151 0.484280713 0.035334669
## Wheaties_Honey_Gold 8.200875e-01 0.001322486 0.004105524 0.007438681
##                            Dim.5
## Total_Raisin_Bran   0.1870752593
## Total_Whole_Grain   0.4047802217
## Triples             0.1238335546
## Trix                0.0001534317
## Wheat_Chex          0.1397611031
## Wheaties            0.0697920091
## Wheaties_Honey_Gold 0.1370605225
## 
## $dist
##   Total_Raisin_Bran   Total_Whole_Grain             Triples                Trix 
##            4.871003            3.942590            2.467627            2.322223 
##          Wheat_Chex            Wheaties Wheaties_Honey_Gold 
##            1.566538            1.392297            1.118819

fviz_pca_var(test.pca)

fviz_pca_ind(test.pca)

Principal Component Analysis using FactoMineR package

Gokul Bhandari

18/02/2021

The dataset contains 77 rows and 14 columns. First 70 rows will be used to train the model and the remaining 7 rows will be used for testing the model (prediction). PCA function in factoextra package takes the following syntax. `PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)`.

Eigenvalues /Variances

A simple method to extract the results, for variables, from a PCA output is to use the function `get_pca_var()`.

The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.`$coord` and `$cor` give the same information. `$cos2` is, in fact, the square of `$cor`.

To plot the variables in the top two dimension

Quality of Representation

Contribution of variables to PC

The larger the value of the contribution, the more the variable contributes to the component. We can use the function `corrplot()` to highlight the most contributing variable for each column.

Contribution of Variables to PC1

The most important contributing variables can be highlighted in the correlation plot as

`dimdesc() can be used to identify the most significantly associated variables with a given principal component . The output will be sorted by p-values. This function can be used as follows.

Supplementary Elements

References

sthda.com

Principal Component Analysis using FactoMineR package

Gokul Bhandari

18/02/2021

The dataset contains 77 rows and 14 columns. First 70 rows will be used to train the model and the remaining 7 rows will be used for testing the model (prediction). PCA function in factoextra package takes the following syntax. PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE).

Eigenvalues /Variances

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var().

The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.$coord and $cor give the same information. $cos2 is, in fact, the square of $cor.

To plot the variables in the top two dimension

Quality of Representation

Contribution of variables to PC

The larger the value of the contribution, the more the variable contributes to the component. We can use the function corrplot() to highlight the most contributing variable for each column.

Contribution of Variables to PC1

The most important contributing variables can be highlighted in the correlation plot as

`dimdesc() can be used to identify the most significantly associated variables with a given principal component . The output will be sorted by p-values. This function can be used as follows.

Supplementary Elements

References

sthda.com

The dataset contains 77 rows and 14 columns. First 70 rows will be used to train the model and the remaining 7 rows will be used for testing the model (prediction). PCA function in factoextra package takes the following syntax. `PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)`.

A simple method to extract the results, for variables, from a PCA output is to use the function `get_pca_var()`.

The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.`$coord` and `$cor` give the same information. `$cos2` is, in fact, the square of `$cor`.

The larger the value of the contribution, the more the variable contributes to the component. We can use the function `corrplot()` to highlight the most contributing variable for each column.