Introduction

Dimension reduction is a machine learning data transformation technique used in unsupervised learning to bring data from a high-dimensional space into a low-dimensional space retaining the meaningful properties of the original data. In a nutshell, dimension reduction means representing data using fewer predictor variables (features).

In my project I would like to use dimension reduction to check if a data set with many features can be described in similar way with less variables.

Dataset - “Pistachio Dataset”

This data set contains features and parameters describing two types of pistachios. Data set is provided on the Kaggle website: https://www.kaggle.com/datasets/muratkokludataset/pistachio-dataset.

Variables

Id Name
1 Area
2 Perimeter
3 Major_Axis
4 Minor_Axis
5 Eccentricity
6 Eqdiasq
7 Solidity
8 Convex_Area
9 Extent
10 Aspect_Ratio
11 Roundness
12 Compactness
13 Shapefactor_1
14 Shapefactor_2
15 Shapefactor_3
16 Shapefactor_4
17 Mean_RR
18 Mean_RG
19 Mean_RB
20 Skew_RR
21 Skew_RG
22 Skew_RB
23 Kurtosis_RR
24 Kurtosis_RG
25 Kurtosis_RB
26 StdDev_RR
27 StdDev_RG
28 StdDev_RB
29 Class

Statistics and Exploratory Data Analysis

You can also embed plots, for example:

#load data
pistacios<-read_excel("C:\\Users\\skarbowiak\\Desktop\\Studia II UW\\Semestr 1 DSaBA\\Unsupervised learning 2022Z\\Dimension reduction projects\\Pistachio_Dataset.xlsx") 
str(pistacios)
## tibble [2,148 × 29] (S3: tbl_df/tbl/data.frame)
##  $ Area         : num [1:2148] 63391 68358 73589 71106 80087 ...
##  $ Perimeter    : num [1:2148] 1568 1942 1247 1445 1252 ...
##  $ Major_Axis   : num [1:2148] 390 411 452 430 469 ...
##  $ Minor_Axis   : num [1:2148] 237 235 221 216 221 ...
##  $ Eccentricity : num [1:2148] 0.795 0.821 0.873 0.864 0.882 ...
##  $ Eqdiasq      : num [1:2148] 284 295 306 301 319 ...
##  $ Solidity     : num [1:2148] 0.867 0.876 0.917 0.959 0.966 ...
##  $ Convex_Area  : num [1:2148] 73160 77991 80234 74153 82929 ...
##  $ Extent       : num [1:2148] 0.639 0.677 0.713 0.703 0.746 ...
##  $ Aspect_Ratio : num [1:2148] 1.65 1.75 2.05 1.99 2.12 ...
##  $ Roundness    : num [1:2148] 0.324 0.228 0.595 0.428 0.642 ...
##  $ Compactness  : num [1:2148] 0.728 0.718 0.677 0.701 0.68 ...
##  $ Shapefactor_1: num [1:2148] 0.0062 0.006 0.0061 0.006 0.0059 0.0073 0.0054 0.0062 0.0068 0.0063 ...
##  $ Shapefactor_2: num [1:2148] 0.0037 0.0034 0.003 0.003 0.0028 0.0038 0.0035 0.0035 0.0033 0.0026 ...
##  $ Shapefactor_3: num [1:2148] 0.53 0.516 0.458 0.491 0.463 ...
##  $ Shapefactor_4: num [1:2148] 0.873 0.902 0.939 0.976 0.983 ...
##  $ Mean_RR      : num [1:2148] 196 223 213 212 230 ...
##  $ Mean_RG      : num [1:2148] 180 209 203 205 218 ...
##  $ Mean_RB      : num [1:2148] 165 187 188 188 194 ...
##  $ StdDev_RR    : num [1:2148] 17.7 26.7 19 18.2 23.4 ...
##  $ StdDev_RG    : num [1:2148] 19.6 27.2 20.1 18.7 24.1 ...
##  $ StdDev_RB    : num [1:2148] 21.1 25.1 20.7 29.8 23.1 ...
##  $ Skew_RR      : num [1:2148] 0.458 -0.385 -0.601 -0.694 -0.929 ...
##  $ Skew_RG      : num [1:2148] 0.663 -0.271 -0.45 -0.628 -0.813 ...
##  $ Skew_RB      : num [1:2148] 0.759 -0.293 0.3 -0.78 -0.497 ...
##  $ Kurtosis_RR  : num [1:2148] 2.97 1.98 3.54 2.88 2.99 ...
##  $ Kurtosis_RG  : num [1:2148] 3.06 2.1 3.69 2.87 2.88 ...
##  $ Kurtosis_RB  : num [1:2148] 2.95 2.22 4.1 2.9 2.74 ...
##  $ Class        : chr [1:2148] "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" ...
summary(pistacios[,-29]) 
##       Area          Perimeter        Major_Axis      Minor_Axis   
##  Min.   : 29808   Min.   : 858.4   Min.   :320.3   Min.   :133.5  
##  1st Qu.: 71937   1st Qu.:1171.0   1st Qu.:426.5   1st Qu.:217.9  
##  Median : 79906   Median :1262.8   Median :448.6   Median :236.4  
##  Mean   : 79951   Mean   :1426.0   Mean   :446.2   Mean   :238.3  
##  3rd Qu.: 89031   3rd Qu.:1607.9   3rd Qu.:468.5   3rd Qu.:257.8  
##  Max.   :124008   Max.   :2755.0   Max.   :542.0   Max.   :383.0  
##   Eccentricity       Eqdiasq         Solidity       Convex_Area    
##  Min.   :0.5049   Min.   :194.8   Min.   :0.5880   Min.   : 37935  
##  1st Qu.:0.8175   1st Qu.:302.6   1st Qu.:0.9198   1st Qu.: 76467  
##  Median :0.8497   Median :319.0   Median :0.9542   Median : 85076  
##  Mean   :0.8402   Mean   :317.9   Mean   :0.9401   Mean   : 85016  
##  3rd Qu.:0.8752   3rd Qu.:336.7   3rd Qu.:0.9769   3rd Qu.: 93894  
##  Max.   :0.9460   Max.   :397.4   Max.   :0.9951   Max.   :132478  
##      Extent        Aspect_Ratio     Roundness       Compactness    
##  Min.   :0.4272   Min.   :1.159   Min.   :0.0628   Min.   :0.4760  
##  1st Qu.:0.6870   1st Qu.:1.736   1st Qu.:0.3713   1st Qu.:0.6815  
##  Median :0.7265   Median :1.896   Median :0.6434   Median :0.7107  
##  Mean   :0.7161   Mean   :1.898   Mean   :0.5692   Mean   :0.7131  
##  3rd Qu.:0.7536   3rd Qu.:2.067   3rd Qu.:0.7441   3rd Qu.:0.7417  
##  Max.   :0.8204   Max.   :3.086   Max.   :0.9336   Max.   :0.8779  
##  Shapefactor_1      Shapefactor_2      Shapefactor_3    Shapefactor_4   
##  Min.   :0.004000   Min.   :0.002400   Min.   :0.2266   Min.   :0.6204  
##  1st Qu.:0.005200   1st Qu.:0.002800   1st Qu.:0.4645   1st Qu.:0.9440  
##  Median :0.005600   Median :0.003000   Median :0.5051   Median :0.9731  
##  Mean   :0.005701   Mean   :0.003017   Mean   :0.5105   Mean   :0.9552  
##  3rd Qu.:0.006100   3rd Qu.:0.003200   3rd Qu.:0.5501   3rd Qu.:0.9873  
##  Max.   :0.013100   Max.   :0.005300   Max.   :0.7706   Max.   :0.9990  
##     Mean_RR         Mean_RG         Mean_RB        StdDev_RR    
##  Min.   :167.2   Min.   :162.6   Min.   :146.8   Min.   :10.61  
##  1st Qu.:211.6   1st Qu.:200.4   1st Qu.:182.9   1st Qu.:19.25  
##  Median :219.5   Median :208.9   Median :192.0   Median :21.43  
##  Mean   :218.1   Mean   :208.0   Mean   :192.0   Mean   :21.38  
##  3rd Qu.:225.9   3rd Qu.:216.5   3rd Qu.:201.1   3rd Qu.:23.70  
##  Max.   :241.3   Max.   :240.5   Max.   :235.0   Max.   :30.84  
##    StdDev_RG       StdDev_RB        Skew_RR           Skew_RG       
##  Min.   :11.99   Min.   :11.20   Min.   :-1.9316   Min.   :-1.6582  
##  1st Qu.:20.04   1st Qu.:19.72   1st Qu.:-0.9909   1st Qu.:-0.8760  
##  Median :22.52   Median :22.28   Median :-0.7566   Median :-0.6531  
##  Mean   :22.59   Mean   :22.43   Mean   :-0.7352   Mean   :-0.6156  
##  3rd Qu.:25.24   3rd Qu.:25.14   3rd Qu.:-0.5025   3rd Qu.:-0.4050  
##  Max.   :33.61   Max.   :42.76   Max.   : 1.8654   Max.   : 2.2576  
##     Skew_RB         Kurtosis_RR     Kurtosis_RG      Kurtosis_RB    
##  Min.   :-2.3486   Min.   :1.662   Min.   : 1.665   Min.   : 1.522  
##  1st Qu.:-0.6458   1st Qu.:2.510   1st Qu.: 2.437   1st Qu.: 2.449  
##  Median :-0.4245   Median :2.942   Median : 2.807   Median : 2.783  
##  Mean   :-0.3671   Mean   :3.054   Mean   : 2.903   Mean   : 2.941  
##  3rd Qu.:-0.1584   3rd Qu.:3.446   3rd Qu.: 3.247   3rd Qu.: 3.225  
##  Max.   : 1.8521   Max.   :8.891   Max.   :10.454   Max.   :11.534

Data set contains 29 variables and 2148 rows. At first I am getting rid of the class variable to unlabel data and then I am going to check if my set contains NA’s.

pistacios2 <- pistacios[ , !(names(pistacios) %in% "Class")]

pistacios[complete.cases(pistacios), ]
## # A tibble: 2,148 × 29
##     Area Perime…¹ Major…² Minor…³ Eccen…⁴ Eqdiasq Solid…⁵ Conve…⁶ Extent Aspec…⁷
##    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 63391    1568.    390.    237.   0.795    284.   0.866   73160  0.639    1.65
##  2 68358    1942.    411.    235.   0.821    295.   0.876   77991  0.677    1.75
##  3 73589    1247.    452.    221.   0.873    306.   0.917   80234  0.713    2.05
##  4 71106    1445.    430.    216.   0.864    301.   0.959   74153  0.703    1.99
##  5 80087    1252.    469.    221.   0.882    319.   0.966   82929  0.746    2.12
##  6 52268    1154.    384.    198.   0.858    258.   0.856   61039  0.563    1.94
##  7 71693    1464.    388.    253.   0.759    302.   0.916   78304  0.689    1.54
##  8 62240    1898.    386.    218.   0.825    282.   0.895   69563  0.673    1.77
##  9 64319    2011.    436.    214.   0.872    286.   0.863   74502  0.654    2.04
## 10 78115    1239.    492.    205.   0.909    315.   0.962   81236  0.710    2.40
## # … with 2,138 more rows, 19 more variables: Roundness <dbl>,
## #   Compactness <dbl>, Shapefactor_1 <dbl>, Shapefactor_2 <dbl>,
## #   Shapefactor_3 <dbl>, Shapefactor_4 <dbl>, Mean_RR <dbl>, Mean_RG <dbl>,
## #   Mean_RB <dbl>, StdDev_RR <dbl>, StdDev_RG <dbl>, StdDev_RB <dbl>,
## #   Skew_RR <dbl>, Skew_RG <dbl>, Skew_RB <dbl>, Kurtosis_RR <dbl>,
## #   Kurtosis_RG <dbl>, Kurtosis_RB <dbl>, Class <chr>, and abbreviated variable
## #   names ¹​Perimeter, ²​Major_Axis, ³​Minor_Axis, ⁴​Eccentricity, ⁵​Solidity, …

The data set doesn’t contain null values. As summary statistics present variables describing pistachios are on different scales. I need to perform standardization.

#standardization 
pistacios2_stand <- as.data.frame(lapply(pistacios2, scale))
summary(pistacios2_stand)
##       Area             Perimeter         Major_Axis         Minor_Axis      
##  Min.   :-3.821343   Min.   :-1.5113   Min.   :-3.88051   Min.   :-3.45760  
##  1st Qu.:-0.610735   1st Qu.:-0.6789   1st Qu.:-0.60842   1st Qu.:-0.67422  
##  Median :-0.003441   Median :-0.4345   Median : 0.07168   Median :-0.06253  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.691970   3rd Qu.: 0.4844   3rd Qu.: 0.68609   3rd Qu.: 0.64163  
##  Max.   : 3.357585   Max.   : 3.5389   Max.   : 2.95011   Max.   : 4.77502  
##   Eccentricity        Eqdiasq            Solidity        Convex_Area       
##  Min.   :-6.8771   Min.   :-4.57492   Min.   :-6.9787   Min.   :-3.578953  
##  1st Qu.:-0.4659   1st Qu.:-0.56771   1st Qu.:-0.4012   1st Qu.:-0.649859  
##  Median : 0.1934   Median : 0.03888   Median : 0.2786   Median : 0.004535  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.7174   3rd Qu.: 0.69741   3rd Qu.: 0.7300   3rd Qu.: 0.674855  
##  Max.   : 2.1695   Max.   : 2.95210   Max.   : 1.0903   Max.   : 3.607940  
##      Extent         Aspect_Ratio         Roundness        Compactness      
##  Min.   :-5.4988   Min.   :-3.080606   Min.   :-2.3800   Min.   :-5.32193  
##  1st Qu.:-0.5533   1st Qu.:-0.673799   1st Qu.:-0.9303   1st Qu.:-0.70771  
##  Median : 0.1986   Median :-0.007931   Median : 0.3489   Median :-0.05335  
##  Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.7145   3rd Qu.: 0.703334   3rd Qu.: 0.8223   3rd Qu.: 0.64142  
##  Max.   : 1.9861   Max.   : 4.946458   Max.   : 1.7129   Max.   : 3.69997  
##  Shapefactor_1     Shapefactor_2      Shapefactor_3      Shapefactor_4    
##  Min.   :-2.0817   Min.   :-1.81647   Min.   :-4.43936   Min.   :-6.4590  
##  1st Qu.:-0.6133   1st Qu.:-0.63919   1st Qu.:-0.71919   1st Qu.:-0.2168  
##  Median :-0.1239   Median :-0.05056   Median :-0.08307   Median : 0.3445  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.4879   3rd Qu.: 0.53807   3rd Qu.: 0.61952   3rd Qu.: 0.6185  
##  Max.   : 9.0532   Max.   : 6.71874   Max.   : 4.06835   Max.   : 0.8442  
##     Mean_RR           Mean_RG            Mean_RB            StdDev_RR       
##  Min.   :-4.7137   Min.   :-3.75625   Min.   :-3.469375   Min.   :-3.44298  
##  1st Qu.:-0.5984   1st Qu.:-0.62418   1st Qu.:-0.695647   1st Qu.:-0.67988  
##  Median : 0.1355   Median : 0.07554   Median : 0.003149   Median : 0.01439  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.7295   3rd Qu.: 0.70583   3rd Qu.: 0.698546   3rd Qu.: 0.74040  
##  Max.   : 2.1502   Max.   : 2.69468   Max.   : 3.300362   Max.   : 3.02391  
##    StdDev_RG          StdDev_RB           Skew_RR            Skew_RG        
##  Min.   :-2.92805   Min.   :-2.86017   Min.   :-3.11078   Min.   :-2.67875  
##  1st Qu.:-0.70531   1st Qu.:-0.68885   1st Qu.:-0.66476   1st Qu.:-0.66902  
##  Median :-0.01883   Median :-0.03824   Median :-0.05553   Median :-0.09627  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.73172   3rd Qu.: 0.69099   3rd Qu.: 0.60518   3rd Qu.: 0.54103  
##  Max.   : 3.04320   Max.   : 5.17775   Max.   : 6.76222   Max.   : 7.38191  
##     Skew_RB         Kurtosis_RR       Kurtosis_RG       Kurtosis_RB     
##  Min.   :-4.6408   Min.   :-1.8959   Min.   :-1.8998   Min.   :-1.8903  
##  1st Qu.:-0.6526   1st Qu.:-0.7416   1st Qu.:-0.7148   1st Qu.:-0.6547  
##  Median :-0.1345   Median :-0.1529   Median :-0.1473   Median :-0.2096  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4889   3rd Qu.: 0.5347   3rd Qu.: 0.5287   3rd Qu.: 0.3787  
##  Max.   : 5.1977   Max.   : 7.9519   Max.   :11.5921   Max.   :11.4552

Now I am going to check how my data is distributed.

hist.data.frame(pistacios2_stand[1:9])

hist.data.frame(pistacios2_stand[10:18])

hist.data.frame(pistacios2_stand[19:28])

par(mar=c(8,6,4,1)+.1)#set the margin
boxplot(pistacios2_stand[,1:9],las=2)

boxplot(pistacios2_stand[,10:18],las=2)

boxplot(pistacios2_stand[,19:28],las=2)

As we can see from the histograms and boxplots pistachios data set has a lot of variables that contains many outliers and also many of them aren’t normally distributed. In that case I am going to check the correlation between the variables but because of the outliers and distribution I am going to use Spearman, not Pearson correlation.

cor<-cor(pistacios2_stand, method="spearman") 
corrplot(cor)

The number of variables makes correlation plot analysis harder but we can observe that some features like Area and Eqdiasq are positively, highly correlated and variables like Compactness and Aspect_Ratio are negatively correlated. Performing dimension reduction will allow for more clear analysis and also getting satisfying results.

Dimension Reduction - PCA

Principal component analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimension reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

Now I am going to find optimal number of dimensions for pistachios data set using eigenvalues which measure the amount of variation retained by each principal component.

pca_pistacio <- prcomp(pistacios2_stand)
summary(pca_pistacio)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.9907 2.3413 2.0709 1.6774 1.61240 0.98869 0.81147
## Proportion of Variance 0.3195 0.1958 0.1532 0.1005 0.09285 0.03491 0.02352
## Cumulative Proportion  0.3195 0.5152 0.6684 0.7689 0.86172 0.89663 0.92014
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.72381 0.63638 0.6034 0.52440 0.43074 0.39738 0.26160
## Proportion of Variance 0.01871 0.01446 0.0130 0.00982 0.00663 0.00564 0.00244
## Cumulative Proportion  0.93885 0.95332 0.9663 0.97614 0.98277 0.98841 0.99085
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.22385 0.2183 0.19957 0.17952 0.16882 0.16318 0.11266
## Proportion of Variance 0.00179 0.0017 0.00142 0.00115 0.00102 0.00095 0.00045
## Cumulative Proportion  0.99264 0.9943 0.99577 0.99692 0.99793 0.99889 0.99934
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.10261 0.07269 0.03442 0.02696 0.02567 0.00866 0.0069
## Proportion of Variance 0.00038 0.00019 0.00004 0.00003 0.00002 0.00000 0.0000
## Cumulative Proportion  0.99972 0.99990 0.99995 0.99997 1.00000 1.00000 1.0000
pistacio_eigenvalue <- get_eigenvalue(pca_pistacio)
pistacio_eigenvalue
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  8.944545e+00     3.194480e+01                    31.94480
## Dim.2  5.481521e+00     1.957686e+01                    51.52166
## Dim.3  4.288599e+00     1.531643e+01                    66.83809
## Dim.4  2.813533e+00     1.004833e+01                    76.88642
## Dim.5  2.599848e+00     9.285173e+00                    86.17160
## Dim.6  9.775043e-01     3.491087e+00                    89.66268
## Dim.7  6.584773e-01     2.351705e+00                    92.01439
## Dim.8  5.239006e-01     1.871073e+00                    93.88546
## Dim.9  4.049841e-01     1.446372e+00                    95.33183
## Dim.10 3.640829e-01     1.300296e+00                    96.63213
## Dim.11 2.749957e-01     9.821276e-01                    97.61426
## Dim.12 1.855363e-01     6.626297e-01                    98.27689
## Dim.13 1.579137e-01     5.639777e-01                    98.84086
## Dim.14 6.843411e-02     2.444076e-01                    99.08527
## Dim.15 5.010822e-02     1.789579e-01                    99.26423
## Dim.16 4.763394e-02     1.701212e-01                    99.43435
## Dim.17 3.982958e-02     1.422485e-01                    99.57660
## Dim.18 3.222750e-02     1.150982e-01                    99.69170
## Dim.19 2.850015e-02     1.017863e-01                    99.79348
## Dim.20 2.662683e-02     9.509581e-02                    99.88858
## Dim.21 1.269226e-02     4.532950e-02                    99.93391
## Dim.22 1.052819e-02     3.760068e-02                    99.97151
## Dim.23 5.284090e-03     1.887175e-02                    99.99038
## Dim.24 1.184700e-03     4.231071e-03                    99.99461
## Dim.25 7.270880e-04     2.596743e-03                    99.99721
## Dim.26 6.589270e-04     2.353311e-03                    99.99956
## Dim.27 7.500127e-05     2.678617e-04                    99.99983
## Dim.28 4.761629e-05     1.700582e-04                   100.00000

As we can see from the results of PCA algorithm just two components can describe more than 50% of variance. 6 out of 28 components are able to explain almost 90% of the variance. Dimension reduction is a powerful tool! Let’s visualize some things.

fviz_eig(pca_pistacio, addlabels = TRUE,ylim=c(0,35)  , main = "Scree plot")

On the plot above we can more clearly see the statistics calculated a moment before. 5-6 variables can explain almost 90% of the set variance.

fviz_pca_var(pca_pistacio, col.var = "darkblue",repel=TRUE,col.circle="darkblue",addEllipses=TRUE)

We can also observe the correlation between a principal component and a variables on the chart above. Positively correlated variables are grouped together and negatively correlated variables are positioned on opposite side of the plot origin.

fviz_pca_ind(pca_pistacio, col.ind="cos2", geom = "point")

Using graph of individuals we can observe the correlation based on individual values, not whole variables.

fviz_eig(pca_pistacio, choice="eigenvalue",addlabels = TRUE,ylim=c(0,10)  , main = "Scree plot - eigenvalue")

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. Then I am going to use 5 dimensions as a cutoff point.

With that knowledge I would like to see the contributions of variables to PC’s.

variables_rez <- get_pca_var(pca_pistacio)
corrplot(variables_rez$cos2, is.corr=FALSE)

fviz_contrib(pca_pistacio, "var", axes = 1:5)

The contribution of variables is decreasing with successive dimensions.The red line on the graph above shows the expected average contribution. The variables above the line can be considered as important.

The contribution chart showed a lot but it was a accumulation of 5 dimensions. I will take a look at contribution in dimensions separately.

pc1 <- fviz_contrib(pca_pistacio, choice = "var", axes = 1, xtickslab.rt = 90)
pc2 <- fviz_contrib(pca_pistacio, choice = "var", axes = 2, xtickslab.rt = 90)
pc3 <- fviz_contrib(pca_pistacio, choice = "var", axes = 3, xtickslab.rt = 90)
pc4 <- fviz_contrib(pca_pistacio, choice = "var", axes = 4, xtickslab.rt = 90)
pc5 <- fviz_contrib(pca_pistacio, choice = "var", axes = 5, xtickslab.rt = 90)
grid.arrange(pc1,pc2,ncol=2)

grid.arrange(pc3,pc4,ncol=2)

plot(pc5)

Conclusions

According to “Contribution of variables to Dim 1-5” chart variables Eqdiasq, Area, Compactness, Convex_Area, Minor_Axis, Shapefactor_3, Aspect_Ratio, Major_Axis, Shapefactor_2, Shapefactor_1, Eccentricity, Solidity, Mean_RG, StdDev_RG, Shapefactor_4, Skew_RR and Mean_RR can be used to explain the variance and perform pistachios classification. It is 17/28 variables. 17 features are still a lot to visualize and analyse but reduction of this type should make analysts job easier and more efficient.

References

1. OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2, pp. DOI:10.23751/pn.v23i2.9686. (Open Access) https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178

2. SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

3.https://rdrr.io/cran

4.https://www.simplilearn.com/what-is-dimensionality-reduction-article#what_is_dimensionality_reduction