Dimension reduction is a machine learning data transformation technique used in unsupervised learning to bring data from a high-dimensional space into a low-dimensional space retaining the meaningful properties of the original data. In a nutshell, dimension reduction means representing data using fewer predictor variables (features).
In my project I would like to use dimension reduction to check if a data set with many features can be described in similar way with less variables.
This data set contains features and parameters describing two types of pistachios. Data set is provided on the Kaggle website: https://www.kaggle.com/datasets/muratkokludataset/pistachio-dataset.
| Id | Name |
|---|---|
| 1 | Area |
| 2 | Perimeter |
| 3 | Major_Axis |
| 4 | Minor_Axis |
| 5 | Eccentricity |
| 6 | Eqdiasq |
| 7 | Solidity |
| 8 | Convex_Area |
| 9 | Extent |
| 10 | Aspect_Ratio |
| 11 | Roundness |
| 12 | Compactness |
| 13 | Shapefactor_1 |
| 14 | Shapefactor_2 |
| 15 | Shapefactor_3 |
| 16 | Shapefactor_4 |
| 17 | Mean_RR |
| 18 | Mean_RG |
| 19 | Mean_RB |
| 20 | Skew_RR |
| 21 | Skew_RG |
| 22 | Skew_RB |
| 23 | Kurtosis_RR |
| 24 | Kurtosis_RG |
| 25 | Kurtosis_RB |
| 26 | StdDev_RR |
| 27 | StdDev_RG |
| 28 | StdDev_RB |
| 29 | Class |
You can also embed plots, for example:
#load data
pistacios<-read_excel("C:\\Users\\skarbowiak\\Desktop\\Studia II UW\\Semestr 1 DSaBA\\Unsupervised learning 2022Z\\Dimension reduction projects\\Pistachio_Dataset.xlsx")
str(pistacios)
## tibble [2,148 × 29] (S3: tbl_df/tbl/data.frame)
## $ Area : num [1:2148] 63391 68358 73589 71106 80087 ...
## $ Perimeter : num [1:2148] 1568 1942 1247 1445 1252 ...
## $ Major_Axis : num [1:2148] 390 411 452 430 469 ...
## $ Minor_Axis : num [1:2148] 237 235 221 216 221 ...
## $ Eccentricity : num [1:2148] 0.795 0.821 0.873 0.864 0.882 ...
## $ Eqdiasq : num [1:2148] 284 295 306 301 319 ...
## $ Solidity : num [1:2148] 0.867 0.876 0.917 0.959 0.966 ...
## $ Convex_Area : num [1:2148] 73160 77991 80234 74153 82929 ...
## $ Extent : num [1:2148] 0.639 0.677 0.713 0.703 0.746 ...
## $ Aspect_Ratio : num [1:2148] 1.65 1.75 2.05 1.99 2.12 ...
## $ Roundness : num [1:2148] 0.324 0.228 0.595 0.428 0.642 ...
## $ Compactness : num [1:2148] 0.728 0.718 0.677 0.701 0.68 ...
## $ Shapefactor_1: num [1:2148] 0.0062 0.006 0.0061 0.006 0.0059 0.0073 0.0054 0.0062 0.0068 0.0063 ...
## $ Shapefactor_2: num [1:2148] 0.0037 0.0034 0.003 0.003 0.0028 0.0038 0.0035 0.0035 0.0033 0.0026 ...
## $ Shapefactor_3: num [1:2148] 0.53 0.516 0.458 0.491 0.463 ...
## $ Shapefactor_4: num [1:2148] 0.873 0.902 0.939 0.976 0.983 ...
## $ Mean_RR : num [1:2148] 196 223 213 212 230 ...
## $ Mean_RG : num [1:2148] 180 209 203 205 218 ...
## $ Mean_RB : num [1:2148] 165 187 188 188 194 ...
## $ StdDev_RR : num [1:2148] 17.7 26.7 19 18.2 23.4 ...
## $ StdDev_RG : num [1:2148] 19.6 27.2 20.1 18.7 24.1 ...
## $ StdDev_RB : num [1:2148] 21.1 25.1 20.7 29.8 23.1 ...
## $ Skew_RR : num [1:2148] 0.458 -0.385 -0.601 -0.694 -0.929 ...
## $ Skew_RG : num [1:2148] 0.663 -0.271 -0.45 -0.628 -0.813 ...
## $ Skew_RB : num [1:2148] 0.759 -0.293 0.3 -0.78 -0.497 ...
## $ Kurtosis_RR : num [1:2148] 2.97 1.98 3.54 2.88 2.99 ...
## $ Kurtosis_RG : num [1:2148] 3.06 2.1 3.69 2.87 2.88 ...
## $ Kurtosis_RB : num [1:2148] 2.95 2.22 4.1 2.9 2.74 ...
## $ Class : chr [1:2148] "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" ...
summary(pistacios[,-29])
## Area Perimeter Major_Axis Minor_Axis
## Min. : 29808 Min. : 858.4 Min. :320.3 Min. :133.5
## 1st Qu.: 71937 1st Qu.:1171.0 1st Qu.:426.5 1st Qu.:217.9
## Median : 79906 Median :1262.8 Median :448.6 Median :236.4
## Mean : 79951 Mean :1426.0 Mean :446.2 Mean :238.3
## 3rd Qu.: 89031 3rd Qu.:1607.9 3rd Qu.:468.5 3rd Qu.:257.8
## Max. :124008 Max. :2755.0 Max. :542.0 Max. :383.0
## Eccentricity Eqdiasq Solidity Convex_Area
## Min. :0.5049 Min. :194.8 Min. :0.5880 Min. : 37935
## 1st Qu.:0.8175 1st Qu.:302.6 1st Qu.:0.9198 1st Qu.: 76467
## Median :0.8497 Median :319.0 Median :0.9542 Median : 85076
## Mean :0.8402 Mean :317.9 Mean :0.9401 Mean : 85016
## 3rd Qu.:0.8752 3rd Qu.:336.7 3rd Qu.:0.9769 3rd Qu.: 93894
## Max. :0.9460 Max. :397.4 Max. :0.9951 Max. :132478
## Extent Aspect_Ratio Roundness Compactness
## Min. :0.4272 Min. :1.159 Min. :0.0628 Min. :0.4760
## 1st Qu.:0.6870 1st Qu.:1.736 1st Qu.:0.3713 1st Qu.:0.6815
## Median :0.7265 Median :1.896 Median :0.6434 Median :0.7107
## Mean :0.7161 Mean :1.898 Mean :0.5692 Mean :0.7131
## 3rd Qu.:0.7536 3rd Qu.:2.067 3rd Qu.:0.7441 3rd Qu.:0.7417
## Max. :0.8204 Max. :3.086 Max. :0.9336 Max. :0.8779
## Shapefactor_1 Shapefactor_2 Shapefactor_3 Shapefactor_4
## Min. :0.004000 Min. :0.002400 Min. :0.2266 Min. :0.6204
## 1st Qu.:0.005200 1st Qu.:0.002800 1st Qu.:0.4645 1st Qu.:0.9440
## Median :0.005600 Median :0.003000 Median :0.5051 Median :0.9731
## Mean :0.005701 Mean :0.003017 Mean :0.5105 Mean :0.9552
## 3rd Qu.:0.006100 3rd Qu.:0.003200 3rd Qu.:0.5501 3rd Qu.:0.9873
## Max. :0.013100 Max. :0.005300 Max. :0.7706 Max. :0.9990
## Mean_RR Mean_RG Mean_RB StdDev_RR
## Min. :167.2 Min. :162.6 Min. :146.8 Min. :10.61
## 1st Qu.:211.6 1st Qu.:200.4 1st Qu.:182.9 1st Qu.:19.25
## Median :219.5 Median :208.9 Median :192.0 Median :21.43
## Mean :218.1 Mean :208.0 Mean :192.0 Mean :21.38
## 3rd Qu.:225.9 3rd Qu.:216.5 3rd Qu.:201.1 3rd Qu.:23.70
## Max. :241.3 Max. :240.5 Max. :235.0 Max. :30.84
## StdDev_RG StdDev_RB Skew_RR Skew_RG
## Min. :11.99 Min. :11.20 Min. :-1.9316 Min. :-1.6582
## 1st Qu.:20.04 1st Qu.:19.72 1st Qu.:-0.9909 1st Qu.:-0.8760
## Median :22.52 Median :22.28 Median :-0.7566 Median :-0.6531
## Mean :22.59 Mean :22.43 Mean :-0.7352 Mean :-0.6156
## 3rd Qu.:25.24 3rd Qu.:25.14 3rd Qu.:-0.5025 3rd Qu.:-0.4050
## Max. :33.61 Max. :42.76 Max. : 1.8654 Max. : 2.2576
## Skew_RB Kurtosis_RR Kurtosis_RG Kurtosis_RB
## Min. :-2.3486 Min. :1.662 Min. : 1.665 Min. : 1.522
## 1st Qu.:-0.6458 1st Qu.:2.510 1st Qu.: 2.437 1st Qu.: 2.449
## Median :-0.4245 Median :2.942 Median : 2.807 Median : 2.783
## Mean :-0.3671 Mean :3.054 Mean : 2.903 Mean : 2.941
## 3rd Qu.:-0.1584 3rd Qu.:3.446 3rd Qu.: 3.247 3rd Qu.: 3.225
## Max. : 1.8521 Max. :8.891 Max. :10.454 Max. :11.534
Data set contains 29 variables and 2148 rows. At first I am getting rid of the class variable to unlabel data and then I am going to check if my set contains NA’s.
pistacios2 <- pistacios[ , !(names(pistacios) %in% "Class")]
pistacios[complete.cases(pistacios), ]
## # A tibble: 2,148 × 29
## Area Perime…¹ Major…² Minor…³ Eccen…⁴ Eqdiasq Solid…⁵ Conve…⁶ Extent Aspec…⁷
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63391 1568. 390. 237. 0.795 284. 0.866 73160 0.639 1.65
## 2 68358 1942. 411. 235. 0.821 295. 0.876 77991 0.677 1.75
## 3 73589 1247. 452. 221. 0.873 306. 0.917 80234 0.713 2.05
## 4 71106 1445. 430. 216. 0.864 301. 0.959 74153 0.703 1.99
## 5 80087 1252. 469. 221. 0.882 319. 0.966 82929 0.746 2.12
## 6 52268 1154. 384. 198. 0.858 258. 0.856 61039 0.563 1.94
## 7 71693 1464. 388. 253. 0.759 302. 0.916 78304 0.689 1.54
## 8 62240 1898. 386. 218. 0.825 282. 0.895 69563 0.673 1.77
## 9 64319 2011. 436. 214. 0.872 286. 0.863 74502 0.654 2.04
## 10 78115 1239. 492. 205. 0.909 315. 0.962 81236 0.710 2.40
## # … with 2,138 more rows, 19 more variables: Roundness <dbl>,
## # Compactness <dbl>, Shapefactor_1 <dbl>, Shapefactor_2 <dbl>,
## # Shapefactor_3 <dbl>, Shapefactor_4 <dbl>, Mean_RR <dbl>, Mean_RG <dbl>,
## # Mean_RB <dbl>, StdDev_RR <dbl>, StdDev_RG <dbl>, StdDev_RB <dbl>,
## # Skew_RR <dbl>, Skew_RG <dbl>, Skew_RB <dbl>, Kurtosis_RR <dbl>,
## # Kurtosis_RG <dbl>, Kurtosis_RB <dbl>, Class <chr>, and abbreviated variable
## # names ¹Perimeter, ²Major_Axis, ³Minor_Axis, ⁴Eccentricity, ⁵Solidity, …
The data set doesn’t contain null values. As summary statistics present variables describing pistachios are on different scales. I need to perform standardization.
#standardization
pistacios2_stand <- as.data.frame(lapply(pistacios2, scale))
summary(pistacios2_stand)
## Area Perimeter Major_Axis Minor_Axis
## Min. :-3.821343 Min. :-1.5113 Min. :-3.88051 Min. :-3.45760
## 1st Qu.:-0.610735 1st Qu.:-0.6789 1st Qu.:-0.60842 1st Qu.:-0.67422
## Median :-0.003441 Median :-0.4345 Median : 0.07168 Median :-0.06253
## Mean : 0.000000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.691970 3rd Qu.: 0.4844 3rd Qu.: 0.68609 3rd Qu.: 0.64163
## Max. : 3.357585 Max. : 3.5389 Max. : 2.95011 Max. : 4.77502
## Eccentricity Eqdiasq Solidity Convex_Area
## Min. :-6.8771 Min. :-4.57492 Min. :-6.9787 Min. :-3.578953
## 1st Qu.:-0.4659 1st Qu.:-0.56771 1st Qu.:-0.4012 1st Qu.:-0.649859
## Median : 0.1934 Median : 0.03888 Median : 0.2786 Median : 0.004535
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.7174 3rd Qu.: 0.69741 3rd Qu.: 0.7300 3rd Qu.: 0.674855
## Max. : 2.1695 Max. : 2.95210 Max. : 1.0903 Max. : 3.607940
## Extent Aspect_Ratio Roundness Compactness
## Min. :-5.4988 Min. :-3.080606 Min. :-2.3800 Min. :-5.32193
## 1st Qu.:-0.5533 1st Qu.:-0.673799 1st Qu.:-0.9303 1st Qu.:-0.70771
## Median : 0.1986 Median :-0.007931 Median : 0.3489 Median :-0.05335
## Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.7145 3rd Qu.: 0.703334 3rd Qu.: 0.8223 3rd Qu.: 0.64142
## Max. : 1.9861 Max. : 4.946458 Max. : 1.7129 Max. : 3.69997
## Shapefactor_1 Shapefactor_2 Shapefactor_3 Shapefactor_4
## Min. :-2.0817 Min. :-1.81647 Min. :-4.43936 Min. :-6.4590
## 1st Qu.:-0.6133 1st Qu.:-0.63919 1st Qu.:-0.71919 1st Qu.:-0.2168
## Median :-0.1239 Median :-0.05056 Median :-0.08307 Median : 0.3445
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.4879 3rd Qu.: 0.53807 3rd Qu.: 0.61952 3rd Qu.: 0.6185
## Max. : 9.0532 Max. : 6.71874 Max. : 4.06835 Max. : 0.8442
## Mean_RR Mean_RG Mean_RB StdDev_RR
## Min. :-4.7137 Min. :-3.75625 Min. :-3.469375 Min. :-3.44298
## 1st Qu.:-0.5984 1st Qu.:-0.62418 1st Qu.:-0.695647 1st Qu.:-0.67988
## Median : 0.1355 Median : 0.07554 Median : 0.003149 Median : 0.01439
## Mean : 0.0000 Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.7295 3rd Qu.: 0.70583 3rd Qu.: 0.698546 3rd Qu.: 0.74040
## Max. : 2.1502 Max. : 2.69468 Max. : 3.300362 Max. : 3.02391
## StdDev_RG StdDev_RB Skew_RR Skew_RG
## Min. :-2.92805 Min. :-2.86017 Min. :-3.11078 Min. :-2.67875
## 1st Qu.:-0.70531 1st Qu.:-0.68885 1st Qu.:-0.66476 1st Qu.:-0.66902
## Median :-0.01883 Median :-0.03824 Median :-0.05553 Median :-0.09627
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.73172 3rd Qu.: 0.69099 3rd Qu.: 0.60518 3rd Qu.: 0.54103
## Max. : 3.04320 Max. : 5.17775 Max. : 6.76222 Max. : 7.38191
## Skew_RB Kurtosis_RR Kurtosis_RG Kurtosis_RB
## Min. :-4.6408 Min. :-1.8959 Min. :-1.8998 Min. :-1.8903
## 1st Qu.:-0.6526 1st Qu.:-0.7416 1st Qu.:-0.7148 1st Qu.:-0.6547
## Median :-0.1345 Median :-0.1529 Median :-0.1473 Median :-0.2096
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4889 3rd Qu.: 0.5347 3rd Qu.: 0.5287 3rd Qu.: 0.3787
## Max. : 5.1977 Max. : 7.9519 Max. :11.5921 Max. :11.4552
Now I am going to check how my data is distributed.
hist.data.frame(pistacios2_stand[1:9])
hist.data.frame(pistacios2_stand[10:18])
hist.data.frame(pistacios2_stand[19:28])
par(mar=c(8,6,4,1)+.1)#set the margin
boxplot(pistacios2_stand[,1:9],las=2)
boxplot(pistacios2_stand[,10:18],las=2)
boxplot(pistacios2_stand[,19:28],las=2)
As we can see from the histograms and boxplots pistachios data set has a lot of variables that contains many outliers and also many of them aren’t normally distributed. In that case I am going to check the correlation between the variables but because of the outliers and distribution I am going to use Spearman, not Pearson correlation.
cor<-cor(pistacios2_stand, method="spearman")
corrplot(cor)
The number of variables makes correlation plot analysis harder but we can observe that some features like Area and Eqdiasq are positively, highly correlated and variables like Compactness and Aspect_Ratio are negatively correlated. Performing dimension reduction will allow for more clear analysis and also getting satisfying results.
Principal component analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimension reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
Now I am going to find optimal number of dimensions for pistachios data set using eigenvalues which measure the amount of variation retained by each principal component.
pca_pistacio <- prcomp(pistacios2_stand)
summary(pca_pistacio)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.9907 2.3413 2.0709 1.6774 1.61240 0.98869 0.81147
## Proportion of Variance 0.3195 0.1958 0.1532 0.1005 0.09285 0.03491 0.02352
## Cumulative Proportion 0.3195 0.5152 0.6684 0.7689 0.86172 0.89663 0.92014
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.72381 0.63638 0.6034 0.52440 0.43074 0.39738 0.26160
## Proportion of Variance 0.01871 0.01446 0.0130 0.00982 0.00663 0.00564 0.00244
## Cumulative Proportion 0.93885 0.95332 0.9663 0.97614 0.98277 0.98841 0.99085
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.22385 0.2183 0.19957 0.17952 0.16882 0.16318 0.11266
## Proportion of Variance 0.00179 0.0017 0.00142 0.00115 0.00102 0.00095 0.00045
## Cumulative Proportion 0.99264 0.9943 0.99577 0.99692 0.99793 0.99889 0.99934
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.10261 0.07269 0.03442 0.02696 0.02567 0.00866 0.0069
## Proportion of Variance 0.00038 0.00019 0.00004 0.00003 0.00002 0.00000 0.0000
## Cumulative Proportion 0.99972 0.99990 0.99995 0.99997 1.00000 1.00000 1.0000
pistacio_eigenvalue <- get_eigenvalue(pca_pistacio)
pistacio_eigenvalue
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 8.944545e+00 3.194480e+01 31.94480
## Dim.2 5.481521e+00 1.957686e+01 51.52166
## Dim.3 4.288599e+00 1.531643e+01 66.83809
## Dim.4 2.813533e+00 1.004833e+01 76.88642
## Dim.5 2.599848e+00 9.285173e+00 86.17160
## Dim.6 9.775043e-01 3.491087e+00 89.66268
## Dim.7 6.584773e-01 2.351705e+00 92.01439
## Dim.8 5.239006e-01 1.871073e+00 93.88546
## Dim.9 4.049841e-01 1.446372e+00 95.33183
## Dim.10 3.640829e-01 1.300296e+00 96.63213
## Dim.11 2.749957e-01 9.821276e-01 97.61426
## Dim.12 1.855363e-01 6.626297e-01 98.27689
## Dim.13 1.579137e-01 5.639777e-01 98.84086
## Dim.14 6.843411e-02 2.444076e-01 99.08527
## Dim.15 5.010822e-02 1.789579e-01 99.26423
## Dim.16 4.763394e-02 1.701212e-01 99.43435
## Dim.17 3.982958e-02 1.422485e-01 99.57660
## Dim.18 3.222750e-02 1.150982e-01 99.69170
## Dim.19 2.850015e-02 1.017863e-01 99.79348
## Dim.20 2.662683e-02 9.509581e-02 99.88858
## Dim.21 1.269226e-02 4.532950e-02 99.93391
## Dim.22 1.052819e-02 3.760068e-02 99.97151
## Dim.23 5.284090e-03 1.887175e-02 99.99038
## Dim.24 1.184700e-03 4.231071e-03 99.99461
## Dim.25 7.270880e-04 2.596743e-03 99.99721
## Dim.26 6.589270e-04 2.353311e-03 99.99956
## Dim.27 7.500127e-05 2.678617e-04 99.99983
## Dim.28 4.761629e-05 1.700582e-04 100.00000
As we can see from the results of PCA algorithm just two components can describe more than 50% of variance. 6 out of 28 components are able to explain almost 90% of the variance. Dimension reduction is a powerful tool! Let’s visualize some things.
fviz_eig(pca_pistacio, addlabels = TRUE,ylim=c(0,35) , main = "Scree plot")
On the plot above we can more clearly see the statistics calculated a moment before. 5-6 variables can explain almost 90% of the set variance.
fviz_pca_var(pca_pistacio, col.var = "darkblue",repel=TRUE,col.circle="darkblue",addEllipses=TRUE)
We can also observe the correlation between a principal component and a variables on the chart above. Positively correlated variables are grouped together and negatively correlated variables are positioned on opposite side of the plot origin.
fviz_pca_ind(pca_pistacio, col.ind="cos2", geom = "point")
Using graph of individuals we can observe the correlation based on individual values, not whole variables.
fviz_eig(pca_pistacio, choice="eigenvalue",addlabels = TRUE,ylim=c(0,10) , main = "Scree plot - eigenvalue")
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. Then I am going to use 5 dimensions as a cutoff point.
With that knowledge I would like to see the contributions of variables to PC’s.
variables_rez <- get_pca_var(pca_pistacio)
corrplot(variables_rez$cos2, is.corr=FALSE)
fviz_contrib(pca_pistacio, "var", axes = 1:5)
The contribution of variables is decreasing with successive dimensions.The red line on the graph above shows the expected average contribution. The variables above the line can be considered as important.
The contribution chart showed a lot but it was a accumulation of 5 dimensions. I will take a look at contribution in dimensions separately.
pc1 <- fviz_contrib(pca_pistacio, choice = "var", axes = 1, xtickslab.rt = 90)
pc2 <- fviz_contrib(pca_pistacio, choice = "var", axes = 2, xtickslab.rt = 90)
pc3 <- fviz_contrib(pca_pistacio, choice = "var", axes = 3, xtickslab.rt = 90)
pc4 <- fviz_contrib(pca_pistacio, choice = "var", axes = 4, xtickslab.rt = 90)
pc5 <- fviz_contrib(pca_pistacio, choice = "var", axes = 5, xtickslab.rt = 90)
grid.arrange(pc1,pc2,ncol=2)
grid.arrange(pc3,pc4,ncol=2)
plot(pc5)
According to “Contribution of variables to Dim 1-5” chart variables Eqdiasq, Area, Compactness, Convex_Area, Minor_Axis, Shapefactor_3, Aspect_Ratio, Major_Axis, Shapefactor_2, Shapefactor_1, Eccentricity, Solidity, Mean_RG, StdDev_RG, Shapefactor_4, Skew_RR and Mean_RR can be used to explain the variance and perform pistachios classification. It is 17/28 variables. 17 features are still a lot to visualize and analyse but reduction of this type should make analysts job easier and more efficient.
1. OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2, pp. DOI:10.23751/pn.v23i2.9686. (Open Access) https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178
2. SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)