Introduction

In this project I will use Dimension Reduction approach for popular UCI Wine data. Basically, Dimensionality reduction eliminates some features of the dataset and creates a restricted set of features that contains all of the information needed to predict the target variables more efficiently and accurately.

Exploratory Data Analysis

In this part, you can observe Exploratory Data Analysis of the given data with some visualization parts to get an insight about the distribution of the data. We will analyze the data by using specific R codes. We will add visualization parts as well to observe data deeply.

Review of data set

The data set was found on UC Irvine Machine Learning Repository.

The data set consists of 13 variables.The columns of the Dataset are as following;

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD
Proline

Loading Packages

Firstly, we should install the necessary libraries to our environment.

requiredPackages = c("tidyverse", "dplyr", "ggplot2","stats","factoextra","knitr","gridExtra","corrplot","readr","psych")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.3     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.4     âś” tibble    3.2.1
## âś” lubridate 1.9.3     âś” tidyr     1.3.0
## âś” purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: factoextra
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: knitr
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 4.3.2
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## Loading required package: corrplot
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
## Loading required package: psych
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) } 

library(clusterSim) 
## Warning: package 'clusterSim' was built under R version 4.3.2
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.3.2
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(smacof)
## Warning: package 'smacof' was built under R version 4.3.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
## 
## Attaching package: 'plotrix'
## 
## The following object is masked from 'package:psych':
## 
##     rescale
## 
## Loading required package: colorspace
## Loading required package: e1071
## Warning: package 'e1071' was built under R version 4.3.2
## 
## Attaching package: 'smacof'
## 
## The following object is masked from 'package:psych':
## 
##     Procrustes
## 
## The following object is masked from 'package:base':
## 
##     transform

Statistical Analysis of Data

wine <- read.csv("C:\\Users\\mfati\\Desktop\\Master\\Semester1\\UL\\Projects\\DimensionReduction\\wine.csv", sep = ",", header=TRUE)

wine <- wine[,-1]

wine$Mg <- as.numeric(wine$Mg)
wine$Proline <- as.numeric(wine$Proline)

dim(wine)

dim(wine)
## [1] 178  13

str(wine)

str(wine)
## 'data.frame':    178 obs. of  13 variables:
##  $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic.acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Acl                 : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Mg                  : num  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols             : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid.phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanth             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color.int           : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD                  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline             : num  1065 1050 1185 1480 735 ...

summary(wine)

summary(wine)
##     Alcohol        Malic.acid         Ash             Acl       
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##        Mg            Phenols        Flavanoids    Nonflavanoid.phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##     Proanth        Color.int           Hue               OD       
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

head(wine)

kable(head(wine,10))
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 5.25 1.02 3.58 1290
14.06 2.15 2.61 17.6 121 2.60 2.51 0.31 1.25 5.05 1.06 3.58 1295
14.83 1.64 2.17 14.0 97 2.80 2.98 0.29 1.98 5.20 1.08 2.85 1045
13.86 1.35 2.27 16.0 98 2.98 3.15 0.22 1.85 7.22 1.01 3.55 1045

Checking for missing values

sapply(wine, function(x) sum(is.na(x)))
##              Alcohol           Malic.acid                  Ash 
##                    0                    0                    0 
##                  Acl                   Mg              Phenols 
##                    0                    0                    0 
##           Flavanoids Nonflavanoid.phenols              Proanth 
##                    0                    0                    0 
##            Color.int                  Hue                   OD 
##                    0                    0                    0 
##              Proline 
##                    0

normalization

winenormal<-data.Normalization(wine, type="n1",normalization="column")
head(winenormal)
##     Alcohol  Malic.acid        Ash        Acl         Mg   Phenols Flavanoids
## 1 1.5143408 -0.56066822  0.2313998 -1.1663032 1.90852151 0.8067217  1.0319081
## 2 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481  0.7315653
## 3 0.1963252  0.02117152  1.1062139 -0.2679823 0.08810981 0.8067217  1.2121137
## 4 1.6867914 -0.34583508  0.4865539 -0.8069748 0.92829983 2.4844372  1.4623994
## 5 0.2948684  0.22705328  1.8352256  0.4506745 1.27837900 0.8067217  0.6614853
## 6 1.4773871 -0.51591132  0.3043010 -1.2860793 0.85828399 1.5576991  1.3622851
##   Nonflavanoid.phenols    Proanth  Color.int        Hue        OD     Proline
## 1           -0.6577078  1.2214385  0.2510088  0.3611585 1.8427215  1.01015939
## 2           -0.8184106 -0.5431887 -0.2924962  0.4049085 1.1103172  0.96252635
## 3           -0.4970050  2.1299594  0.2682629  0.3174085 0.7863692  1.39122370
## 4           -0.9791134  1.0292513  1.1827317 -0.4263410 1.1807407  2.32800680
## 5            0.2261576  0.4002753 -0.3183774  0.3611585 0.4483365 -0.03776747
## 6           -0.1755994  0.6623487  0.7298108  0.4049085 0.3356589  2.23274072

Correlation

corrplot(cor(winenormal[sapply(winenormal, is.numeric)]), type = 'upper', method = 'number', tl.cex = 0.9,number.cex=0.65,)

From correlation analysis, there is a high correlation between Phenols and Flavanoids.

If we want we can reduce dimensionality by removing highly correlated features because if two features are highly correlated, then they have same effect for machine learning model.

Let’s see behavior of these two variables and others at MDS and PCA

MDS

MDS (multidimensional scaling) is the machine learning approach which is used in purpose to reduce dimensions of the data in order to be able to plot some meaningful results from the data.

Let’s create a dissimilarity matrix and MDS plot

w1 = dist(t(winenormal))
mds = smacofSym(w1, ndim = 2 )
mds1<-cmdscale(w1, k=2) 
summary(mds1)
##        V1                 V2        
##  Min.   :-11.1063   Min.   :-8.798  
##  1st Qu.: -7.1703   1st Qu.:-3.536  
##  Median :  0.1172   Median : 1.068  
##  Mean   :  0.0000   Mean   : 0.000  
##  3rd Qu.:  7.3681   3rd Qu.: 4.423  
##  Max.   :  9.3303   Max.   : 8.006
mdsplot <- data.frame(mds1)
ggplot(data = mdsplot, aes(x = X1, y = X2) ) +
  geom_point(color = "red") +
  geom_text(aes(label = rownames(mdsplot)), vjust = 1.8, color = "black") +
  labs(x = "Dim1", y = "Dim2", title = "MDS Plot of Wine") +
  theme_bw()

We are seing dispersion of parameters

Now we will check the MDS Goodness

gdns<-randomstress(n=13, ndim=2, nrep=5) 
mean(gdns)
## [1] 0.3516239

It shows the average of randomly generated MDS stress values for 13 objects and 2 dimensions. The average stress value is 0.3586362

PCA

PCA is a dimensionality-reduction approach for reducing the dimensionality of big data sets by converting a large collection of variables into a smaller set that nevertheless includes the majority of the information in the large set.

We can perform PCA using prcomp() function

pca <- prcomp(winenormal, center=FALSE, scale=FALSE)
summarypca <- summary(pca)
summarypca
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

The quality of this PCA is good enough. We have 7 principal components amounting to 90 % of the variance explained, the first one explaining 36 %. However, later validation steps will show what is the optimal number of components. We can continue

Let’s find eigenvalue of the data and show on graph

get_eig(pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.7058503       36.1988481                    36.19885
## Dim.2   2.4969737       19.2074903                    55.40634
## Dim.3   1.4460720       11.1236305                    66.52997
## Dim.4   0.9189739        7.0690302                    73.59900
## Dim.5   0.8532282        6.5632937                    80.16229
## Dim.6   0.6416570        4.9358233                    85.09812
## Dim.7   0.5510283        4.2386793                    89.33680
## Dim.8   0.3484974        2.6807489                    92.01754
## Dim.9   0.2888799        2.2221534                    94.23970
## Dim.10  0.2509025        1.9300191                    96.16972
## Dim.11  0.2257886        1.7368357                    97.90655
## Dim.12  0.1687702        1.2982326                    99.20479
## Dim.13  0.1033779        0.7952149                   100.00000
fviz_eig(pca, addlabels = TRUE,ylim = c(0, 50))

Now let’s merge these eigenvalue results with variable names and see which variable are more important than the others.

coordpca <- get_pca_var(pca)
# Coordinates of variables
kable(coordpca$coord)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
Alcohol -0.3130934 -0.7642573 -0.2493833 -0.0171176 0.2453944 -0.1710519 -0.0418637 -0.2338555 -0.2733703 -0.1059932 0.1073495 0.1093949 -0.0048132
Malic.acid 0.5318847 -0.3554317 0.1070404 0.5146798 -0.0325270 -0.4300067 0.3121603 -0.0388599 0.0404628 0.1548187 -0.0363438 -0.0499947 -0.0083480
Ash 0.0044494 -0.4994461 0.7530514 -0.2053154 0.1321131 -0.1237396 -0.1107313 0.1005107 0.1653782 0.0135872 0.2369643 0.0203857 0.0454050
Acl 0.5191571 0.0167349 0.7360433 0.0583417 -0.0610595 0.0807640 -0.2130209 -0.2526465 -0.1077366 -0.0264473 -0.2277566 0.0229001 -0.0294783
Mg -0.3080229 -0.4734761 0.1572388 -0.3372432 -0.6715773 -0.0305546 0.2396804 0.0923059 -0.1458722 -0.0339963 -0.0338745 -0.0255610 -0.0182543
Phenols -0.8561367 -0.1027742 0.1757842 0.1898745 0.1379259 0.0673849 -0.0207291 0.2396378 -0.1537366 0.1603543 -0.1446145 0.1248399 0.1491577
Flavanoids -0.9174702 0.0053091 0.1811991 0.1459946 0.1007075 0.0151556 -0.0450474 0.1105378 -0.0266472 0.0817224 0.0122091 0.0176235 -0.2675909
Nonflavanoid.phenols 0.6476070 -0.0454768 0.2048724 -0.1948907 0.4625011 0.2071428 0.4420082 0.1377165 -0.1050772 -0.1079619 -0.0555457 -0.0173990 -0.0366666
Proanth -0.6799217 -0.0621039 0.1797229 0.3825481 -0.1264179 0.4275888 0.2762438 -0.2173777 0.1124102 -0.0672129 0.1127881 0.0392548 0.0375917
Color.int 0.1922360 -0.8374894 -0.1651145 0.0631984 0.0706049 0.3353486 -0.1690336 0.0199516 -0.0302155 0.1456498 -0.0151289 -0.2482242 0.0038560
Hue -0.6436621 0.4412422 0.1024817 -0.4100751 0.1603683 -0.0848959 0.1722727 -0.2577549 -0.0461307 0.2616705 0.0229090 -0.1064894 0.0289015
OD -0.8160189 0.2599338 0.1996251 0.1765039 0.0934428 -0.2129560 -0.0332287 0.0461100 -0.0737561 -0.2623251 -0.0220591 -0.2468838 0.0503887
Proline -0.6220508 -0.5766127 -0.1524154 -0.2224704 0.1458240 -0.0959044 0.0570130 -0.0708538 0.3094709 -0.0812042 -0.2562460 0.0326195 -0.0046452
# Contribution of variables
kable(coordpca$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
Alcohol 2.0830974 23.3918820 4.3007553 0.0318848 7.0577176 4.5598754 0.3180549 15.6926312 25.8693405 4.4776561 5.1038474 7.0908474 0.0224100
Malic.acid 6.0116950 5.0593925 0.7923294 28.8251177 0.1240000 28.8169110 17.6840358 0.4333160 0.5667537 9.5530412 0.5850038 1.4809925 0.0674116
Ash 0.0004207 9.9899495 39.2156374 4.5871171 2.0456286 2.3862422 2.2251871 2.8988475 9.4675872 0.0735787 24.8693128 0.2462380 1.9942532
Acl 5.7274256 0.0112159 37.4642355 0.3703868 0.4369599 1.0165581 8.2351285 18.3158477 4.0179927 0.2787779 22.9741700 0.3107267 0.8405746
Mg 2.0161740 8.9780536 1.7097376 12.3760834 52.8599543 0.1454960 10.4253627 2.4448898 7.3659356 0.4606367 0.5082109 0.3871342 0.3223312
Phenols 15.5757183 0.4230138 2.1368289 3.9231070 2.2295987 0.7076561 0.0779804 16.4782482 8.1815747 10.2484083 9.2623559 9.2344544 21.5210553
Flavanoids 17.8873419 0.0011288 2.2705035 2.3193704 1.1886633 0.0357967 0.3682695 3.5060824 0.2458027 2.6618089 0.0660186 0.1840310 69.2651822
Nonflavanoid.phenols 8.9122014 0.0828259 2.9025311 4.1331305 25.0703477 6.6870863 35.4557474 5.4421728 3.8220766 4.6455367 1.3664643 0.1793708 1.3005087
Proanth 9.8238044 0.1544625 2.2336591 15.9246116 1.8730611 28.4937517 13.8487693 13.5590940 4.3741576 1.8005319 5.6340987 0.9130381 1.3669601
Color.int 0.7852920 28.0895412 1.8852996 0.4346196 0.5842581 17.5262917 5.1852819 0.1142232 0.3160410 8.4550208 0.1013709 36.5083773 0.0143827
Hue 8.8039532 7.7972268 0.7262776 18.2988379 3.0142002 1.1232342 5.3859103 19.0640190 0.7366513 27.2900597 0.2324398 6.7191895 0.8080004
OD 14.1501921 2.7058997 2.7557523 3.3900448 1.0233546 7.0676794 0.2003789 0.6100843 1.8831223 27.4267836 0.2155123 36.1151384 2.4560574
Proline 8.2226840 13.3154077 1.6064528 5.3856884 2.4922558 1.4334213 0.5898932 1.4405441 33.1529642 2.6281597 29.0811947 0.6304618 0.0208726
# Control variable colors using their contributions
fviz_pca_var(pca, col.var="contrib",
             gradient.cols = c("orange", "green", "#CC0066"),
             repel = TRUE # Avoid text overlapping
)

According to the chart above, variables with the darkest shading are the most significant, while those with the lightest shading are the least important.

# Contributions of variables to PC1
fviz_contrib(pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(pca, choice = "var", axes = 2, top = 10)

Now find results for individuals.

indpca <- get_pca_ind(pca)
# Coordinates of individuals
kable(head(indpca$coord))
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
-3.307421 -1.4394023 -0.1652728 -0.2150246 -0.6910933 -0.2232504 0.5947488 0.0649559 -0.6396384 -1.0180840 0.4502932 -0.5392891 0.0660523
-2.203250 0.3324551 -2.0207571 -0.2905387 0.2569299 -0.9245123 0.0536243 1.0215343 0.3079780 -0.1592521 0.1422560 -0.3871456 -0.0036263
-2.509661 -1.0282507 0.9800541 0.7228632 0.2503270 0.5477310 0.4230122 -0.3432479 1.1745213 -0.1130420 0.2858665 -0.0005819 -0.0216554
-3.746497 -2.7486184 -0.1756962 0.5663856 0.3109644 0.1141091 -0.3822590 0.6417831 -0.0523966 -0.2387392 -0.7574476 0.2413388 0.3684442
-1.006071 -0.8673840 2.0209873 -0.4086131 -0.2976180 -0.4053761 0.4428253 0.4155283 -0.3258998 0.0781460 0.5244656 0.2160547 0.0791403
-3.041674 -2.1164309 -0.6276254 -0.5141870 0.6302409 0.1230834 0.4005239 0.3937826 0.1517181 0.1017089 -0.4044444 0.3783654 -0.1447470
# Contribution of variables
kable(head(indpca$contrib))
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
1.3059329 0.4661550 0.0106119 0.0282653 0.3144765 0.0436377 0.3606395 0.0068017 0.7956675 2.3208241 0.5045087 0.9681162 0.0237098
0.5795213 0.0248675 1.5864204 0.0516042 0.0434654 0.7483460 0.0029318 1.6822341 0.1844599 0.0567866 0.0503523 0.4989223 0.0000715
0.7519205 0.2378834 0.3731558 0.3194398 0.0412601 0.2626707 0.1824365 0.1899312 2.6827759 0.0286124 0.2033313 0.0000011 0.0025485
1.6756863 1.6997886 0.0119926 0.1961107 0.0636702 0.0114003 0.1489778 0.6639832 0.0053391 0.1276211 1.4275231 0.1938824 0.7377268
0.1208367 0.1692735 1.5867819 0.1020708 0.0583221 0.1438776 0.1999268 0.2783439 0.2065527 0.0136738 0.6844031 0.1553859 0.0340367
1.1045036 1.0077997 0.1530351 0.1616289 0.2615339 0.0132640 0.1635547 0.2499733 0.0447649 0.0231629 0.4070011 0.4765483 0.1138601
var<-get_pca_ind(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution of the Principal Components')

Result for PCA

As a results from the graphs above, both in results that we executed for all approach, We don’t have very strong evidence but we might drop Malic acid and Acl features from the dataset.

wine <- wine[!colnames(wine) %in% c("Malic acid", "Acl")]
kable(head(wine))
Alcohol Malic.acid Ash Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
14.23 1.71 2.43 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
13.20 1.78 2.14 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
13.16 2.36 2.67 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
14.37 1.95 2.50 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
13.24 2.59 2.87 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
14.20 1.76 2.45 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

REFERENCES

Lecture notes of Jacek Lewkowicz - Unsupervised Learning-University of Warsaw.

https://www.displayr.com/goodness-of-fit-in-mds-and-t-sne-with-shepard-diagrams/

https://www.geeksforgeeks.org/multidimensional-scaling-using-r/

https://medium.com/geekculture/principal-component-analysis-3d2b3a0bb93e

https://medium.com/swlh/principal-component-analysis-3b746802766

https://uclouvain-cbio.github.io/WSBIM1322/sec-dimred.html