Introduction

Dimensionality reduction eliminates some features of the dataset and creates a restricted set of features that contains all of the information needed to predict the target variables more efficiently and accurately.we will use The Principal Component Analysis (PCA). PCA is a method used to reduce dimensionality in high-dimensional datasets, We’ll use the Wine Data Set from the UCI Machine Learning Repository. This data set contains the results of chemical analysis of 178 different wines from three cultivars. There observations contain the quantities of 13 constituents found in each of the three types of wines.

library(HDclassif)
library(stats)

in the above we load necessary packeges and libraryies.

data(wine)
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ class: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V1   : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ V2   : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ V3   : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ V4   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ V5   : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ V6   : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ V7   : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ V8   : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ V9   : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ V10  : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ V11  : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ V12  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ V13  : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

now when we look data as you can see observations are named v1- v13 in data firstly i will fix this.

names(wine) <- c("Type", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", 
"Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", 
"Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", 
"Proline")
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol                     : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic acid                  : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                         : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity of ash           : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium                   : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total phenols               : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids                  : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid phenols        : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color intensity             : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                         : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280/OD315 of diluted wines: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline                     : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

PCA using R

We can use principal component analysis (PCA) for the following purposes: To reduce the number of dimensions in the dataset. To find patterns in the high-dimensional dataset To visualize the data of high dimensionality To ignore noise To improve classification To gets a compact description To captures as much of the original variance in the data as possible

(https://pub.towardsai.net/principal-component-analysis-pca-with-python-examples-tutorial-67a917bae9aa)

wine_pca <- prcomp(wine, center = TRUE, scale = TRUE)
summary(wine_pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3529 1.5802 1.2025 0.96328 0.93675 0.82023 0.74418
## Proportion of Variance 0.3954 0.1784 0.1033 0.06628 0.06268 0.04806 0.03956
## Cumulative Proportion  0.3954 0.5738 0.6771 0.74336 0.80604 0.85409 0.89365
##                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.5916 0.54272 0.51216 0.47524 0.41085 0.35995 0.24044
## Proportion of Variance 0.0250 0.02104 0.01874 0.01613 0.01206 0.00925 0.00413
## Cumulative Proportion  0.9186 0.93969 0.95843 0.97456 0.98662 0.99587 1.00000

We use promp function inside of the stats package to do PCA.We obtain 14 pricipal componets Each of these explains a percantage of the total varition in the data.for example pc1 shows that 40% of total variance.Also pc2 shows us 20 % of total variance so if choose 2 componets just with these we can get accurate %60 information with 15% of the data

Visualizing the data set

now i visualize the data set using the first two principal components.

biplot(wine_pca)

This scatter plot shows all of the wines and so close to each other as we you can see 4 and 19 at the top we compare this ones.

wine[c(4, 19),]
##    Type Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
## 4     1   14.37       1.95 2.50              16.8       113          3.85
## 19    1   14.19       1.59 2.48              16.5       108          3.30
##    Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
## 4        3.49                 0.24            2.18             7.8 0.86
## 19       3.93                 0.32            1.86             8.7 1.23
##    OD280/OD315 of diluted wines Proline
## 4                          3.45    1480
## 19                         2.82    1680

AS We can see above these are almost smilar. We can hide wines in the diagram to make it easir to view vectors.

biplot(wine_pca, xlabs = rep("", nrow(wine)))

CONCLUSION

Vectors shows that there is a relationship between the original variables and the principal components. Therefore we can say, Alcalinity of ash is similar to PC1 also length of the vector represents the strength of the correlation between the original variable and the principal components.PCA is very useful in reducing the dimension of data, in the project we need more analysis but pca and biplots are enough to understood high-dimensioal data.