Dimension Reduction

Introduction

Dimension reduction aims to explain the variance in data using fewer variables. The aim is to use fewer variables as extra variables may lead to redundant variables thus resulting in models that do not work or take a long time to run. Essentially, we declutter our dataset to remove variable not needed. The first obvious way to do this is to remove variables that are virtually the same (duplicated variables). Secondly, we may also consider simplifying categorical variables.

In this study we will be considering a wine dataset with the goal of identifying the membership of wine in 1 of 3 cultivars (varieties), based on 13 chemical constituents. The data consists of 178 samples. .

wine <- read.table ("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep= ",", header = FALSE)
colnames(wine) <- c("Cvs", "alcohol","malic acid","ash","alkalinity of ash","magnesium","total phenols","flavanoids","nonflavanoid phenols","proanthocyanins","color intensity","hue", "OD280/OD315 of diluted wines","proline")
summary(wine)

##       Cvs           alcohol        malic acid         ash       
##  Min.   :1.000   Min.   :11.03   Min.   :0.740   Min.   :1.360  
##  1st Qu.:1.000   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210  
##  Median :2.000   Median :13.05   Median :1.865   Median :2.360  
##  Mean   :1.938   Mean   :13.00   Mean   :2.336   Mean   :2.367  
##  3rd Qu.:3.000   3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558  
##  Max.   :3.000   Max.   :14.83   Max.   :5.800   Max.   :3.230  
##  alkalinity of ash   magnesium      total phenols     flavanoids   
##  Min.   :10.60     Min.   : 70.00   Min.   :0.980   Min.   :0.340  
##  1st Qu.:17.20     1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205  
##  Median :19.50     Median : 98.00   Median :2.355   Median :2.135  
##  Mean   :19.49     Mean   : 99.74   Mean   :2.295   Mean   :2.029  
##  3rd Qu.:21.50     3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875  
##  Max.   :30.00     Max.   :162.00   Max.   :3.880   Max.   :5.080  
##  nonflavanoid phenols proanthocyanins color intensity       hue        
##  Min.   :0.1300       Min.   :0.410   Min.   : 1.280   Min.   :0.4800  
##  1st Qu.:0.2700       1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825  
##  Median :0.3400       Median :1.555   Median : 4.690   Median :0.9650  
##  Mean   :0.3619       Mean   :1.591   Mean   : 5.058   Mean   :0.9574  
##  3rd Qu.:0.4375       3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200  
##  Max.   :0.6600       Max.   :3.580   Max.   :13.000   Max.   :1.7100  
##  OD280/OD315 of diluted wines    proline      
##  Min.   :1.270                Min.   : 278.0  
##  1st Qu.:1.938                1st Qu.: 500.5  
##  Median :2.780                Median : 673.5  
##  Mean   :2.612                Mean   : 746.9  
##  3rd Qu.:3.170                3rd Qu.: 985.0  
##  Max.   :4.000                Max.   :1680.0

Data exploration

Firstly we begin by checking the data for correlations between variables as these can be interpreted as extra potential variables. These correlations (using the “cor” function ) can be displayed in a heat map below

heatmap(cor(wine), Rowv = NA, Colv = NA)

Strong correlations are represented by the dark squares which points to possible un-needed variables.

Principal Components analysis

The main purpose of this analysis is to reduce the number of variables in the data set by finding new set of smaller variables. Components of this analysis are perpendicular to each other in order to maximize the efficiency.Efficiency being using minimum number of variables to explain a maximum amount of variance. These components can be illustrated graphically;

wine_data <- princomp(wine[,-6],cor = T)

plot(wine_data ,  main="Principal Component Analysis")

Although the graph representation above can be analysed, it is essential to first normalize the data using the scale() function and exclude the CVs column from our anlysis.

classes <- factor(wine$Cvs) #to create classes within the Cvs column
winePCA <- prcomp(scale(wine[,-1]))
summary(winePCA)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

From this summary of the normalized dataset, we can see the cumulative contribution of each principal component. The first component (PC1) contributes 36.2%, suggesting that, it has the power to explain 36.2% of the variance in the data.Since this is a cumulative contribution, the percentage increases as more principal components are added.

It is evident from the summary that 92% of variance in the data can be explained by 8 of the 13 variables. This means that by removing 5 variables, we have only loss 8% of fedelity.

Visualisation

We may consider two scenarios illustrated in the graphs below; 1. a graph with only the principal components 3 and 4 (PC1, PC2) 2. a graph with all the principal components.

It can be noted that the graph with PC 3 and PC4 does not explain much of the variance within the data as compared to the graph with all the components.This is evident by the fact that some plots are stacked on top of each other.

plot(winePCA$x[,3:4],col= classes)

plot(winePCA$x,col= classes)

Conslusion

From the analysis of the data, a suitable principal component to explain the maximum variance in the data would be PC8 as the graph below illustrates less stacked plots and a wider variance.

plot(winePCA$x[,8:9],col= classes)

References

https://archive.ics.uci.edu/ml/machine-learning-databases/wine/

https://r4ds.had.co.nz/