Dimension reduction aims to explain the variance in data using fewer variables. The aim is to use fewer variables as extra variables may lead to redundant variables thus resulting in models that do not work or take a long time to run. Essentially, we declutter our dataset to remove variable not needed. The first obvious way to do this is to remove variables that are virtually the same (duplicated variables). Secondly, we may also consider simplifying categorical variables.
In this study we will be considering a wine dataset with the goal of identifying the membership of wine in 1 of 3 cultivars (varieties), based on 13 chemical constituents. The data consists of 178 samples. .
wine <- read.table ("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep= ",", header = FALSE)
colnames(wine) <- c("Cvs", "alcohol","malic acid","ash","alkalinity of ash","magnesium","total phenols","flavanoids","nonflavanoid phenols","proanthocyanins","color intensity","hue", "OD280/OD315 of diluted wines","proline")
summary(wine)
## Cvs alcohol malic acid ash
## Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210
## Median :2.000 Median :13.05 Median :1.865 Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558
## Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230
## alkalinity of ash magnesium total phenols flavanoids
## Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205
## Median :19.50 Median : 98.00 Median :2.355 Median :2.135
## Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029
## 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875
## Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
## nonflavanoid phenols proanthocyanins color intensity hue
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800
## 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825
## Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650
## Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574
## 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200
## Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100
## OD280/OD315 of diluted wines proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.938 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.612 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
Firstly we begin by checking the data for correlations between variables as these can be interpreted as extra potential variables. These correlations (using the “cor” function ) can be displayed in a heat map below
heatmap(cor(wine), Rowv = NA, Colv = NA)
Strong correlations are represented by the dark squares which points to possible un-needed variables.
The main purpose of this analysis is to reduce the number of variables in the data set by finding new set of smaller variables. Components of this analysis are perpendicular to each other in order to maximize the efficiency.Efficiency being using minimum number of variables to explain a maximum amount of variance. These components can be illustrated graphically;
wine_data <- princomp(wine[,-6],cor = T)
plot(wine_data , main="Principal Component Analysis")
Although the graph representation above can be analysed, it is essential to first normalize the data using the scale() function and exclude the CVs column from our anlysis.
classes <- factor(wine$Cvs) #to create classes within the Cvs column
winePCA <- prcomp(scale(wine[,-1]))
summary(winePCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
From this summary of the normalized dataset, we can see the cumulative contribution of each principal component. The first component (PC1) contributes 36.2%, suggesting that, it has the power to explain 36.2% of the variance in the data.Since this is a cumulative contribution, the percentage increases as more principal components are added.
It is evident from the summary that 92% of variance in the data can be explained by 8 of the 13 variables. This means that by removing 5 variables, we have only loss 8% of fedelity.
We may consider two scenarios illustrated in the graphs below; 1. a graph with only the principal components 3 and 4 (PC1, PC2) 2. a graph with all the principal components.
It can be noted that the graph with PC 3 and PC4 does not explain much of the variance within the data as compared to the graph with all the components.This is evident by the fact that some plots are stacked on top of each other.
plot(winePCA$x[,3:4],col= classes)
plot(winePCA$x,col= classes)
From the analysis of the data, a suitable principal component to explain the maximum variance in the data would be PC8 as the graph below illustrates less stacked plots and a wider variance.
plot(winePCA$x[,8:9],col= classes)