Review Principal Component Analysis (PCA) techniques and try a few packages from github
Files are found locally at F:Directory(winequality-red.csv will not be initially examined) winequality-white.csv
The variables are:
- Fixed acidity - Volatile acidity - Citric acid - Residual sugar - Chlorides - Free sulfur dioxide - Total sulfur dioxide - Density - pH - Sulphates - Alcohol - Quality
winequality-white.csv has 4898 records with 12 variables and no missing values
setwd("F:/R/Working Directory/Rpubs/PCA")
whiteWine = read.csv(file = "F:/R/Working Directory/Rpubs/PCA/winequality-white.csv" , sep = ";", header = TRUE)
dim(whiteWine)
## [1] 4898 12
head(whiteWine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
A principal component analysis is just a few lines of code. Scaling is used to ensure factor are weighted evenly.
whitePCA = prcomp(whiteWine, scale. = TRUE)
summary(whitePCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8294 1.2594 1.1710 1.04157 0.98756 0.96890 0.8771
## Proportion of Variance 0.2789 0.1322 0.1143 0.09041 0.08127 0.07823 0.0641
## Cumulative Proportion 0.2789 0.4111 0.5253 0.61573 0.69701 0.77524 0.8393
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.85082 0.74599 0.58561 0.53302 0.14307
## Proportion of Variance 0.06032 0.04638 0.02858 0.02368 0.00171
## Cumulative Proportion 0.89967 0.94604 0.97462 0.99829 1.00000
A detailed view of the PCA:
whitePCA
## Standard deviations:
## [1] 1.8293903 1.2594008 1.1709706 1.0415668 0.9875644 0.9688978 0.8770680
## [8] 0.8508195 0.7459900 0.5856051 0.5330248 0.1430703
##
## Rotation:
## PC1 PC2 PC3 PC4
## fixed.acidity -0.15690447 0.56066866 -0.20738436 0.03373494
## volatile.acidity -0.02428722 0.01606694 0.52491466 -0.13119747
## citric.acid -0.13294430 0.28938115 -0.44635554 0.32953335
## residual.sugar -0.40605288 -0.03882402 -0.03384313 -0.41615630
## chlorides -0.21754400 0.03691144 0.21471269 0.50961203
## free.sulfur.dioxide -0.27471931 -0.34554881 -0.31297088 -0.14892788
## total.sulfur.dioxide -0.39044148 -0.27232605 -0.12479447 -0.02161841
## density -0.50129557 -0.01773344 0.03196758 -0.10386393
## pH 0.13003701 -0.56714503 0.06848384 0.20410995
## sulphates -0.03364168 -0.24826266 -0.22699505 0.51924489
## alcohol 0.44279498 0.01698188 -0.15887556 -0.13438871
## quality 0.22713722 -0.14603134 -0.48884718 -0.27820033
## PC5 PC6 PC7 PC8
## fixed.acidity -0.24413933 0.105856235 0.22355921 0.13041311
## volatile.acidity -0.70298193 -0.123704688 -0.22363601 -0.22960669
## citric.acid -0.06510579 -0.131958661 -0.12037133 -0.69141866
## residual.sugar 0.01610213 0.289918546 -0.33860858 -0.11329401
## chlorides 0.17829248 -0.409317266 -0.55225504 0.21139734
## free.sulfur.dioxide -0.11117214 -0.488085145 0.22407108 0.12883115
## total.sulfur.dioxide -0.27144774 -0.272493820 0.20375343 0.01290262
## density 0.07834373 0.326008106 -0.12313568 -0.08667076
## pH 0.11270171 0.192688838 0.07704001 -0.47796137
## sulphates -0.45623099 0.479811894 -0.04462167 0.33642752
## alcohol -0.30855451 -0.135443327 -0.09801169 -0.08899029
## quality -0.04112191 -0.005524396 -0.58434519 0.14444197
## PC9 PC10 PC11 PC12
## fixed.acidity -0.63145048 0.20087123 -0.10411772 0.170792295
## volatile.acidity -0.03159628 -0.14175876 -0.27002270 0.013376718
## citric.acid 0.24949503 -0.10632912 -0.05395597 0.009648802
## residual.sugar 0.17730336 0.37427490 0.17987291 0.493565139
## chlorides -0.17916182 0.23552782 0.09108849 0.025168952
## free.sulfur.dioxide 0.10184710 0.32733415 -0.49921348 -0.029475198
## total.sulfur.dioxide -0.17800832 -0.34735757 0.64355326 0.035060193
## density -0.12538636 0.04349161 -0.06686042 -0.761184485
## pH -0.52031593 0.18375599 -0.07911267 0.141842640
## sulphates 0.23662489 0.05519364 -0.04102077 0.042787387
## alcohol 0.01278298 0.57530003 0.41895440 -0.350156811
## quality -0.29970621 -0.36771605 -0.14620225 -0.016069252
For the primary component, the variables with the greatest variance (influence) are: - density: -0.50129557 - alcohol: 0.44279498 - residual sugar: -0.40605288
For the secondary component, the variables with the greatest variance are: - pH: -0.56714503 - fixed acidity: 0.56066866
Two ways to view the scree plot:
plot(whitePCA)
screeplot(whitePCA, type = "lines") # default is boxplot
A biplot shows the components orthagonally rotated against the two primary eigenvectors
biplot(whitePCA)
Obviously, the above biplot is not very readable.
So, let’s install and play with a new function,
library(devtools)
install_github("vqv/ggbiplot")
And let’s see how it works on their data:
library(ggplot2)
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
data("wine")
dim(wine)
## [1] 178 13
head(wine)
## Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav NonFlavPhenols Proa Color
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38
## 3 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68
## 4 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80
## 5 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32
## 6 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75
## Hue OD Proline
## 1 1.04 3.92 1065
## 2 1.05 3.40 1050
## 3 1.03 3.17 1185
## 4 0.86 3.45 1480
## 5 1.04 2.93 735
## 6 1.05 2.85 1450
wine.pca <- prcomp(wine, scale. = TRUE)
ggbiplot(wine.pca, obs.scale = 1, var.scale = 1,
groups = wine.class, ellipse = TRUE, circle = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
That dataset has only 178 rows and 13 variables, so it will look a bit less cluttered than when we use our data. We also don’t have any distinct classes (listed as wine.class as part of ggbiplot), so any legend modification would get ignored anyway.
library(ggplot2)
library(ggbiplot)
ggbiplot(whitePCA, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = TRUE) +
scale_color_discrete(name = '')
Here’s another “pure” biplot method of “zooming in”
biplot(whitePCA, expand = 10, xlim=c(-0.5, 0.5), ylim=c(-0.5, 0.5))
Zooming with R allows a much clearer picture than this might otherwise seem; it looks a lot better run as a script than within a notebook.