Purpose

Review Principal Component Analysis (PCA) techniques and try a few packages from github

Data Preparation

Files are found locally at F:Directory(winequality-red.csv will not be initially examined) winequality-white.csv

The variables are:
- Fixed acidity - Volatile acidity - Citric acid - Residual sugar - Chlorides - Free sulfur dioxide - Total sulfur dioxide - Density - pH - Sulphates - Alcohol - Quality

winequality-white.csv has 4898 records with 12 variables and no missing values

setwd("F:/R/Working Directory/Rpubs/PCA")

whiteWine = read.csv(file = "F:/R/Working Directory/Rpubs/PCA/winequality-white.csv" , sep = ";", header = TRUE)
dim(whiteWine)
## [1] 4898   12
head(whiteWine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Initial Analysis

A principal component analysis is just a few lines of code. Scaling is used to ensure factor are weighted evenly.

whitePCA = prcomp(whiteWine, scale. = TRUE)
summary(whitePCA)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.8294 1.2594 1.1710 1.04157 0.98756 0.96890 0.8771
## Proportion of Variance 0.2789 0.1322 0.1143 0.09041 0.08127 0.07823 0.0641
## Cumulative Proportion  0.2789 0.4111 0.5253 0.61573 0.69701 0.77524 0.8393
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.85082 0.74599 0.58561 0.53302 0.14307
## Proportion of Variance 0.06032 0.04638 0.02858 0.02368 0.00171
## Cumulative Proportion  0.89967 0.94604 0.97462 0.99829 1.00000

Numerical Analysis

A detailed view of the PCA:

whitePCA
## Standard deviations:
##  [1] 1.8293903 1.2594008 1.1709706 1.0415668 0.9875644 0.9688978 0.8770680
##  [8] 0.8508195 0.7459900 0.5856051 0.5330248 0.1430703
## 
## Rotation:
##                              PC1         PC2         PC3         PC4
## fixed.acidity        -0.15690447  0.56066866 -0.20738436  0.03373494
## volatile.acidity     -0.02428722  0.01606694  0.52491466 -0.13119747
## citric.acid          -0.13294430  0.28938115 -0.44635554  0.32953335
## residual.sugar       -0.40605288 -0.03882402 -0.03384313 -0.41615630
## chlorides            -0.21754400  0.03691144  0.21471269  0.50961203
## free.sulfur.dioxide  -0.27471931 -0.34554881 -0.31297088 -0.14892788
## total.sulfur.dioxide -0.39044148 -0.27232605 -0.12479447 -0.02161841
## density              -0.50129557 -0.01773344  0.03196758 -0.10386393
## pH                    0.13003701 -0.56714503  0.06848384  0.20410995
## sulphates            -0.03364168 -0.24826266 -0.22699505  0.51924489
## alcohol               0.44279498  0.01698188 -0.15887556 -0.13438871
## quality               0.22713722 -0.14603134 -0.48884718 -0.27820033
##                              PC5          PC6         PC7         PC8
## fixed.acidity        -0.24413933  0.105856235  0.22355921  0.13041311
## volatile.acidity     -0.70298193 -0.123704688 -0.22363601 -0.22960669
## citric.acid          -0.06510579 -0.131958661 -0.12037133 -0.69141866
## residual.sugar        0.01610213  0.289918546 -0.33860858 -0.11329401
## chlorides             0.17829248 -0.409317266 -0.55225504  0.21139734
## free.sulfur.dioxide  -0.11117214 -0.488085145  0.22407108  0.12883115
## total.sulfur.dioxide -0.27144774 -0.272493820  0.20375343  0.01290262
## density               0.07834373  0.326008106 -0.12313568 -0.08667076
## pH                    0.11270171  0.192688838  0.07704001 -0.47796137
## sulphates            -0.45623099  0.479811894 -0.04462167  0.33642752
## alcohol              -0.30855451 -0.135443327 -0.09801169 -0.08899029
## quality              -0.04112191 -0.005524396 -0.58434519  0.14444197
##                              PC9        PC10        PC11         PC12
## fixed.acidity        -0.63145048  0.20087123 -0.10411772  0.170792295
## volatile.acidity     -0.03159628 -0.14175876 -0.27002270  0.013376718
## citric.acid           0.24949503 -0.10632912 -0.05395597  0.009648802
## residual.sugar        0.17730336  0.37427490  0.17987291  0.493565139
## chlorides            -0.17916182  0.23552782  0.09108849  0.025168952
## free.sulfur.dioxide   0.10184710  0.32733415 -0.49921348 -0.029475198
## total.sulfur.dioxide -0.17800832 -0.34735757  0.64355326  0.035060193
## density              -0.12538636  0.04349161 -0.06686042 -0.761184485
## pH                   -0.52031593  0.18375599 -0.07911267  0.141842640
## sulphates             0.23662489  0.05519364 -0.04102077  0.042787387
## alcohol               0.01278298  0.57530003  0.41895440 -0.350156811
## quality              -0.29970621 -0.36771605 -0.14620225 -0.016069252

For the primary component, the variables with the greatest variance (influence) are: - density: -0.50129557 - alcohol: 0.44279498 - residual sugar: -0.40605288

For the secondary component, the variables with the greatest variance are: - pH: -0.56714503 - fixed acidity: 0.56066866

Screeplots and Bi-Plots

Basics

Two ways to view the scree plot:

plot(whitePCA)

screeplot(whitePCA, type = "lines") # default is boxplot

A biplot shows the components orthagonally rotated against the two primary eigenvectors

biplot(whitePCA)

Obviously, the above biplot is not very readable.

ggbiplot

So, let’s install and play with a new function,

library(devtools)
install_github("vqv/ggbiplot")

And let’s see how it works on their data:

library(ggplot2)
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
data("wine")
dim(wine)
## [1] 178  13
head(wine)
##   Alcohol MalicAcid  Ash AlcAsh  Mg Phenols Flav NonFlavPhenols Proa Color
## 1   14.23      1.71 2.43   15.6 127    2.80 3.06           0.28 2.29  5.64
## 2   13.20      1.78 2.14   11.2 100    2.65 2.76           0.26 1.28  4.38
## 3   13.16      2.36 2.67   18.6 101    2.80 3.24           0.30 2.81  5.68
## 4   14.37      1.95 2.50   16.8 113    3.85 3.49           0.24 2.18  7.80
## 5   13.24      2.59 2.87   21.0 118    2.80 2.69           0.39 1.82  4.32
## 6   14.20      1.76 2.45   15.2 112    3.27 3.39           0.34 1.97  6.75
##    Hue   OD Proline
## 1 1.04 3.92    1065
## 2 1.05 3.40    1050
## 3 1.03 3.17    1185
## 4 0.86 3.45    1480
## 5 1.04 2.93     735
## 6 1.05 2.85    1450
wine.pca <- prcomp(wine, scale. = TRUE)
ggbiplot(wine.pca, obs.scale = 1, var.scale = 1,
         groups = wine.class, ellipse = TRUE, circle = TRUE) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')

ggbiplot with our data

That dataset has only 178 rows and 13 variables, so it will look a bit less cluttered than when we use our data. We also don’t have any distinct classes (listed as wine.class as part of ggbiplot), so any legend modification would get ignored anyway.

library(ggplot2)
library(ggbiplot)

ggbiplot(whitePCA, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = TRUE) +
  scale_color_discrete(name = '') 

Zooming In

Here’s another “pure” biplot method of “zooming in”

biplot(whitePCA, expand = 10, xlim=c(-0.5, 0.5), ylim=c(-0.5, 0.5))

Zooming with R allows a much clearer picture than this might otherwise seem; it looks a lot better run as a script than within a notebook.