PCA on Wine Data

Reading Multivariate Analysis Data into R

wine <- read.table("wine.data", sep=",")
View(wine)

Plotting Multivariate Data

library(car)

## Loading required package: carData

library(carData)
wine[2:6]

##        V2   V3   V4   V5  V6
## 1   14.23 1.71 2.43 15.6 127
## 2   13.20 1.78 2.14 11.2 100
## 3   13.16 2.36 2.67 18.6 101
## 4   14.37 1.95 2.50 16.8 113
## 5   13.24 2.59 2.87 21.0 118
## 6   14.20 1.76 2.45 15.2 112
## 7   14.39 1.87 2.45 14.6  96
## 8   14.06 2.15 2.61 17.6 121
## 9   14.83 1.64 2.17 14.0  97
## 10  13.86 1.35 2.27 16.0  98
## 11  14.10 2.16 2.30 18.0 105
## 12  14.12 1.48 2.32 16.8  95
## 13  13.75 1.73 2.41 16.0  89
## 14  14.75 1.73 2.39 11.4  91
## 15  14.38 1.87 2.38 12.0 102
## 16  13.63 1.81 2.70 17.2 112
## 17  14.30 1.92 2.72 20.0 120
## 18  13.83 1.57 2.62 20.0 115
## 19  14.19 1.59 2.48 16.5 108
## 20  13.64 3.10 2.56 15.2 116
## 21  14.06 1.63 2.28 16.0 126
## 22  12.93 3.80 2.65 18.6 102
## 23  13.71 1.86 2.36 16.6 101
## 24  12.85 1.60 2.52 17.8  95
## 25  13.50 1.81 2.61 20.0  96
## 26  13.05 2.05 3.22 25.0 124
## 27  13.39 1.77 2.62 16.1  93
## 28  13.30 1.72 2.14 17.0  94
## 29  13.87 1.90 2.80 19.4 107
## 30  14.02 1.68 2.21 16.0  96
## 31  13.73 1.50 2.70 22.5 101
## 32  13.58 1.66 2.36 19.1 106
## 33  13.68 1.83 2.36 17.2 104
## 34  13.76 1.53 2.70 19.5 132
## 35  13.51 1.80 2.65 19.0 110
## 36  13.48 1.81 2.41 20.5 100
## 37  13.28 1.64 2.84 15.5 110
## 38  13.05 1.65 2.55 18.0  98
## 39  13.07 1.50 2.10 15.5  98
## 40  14.22 3.99 2.51 13.2 128
## 41  13.56 1.71 2.31 16.2 117
## 42  13.41 3.84 2.12 18.8  90
## 43  13.88 1.89 2.59 15.0 101
## 44  13.24 3.98 2.29 17.5 103
## 45  13.05 1.77 2.10 17.0 107
## 46  14.21 4.04 2.44 18.9 111
## 47  14.38 3.59 2.28 16.0 102
## 48  13.90 1.68 2.12 16.0 101
## 49  14.10 2.02 2.40 18.8 103
## 50  13.94 1.73 2.27 17.4 108
## 51  13.05 1.73 2.04 12.4  92
## 52  13.83 1.65 2.60 17.2  94
## 53  13.82 1.75 2.42 14.0 111
## 54  13.77 1.90 2.68 17.1 115
## 55  13.74 1.67 2.25 16.4 118
## 56  13.56 1.73 2.46 20.5 116
## 57  14.22 1.70 2.30 16.3 118
## 58  13.29 1.97 2.68 16.8 102
## 59  13.72 1.43 2.50 16.7 108
## 60  12.37 0.94 1.36 10.6  88
## 61  12.33 1.10 2.28 16.0 101
## 62  12.64 1.36 2.02 16.8 100
## 63  13.67 1.25 1.92 18.0  94
## 64  12.37 1.13 2.16 19.0  87
## 65  12.17 1.45 2.53 19.0 104
## 66  12.37 1.21 2.56 18.1  98
## 67  13.11 1.01 1.70 15.0  78
## 68  12.37 1.17 1.92 19.6  78
## 69  13.34 0.94 2.36 17.0 110
## 70  12.21 1.19 1.75 16.8 151
## 71  12.29 1.61 2.21 20.4 103
## 72  13.86 1.51 2.67 25.0  86
## 73  13.49 1.66 2.24 24.0  87
## 74  12.99 1.67 2.60 30.0 139
## 75  11.96 1.09 2.30 21.0 101
## 76  11.66 1.88 1.92 16.0  97
## 77  13.03 0.90 1.71 16.0  86
## 78  11.84 2.89 2.23 18.0 112
## 79  12.33 0.99 1.95 14.8 136
## 80  12.70 3.87 2.40 23.0 101
## 81  12.00 0.92 2.00 19.0  86
## 82  12.72 1.81 2.20 18.8  86
## 83  12.08 1.13 2.51 24.0  78
## 84  13.05 3.86 2.32 22.5  85
## 85  11.84 0.89 2.58 18.0  94
## 86  12.67 0.98 2.24 18.0  99
## 87  12.16 1.61 2.31 22.8  90
## 88  11.65 1.67 2.62 26.0  88
## 89  11.64 2.06 2.46 21.6  84
## 90  12.08 1.33 2.30 23.6  70
## 91  12.08 1.83 2.32 18.5  81
## 92  12.00 1.51 2.42 22.0  86
## 93  12.69 1.53 2.26 20.7  80
## 94  12.29 2.83 2.22 18.0  88
## 95  11.62 1.99 2.28 18.0  98
## 96  12.47 1.52 2.20 19.0 162
## 97  11.81 2.12 2.74 21.5 134
## 98  12.29 1.41 1.98 16.0  85
## 99  12.37 1.07 2.10 18.5  88
## 100 12.29 3.17 2.21 18.0  88
## 101 12.08 2.08 1.70 17.5  97
## 102 12.60 1.34 1.90 18.5  88
## 103 12.34 2.45 2.46 21.0  98
## 104 11.82 1.72 1.88 19.5  86
## 105 12.51 1.73 1.98 20.5  85
## 106 12.42 2.55 2.27 22.0  90
## 107 12.25 1.73 2.12 19.0  80
## 108 12.72 1.75 2.28 22.5  84
## 109 12.22 1.29 1.94 19.0  92
## 110 11.61 1.35 2.70 20.0  94
## 111 11.46 3.74 1.82 19.5 107
## 112 12.52 2.43 2.17 21.0  88
## 113 11.76 2.68 2.92 20.0 103
## 114 11.41 0.74 2.50 21.0  88
## 115 12.08 1.39 2.50 22.5  84
## 116 11.03 1.51 2.20 21.5  85
## 117 11.82 1.47 1.99 20.8  86
## 118 12.42 1.61 2.19 22.5 108
## 119 12.77 3.43 1.98 16.0  80
## 120 12.00 3.43 2.00 19.0  87
## 121 11.45 2.40 2.42 20.0  96
## 122 11.56 2.05 3.23 28.5 119
## 123 12.42 4.43 2.73 26.5 102
## 124 13.05 5.80 2.13 21.5  86
## 125 11.87 4.31 2.39 21.0  82
## 126 12.07 2.16 2.17 21.0  85
## 127 12.43 1.53 2.29 21.5  86
## 128 11.79 2.13 2.78 28.5  92
## 129 12.37 1.63 2.30 24.5  88
## 130 12.04 4.30 2.38 22.0  80
## 131 12.86 1.35 2.32 18.0 122
## 132 12.88 2.99 2.40 20.0 104
## 133 12.81 2.31 2.40 24.0  98
## 134 12.70 3.55 2.36 21.5 106
## 135 12.51 1.24 2.25 17.5  85
## 136 12.60 2.46 2.20 18.5  94
## 137 12.25 4.72 2.54 21.0  89
## 138 12.53 5.51 2.64 25.0  96
## 139 13.49 3.59 2.19 19.5  88
## 140 12.84 2.96 2.61 24.0 101
## 141 12.93 2.81 2.70 21.0  96
## 142 13.36 2.56 2.35 20.0  89
## 143 13.52 3.17 2.72 23.5  97
## 144 13.62 4.95 2.35 20.0  92
## 145 12.25 3.88 2.20 18.5 112
## 146 13.16 3.57 2.15 21.0 102
## 147 13.88 5.04 2.23 20.0  80
## 148 12.87 4.61 2.48 21.5  86
## 149 13.32 3.24 2.38 21.5  92
## 150 13.08 3.90 2.36 21.5 113
## 151 13.50 3.12 2.62 24.0 123
## 152 12.79 2.67 2.48 22.0 112
## 153 13.11 1.90 2.75 25.5 116
## 154 13.23 3.30 2.28 18.5  98
## 155 12.58 1.29 2.10 20.0 103
## 156 13.17 5.19 2.32 22.0  93
## 157 13.84 4.12 2.38 19.5  89
## 158 12.45 3.03 2.64 27.0  97
## 159 14.34 1.68 2.70 25.0  98
## 160 13.48 1.67 2.64 22.5  89
## 161 12.36 3.83 2.38 21.0  88
## 162 13.69 3.26 2.54 20.0 107
## 163 12.85 3.27 2.58 22.0 106
## 164 12.96 3.45 2.35 18.5 106
## 165 13.78 2.76 2.30 22.0  90
## 166 13.73 4.36 2.26 22.5  88
## 167 13.45 3.70 2.60 23.0 111
## 168 12.82 3.37 2.30 19.5  88
## 169 13.58 2.58 2.69 24.5 105
## 170 13.40 4.60 2.86 25.0 112
## 171 12.20 3.03 2.32 19.0  96
## 172 12.77 2.39 2.28 19.5  86
## 173 14.16 2.51 2.48 20.0  91
## 174 13.71 5.65 2.45 20.5  95
## 175 13.40 3.91 2.48 23.0 102
## 176 13.27 4.28 2.26 20.0 120
## 177 13.17 2.59 2.37 20.0 120
## 178 14.13 4.10 2.74 24.5  96

scatterplotMatrix(wine[2:6])

In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in this case the concentrations of the first five chemicals (variables V2, V3, V4, V5, V6).

Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example, the second cell in the first row is a scatterplot of V2 (y-axis) against V3 (x-axis).

A Scatterplot with the Data Points Labelled by their Group

plot(wine$V4, wine$V5)
#labelling the datapoints with respect to their cultivars
text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red")

We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have lower values of V4 compared to the wines of cultivar 1.

Making a Profile Plot

We will create a function first called “makeProfilePlot()” that requires to install the “RColorBrewer” R package.

makeProfilePlot <- function(mylist,names)
  {
     require(RColorBrewer)
     # find out how many variables we want to include
     numvariables <- length(mylist)
     # choose 'numvariables' random colours
     colours <- brewer.pal(numvariables,"Set1")
     # find out the minimum and maximum values of the variables:
     mymin <- 1e+20
     mymax <- 1e-20
     for (i in 1:numvariables)
     {
        vectori <- mylist[[i]]
        mini <- min(vectori)
        maxi <- max(vectori)
        if (mini < mymin) { mymin <- mini }
        if (maxi > mymax) { mymax <- maxi }
     }
     # plot the variables
     for (i in 1:numvariables)
     {
        vectori <- mylist[[i]]
        namei <- names[i]
        colouri <- colours[i]
        if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) }
        else         { points(vectori, col=colouri,type="l")                                     }
        lastxval <- length(vectori)
        lastyval <- vectori[length(vectori)]
        text((lastxval-10),(lastyval),namei,col="black",cex=0.6)
     }
  }

To make a profile plot of the concentrations of the first five chemicals in the wine sample (stored in columns V2, V3, V4, V5, V6 of variable “wine”), we have:

library(RColorBrewer)
names <- c("V2","V3","V4","V5","V6")
mylist <- list(wine$V2,wine$V3,wine$V4,wine$V5,wine$V6)
makeProfilePlot(mylist,names)

It is clear from the profile plot that the mean and standard deviation for V6 is quite a lot higher than that for the other variables.

Calculating Summary Statistics for Multivariate Data

Getting the mean of each of the 13 chemical concentrations in the wine samples.

sapply(wine[2:14],mean)

##          V2          V3          V4          V5          V6          V7 
##  13.0006180   2.3363483   2.3665169  19.4949438  99.7415730   2.2951124 
##          V8          V9         V10         V11         V12         V13 
##   2.0292697   0.3618539   1.5908989   5.0580899   0.9574494   2.6116854 
##         V14 
## 746.8932584

Getting the standard deviation of each of the 13 chemical concentrations in the wine samples.

sapply(wine[2:14],sd)

##          V2          V3          V4          V5          V6          V7 
##   0.8118265   1.1171461   0.2743440   3.3395638  14.2824835   0.6258510 
##          V8          V9         V10         V11         V12         V13 
##   0.9988587   0.1244533   0.5723589   2.3182859   0.2285716   0.7099904 
##         V14 
## 314.9074743

As we can see from the results above, the extent at which the standard deviations of some variables differ in the data is very great. So before comparing these variables, we need to standardize them.

Standardizing Variables

We standardized variables of a certain data if we want to compare different variables that have different units as well as varying means and standard deviations, e.g. very different variances. With respect to the given data, it is not advisable to use the unstandardized chemical concentrations as the input for a Principal Component Analysis (PCA) of the wine samples, as if you did that, the first principal component would be dominated by the variables which show the largest variances, such as V14.

Thus, it would be a better idea to first standardize the variables so that they all have variance 1 and mean 0, and to then carry out the Principal Component Analysis (PCA) on the standardized data. This would allow us to find the principal components that provide the best low-dimensional representation of the variation in the original data, without being overly biased by those variables that show the most variance in the original data.

standardisedconcentrations <- as.data.frame(scale(wine[2:14]))

Now, checking if each variable has a mean approximately equal to 0 and standard deviation of 1.

sapply(standardisedconcentrations,mean)

##            V2            V3            V4            V5            V6 
## -8.591766e-16 -6.776446e-17  8.045176e-16 -7.720494e-17 -4.073935e-17 
##            V7            V8            V9           V10           V11 
## -1.395560e-17  6.958263e-17 -1.042186e-16 -1.221369e-16  3.649376e-17 
##           V12           V13           V14 
##  2.093741e-16  3.003459e-16 -1.034429e-16

Note that mean values are too small and approximately equal to 0.

sapply(standardisedconcentrations,sd)

##  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 
##   1   1   1   1   1   1   1   1   1   1   1   1   1

Doing the Principal Component Analysis (PCA)

The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations. Note that before doing PCA, you need to standardized the variables involved. This is necessary if the input variables have very different variances.

wine.pca <- prcomp(standardisedconcentrations)
summary(wine.pca)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

To view the standard deviation of each component, we have

wine.pca$sdev

##  [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128
##  [8] 0.5903367 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244

We can calculate the total variance explained by the components is the sum of the variances of the components by,

sum((wine.pca$sdev)^2)

## [1] 13

In this case, we see that the total variance is 13, which is equal to the number of standardised variables (13 variables). This is because for standardised data, the variance of each standardised variable is 1. The total variance is equal to the sum of the variances of the individual variables

Deciding How Many Principal Components to Retain in PCA

In order to decide how many principal components should be retained, it is common to summarize the results of a principal components analysis by making a scree plot.

screeplot(wine.pca, type="lines")

The most obvious change in slope in the scree plot occurs at component 4, which is the “elbow” of the scree plot. Therefore, it could be argued based on the basis of the scree plot that the first three components, component 1 up to component 3, should be retained.

Another way of deciding how many components to retain is to use Kaiser’s criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to standardized data).

(wine.pca$sdev)^2

##  [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283
##  [8] 0.3484974 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779

We see from the result above that components 1,2, and 3 have variances greater than 1. By Kaiser’s criterion, we should retain these principal components.

A third way to decide how many principal components to retain is to decide to keep the number of components required to explain at least some minimum amount of the total variance. For example, if it is important to explain at least 80% of the variance, we would retain the first five principal components, as we can see from the output of “summary(wine.pca)” that the first five principal components explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient).

Loadings for the Principal Components

The loadings for the principal components are stored in a named element “rotation” of the variable returned by “prcomp()”. This contains a matrix with the loadings of each principal component, where the first column in the matrix contains the loadings for the first principal component, the second column contains the loadings for the second principal component, and so on.

Therefore, to obtain the loadings for the first principal component in our analysis of the 13 chemical concentrations in wine samples, we type:

wine.pca$rotation[,1]

##           V2           V3           V4           V5           V6           V7 
## -0.144329395  0.245187580  0.002051061  0.239320405 -0.141992042 -0.394660845 
##           V8           V9          V10          V11          V12          V13 
## -0.422934297  0.298533103 -0.313429488  0.088616705 -0.296714564 -0.376167411 
##          V14 
## -0.286752227

The results above indicate that the first principal component is a linear combination of the variables: -0.144Z2 + 0.245Z3 + 0.002Z4 + 0.239Z5 - 0.142Z6 - 0.395Z7 - 0.423Z8 + 0.299Z9 -0.313Z10 + 0.089Z11 - 0.297Z12 - 0.376Z13 - 0.287*Z14, where Z2, Z3, Z4…Z14 are the standardized versions of the variables V2, V3, V4…V14 (that each have mean of 0 and variance of 1).

Also, the first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376), V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13, V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5.

Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings,

sum((wine.pca$rotation[,1])^2)

## [1] 1

To calculate the values of the first principal component, we can define our own function to calculate a principal component given the loadings and the input variables’ values:

calcpc <- function(variables,loadings)
  {
     # find the number of samples in the data set
     as.data.frame(variables)
     numsamples <- nrow(variables)
     # make a vector to store the component
     pc <- numeric(numsamples)
     # find the number of variables
     numvariables <- length(variables)
     # calculate the value of the component for each sample
     for (i in 1:numsamples)
     {
        valuei <- 0
        for (j in 1:numvariables)
        {
           valueij <- variables[i,j]
           loadingj <- loadings[j]
           valuei <- valuei + (valueij * loadingj)
        }
        pc[i] <- valuei
     }
     return(pc)
  }

We can then use the function above to calculate the values of the first principal component for each sample in our wine data:

calcpc(standardisedconcentrations, wine.pca$rotation[,1])

##   [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373
##   [7] -2.44220051 -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688
##  [13] -2.10751729 -3.44842921 -4.30065228 -2.29870383 -2.16584568 -1.89362947
##  [19] -3.53202167 -2.07865856 -3.11561376 -1.08351361 -2.52809263 -1.64036108
##  [25] -1.75662066 -0.98729406 -1.77028387 -1.23194878 -2.18225047 -2.24976267
##  [31] -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118 -1.89847087
##  [37] -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
##  [43] -3.06216898 -0.46090897 -2.09544094 -1.13297020 -2.71893118 -2.81340300
##  [49] -2.00419725 -2.69987528 -3.20587409 -2.85091773 -3.49574328 -2.21853316
##  [55] -2.14094846 -2.46238340 -2.73380617 -2.16762631 -3.13054925  0.92596992
##  [61]  1.53814123  1.83108449 -0.03052074 -2.04449433  0.60796583 -0.89769555
##  [67] -2.24218226 -0.18286818  0.81051865 -1.97006319  1.56779366 -1.65301884
##  [73]  0.72333196 -2.55501977 -1.82741266  0.86555129 -0.36897357  1.45327752
##  [79] -1.25937829 -0.37509228 -0.75992026 -1.03166776  0.49348469  2.53183508
##  [85] -0.83297044 -0.78568828  0.80456258  0.55647288  1.11197430  0.55415961
##  [91]  1.34548982  1.56008180  1.92711944 -0.74456561 -0.95476209 -2.53670943
##  [97]  0.54242248 -1.02814946 -2.24557492 -1.40624916 -0.79547585  0.54798592
## [103]  0.16072037  0.65793897 -0.39125074  1.76751314  0.36523707  1.61611371
## [109] -0.08230361 -1.57383547 -1.41657326  0.27791878  1.29947929  0.45578615
## [115]  0.49279573 -0.48071836  0.25217752  0.10692601  2.42616867  0.54953935
## [121] -0.73754141 -1.33256273  1.17377592  0.46103449 -0.97572169  0.09653741
## [127] -0.03837888  1.59266578  0.47821593  1.78779033  1.32336859  2.37779336
## [133]  2.92867865  2.14077227  2.36320318  3.05522315  3.90473898  3.92539034
## [139]  3.08557209  2.36779237  2.77099630  2.28012931  2.97723506  2.36851341
## [145]  2.20364930  2.61823528  4.26859758  3.57256360  2.79916760  2.89150275
## [151]  2.31420887  2.54265841  1.80744271  2.75238051  2.72945105  3.59472857
## [157]  2.88169708  3.38261413  1.04523342  1.60538369  3.13428951  2.23385546
## [163]  2.83966343  2.59019044  2.94100316  3.52010248  2.39934228  2.92084537
## [169]  2.17527658  2.37423037  3.20258311  3.66757294  2.45862032  3.36104305
## [175]  2.59463669  2.67030685  2.38030254  3.19973210

In fact, the values of the first principal component are stored in the variable wine.pca$x[,1] that was returned by the “prcomp()” function, so we can compare those values to the ones that we calculated, and they should agree:

wine.pca$x[,1]

##   [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373
##   [7] -2.44220051 -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688
##  [13] -2.10751729 -3.44842921 -4.30065228 -2.29870383 -2.16584568 -1.89362947
##  [19] -3.53202167 -2.07865856 -3.11561376 -1.08351361 -2.52809263 -1.64036108
##  [25] -1.75662066 -0.98729406 -1.77028387 -1.23194878 -2.18225047 -2.24976267
##  [31] -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118 -1.89847087
##  [37] -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
##  [43] -3.06216898 -0.46090897 -2.09544094 -1.13297020 -2.71893118 -2.81340300
##  [49] -2.00419725 -2.69987528 -3.20587409 -2.85091773 -3.49574328 -2.21853316
##  [55] -2.14094846 -2.46238340 -2.73380617 -2.16762631 -3.13054925  0.92596992
##  [61]  1.53814123  1.83108449 -0.03052074 -2.04449433  0.60796583 -0.89769555
##  [67] -2.24218226 -0.18286818  0.81051865 -1.97006319  1.56779366 -1.65301884
##  [73]  0.72333196 -2.55501977 -1.82741266  0.86555129 -0.36897357  1.45327752
##  [79] -1.25937829 -0.37509228 -0.75992026 -1.03166776  0.49348469  2.53183508
##  [85] -0.83297044 -0.78568828  0.80456258  0.55647288  1.11197430  0.55415961
##  [91]  1.34548982  1.56008180  1.92711944 -0.74456561 -0.95476209 -2.53670943
##  [97]  0.54242248 -1.02814946 -2.24557492 -1.40624916 -0.79547585  0.54798592
## [103]  0.16072037  0.65793897 -0.39125074  1.76751314  0.36523707  1.61611371
## [109] -0.08230361 -1.57383547 -1.41657326  0.27791878  1.29947929  0.45578615
## [115]  0.49279573 -0.48071836  0.25217752  0.10692601  2.42616867  0.54953935
## [121] -0.73754141 -1.33256273  1.17377592  0.46103449 -0.97572169  0.09653741
## [127] -0.03837888  1.59266578  0.47821593  1.78779033  1.32336859  2.37779336
## [133]  2.92867865  2.14077227  2.36320318  3.05522315  3.90473898  3.92539034
## [139]  3.08557209  2.36779237  2.77099630  2.28012931  2.97723506  2.36851341
## [145]  2.20364930  2.61823528  4.26859758  3.57256360  2.79916760  2.89150275
## [151]  2.31420887  2.54265841  1.80744271  2.75238051  2.72945105  3.59472857
## [157]  2.88169708  3.38261413  1.04523342  1.60538369  3.13428951  2.23385546
## [163]  2.83966343  2.59019044  2.94100316  3.52010248  2.39934228  2.92084537
## [169]  2.17527658  2.37423037  3.20258311  3.66757294  2.45862032  3.36104305
## [175]  2.59463669  2.67030685  2.38030254  3.19973210

Similarly, we can obtain the loadings for the second principal component by typing:

wine.pca$rotation[,2]

##           V2           V3           V4           V5           V6           V7 
##  0.483651548  0.224930935  0.316068814 -0.010590502  0.299634003  0.065039512 
##           V8           V9          V10          V11          V12          V13 
## -0.003359812  0.028779488  0.039301722  0.529995672 -0.279235148 -0.164496193 
##          V14 
##  0.364902832

This means that the second principal component is a linear combination of the variables: 0.484Z2 + 0.225Z3 + 0.316Z4 - 0.011Z5 + 0.300Z6 + 0.065Z7 - 0.003Z8 + 0.029Z9 + 0.039Z10 + 0.530Z11 - 0.279Z12 - 0.164Z13 + 0.365*Z14, where Z1, Z2, Z3…Z14 are the standardised versions of variables V2, V3, … V14 that each have mean 0 and variance 1.

Also, the second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while the loading for V12 is negative. Therefore, an interpretation of the second principal component is that it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. More importantly, the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between the concentrations of V11 and V2, and the concentration of V12.

Note that the square of the loadings sum to 1, as above:

sum((wine.pca$rotation[,2])^2)

## [1] 1

Scatterplots of the Principal Components

The values of the principal components are stored in a named element “x” of the variable returned by “prcomp()”. This contains a matrix with the principal components, where the first column in the matrix contains the first principal component, the second column the second component, and so on.

We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine samples come from by,

plot(wine.pca$x[,1],wine.pca$x[,2])
text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red")

The scatterplot shows the first principal component on the x-axis, and the second principal component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1 have much lower values of the first principal component than wine samples of cultivar 3. Therefore, the first principal component separates wine samples of cultivars 1 from those of cultivar 3.

We can also see that wine samples of cultivar 2 have much higher values of the second principal component than wine samples of cultivars 1 and 3. Therefore, the second principal component separates samples of cultivar 2 from samples of cultivars 1 and 3.

Therefore, the first two principal components are reasonably useful for distinguishing wine samples of the three different cultivars.

Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the concentrations of these chemicals in the different cultivars, by printing out the means of the standardised concentration variables in each cultivar, using the “printMeanAndSdByGroup()” function.

printMeanAndSdByGroup <- function(variables,groupvariable)
  {
     # find the names of the variables
     variablenames <- c(names(groupvariable),names(as.data.frame(variables)))
     # within each group, find the mean of each variable
     groupvariable <- groupvariable[,1] # ensures groupvariable is not a list
     means <- aggregate(as.matrix(variables) ~ groupvariable, FUN = mean)
     names(means) <- variablenames
     print(paste("Means:"))
     print(means)
     # within each group, find the standard deviation of each variable:
     sds <- aggregate(as.matrix(variables) ~ groupvariable, FUN = sd)
     names(sds) <- variablenames
     print(paste("Standard deviations:"))
     print(sds)
     # within each group, find the number of samples:
     samplesizes <- aggregate(as.matrix(variables) ~ groupvariable, FUN = length)
     names(samplesizes) <- variablenames
     print(paste("Sample sizes:"))
     print(samplesizes)
  }

printMeanAndSdByGroup(standardisedconcentrations,wine[1])

## [1] "Means:"
##   V1         V2         V3         V4         V5          V6          V7
## 1  1  0.9166093 -0.2915199  0.3246886 -0.7359212  0.46192317  0.87090552
## 2  2 -0.8892116 -0.3613424 -0.4437061  0.2225094 -0.36354162 -0.05790375
## 3  3  0.1886265  0.8928122  0.2572190  0.5754413 -0.03004191 -0.98483874
##            V8          V9        V10        V11        V12        V13
## 1  0.95419225 -0.57735640  0.5388633  0.2028288  0.4575567  0.7691811
## 2  0.05163434  0.01452785  0.0688079 -0.8503999  0.4323908  0.2446043
## 3 -1.24923710  0.68817813 -0.7641311  1.0085728 -1.2019916 -1.3072623
##          V14
## 1  1.1711967
## 2 -0.7220731
## 3 -0.3715295
## [1] "Standard deviations:"
##   V1        V2        V3        V4        V5        V6        V7        V8
## 1  1 0.5692415 0.6163463 0.8280333 0.7624716 0.7350927 0.5416007 0.3979478
## 2  2 0.6626591 0.9090742 1.1498967 1.0030563 1.1730101 0.8713912 0.7065071
## 3  3 0.6531461 0.9738258 0.6732065 0.6761844 0.7625055 0.5703767 0.2938394
##          V9       V10       V11       V12       V13       V14
## 1 0.5628555 0.7200189 0.5342623 0.5096112 0.5029315 0.7034472
## 2 0.9960462 1.0519061 0.3989712 0.8878480 0.6994087 0.4992299
## 3 0.9974790 0.7142999 0.9968323 0.5006795 0.3832607 0.3654948
## [1] "Sample sizes:"
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1  1 59 59 59 59 59 59 59 59  59  59  59  59  59
## 2  2 71 71 71 71 71 71 71 71  71  71  71  71  71
## 3  3 48 48 48 48 48 48 48 48  48  48  48  48  48

Does it make sense that the first principal component can separate cultivar 1 from cultivar 3? In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171) are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736). In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372) are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575). Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.

Above, we intepreted the second principal component as a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. In the light of the mean values of these variables in the different cultivars, does it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3? In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292) are not very different from the mean value of V12 (0.458). In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893) are also not very different from the mean value of V12 (-1.202). In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361) are much less than the mean value of V12 (0.432). Therefore, it makes sense that the second principal component is a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12; and this principal component can separate cultivar 2 from cultivars 1 and 3.

In conclusion, the first three principal components from PCA done on the data accounts for most of the variations in the original data according to the scree plot and Kaiser’s criterion.

PCA on Wine Data

Darwin Mangubat

2/8/2022