This Portfolio will walk you through using PCA for what I call pseudo-cluster analysis. You will also examine the PCA scores generated by PCA, how they correlate with themselves, and how they correlate to the original data features. This will help illustrate how the vectors in a biplot relate to the original data and the layout of the points within the biplot.

For the next exam you should be able to work with code and data in these formats and interpret the plots and correlation matrices that occur in this Portfolio.

Data for Worked Example - Amino Acid Properties

We will do PCA on the chemical properties of amino acids. We will then compare this continuous numeric data to the categories often used to summarize amino acids in text books.

PCA calculations themselves only operate on numeric data. Genotype data such as SNPs must be first converted to a numeric code before being used in PCA.

Categorical labels are only applied to data after the PCA is complete. Unlike true cluster analysis methods, PCA does not divide the data into clusters or a hierarchy. Clusters may emerge from the PCA and/or become apparent when categorical data is added to the biplot.

Numeric data

The following vectors are the columns that make up Table 2.2 of Higgs and Attwood 2005, Chapter 2. Some names have been changed for easier plotting.

# main data vectors
aa1        <-c('A', 'C', 'D', 'E', 'F', 'G', 'H', 
               'I', 'K', 'L','M','N',
                'P','Q','R','S','T','V','W','Y')
vol    <- c(67, 86, 91, 109, 135, 48, 118, 124, 
            135, 124, 124, 96, 90, 114,
            148, 73, 93, 105, 163, 141)
bulk <-c(11.5, 13.46, 11.68, 13.57, 19.8, 3.4, 
         13.69, 21.4, 15.71, 21.4, 16.25,
         12.28,17.43, 14.45, 14.28, 9.47, 15.77, 
         21.57, 21.67, 18.03)
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,
            6.02,9.74,5.98, 5.74, 5.41, 6.3, 
            5.65, 10.76, 5.68, 6.16, 5.96, 5.89, 5.66)
Hyd1 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,
         4.5, -3.9, 3.8, 1.9, -3.5, -1.6, 
              -3.5, -4.5, -0.8, -0.7, 4.2, -0.9, -1.3)
Hyd2 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,
         -8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,
              0.6,1.2,2.6,1.9,-0.7)
sur_area       <-c(113, 140,151,183,218,
                   85,194,182,211,180, 204, 158, 143, 189, 
                 241, 122, 146,160,259,229)
frac_area    <-c(0.74,0.91,0.62,0.62,0.88,
                 0.72,0.78,0.88,0.52, 0.85, 0.85, 0.63,
                 0.64,0.62,0.64,0.66,0.7, 0.86, 0.85, 0.76)

Make the dataframe with the numeric data

##   Vol  Bulk Polarity pH0charge Hydro1 Hydro2 SAAbyH20 SALfold
## A  67 11.50     0.00      6.00    1.8    1.6      113    0.74
## C  86 13.46     1.48      5.07    2.5    2.0      140    0.91
## D  91 11.68    49.70      2.77   -3.5   -9.2      151    0.62
## E 109 13.57    49.90      3.22   -3.5   -8.2      183    0.62
## F 135 19.80     0.35      5.48    2.8    3.7      218    0.88
## G  48  3.40     0.00      5.97   -0.4    1.0       85    0.72
## H 118 13.69    51.60      7.59   -3.2   -3.0      194    0.78
## I 124 21.40     0.13      6.02    4.5    3.1      182    0.88
## K 135 15.71    49.50      9.74   -3.9   -8.8      211    0.52
## L 124 21.40     0.13      5.98    3.8    2.8      180    0.85
## M 124 16.25     1.43      5.74    1.9    3.4      204    0.85
## N  96 12.28     3.38      5.41   -3.5   -4.8      158    0.63
## P  90 17.43     1.58      6.30   -1.6   -0.2      143    0.64
## Q 114 14.45     3.53      5.65   -3.5   -4.1      189    0.62
## R 148 14.28    52.00     10.76   -4.5  -12.3      241    0.64
## S  73  9.47     1.67      5.68   -0.8    0.6      122    0.66
## T  93 15.77     1.66      6.16   -0.7    1.2      146    0.70
## V 105 21.57     0.13      5.96    4.2    2.6      160    0.86
## W 163 21.67     2.10      5.89   -0.9    1.9      259    0.85
## Y 141 18.03     1.61      5.66   -1.3   -0.7      229    0.76

Notes:

SA = Surface area of the amino acid
SAAbyH20 = Surface area accessed by H20
SALfold = Surface area lost to access by water when folded
Hydro = Hydropathy (there are 2 ways this is measured)

Categorical data

We’ll also look at some information about some common ways that textbooks classify amino acids with CATEGORICAL LABELS.

# Charge
## un = uncharged (0), neg = negative (-), pos = positive (+)
charge_cat<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

# Hydrophobicity
phob_cat <-c('phob', 'phob', 'phyll', 'phyll', 'phob',
             'neut', 'neut','phob',
             'phyll','phob','phob','phyll','neut',
              'phyll','phyll','neut','neut',
              'phob','phob','neut')
phob_cat <- factor(phob_cat, levels = c("phob","neut","phyll"))


# Volume 
# v = very
x<-c('v_small','small','small','medium',
              'v_large','v_small','medium','large','large',
              'large','large','small','small','medium','large',
           'v_small','small','medium','v_large','v_large')
x <- factor(x,
            levels = c("v_small","small","medium",
                       "large","v_large"))

# Polarity
pol_cat<-c('nonpol','nonpol','polar','polar',
                'nonpol','nonpol','polar','nonpol',
                'polar','nonpol','nonpol','polar','nonpol','polar',
                'polar','polar','polar','nonpol','nonpol','polar')



# Chemical category
chem_cat<-c('ali','sulfur','acid','acid','aro',
            'ali','base','ali','base','ali','sulfur',
            'amide','ali','amide','base','hydro','hydro',
            'ali','aro','aro')

Here’s an example of the categorical data:

# add table()
table(phob_cat)

## phob_cat
##  phob  neut phyll 
##     8     6     6

table(phob_cat, charge_cat)

##         charge_cat
## phob_cat neg pos un
##    phob    0   0  8
##    neut    0   1  5
##    phyll   2   2  2

Data exploration

Scatterplot matrix with smoothers

A scatterplot matrix of the raw data with smoothers added.

#add pairs() and fix any errors in the code
pairs(habk_df,
      col = phob_cat,
      pch = c(as.numeric(phob_cat)+14),
      panel = panel.smooth,
      upper.panel = NULL
      )
mtext(1, line = 4,
      text = "Black square = Hydrophobic, Red circle = neutral, Green triangle = hydrophyillic")

Correlation matrix

A correlation of the raw data.

# make correlation matrix
# add cor()
cor_mat <- cor(habk_df)

#clean up the matrix
# add round()
cor_mat_round <- round(cor_mat,3)
cor_mat_round[upper.tri(cor_mat_round)] <- NA

# plot the matrix
cor_mat_round

##              Vol   Bulk Polarity pH0charge Hydro1 Hydro2 SAAbyH20 SALfold
## Vol        1.000     NA       NA        NA     NA     NA       NA      NA
## Bulk       0.727  1.000       NA        NA     NA     NA       NA      NA
## Polarity   0.236 -0.202    1.000        NA     NA     NA       NA      NA
## pH0charge  0.363  0.082    0.266     1.000     NA     NA       NA      NA
## Hydro1    -0.080  0.444   -0.669    -0.204  1.000     NA       NA      NA
## Hydro2    -0.159  0.319   -0.854    -0.261  0.850  1.000       NA      NA
## SAAbyH20   0.987  0.636    0.288     0.347 -0.181 -0.226    1.000      NA
## SALfold    0.182  0.486   -0.529    -0.180  0.841  0.788    0.123       1

Scaled data

Data for PCA is usually centered and scaled.

#add scale()
habk_df_scale <-scale(habk_df, 
                      center = T, 
                      scale = T)

Scaling the data does NOT change the correlation between variables. We can see this by looking at a correlation matrix of the scaled data

# make correlation matrix
# add cor()
cor_mat_scale <- cor(habk_df_scale)

#clean up the matrix
cor_mat_round2 <- round(cor_mat_scale,3)
cor_mat_round2[upper.tri(cor_mat_round2)] <- NA

# plot the matrix
cor_mat_round2

##              Vol   Bulk Polarity pH0charge Hydro1 Hydro2 SAAbyH20 SALfold
## Vol        1.000     NA       NA        NA     NA     NA       NA      NA
## Bulk       0.727  1.000       NA        NA     NA     NA       NA      NA
## Polarity   0.236 -0.202    1.000        NA     NA     NA       NA      NA
## pH0charge  0.363  0.082    0.266     1.000     NA     NA       NA      NA
## Hydro1    -0.080  0.444   -0.669    -0.204  1.000     NA       NA      NA
## Hydro2    -0.159  0.319   -0.854    -0.261  0.850  1.000       NA      NA
## SAAbyH20   0.987  0.636    0.288     0.347 -0.181 -0.226    1.000      NA
## SALfold    0.182  0.486   -0.529    -0.180  0.841  0.788    0.123       1

We can confirm this by plotting each element of the matrix of the raw data against each element of matrix of scaled data.

This code pulls the correlations out of the matrices

# get the correlations
corr_raw   <- as.vector(cor_mat_round)
corr_scale <- as.vector(cor_mat_round2)

#remove NAs
# add na.omit()
corr_raw   <- na.omit(corr_raw)
corr_scale <- na.omit(corr_scale)

Now we can plot them. The data fall into a perfectly straight line because they are actually identical.

par(mfrow = c(1,1))

#add plot()
plot(corr_raw ~ corr_scale)

#add abline()
abline(a = 0, b = 1)

We can test this with a logical operation tooL

## add == to do a logical test
corr_raw == corr_scale

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE

PCA Analysis

Now time for the actual PCA.

PCA analysis with vegan

The function rda() in vegan runs PCA.

# install.package("vegan")
library(vegan)

# run the PCA
aa_pca <- vegan::rda(habk_df_scale,
                     scale = T)

Plotting PCA basic results as a biplot

PCA results are usually presented as a 2D or 3D biplot. 2D plots are most common, but sometimes 3D plots are published. Its called a biplot not because its 2D, but because it has two element:

points and
arrows (vectors).

A 3D plot with vectors is still called a biplot because it contains two types of data - points and vectors.

Points near each other are similar to each other in the original multidimensional data. Points that are far away are dis-similar. Note that at this stage groups are not indicated in any way, because PCA is not a true clustering algorithm. Also, a similarity matrix or distance matrix is not calculated for standard PCA, (though similar results could be achieved through a different route by calculating a Euclidean distance matrix).

Arrows (vectors) that point in approximately the same direction represent features that have a strong positive (+) correlation in the raw data. Arrows that point in approximately opposite directions represent features that have a strong negative correlation (-) in the raw data. Arrows that are approximately perpendicular (at right angles) to each other are called orthogonal and are uncorrelated with each other. If arrows are approximately 45 degrees apart they have a moderate correlation to each other. The closer two arrows are to pointing the same or opposite directions, the stronger the correlation. The closer they are to being orthogonal, the weaker the correlation.

# Make the plot
# add biplot()
biplot(aa_pca, 
       main = "PCA Biplot",
       xlab = "Principal Components Axis 1 (PC1) - most variation",
       ylab = "Principal Components Axis 2 (PC2) - 2nd variation"
       )

We can see these correlations by looking at the correlation matrix or scatterplot above, or by plotting the raw data. The Hydro2 and SALfold arrows point in the same direction in the biplot, and we can see this correlation in the raw data.

habk_df_scale <- data.frame(habk_df_scale)
plot(Hydro1 ~ SALfold, data  = habk_df_scale)

SALfold and pH0charge are at almost a 90 degree angle to each other. This indicates that have a weak correlation. We can check the raw data.

# add plot() and fix any errors in the code
plot(SALfold ~ pH0charge, data = habk_df_scale)

# add abline()
abline(lm(SALfold ~ pH0charge, data = habk_df_scale))

There’s a slight downward trend in the data, but the correlation is weak

#add cor() and fix any errors in the code
cor(habk_df_scale$SALfold, habk_df_scale$pH0charge)

## [1] -0.1798668

The x-axis is always plotted as Principal Component 1. This represents the axis of greatest variation in the data. This means that vectors that are close to parallel with the x-axis are major determinants of differences in the data. In the amino acid data, Hydro1, Hydro 2 (hydropathy), and Polarity are closest to parallel with PC1. This means that these features of amino acids are very meaningful in describing how amino acids are different. Its therefore not surprising that most biology textbook classify amino acids by hydropathy and polarity.

The y-axis is always plotted as Principal Component 2. This represents the axis of second greatest variation. Volume and SAAbyH20 are close to parallel with PC2. After Hydropathy and Polarity, size describes major differences in amino acids. Because of this, some textbooks also classify amino acids by their size.

Scree plots

A 3D plot allows PC3 to be plotted. A scree plot is used to judge if there is possibly value in using a third PC. The height of the bars in a scree plot tell you how important each PC is. A major vertical drop in importance indicates the subsequent PCs can be safely ignored.

In the scree plot for the amino data the bars for PC1 and PC2 are tall, and PC3 is less than 1/4 the size of PC2. This means that there is little to be gained by considering PC3. If PC3 was very close to PC2, then it would be worthwhile to make a 3D biplot.

# add screeplot()
screeplot(aa_pca, 
          main = "Scree plot for amino acid cata")

Plotting PCA with clusters

Pseudo - clustering 1

Here is an example of adding clusters using hulls. Hulls indicated groups by drawing a line around the outermost members of the group as defined by a categorical feature added by the data analyst; these are not groups defined by the PCA! Hulls that don’t overlap much or at all indicate distinct groups.

When there are many groups or few points in a group, the hulls may connect all members of the group. In the plot below, the black line dashed line connects all 3 points in its small group of A, S and G at the top of the plot. In the middle of the plot is a dashed green line of V, E, Q, and H. Q is within the triangle formed by connecting the other points, indicating its membership in that group.

#biplot()
# add biplot()
biplot(aa_pca, 
       main = "PCA Biplot with 5 size groups",
       xlab = "",
       ylab = "")

#hull with ordihull()
vegan::ordihull(aa_pca,
         group = x,
         lty = c(2,1),
         col = c(1:5))

Here is a simple set of 2 groups. Hulls can overlap. The greater the overlap the less distinct the groups are. Here there’s some ambiguity in the middle, but not much. Sometimes a 3D plot can reveal if there is space between the groups.

# add biplot()
biplot(aa_pca, 
       main = "PCA Biplot - Polarity groups")

#hull
vegan::ordihull(aa_pca,
         group = pol_cat,
         lty = c(1,2),
         lwd = c(1,3),
         col = c(1,2))

Polarity plot

The hulls in this plot are grouping amino acids by hydrophobicity.

# add biplot()
biplot(aa_pca, 
        main = "PCA Biplot - 3 hydrophobicity groups")

#ordihull()
vegan::ordihull(aa_pca,
         group = phob_cat,
         lty = c(1,2,1),
         lwd = c(1,3,6),
         col = c(1,2,3))

PCA scores

PCA takes in dataframe of original features, and returns a dataframe of the same number of modified, orthogonal (uncorrelated) features. I’ll get the modified features for us to look at (you don’t need to know the details of this).

#convert to df
pca_scores <- data.frame(aa_pca$CA$u)

Correlation matrix of the PCA scores

A correlation matrix of these PCA scores returns correlations of 0. This is because PCA re-engineers the data so that all the redundancies (visible as correlations) are gone.

# add cor()
pca_score_cor <- cor(pca_scores)

# round to make easier to read
# add round()
pca_score_cor <- round(pca_score_cor, 10)

# look at the matrix
pca_score_cor

##     PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## PC1   1   0   0   0   0   0   0   0
## PC2   0   1   0   0   0   0   0   0
## PC3   0   0   1   0   0   0   0   0
## PC4   0   0   0   1   0   0   0   0
## PC5   0   0   0   0   1   0   0   0
## PC6   0   0   0   0   0   1   0   0
## PC7   0   0   0   0   0   0   1   0
## PC8   0   0   0   0   0   0   0   1

We can see this also via a scatterplot matrix. The smoothers are all generally flat or bounce around with no pattern.

#add pairs()
pairs(pca_scores,
      upper.panel = NULL, 
      lower.panel = panel.smooth)

All PCs are orthogonal to all the other PCs. This means their correlations are close to zero. We can see this in the scatterplot matrix and correlation matrix above, but to emphasize this let’s look at PC1 and PC2.

# add plot() and fix any errors in the code
plot(PC2 ~ PC1, data = pca_scores)

# add abline() and fix any errors in the code
abline(lm(PC2 ~ PC1, data = pca_scores))

We can plot the PCA scores against the original features to see which ones are correlated and which ones aren’t By examining the biplot (shown again for reference) you can determine which features will be correlated with PC1 (x-axis of biplot), and which ones with PC2 (y- axis of the biplot).

First, the biplot

# add biplot()
par(mfrow = c(1,1))
    (aa_pca)

## Call: rda(X = habk_df_scale, scale = T)
## 
##               Inertia Rank
## Total               8     
## Unconstrained       8    8
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 3.570 2.807 0.792 0.426 0.279 0.075 0.047 0.003

Now the Hydropathy variable 2, which has a vector in the biplot which points to the LEFT on the biplot, which is in line with the x-axis (PC1). Its orthogonal to PC2 (y-axis).

# Hydropathy variable 2
par(mfrow = c(1,2))
plot(pca_scores$PC1,
     habk_df$Hydro2,
     main = "Hydropthy vs. PC1")
plot(pca_scores$PC2,habk_df$Hydro2,
     main = "Hydropthy vs. PC2")

What about Polarity, which points to the RIGHT, the opposite direction of Hydro2? I’ve add some code to add the regression line to help visualize this.

par(mfrow = c(1,1))
plot(pca_scores$PC1,
     habk_df$Polarity,
     main = "Hydropthy vs. PC1")
abline(lm(habk_df$Polarity~pca_scores$PC1))

Now the Volume variable, which has a vector which points to the DOWN on the biplot, which is in line with the y-axis (PC2).

par(mfrow = c(1,2))
plot(pca_scores$PC2,
     habk_df$Vol)
plot(pca_scores$PC1,
     habk_df$Vol)

Portfolio Task:

Copy the hydropathy variable 2 code chunk from above. Modify it so that the graph on the left plots SALfold against PC2, and the graph on right plot SALfold versus PC1. Change the titles using “Main” to be correct.

# Hydropathy variable 2
par(mfrow = c(1,2))
plot(pca_scores$PC2,habk_df$SALfold,
     main = "SALfold vs. PC2")
plot(pca_scores$PC1,
     habk_df$SALfold,
     main = "SALfold vs. PC1")

Under the hood of PCA