QUESTION: How do you carry out principle components analysis (PCA) using the princomp() function, and plot the results?

I will show how to conduct PCA with princomp(), which is a fucntion in base R to carry out PCA. prcomp() is another function in base R that is also used for PCA.

Data

We’ll use the “palmerpenguins” packages (https://allisonhorst.github.io/palmerpenguins/) to address this question. You’ll need to install the package with install.packages(“palmerpenguins”) if you have not done so before, call library(““palmerpenguins”), and load the data with data(penguins)

#install.packages("palmerpenguins")
library(palmerpenguins)
data(penguins)

Acquiring the Data

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Here I’m accessing only the columns of penguins that contain numeric data.

penguins.numeric <- penguins[,c("bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","year","sex")]
penguins.numeric.mod <- penguins.numeric[,-c(5,6)]
summary(penguins.numeric.mod)
##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.45   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300  
##  NA's   :2       NA's   :2       NA's   :2         NA's   :2

Conducting Principle Components Analysis

Now, I’m using plot() to produce a scatterplot matrix

plot(penguins.numeric.mod)

Converting into a dataframe and using na.omit() to omit any values of “NA”

penguins.numeric.mod <- data.frame(penguins.numeric.mod)
penguins.numeric.mod <- na.omit(penguins.numeric.mod)

Using princomp(): The main difference between prcomp() and princomp() is that for prcomp(), the default value for scale is equal to true, while it isn’t for princomp(). Here, I need to set scale = TRUE to ensure that the data is scaled and centered.

pca.penguins <- princomp(penguins.numeric.mod, scale = TRUE)
## Warning: In princomp.default(penguins.numeric.mod, scale = TRUE) :
##  extra argument 'scale' will be disregarded
## Displays the PCA
biplot(pca.penguins)
rda.out <- vegan::rda(penguins.numeric.mod, scale = TRUE)

biplot(rda.out, display = "sites")

Additional Reading

For more information on this topic, see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/princomp

Keywords

  1. principle components analysis
  2. princomp()
  3. prcomp()
  4. scale
  5. na.omit()
  6. palmerspenguins