Introduction

Briefly describe the dataset you selected.

This dataset contains information like sex, bill length and depth, body mass, and flipper length for three different species of penguins; Adelie, Chinstrap, and Gentoo. It has 344 observations and 7 variables.

Explain the purpose of performing PCA on your data.

I am performing a PCA on this data because I want to know the differences in looks between the three species of penguins that distinguish them. As well as reducing the number of variables to simplify it for possibly more extensive data analysis.

2. Data Preparation

Load and head the dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(broom)
## Warning: package 'broom' was built under R version 4.4.2
setwd("C:/Users/lydia/Downloads/Comms/data 110")
penguins <- read_csv("penguins(1).csv")
## Rows: 344 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (4): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 7
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 1 more variable: sex <chr>

Clean and filter.

penguins_clean <- penguins |>
  select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)|> 
  drop_na()

3. Principal Component Analysis (PCA) Setup

Perform and print PCA on relevant variables.

pca_fit <- penguins_clean |> 
  select(where(is.numeric)) |> 
  scale() |>                  
  prcomp()  

pca_fit
## Standard deviations (1, .., p=4):
## [1] 1.6594442 0.8789293 0.6043475 0.3293816
## 
## Rotation (n x k) = (4 x 4):
##                          PC1          PC2        PC3        PC4
## bill_length_mm     0.4552503 -0.597031143 -0.6443012  0.1455231
## bill_depth_mm     -0.4003347 -0.797766572  0.4184272 -0.1679860
## flipper_length_mm  0.5760133 -0.002282201  0.2320840 -0.7837987
## body_mass_g        0.5483502 -0.084362920  0.5966001  0.5798821

4. Visualize PCA Results

Scatter Plot: Plot the scatter plot along the two principal components (PC1 vs. PC2)

pca_fit |>
  augment(penguins_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = species))+
  xlim(-5,5)+
  ylim(-4,4)+
  xlab("PC1")+
  ylab("PC2")+
  guides(color = guide_legend(title = NULL))+
  theme_minimal()

Then, add PC1 and PC2 arrows and label them.

pca_fit |>
  augment(penguins_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = species)) +
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black")+ # PC2 arrow
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  
   
  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  
  xlim(-5, 5) +
  ylim(-4, 4) +
  xlab("PC1") +
  ylab("PC2") +
  guides(color = guide_legend(title = NULL)) +  # No title for the legend
  scale_color_manual(values = c("Adelie" = "hotpink", "Chinstrap" = "skyblue", "Gentoo" = "orange"))+
  theme_minimal()

Rotation Matrix

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)
pca_fit |>
  # extract rotation matrix
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1, 1) + ylim(-1, 1) + 
  coord_fixed()+
  theme_minimal()

5. PCA Analysis Summary

What does PC1 show?

Adelie and Chinstrap penguins are seperated from Gentoo penguins along PC1 in the PCA plot. The separation happens around 0. Penguins with positive PC1 values tend to be Gentoo penguins and penguins with negative PC1 values tend to be Adelie and Chinstrap penguins. The rotation matrix for PC1 is showing that Gentoo penguins tend have longer flippers, greater body mass, and longer bills while Adelie and Chinstrap penguins tend have shorter flippers, lower body mass, and shorter bills.

What does PC2 show?

Adelie penguins are separated from Chinstrap penguins along PC2 in the PCA plot. The separation happens around 0. Penguins with positive PC2 values tend to be Adelie penguins and penguins with negative PC2 values tend to be Chinstrap penguins. The rotation matrix for PC2 is showing that Chinstrap penguins tend to have longer bills and greater bill depth compared to Adelie penguins, which tend to have shorter bills and shallower bill depths.

6. Variance Explained Plot

Percentage of variance explained by each principal component

pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "lightblue") + 
  scale_x_continuous(
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_minimal()

How much variation is explained by PC1 and PC2?

pca_fit |>
  tidy(matrix = "eigenvalues")
## # A tibble: 4 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.66   0.688       0.688
## 2     2   0.879  0.193       0.882
## 3     3   0.604  0.0913      0.973
## 4     4   0.329  0.0271      1

Together PC1 and PC2 explain about 88% of all variation in the data set.