Briefly describe the dataset you selected.
This dataset contains information like sex, bill length and depth, body mass, and flipper length for three different species of penguins; Adelie, Chinstrap, and Gentoo. It has 344 observations and 7 variables.
Explain the purpose of performing PCA on your data.
I am performing a PCA on this data because I want to know the differences in looks between the three species of penguins that distinguish them. As well as reducing the number of variables to simplify it for possibly more extensive data analysis.
Load and head the dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(broom)
## Warning: package 'broom' was built under R version 4.4.2
setwd("C:/Users/lydia/Downloads/Comms/data 110")
penguins <- read_csv("penguins(1).csv")
## Rows: 344 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (4): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 7
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 1 more variable: sex <chr>
Clean and filter.
penguins_clean <- penguins |>
select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)|>
drop_na()
Perform and print PCA on relevant variables.
pca_fit <- penguins_clean |>
select(where(is.numeric)) |>
scale() |>
prcomp()
pca_fit
## Standard deviations (1, .., p=4):
## [1] 1.6594442 0.8789293 0.6043475 0.3293816
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## bill_length_mm 0.4552503 -0.597031143 -0.6443012 0.1455231
## bill_depth_mm -0.4003347 -0.797766572 0.4184272 -0.1679860
## flipper_length_mm 0.5760133 -0.002282201 0.2320840 -0.7837987
## body_mass_g 0.5483502 -0.084362920 0.5966001 0.5798821
Scatter Plot: Plot the scatter plot along the two principal components (PC1 vs. PC2)
pca_fit |>
augment(penguins_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = species))+
xlim(-5,5)+
ylim(-4,4)+
xlab("PC1")+
ylab("PC2")+
guides(color = guide_legend(title = NULL))+
theme_minimal()
Then, add PC1 and PC2 arrows and label them.
pca_fit |>
augment(penguins_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = species)) +
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC1 arrow
geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black")+ # PC2 arrow
geom_text(aes(x = 5, y = 0, label = "PC1"),
vjust = -0.5, color = "black") +
geom_text(aes(x = 0, y = 4, label = "PC2"),
hjust = -0.5, color = "black") +
xlim(-5, 5) +
ylim(-4, 4) +
xlab("PC1") +
ylab("PC2") +
guides(color = guide_legend(title = NULL)) + # No title for the legend
scale_color_manual(values = c("Adelie" = "hotpink", "Chinstrap" = "skyblue", "Gentoo" = "orange"))+
theme_minimal()
Rotation Matrix
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
pca_fit |>
# extract rotation matrix
tidy(matrix = "rotation") |>
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column), hjust = 1) +
xlim(-1, 1) + ylim(-1, 1) +
coord_fixed()+
theme_minimal()
What does PC1 show?
Adelie and Chinstrap penguins are seperated from Gentoo penguins along PC1 in the PCA plot. The separation happens around 0. Penguins with positive PC1 values tend to be Gentoo penguins and penguins with negative PC1 values tend to be Adelie and Chinstrap penguins. The rotation matrix for PC1 is showing that Gentoo penguins tend have longer flippers, greater body mass, and longer bills while Adelie and Chinstrap penguins tend have shorter flippers, lower body mass, and shorter bills.
What does PC2 show?
Adelie penguins are separated from Chinstrap penguins along PC2 in the PCA plot. The separation happens around 0. Penguins with positive PC2 values tend to be Adelie penguins and penguins with negative PC2 values tend to be Chinstrap penguins. The rotation matrix for PC2 is showing that Chinstrap penguins tend to have longer bills and greater bill depth compared to Adelie penguins, which tend to have shorter bills and shallower bill depths.
Percentage of variance explained by each principal component
pca_fit |>
tidy(matrix = "eigenvalues") |>
ggplot(aes(PC, percent)) +
geom_col(fill= "lightblue") +
scale_x_continuous(
breaks = 1:7
) +
scale_y_continuous(
name = "variance explained",
breaks = seq(0, 1, by = 0.1),
label = scales::label_percent(accuracy = 1)
)+
xlab("Principal Component (PC)") +
theme_minimal()
How much variation is explained by PC1 and PC2?
pca_fit |>
tidy(matrix = "eigenvalues")
## # A tibble: 4 × 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.66 0.688 0.688
## 2 2 0.879 0.193 0.882
## 3 3 0.604 0.0913 0.973
## 4 4 0.329 0.0271 1
Together PC1 and PC2 explain about 88% of all variation in the data set.