Introduction

The dataset selected contains records of Pokémon and their associated trainer information. Important variables include Pokémon species, trainer region, Pokémon region, levels, types, gender, nature, and more.

The purpose of performing a PCA on this dataset is to understand which variables contribute most to variance among the Pokémon characteristics, especially focusing on the numeric ones like Level, Level Met, and Perfect IVs.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

setwd("~/Documents/Data 101:110")
pokemon_data <- read_csv("Pokemon.csv")
## Rows: 500 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): Date, Pokemon, Trainer Region, Trainer Subregion, Pokemon Region,...
## dbl   (3): Level, Level Met, Perfect IVs
## lgl   (1): Held Item
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(pokemon_data) <- tolower(names(pokemon_data))
names(pokemon_data) <- gsub(" ", "_", names(pokemon_data))
names(pokemon_data) <- gsub("/", "_", names(pokemon_data))
head(pokemon_data)
## # A tibble: 6 × 15
##   date       time  pokemon trainer_region trainer_subregion pokemon_region level
##   <chr>      <tim> <chr>   <chr>          <chr>             <chr>          <dbl>
## 1 12/13/2016 17:28 Oricor… South Korea    <NA>              <NA>              13
## 2 12/13/2016 17:30 Zubat   United States  Texas             GER                8
## 3 12/13/2016 17:31 Carbink United States  Oklahoma          <NA>              10
## 4 12/13/2016 17:33 Klefki  United States  Connecticut       <NA>              29
## 5 12/13/2016 17:34 Luvdisc United States  <NA>              <NA>              16
## 6 12/13/2016 17:35 Roggen… United Kingdom <NA>              SPA               10
## # ℹ 8 more variables: level_met <dbl>, gender <chr>, type1 <chr>, type2 <chr>,
## #   nature <chr>, pokeball <chr>, held_item <lgl>, perfect_ivs <dbl>
pokemon_clean <- pokemon_data |>
  select(pokemon, trainer_region, level, level_met, perfect_ivs) |>
  drop_na()


head(pokemon_clean)
## # A tibble: 6 × 5
##   pokemon    trainer_region level level_met perfect_ivs
##   <chr>      <chr>          <dbl>     <dbl>       <dbl>
## 1 Oricorio   South Korea       13        10           0
## 2 Zubat      United States      8         8           1
## 3 Carbink    United States     10        10           0
## 4 Klefki     United States     29        29           0
## 5 Luvdisc    United States     16        16           0
## 6 Roggenrola United Kingdom    10        10           1
pca_fit <- pokemon_clean |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp()


pca_fit
## Standard deviations (1, .., p=3):
## [1] 1.4622077 0.9093526 0.1871537
## 
## Rotation (n x k) = (3 x 3):
##                    PC1        PC2         PC3
## level       -0.6612829 -0.2392569 -0.71095788
## level_met   -0.6559865 -0.2752668  0.70278723
## perfect_ivs  0.3638498 -0.9311199 -0.02507988
pca_fit |>
  augment(pokemon_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2, color = trainer_region)) +
  geom_point() +
  coord_fixed()

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca_fit |>
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column)) +
  coord_fixed()

#PCA Analysis

What does PC1 represent? PC1 represents Pokémon with better stats (IVs) from those caught or leveled up through experience.

What does PC2 represent? PC2 represents overall strength, with lower scores for Pokémon caught or leveled at higher levels.

What does the analysis reveal about Pokémon? Pokémon are distinguished either by their perfect IVs or by training and experience (level/met level).

pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(color = "red", fill = "purple") +
  scale_x_continuous(breaks = 1:6) +
  scale_y_continuous(labels = scales::label_percent())

summary(pca_fit)
## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.4622 0.9094 0.18715
## Proportion of Variance 0.7127 0.2756 0.01168
## Cumulative Proportion  0.7127 0.9883 1.00000

How much variation is explained by PC1 and PC2?

PC1 explains around 71% of the variance.

PC2 explains around 27% of the variance.

Together, PC1 and PC2 explain about 98% of the total variation in Pokémon attributes.