Introduction

My dataset is 500 observations of different randomized anonymous pokemon trades of one individual between the days December 13th 2016 to December 18th 2016. It includes what country the trader is from, the random pokemon traded, its level, its type 1 and 2, gender, pokeball its in, and how many Perfect IVs it has. To my understanding from a google search and my mother who plays pokemon go, a perfect IV is a stat of a pokemon that is at its max potential. The max number a stat can be is 31 so a 31 stat is a perfect IV. The number of perfect IVs and the pokemon’s level heavily tells the value of the pokemon.

My purpose of performing PCA on this data is to see if stronger pokemon can be associated to different regions. This can give insight on if living in different countries can determine how strong your pokemon is and therefore give you a unfair advantage. Using numerical Variables such as the pokemon’s level and number of perfect IVs as the pokemon’s value and seeing where the pokemon is from can be done using PCA. It will also help to see if other variables such as

library(broom)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Jacob/Downloads")
poky <- read_csv("pokemon.csv") 
## Rows: 500 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): Date, Pokemon, Trainer Region, Trainer Subregion, Pokemon Region,...
## dbl   (3): Level, Level Met, Perfect IVs
## lgl   (1): Held Item
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(poky) <- gsub(" ", "_", names(poky))

Data Prep

poky_clean <- poky |>
  mutate(min_time = as.numeric(Time) / 60) |> # change time to minutes for numerical
  mutate(Continent = case_when( # adding new variable for the pokemon's original continent
  Trainer_Region %in% c("United States", "Canada", "Mexico") ~ "North America",
  Trainer_Region %in% c("Brazil", "Argentina", "Chile") ~ "South America",
  Trainer_Region %in% c("Germany", "France", "United Kingdom", "Italy", "Spain") ~ "Europe",
  Trainer_Region %in% c("China", "Japan", "India", "South Korea") ~ "Asia",
  Trainer_Region %in% c("South Africa", "Nigeria", "Egypt") ~ "Africa",
  Trainer_Region %in% c("Australia", "New Zealand") ~ "Oceania")) |>
  mutate(level_ups = Level - Level_Met) |> #Difference of level and level met to capture both variables. Represents how much the pokemon was upgraded. 
  select(Continent, min_time, level_ups, Perfect_IVs) |> # numeric variables correlated with regions
  drop_na() # remove missing values

head(poky_clean)
## # A tibble: 6 × 4
##   Continent     min_time level_ups Perfect_IVs
##   <chr>            <dbl>     <dbl>       <dbl>
## 1 Asia              1048         3           0
## 2 North America     1050         0           1
## 3 North America     1051         0           0
## 4 North America     1053         0           0
## 5 North America     1054         0           0
## 6 Europe            1055         0           1

PCA setup

pca <- poky_clean |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp()


pca
## Standard deviations (1, .., p=3):
## [1] 1.0637050 1.0078656 0.9234385
## 
## Rotation (n x k) = (3 x 3):
##                     PC1        PC2        PC3
## min_time    -0.05194569 -0.9479081 -0.3142798
## level_ups   -0.69930684  0.2591931 -0.6661748
## Perfect_IVs  0.71293168  0.1851731 -0.6763426

Visualize PCA

pca |>
  augment(poky_clean) |>
  filter(!Continent %in% c("Oceania", "South America")) |> #Oceania and South America does not have enough data to be significant. Just noise. 
  ggplot(aes(x = .fittedPC1, y = .fittedPC2)) +
  geom_density2d(aes(color = Continent)) +
  facet_wrap(~ Continent, nrow = 2) +


  
  # Add the PCA1 and PCA2 arrows
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC2 arrow
  
   # Add text labels- Positioning and color of the label
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  
   
  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  
  xlim(-5, 5) +
  ylim(-4, 4) +
  xlab("PC1") +
  ylab("PC2") +
  theme_minimal() 
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_density2d()`).

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca |>
  tidy(matrix = "rotation") |>  # extract rotation matrix
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column)) +
  coord_fixed()

PCA Findings

PC1 is heavily correlated with the number of Perfect_IVs the pokemon has and the amount of times the pokemon leveled up while in control with it’s original owner before it was traded. Left of PC1 means more times the pokmeon leveled up, but much less perfect IVs. Right of PC1 means more perfect Ivs, but much less times the pokemon was leveled up.

PC2 is also heavily correlated with when the pokemon trade was done. The farther down PC2 you go, the later the time the trade was done.

PCA Variance

pca |>
  # extract eigenvalues
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "lightblue") + 
  scale_x_continuous(
    # create one axis tick per PC
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    # format y axis ticks as percent values
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_minimal()

pca |>
  tidy(matrix = "eigenvalues")
## # A tibble: 3 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.06    0.377      0.377
## 2     2   1.01    0.339      0.716
## 3     3   0.923   0.284      1

PC1 explains about 38% of the variance and PC2 explains about 34%. Together PC1 and PC2 explains about 72% of the variation.