Introduction
My dataset is 500 observations of different randomized anonymous pokemon trades of one individual between the days December 13th 2016 to December 18th 2016. It includes what country the trader is from, the random pokemon traded, its level, its type 1 and 2, gender, pokeball its in, and how many Perfect IVs it has. To my understanding from a google search and my mother who plays pokemon go, a perfect IV is a stat of a pokemon that is at its max potential. The max number a stat can be is 31 so a 31 stat is a perfect IV. The number of perfect IVs and the pokemon’s level heavily tells the value of the pokemon.
My purpose of performing PCA on this data is to see if stronger pokemon can be associated to different regions. This can give insight on if living in different countries can determine how strong your pokemon is and therefore give you a unfair advantage. Using numerical Variables such as the pokemon’s level and number of perfect IVs as the pokemon’s value and seeing where the pokemon is from can be done using PCA. It will also help to see if other variables such as
library(broom)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Jacob/Downloads")
poky <- read_csv("pokemon.csv")
## Rows: 500 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Date, Pokemon, Trainer Region, Trainer Subregion, Pokemon Region,...
## dbl (3): Level, Level Met, Perfect IVs
## lgl (1): Held Item
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(poky) <- gsub(" ", "_", names(poky))
Data Prep
poky_clean <- poky |>
mutate(min_time = as.numeric(Time) / 60) |> # change time to minutes for numerical
mutate(Continent = case_when( # adding new variable for the pokemon's original continent
Trainer_Region %in% c("United States", "Canada", "Mexico") ~ "North America",
Trainer_Region %in% c("Brazil", "Argentina", "Chile") ~ "South America",
Trainer_Region %in% c("Germany", "France", "United Kingdom", "Italy", "Spain") ~ "Europe",
Trainer_Region %in% c("China", "Japan", "India", "South Korea") ~ "Asia",
Trainer_Region %in% c("South Africa", "Nigeria", "Egypt") ~ "Africa",
Trainer_Region %in% c("Australia", "New Zealand") ~ "Oceania")) |>
mutate(level_ups = Level - Level_Met) |> #Difference of level and level met to capture both variables. Represents how much the pokemon was upgraded.
select(Continent, min_time, level_ups, Perfect_IVs) |> # numeric variables correlated with regions
drop_na() # remove missing values
head(poky_clean)
## # A tibble: 6 × 4
## Continent min_time level_ups Perfect_IVs
## <chr> <dbl> <dbl> <dbl>
## 1 Asia 1048 3 0
## 2 North America 1050 0 1
## 3 North America 1051 0 0
## 4 North America 1053 0 0
## 5 North America 1054 0 0
## 6 Europe 1055 0 1
PCA setup
pca <- poky_clean |>
select(where(is.numeric)) |>
scale() |>
prcomp()
pca
## Standard deviations (1, .., p=3):
## [1] 1.0637050 1.0078656 0.9234385
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## min_time -0.05194569 -0.9479081 -0.3142798
## level_ups -0.69930684 0.2591931 -0.6661748
## Perfect_IVs 0.71293168 0.1851731 -0.6763426
Visualize PCA
pca |>
augment(poky_clean) |>
filter(!Continent %in% c("Oceania", "South America")) |> #Oceania and South America does not have enough data to be significant. Just noise.
ggplot(aes(x = .fittedPC1, y = .fittedPC2)) +
geom_density2d(aes(color = Continent)) +
facet_wrap(~ Continent, nrow = 2) +
# Add the PCA1 and PCA2 arrows
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC1 arrow
geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC2 arrow
# Add text labels- Positioning and color of the label
geom_text(aes(x = 5, y = 0, label = "PC1"),
vjust = -0.5, color = "black") +
geom_text(aes(x = 0, y = 4, label = "PC2"),
hjust = -0.5, color = "black") +
xlim(-5, 5) +
ylim(-4, 4) +
xlab("PC1") +
ylab("PC2") +
theme_minimal()
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 454 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_density2d()`).
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
pca |>
tidy(matrix = "rotation") |> # extract rotation matrix
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column)) +
coord_fixed()
PCA Findings
PC1 is heavily correlated with the number of Perfect_IVs the pokemon has and the amount of times the pokemon leveled up while in control with it’s original owner before it was traded. Left of PC1 means more times the pokmeon leveled up, but much less perfect IVs. Right of PC1 means more perfect Ivs, but much less times the pokemon was leveled up.
PC2 is also heavily correlated with when the pokemon trade was done. The farther down PC2 you go, the later the time the trade was done.
PCA Variance
pca |>
# extract eigenvalues
tidy(matrix = "eigenvalues") |>
ggplot(aes(PC, percent)) +
geom_col(fill= "lightblue") +
scale_x_continuous(
# create one axis tick per PC
breaks = 1:7
) +
scale_y_continuous(
name = "variance explained",
breaks = seq(0, 1, by = 0.1),
# format y axis ticks as percent values
label = scales::label_percent(accuracy = 1)
)+
xlab("Principal Component (PC)") +
theme_minimal()
pca |>
tidy(matrix = "eigenvalues")
## # A tibble: 3 × 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.06 0.377 0.377
## 2 2 1.01 0.339 0.716
## 3 3 0.923 0.284 1
PC1 explains about 38% of the variance and PC2 explains about 34%. Together PC1 and PC2 explains about 72% of the variation.