The dataset selected contains records of Pokémon and their associated trainer information. Important variables include Pokémon species, trainer region, Pokémon region, levels, types, gender, nature, and more.
The purpose of performing a PCA on this dataset is to understand which variables contribute most to variance among the Pokémon characteristics, especially focusing on the numeric ones like Level, Level Met, and Perfect IVs.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
setwd("~/Documents/Data 101:110")
pokemon_data <- read_csv("Pokemon.csv")
## Rows: 500 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Date, Pokemon, Trainer Region, Trainer Subregion, Pokemon Region,...
## dbl (3): Level, Level Met, Perfect IVs
## lgl (1): Held Item
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(pokemon_data) <- tolower(names(pokemon_data))
names(pokemon_data) <- gsub(" ", "_", names(pokemon_data))
names(pokemon_data) <- gsub("/", "_", names(pokemon_data))
head(pokemon_data)
## # A tibble: 6 × 15
## date time pokemon trainer_region trainer_subregion pokemon_region level
## <chr> <tim> <chr> <chr> <chr> <chr> <dbl>
## 1 12/13/2016 17:28 Oricor… South Korea <NA> <NA> 13
## 2 12/13/2016 17:30 Zubat United States Texas GER 8
## 3 12/13/2016 17:31 Carbink United States Oklahoma <NA> 10
## 4 12/13/2016 17:33 Klefki United States Connecticut <NA> 29
## 5 12/13/2016 17:34 Luvdisc United States <NA> <NA> 16
## 6 12/13/2016 17:35 Roggen… United Kingdom <NA> SPA 10
## # ℹ 8 more variables: level_met <dbl>, gender <chr>, type1 <chr>, type2 <chr>,
## # nature <chr>, pokeball <chr>, held_item <lgl>, perfect_ivs <dbl>
pokemon_clean <- pokemon_data |>
select(pokemon, trainer_region, level, level_met, perfect_ivs) |>
drop_na()
head(pokemon_clean)
## # A tibble: 6 × 5
## pokemon trainer_region level level_met perfect_ivs
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Oricorio South Korea 13 10 0
## 2 Zubat United States 8 8 1
## 3 Carbink United States 10 10 0
## 4 Klefki United States 29 29 0
## 5 Luvdisc United States 16 16 0
## 6 Roggenrola United Kingdom 10 10 1
pca_fit <- pokemon_clean |>
select(where(is.numeric)) |>
scale() |>
prcomp()
pca_fit
## Standard deviations (1, .., p=3):
## [1] 1.4622077 0.9093526 0.1871537
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## level -0.6612829 -0.2392569 -0.71095788
## level_met -0.6559865 -0.2752668 0.70278723
## perfect_ivs 0.3638498 -0.9311199 -0.02507988
pca_fit |>
augment(pokemon_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2, color = trainer_region)) +
geom_point() +
coord_fixed()
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
pca_fit |>
tidy(matrix = "rotation") |>
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column)) +
coord_fixed()
#PCA Analysis
What does PC1 represent? PC1 represents Pokémon with better stats (IVs) from those caught or leveled up through experience.
What does PC2 represent? PC2 represents overall strength, with lower scores for Pokémon caught or leveled at higher levels.
What does the analysis reveal about Pokémon? Pokémon are distinguished either by their perfect IVs or by training and experience (level/met level).
pca_fit |>
tidy(matrix = "eigenvalues") |>
ggplot(aes(PC, percent)) +
geom_col(color = "red", fill = "purple") +
scale_x_continuous(breaks = 1:6) +
scale_y_continuous(labels = scales::label_percent())
summary(pca_fit)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.4622 0.9094 0.18715
## Proportion of Variance 0.7127 0.2756 0.01168
## Cumulative Proportion 0.7127 0.9883 1.00000
PC1 explains around 71% of the variance.
PC2 explains around 27% of the variance.
Together, PC1 and PC2 explain about 98% of the total variation in Pokémon attributes.