Assignment 3

1. Setup

Run the code chunks below to install the necessary packages before proceeding to the next steps.

required_packages <- c(
  "tidyverse", "dplyr", "ggplot2", "fmsb", 
  "xgboost", "caret", "SHAPforxgboost",
  "stats", "factoextra"
)

for (package in required_packages) {
  if (!requireNamespace(package, quietly = TRUE)) {
    install.packages(package, dependencies = TRUE)
  }
}

## Registered S3 methods overwritten by 'pROC':
##   method    from
##   print.roc fmsb
##   plot.roc  fmsb

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringi)

## Warning: package 'stringi' was built under R version 4.4.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'tibble' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'stringr' was built under R version 4.4.3

## Warning: package 'forcats' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)

## Warning: package 'rvest' was built under R version 4.4.3

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 4.4.3

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

library(readr)
library(ggplot2)
library(fmsb)

## Warning: package 'fmsb' was built under R version 4.4.3

library(xgboost)

## Warning: package 'xgboost' was built under R version 4.4.3

## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(SHAPforxgboost)

## Warning: package 'SHAPforxgboost' was built under R version 4.4.3

library(stats)
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

2. Dataset

2.1. Battle Log & Pokedex

We will be using the dataset from a Kaggle competition called Pokemon - Weedle’s Cave . You can download the dataset using the link below:

https://www.kaggle.com/datasets/terminus7/pokemon-challenge?select=pokemon.csv

The dataset consists of three files:

combats.csv : Contains matchups between two Pokémon and a “Winner” label to indicate which Pokémon wins the battle.
pokemon.csv : A comprehensive Pokédex listing all Pokémon up to the 7th generation (X&Y). It includes details such as name, type, stats, generation, and legendary status.
tests.csv : A public test dataset. This file will not be used in this project.

Additionally, we will need to download another dataset from Kaggle called Complete Competitive Pokémon Dataset :

https://www.kaggle.com/datasets/n2cholas/competitive-pokemon-dataset?select=move-data.csv

From this dataset, we only require the move-data.csv file, which provides detailed information about Pokémon moves. This will help enrich our data with more specific insights into move attributes and effects.

pokemon_df <- read_csv("files/data/pokemon-challenge/pokemon.csv")

## Rows: 800 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Name, Type 1, Type 2
## dbl (8): #, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
## lgl (1): Legendary
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(pokemon_df)

## # A tibble: 6 × 12
##     `#` Name    `Type 1` `Type 2`    HP Attack Defense `Sp. Atk` `Sp. Def` Speed
##   <dbl> <chr>   <chr>    <chr>    <dbl>  <dbl>   <dbl>     <dbl>     <dbl> <dbl>
## 1     1 Bulbas… Grass    Poison      45     49      49        65        65    45
## 2     2 Ivysaur Grass    Poison      60     62      63        80        80    60
## 3     3 Venusa… Grass    Poison      80     82      83       100       100    80
## 4     4 Mega V… Grass    Poison      80    100     123       122       120    80
## 5     5 Charma… Fire     <NA>        39     52      43        60        50    65
## 6     6 Charme… Fire     <NA>        58     64      58        80        65    80
## # ℹ 2 more variables: Generation <dbl>, Legendary <lgl>

2.2. Pokemon Moveset

As highlighted in the report, this project stands out by incorporating Pokémon moves, which introduces a more complex problem due to the vast number of moves a Pokémon can potentially use. To address this, we will scrape the necessary data from the Smogon website : https://www.smogon.com/dex/xy/pokemon/.

Each Pokémon is limited to four moveslots, and Smogon, being a community-driven platform, provides commonly used movesets for competitive play. This significantly simplifies our task, as it eliminates the need to source and compile moveset data manually, allowing us to focus on leveraging this information effectively.

get_mega_stone <- function(pokemon_name) {
    mega_stone = NULL
    if(str_detect(pokemon_name, "Mega")) {
        parts <- pokemon_name %>% strsplit(" ")
        # Charizard and Mewtwo have 2 mega evolutions (X or Y)
        if(length(parts[[1]]) == 3) {
            pokemon <- parts[[1]][2]
            if(pokemon == "Charizard") {
                stone <- "Charizardite"
            } else if(pokemon == "Mewtwo") {
                stone <- "Mewtwonite"
            }
            x_or_y <- parts[[1]][3]
            mega_stone <- paste0(stone, " ", x_or_y)
        }
    }
    return(mega_stone)
}

handle_unique_pokemons <- function(pokemon_name) {
    # Initial pokemon name preprocessing
    pokemon_name <- str_to_lower(pokemon_name)
    pokemon_name <- str_replace_all(pokemon_name, "[.']", "")
    pokemon_name <- stri_trans_general(pokemon_name, "Latin-ASCII")
    
    # Handle pokemon with female/male symbols (e.g. Nidoran)
    if(str_detect(pokemon_name, "♀")) {
        pokemon_name <- str_replace_all(pokemon_name, "♀", "-f")
    } else if(str_detect(pokemon_name, "♂")) {
        pokemon_name <- str_replace_all(pokemon_name, "♂", "-m")
    } else if(str_detect(pokemon_name, "female")) {
        pokemon_name <- str_replace_all(pokemon_name, "female", "f")
    } else if(str_detect(pokemon_name, "male")) {
        pokemon_name <- str_replace_all(pokemon_name, "male", "m")
    }
    
    # Handle pokemon with multiple forms
    keywords_to_remove <- c("forme", "mode", "cloak", "normal",
                            "primal", "average", "size")
    if(any(str_detect(pokemon_name, keywords_to_remove))) {
        pokemon_name <- str_replace_all(pokemon_name, paste(keywords_to_remove, collapse = "|"), "")
    }
    
    # Handle pokemon with multiple variants
    pokemon_name_split <- str_split(pokemon_name, " ")
    variant <- pokemon_name_split[[1]][2]
    if(variant %in% c("confined", "standard", "altered", "land", 
                    "incarnate", "ordinary", "aria", "plant",
                    "blade", "shield", "half")) {
        pokemon_name <- str_replace_all(pokemon_name, variant, "")
    }
    
    # Handle the word order specifically for "Rotom"
    if(str_detect(pokemon_name, "rotom")) {
        pokemon_name <- str_replace(pokemon_name, "^(\\w+)\\s+(\\w+)$", "\\2 \\1")
    }
    
    # Final name preprocessing steps
    pokemon_name <- str_trim(pokemon_name)
    pokemon_name <- str_replace_all(pokemon_name, "mega | x| y", "")
    
    # Concat pokemon names longer than 2 words with '-' for scraping purposes
    if(str_detect(pokemon_name, " ")) {
        pokemon_name_split <- str_split(pokemon_name, " ")
        pokemon_name <- paste0(pokemon_name_split[[1]][1], "-", pokemon_name_split[[1]][2])
    }
    
    return(pokemon_name)
}

scrape_pokemon_moves <- function(pokemon_name, gen = "xy", debugging = FALSE) {
    # Get item and scraping name from the original pokemon name
    mega_stone <- get_mega_stone(pokemon_name)
    scraping_name <- handle_unique_pokemons(pokemon_name)
    
    # Construct the URL and get the pokemon info based on scraping name
    url <- paste0("https://www.smogon.com/dex/",
                  gen,
                  "/pokemon/",
                  scraping_name,
                  "/")
    html <- tryCatch({
                read_html(url)
            }, error = function(e) {
                return(NULL) # Return NULL if there's an error
            })
    scraping_successful <- !is.null(html)
    
    if (scraping_successful) {
        # Extract the HTML tag that contains Pokémon moveset
        json_data <- html %>%
            html_element("script:contains('dex')") %>%
            html_text() %>%
            str_extract('"strategies":.+')

        if (!is.null(json_data)) {
            if (!is.null(mega_stone)) {
                # Extract moves for specific Mega evolution
                moveset <- json_data %>%
                    str_extract(str_c('"items":\\["', mega_stone, '"\\].*?"moveslots":\\[\\[.*?\\]\\]')) %>%
                    str_extract_all('"move":".*?"') %>%
                    unlist() %>%
                    str_replace_all('^"move":"|"$', "") %>%
                    unique()
            } else {
                moveset <- json_data %>%
                    str_extract_all('"move":".*?"') %>%
                    unlist() %>%
                    str_replace_all('^"move":"|"$', "") %>%
                    unique()
            }
        } else {
            moveset <- NULL
        }
    } else {
        moveset <- NULL
    }

    if(!is.null(mega_stone)) {
        moveset <- json_data %>%
        str_extract(str_c('"items":\\["', mega_stone, '"\\].*?"moveslots":\\[\\[.*?\\]\\]')) %>% # Match the specific Mega evolution
            str_extract_all('"move":".*?"') %>%
            unlist() %>%
            str_replace_all('^"move":"|"$', "") %>%
            unique()
    } else {
        # Extract and clean the moveset
        moveset <- json_data %>%
            str_extract_all('"move":".*?"') %>%
            unlist() %>%
            str_replace_all('^"move":"|"$', "") %>%
            unique() # 1 Pokemon might have the same move in different movesets
    }

    # Return the first 4 moves
    return(list(
        Original_Name = pokemon_name,
        Standardised_Name = scraping_name,
        Move1 = moveset[1],
        Move2 = moveset[2],
        Move3 = moveset[3],
        Move4 = moveset[4],
        Smogon_URL = url,
        Scraping_Is_Successful = scraping_successful
    ))
}

convert_pokemon_moves_to_df <- function(pokemon_df, n_pokemon_sample = NULL) {
    if(is.null(n_pokemon_sample)) {
        moves_list <- map(pokemon_df$Name, scrape_pokemon_moves)
    } else {
        moves_list <- map(head(pokemon_df$Name, n = n_pokemon_sample), 
                          scrape_pokemon_moves)
    }
    moves_df <- moves_list %>%
        map_dfr(~ tibble(
            Original_Name = .x$Original_Name,
            Standardised_Name = .x$Standardised_Name,
            Move1 = .x$Move1,
            Move2 = .x$Move2,
            Move3 = .x$Move3,
            Move4 = .x$Move4,
            Smogon_URL = .x$Smogon_URL,
            Scraping_Is_Successful = .x$Scraping_Is_Successful # Include the success indicator
        )) %>%
        mutate(
            Is_Filled_Backwards = ifelse(is.na(Move1) & is.na(Move2) & is.na(Move3) & is.na(Move4), TRUE, FALSE)
        ) %>%
        fill(Move1, Move2, Move3, Move4, .direction = "down") # Fill any blank moves backwards
    
    return(moves_df)
}

get_effectiveness_type_chart <- function() {
    return(data.frame(
            Attacking = c("Normal", "Fire", "Water", "Electric", "Grass", "Ice", "Fighting", "Poison", "Ground", "Flying", "Psychic", "Bug", "Rock", "Ghost", "Dragon", "Dark", "Steel", "Fairy"),
            Normal = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0, 1, 1, 0.5, 1),
            Fire = c(1, 0.5, 0.5, 1, 2, 2, 1, 1, 1, 1, 1, 2, 0.5, 1, 0.5, 1, 2, 1),
            Water = c(1, 2, 0.5, 1, 0.5, 1, 1, 1, 2, 1, 1, 1, 2, 1, 0.5, 1, 1, 1),
            Electric = c(1, 1, 2, 0.5, 0.5, 1, 1, 1, 0, 2, 1, 1, 1, 1, 0.5, 1, 1, 1),
            Grass = c(1, 0.5, 2, 1, 0.5, 1, 1, 0.5, 2, 0.5, 1, 0.5, 2, 1, 0.5, 1, 0.5, 1),
            Ice = c(1, 0.5, 0.5, 1, 2, 0.5, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 0.5, 1),
            Fighting = c(2, 1, 1, 1, 1, 2, 1, 0.5, 1, 0.5, 0.5, 0.5, 2, 0, 1, 2, 2, 0.5),
            Poison = c(1, 1, 1, 1, 2, 1, 1, 0.5, 0.5, 1, 1, 1, 0.5, 0.5, 1, 1, 0, 2),
            Ground = c(1, 2, 1, 2, 0.5, 1, 1, 2, 1, 0, 1, 0.5, 2, 1, 1, 1, 2, 1),
            Flying = c(1, 1, 1, 0.5, 2, 1, 2, 1, 1, 1, 1, 2, 0.5, 1, 1, 1, 0.5, 1),
            Psychic = c(1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 0.5, 1, 1, 1, 1, 0, 0.5, 1),
            Bug = c(1, 0.5, 1, 1, 2, 1, 0.5, 0.5, 1, 0.5, 2, 1, 1, 0.5, 1, 2, 0.5, 0.5),
            Rock = c(1, 2, 1, 1, 1, 2, 0.5, 1, 0.5, 2, 1, 2, 1, 1, 1, 1, 0.5, 1),
            Ghost = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 0.5, 1, 1),
            Dragon = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0.5, 0),
            Dark = c(1, 1, 1, 1, 1, 1, 0.5, 1, 1, 1, 2, 1, 1, 2, 1, 0.5, 1, 0.5),
            Steel = c(1, 0.5, 0.5, 0.5, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0.5, 2),
            Fairy = c(1, 0.5, 1, 1, 1, 1, 2, 0.5, 1, 1, 1, 1, 1, 1, 2, 2, 0.5, 1)
    ))
}

Let’s try using 10 Pokemon samples.

moves_df <- convert_pokemon_moves_to_df(pokemon_df, n_pokemon_sample = 10)
head(moves_df)

## # A tibble: 6 × 9
##   Original_Name Standardised_Name Move1        Move2      Move3 Move4 Smogon_URL
##   <chr>         <chr>             <chr>        <chr>      <chr> <chr> <chr>     
## 1 Bulbasaur     bulbasaur         Sludge Bomb  Solar Beam Giga… Slee… https://w…
## 2 Ivysaur       ivysaur           Knock Off    Sludge Bo… Giga… Leec… https://w…
## 3 Venusaur      venusaur          Giga Drain   Sludge Bo… Hidd… Synt… https://w…
## 4 Mega Venusaur venusaur          Giga Drain   Sludge Bo… Hidd… Synt… https://w…
## 5 Charmander    charmander        Flamethrower Fire Blast Over… Slee… https://w…
## 6 Charmeleon    charmeleon        Flamethrower Fire Blast Over… Slee… https://w…
## # ℹ 2 more variables: Scraping_Is_Successful <lgl>, Is_Filled_Backwards <lgl>

Now let it run for the entire Pokemon list.

start_time <- Sys.time()

moves_df <- convert_pokemon_moves_to_df(pokemon_df)
saveRDS(moves_df, file = "files/data/pokemon-challenge/pokemon_moveset.rds")

end_time <- Sys.time()
execution_time <- end_time - start_time
cat("Scraping time:", round(as.numeric(execution_time, units = "secs"), 2), "seconds\n")

## Scraping time: 2110.45 seconds

3. Preprocess

We now have four key datasets:

Pokémon Compendium Dataset : Contains detailed information about each Pokémon, including their name, type, stats, generation, and legendary status.
Pokémon Moveset Data (Scraped from Smogon) : Provides commonly used movesets for each Pokémon, sourced from the Smogon community.
Pokémon Move Details Dataset : Includes specific details about each move, such as its type, power, accuracy, and effects.
Pokémon Battle Logs : Contains records of matchups between two Pokémon and the outcome of each battle.

In addition to these, there is another critical dataset called the Effectiveness Table . This table outlines the type advantages and disadvantages for each Pokémon type. The values in the table represent damage multipliers that determine how effective a move is against a target based on their types. For example, Water-type moves deal 2x damage to Fire-type Pokémon, while they deal only 0.5x damage to Grass-type Pokémon. This dataset is essential for understanding and calculating the impact of type matchups in battles.

pokemon_df <- read_csv("files/data/pokemon-challenge/pokemon.csv")

## Rows: 800 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Name, Type 1, Type 2
## dbl (8): #, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
## lgl (1): Legendary
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pokemon_moveset_df <- readRDS("files/data/pokemon-challenge/pokemon_moveset.rds")
move_data_df <- read_csv("files/data/competitive-pokemon-dataset/move-data.csv")

## Rows: 728 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Name, Type, Category, Contest, Power, Accuracy
## dbl (3): Index, PP, Generation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pokemon_combat_df <- read_csv("files/data/pokemon-challenge/combats.csv")

## Rows: 50000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): First_pokemon, Second_pokemon, Winner
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

effectiveness_df <- get_effectiveness_type_chart()

3.1. Join & Assemble the Dataset

Next, we need to integrate all five data sources into a single unified table by performing the following steps:

Join pokemon_combat_df (Battle Logs) with pokemon_df (Pokémon Compendium):
Merge these datasets to retrieve detailed information about the two Pokémon involved in each battle, such as their names, types, stats, and other attributes.
Join with pokemon_moveset_df (Pokémon Moveset):
Add four additional columns to each Pokémon’s data, representing the moves assigned to them based on the Smogon movesets.
Join with move_data_df (Move Details):
Incorporate details for each move, including its type, power, accuracy, PP (Power Points), and other relevant attributes.
Calculate Effectiveness for Each Move and Pokémon Type:
Use the Effectiveness Table to compute the damage multiplier for each move against the opposing Pokémon’s type(s). This step ensures that type advantages and disadvantages are accurately reflected in the analysis.
Handle Missing Values:
- Fill null values in the effectiveness columns with 0 , as this indicates no effect (neutral or invalid matchups).
- Replace null values in string columns with empty strings to ensure consistency and avoid errors during analysis.

By completing these steps, we will have a comprehensive dataset that combines battle logs, Pokémon attributes, moveset details, and type effectiveness, ready for further analysis or modeling.

calculate_effectiveness <- function(move_type, opponent_type_1, opponent_type_2, effectiveness_df) {
    if(is.na(move_type)) {
        return(NA)
    }
    
    if(!is.na(opponent_type_1)) {
        type1_multiplier <- effectiveness_df %>%
                filter(Attacking == opponent_type_1) %>%
                pull(as.character(move_type)) %>%
                as.numeric()
    } else {
        type1_multiplier <- 1
    }
    
    if(!is.na(opponent_type_2)) {
        type2_multiplier <- effectiveness_df %>%
                filter(Attacking == opponent_type_2) %>%
                pull(as.character(move_type)) %>%
                as.numeric()
    } else {
        type2_multiplier <- 1
    }
    
    return(type1_multiplier * type2_multiplier)
}

preprocess_pokemon_data <- function(pokemon_df, pokemon_moveset_df, move_data_df, pokemon_combat_df, effectiveness_df) {
    
    # Preprocess pokedex data
    colnames(pokemon_df) <- c("No", "Name", "Type1", "Type2",
                              "HP", "Attack", "Defense", "Sp_Atk",
                              "Sp_Def", "Speed", "Generation", "Is_Legendary")
    
    # Preprocess move data
    move_data_df <- move_data_df %>%
        mutate(Power = ifelse(is.na(Power), 0, Power),
               Accuracy = ifelse(is.na(Accuracy), 100, Accuracy)) %>%
        select(-Index, -Contest, -Generation)

    pokemon_moveset_df <- pokemon_moveset_df %>%
        filter(!Is_Filled_Backwards) %>%
        select(Original_Name, Move1, Move2, Move3, Move4) %>%
        left_join(move_data_df, by = join_by(Move1 == Name)) %>%
        rename(
            Move1_Type = Type,
            Move1_Power = Power,
            Move1_Accuracy = Accuracy
        ) %>%
        left_join(move_data_df, by = join_by(Move2 == Name), suffix = c("_Move1", "_Move2")) %>%
        rename(
            Move2_Type = Type,
            Move2_Power = Power,
            Move2_Accuracy = Accuracy
        ) %>%
        left_join(move_data_df, by = join_by(Move3 == Name)) %>%
        rename(
            Move3_Type = Type,
            Move3_Power = Power,
            Move3_Accuracy = Accuracy
        ) %>%
        left_join(move_data_df, by = join_by(Move4 == Name), suffix = c("_Move3", "_Move4")) %>%
        rename(
            Move4_Type = Type,
            Move4_Power = Power,
            Move4_Accuracy = Accuracy
        )
    
    # Preprocess pokemon combat
    pokemon_combat_df <- pokemon_combat_df %>%
        # Join with pokedex table to find pokemon info details (stats)
        left_join(pokemon_df, by = join_by(First_pokemon == No)) %>%
        rename_with(~ paste0("First_", .), .cols = c("Name", "Type1", "Type2", "HP", "Attack", "Defense", "Sp_Atk", "Sp_Def", "Speed", "Generation", "Is_Legendary")) %>%
        left_join(pokemon_df, by = join_by(Second_pokemon == No)) %>%
        rename_with(~ paste0("Second_", .), .cols = c("Name", "Type1", "Type2", "HP", "Attack", "Defense", "Sp_Atk", "Sp_Def", "Speed", "Generation", "Is_Legendary")) %>%
        
        # Join with moveset table
        left_join(pokemon_moveset_df, by = join_by(First_Name == Original_Name)) %>%
        rename_with(~ paste0("First_", .), .cols = c(starts_with("Move"), starts_with("Category"), starts_with("PP"))) %>%
        left_join(pokemon_moveset_df, by = join_by(Second_Name == Original_Name)) %>%
        rename_with(~ paste0("Second_", .), .cols = c(starts_with("Move"), starts_with("Category"), starts_with("PP"))) %>%
        
        # Add type effectiveness calculation
        mutate(
            First_Type1_Effectiveness = mapply(calculate_effectiveness,
                                         First_Type1,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            First_Type2_Effectiveness = mapply(calculate_effectiveness,
                                         First_Type2,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Type1_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Type1,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Type2_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Type2,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            
            First_Move1_Effectiveness = mapply(calculate_effectiveness,
                                         First_Move1_Type,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            First_Move2_Effectiveness = mapply(calculate_effectiveness,
                                         First_Move2_Type,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            First_Move3_Effectiveness = mapply(calculate_effectiveness,
                                         First_Move3_Type,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            First_Move4_Effectiveness = mapply(calculate_effectiveness,
                                         First_Move4_Type,
                                         Second_Type1,
                                         Second_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Move1_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Move1_Type,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Move2_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Move2_Type,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Move3_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Move3_Type,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
            Second_Move4_Effectiveness = mapply(calculate_effectiveness,
                                         Second_Move4_Type,
                                         First_Type1,
                                         First_Type2,
                                         MoreArgs = list(effectiveness_df = effectiveness_df)),
    ) %>%
    
    # Fill the remaining columns with 0 for numeric and empty string "" for characters
    mutate(across(where(is.numeric), ~ replace_na(.x, 0))) %>% 
    mutate(across(where(is.character), ~ replace_na(.x, "")))
    
    return(pokemon_combat_df)
}

start_time <- Sys.time()

final_df <- preprocess_pokemon_data(pokemon_df, pokemon_moveset_df, move_data_df, pokemon_combat_df, effectiveness_df)
saveRDS(final_df, file = "files/data/final_pokemon_data.rds")

end_time <- Sys.time()
execution_time <- end_time - start_time
cat("Execution time:", round(as.numeric(execution_time, units = "secs"), 2), "seconds\n")

## Execution time: 490.84 seconds

4. Exploratory Data Analysis (EDA)

Once we have a comprehensive dataset, let’s analyze it to address the problem of determining whether to switch Pokémon. To guide this analysis, I have defined the following three research questions:

What are the most important factors in determining which Pokémon will win (and whether a switch is necessary)?
This question aims to identify key predictors of victory, such as stats (e.g., speed, attack), type advantages, or other attributes, to help decide whether staying in the battle or switching Pokémon is the better strategy.
In some cases, a Pokémon might win despite having type disadvantages. What are the deciding factors that contribute to their success?
This question explores scenarios where Pokémon overcome inherent type weaknesses, focusing on factors like move combinations, stat distributions, or strategic gameplay decisions that lead to unexpected victories.
Is there an optimal combination of move types (e.g., two Physical moves and two Special moves) that leads to higher win rates?
This question investigates whether specific configurations of move types—such as balancing Physical and Special moves, or prioritizing certain move categories—can enhance a Pokémon’s chances of winning.

By addressing these research questions, we can gain deeper insights into the dynamics of Pokémon battles and develop data-driven strategies for making effective in-battle decisions, such as when to switch Pokémon or how to optimise move sets.

type_color_mapping <- c(
    "Dragon" = "#6F35FC", "Steel" = "#B7B7CE", "Flying" = "#A98FF3",
    "Fairy" = "#F95587", "Rock" = "#B6A136", "Fire" = "#EE8130",
    "Electric" = "#F7D02C", "Dark" = "#705746", "Ghost" = "#735797",
    "Ground" = "#E2BF65", "Ice" = "#96D9D6", "Water" = "#6390F0",
    "Grass" = "#7AC74C", "Fighting" = "#C22E28", "Psychic" = "#D685AD",
    "Poison" = "#A33EA1", "Normal" = "#A8A77A", "Bug" = "#A6B91A"
)

outcome_colors <- c("Winner" = "#80ed99", "Loser" = "#f28482")

legendary_colors <- c("Non-Legendary" = "#69b3a2", "Legendary" = "#404080")

df <- readRDS("files/data/final_pokemon_data.rds")
dim(df)

## [1] 50000    85

4.1. Pokemon Stats

Perhaps the most fundamental aspect to examine is the basic statistics of Pokémon. These stats play a critical role in determining battle outcomes and are defined as follows:

HP (Hit Points): Represents the “health” of a Pokémon. When a Pokémon’s HP reaches 0, it faints, and another Pokémon must be sent out to replace it.
Attack: Also known as “physical attack,” this stat determines the damage dealt by physical moves.
Defense: Reduces the damage taken from physical attacks, based on the opponent’s Attack stat.
Special Attack: Different from physical attack, this stat governs the power of special (or “magic-like”) moves.
Special Defense: Reduces the damage taken from special attacks, based on the opponent’s Special Attack stat.
Speed: Determines which Pokémon moves first in a battle, with faster Pokémon typically acting ahead of slower ones.

To begin our analysis, let’s first examine the stat aggregation for each Pokémon type . This will help us understand how different types of Pokémon compare in terms of their average stats (e.g., whether Fire-type Pokémon generally have higher Attack, or if Water-type Pokémon tend to excel in Special Defense). By aggregating these stats, we can identify trends and patterns that may influence battle strategies and decisions.

par(mfrow=c(3,6))
par(mar=c(1,1,1,1))

df <- df %>%
    mutate(
        First_Outcome = ifelse(First_pokemon == Winner, "Winner", "Loser"),
        Second_Outcome = ifelse(Second_pokemon == Winner, "Winner", "Loser")
    )

show_type_radar_chart <- function(type_data, type_name, type_mapping_color) {
    # Add min and max values for scaling the radar chart
    radar_data <- rbind(
        rep(ceiling(max(type_data$Value)), nrow(type_data)), # Max value for scaling
        rep(floor(min(type_data$Value)), nrow(type_data)), # Min value for scaling
        type_data$Value # Actual values
    )
    radar_data <- as.data.frame(radar_data)

    # Add column names
    colnames(radar_data) <- c("HP", "Attack", "Defense", "Sp_Atk", "Sp_Def", "Speed")

    # Plot the radar chart
    radarchart(
        radar_data,
        axistype = 0,
        pcol = type_mapping_color,
        pfcol = adjustcolor(type_mapping_color, alpha.f = 0.5),
        plwd = 2,
        cglcol="grey", 
        cglty=1, 
        axislabcol="black",
        title = type_name,
        caxislabels=seq(0,2000,5), 
        cglwd=0.8
    )
}

winner_stats_df <- bind_rows(
    df %>%
        select(First_Type1, First_HP, First_Attack, First_Defense, First_Sp_Atk, First_Sp_Def, First_Speed, Outcome = First_Outcome) %>%
        rename(First_Type = First_Type1) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    df %>%
        select(First_Type2, First_HP, First_Attack, First_Defense, First_Sp_Atk, First_Sp_Def, First_Speed, Outcome = First_Outcome) %>%
        rename(First_Type = First_Type2) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    df %>%
        select(Second_Type1, Second_HP, Second_Attack, Second_Defense, Second_Sp_Atk, Second_Sp_Def, Second_Speed, Outcome = Second_Outcome) %>%
        rename(Second_Type = Second_Type1) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything()),
    df %>%
        select(Second_Type2, Second_HP, Second_Attack, Second_Defense, Second_Sp_Atk, Second_Sp_Def, Second_Speed, Outcome = Second_Outcome) %>%
        rename(Second_Type = Second_Type2) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything())
) %>%
    filter(Type != "") %>%
    group_by(Type) %>%
    summarize(
        HP = mean(HP, na.rm = TRUE),
        Attack = mean(Attack, na.rm = TRUE),
        Defense = mean(Defense, na.rm = TRUE),
        Sp_Atk = mean(Sp_Atk, na.rm = TRUE),
        Sp_Def = mean(Sp_Def, na.rm = TRUE),
        Speed = mean(Speed, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    pivot_longer(cols = c(HP, Attack, Defense, Sp_Atk, Sp_Def, Speed),
                 names_to = "Stat", values_to = "Value") %>%
    mutate(Color = type_color_mapping[as.character(Type)])

for (type in unique(winner_stats_df$Type)) {
    type_data <- winner_stats_df %>% filter(Type == type)
    type_color <- unique(type_data$Color)
    
    show_type_radar_chart(type_data, type, type_color)
}

From the battle logs, we can observe that different Pokémon types exhibit distinct stat distributions, which influence their roles in battles:

Physical Attackers: These types tend to have higher Attackstats and rely on physical moves to deal damage.
- Examples: Bug, Dark, Dragon, Fighting, Grass, Ground, Poison, Water
Special Attackers: These types excel in Special Attack, using special (or “magic-like”) moves to overpower opponents.
- Examples: Electric, Fairy, Fire, Ghost, Grass, Ice, Poison, Psychic, Water
Physically Defensive: These types are strong in Defense, allowing them to withstand physical attacks more effectively.
- Examples: Bug, Ghost, Grass, Ground, Rock, Steel, Water
Specially Defensive: These types have high Special Defense, enabling them to resist special attacks.
- Examples: Fairy, Ghost, Grass, Ice, Poison
Speedy: These types are characterised by high Speed, giving them an advantage in moving first during battles.
- Examples: Electric, Flying, Normal
Bulky (High in HP): These types have higher HP, making them more durable and capable of enduring prolonged battles.
- Examples: Ice, Normal, Water

This breakdown highlights the natural tendencies of each type, providing valuable insights into their strengths and potential roles in battle strategies. For instance, a Water-type Pokémon might serve as both a physical and special attacker while also being physically defensive and bulky, making it a versatile choice. On the other hand, a Ghost-type Pokémon might focus on defense and special attack, excelling in specific scenarios. Understanding these distributions can help inform decisions about team composition and move selection.

How about the stats difference between “Winner” and “Loser”?

winner_stats_df <- bind_rows(
    df %>%
        select(First_Name, First_HP, First_Attack, First_Defense, First_Sp_Atk, First_Sp_Def, First_Speed, First_Is_Legendary, Outcome = First_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    df %>%
        select(Second_Name, Second_HP, Second_Attack, Second_Defense, Second_Sp_Atk, Second_Sp_Def, Second_Speed, Second_Is_Legendary, Outcome = Second_Outcome) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything())
)

winner_stats_long_df <- winner_stats_df %>%
    pivot_longer(cols = c(HP, Attack, Defense, Sp_Atk, Sp_Def, Speed),
                 names_to = "Stat", values_to = "Value")

ggplot(winner_stats_long_df, aes(x = Stat, y = Value, fill = Outcome)) +
    geom_boxplot() +
    labs(title = "Pokemon Combat Stats Difference",
         x = "Stat", y = "Value") +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors)

Almost every Pokémon that won in the dataset has noticeably higher stats compared to their opponents, particularly in Speed, Attack, and Special Attack. This finding is intuitive, as these stats play a critical role in determining battle outcomes.

Speed is especially crucial because the Pokémon with the higher Speed stat typically moves first, allowing it to deal damage or apply status effects before the opponent can act.
Attack and Special Attack are key indicators of offensive power, enabling Pokémon to deal significant damage using physical or special moves, respectively.

This trend underscores the importance of prioritizing these stats when selecting or training Pokémon for battles. A faster Pokémon can seize the initiative, while a strong attacker can overwhelm opponents before they have a chance to respond. These insights align with common competitive strategies, where Speed and offensive capabilities often dictate the flow of a match.

Now, let’s explore other factors beyond stats, such as legendary status.

legendary_non_legendary_df <- df %>%
    filter((First_Is_Legendary == TRUE & Second_Is_Legendary == FALSE) |
           (Second_Is_Legendary == TRUE & First_Is_Legendary == FALSE))

winner_stats_df <- bind_rows(
    legendary_non_legendary_df %>%
        select(First_Name, First_HP, First_Attack, First_Defense, First_Sp_Atk, First_Sp_Def, First_Speed, First_Is_Legendary, Outcome = First_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    legendary_non_legendary_df %>%
        select(Second_Name, Second_HP, Second_Attack, Second_Defense, Second_Sp_Atk, Second_Sp_Def, Second_Speed, Second_Is_Legendary, Outcome = Second_Outcome) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything())
)

legendary_outcome_summary <- winner_stats_df %>%
    group_by(Is_Legendary, Outcome) %>%
    summarize(Count = n(), .groups = "drop") %>%
    mutate(Is_Legendary = factor(Is_Legendary, levels = c(FALSE, TRUE), labels = c("Non-Legendary", "Legendary")))

ggplot(legendary_outcome_summary, aes(x = Is_Legendary, y = Count, fill = Outcome)) +
    geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
    labs(title = "Win/Loss Distribution for Legendary Vs. Non-Legendary Matchups",
         x = "Legendary Status", y = "Count") +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors) +
    theme(
        axis.text.x = element_text(size = 10),
        legend.position = "top"
    )

Legendary Pokemons have a higher win count compared to the regular Pokemons.

winner_stats_long_df <- winner_stats_df %>%
    pivot_longer(cols = c(HP, Attack, Defense, Sp_Atk, Sp_Def, Speed),
                 names_to = "Stat", values_to = "Value") %>%
    mutate(Is_Legendary = factor(Is_Legendary, levels = c(FALSE, TRUE), labels = c("Non-Legendary", "Legendary")))

ggplot(winner_stats_long_df, aes(x = Stat, y = Value, fill = Is_Legendary)) +
    geom_boxplot() +
    labs(title = "Pokemon Combat Stats Difference",
         x = "Stat", y = "Value") +
    theme_minimal() +
    scale_fill_manual(values = legendary_colors)

Indeed, legendary Pokémon generally possess higher base stats compared to non-legendary ones. However, the earlier chart showing the win/loss distribution reveals that legendary Pokémon still lost approximately 1,500 battles . This raises the question: what could explain these losses despite their superior stats?

To investigate further, let’s revisit the stat differences and consider other potential factors that might influence battle outcomes.

legendary_loses <- winner_stats_df %>%
    filter(Is_Legendary == TRUE & Outcome == "Loser")

non_legendary_loses <- winner_stats_df %>%
    filter(Is_Legendary == FALSE & Outcome == "Winner")

# Combine the two subsets
loser_stats_df <- bind_rows(
    legendary_loses %>% mutate(Group = "Legendary Losers"),
    non_legendary_loses %>% mutate(Group = "Non-Legendary Winners")
)

# Reshape the data into long format
long_stats_df <- loser_stats_df %>%
    pivot_longer(cols = c(HP, Attack, Defense, Sp_Atk, Sp_Def, Speed),
                 names_to = "Stat", values_to = "Value")

group_colors <- c("Legendary Losers" = "#f28482", "Non-Legendary Winners" = "#80ed99")
ggplot(long_stats_df, aes(x = Stat, y = Value, fill = Group)) +
    geom_boxplot(alpha = 0.7, position = position_dodge(width = 0.8)) + # Add dodge for side-by-side boxplots
    labs(title = "Stat Comparison in Legendary vs. Non-Legendary Matchups",
         x = "Stat", y = "Stat Value") +
    theme_minimal() +
    scale_fill_manual(values = group_colors) +
    theme(
        axis.text.x = element_text(size = 10), # Rotate x-axis labels for readability
        legend.position = "top"
    )

Legendary Pokémon are superior in all aspects except for Speed . This observation strongly suggests that Speed plays a crucial role in determining which Pokémon will win a battle.

4.2. Type Effectiveness

Type matchups are a critical aspect of Pokémon battles, as moves that are “super effective” deal significantly more damage. This means that even if a Pokémon has a type disadvantage, it can still turn the tide of battle by utilizing moves that exploit its opponent’s weaknesses or cover its own vulnerabilities.

To begin our analysis, let’s first examine the distribution of Pokémon types in our battle log dataset.

type_outcomes <- bind_rows(
    df %>%
        select(First_Type1, Outcome = First_Outcome) %>%
        rename(Type = First_Type1),
    df %>%
        select(First_Type2, Outcome = First_Outcome) %>%
        rename(Type = First_Type2),
    df %>%
        select(Second_Type1, Outcome = Second_Outcome) %>%
        rename(Type = Second_Type1),
    df %>%
        select(Second_Type2, Outcome = Second_Outcome) %>%
        rename(Type = Second_Type2)
) %>%
    filter(Type != "") %>%
    group_by(Type, Outcome) %>%
    summarize(Count = n(), .groups = "drop") %>%
    group_by(Type) %>%
    mutate(Total = sum(Count),
           Proportion = Count / Total) %>%
    ungroup() %>%
    mutate(Type = fct_reorder(Type, Total))

type_counts <- type_outcomes %>%
    select(Type, Total) %>%
    unique()

ggplot(type_counts, aes(x = Type, y = Total, fill = Type)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Pokémon Type Counts",
         x = "Type", y = "Count") +
    coord_flip() +
    theme_minimal() +
    scale_fill_manual(values = type_color_mapping) +
    theme(legend.position = "None")

It appears that Water-type Pokémon are the most frequently used type in the battle log dataset. Next, let’s analyze the win-loss proportions for each Pokémon type to gain further insights into their performance.

ggplot(type_outcomes, aes(x = Type, y = Count, fill = Outcome)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.6) +
    labs(title = "Win/Loss Counts by Pokémon Type",
         x = "Type", y = "Count") +
    coord_flip() +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors)

ggplot(type_outcomes, aes(x = Type, y = Proportion, fill = Outcome)) +
    geom_bar(stat = "identity", position = "stack") +
    labs(title = "Proportion of Wins/Losses by Pokémon Type",
         x = "Type", y = "Proportion") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_fill_manual(values = outcome_colors)

Flying, Normal, Psychic, Fire, Fighting, Dark, Dragon, and Electric have higher win rates in comparison to other types.

Let’s look closer at the effectiveness table.

type_matchup_matrix <- get_effectiveness_type_chart()
type_matchup_matrix <- type_matchup_matrix %>%
    pivot_longer(cols = -Attacking, names_to = "Defending", values_to = "Effectiveness") %>%
    rename(Attacking_Type = Attacking, Defending_Type = Defending)

effectiveness_colors <- c("#f28482", "#ffffff", "#80ed99") # Red (low) -> White (neutral) -> Green (high)
ggplot(type_matchup_matrix, aes(x = Defending_Type, y = Attacking_Type, fill = Effectiveness)) +
    geom_tile(color = "white", size = 0.1) +
    geom_text(aes(label = Effectiveness), color = "black", size = 2.5) + # Add effectiveness values as text
    scale_fill_gradientn(colors = effectiveness_colors, limits = c(0, 2), name = "Effectiveness") +
    labs(title = "Pokémon Type Effectiveness Heatmap",
         x = "Attacking Type", y = "Defending Type") +
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8)
    )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

For each attacking move that targets an opponent’s weakness, the damage is multiplied by 2x . For example, if a Pokémon uses a Fire-type move against a dual-type Bug and Steel Pokémon, it will deal 4x damage because both types are weak to Fire. Conversely, this multiplier system also applies to resistances, where the damage is reduced to 0.5x (or 0.25x for double resistances) instead of 2x.

With this in mind, we can now analyze the win rates for moves with effectiveness levels of 0.25x , 0.5x , 2x , and 4x to better understand their impact on battle outcomes.

df <- df %>%
    rowwise() %>%
    mutate(
        First_Most_Effectiveness_Type = max(First_Type1_Effectiveness, First_Type2_Effectiveness, na.rm = TRUE),
        Second_Most_Effectiveness_Type = max(Second_Type1_Effectiveness, Second_Type2_Effectiveness, na.rm = TRUE)
    ) %>%
    ungroup()

type_effectiveness_df <- bind_rows(
    df %>%
        select(First_Name, First_Most_Effectiveness_Type, Outcome = First_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    df %>%
        select(Second_Name, Second_Most_Effectiveness_Type, Outcome = Second_Outcome) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything())
)

type_summary <- type_effectiveness_df %>%
    group_by(Outcome, Most_Effectiveness_Type = cut(Most_Effectiveness_Type, breaks = c(0, 0.5, 1, 2, 4))) %>%
    summarize(Count = n(), .groups = "drop") %>%
    ungroup() %>%
    filter(!is.na(Most_Effectiveness_Type))

ggplot(type_summary, aes(x = Most_Effectiveness_Type, y = Count, fill = Outcome)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7, alpha = 0.8) +
    labs(title = "Outcome Distribution by Pokemon Type Effectiveness",
         x = "Effectiveness", y = "Count") +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors)+
    scale_x_discrete(labels =  c(
        "(0,0.25]" = "0.25x",
        "(0,0.5]" = "0.5x",
        "(0.5,1]" = "1x",
        "(1,2]" = "2x",
        "(2,4]" = "4x"
    )
)

We can see that Pokemon types that do 2x and 4x effectiveness have higher win count.

df <- df %>%
    rowwise() %>%
    mutate(
        First_Most_Effective_Move = max(First_Move1_Effectiveness, First_Move2_Effectiveness, First_Move3_Effectiveness, First_Move4_Effectiveness, na.rm = TRUE),
        Second_Most_Effective_Move = max(Second_Move1_Effectiveness, Second_Move2_Effectiveness, Second_Move3_Effectiveness, Second_Move4_Effectiveness, na.rm = TRUE)
    ) %>%
    ungroup()

move_effectiveness_df <- bind_rows(
    df %>%
        select(First_Name, First_Most_Effective_Move, Outcome = First_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    df %>%
        select(Second_Name, Second_Most_Effective_Move, Outcome = Second_Outcome) %>%
        rename_with(~ gsub("Second_", "", .), .cols = everything())
)
move_summary <- move_effectiveness_df %>%
    group_by(Outcome, Most_Effective_Move = cut(Most_Effective_Move, breaks = c(0, 0.5, 1, 2, 4))) %>%
    summarize(Count = n(), .groups = "drop") %>%
    ungroup() %>%
    filter(!is.na(Most_Effective_Move))

ggplot(move_summary, aes(x = Most_Effective_Move, y = Count, fill = Outcome)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7, alpha = 0.8) +
    labs(title = "Outcome Distribution by Move Effectiveness",
         x = "Effectiveness", y = "Count") +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors)+
    scale_x_discrete(labels =  c(
        
    )
)

This also applies to the move type effectiveness

How about the Pokemon that lost even though they have type advantages?

df <- df %>%
    mutate(
        First_Combined_Effectiveness = coalesce(First_Type1_Effectiveness * First_Type2_Effectiveness, First_Type1_Effectiveness),
        Second_Combined_Effectiveness = coalesce(Second_Type1_Effectiveness * Second_Type2_Effectiveness, Second_Type1_Effectiveness)
)
    
loser_df <- bind_rows(
    df %>%
        filter(First_Combined_Effectiveness < 1) %>%
        select(First_HP, First_Attack, First_Defense, First_Sp_Atk, First_Sp_Def, First_Speed, First_Is_Legendary, First_Move1_Effectiveness, First_Move2_Effectiveness, First_Move3_Effectiveness, First_Move4_Effectiveness, First_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything()),
    
    df %>%
        filter(Second_Combined_Effectiveness < 1) %>%
        select(Second_HP, Second_Attack, Second_Defense, Second_Sp_Atk, Second_Sp_Def, Second_Speed, Second_Is_Legendary, Second_Move1_Effectiveness, Second_Move2_Effectiveness, Second_Move3_Effectiveness, Second_Move4_Effectiveness, Second_Outcome) %>%
        rename_with(~ gsub("First_", "", .), .cols = everything())
)

loser_long_df <- loser_df %>%
    pivot_longer(cols = c(HP, Attack, Defense, Sp_Atk, Sp_Def, Speed, Is_Legendary, Move1_Effectiveness, Move2_Effectiveness, Move3_Effectiveness, Move4_Effectiveness),
                 names_to = "Stat", values_to = "Value")

ggplot(winner_stats_long_df, aes(x = Stat, y = Value, fill = Outcome)) +
    geom_boxplot() +
    labs(title = "Pokemon Combat Stats Difference",
         x = "Stat", y = "Value") +
    theme_minimal() +
    scale_fill_manual(values = outcome_colors, labels = c("Loser (w/ type advantage)", "Winner (w/o type advantage)"))

Once again, Speed proves to be a dominant factor, outclassing even type advantages. As a result, Pokémon with type disadvantages can still achieve a higher win probability if they are faster than their opponent, allowing them to strike first and potentially overpower the opposing Pokémon before it can act.

4.3. Moveset Combination

To address the third research question, we will utilize the “Category” column in the dataset. Moves are classified into three categories: physical, special, and status .

Physical moves deal damage based on the attacker’s Attack stat and the target’s Defense stat.
Special moves calculate damage using the attacker’s Special Attack stat and the target’s Special Defense stat.
Status moves do not inflict direct damage but instead affect the battlefield or Pokémon stats. For example, they can raise or lower stats, apply status effects (e.g., paralysis, sleep), or manipulate environmental conditions like weather or terrain.

By analysing the distribution and effectiveness of these move categories, we can determine whether certain combinations—such as balancing physical and special moves or incorporating strategic status moves—lead to higher win rates.

df <- df %>%
    rowwise() %>%
    mutate(
        Physical_Count = sum(c_across(starts_with("First_Category_Move")) == "Physical", na.rm = TRUE),
        Special_Count = sum(c_across(starts_with("First_Category_Move")) == "Special", na.rm = TRUE),
        Status_Count = sum(c_across(starts_with("First_Category_Move")) == "Status", na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(Move_Set_Combination = case_when(
        Physical_Count == 2 & Special_Count == 2 ~ "2 Physical, 2 Special",
        Physical_Count == 1 & Special_Count == 3 ~ "1 Physical, 3 Special",
        Physical_Count == 3 & Special_Count == 1 ~ "3 Physical, 1 Special",
        Physical_Count == 4 ~ "4 Physical",
        Special_Count == 4 ~ "4 Special",
        Status_Count > 0 ~ "Mixed with Status",
        TRUE ~ "Other"
    ))

move_set_summary <- df %>%
    group_by(Move_Set_Combination, First_Outcome) %>%
    summarize(Count = n(), .groups = "drop") %>%
    group_by(Move_Set_Combination) %>%
    mutate(Total = sum(Count),
           Proportion = Count / Total) %>%
    filter(First_Outcome == "Winner") %>% # Focus on win rates
    ungroup()

combination_colors <- c(
    "2 Physical, 2 Special" = "#80ed99",
    "1 Physical, 3 Special" = "#f7dc6f",
    "3 Physical, 1 Special" = "#f28482",
    "4 Physical" = "#a8dadc",
    "4 Special" = "#ffafcc",
    "Mixed with Status" = "#b8c0ff"
)

# Plot win rates by move set combination
ggplot(move_set_summary, aes(x = Move_Set_Combination, y = Proportion, fill = Move_Set_Combination)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    labs(title = "Win Rates by Move Set Combination",
         x = "Move Set Combination", y = "Win Rate") +
    theme_minimal() +
    scale_fill_manual(values = combination_colors) +
    theme(
        axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
        legend.position = "none"
    )

Combinations of attacking moves prove to be more efficient, especially when paired with Speed . If a Pokémon is guaranteed to move first due to its high Speed stat, it can strategically utilize its four available moves to adapt to the situation and maximize damage output. This flexibility allows the Pokémon to exploit type weaknesses, cover its own vulnerabilities, or apply status effects effectively, making it a formidable opponent in battle.

5. Model Training

Now that we have analyzed the key factors contributing to a “Winning” Pokémon, we can leverage this understanding to build a predictive model. For this purpose, we will train XGBoost (eXtreme Gradient Boosting) models, which are well-suited for handling complex datasets and capturing intricate relationships between features. This approach will enable us to make accurate predictions about battle outcomes based on the derived insights.

5.1. All Features

To begin with a straightforward approach, we can train the XGBoost model using all the engineered features obtained from the data wrangling and web scraping process. The dataset currently contains a total of 58 features . However, additional preprocessing steps are required to ensure compatibility with the model:

Binary Label Creation:
Since the raw label “Outcome” is not in binary form, we need to create a new label called “First_Outcome” . This label will indicate whether the first Pokémon wins (“Winner” ) or loses (“Loser” ) the battle.
Handling Missing Values in “Power” Columns:
For move “Power” columns, replace empty strings and None values with 0 . This indicates that the move is a status-based move, which does not deal direct damage.
Handling Missing Values in “Accuracy” Columns:
Similarly, for “Accuracy” columns, replace empty strings and None values with 100 , as status moves always have an accuracy of 100.
Converting Non-Numeric Columns:
XGBoost models do not support character or logical data types. Therefore, all remaining non-numeric columns must be converted to numeric types (e.g., using one-hot encoding or label encoding for categorical variables).

These preprocessing steps will ensure the dataset is properly formatted and ready for training the XGBoost model.

df <- readRDS("files/data/final_pokemon_data.rds")
df <- df %>%
    mutate(
        First_Outcome = ifelse(First_pokemon == Winner, "Winner", "Loser"),
        First_Outcome = ifelse(First_Outcome == "Winner", 1, 0),
        across(ends_with("Power"), ~ ifelse(.x == "" | .x == "None", 0, as.numeric(.x))),
        across(ends_with("Accuracy"), ~ ifelse(.x == "" | .x == "None", 100, as.numeric(.x))),
        across(where(is.character) | where(is.logical), ~ as.numeric(.))
    )%>%
    select(First_Type1, First_Type2, First_HP, First_Attack, First_Defense,
           First_Sp_Atk, First_Sp_Def, First_Speed, First_Is_Legendary,
           First_Move1_Type, First_Category_Move1, First_PP_Move1,
           First_Move1_Power, First_Move1_Accuracy,
           First_Move2_Type, First_Category_Move2, First_PP_Move2,
           First_Move2_Power, First_Move2_Accuracy,
           First_Move3_Type, First_Category_Move3, First_PP_Move3,
           First_Move3_Power, First_Move3_Accuracy,
           First_Move4_Type, First_Category_Move4, First_PP_Move4,
           First_Move4_Power, First_Move4_Accuracy,
           Second_Type1, Second_Type2, Second_HP, Second_Attack, Second_Defense,
           Second_Sp_Atk, Second_Sp_Def, Second_Speed, Second_Is_Legendary,
           Second_Move1_Type, Second_Category_Move1, Second_PP_Move1,
           Second_Move1_Power, Second_Move1_Accuracy,
           Second_Move2_Type, Second_Category_Move2, Second_PP_Move2,
           Second_Move2_Power, Second_Move2_Accuracy,
           Second_Move3_Type, Second_Category_Move3, Second_PP_Move3,
           Second_Move3_Power, Second_Move3_Accuracy,
           Second_Move4_Type, Second_Category_Move4, Second_PP_Move4,
           Second_Move4_Power, Second_Move4_Accuracy,
           First_Outcome)

## Warning: There were 46 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 45 remaining warnings.

We will split the data into train and test data with 80:20 split values.

prepare_training_data <- function(df, train_pct = 0.8){
    set.seed(123)
    
    train_indices <- createDataPartition(df$First_Outcome, p = 0.8, list = FALSE)
    train_data <- df[train_indices, ]
    test_data <- df[-train_indices, ]
    
    features <- setdiff(names(train_data), "First_Outcome") # Exclude the target column
    train_matrix <- as.matrix(train_data[features])
    test_matrix <- as.matrix(test_data[features])
    train_labels <- train_data$First_Outcome
    test_labels <- test_data$First_Outcome

    return(list(
        train_matrix = train_matrix,
        test_matrix = test_matrix,
        train_labels = train_labels,
        test_labels = test_labels
    ))
}

The XGBoost models will be trained using the following hyperparameters:

max_depth = 6: The maximum depth for each tree.
eta = 0.1: The learning rate, which controls the contribution of each tree.
n_rounds = 100: The number of boosting rounds (iterations).

Using the predefined hyperparameters, we can then train the model and predict the test dataset.

training_data <- prepare_training_data(df)
train_matrix <- training_data$train_matrix
test_matrix <- training_data$test_matrix
train_labels <- training_data$train_labels
test_labels <- training_data$test_labels

xgb_model <- xgboost(
    data = train_matrix,
    label = train_labels,
    max.depth = 6,        # Maximum depth of each tree
    eta = 0.1,            # Learning rate
    nrounds = 100,        # Number of boosting rounds
    objective = "binary:logistic", # Binary classification
    eval_metric = "error", # Evaluation metric
    verbose = 0           # Print progress
)

pred_probs <- predict(xgb_model, test_matrix)
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)

Now, let’s proceed to make predictions. A Pokémon will be predicted as the Loser if the predicted probability of losing is above 0.5 . Otherwise, it will be classified as the Winner.

pred_probs <- predict(xgb_model, test_matrix)

# Convert probabilities to binary predictions (threshold = 0.5)
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)

# Calculate metrics
conf_matrix <- confusionMatrix(factor(pred_labels), factor(test_labels))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4965  201
##          1  316 4518
##                                           
##                Accuracy : 0.9483          
##                  95% CI : (0.9438, 0.9526)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8964          
##                                           
##  Mcnemar's Test P-Value : 5.339e-07       
##                                           
##             Sensitivity : 0.9402          
##             Specificity : 0.9574          
##          Pos Pred Value : 0.9611          
##          Neg Pred Value : 0.9346          
##              Prevalence : 0.5281          
##          Detection Rate : 0.4965          
##    Detection Prevalence : 0.5166          
##       Balanced Accuracy : 0.9488          
##                                           
##        'Positive' Class : 0               
##

We have a solid starting model with an accuracy of approximately 94.81% . However, the model’s performance on the positive class (1) is slightly weaker, as reflected in the lower recall (sensitivity) value. This suggests that the model may struggle somewhat to correctly identify instances of the positive class.

To better understand the model’s behavior and decision-making process, let’s examine the feature importance for this model. This will help us identify which features contribute most significantly to the predictions.

importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = 10) # Show top 10 features

As expected, both speeds are the two highest predictors, followed by attack stats.

5.2. Engineered Features

An accuracy of 94% with 58 features is impressive, but can we further improve the model? Based on the EDA conducted earlier, we can refine the feature set by including only the most impactful features, such as:

Type effectiveness (e.g., maximum damage multipliers like 2x or 4x for moves targeting opponent weaknesses).
Move combination counts (e.g., the number of physical, special, and status moves in a Pokémon’s moveset).

At the same time, we can exclude less relevant or redundant features that may introduce noise or overcomplicate the model. This streamlined approach could enhance performance while maintaining interpretability. Let’s proceed with this optimized feature selection and evaluate its impact on the model.

df <- readRDS("files/data/final_pokemon_data.rds")
df <- df %>%
    rowwise() %>%
    mutate(
        # Pokemon's type effectiveness
        First_Type_Max_Effectiveness = coalesce(max(c_across(starts_with("First_Type") & ends_with("Effectiveness")), na.rm = TRUE), First_Type1_Effectiveness),
        Second_Type_Max_Effectiveness = coalesce(max(c_across(starts_with("Second_Type") & ends_with("Effectiveness")), na.rm = TRUE), Second_Type1_Effectiveness),
        # Pokemon's move effectiveness
        First_Move_Max_Effectiveness = max(c_across(starts_with("First_Move") & ends_with("Effectiveness")), na.rm = TRUE),
        Second_Move_Max_Effectiveness = max(c_across(starts_with("Second_Move") & ends_with("Effectiveness")), na.rm = TRUE),
        # Move combination count
        First_Physical_Move_Count = sum(c_across(starts_with("First_Category_Move")) == "Physical", na.rm = TRUE),
        First_Special_Move_Count = sum(c_across(starts_with("First_Category_Move")) == "Special", na.rm = TRUE),
        First_Status_Move_Count = sum(c_across(starts_with("First_Category_Move")) == "Status", na.rm = TRUE),
        Second_Physical_Move_Count = sum(c_across(starts_with("Second_Category_Move")) == "Physical", na.rm = TRUE),
        Second_Special_Move_Count = sum(c_across(starts_with("Second_Category_Move")) == "Special", na.rm = TRUE),
        Second_Status_Move_Count = sum(c_across(starts_with("Second_Category_Move")) == "Status", na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
        First_Outcome = ifelse(First_pokemon == Winner, "Winner", "Loser"),
        First_Outcome = ifelse(First_Outcome == "Winner", 1, 0),
        across(ends_with("Power"), ~ ifelse(.x == "" | .x == "None", 0, as.numeric(.x))),
        across(ends_with("Accuracy"), ~ ifelse(.x == "" | .x == "None", 100, as.numeric(.x))),
        across(where(is.character) | where(is.logical), ~ as.numeric(.))
    )%>%
    select(First_Type1, First_Type2, First_HP, First_Attack, First_Defense,
           First_Sp_Atk, First_Sp_Def, First_Speed, First_Is_Legendary,
           First_Type_Max_Effectiveness, First_Move_Max_Effectiveness,
           First_Physical_Move_Count, First_Special_Move_Count,
           First_Status_Move_Count,
           Second_Type1, Second_Type2, Second_HP, First_Attack, First_Defense,
           Second_Sp_Atk, Second_Sp_Def, Second_Speed, Second_Is_Legendary,
           Second_Type_Max_Effectiveness, Second_Move_Max_Effectiveness,
           Second_Physical_Move_Count, Second_Special_Move_Count,
           Second_Status_Move_Count, First_Outcome)

## Warning: There were 46 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 45 remaining warnings.

saveRDS(df, file = "files/data/pokemon_data_cleaned.rds") # Save and load the data to save time

Train it with the same hyperparameters

df <- readRDS("files/data/pokemon_data_cleaned.rds")
training_data <- prepare_training_data(df)
train_matrix <- training_data$train_matrix
test_matrix <- training_data$test_matrix
train_labels <- training_data$train_labels
test_labels <- training_data$test_labels

xgb_model <- xgboost(
    data = train_matrix,
    label = train_labels,
    max.depth = 6,        # Maximum depth of each tree
    eta = 0.1,            # Learning rate
    nrounds = 100,        # Number of boosting rounds
    objective = "binary:logistic", # Binary classification
    eval_metric = "error", # Evaluation metric
    scale_pos_weight = sum(train_labels == 0) / sum(train_labels == 1),
    verbose = 0           # Print progress
)

pred_probs <- predict(xgb_model, test_matrix)
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)

Now let’s look at the result.

pred_probs <- predict(xgb_model, test_matrix)

# Convert probabilities to binary predictions (threshold = 0.5)
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)

# Calculate metrics
conf_matrix <- confusionMatrix(factor(pred_labels), factor(test_labels))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4993  193
##          1  288 4526
##                                          
##                Accuracy : 0.9519         
##                  95% CI : (0.9475, 0.956)
##     No Information Rate : 0.5281         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9036         
##                                          
##  Mcnemar's Test P-Value : 1.819e-05      
##                                          
##             Sensitivity : 0.9455         
##             Specificity : 0.9591         
##          Pos Pred Value : 0.9628         
##          Neg Pred Value : 0.9402         
##              Prevalence : 0.5281         
##          Detection Rate : 0.4993         
##    Detection Prevalence : 0.5186         
##       Balanced Accuracy : 0.9523         
##                                          
##        'Positive' Class : 0              
##

The feature importance still highlights the dominance of Speed stats , but now we can clearly see the significance of Pokémon type effectiveness coming into play. By further refining the model and removing additional less critical features, we were able to halve the total number of features while achieving an improved accuracy of 95.08% .

Notably, both the sensitivity (recall for the positive class) and specificity (recall for the negative class) values also showed improvement. This demonstrates that reducing feature redundancy not only simplified the model but also enhanced its overall performance and robustness.

importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = 10) # Show top 10 features

Save the final XGBoost model.

xgb.save(xgb_model, "files/models/xgboost_model.model")

## [1] TRUE

5.3. Pokemon Clustering

The project also incorporates Pokémon strategic team building as a key feature, allowing players to construct and comprehend the roles of each Pokémon in their team. In battles, a player can bring up to six different Pokémon, and assembling an optimal team with diverse Pokémon that can cover each other’s weaknesses significantly increases the likelihood of winning.

To begin this analysis, we will use the raw Pokémon compendium dataset obtained from Kaggle.

df <- read_csv("files/data/pokemon-challenge/pokemon.csv")

## Rows: 800 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Name, Type 1, Type 2
## dbl (8): #, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
## lgl (1): Legendary
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- df %>%
    mutate(
        across(where(is.character) | where(is.logical), ~ as.numeric(.))
    )

## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(where(is.character) | where(is.logical),
##   ~as.numeric(.))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

Next, we will scale the numeric columns since they have different ranges. Scaling them can make the following steps easier.

scaled_data <- scale(df)
scaled_df <- as.data.frame(scaled_data)

nzv_columns <- nearZeroVar(scaled_df, saveMetrics = TRUE)

# Extract the names of constant or near-zero variance columns
problematic_columns <- rownames(nzv_columns[nzv_columns$zeroVar | nzv_columns$nzv, , drop = FALSE])
scaled_df <- scaled_df[, !colnames(scaled_df) %in% problematic_columns]

Recall that our dataset contains 58 features , which would make fitting a clustering model computationally intensive and challenging to interpret. To address this, we can first apply PCA (Principal Component Analysis) to reduce the dimensionality of the data to just two features . This not only simplifies the clustering process but also makes it easier to visualize the results in a 2D plot.

# Perform PCA
pca_result <- prcomp(scaled_df, center = TRUE, scale. = TRUE)

# Extract the first two principal components
pca_data <- data.frame(
    PC1 = pca_result$x[, 1],
    PC2 = pca_result$x[, 2]
)

# View the explained variance
summary_pca <- summary(pca_result)
print(summary_pca$importance)

##                             PC1      PC2      PC3      PC4     PC5       PC6
## Standard deviation     1.761715 1.393246 1.052153 0.891066 0.84937 0.7939689
## Proportion of Variance 0.344850 0.215680 0.123000 0.088220 0.08016 0.0700400
## Cumulative Proportion  0.344850 0.560530 0.683530 0.771760 0.85191 0.9219600
##                              PC7       PC8       PC9
## Standard deviation     0.6500174 0.5167923 0.1130869
## Proportion of Variance 0.0469500 0.0296700 0.0014200
## Cumulative Proportion  0.9689000 0.9985800 1.0000000

Next, we find the number of optimal k using the elbow method

wss <- fviz_nbclust(pca_data, kmeans, method = "wss")
wss

The most optimal number of k is 9 or 10.

set.seed(123)
k <- 9
kmeans_model <- kmeans(pca_data, centers = k, nstart = 25)

# Add cluster assignments to the PCA data
pca_data$Cluster <- as.factor(kmeans_model$cluster)

ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
    geom_point(size = 2, alpha = 0.8) +
    labs(title = "K-Means Clustering with PCA",
         x = "Principal Component 1", y = "Principal Component 2") +
    theme_minimal() +
    theme(legend.position = "top")

And with that, we have reached the conclusion of the analysis for this dataset. We have conducted a comprehensive exploration and modeling process, covering a sufficient range of analyses to address the problem at the heart of this project. This work can serve as an initial model prototype , laying the foundation for potential integration into the real Pokémon game system in the future.

By refining the model further and incorporating additional features or real-time data, this prototype could evolve into a powerful tool for players to make informed decisions during battles, enhancing both strategic gameplay and competitive performance.

Assignment 3

Archel Taneka Sutanto

02 July, 2025

Assignment 3

1. Setup

2. Dataset

2.1. Battle Log & Pokedex

2.2. Pokemon Moveset

3. Preprocess

3.1. Join & Assemble the Dataset

4. Exploratory Data Analysis (EDA)

4.1. Pokemon Stats

4.2. Type Effectiveness

4.3. Moveset Combination

5. Model Training

5.1. All Features

5.2. Engineered Features

5.3. Pokemon Clustering