Introduction

Image courtesy of screenrant.com

In this code through, I am going to show how you can use K-Clustering with datasets. Keeping with my theme of my last code through, I will show how to cluster data from Pokemon GO.

If you aren’t familiar with Pokemon, it’s a video game about collecting fantasy monsters that battle each other with special powers. We’re up to over a thousand of these pocket monsters now, and every time a new game is announced, the significant overlap of data nerds and Pokemon nerds datamine it to pull all the stats. Pokemon GO is the mobile version of this game that prior to 2016 was only on Nintendo Consoles.

Pokemon GO also becomes a bit heavy on the inventory management, as there are so many Pokemon and an almost infinite number of team line ups made of six Pokemon each. R comes in handy here, as you can upload a dataset containing all the current Pokemon and use it to help you manage your Pokemon.

Content Overview

This code through explores how to cluster data by using Shreya Sur965’s “Gotta Analyze ’Em All: The Ultimate Pokémon GO Dataset”, the fviz_nbclust(), kmeans(), and fviz_cluster() functions. This helps me learn more about R and also incentives me to clear out my inventory of Pokémon because I am trying not to become a digital hoarder.

In doing this code through, I learned about K-clustering, which helps you find the optimal number of clusters, which can be helpful when trying not to create too many groups.

Why You Should Care

Clustering is a powerful tool. It can show you patterns that are unexpected and can lead to newer understandings, or they can reveal what we already know in a format that is established in it’s accuracy and repeatability.

Further Exposition

This is based on the work of Shreya Sur965, who created the “Gotta Analyze ’Em All: The Ultimate Pokémon GO Dataset” and hosted it on kaggle.com. I also compared the outcome of my data manipulation against Pokemon GO Database.

Starting to Code

First off, we need to load a few libraries:

# LOAD LIBRARIES

library(dplyr)
library(tidyverse)
library( mclust )      # cluster analysis 
library( ggplot2 )     # graphing 
library( ggthemes )
library( dplyr )
library( pander )

# K-Means Clustering
#install.packages("factoextra")
library(factoextra)
library(cluster)

set.seed(1234)

Examine the Dataset

Then we’ll download the pokemon.csv file and read the .csv, and then take a closer look at the dataset to see the characteristics with glimpse().

Anyone familiar with Pokemon will recognize that the Pokemon are arranged by their number (pokemon_id) with Bulbasaur being the very first Pokemon.

# Loading the dataset

pogo <- read_csv(file="pokemon.csv" )

# column names and data types

glimpse(pogo)

## Rows: 1,007
## Columns: 24
## $ pokemon_id                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ pokemon_name                 <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Char…
## $ base_attack                  <dbl> 118, 151, 198, 116, 158, 223, 94, 126, 17…
## $ base_defense                 <dbl> 111, 143, 189, 93, 126, 173, 121, 155, 20…
## $ base_stamina                 <dbl> 128, 155, 190, 118, 151, 186, 127, 153, 1…
## $ type                         <chr> "['Grass', 'Poison']", "['Grass', 'Poison…
## $ rarity                       <chr> "Standard", "Standard", "Standard", "Stan…
## $ charged_moves                <chr> "['Sludge Bomb', 'Seed Bomb', 'Power Whip…
## $ fast_moves                   <chr> "['Vine Whip', 'Tackle']", "['Razor Leaf'…
## $ candy_required               <dbl> NA, 25, 100, NA, 25, 100, NA, 25, 100, NA…
## $ distance                     <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ max_cp                       <dbl> 1275, 1943, 3112, 1121, 1891, 3305, 1082,…
## $ attack_probability           <dbl> 0.1, 0.1, 0.2, 0.1, 0.1, 0.2, 0.1, 0.1, 0…
## $ base_capture_rate            <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -…
## $ base_flee_rate               <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -…
## $ dodge_probability            <dbl> 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15,…
## $ max_pokemon_action_frequency <dbl> 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1…
## $ min_pokemon_action_frequency <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0…
## $ found_egg                    <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, T…
## $ found_evolution              <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FAL…
## $ found_wild                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ found_research               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ found_raid                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ found_photobomb              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…

Prepare Databases for Clustering

Now that we understand how the database is built, we can move onto which characteristics we would like to isolate for clustering. I am going to focus on attack, defense, stamina, and maximum combat power (CP).

# create separate database "d1" from rest of "pogo" database
d1 <- pogo

# Prepare Data for Clustering

d3 <- d1 %>% 
  select("base_attack", "base_stamina", "base_defense", "max_cp")

d4 <- d1 %>% 
  select("pokemon_name", "base_attack", "base_defense", "base_stamina", "max_cp")

head(d4) %>% pander()

pokemon_name	base_attack	base_defense	base_stamina	max_cp
Bulbasaur	118	111	128	1275
Ivysaur	151	143	155	1943
Venusaur	198	189	190	3112
Charmander	116	93	118	1121
Charmeleon	158	126	151	1891
Charizard	223	173	186	3305

Pokemon, the Baseball of Video Games

Pokemon, like Baseball, is a stats game. Pokemon have, essentially, four main stats: Attack, Stamina, Defense, and Combat Power (CP). Attack is how much damage a Pokemon does in battle, Stamina is how long a Pokemon lasts in battle, and Defense is how much damage a Pokemon resists in battle. CP is the overall measure of strength of any Pokemon, and can be considered their power level. Knowing this, you’ll want to have Pokemon with the highest possible CP for any raid.

The database I pulled from has the variable max_cp as the maximum amount of CP a Pokemon can get. It should be noted that Pokemon in this dataset have their max_cp as coming from their higher powered form variations such as Mega, Primal, Shadow, and so on, which shouldn’t be an issue if you already have those variations. If you don’t or aren’t sure, you can check the pokemon_name and max_cp against the same Pokemon in Pokemon GO Database, which will give you a drop-down menu of all the variations of that Pokemon.

Let’s take a peek at the summary statistics of attack, stamina, defense, and maximum combat power. This will help us understand what is and isn’t impressive for each stat.

# Summary statistics of chosen characteristics

summary( d3 [c("base_attack", 
               "base_stamina", 
               "base_defense", 
               "max_cp")])

##   base_attack     base_stamina    base_defense       max_cp    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  16  
##  1st Qu.:119.0   1st Qu.:137.0   1st Qu.:103.0   1st Qu.:1306  
##  Median :165.0   Median :167.0   Median :142.0   Median :2304  
##  Mean   :166.3   Mean   :171.1   Mean   :143.8   Mean   :2310  
##  3rd Qu.:211.0   3rd Qu.:193.0   3rd Qu.:179.0   3rd Qu.:3138  
##  Max.   :414.0   Max.   :496.0   Max.   :505.0   Max.   :9366

Starting to Cluster

How does the data look when clustered?

# Perform Cluster Analysis

# library( mclust )
set.seed( 1234 )
fit <- Mclust( d3 )
d3$cluster <- as.factor( fit$classification )

plot( fit, what = "classification" )

Good! The data shows up well with some correlation. Now let’s plot out max CP within each group.

d3$cluster <- d3$cluster <- as.factor( paste0("GROUP-",fit$classification) )

ggplot( d3, aes( x=max_cp ) ) + 
        geom_density( alpha = 0.5, fill="blue" ) + 
        xlab( "max cp" ) + 
        facet_wrap( ~ cluster, nrow=2 ) + 
        theme_minimal()

9 groups?! For my purposes in Pokemon, that’s too many. Let’s try to pare that down with K-Means Clustering.

K-Means Clustering, Begin!

First thing you’ll need to do, is scale each variable to have a mean of 0 and a standard deviation of 1.

# K-Means Clustering Packages
# library(factoextra)
# library(cluster)

#scale each variable to have a mean of 0 and sd of 1
d.scale <- scale(d3[,c("base_attack", "base_stamina", "base_defense", "max_cp")],
                  center=TRUE, scale=TRUE)

head(d.scale)

##      base_attack base_stamina base_defense     max_cp
## [1,]  -0.8128315   -0.8982865  -0.63095462 -0.9248926
## [2,]  -0.2571140   -0.3352733  -0.01584354 -0.3278316
## [3,]   0.5343624    0.3945586   0.86837863  0.7170251
## [4,]  -0.8465114   -1.1068099  -0.97695460 -1.0625384
## [5,]  -0.1392346   -0.4186827  -0.34262130 -0.3743095
## [6,]   0.9553605    0.3111492   0.56082310  0.8895293

Elbow Method with fviz_nbclust() Function

Now we can move onto figuring out the optimal amount of clusters that would work with the dataset. In order to figure this out, we can use the elbow method by plotting the data with the fviz_nbclust() function. This function determines and visualizes the optimal number of clusters within the sum of squares.

# library(factoextra)

fviz_nbclust(d.scale, kmeans, method = "wss")

The elbow method is a technique in cluster analysis where the elbow of the curve is the number of clusters you should use. In a way, it is sort of highlighting the spot where diminishing returns begins. As we saw earlier, my dataset has 9 groups, which is fine for our neighborhood analysis where the stakes are high. But when the stakes are low as messing with datasets and trying to find out more about my hobby, then I only really need 3-5 groups. Thankfully, it looks like the elbow is around 3-5 clusters, so I will go with 4 clusters.

K-Clustering with kmeans() Function

Now we can use the kmeans() function to perform clustering on the dataset.The way this function is built is kmeans(dataset name, centers = number of clusters you want, nstart = number of initial configurations)

We will set centers to equal 4 so that we know will sort all the Pokemon into 4 groups, not 9. And we will set the nstart to equal 25, which will allow the algorithm to cycle through 25 configurations and use the one with the smallest within cluster variation.

km.pogo <- kmeans(d.scale, centers = 4, nstart = 25)

#view results
km.pogo

## K-means clustering with 4 clusters of sizes 256, 50, 340, 361
## 
## Cluster means:
##   base_attack base_stamina base_defense      max_cp
## 1   1.2063149   0.55982964    0.8231764  1.24646077
## 2   0.1227947   2.58030073    0.3178542  0.71654245
## 3   0.1224777   0.00486514    0.3095774  0.07743481
## 4  -0.9878082  -0.75896280   -0.9193412 -1.05609118
## 
## Clustering vector:
##    [1] 4 3 1 4 3 1 4 3 3 4 4 3 4 4 3 4 4 3 4 3 4 3 4 3 4 3 4 3 4 4 3 4 4 3 4 3 4
##   [38] 3 4 2 4 3 4 3 3 4 3 4 3 4 3 4 3 4 3 4 3 4 1 4 4 3 4 3 1 4 3 1 4 4 3 4 3 4
##   [75] 3 1 3 3 4 3 4 3 3 4 3 4 3 4 1 4 3 4 3 1 3 4 3 4 1 4 3 4 1 4 3 3 3 3 4 3 3
##  [112] 1 2 3 2 4 3 4 3 4 3 3 1 3 3 3 1 3 4 1 2 4 4 2 1 1 3 3 1 4 1 1 2 1 1 1 4 3
##  [149] 1 1 1 4 3 3 4 3 1 4 3 1 4 3 4 3 4 3 4 3 3 4 2 4 4 4 4 3 4 3 4 4 1 3 4 3 3
##  [186] 3 4 4 3 4 4 3 4 4 3 1 3 4 3 3 4 2 3 4 3 3 3 3 4 3 3 1 3 1 3 4 1 4 3 4 3 4
##  [223] 4 3 4 3 3 4 3 3 4 1 1 3 4 4 3 4 4 4 3 2 1 1 1 4 3 1 1 1 1 4 3 1 4 3 1 4 3
##  [260] 1 4 3 4 3 4 4 3 4 4 4 4 3 4 4 3 4 3 4 3 4 4 1 4 3 4 3 4 3 1 4 3 4 4 4 3 4
##  [297] 2 4 4 4 3 4 3 4 3 1 4 3 4 3 3 3 3 3 3 4 3 4 3 2 2 4 3 3 4 3 4 4 4 3 4 3 4
##  [334] 3 3 3 3 3 4 2 4 3 4 3 4 3 4 1 4 1 3 3 4 3 4 3 3 3 3 4 4 3 4 3 2 4 3 3 3 4
##  [371] 4 3 1 4 3 1 1 1 3 1 1 1 1 1 1 1 4 3 1 4 4 1 4 3 1 4 4 1 4 3 4 3 4 4 1 4 1
##  [408] 3 1 4 3 4 3 3 4 3 4 4 3 4 3 4 2 3 4 2 4 3 3 1 4 3 4 4 3 4 3 4 4 4 3 3 4 3
##  [445] 1 2 4 1 4 1 4 3 4 3 3 4 3 4 4 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1
##  [482] 1 1 1 1 1 2 2 3 1 1 1 1 1 4 3 3 4 3 1 4 3 1 4 3 4 3 1 4 3 4 3 4 3 4 3 4 2
##  [519] 4 4 1 4 3 4 3 1 4 3 4 1 3 4 3 1 4 4 2 2 1 4 3 3 4 4 3 4 3 4 3 3 4 4 1 4 1
##  [556] 3 4 3 4 3 3 4 3 3 3 3 1 4 3 4 3 4 3 4 3 3 4 3 1 4 3 4 3 1 4 3 3 4 1 4 2 4
##  [593] 3 2 4 3 4 3 4 3 3 4 3 1 4 1 4 3 1 4 3 1 4 1 1 4 3 2 4 1 1 4 1 4 1 1 4 2 4
##  [630] 2 3 3 4 3 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 1 4 3 1 4 3 3 4 3 4 4 3 4 4 3
##  [667] 4 1 4 3 1 4 2 4 1 3 4 3 4 4 4 4 3 4 3 4 3 4 1 4 3 4 1 4 3 3 1 4 2 1 3 3 3
##  [704] 4 3 1 3 4 3 4 3 4 1 4 1 1 1 2 1 1 1 4 3 1 4 3 1 4 3 1 4 4 1 4 3 4 3 1 4 1
##  [741] 3 4 3 4 1 1 4 3 3 1 4 3 4 3 4 3 4 3 4 2 4 4 1 3 3 1 4 1 4 3 3 1 1 3 3 3 3
##  [778] 3 3 1 1 4 3 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 2 1 1 1 3 1 1 1 1 4 1 4 3 1 4 3
##  [815] 1 4 3 1 4 2 4 4 3 4 4 3 4 3 4 3 4 3 4 1 4 3 4 3 2 4 3 2 4 1 3 4 1 4 3 4 1
##  [852] 4 1 4 1 4 3 1 4 4 1 1 3 1 1 1 3 4 1 3 3 4 1 1 3 3 3 4 2 3 3 3 3 1 4 3 1 1
##  [889] 1 2 3 1 1 1 2 1 1 1 1 1 2 1 1 1 4 3 1 4 3 1 4 3 1 4 2 4 3 4 3 4 4 3 4 3 4
##  [926] 3 4 4 1 3 4 4 1 4 1 1 4 2 4 3 4 1 4 3 4 3 4 3 3 4 3 4 3 4 3 4 4 3 4 3 3 4
##  [963] 1 4 1 3 3 3 1 4 3 1 4 2 3 2 1 1 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 4 3 1 4 1
## [1000] 1 1 2 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 323.2009 236.0519 403.8299 364.8965
##  (between_SS / total_SS =  67.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

There are 1007 Pokemon in this dataset, and now they are in groups of 50 to 361 Pokemon each.

Vizualization with fviz_cluster() Function

To get a visualization of how kmeans() partitioned the data, use fviz_cluster():

#plot results of final k-means model
fviz_cluster(km.pogo, data = d.scale)

I can certainly see that 4 groups is more than enough, and how K-Clustering can be impacted by outliers. But we have 4 distinct, easily manageable groups!

Find Group Means with aggregate()

Now that the data is clustered, we can use other functions, like aggregate(), to see the means of each group:

# use aggregate() function to find means of each cluster
aggregate(d3, 
          by=list(cluster=km.pogo$cluster), 
          mean)

Right off the bat, I notice that cluster 1 has the highest maxium combat power and base attack, most likely that is where all our heaviest hitters and legendary Pokemon are! In contrast, cluster 4 has the lowest of every characteristic, making it the group that has the weakest Pokemon or those that are still in their non-evolved form.

Add Cluster Group Assignments to Dataset

Next up, we use cbind() to add cluster group assignments back to the dataset.

#add cluster assigment to original data
d5 <- cbind(d4, 
            cluster = km.pogo$cluster)

#view final data
head(d5)

I added the cluster information back to my “d4” dataset, since that one kept the unscaled information from “d3” as well as the Pokemon names. Which is useful! It seems like as long as the cluster groups match your original dataset, you can add them back into any variation of that dataset, as long as it has the same number of observations.

Revisit Old Plots with Cluster Info

We can revisit our maximum combat power variable plots by group, but this time with only 4 instead of 9.

ggplot( d5, aes( x=max_cp ) ) + 
        geom_density( alpha = 0.5, fill="blue" ) + 
        xlab( "max cp" ) + 
        facet_wrap( ~ cluster, nrow=2 ) + 
        theme_minimal()

Cluster 1 indeed has the highest density of higher combat power Pokemon, and group 4 has the lowest density.

Cluster Vizualization

Now for some graphs that show off cluster differences and patterns!

Starting off with seeing how each cluster ranks in terms of combat power and attack:

ggplot(d5, aes(x=max_cp, 
               y=base_attack, 
               color = as.factor(cluster))) + geom_point() + labs(colour="Cluster")

We can already see that combat power was likely the main component that determined the clustering. The difference between clusters 1 and 4 is also quite apparent! And cluster 3 turns out to be the mid-range Pokemon, not too weak and not too strong. The most interesting to me is cluster 2 covering the widest range of CP, looking like the group with the most variety and outliers.

Next up, attack vs stamina:

ggplot(d5, aes(x=base_stamina, y=base_attack, color = as.factor(cluster))) + geom_point() +
  labs(colour="Cluster")

Interesting! Now I can see why group 2 was clustered in that way. Group 2 must have the Pokemon with the highest stamina.

Last one just to see if there are any other patterns, stamina vs defense:

ggplot(d5, aes(x=base_stamina, y=base_defense, color = as.factor(cluster))) + geom_point() +
  labs(colour="Cluster")

Hmmmm, the least organized of the bunch, but still has that distinctive pattern of group 4 being weaker and group 1 being stronger. It just also has a long tail of cluster 2, showing off all the higher stamina Pokemon.

From these visualizations, I came to the following conclusion for group names:

Group 1: Heaviest Hitters, the best of the best
Group 2: Bulkiest, Pokemon that will last the longest in battle
Group 3: Middle Children, not great but not terrible
Group 4: Babies, entry-level starters

Wrapping it Up

To wrap it all up, let’s see what the top Pokemon are of each group. To figure that out, you just need to add all their characteristics together to make a “Total” column, then work with some classic dplyr functions to arrange the top 20 Pokemon of each cluster in descending order:

top_pokemon <- d5 %>%
  select("pokemon_name", "base_attack", "base_defense", "base_stamina", "max_cp", cluster) %>% 
  mutate(Total =base_attack + base_defense + base_stamina + max_cp) %>%
  arrange(cluster, desc(Total)) %>%
  group_by(cluster) %>%
  top_n(20)

print(top_pokemon, n = 200)

## # A tibble: 80 × 7
## # Groups:   cluster [4]
##    pokemon_name base_attack base_defense base_stamina max_cp cluster Total
##    <chr>              <dbl>        <dbl>        <dbl>  <dbl>   <int> <dbl>
##  1 Zacian               332          240          192   5696       1  6460
##  2 Palafin              322          196          225   5418       1  6161
##  3 Kyurem               310          183          245   5268       1  6006
##  4 Slaking              290          166          284   5069       1  5809
##  5 Regigigas            287          210          221   4972       1  5690
##  6 Calyrex              268          246          205   4845       1  5564
##  7 Roaring Moon         280          196          233   4821       1  5530
##  8 Zamazenta            250          292          192   4773       1  5507
##  9 Kyogre               270          228          205   4708       1  5411
## 10 Groudon              270          228          205   4708       1  5411
## 11 Necrozma             277          220          200   4689       1  5386
## 12 Solgaleo             255          191          264   4625       1  5335
## 13 Lunala               255          191          264   4625       1  5335
## 14 Great Tusk           249          209          251   4604       1  5313
## 15 Dialga               275          211          205   4620       1  5311
## 16 Reshiram             275          211          205   4620       1  5311
## 17 Zekrom               275          211          205   4620       1  5311
## 18 Arceus               238          238          237   4564       1  5277
## 19 Palkia               280          215          189   4565       1  5249
## 20 Meloetta             250          225          225   4544       1  5244
## 21 Eternatus            251          505          452   9366       2 10574
## 22 Iron Hands           245          177          319   4704       2  5445
## 23 Ursaluna             243          181          277   4410       2  5111
## 24 Zygarde              184          207          389   4258       2  5038
## 25 Ting-Lu              194          203          321   4041       2  4759
## 26 Giratina             187          225          284   3866       2  4562
## 27 Snorlax              190          169          330   3690       2  4379
## 28 Cetitan              208          123          347   3561       2  4239
## 29 Vaporeon             205          161          277   3563       2  4206
## 30 Bewear               226          141          260   3566       2  4193
## 31 Regidrago            202          101          400   3402       2  4105
## 32 Dondozo              176          178          312   3428       2  4094
## 33 Guzzlord             188           99          440   3303       2  4030
## 34 Copperajah           226          126          263   3409       2  4024
## 35 Blissey              129          169          496   3155       2  3949
## 36 Cresselia            152          258          260   3269       2  3939
## 37 Farigiraf            209          136          260   3261       2  3866
## 38 Hariyama             209          114          302   3236       2  3861
## 39 Aurorus              186          163          265   3206       2  3820
## 40 Braviary             213          137          242   3219       2  3811
## 41 Flygon               205          168          190   3044       3  3607
## 42 Durant               217          188          151   3043       3  3599
## 43 Crobat               194          178          198   3027       3  3597
## 44 Kingdra              194          194          181   3022       3  3591
## 45 Greninja             223          152          176   3037       3  3588
## 46 Klinklang            199          214          155   3017       3  3585
## 47 Carracosta           192          197          179   2999       3  3567
## 48 Dracozolt            195          165          207   2999       3  3566
## 49 Houndoom             224          144          181   3014       3  3563
## 50 Rabsca               201          178          181   3001       3  3561
## 51 Tauros               198          183          181   2998       3  3560
## 52 Pawmot               232          141          172   3014       3  3559
## 53 Breloom              241          144          155   3007       3  3547
## 54 Mismagius            211          187          155   2992       3  3545
## 55 Poliwrath            182          184          207   2958       3  3531
## 56 Toxtricity           224          140          181   2976       3  3521
## 57 Heliolisk            219          168          158   2974       3  3519
## 58 Rotom                204          219          137   2951       3  3511
## 59 Starmie              210          184          155   2957       3  3506
## 60 Leavanny             205          165          181   2952       3  3503
## 61 Weepinbell           172           92          163   1844       4  2271
## 62 Monferno             158          105          162   1801       4  2226
## 63 Murkrow              175           87          155   1787       4  2204
## 64 Krabby               181          124          102   1785       4  2192
## 65 Flaaffy              145          109          172   1740       4  2166
## 66 Anorith              176          100          128   1750       4  2154
## 67 Rufflet              150           97          172   1706       4  2125
## 68 Pancham              145          107          167   1703       4  2122
## 69 Larvesta             156          107          146   1712       4  2121
## 70 Luxio                159           95          155   1700       4  2109
## 71 Sableye              141          136          137   1688       4  2102
## 72 Trumbeak             159          100          146   1691       4  2096
## 73 Fletchinder          145          110          158   1681       4  2094
## 74 Yanma                154           94          163   1682       4  2093
## 75 Darumaka             153           86          172   1649       4  2060
## 76 Tranquill            144          107          158   1650       4  2059
## 77 Morgrem              145          102          163   1649       4  2059
## 78 Dolliv               137          131          141   1639       4  2048
## 79 Litleo               139          112          158   1631       4  2040
## 80 Poliwhirl            130          123          163   1623       4  2039

Now we can see which Pokemon are in which groups. It tracks that group 1 has all the legendary Pokemon, considering they are the strongest. Group 4 has the weaker, less rare Pokemon. Group 2 is still the most interesting group to me, as the Pokemon that have the highest stamina are not always the most obvious. Putting Eternatus, one of the Pokemon “gods” in the same category as Snorlax, a more commonly found Pokemon known for almost always being asleep, is quite entertaining.

I didn’t have as much of a clear end goal of what to do with this data as my last code through, which did help me in finally getting a Mega Rayquaza. However, this did help me come to a better understanding on how clustering works outside of mclust() and gave me a better appreciation for all the data that goes into one of my favorite games.

Further Resources

Learn more about [package, technique, dataset] with the following:

Shreya Sur965’s “Gotta Analyze ’Em All: The Ultimate Pokémon GO Dataset”

Clustering Pokemon

K-Means Clustering in R: Step-by-Step Example

Works Cited

This code through references and cites the following sources:

Pokemon GO Database

K-Clustering Pokémon GO Data

Carrie Hawke

02 December 2025