K-Means Clustering
K-Means Clustering
1 Intro
1.1 What We’ll Do
We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).
1.2 Datasets
The datasets is acquired through Kaggle .
1.3 Library and Setup
Load the required library.
2 Import Data
The datasets is list of all pokemons from Pokemon franchise game from generation 1 up to generation 7
Observations: 801
Variables: 41
$ abilities <fct> "['Overgrow', 'Chlorophyll']", "['Overgrow',...
$ against_bug <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 0.25, 1.00, 1....
$ against_dark <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_dragon <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_electric <dbl> 0.5, 0.5, 0.5, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0,...
$ against_fairy <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0,...
$ against_fight <dbl> 0.50, 0.50, 0.50, 1.00, 1.00, 0.50, 1.00, 1....
$ against_fire <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...
$ against_flying <dbl> 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
$ against_ghost <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_grass <dbl> 0.25, 0.25, 0.25, 0.50, 0.50, 0.25, 2.00, 2....
$ against_ground <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0,...
$ against_ice <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 1.0, 0.5, 0.5, 0.5,...
$ against_normal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ against_poison <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
$ against_psychic <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,...
$ against_rock <dbl> 1, 1, 1, 2, 2, 4, 1, 1, 1, 2, 2, 4, 2, 2, 2,...
$ against_steel <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...
$ against_water <dbl> 0.5, 0.5, 0.5, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5,...
$ attack <int> 49, 62, 100, 52, 64, 104, 48, 63, 103, 30, 2...
$ base_egg_steps <int> 5120, 5120, 5120, 5120, 5120, 5120, 5120, 51...
$ base_happiness <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, ...
$ base_total <int> 318, 405, 625, 309, 405, 634, 314, 405, 630,...
$ capture_rate <fct> 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120...
$ classfication <fct> Seed Pokémon, Seed Pokémon, Seed Pokémon,...
$ defense <int> 49, 63, 123, 43, 58, 78, 65, 80, 120, 35, 55...
$ experience_growth <int> 1059860, 1059860, 1059860, 1059860, 1059860,...
$ height_m <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6,...
$ hp <int> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, ...
$ japanese_name <fct> Fushigidaneフシギダãƒ, Fushigisouフシ...
$ name <fct> Bulbasaur, Ivysaur, Venusaur, Charmander, Ch...
$ percentage_male <dbl> 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88...
$ pokedex_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ sp_attack <int> 65, 80, 122, 60, 80, 159, 50, 65, 135, 20, 2...
$ sp_defense <int> 65, 80, 120, 50, 65, 115, 64, 80, 115, 20, 2...
$ speed <int> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30,...
$ type1 <fct> grass, grass, grass, fire, fire, fire, water...
$ type2 <fct> poison, poison, poison, , , flying, , , , , ...
$ weight_kg <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5...
$ generation <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ is_legendary <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Variable explanation:
- name : The English name of the Pokemon
- japanese_name : The Original Japanese name of the Pokemon
- pokedex_number : The entry number of the Pokemon in the National Pokedex
- percentage_male : The percentage of the species that are male. Blank if the Pokemon is genderless.
- type1 : The Primary Type of the Pokemon
- type2 : The Secondary Type of the Pokemon
- classification : The Classification of the Pokemon as described by the Sun and Moon Pokedex
- height_m : Height of the Pokemon in metres
- weight_kg : The Weight of the Pokemon in kilograms
- capture_rate : Capture Rate of the Pokemon
- base_egg_steps : The number of steps required to hatch an egg of the Pokemon
- abilities : A stringified list of abilities that the Pokemon is capable of having
- experience_growth : The Experience Growth of the Pokemon
- base_happiness : Base Happiness of the Pokemon
- against_? : Eighteen features that denote the amount of damage taken against an attack of a particular type
- hp : The Base HP of the Pokemon
- attack : The Base Attack of the Pokemon
- defense : The Base Defense of the Pokemon
- sp_attack : The Base Special Attack of the Pokemon
- sp_defense : The Base Special Defense of the Pokemon
- base_total : attack + defense + hp + sp_attack + sp_defense
- speed : The Base Speed of the Pokemon
- generation : The numbered generation which the Pokemon was first introduced
- is_legendary : Denotes if the Pokemon is legendary.
3 Data Preprocessing
Since is_legendary is logical, we will convert the data type.
Capture rate should be a numeric data, we might check the levels
[1] "100" "120"
[3] "125" "127"
[5] "130" "140"
[7] "145" "15"
[9] "150" "155"
[11] "160" "170"
[13] "180" "190"
[15] "200" "205"
[17] "220" "225"
[19] "235" "25"
[21] "255" "3"
[23] "30" "30 (Meteorite)255 (Core)"
[25] "35" "45"
[27] "50" "55"
[29] "60" "65"
[31] "70" "75"
[33] "80" "90"
Turns out there is a data with specific characters capture_rate
with 2 numeric value based on the Pokemon condition. We may force them into NA by coercing the data type into numeric, or we can choose either one of the value. I prefer to keep them and choose the lower capture rate. generation
is not a genuine numeric value, but rather a factor with the number indicating on which generation the Pokemon is introduced by the game.
data <- data %>% mutate(capture_rate = if_else(capture_rate == "30 (Meteorite)255 (Core)",
30, as.numeric(capture_rate)), generation = as.factor(generation))
We will check if there is any missing values in the data.
abilities against_bug against_dark against_dragon
0 0 0 0
against_electric against_fairy against_fight against_fire
0 0 0 0
against_flying against_ghost against_grass against_ground
0 0 0 0
against_ice against_normal against_poison against_psychic
0 0 0 0
against_rock against_steel against_water attack
0 0 0 0
base_egg_steps base_happiness base_total capture_rate
0 0 0 0
classfication defense experience_growth height_m
0 0 0 20
hp japanese_name name percentage_male
0 0 0 98
pokedex_number sp_attack sp_defense speed
0 0 0 0
type1 type2 weight_kg generation
0 0 20 0
is_legendary
0
The percentage_male
variable has a lot of missing value. We need to remove them in order to preserve our data. We will not use the pokedex_number
since it is a unique ID for each pokemon.
4 Exploratory Data Analysis
4.1 Possibility for Clustering
One of the easiest way to segment the Pokemon is to look at the legendary status. Here, we try to see if a legendary pokemon require more base_egg_steps
to hatch and has higher base_total
stats.
ggplot(data, aes(base_egg_steps, base_total, color = is_legendary, size = capture_rate)) +
geom_point(alpha = 0.5) + theme_minimal()
There is a clear distinction between a legendary Pokemon and a common Pokemon by looking at the base_egg_steps
and the base_total
stats.
Let’s look at another merics. We want to see if there is a clear difference on combination of type1
and type2
toward the mean of the base total
of a non-legendary Pokemon.
data %>% filter(is_legendary == "no") %>% group_by(type1, type2) %>% summarise(base = mean(base_total)) %>%
ggplot(aes(type1, type2, fill = base)) + scale_fill_viridis_c(option = "B") +
geom_tile(color = "black") + theme_minimal()
Pokemon which only has 1 type (no type2
) is generally weaker compared to Pokemon which has more than 1 type. Pokemon with normal type is also weaker, regardless of the combination.
Lastly, we want to check some stats between legendary and non-legendary Pokemon.
p1 <- ggplot(data, aes(is_legendary, height_m, fill = is_legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Height")
p2 <- ggplot(data, aes(is_legendary, weight_kg, fill = is_legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Weight")
p3 <- ggplot(data, aes(is_legendary, speed, fill = is_legendary)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Speed")
p4 <- ggplot(data, aes(is_legendary, hp, fill = is_legendary)) + geom_boxplot() +
theme_minimal() + theme(legend.position = "bottom") + labs(title = "Health Point (HP)")
plot_grid(p1, p2, p3, p4)
Legendary Pokemon is slightly better in all aspects.
Based on our exploratory process, we can segment or cluster some Pokemon based on their specific traits, such as the legendary status. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means
.
4.2 Possibility for Principle Component Analysis (PCA)
We want to see if there is a high correlation between numeric variables. Strong correlation in some variables imply that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).
There are some features that has high correlation such as the base_total
with sp_attack
, against_ice
with against_ground
, etc. Based on this result, we will try to reduce the dimension using PCA.
5 Clustering
5.1 Finding Optimal Number of Clusters
Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.
5.1.1 Elbow Method
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9),
big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")
Using the elbow method, we know that 4 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
5.1.2 Silhouette Method
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 4.
5.1.3 Gap Statistic
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
Based on the gap statistic method, the optimal k is one.
Two out of three methods suggest that k = 4 is the optimum number of clusters. If we use k=1, we will not be able to analyze the difference between clusters or segments because that mean all of the data belong to one cluster and has the same characteristics. This can also be an alarm to us that a clustering method may not be very useful, so take it with a grain of salt.
5.2 K-Means Clustering
Here is the algorithm behind K-Means Clustering:
- Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
- Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)
K-means clustering with 4 clusters of sizes 78, 184, 320, 199
Cluster means:
against_bug against_dark against_dragon against_electric against_fairy
1 0.8910256 1.096154 0.7820513 1.243590 0.974359
2 0.9904891 1.057065 1.0679348 1.061141 1.245924
3 0.9601563 1.067969 0.9406250 1.092188 1.035937
against_fight against_fire against_flying against_ghost against_grass
1 1.048077 1.016026 1.038462 0.9487179 1.1474359
2 1.160326 1.019022 1.074728 1.0353261 1.0353261
3 1.035937 1.216406 1.197656 0.9875000 1.0445312
against_ground against_ice against_normal against_poison against_psychic
1 1.044872 1.038462 0.8076923 1.0256410 0.9487179
2 1.154891 1.258152 0.8926630 0.9076087 0.9116848
3 1.110156 1.132031 0.8757812 0.9523438 1.0203125
against_rock against_steel against_water attack base_egg_steps
1 1.256410 1.2115385 1.025641 63.43590 5120.000
2 1.197011 1.0108696 1.058424 94.66848 13753.043
3 1.281250 0.9812500 1.044531 73.19375 5196.000
base_happiness base_total capture_rate defense experience_growth
1 72.62821 393.9487 23.12821 67.87179 743589.7
2 50.40761 506.0489 22.02717 86.37500 1279673.9
3 69.53125 404.4094 21.82187 70.97500 1000000.0
height_m hp sp_attack sp_defense speed weight_kg
1 0.9269231 69.60256 60.84615 74.33333 57.85897 29.45769
2 1.7581522 82.15761 84.51630 82.15217 76.17935 127.14239
3 1.0165625 64.50000 66.01875 66.92500 62.79688 48.14000
[ reached getOption("max.print") -- omitted 1 row ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 4 4 4
21 22 23 24 25 29 30 31 32 33 34 35 36 39 40 41 42 43
3 3 3 3 3 4 4 4 4 4 4 1 1 1 1 3 3 4
44 45 46 47 48 49 54 55 56 57 58 59 60 61 62 63 64 65
4 4 3 3 3 3 3 3 3 3 2 2 4 4 4 4 4 4
66 67 68 69 70 71 72 73 77 78 79 80 81 82 83 84 85 86
4 4 4 4 4 4 2 2 3 3 3 3 3 3 3 3 3 3
87 90 91 92 93 94 95 96 97 98 99 100 101 102 104 106 107 108
3 2 2 4 4 4 3 3 3 3 3 3 3 2 3 3 3 3
109 110 111 112 113 114 115 116 117 118 119
3 3 2 2 1 3 3 3 3 3 3
[ reached getOption("max.print") -- omitted 680 entries ]
Within cluster sum of squares by cluster:
[1] 632087993618 1988052753373 458666182 2050047364
(between_SS / total_SS = 87.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
The ratio between the sum of squares distance between cluster to the total sum of squares is 87.2%, meaning that most of the sum of squares distance comes from the distance between clusters. Thus, we can conclude that our data is properly clustered since the observations in the same cluster has a little distance or variations. The number of members on each cluster is not equally distributed.
We’ve already get the information about the cluster of each observation. Let’s join the vector cluster
into the dataset.
df_clust <- data %>% select(-percentage_male) %>% na.omit() %>% bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:40)
df_clust
5.3 Cluster Analysis
We will do analysis regarding the characteristic of each cluster and see if there is a difference or specific traits on each clusters. Since we have a lot of features (32), we might not be able to explore all of them.
df_clust %>% mutate(cluster = cluster) %>% ggplot(aes(experience_growth, base_total,
color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") +
theme_minimal() + theme(legend.position = "top")
There is a clear distinction on each clusters based on their experience_growth
. Cluster 1
has the lowest experience growth while Cluster 2
has the highest experience growth. Cluster 3
and Cluster 4
has a quite close experience growth but still perfectly separable from each other. No apparent difference in base_total
between clusters, but Cluster 4
has a high variance in the base total stats.
We might check on each clusters centroid:
df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% mutate_if(is.numeric,
.funs = "round", digits = 2) %>% select(1:19)
Some interesting finding that we can take from the centroid including:
Based on their ressistances
Cluster 1
has greater ressistance against bugs, dragon, ghost, and normal attack moves, meaning they can take more damage from dragon and other attacks. However, they are weak against fairy and steel attacks compared to other clusters, so watch out when your enemy use a Pokemon that has one of those two type of moves.Cluster 2
has weak ressitance against fairy, ground, ice, ghost and fight attacks than other clusters. They are strong against fire, psychic and rocks.Cluster 3
has no specific ressistance that is better than other clusters. However, they are weak against fire, likeCluster 1
.Cluster 4
has weak ressistance against bug, flying, ghost, ice . However, they are strong against grass and steel.
What are we supposed to do with this information? We can use them to help us consider the kind of Pokemon that we will use in accordance with what type of moves our opponent’s has. For example, if our opponent’s Pokemon move mostly consists of dragon attack, we may choose one of our Pokemon that from Cluster 1
, since they are stronger against dragon attacks. Or if we can place a Pokemon from Cluster 3
to be our first Pokemon, to take damages and observe what our opponent’s move type is. We then switch our Pokemon with the better one that suits our opponent.
However, considering what Pokemon to use only based on our opponent’s moves might be not enough. We need to look at our own Pokemon’s battle stats.
Based on Pokemon’s base battle stats
df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster,
attack, defense, sp_attack, sp_defense, hp, speed, base_total) %>% mutate_if(is.numeric,
.funs = "round", digits = 2)
Pokemon’s base stats is indicated by their attack
, defense
, sp_attack
, sp_defence
, hp
, speed
, and base_total
.
Cluster 1
overall has the lowestbase_total
stats. They are particularly the weakest in term ofattack
,speed
, andsp_attack
. They don’t have a particular stats that is better than other clusters. A group of Pokemon with low attack power.Cluster 2
overall has the highestbase_total
and they are the best in all stats.Cluster 3
is the balanced group, with no particular stats that is worse than other clusters even though theirbase_total
is the second worse.Cluster 4
has the worstdefense
,sp_defense
, andhp
. They are a group of Pokemon with low defense power, the opposite ofCluster 1
.
Based on other numeric stats
df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster,
base_egg_steps, base_happiness, experience_growth, weight_kg, height_m,
capture_rate) %>% mutate_if(is.numeric, .funs = "round", digits = 2)
Cluster 1
is probably the smallest in term ofweight
andheight
. They has the highestbase_happiness
and easy to catch(highcapture_rate
) and hatch from egg(base_egg_steps
). They are also the easiest to reach level 100, indicated by the lowexperience_growth
.Cluster 2
is the hardest to hatch from egg and start with less happiness than other clusters. They are also consists of larger Pokemon (highweight
andheight
) and the hardest to reach level 100.Cluster 3
andCluster 4
is in the middle betweenCluster 1
andCluster 2
, with Pokemon ofCluster 3
has slightly bigger size in term of theirweight
andheight
.
Based on the legendary status
How is the distribution of legendary Pokemon?
Cluster 1
and Cluster 3
has no legendary Pokemon in it, while Cluster 2
has the most legendary Pokemon. That’s why they are better than the others in term of battle stats we previously examined.
6 PCA
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
6.1 Dimensionality Reduction
Here we will make PCA from the df
datasets. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.
poke_pca <- PCA(df2 %>% select(-name), scale.unit = T, ncp = 31, graph = F,
quali.sup = 32)
summary(poke_pca)
Call:
PCA(X = df2 %>% select(-name), scale.unit = T, ncp = 31, quali.sup = 32,
graph = F)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 5.412 3.528 2.570 2.173 1.920 1.765
% of var. 17.457 11.382 8.290 7.009 6.195 5.694
Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
Variance 1.426 1.330 1.246 1.162 1.003 0.794
% of var. 4.601 4.292 4.018 3.748 3.234 2.562
Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
Variance 0.751 0.733 0.629 0.616 0.584 0.515
% of var. 2.421 2.363 2.029 1.987 1.885 1.661
Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
Variance 0.453 0.398 0.329 0.316 0.310 0.262
% of var. 1.460 1.282 1.060 1.020 1.001 0.845
Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
Variance 0.225 0.190 0.145 0.088 0.069 0.058
% of var. 0.727 0.613 0.466 0.284 0.223 0.188
Dim.31
Variance 0.000
% of var. 0.000
[ reached getOption("max.print") -- omitted 1 row ]
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
1 | 4.218 | -2.274 0.122 0.291 | 1.373 0.068 0.106 |
2 | 3.819 | -1.131 0.030 0.088 | 1.657 0.100 0.188 |
3 | 5.108 | 2.051 0.100 0.161 | 2.214 0.178 0.188 |
4 | 4.029 | -1.716 0.070 0.181 | -1.778 0.115 0.195 |
5 | 3.504 | -0.395 0.004 0.013 | -1.451 0.076 0.171 |
Dim.3 ctr cos2
1 -0.975 0.047 0.053 |
2 -1.070 0.057 0.078 |
3 -1.334 0.089 0.068 |
4 -0.341 0.006 0.007 |
5 -0.442 0.010 0.016 |
[ reached getOption("max.print") -- omitted 5 rows ]
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
against_bug | -0.025 0.011 0.001 | 0.312 2.756 0.097 | 0.061
against_dark | 0.131 0.319 0.017 | -0.177 0.884 0.031 | -0.733
against_dragon | 0.173 0.551 0.030 | 0.257 1.874 0.066 | 0.193
against_electric | -0.080 0.119 0.006 | -0.031 0.028 0.001 | -0.112
against_fairy | 0.144 0.385 0.021 | 0.392 4.361 0.154 | 0.502
ctr cos2
against_bug 0.145 0.004 |
against_dark 20.880 0.537 |
against_dragon 1.443 0.037 |
against_electric 0.486 0.012 |
against_fairy 9.797 0.252 |
[ reached getOption("max.print") -- omitted 5 rows ]
Supplementary categories
Dist Dim.1 cos2 v.test Dim.2 cos2
no | 0.468 | -0.413 0.779 -15.926 | -0.098 0.044
yes | 4.828 | 4.261 0.779 15.926 | 1.010 0.044
v.test Dim.3 cos2 v.test
no -4.674 | 0.057 0.015 3.183 |
yes 4.674 | -0.587 0.015 -3.183 |
Let’s visualize the percentage of variances captured by each dimensions.
50% of the variances can be explained by only using the first 5 dimensions, with the first dimensions can explain 17.5% of the total variances.
We can keep around 80% of the information from our data by using only 13 dimensions (thus, 58% dimensionality reduction). This mean that we can actually reduce the number of features on our dataset from 31 to just 13 numeric features.
We can extract the values of PC1 to PC13 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
df_pca <- data.frame(poke_pca$ind$coord[, 1:13]) %>% bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:13)
df_pca
6.2 Individual and Variable Factor Map
6.2.1 Individual Observations Map
cos2
or the squared cosine value shows the importance of a principal component for a given observation (vector of original variables). The value of cos2
can help find the components that are important to interpret observations.
The individual observations map shows where each of the observations is positioned in term of PC1 and PC2. Using only the first 2 PCs, we can see that there are a lot of outliers in our data that has high cos2
in the PC1. Further analysis can be done to check them out.
From the plot, we can see that pokemon with legendary status has more varied PCs values, indicated by bigger ellipse sphere, and has higher PC1 scores compared to the non-legendary. However, it’s clear that from the PCA, our data cannot be represent with only 2 dimensions, since it only accomodate less than 30% of the variance and cannot represent our data.
Pokemon number 384 and 401 is quite an outlier, let’s see his data.
df2 %>% mutate(rn = row_number()) %>% select(name, legendary, c(1:31, 34)) %>%
filter(rownames(df2) == 384)
df2 %>% filter(legendary == "yes") %>% select(name, 19:31) %>% arrange(desc(base_total)) %>%
top_n(5, wt = base_total)
The Pokemon with index number of 384 is Rayquaza
, which is the second best legendary Pokemon in term of base_total stats and has higher attack and bigger than Mewtwo
.
6.2.2 Variable Factor Map
If the observations are represented by their projections, the variables are represented by their correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components. The closer to the center of the plot a variable is, the less important it is for the first two components.
fviz_pca_var(poke_pca, select.var = list(contrib = 31), col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)
The plot above shows us that the variables are located inside the circle, meaning that we would need more that two components to represent our data perfectly. The distance between variables and the origin measures the quality of the variables on the factor map. The color indicates the contribution of each variables. The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis.
We can also check the quality of representation or cos2
of each variables. A high cos2
indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2
indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.
fviz_cos2(poke_pca, choice = "var", fill = "cos2") + scale_fill_viridis_c(option = "B") +
theme(legend.position = "top")
Variables that highly contributed to PC2 are against_flying
, against_poison
, against_ice
, and against_ground
, while the rest of the first 13 variables contribute more toward PC1, especially base_total
, which has the highest contribution toward PC1. The against_ice
has the lowest correlation with the principle components, while the against_ground
is negatively correlated with PC2.
We can consider to remove the against_ice
variables since it’s contribute a little to our PC1.
6.3 Clustering with PCA
PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.
However, same problem like the individual observations map happened: we have to little dimensions to represent our data. On the above visualization, our cluster looks like intersecting each other because we don’t have enough dimensions to represent them.
We may add 1 more dimensions using plotly
to see if our clusters is still clumped together.
7 Conclusion
We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis:
- We can separate our data into at least 4 clusters based on all of the numerical features, with more than 87% of the total sum of squares come from the distance of observations between clusters.
Cluster 2
has the unique traits that it has the most (if not all) of the legendary Pokemon, which make it the best overall inbase_total
battle stats.- We can reduce our dimensions from 31 features into just 13 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
- However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 2 dimensions. Perhaps the result from the gap statistic method is true, that there is only 1 big cluster.