About the data

The dataset, Pokemon with Stats, contains detailed information about various Pokemon, including their attributes and statistics.
Columns:

number: Unique identifier for each Pokémon.
name: The name of the Pokémon.
type1: Primary elemental type of the Pokémon (e.g., Water, Fire, Grass).
type2: Secondary elemental type (optional, some Pokémon may not have a second type).
total: The sum of all base stats of the Pokémon.
hp: Hit points, or health, which measures how much damage a Pokémon can sustain.
attack: Physical attack power of the Pokémon.
defense: Physical defense of the Pokémon.
sp_attack: Special attack power of the Pokémon.
sp_defense: Special defense of the Pokémon.
speed: How fast the Pokémon can act in a battle.
generation: Which generation the Pokémon belongs to, ranging from 1 to 6.
legendary: Binary indicator of whether the Pokémon is legendary (1 for true, 0 for false).
There are a total of 13 columns. The dataset consists of approximately 800 rows, each representing a unique Pokémon.

Number of Rows: The dataset contains approximately 800+ rows, with each row representing a unique Pokemon.

Data Source: The dataset is sourced from Data World (https://data.world/data-society/pokemon-with-stats).

EDA

# Load libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.3
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
# Load dataset
pokemon_data <- read.csv(file.choose())

# Summary of the dataset
summary(pokemon_data)
##      number          name              type1              type2          
##  Min.   :  1.0   Length:1072        Length:1072        Length:1072       
##  1st Qu.:209.8   Class :character   Class :character   Class :character  
##  Median :442.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :445.2                                                           
##  3rd Qu.:681.2                                                           
##  Max.   :898.0                                                           
##      total              hp             attack          defense      
##  Min.   : 175.0   Min.   :  1.00   Min.   :  5.00   Min.   :  5.00  
##  1st Qu.: 330.0   1st Qu.: 50.00   1st Qu.: 56.00   1st Qu.: 52.00  
##  Median : 460.5   Median : 68.00   Median : 80.00   Median : 70.00  
##  Mean   : 440.9   Mean   : 70.49   Mean   : 80.94   Mean   : 74.97  
##  3rd Qu.: 519.2   3rd Qu.: 84.00   3rd Qu.:100.00   3rd Qu.: 90.00  
##  Max.   :1125.0   Max.   :255.00   Max.   :190.00   Max.   :250.00  
##    sp_attack        sp_defense         speed          generation   
##  Min.   : 10.00   Min.   : 20.00   Min.   :  5.00   Min.   :0.000  
##  1st Qu.: 50.00   1st Qu.: 50.00   1st Qu.: 45.00   1st Qu.:2.000  
##  Median : 65.00   Median : 70.00   Median : 65.00   Median :4.000  
##  Mean   : 73.27   Mean   : 72.48   Mean   : 68.79   Mean   :4.295  
##  3rd Qu.: 95.00   3rd Qu.: 90.00   3rd Qu.: 90.00   3rd Qu.:6.000  
##  Max.   :194.00   Max.   :250.00   Max.   :200.00   Max.   :8.000  
##  legendary      
##  Mode :logical  
##  FALSE:954      
##  TRUE :118      
##                 
##                 
## 
# Check for missing values
sum(is.na(pokemon_data))
## [1] 0
# Distribution of Pokémon types (type1)
ggplot(pokemon_data, aes(x = type1)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Pokémon Primary Types", x = "Type", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Legendary Pokémon vs Non-Legendary
ggplot(pokemon_data, aes(x = legendary)) +
  geom_bar(fill = "coral") +
  labs(title = "Legendary vs Non-Legendary Pokémon", x = "Legendary", y = "Count")

# Correlation heatmap for numeric attributes
pokemon_numeric <- pokemon_data %>%
  select(total, hp, attack, defense, sp_attack, sp_defense, speed)

correlation_matrix <- cor(pokemon_numeric, use = "complete.obs")

# Visualize the correlation matrix
ggcorrplot(correlation_matrix, lab = TRUE, title = "Correlation Matrix for Numeric Stats")

# Pair plot of numerical variables
pairs(pokemon_numeric, main = "Pair Plot of Pokémon Attributes")

Insights from EDA

Type Distribution: Certain types of Pokémon, such as Water and Normal types, tend to be more common, while types like Dragon and Ghost are rarer.
Legendary Pokémon: The dataset shows a clear difference between the number of legendary and non-legendary Pokémon, with non-legendary ones being far more prevalent.
Correlations: High correlations exist between certain stats, such as special attack and special defense, which could influence how Pokémon perform in battles. Speed shows less correlation with other variables, indicating it may be independently distributed.

K - Means

# Select numeric columns for clustering
pokemon_numeric <- pokemon_data %>%
  select(total, hp, attack, defense, sp_attack, sp_defense, speed)

# Scale the numeric data
pokemon_scaled <- scale(pokemon_numeric)

# Determine the optimal number of clusters using the Elbow Method
set.seed(123)
wss <- (nrow(pokemon_scaled) - 1) * sum(apply(pokemon_scaled, 2, var))

for (i in 2:10) {
  wss[i] <- sum(kmeans(pokemon_scaled, centers = i, nstart = 25)$tot.withinss)
}

# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters (k)",
     ylab = "Total Within Sum of Squares", main = "Elbow Method for K-Means")

# Set k
k <- 2

# Apply K-means clustering
set.seed(123)
kmeans_result <- kmeans(pokemon_scaled, centers = 2, nstart = 25)

# Add cluster labels to the original data
pokemon_data$Cluster <- as.factor(kmeans_result$cluster)

# Visualize the clusters (Total vs Attack)
ggplot(pokemon_data, aes(x = attack, y = total, color = Cluster)) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "K-means Clustering on Pokémon Data", x = "Attack", y = "Total") +
  theme_minimal()

Conclusion Through the EDA and clustering, we gained a deeper understanding of the Pokémon dataset. The dataset shows a rich diversity in Pokémon types and attributes. Using K-means clustering, we were able to group Pokémon based on their stats, highlighting patterns in attack power, total base stats, and other attributes.

The clustering revealed that certain Pokémon, especially those with higher total stats, tend to cluster together, likely representing stronger Pokémon used in competitive play. This analysis can be further expanded by exploring more advanced clustering techniques or dimensionality reduction methods.

DB Index

# Run K-Means algorithm again with the chosen k (let's assume k = 3 from elbow method)
library(clusterSim)
## Warning: package 'clusterSim' was built under R version 4.3.3
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.3.3
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(123)
kmeans_result <- kmeans(pokemon_scaled, centers = 2, nstart = 25)
# Calculate the Davies-Bouldin index
db_index <- index.DB(pokemon_scaled, kmeans_result$cluster, centrotypes = "centroids")
# Print the DB Index value
print(paste("DB Index:", db_index$DB))
## [1] "DB Index: 1.26313501170336"