I definitely wanted to use plotly on a treemap because interactive plots are fun. Treemaps are something I haven’t tried in this course yet so I thought I’d challenge myself for the final project. A bit of a gamble, but worth it.
I chose the Pokemon dataset for nostalgia reasons, and because I knew the variety of Pokémon types and names could make for a cool visualization.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
pokemon <-read_csv("pokemon.csv")
Rows: 500 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Date, Pokemon, Trainer Region, Trainer Subregion, Pokemon Region,...
dbl (3): Level, Level Met, Perfect IVs
lgl (1): Held Item
time (1): Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dataset Overview
The Pokemon dataset used in this project was sourced from Serebii.net, a fan based database of Pokemon game content and stats. However, I could not find a README file or documentation about how the data was collected or structured. So, I worked with the variables as is and worked with prior knowledge of the Pokemon games to interpret what the columns represent.
The dataset includes various information such as Pokemon names, level, type, trainer location, and more. For this project, I will use those variables: Pokemon, Type1, Level Met and Level. I chose to only use Type1 because Type2 represents a secondary nature that would complicate the visual. I also decided to group the Pokemon by their Type1 and select only the top 5 highest level Pokemon per type to keep things readable.
Variables included in the dataset: Bold font will be used in the project
Date
Time
Pokemon
Trainer Region
Trainer Subregion
Pokemon Region
Level
Level Met
Gender
Type1
Type2
Nature
Pokeball
Held Item
Perfect IVs
Background Research
I wanted to see what strength and weakness of each Pokemon’s type are to other types so this chart below shows a clear visual of which would have strong attack/defense against another ranging from no effect to super effective. We can easily see which Pokemon types have an advantage over and disadvantage against here.
Pokemon Go type chart: Strengths & weaknesses - Dexerto
Filtering pokemon by group of type1 and keep it to top 5 to keep the data short for the visual
top_pokemon <- pokemon |>group_by(Type1) |>slice_max(Level, n =5)
The subdataset looks good but I am seeing few issues in the table but we will run it… I will expect few errors in the plot
Treemap Plot
Using the top_pokemon subdataset, I created a treemap showing the top 5 highest level Pokémon per type. I also applied custom colors to each type using scale_fill_manual, which was fun to experiment with.
The treemap looks great. Picking the colors was fun and gave each type a clear visual identity. However, I noticed that some types like Dragon and Steel had too many Pokemons, while others like Psychic, Electric, and Rock had 6 instead of 5.
The plot was supposed to tell us the top 5 highest level Pokemon within each primary type. By sizing the boxes by level and grouping them by Type 1, the chart highlights which Pokemon stand out the most within their category. It’s a fun and informative way to compare strength across types while showing the diversity of Pokemon in the plot.
Plotly Scattermap Plot
I want to see how much Pokemon have gained their levels using the variables Level Met which means the pokemon met the trainer at that level and another variable is the Level which means the current standing level. I thought I could use the scatterplot to show the relationship between these two variables but I am aware of that there is a lot of Pokemons that did not get trained to improve their level after meeting their trainers.
This scatterplot shows the relationship between the level a Pokemon was met (Level Met) and its current level (Level), grouped by their primary type. Each dot represents one Pokemon and the color shows its type using a custom colors to match the theme of the type. The goal here was to see if there’s a pattern in level progression across different types. We can see that most Pokemon were met at lower levels and trained to higher ones while barely a few were caught at already high levels. There is a lot of Pokemon that are left untrained, though.
Statistical Analysis: Linear Regression
I wanted to see how a Pokemon’s current level is related to the level it was first met, I performed a simple linear regression using the formula: Level ~ Level Met from Pokemon. This means I’m predicting a Pokemon’s current level (Level) based on its Level Met and the level it was originally encountered.
The intercept (0.99) tells us that a Pokemon encountered at level 0 would be around level 1, hypothetically.
The slope (0.98) tells us that for every level higher a Pokémon is met, its current level is on average 0.98 levels higher, basically almost a 1-to-1 relationship.
Summary of Model Output
model <-lm(Level ~`Level Met`, data = pokemon)summary(model)
Call:
lm(formula = Level ~ `Level Met`, data = pokemon)
Residuals:
Min 1Q Median 3Q Max
-0.9730 -0.9730 -0.8737 -0.5908 29.1461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.99282 0.19842 5.004 7.82e-07 ***
`Level Met` 0.98015 0.01204 81.420 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.503 on 498 degrees of freedom
Multiple R-squared: 0.9301, Adjusted R-squared: 0.93
F-statistic: 6629 on 1 and 498 DF, p-value: < 2.2e-16
p-value for Level Met: < 2e-16
This means it is highly significant.
Adjusted R²: 0.93
This means 93% of the variation in the Pokémon’s current level is at the level it was met at.
Residual standard error: 3.5
This means small average deviation between the predicted and actual levels.
Interpretation
This model shows strong and significant relationship between the level a Pokemon was met at and its current level. The nearly perfect linear line in the plot tells us that most Pokemon haven’t changed levels much since they were caught. With a high R² and strong t-statistic, the model is both predictive and statistically valid.
Final Reflection
Both plots gave me a solid insight into the patterns in the dataset. The treemap made it easy to compare the top Pokemon across types and visualize who the strongest ones were, while showing type diversity. One thing I noticed was that some types, like Dragon and Steel, had more than five Pokemon even though I tried to limit it. I wish I could do additional cleaning to fix those duplicates.
The scatterplot revealed something I expected: most Pokemon hadn’t increased much in level from when they were first met their trainer. That does not surprise me since I assumed most trainers are looking to catch more than to train their Pokemons. This plot shows that the trainers are playing for recreational than competitive purposes. I would love to add picture of the Pokemon within the information box so the audience would be able to familiarize the name of the Pokemon easily.
I wanted to make the treemap interactive using plotly, but unfortunately geom_treemap() and geom_treemap_text() are not yet supported in plotly. I tried, but it didn’t work. It’s something I learned from that functionality. Despite that limitation, I’m quite happy, but not completely satisfied, of how the project turned out. It helped me apply almost every skill we learned in this course and gave me more confidence in using R for data visualization and to communicate through it.