Week 5 | A Dive into Data Ambiguities and Their Implications

library(ggplot2)

pokemon_data <- read.csv("./PokemonStats.csv")

# A few columns/values that might be ambiguous without further documentation include:

#1. Total: It's unclear what this column represents. Is it the sum of all the stat points (HP, Attack, Defense, etc.)?

#2. Type1 and Type2: While these columns likely represent the primary and secondary types of a Pokémon, it might not be clear to everyone, especially those unfamiliar with the Pokémon franchise. Additionally, the significance of having two types and how they interact might be ambiguous.

#3. Height and Weight: The units for these columns are not specified. Are they in meters and kilograms respectively? Or are they in some other unit?

#4. ID: While it seems to be a unique identifier for each Pokémon, there are some repeated IDs, such as for Venusaur and its Mega Evolution. This could be confusing without documentation.

## Why do you think they chose to encode the data the way they did? What could have happened if you didn't read the documentation?

#1. Total: Summarizing stats into a "Total" column makes it easy to quickly gauge a Pokémon's overall strength. Without documentation, one might not realize this is a derived column and might mistakenly treat it as a unique attribute.

#2. Type1 and Type2: By splitting types into two columns, the dataset can represent Pokémon with dual types. Without documentation, one might not understand the significance or interplay of these types.

#3. Height and Weight: These are standard attributes for creatures, but without units, they can be misleading. Without documentation, one might make incorrect assumptions about the scale or units.

#4. ID: This is likely a way to identify each Pokémon uniquely, but the presence of duplicate IDs for different forms or evolutions of Pokémon can be misleading without documentation.

# Identify at least one element of the data that is unclear even after reading the documentation

#1. The Type2 column has many missing values, suggesting that not all Pokémon have a secondary type. However, without documentation, it's unclear if a missing value indicates the absence of a secondary type or if the data is simply incomplete.

#2. There's one missing value in the Weight column. Without documentation, we can't determine if this is a data entry error, if the Pokémon genuinely has no weight, or if its weight is unknown.

#3. The column labeled 'Height' presents a degree of ambiguity. The maximum value listed is 100.0. If this figure is interpreted as meters, it suggests a Pokémon with a height of 100 meters. However, this appears inconsistent as the heights of all other Pokémon listed are around 1 meter. It's the same Pokémon with no weight mentioned.

# Plotting the histogram
ggplot(data = pokemon_data, aes(x = Height)) +
  geom_histogram(aes(y = after_stat(density)), fill = "skyblue", bins = 50, alpha = 0.7) +
  geom_density(aes(y = after_stat(density)), color = "blue") +
  geom_vline(aes(xintercept = 100), color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Distribution of Pokémon Heights",
       x = "Height (Assumed to be in meters)",
       y = "Density") +
  theme_minimal() +
  annotate("text", x = 80, y = 0.02, label = "Potential Outlier (100m)", color = "red")

# Here's the histogram showcasing the distribution of Pokémon heights:

#1. The majority of Pokémon have heights within a reasonable range.
#2. The red dashed line highlights a potential outlier: a Pokémon with a height of 100 meters.

# The height value of 100 meters is a clear outlier in the dataset, which could either be a data entry error, a unique characteristic of a specific Pokémon, or an issue with the assumed units of measurement. Without proper documentation, this value becomes ambiguous and can lead to incorrect interpretations or conclusions.

#Risks:

# Inaccurate Modeling: If used for predictive modeling or analytics, this outlier can skew model results, leading to inaccurate predictions or conclusions.
# Misleading Visuals: Visualizations or reports based on this data might present misleading information, affecting decision-making or understanding.
# Inaccurate Gameplay Strategy: Gamers or enthusiasts might use this data to understand gameplay dynamics better. Misinterpretations can lead them to develop incorrect strategies, thinking that taller Pokémon might have certain advantages.

# To reduce negative consequences:

#1. We can cross-check the dataset with other reputable Pokémon sources to clear up any suspicious data points.
#2. We can tap into dedicated Pokémon communities to help clear up any ambiguities. Fans and players often have a wealth of detailed knowledge that can be really handy.
#3. Before finalizing analyses or making decisions based on the data, having a Pokémon expert or enthusiast review the findings can help catch any misconceptions.

Week 5 | A Dive into Data Ambiguities and Their Implications

Navdeep Metchu