pkmn_data <- read.csv("C:/Users/thord/Downloads/archive/all_pokemon_data.csv")
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
## Warning: package 'tibble' was built under R version 4.4.3
In this project I am working with a database containing all programmed information on every “Pokemon” from the popular video game franchise “Pokemon”. This database is quite thorough and all variables well labeled, but it is a somewhat small database. This means that one can expect outliers to have a pronounced impact on trends that may radically skew our results. This must be properly accounted for in each item.
Gen_Totals <- data.frame(
Generation = c("1", "2", "3", "4", "5", "6", "7", "8", "9"),
Total_mon = c(151, 100, 135, 107, 156, 72, 88, 89, 120))
Gen_Form <- data.frame(
Generation = c("1", "2", "3", "4", "5", "6", "7", "8", "9"),
Alt_form = c(66, 12, 24, 7, 12, 6, 2, 0, 0))
combined <- merge(Gen_Totals, Gen_Form, by = "Generation")
(Should the teacher be wondering why the above code was written, the database made no distinction between when an alternate form was added, and the total included alt forms. The author therefore had to create the necessary dataframes manually)
Flott_data <- combined %>%
mutate(Alternate_Proportion = Alt_form/Total_mon,
Regular_Proportion = 1 - Alternate_Proportion) %>%
select(Generation, Alternate_Proportion, Regular_Proportion) %>%
pivot_longer(-Generation, names_to = "Category", values_to = "Proportion")
Let us define our terms. An alternate form is a variant of a Pokemon that is not defined as a “new” creature. Often when a new generation is released, multiple Pokemon are given these alternate forms to appeal to those who are fond of them. The hypothesis is essentially that whenever a new generation is released, it is more likely for older Pokemon to receive new forms than it is for relatively newer Pokemon.
ggplot(combined, aes(x = factor(Generation))) +
geom_col(aes(y = Total_mon), fill = "steelblue", width = 0.4,
position = position_nudge(x = -0.2)) +
geom_col(aes(y = Alt_form), fill = "coral", width = 0.4,
position = position_nudge(x = 0.2)) +
labs(title = "Pokémon per Generation",
x = "Generation",
y = "Count") +
scale_x_discrete(labels = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX")) +
theme_minimal()
Here is a chart which shows how many Pokemon belong to each generation. While this chart may seem revealing, it is somewhat misleading. Yes Gen 1 may seem to have a ton of alternate forms, it also has a lot of Pokemon in general. This does not show the proportion, which is what truly matters
ggplot(Flott_data, aes(x = Generation, y = Proportion, fill = Category)) +
geom_bar(stat = "identity",
position = position_fill(reverse = TRUE)) +
scale_fill_manual(values = c("Regular_Proportion" = "#007BA7", "Alternate_Proportion" = "#800000")) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX")) +
labs(title = "Proportion of Alternate Forms by Generation",
x = "Generation (Roman Numerals)", # Updated axis label
y = "Percentage",
fill = "Category") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
A not so cautious reader may rush to their conclusions upon seeing this graph, as the correlation appears self evidence. But just to be safe, one should run a Chi Square Test. Early Gen includes all Pokemon from Gen 1-4, Late Gen includes all Pokemon from Gen5-9.
form_data <- matrix(c(109, 20, 384, 503), nrow = 2,
dimnames = list(c("Early Gen", "Late Gen"),
c("With Forms", "Without Forms")))
print(form_data)
## With Forms Without Forms
## Early Gen 109 384
## Late Gen 20 503
For this test let us set α = 0.01.
H-0: There is no significant correlation between the sum of alternate forms received in a later generation and the chronological placement of said generation
H-1:There is a significant correlation between the sum of alternate forms received in a later generation and the chronological placement of said generation
chisq.test(form_data)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: form_data
## X-squared = 74.908, df = 1, p-value < 2.2e-16
X-squared being so great confirms a significant deviation from the expected value were we using the null hypothesis. And p <<0,05 by an almost ridiculous amount. This means there is an almost 0% chance that this rejection of the null hypothesis is due to random chance
What immediately stands out is how hard a single category has skewed these results. Of the 107 alternate forms in the early gens, 66 are from gen 1. Therefore to conduct a more proper test, the definition of early gen has been expanded to include gen 5, and gen 1 has been removed.
formed_data <- matrix(c(55, 8, 443, 369), nrow = 2,
dimnames = list(c("Early Gen", "Late Gen"),
c("With Forms", "Without Forms")))
print(formed_data)
## With Forms Without Forms
## Early Gen 55 443
## Late Gen 8 369
We are using the same hypothesis and the same value for alpha as for “Test 1”
chisq.test(formed_data)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: formed_data
## X-squared = 24.246, df = 1, p-value = 8.48e-07
X-squared is somewhat high but significantly smaller than in the previous test. So the difference in the value predicted by the null hypothesis and the observed value is no longer as dramatic, but still large enough to soundly reject it. And while p is much larger, p<<0,05, so it is highly unlikely that this variance is due to random chance.
One should expect that older Pokemon have received more alternate forms than newer ones for the simple reason that the older the generation, the more generations have been released afterwards, which means there are more opportunities for it to receive alternative forms. This does not necessarily mean that it is less likely for later gen Pokemon to receive forms in the next generation, only that it has received less opportunities.
These results are much more normal than the previous results. p is still incredibly small, and X squared is still fairly large, but not so large that it staggers belief. Based on this data, there is a significant correlation between how early a generation is and whether or not it will receive an alternate form in a later gen.
Let us define our terms. Power Creep a phenomena where as a franchise goes on, the general “power level” rises steadily. In terms of Pokemon, power level can be cleanly defined as the sum of a Pokemons “stats”, which are values which describe how good it is in particular areas. The sum of these values is labeled its “Base Stat Total” (BST for short) . The hypothesis is that the later the generation, the higher the mean Base Stat Total. The following is a Boxplot describing this relationship
pkmn_data <- read.csv("C:/Users/thord/Downloads/archive/all_pokemon_data.csv")
pkmn_data <- pkmn_data %>%
mutate(Generation = recode(Generation,
"generation-i" = "I",
"generation-ii" = "II",
"generation-iii" = "III",
"generation-iv" = "IV",
"generation-v" = "V",
"generation-vi" = "VI",
"generation-vii" = "VII",
"generation-viii" = "VIII",
"generation-ix" = "IX"))
pkmn_data$Generation <- factor(pkmn_data$Generation,
levels = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX"))
ggplot(pkmn_data, aes(x = Generation, y = Base.Stat.Total, fill = Generation)) +
geom_boxplot(trim = FALSE) +
labs(title = "Generation vs BST",
x = "Generation",
y = "Base Stat Total (BST)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set3")
## Warning in geom_boxplot(trim = FALSE): Ignoring unknown parameters: `trim`
While this does adequately describe the mean BST for each generation, it may be useful for analysis to see the distribution of these values within each generation more clearly.
ggplot(pkmn_data, aes(x = Generation, y = Base.Stat.Total, fill = Generation)) +
geom_violin(trim = FALSE) +
labs(title = "Generation vs BST",
x = "Generation",
y = "Base Stat Total (BST)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set3")
Now that we have clear charts we are ready to commence with our tests.
generation_mapping <- c(
"generation-i" = 1,
"generation-ii" = 2,
"generation-iii" = 3,
"generation-iv" = 4,
"generation-v" = 5,
"generation-vi" = 6,
"generation-vii" = 7,
"generation-viii" = 8,
"generation-ix" = 9)
pkmn_data$Gen_Numeric <- generation_mapping[pkmn_data$Generation]
The Generation Category must be defined as a numeric variable so R can run a spearman test. For us they are still categorical variables. This is okay because the categories are ordered ordinally, so a spearman test should still be able to show monotonic trends.
H-0: There is no significant correlation between Generations and mean BST
H-1:There is a significant correlation between Generations and mean BST
For this test let us set α = 0.01.
cor.test(pkmn_data$Gen_Num, pkmn_data$Base.Stat.Total, method = "spearman")
## Warning in cor.test.default(pkmn_data$Gen_Num, pkmn_data$Base.Stat.Total, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: pkmn_data$Gen_Num and pkmn_data$Base.Stat.Total
## S = 242857420, p-value = 2.525e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1220944
since 0,1<rho<0,2, the relationship between the two variables is extremely weak, but not non existant. But since p<<<0,05 by a dramatic amount, it is extremely unlikely that this trend has come about by chance
This is a very small sample size, only 9 categories. So while it is generalizable as the sample is as big as the population, it can not be used to predict future generations with any serious credibility. Often p becomes incredibly small when dealing with small sample sizes.
There is a very slight incline of the mean BST. Judging by this, while there is Power Creep, it Creeps gradually. However if one turns their attention to the Violin plot, one can see that there is a significant trend in the distribution of BST within generations. The violins begin to take on more of an hourglas figure. The middle becomes gradually thinner, with gradually larger bumps in the bottom and top half. So the mean may not change significantly but the concentration of power does.
Let us define our terms. Weight should be self evident, but HP (Hit Points) describes how much damage a Pokemon can take before it “faints”. The idea is that the more meat you have on your bones, the more health you have.
ggplot(pkmn_data, aes(x = Weight..lbs., y = Health)) +
geom_point(alpha = 0.3) + # Scatter plot with transparency
geom_smooth(method = "loess", color = "blue") + # Add a smooth line
labs(
title = "Weight vs Health (Not applicable to people) ",
x = "Weight", y = "Health" ) + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This chart is heavily skewed by a few outliers. Some Pokemon are extremely light with extremely high HP (balloon esque) and some are extremely heavy with extremely low hp (because they are mostly metal and therefore have higher defense). Therefore a new chart cleaned of such outliers must be made for a more accurate view.
filter_data <- pkmn_data %>%
filter(Weight..lbs. >= 50 & Weight..lbs. <= 500)
ggplot(filter_data, aes(x = Weight..lbs., y = Health)) +
geom_point(alpha = 0.3) + # Scatter plot with transparency
geom_smooth(method = "loess", color = "blue") + # Add a smooth line
labs(
title = "Weight(Between 50-500lbs) vs Health (Not applicable to people) ", x = "Weight", y = "Health" ) + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Now we will test for both charts. First we will determine the normality of each variable in either dataframe. Then we will first test the unfiltered data, and then later the filtered data
H-0=The data is normally distributed
H-1=The distribution does not qualify under normalitys
For this test let us set α = 0.01.
shapiro.test(pkmn_data$Health)
##
## Shapiro-Wilk normality test
##
## data: pkmn_data$Health
## W = 0.92088, p-value < 2.2e-16
shapiro.test(filter_data$Health)
##
## Shapiro-Wilk normality test
##
## data: filter_data$Health
## W = 0.87301, p-value < 2.2e-16
shapiro.test(pkmn_data$Weight..lbs)
##
## Shapiro-Wilk normality test
##
## data: pkmn_data$Weight..lbs
## W = 0.53325, p-value < 2.2e-16
shapiro.test(filter_data$Weight..lbs)
##
## Shapiro-Wilk normality test
##
## data: filter_data$Weight..lbs
## W = 0.84188, p-value < 2.2e-16
As W is almost never equal to or above 0.90 for any variable, we can not assume normality and can therefore not use the Pearson test. An astute reader may have noticed that the same value for p keeps appearing. This is the lowest value of p R can generate, it simply means that p is incredibly small.
H-0: There is no significant monotonic relationship between weight and health
H-1:There is a significant monotonic relationship between weight and health
For this test let us set α = 0.01.
cor.test(pkmn_data$Health, pkmn_data$Weight..lbs., method = "spearman")
## Warning in cor.test.default(pkmn_data$Health, pkmn_data$Weight..lbs., method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: pkmn_data$Health and pkmn_data$Weight..lbs.
## S = 102368130, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6299493
Here rho is is very high, which indicates an extremely strong monotonic relationship between HP and Weight. And as p<<0,05, it is highly unlikely to be a result of random chance.
H-0: There is no significant monotonic relationship between weight and health
H-1:There is a significant monotonic relationship between weight and health
For this test let us set α = 0.01.
cor.test(filter_data$Health, filter_data$Weight..lbs., method = "spearman")
## Warning in cor.test.default(filter_data$Health, filter_data$Weight..lbs., :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: filter_data$Health and filter_data$Weight..lbs.
## S = 21503997, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3588216
Rho is quite a bit smaller now, which must mean that the low weight high HP Pokemon skewed the results more than the high weight low HP Pokemon. In any case this still confirms a tangible monotonic relationship between health and weight. And as p<<<0,05, it is highly unlikely to be due to random chance.
Each of these categorie have a lot of extreme outliers which skewed the result a lot despite the fact that it is a decently large sample size. The results are of course generalizable as the sample is equal in size to the population