This week we will be performing AB Testing and using hypothesis testing to calculate a difference between two groups.
For our first hypothesis, let’s consider legendary Pokemon. Legendary Pokemon are special Pokemon that are typically found at the end of each Pokemon game, and you can typically only battle and capture them once, making them incredibly rare. There can be any number of Legendary Pokemon in a game, and some games have more legendary Pokemon than others.
Most Pokemon are modeled after real-life objects and creatures found in nature. However, some typings such as ‘Dragon’ and ‘Fairy’ are more ‘mystical’ than logical typings such as ‘water’ or ‘rock’. As such, let’s assume that …
H(sub-0): Legendary status is independent of Pokemon type (chi square test) .
H1(sub-1): Legendary status is not independent of Pokemon type
What we’re asking is whether the proportion of Legendary Pokemon should be roughly the same across all types. In other words, if 10% of all Pokemon are legendary overall, then we could expect, under independence, approx. 10% of water, fire, grass types etc. would be legendary.
First, let’s determine if we have enough data to perform a hypothesis test using the Neyman-Pearson framework. This framework allows us to be more objective and provides more explicit guidelines for how we can reject a hypothesis (important because statistics is most valuable in disproving hypothesis).
The test we’ll be using is the chi-square test. Since both columns are categorical, this test will measure how far observes counts deviate from our expectations.
\[ x^2 = \sum(O - E^2)/(E) \]
For our chosen significance level, let’s say, a = 0.05, we reject our alternative hypothesis if:
\[ x^2 \geq x^2_a, df \]
When the chi-square statistic becomes too large, this indicates that our observes and expected counts are very different, typically leaning towards the decision to reject the null hypothesis.
Here’s a table to make this even easier to read:
##
## 0 1
## bug 69 3
## dark 26 3
## dragon 20 7
## electric 34 5
## fairy 17 1
## fighting 28 0
## fire 47 5
## flying 2 1
## ghost 26 1
## grass 74 4
## ground 30 2
## ice 21 2
## normal 102 3
## poison 32 0
## psychic 36 17
## rock 41 4
## steel 18 6
## water 108 6
These are our observed numbers of legendary and non-legendary Pokemon. Now that we have a table, we can perform a chi-squared test.
## Warning in chisq.test(legendary_table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: legendary_table
## X-squared = 73.908, df = 17, p-value = 4.533e-09
There are 18 possible types and 2 categories (legendary vs. non-legendary), meaning our df are calculated as follows:
\[ df = (18-1)(2-1)= 17 \]
At a = 0.05, the critical chi-squared value is about
\[ x^2_.0517\approx 27.59 \]
So the decision rule is…
Let’s even look at our expected values to compare to our observations
## Warning in chisq.test(legendary_table): Chi-squared approximation may be
## incorrect
##
## 0 1
## bug 65.707865 6.2921348
## dark 26.465668 2.5343321
## dragon 24.640449 2.3595506
## electric 35.591760 3.4082397
## fairy 16.426966 1.5730337
## fighting 25.553059 2.4469413
## fire 47.455680 4.5443196
## flying 2.737828 0.2621723
## ghost 24.640449 2.3595506
## grass 71.183521 6.8164794
## ground 29.203496 2.7965044
## ice 20.990012 2.0099875
## normal 95.823970 9.1760300
## poison 29.203496 2.7965044
## psychic 48.368290 4.6317104
## rock 41.067416 3.9325843
## steel 21.902622 2.0973783
## water 104.037453 9.9625468
As we can see, both from our tables and our critical value, that we need to reject independence
This is a mosaic plot. This kind of plot is best for showing independence vs dependence because it directly encodes our expected vs observed values from above.
For our next hypothesis let’s access whether the attack stat of Pokemon in our data set tend to stay around the average. In other words…
H(sub-0): The attack state values in the Pokemon data set follow a normal distribution
H(sub-1): The attack state values in the Pokemon data set do not follow a normal distribution
Instead of our previous framework, we’ll be using Fisher’s Significance Testing framework. In this framework, we assume that our null hypothesis is true and use our p-value to measure the test statistic (small p-value: reject null hyp., large p-value: do not reject null hyp.)
Based on our null hypothesis, we should look for the following:
Bell-shaped histogram
Majority of attack stats falling near the average
Very high and very low attack values are rare
Evidence that would contradict normality in this case are the following (non-exhaustive):
Multiple clusters
A long right tail (very strong Pokemon as you read right on the graph)
Big gaps or spikes
##
## Shapiro-Wilk normality test
##
## data: pokemon$attack
## W = 0.97948, p-value = 3.581e-09
The above calculation is a Shapiro-Wilk statistic (W). This gives a numerical measure of how normal the data will look, and is used to compute our p-value. If our data is closely aligned with a normal curve, W is close 1, meanwhile, if the data deviates (with clusters or long tails), then W will get smaller.
We can see from our numbers that our p-value is incredibly small, meaning that there is strong evidence against the hypothesis that Attack is normally distributed.
Let’s plot this drive our point home:
Although close to a bell-curve at a glance, from our formulas, we can see in actuality that the attack stat does not follow a normal distribution.