As per usual, I’m providing a summary of the Pokemon data set I am using for this week’s data dive:
This week, we are taking the time to think critically about our data in the event things (often) do not go to plan when drawing conclusions.
In particular, by taking relatively small samples from our population, we should have the ability to make inferences about the entire population. However, to ensure a sample is representative of the entire population, we must ensure that ever member of the population has an equal chance of being selected. When this is not guaranteed, we have to determine if the sample is biased.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 960 × 42
## abilities against_bug against_dark against_dragon against_electric
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ['Sturdy', 'Sand St… 1 1 1 1
## 2 ['Hyper Cutter', 'S… 1 1 1 2
## 3 ['Swarm', 'Early Bi… 0.5 1 1 2
## 4 ['Insomnia', 'Forew… 2 2 1 1
## 5 ['Own Tempo', 'Tech… 1 1 1 1
## 6 ['Pure Power', 'Tel… 1 1 1 1
## 7 ['Intimidate', 'She… 0.5 1 1 1
## 8 ['Sturdy', 'Weak Ar… 1 1 1 1
## 9 ['Pressure', 'Inner… 1 1 1 2
## 10 ['Intimidate', 'Mox… 2 0.5 1 0
## # ℹ 950 more rows
## # ℹ 37 more variables: against_fairy <dbl>, against_fight <dbl>,
## # against_fire <dbl>, against_flying <dbl>, against_ghost <dbl>,
## # against_grass <dbl>, against_ground <dbl>, against_ice <dbl>,
## # against_normal <dbl>, against_poison <dbl>, against_psychic <dbl>,
## # against_rock <dbl>, against_steel <dbl>, against_water <dbl>, attack <int>,
## # base_egg_steps <int>, base_happiness <int>, base_total <int>, …
## # A tibble: 7 × 3
## generation count_ mean_hp
## <int> <int> <dbl>
## 1 1 151 64.3
## 2 2 100 71.0
## 3 3 135 65.7
## 4 4 107 73.1
## 5 5 156 70.3
## 6 6 72 71.1
## 7 7 80 70.6
Below, I’ve created a table that will create 3 random samples with replacement from the Pokemon data set. Each of the three random samples are 40% of the entire set.
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 43 61.6 1
## 2 2 23 72.6 1
## 3 3 37 73.3 1
## 4 4 17 74.6 1
## 5 5 38 77.9 1
## 6 6 19 62.1 1
## 7 7 23 68.7 1
## 8 1 30 68 2
## 9 2 26 75.1 2
## 10 3 42 60.5 2
## # ℹ 11 more rows
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 87 59.6 1
## 2 2 51 70.6 1
## 3 3 68 70.4 1
## 4 4 61 77.7 1
## 5 5 60 70.8 1
## 6 6 29 77.2 1
## 7 7 44 70.6 1
## 8 1 74 65.1 2
## 9 2 51 66.5 2
## 10 3 76 58.7 2
## # ℹ 11 more rows
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 131 64.4 1
## 2 2 68 66.5 1
## 3 3 91 66.0 1
## 4 4 78 71.9 1
## 5 5 127 67.9 1
## 6 6 47 75.3 1
## 7 7 58 74.4 1
## 8 1 104 61.7 2
## 9 2 62 75.0 2
## 10 3 117 69.0 2
## # ℹ 11 more rows
For each of the samples above, the amount of data available for the sample size increases from 25% - 75%. As far as our values go, it is to be expected that we will encounter sampling error, whereby values will differ slightly from the population parameter. However, we know through the Central Limit Theorem that as long as the sample size is large enough, the sample means’ distribution will be normally distributed regardless of the underlying population distribution
While there is difference in the sample means from each frame, it is slight. An anomaly that I might have detected in one sub-sample may have been a point anomaly, whereby the individual data points would deviate significantly (10+ pts) from the rest of the set. This would have conveyed a rare event in the sample or some type of error.
Consistent across all sub-samples are the sample averages across generation and their increases or decreases in comparison to subsequent generations. While the raw data itself is not necessarily normally distributed, the larger samples have less variability in the sample mean. as such, we can conclude that with the larger samples, the sample mean is becoming a more reliable estimation of the population mean.