Week 4

As per usual, I’m providing a summary of the Pokemon data set I am using for this week’s data dive:

This week, we are taking the time to think critically about our data in the event things (often) do not go to plan when drawing conclusions.

In particular, by taking relatively small samples from our population, we should have the ability to make inferences about the entire population. However, to ensure a sample is representative of the entire population, we must ensure that ever member of the population has an equal chance of being selected. When this is not guaranteed, we have to determine if the sample is biased.

Original Samples

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## # A tibble: 960 × 42
##    abilities            against_bug against_dark against_dragon against_electric
##    <chr>                      <dbl>        <dbl>          <dbl>            <dbl>
##  1 ['Sturdy', 'Sand St…         1            1                1                1
##  2 ['Hyper Cutter', 'S…         1            1                1                2
##  3 ['Swarm', 'Early Bi…         0.5          1                1                2
##  4 ['Insomnia', 'Forew…         2            2                1                1
##  5 ['Own Tempo', 'Tech…         1            1                1                1
##  6 ['Pure Power', 'Tel…         1            1                1                1
##  7 ['Intimidate', 'She…         0.5          1                1                1
##  8 ['Sturdy', 'Weak Ar…         1            1                1                1
##  9 ['Pressure', 'Inner…         1            1                1                2
## 10 ['Intimidate', 'Mox…         2            0.5              1                0
## # ℹ 950 more rows
## # ℹ 37 more variables: against_fairy <dbl>, against_fight <dbl>,
## #   against_fire <dbl>, against_flying <dbl>, against_ghost <dbl>,
## #   against_grass <dbl>, against_ground <dbl>, against_ice <dbl>,
## #   against_normal <dbl>, against_poison <dbl>, against_psychic <dbl>,
## #   against_rock <dbl>, against_steel <dbl>, against_water <dbl>, attack <int>,
## #   base_egg_steps <int>, base_happiness <int>, base_total <int>, …

Mean HP by Generation

## # A tibble: 7 × 3
##   generation count_ mean_hp
##        <int>  <int>   <dbl>
## 1          1    151    64.3
## 2          2    100    71.0
## 3          3    135    65.7
## 4          4    107    73.1
## 5          5    156    70.3
## 6          6     72    71.1
## 7          7     80    70.6

Below, I’ve created a table that will create 3 random samples with replacement from the Pokemon data set. Each of the three random samples are 40% of the entire set.

Random Samples Grouped By Generation

“Small” (S)

## # A tibble: 21 × 4
##    generation count_ mean_hp sample_num
##         <int>  <int>   <dbl>      <int>
##  1          1     43    61.6          1
##  2          2     23    72.6          1
##  3          3     37    73.3          1
##  4          4     17    74.6          1
##  5          5     38    77.9          1
##  6          6     19    62.1          1
##  7          7     23    68.7          1
##  8          1     30    68            2
##  9          2     26    75.1          2
## 10          3     42    60.5          2
## # ℹ 11 more rows

“Medium” (M)

## # A tibble: 21 × 4
##    generation count_ mean_hp sample_num
##         <int>  <int>   <dbl>      <int>
##  1          1     87    59.6          1
##  2          2     51    70.6          1
##  3          3     68    70.4          1
##  4          4     61    77.7          1
##  5          5     60    70.8          1
##  6          6     29    77.2          1
##  7          7     44    70.6          1
##  8          1     74    65.1          2
##  9          2     51    66.5          2
## 10          3     76    58.7          2
## # ℹ 11 more rows

“Large” (L)

## # A tibble: 21 × 4
##    generation count_ mean_hp sample_num
##         <int>  <int>   <dbl>      <int>
##  1          1    131    64.4          1
##  2          2     68    66.5          1
##  3          3     91    66.0          1
##  4          4     78    71.9          1
##  5          5    127    67.9          1
##  6          6     47    75.3          1
##  7          7     58    74.4          1
##  8          1    104    61.7          2
##  9          2     62    75.0          2
## 10          3    117    69.0          2
## # ℹ 11 more rows

For each of the samples above, the amount of data available for the sample size increases from 25% - 75%. As far as our values go, it is to be expected that we will encounter sampling error, whereby values will differ slightly from the population parameter. However, we know through the Central Limit Theorem that as long as the sample size is large enough, the sample means’ distribution will be normally distributed regardless of the underlying population distribution

While there is difference in the sample means from each frame, it is slight. An anomaly that I might have detected in one sub-sample may have been a point anomaly, whereby the individual data points would deviate significantly (10+ pts) from the rest of the set. This would have conveyed a rare event in the sample or some type of error.

Consistent across all sub-samples are the sample averages across generation and their increases or decreases in comparison to subsequent generations. While the raw data itself is not necessarily normally distributed, the larger samples have less variability in the sample mean. as such, we can conclude that with the larger samples, the sample mean is becoming a more reliable estimation of the population mean.