As per usual, I’m providing a summary of the Pokemon data set I am using for this week’s data dive:
This week, we are taking the time to think critically about our data in the event things (often) do not go to plan when drawing conclusions.
In particular, by taking relatively small samples from our population, we should have the ability to make inferences about the entire population. However, to ensure a sample is representative of the entire population, we must do our utmost to ensure that every member of the population has an equal chance of being selected. When this is not guaranteed, we have to determine if the sample is biased.
Statisticians aim to have equal representation from their population before drawing conclusions, but it is rare that this is performed perfectly. We strive for equality to prevent conclusions that lead to loss of trust, legal and ethical consequences, discrimination and more. For example, if we were analyzing a data set looking at the profiles of incarcerated men in North America over the past decade, failing to identify biases in our data may lead to catastrophic consequences towards men in different social, financial or ethnic groups.
Given the possible threats that biases pose towards data, observe the following samples.
## # A tibble: 7 × 3
## generation count_ mean_hp
## <int> <int> <dbl>
## 1 1 151 64.3
## 2 2 100 71.0
## 3 3 135 65.7
## 4 4 107 73.1
## 5 5 156 70.3
## 6 6 72 71.1
## 7 7 80 70.6
Below, I have created a table that will create 3 random samples with replacement from the Pokemon data set. Each of the three random samples are 40% of the entire set.
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 44 64 1
## 2 2 26 77.7 1
## 3 3 28 73.2 1
## 4 4 26 75.2 1
## 5 5 46 67.2 1
## 6 6 11 58.6 1
## 7 7 19 77.4 1
## 8 1 35 76.1 2
## 9 2 21 71.0 2
## 10 3 27 65.7 2
## # ℹ 11 more rows
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 91 63.1 1
## 2 2 56 74.8 1
## 3 3 59 67.2 1
## 4 4 69 74.5 1
## 5 5 64 73.6 1
## 6 6 28 74.9 1
## 7 7 33 71.9 1
## 8 1 87 63.1 2
## 9 2 64 64.8 2
## 10 3 66 67.7 2
## # ℹ 11 more rows
## # A tibble: 21 × 4
## generation count_ mean_hp sample_num
## <int> <int> <dbl> <int>
## 1 1 115 67.8 1
## 2 2 58 72.8 1
## 3 3 108 62.6 1
## 4 4 95 70.0 1
## 5 5 92 72.7 1
## 6 6 69 72.6 1
## 7 7 63 76.0 1
## 8 1 104 62.0 2
## 9 2 82 72.5 2
## 10 3 102 64.5 2
## # ℹ 11 more rows
For each of the samples above, the amount of data available for the sample size increases from 25% - 75%. As far as the values go, it is to be expected that we will encounter sampling error, whereby values will differ slightly from the population parameter. However, we know through the Central Limit Theorem that as long as the sample size is large enough, the sample means’ distribution will be normally distributed regardless of the underlying population distribution
While there is difference in the sample means from each frame, it is slight. A point anomaly may have been detected in one sub-sample, whereby the individual data points would deviate significantly (10+ pts) from the rest of the set. This would have conveyed a rare event in the sample or some type of error.
The sample averages are consistent across all sub-samples and their increases or decreases in comparison to subsequent generations. While the raw data itself is not necessarily normally distributed, the larger samples have less variability in the sample mean. As such, we can conclude that with the larger samples, the sample mean is becoming a more reliable estimation of the population mean.
Following these assessments, there is an understanding that we should not draw conclusions in absolutes. We can utilize samples as a powerful tool to understand things about our data and the contexts in which it exists if we also understand that we are getting closer to the truth as opposed to concluding the truth with exactness.