library(DT)
value <- c(1:6)
Average/Mean - The sum of all observation divided by the number of the observation
Expected Mean - The average value expected from a random variable. It is the sum of the product of the observation and it’s probability
Estimator - Statistics used to generate knowledge
Estimate - Product from the estimator
In my last article, I wrote about how sample mean is an unbiased estimator of the population and hence a sensible method of estimating the population mean. The link to that article is : https://rpubs.com/Alabhya/Estimator.
In this article, I will write more on why the sample mean makes a great estimator of the population mean. This article will go through the sampling distribution of mean and central limit theorem in order to investigate the relationship between population and the sample taken from it. The die simulation will once again be used in this example.
(Note: A fair-faced die has an equal likelihood of an outcome. This means that the probability of getting any face from rolling dice are all the same. In a six-faced die, every face has a 1/6 probability of occurrence.)
Suppose that someone is extracting the mean from the outcome of a die. They can note the outcome each time a die is rolled. Since only one die is rolled, the outcome itself is an average of the sample. There are only six possible outcome (1),(2),(3),(4),(5),(6). Probability of getting any one of these outcome is 1/6 (or 0.1667).
All the possible values and its probability distribution is tabulated below.
dice <- as.data.frame(value)
dice<- as.data.frame(prop.table(table(dice)))
datatable(dice, rownames = F)
With the help of barplot, this probability distribution is better represented below.
barplot(prop.table(table(dice$dice)))
It can be seen that all the possible values are also the list of the face value of the die (i.e 1,2,3,4,5,6). There can be no other value. Since the size is one, the average is always equal to the outcome since it is divided by one.
From the table, the expected mean of the sample can be calculated. The expected mean is simply the outcome multiplied by its corresponding probabilities. In the case of one sample, the expected mean is as follows
sum(as.double(dice$dice) * dice$Freq)
## [1] 3.5
The expected mean is 3.5, which is also exactly equal to the population mean. The population mean is the average of six face, i.e (1+2+3+4+5+6)/6 = 3.5.
Now, when two dices are rolled, the average value of the sample is the sum of the two outcomes divided by two. For instance, if the outcome is (2,5), the average value is (2+5)/2, which is 3.5.
The following table lists all the possible outcome of rolling two dice and their average values.
dice2 <- as.data.frame(expand.grid(Dice1=value,Dice2=value))
dice2$Average <- rowMeans(dice2)
datatable(dice2, rownames = F)
Below is the tabulation of the average value from two sample. The table list all the possible average values after rolling two dice and the number of times it occurs.
datatable(as.data.frame(table(dice2$Average)), rownames = F)
From the table above, it can be seen that the average value ranges from one to six, which is obvious since the population starts from one and ends at six. Only one combination gives the average 1, when both the rolls are 1 (i.e. 1,1). Likewise for the case of six. These are the extreme values, and has the least probability of occurrence.
The average value of 3.5 is repeated six times, the highest when the values are (1,6), (2,5), (3,4), (4,3),(5,2),(6,1).
**A remainder; 3.5 is the population mean as well as the expected value with one sample.
Out of 36 possible outcome, six outcome is exactly equal to the population mean. In order words, once in six attempts is equal to the actual mean.
The probability distribution of the average is given in the next table.
datatable(as.data.frame(prop.table(table(dice2$Average))), rownames = F)
It is better represented with the barplot below.
barplot(prop.table(table(dice2$Average)))
You can now see that the barplot is no longer uniform. Moreover, the sample means has the probability of occurrence more around the center, near to 3.5, which is the value of the population mean.
The expected mean of rolling two dice is
dice2_p <- as.data.frame(prop.table(table(dice2$Average)))
sum(as.double(as.character(dice2_p$Var1)) * dice2_p$Freq)
## [1] 3.5
The tendency of the distribution may already be predictable from the two bar plots. However, for the sake of robustness, the outcome of three samples will also be evaluated.
Now, instead of two, the average value of the three dice roll is recorded. If the dice rolled (1,3,4), the average value would be (1+3+4)/3=2.67. The following table shows all the possible outcome and the average when three dice are thrown.
dice3 <- expand.grid(Dice1=value,Dice2=value,Dice3=value)
dice3$Average <- rowMeans(dice3)
datatable(dice3, rownames = F)
There are 216 observations so the data may be difficult to interpret.
Just like before, a new table below will show all the possible average outcome after taking three samples.
datatable(as.data.frame(table(dice3$Average)), rownames = F)
Like before, the average is within one and six. However, not a single combination has a mean of 3.5, which is equal to the population mean as well as the expected mean of taking one and two samples (You can use the search widget) . Nevertheless, the expected mean of three samples is still 3.5. That is verified in the the table below.
The probability distribution of the means of three samples and its expected mean.
datatable(as.data.frame(prop.table(table(dice3$Average))), rownames = F)
The expected mean is
dice3_p <- as.data.frame(prop.table(table(dice3$Average)))
sum(as.double(as.character(dice3_p$Var1)) * dice3_p$Freq)
## [1] 3.5
While no combination of outcome was equal to 3.5, the population mean, the expectation of the sample mean is still 3.5. Therefore, it is still an unbiased estimator.
Whatever the size of sample, the expected sample mean will always equal to the population mean. That is the most important property of a sample mean.
Moreover, as the size of the sample increases, the average values from the samples tend to get closer towards the expected values. The barplot of the averagee of three sample will show how closer the average values are to the expected mean.
barplot(prop.table(table(dice3$Average)))
The barplot makes it clear that most of the observations are closer to the expected mean (and population mean) . There are two peaks, both in-between 3 and 4, which implies it being a lot closer to the population mean of 3.5. There is larger probability of sample mean being closer to the population mean when sample size increases.
With the three samples, it can be seen that about average values of 46% of all the possible outcome is within 0.5 difference from the actual mean. Approximately 78% of all the possible value are within the range of 1.5 difference.
The distribution for a single throw was uniform, whereas when those for sample two and three were not. The peak value of both the sample of two and three is in the middle, closer to the population mean. While the average from three samples had no single observation exactly equal to the population mean, most of the sample means were more closer to the population mean. Moreover, it showed a clear tendency of the sample mean distribution towards normal distribution when the size of the sample was increased.
These results are the evidence of Central Limit Theorem. The distribution of the sample mean approaches a normal distribution as the sample gets larger. Moreover, larger the sample size results to the estimation having higher chances of being closer to the population mean.
The central limit theorem holds true no matter the shape of population distribution.
(Please find the web-app https://mealabhya.shinyapps.io/Sample_Probability_Distribution_Emperical_Simulation/ to practically experiment the concept covered in this article.)