library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Population - Full observation
Sample - Subset of the observation, preferably having all characteristics of population
Estimator - Statistics used to generate knowledge
Estimate - Product from the estimator
Mean - Average value of the observation
Variance - Spread of the observation
If you are a researcher or simply have passion for statistical and data analysis, you have probably worked with datasets. Any research, be it for academic, scientific or business, can require data for a researcher to make an inference. Ideally, a dataset of the complete population would be very useful, but it would be unrealistic to expect data of every observation, no matter the type of observations. As a result, a more realistic approach would be to take a subsection of any population which can represent all the characteristics of the larger population.
A random sample is a great technique to ensure that the sample is not biased. Since the observation is randomly assigned, it is expected that the sample does not over-represent or under-represent certain features of the population and any outliers would be offset by other observations. Of course, the size of the sample is important as a smaller sample may not be as inclusive as required.
While there are many sampling techniques that the researcher may use and should use depending on the hypothesis, a sample must be similar to the population it is representing. The sampling method is not the thesis of this document, but rather we shall see how the mean and variance of a sample compares with the population.
In order to investigate the relationship between population and sample, we shall calculate the mean and variance of two dice thrown at once and find out how the sampling parameters perform. The population of sixed faced die is 6 observation, i.e (1,2,3,4,5,6). Consider the two dice result as a sample of two observation.
A fair die has an equal probability of getting one of the six faces. There are six possible outcomes while rolling a die. When we roll two dice, we get a different combination of results such as 1-1 or 5-4. Altogether, there are 36 different combinations of results after rolling two dice.
In this example, we shall record all the possible outcome of rolling two dice.
The command below helps us to generate a combination of the outcome from rolling the two dice simultaneously.
rm(list=ls()) #remove all elements
scoreCard <- as.data.frame(matrix(data=NA,ncol = 2, byrow = T)) #making a dataset to use
colnames(scoreCard) <- c("Dice 1", "Dice 2") #renaming the columns
r=1 #r is also the values for row which we will use in the dataset
for(i in 1:6){
for (j in 1:6) {
c=1 #c is the values we shall assign to the columns
scoreCard[r,c] <- i
c<- c+1
scoreCard[r,c]<-j
r<-r+1
}
}
scoreCard
## Dice 1 Dice 2
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 2 1
## 8 2 2
## 9 2 3
## 10 2 4
## 11 2 5
## 12 2 6
## 13 3 1
## 14 3 2
## 15 3 3
## 16 3 4
## 17 3 5
## 18 3 6
## 19 4 1
## 20 4 2
## 21 4 3
## 22 4 4
## 23 4 5
## 24 4 6
## 25 5 1
## 26 5 2
## 27 5 3
## 28 5 4
## 29 5 5
## 30 5 6
## 31 6 1
## 32 6 2
## 33 6 3
## 34 6 4
## 35 6 5
## 36 6 6
Each row of the dataset is a sample.
Now before diving deep into the estimators, it would only be proper to find the value of mean and variance of the population. The mean of a fair die, with values from one to six, would be (1+2+3+4+5+6)/6 =3.5. The variance would be ((1-3.5)^2 + (2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2+ (5-3.5)^2 + (6-3.5)^2) /6 = 2.917
MeanDie <- mean(1:6)
VarianceDie <- ((1-3.5)^2+(2-3.5)^2+(3-3.5)^2+(4-3.5)^2+(5-3.5)^2+(6-3.5)^2)/6
MeanDie
## [1] 3.5
VarianceDie
## [1] 2.916667
The population mean is 3.5 whereas the variance is 2.917.
Now, for each sample, we will record mean and the variance. For example, if the two dice are one and five, like in the fifth sample (or the row) the mean of the combination is three.
The following command will create a column for the mean and the variance of ALL possible sample values.
scoreCard$Mean <- rowMeans(scoreCard[,c(1,2)])
variance <- ((scoreCard$`Dice 1` - scoreCard$Mean)^2 + (scoreCard$`Dice 2`- scoreCard$Mean)^2)/2
scoreCard$Variance <- variance
scoreCard
## Dice 1 Dice 2 Mean Variance
## 1 1 1 1.0 0.00
## 2 1 2 1.5 0.25
## 3 1 3 2.0 1.00
## 4 1 4 2.5 2.25
## 5 1 5 3.0 4.00
## 6 1 6 3.5 6.25
## 7 2 1 1.5 0.25
## 8 2 2 2.0 0.00
## 9 2 3 2.5 0.25
## 10 2 4 3.0 1.00
## 11 2 5 3.5 2.25
## 12 2 6 4.0 4.00
## 13 3 1 2.0 1.00
## 14 3 2 2.5 0.25
## 15 3 3 3.0 0.00
## 16 3 4 3.5 0.25
## 17 3 5 4.0 1.00
## 18 3 6 4.5 2.25
## 19 4 1 2.5 2.25
## 20 4 2 3.0 1.00
## 21 4 3 3.5 0.25
## 22 4 4 4.0 0.00
## 23 4 5 4.5 0.25
## 24 4 6 5.0 1.00
## 25 5 1 3.0 4.00
## 26 5 2 3.5 2.25
## 27 5 3 4.0 1.00
## 28 5 4 4.5 0.25
## 29 5 5 5.0 0.00
## 30 5 6 5.5 0.25
## 31 6 1 3.5 6.25
## 32 6 2 4.0 4.00
## 33 6 3 4.5 2.25
## 34 6 4 5.0 1.00
## 35 6 5 5.5 0.25
## 36 6 6 6.0 0.00
It is obvious that the different samples will neither necessarily be exact with one another nor with the population value. For instance, the sample (1,1) has a mean of 1 and variance 0 whereas the sample (3,5) has a mean of 4 and variance 1. The estimates are not only different from each other but also from the population parameter.
Now we shall see the different values of mean and variance from our sample and the number of times they are repeated.
From the tables below, we can see how the mean and variance is distributed in the samples.
table(scoreCard$Mean)
##
## 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
## 1 2 3 4 5 6 5 4 3 2 1
table(scoreCard$Variance)
##
## 0 0.25 1 2.25 4 6.25
## 6 10 8 6 4 2
From both the tables above, we can see that multiple estimates are not equal to the population mean and variance; and that’s okay. It would be unrealistic for a researcher to find the exact same value of the population from a sample. However, what the estimator should be is unbiased;- meaning that the estimate from the sample must be a good indicator of the population. To check whether estimators are unbiased estimates or not, we calculate the expected values of our estimators. An expected value is simply the total sum of the values times the probability of occurrence.
The command below shows the probability distribution of the estimates before computing the expected values of the estimators.The distribution is the probability of getting the given value from the data (no. of occurance divided by total sample, in this case 36)
prop.table(table(scoreCard$Mean))
##
## 1 1.5 2 2.5 3 3.5 4
## 0.02777778 0.05555556 0.08333333 0.11111111 0.13888889 0.16666667 0.13888889
## 4.5 5 5.5 6
## 0.11111111 0.08333333 0.05555556 0.02777778
prop.table(table(scoreCard$Variance))
##
## 0 0.25 1 2.25 4 6.25
## 0.16666667 0.27777778 0.22222222 0.16666667 0.11111111 0.05555556
Transposing the table. The third column for observation times it’s probability is also created.
Mean <- summarise(group_by(scoreCard, Mean), probability = length(Mean)/36)
Mean$Outcome <- Mean$Mean * Mean$probability
head(Mean,11)
## # A tibble: 11 x 3
## Mean probability Outcome
## <dbl> <dbl> <dbl>
## 1 1 0.0278 0.0278
## 2 1.5 0.0556 0.0833
## 3 2 0.0833 0.167
## 4 2.5 0.111 0.278
## 5 3 0.139 0.417
## 6 3.5 0.167 0.583
## 7 4 0.139 0.556
## 8 4.5 0.111 0.5
## 9 5 0.0833 0.417
## 10 5.5 0.0556 0.306
## 11 6 0.0278 0.167
Variance <- summarise(group_by(scoreCard, Variance), probability = length(Variance)/36)
Variance$Outcome <- Variance$Variance * Variance$probability
Variance
## # A tibble: 6 x 3
## Variance probability Outcome
## <dbl> <dbl> <dbl>
## 1 0 0.167 0
## 2 0.25 0.278 0.0694
## 3 1 0.222 0.222
## 4 2.25 0.167 0.375
## 5 4 0.111 0.444
## 6 6.25 0.0556 0.347
The tables above shows the probability distribution of means and variance. With the probability distribution, we can calculate the Expected Values. The expected value is simply the total sum of the value of the observations times it’s probability.
E_Mean <- sum(Mean$Mean * Mean$probability) #or simply sum(Mean$Outcome)
E_Variance <- sum(Variance$Variance * Variance$probability) #or simple sum(Variance$Outcome)
E_Mean
## [1] 3.5
E_Variance
## [1] 1.458333
The results show that the expected mean from the sample is exactly equal to the population mean. This shows that the mean from the sample is an unbiased estimator since the expected mean does not differ from the population mean. However, the variance(s) are not equal. This means that any variance that the researcher calculate from the sample is a biased estimator of the population variance.
(It is possible, however, to modify the variance calculation to obtain an unbiased estimator. As a result, sample variance formula differs from population variance)
While not every sample mean is always equal to the population mean, it is nonetheless unbiased.An unbiased estimator signify that such estimator is a sensible method of estimating the population parameters from the sample.
One thing the data does show is that the sample mean equal to the population mean has the highest probability of occurrence.Increasing the sample size will increase the probability further. This is an indication of a good estimator.