Sampling Distribution of the Proportion

In this article our propose is to analyze the selection distribution of the proportion, mean and variance of five stores carrying out a certain activity, distributed by daily turnover, as follows:

The population consists of 5 stores with different activities and turnovers, N={2,7,2,4,5} (in millions). From this population, we will extract 10 samples (n = 3) to analyze its characteristics and to examine the sampling distribution. The population distribution is characterized by the following characteristics: mean, variance, and standard deviation.

The mean of a population is the average value of all the units in the population:

\[ \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \]

where:

The variance of a population measures how much the values deviate from the mean:

\[ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]

where:

The standard deviation is the square root of the variance:

\[ \sigma = \sqrt{\sigma^2} \]

where:

The population distribution is characterized by the following features:

Thus, the mean, variance and the standard deviation of the population is \(\mu\) = 4, \(\sigma\) = 4.5 and \(\sqrt{\sigma^2}\) = 2.12.

Samples selection

The selection procedure used to extract samples is simple random sampling without replacement. For simple random sampling without replacement, the number of sample that we can extract is given by the following formula:

\[ C_N^n = \frac{N!}{n!(N-n)!} \]

where: N is the total number of units in the population and n is the size of the sample.

10 samples we can extract using this procedure: \[ C_5^3 = \frac{5!}{3!(5-3)!} = 10 \quad \text{samples} \]

The probability of extracting a single sample of size n = 3 units is given by the following formula:

\[ P(\text{sample}) = \frac{1}{C_N^n} = \frac{1}{\frac{N!}{n!(N-n)!}} \]

The probability of extracting a single sample using this procedure is:

\[ P(\text{sample}) = \frac{1}{C_5^3} = \frac{1}{\frac{5!}{3!(5-3)!}} = \frac{1}{10} = 0.1 \]

The probability of including a specific unit in the sample in simple random sampling without replacement is:

\[ P(\text{specific unit included}) = \frac{1}{N} + \dots + \frac{1}{N}= \sum_{i=1}^{n} \frac{1}{N} = \frac{n}{N} \]

The probability of including a unit in the sample using this procedure is:

\[ P(\text{specific unit included}) = \frac{1}{5} + \frac{1}{5} + \frac{1}{5}= \sum_{i=1}^{n} \frac{1}{N} = \frac{3}{5} = 0.6 \]

  • at the first extraction, the probability of selecting a specific unit is 0.2.

\[ P_1 = \frac{1}{5} = 0.2 \]

  • if the specific unit is not selected in the first extraction, the probability in the second selection increases to 0.25.

The total probability of selecting the specific unit in the second extraction is:

\[ P_2 = \left( \frac{4}{5} \right) \times \left( \frac{1}{4} \right) = \frac{4}{20} = \frac{1}{5} = 0.2 \]

where:

  • \(\frac{4}{5}\) is the probability that another unit to be picked first
  • \(\frac{1}{4}\) is the probability of selecting the specific unit in the second extraction
  • if the specific unit is still not selected, the probability in the third selection increases to 0.33.

\[ P_3 = \left( \frac{4}{5} \right) \times \left( \frac{3}{4} \right) \times \left( \frac{1}{3} \right) = \frac{12}{60} = \frac{1}{5} = 0.2 \]

where:

  • \(\frac{4}{5}\) is the probability that another unit to be picked first

  • \(\frac{3}{4}\) is the probability that another unit to be picked second

  • \(\frac{1}{3}\) is the probability of selecting the specific unit in the third extraction

The total probability of selecting a specific unit is:

\[ P(\text{unit included}) = P_1 + P_2 + P_3 = \frac{1}{5} + \frac{1}{5} + \frac{1}{5} = \frac{3}{5} = 0.6 \]

Thus, the probability that a specific unit is included in the sample is 60%.

Therefore, as more units are removed from the population, there are fewer options left and the chance of selecting any remaining unit increases.

For a random sample of size n = 3, the sampling distribution of the sample proportion \(\hat{p}\) (proportion of the variable that satisfies the condition in the sample) follows approximately a normal distribution, provided that both np and n(1−p) are sufficiently large.

The possible samples that can be extracted from the reference population are:

Density distribution for each sample

As follows, we will calculate the mean, variance, and standard deviation for each sample extracted from the reference population. Additionally, we will graphically represent the sampling distribution of the mean, variance, and proportion.

Standardization of the variable

The standardization formula when we want to standardize a variable, given that the population parameters (μ,σ) are known, is: \[ Z = \frac{X - \mu}{\sigma} \]

When the population parameters are unknown, we use the following formula: \[ Z = \frac{X - \bar{x}}{s} \; or \quad Z = \frac{X - \hat{\mu}}{\hat{\sigma}} \]

Selection proportion distribution

As can be seen, the estimated proportion values for each sample extracted from the reference population follow the following distribution:

As it can be seen form the graphic, most of the sample have a proportion of 0.33 for the attributive variable activity, followed by samples with a proportion value of 0.67, and finally, those with a proportion value of 0. The mean is 0.4, the variance is 0.04 and the standard deviation is 0.2 suggesting that the data is centered around 0.4 with relatively low variability.

Selection variance distribution

Selection mean distribution

Bias in Estimation

1. Bias of the Sample Mean

The sample mean \(\bar{X}\) is an unbiased estimator of the population mean \(\mu\):

\[ \text{Bias}(\bar{X}) = \mathbb{E}[\bar{X}] - \mu = 0 \]

2. Bias of the Sample Proportion

If \(\hat{p}\) is the sample proportion estimating the population proportion \(p\):

\[ \text{Bias}(\hat{p}) = \mathbb{E}[\hat{p}] - p \]

If the sample is randomly selected, \(\mathbb{E}[\hat{p}] = p\), meaning the sample proportion is unbiased.

However, in cases of non-random sampling, selection bias may introduce systematic error.

## Conclusions: We observe that for proportion and mean, the bias is very small, almost zero. Therefore, the larger the sample size, the value of the estimator tends to overlap with the value of the parameter.

References:

  1. E.Jaba, Statistica Ediatia a treia, Editura Economica Bucuresti, 2002.