Sampling Theory and Central Limit Theorem

The sample theory is the study of relationships existing between a population and samples drawn from population.Consider all possible samples of size n that can be drawn from the population. For each sample, we can compute statistic like mean or a standard deviation, etc that will vary from sample to sample. This way we obtain a distribution called as the sampling distribution of a statistic. If the statisic is sample mean , then the distribution is called the sampling distribution of mean.

We can have sampling distribution of standard deviation, variance, medians, proportion etc.

The Central limit theorem states that the sampling distribution of the mean of any independent, random variable will be normal or near normal,regardless of underlying distribution.If the sample size is large enough,we get a nice bell shaped curve.

In other words, suppose you picked a sample from large number of independent and random observations and compute the arithmetic average of sample and do this exercise for n number of times. Then according to the central limit theorem the computed values of the average will be distributed according to the normal distribution (commonly known as a “bell curve”). We will try simulate this my theorem in examples below

Example 1: A fair die can be modelled with a discrete random variable with outcome 1 through 6, each with the equal probability of 1/6.

The expected value is \(\frac{1+2+3+4+5+6}{6}\) =3.5
Suppose you throw the die 10000 times and plot the frequency of each outcome. Here’s the r syntax to simulate the throwing a die 10000 times

DieOutcome <- sample(1:6,10000, replace= TRUE)
hist(DieOutcome, col ="light blue")
abline(v=3.5, col = "red",lty=1)

We will take samples of size 10 , from the above 10000 observation of outcome of die roll, take the arithmetic mean and try to plot the mean of sample. we will do this procedure k times (in this case k= 10000 )

x10 <- c()
k =10000
 for ( i in 1:k) {
 x10[i] = mean(sample(1:6,10, replace = TRUE))}
 hist(x10, col ="pink", main="Sample size =10",xlab ="Outcome of die roll")
 abline(v = mean(x10), col = "Red")
 abline(v = 3.5, col = "blue")

Sample Size
By theory , we know as the sample increases, we get better bell shaped curve. As the n apporaches infinity , we get a normal distribution. Lets do this by increasing the sample size to 30, 100 and 1000 in above example 1.

 x30 <- c()
 x100 <- c()
 x1000 <- c()
 k =10000
 for ( i in 1:k){
 x30[i] = mean(sample(1:6,30, replace = TRUE))
 x100[i] = mean(sample(1:6,100, replace = TRUE))
 x1000[i] = mean(sample(1:6,1000, replace = TRUE))
 }
 par(mfrow=c(1,3))
 hist(x30, col ="green",main="n=30",xlab ="die roll")
 abline(v = mean(x30), col = "blue")

 hist(x100, col ="light blue", main="n=100",xlab ="die roll")
 abline(v = mean(x100), col = "red")

 hist(x1000, col ="orange",main="n=1000",xlab ="die roll")
 abline(v = mean(x1000), col = "red")

We will take another example
Example 2: A fair Coin
Flipping a fair coin many times the probability of getting a given number of heads in a series of flips should follow a normal curve, with mean equal to half the total number of flips in each series. Here 1 represent heads and 0 tails.

x <- c()
k =10000  
 for ( i in 1:k) {  
 x[i] = mean(sample(0:1,100, replace = TRUE))}  
 hist(x, col ="light green", main="Sample size = 100",xlab ="flipping coin ")  
 abline(v = mean(x), col = "red")

Relation between the Population mean, Sample mean, Poulation variance and Sample variance

Suppose that all possible sample of size N are drawn without replacement from a finite population of size \({N_p} > N\) ,then

\[\mu_\bar{x} = \mu\] and \[\sigma_\bar{X} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N_p - N}{N_p -1 }}\] where \(\mu_\bar{x}\) = mean of sampling distribution and \(\sigma_\bar{X}\) = standard deviation of sampling distribution.

If the population is infinite or if the sampling is done with replacement, then the mean of the sampling distribution of the sample means is the equal to the population mean from where we are sampling i.e. \[\mu_\bar{X} = \mu\] and \[\sigma_\bar{X} = \frac{\sigma}{\sqrt{n}}\]

\(\frac{\sigma}{\sqrt{n}}\) also called the standard error.

Lets verify the above rules , with fair die used in example 1.
Variance of rolling a fair die = 2.92
Standard deviation \(\sigma\) = 1.71
Sample size 10
\(\sigma_\bar{X_10}\) = \(\frac{1.71}{\sqrt{10}}\) = 0.54
Standard deviation = sd(x10) = 0.54

Sample size 100
\(\sigma_\bar{X_100}\) = \(\frac{1.71}{\sqrt{100}}\) = 0.17
Standard deviation of X100 = sd(x100) = 0.17

Sample size 1000
\(\sigma_\bar{X_10}\) = 1.71/sqrt(1000) = 0.05
Standard deviation of X10 = sd(x1000) = 0.05