The Sampling Distribution of the Mean

We’re going to look at a specific example to illustrate the concept of the sampling didstribution of the mean. We’ll use the variable carat in the dataframe diamonds, which is in the ggplot package. Before we start, we need to make sure we have the package loaded.

library(ggplot2)

OK, Now we have the data, so let’s look at the distribution of carat weights and get the population parameters \(\mu\) and \(\sigma\).

hist(diamonds$carat,main = "Histogram of Raw Data")

pop.mu <- mean(diamonds$carat)
pop.sigma <- sd(diamonds$carat)

pop.mu
## [1] 0.7979397
pop.sigma
## [1] 0.4740112

Now let’s get sample estimates of the mean from a very small sample of size 10. Do this several times and get a feeling for the kind of values we’re seeing.

# Sample size 10
x10 <- sample(diamonds$carat,10)
mean(x10)
## [1] 0.895

Now let’s get sample estimates of the mean from a larger sample of size 1,000. Do this also several times and get a feeling for the kind of values we’re seeing.

# Sample size 1,000
x1000 <- sample(diamonds$carat,1000)
mean(x1000)
## [1] 0.79198

What happened? How do the sample means based on samples of size 1,000 differ from those based on samples of size 10? Recall that the true value of the mean is 0.7979397.

Comparing these individual sample means one at a time by hand is tedious, so we’ll be more systematic. Let’s get 1,000 sample means based on samples of size 10 and save the results of each run in a vector called results. First we’ll create the results vector and fill it with zeros just so that it will be there.

results <- rep(0,1000)

Now we can get our 1,000 sample mean estimates and place each one of them in its own place in the results vector.

for(i in 1:1000)
{
  x <- sample(diamonds$carat,10)
  results[i] <- mean(x)
}

Let’s look at a few of the things in the results vector and ponder what they are.

results[23]
## [1] 0.625
results[195]
## [1] 0.548
results[510]
## [1] 1.064
results[1000]
## [1] 0.59

Each one of these numbers is an estimate of the population mean based on a sample of size 10.

Results itself is a sample also, but not of the original population of diamond weights.

There is a population of sample means based on samples of size 10 from the original population.

Results is a sample from the population of sample means based on samples of size 10.

There are two distinct experiments here, each with its own random variable.

The first experiment is to select one diamond and weigh it. The random variable is the weight of this single diamond. This random variable has a probability distribution with mean \(\mu\) and standard deviation \(\sigma\).

The second experiment is to select a sample of size \(n\). The random variable is the mean of the \(n\) diamonds. This second random variable has the same mean as the original population, but a samller standard deviation. We distinguish between the two using the subscripts \(x\) for the original random variable and \(\bar{x}\) for the second.

We can get estimates of the mean and standard deviaition of the population of sample means because we have a sample from that population in results.

mean(results)
## [1] 0.794981
sd(results)
## [1] 0.1486673

Note that the mean of results is about the same as the mean of the original population, but the standard deviation is smaller by roughly a factor of 3.

These facts are examples of two relationships that we need to remember.

\[\mu_{\bar{x}}=\mu_{x}\] and \[\sigma_{\bar{x}} =\frac{\sigma_{x}}{\sqrt{n}}\]

Furthermore, as the value of \(n\) becomes larger the probability distribution of sample means approaches a normal distribution. This fact, known as the central limit theorem, is the reason why the normal distribution is so important and why the standard deviation is the chosen measure of spread even though there are simpler measures, which would do just as well.

Let’s look at a histogram of results and compare it with the original histogram of the diamond weights.

hist(results)

This could easily come from a normal distribution, as the theory suggests.

The Sampling Distribution of the Proportion

We have the basic theoretical results.

The estimates of a proportion \(\hat{p}\) based on a sample of size n is approximately normal and has the following mean and standard deviation provided that \(n\hat{p} > 10\) and \(n(1-\hat{p}) > 10\).

\[\mu_{\hat{p}} = p\] and \[\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\]

This is all based on the theory of the binomial distribution. In that theory we talked about having a known value of \(p\), the probability of success on a single trial. When a single trial is repeated \(n\) times, the expected number of successes is given by \(np\). Now we are trying to estimate the value of \(p\) based on the results of \(n\) trials.

We’ll use the diamonds dataframe again to illustrate the theory. Let’s look at a relative frequency table of the variable cut.

table(diamonds$cut)/nrow(diamonds)
## 
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

A single trial consists of selecting a diamond at random. For our first exercise, we’ll consider getting one with a Fair cut as a success. From the table we know that the true population value of \(p\) is \(.02984798\).

Let’s set up a code snippet to look at the distribution of estimates based on different sample sizes. We can change the sample size and re-run the snippet.

# First create a vector where we can put our estimates
Results = rep(0,1000)

# Set the sample size
SampleSize = 10

# Draw 1,000 samples of this size, estimate p for each one and place 
# each result into the results vector

for(i in 1:1000)
  {Sample = sample(diamonds$cut,SampleSize)
   p.estimate = mean(Sample == "Fair")
   Results[i] = p.estimate
   }
# Look at the distribution of results
mean(Results)
## [1] 0.0283
hist(Results)

Simulation with proportion estimates

The following code chunk allows us to simulate an estimate of a proportion based on a single sample. The code allows for varying the true population proportion and the sample size.

sampsize = 10
p = .3
x = sample(c(1,0),size = sampsize,replace = TRUE,
           prob = c(p,1-p))
prop.est = mean(x)
prop.est
## [1] 0.7
abs(prop.est - p)
## [1] 0.4

The following code allows us to draw from the sampling distribution of estimates produced by the the preceding code.

sampsize = 100
p = .1
expsize = 10000
results = rep(0,expsize)

for(i in 1:expsize){
  x = sample(c(1,0),size = sampsize,replace = TRUE,
             prob = c(p,1-p))
  results[i] = mean(x)
}

summary(results)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.08000 0.10000 0.09982 0.12000 0.23000
hist(results)