Sampling Distribution of the Mean

Harold Nelson

5/3/2020

This set of exercises focuses on the sampling distribution of the mean. It assumes that you have worked through slides 169 - 171 in Module 12 of the CMU Probability and Statistics course.

What does the summary on slide 171 say about the sampling distribution of the mean?

Answer

  1. The mean of the estimates of the mean, the values of \(\bar{x}\), is the same as the mean of the original population.

\[\mu_{\bar{x}}=\mu_{x}\]

  1. The standard deviation of the estimates of the mean is inversely proportional to the square root of the sample size and directly proportional to the standard deviation of the population.

\[\sigma_{\bar{x}} =\frac{\sigma_{x}}{\sqrt{n}}\] 3. The distribution of estimates will approximate a normal distribution more closely as the sample size on which the estimates are based, \(n\), becomes larger. If the original population has a normal distribution or \(n>30\), inference about the population mean can be based on an assumption of normality.

Setup

First, make sure that the package ggplot2 has been installed and loaded so that we have access to the diamonds dataframe.

package = "ggplot2"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

Population Statistics

The exercises that follow are based on the variable carat (weight) in the diamonds dataframe.

Get the summary statistics, the standard deviation, and a histogram of this variable.

hist(diamonds$carat)

summary(diamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100
sd(diamonds$carat)
## [1] 0.4740112

Exercise

Looking at the histogram, does this variable appear to have a normal distribution? Why or why not?

Answer

No, the distribution is asymmetric with a strong right skew. The maximum value is much farther to the right than one could imagine in a normal distribution with this mean and standard distribution.

How many standard deviotions to the right of the mean is the maximum value?

Answer

(max(diamonds$carat) - mean(diamonds$carat))/sd(diamonds$carat)
## [1] 8.885992

That’s almost nine standard deviations. Use the pnorm() function to investigate the chances of this happening. Try pnorm(3), pnorm(4), and pnorm(5), as well as pnorm(9). Comment.

Answer

1 - pnorm(3)
## [1] 0.001349898
1 - pnorm(4)
## [1] 3.167124e-05
1 - pnorm(5)
## [1] 2.866516e-07
1 - pnorm(9)
## [1] 0

These results tell you that being even being five standard deviations to the right of the mean of a normal distribution is close to impossible. The value of zero for nine standard deviations means that you’ve reached the limit of machine precision on your computer.

After all that, let’s agree that the non-normality of diamonds$carat is a done deal.

Taking a Sample

There is a builtin function, sample, in R which we’ll use to get random samples of various sizes from diamonds$carat. Whenever we take a sample, we can also get an estimate of the mean based on that sample. I’ll show more detail than I need so you can visualize what’s happening.

# Get the sample
aSample = sample(diamonds$carat,size= 5)

# Look at the sample
aSample
## [1] 0.46 1.04 0.32 0.76 0.52
#Estimate the mean from this sample
anEstimate = mean(aSample)
anEstimate
## [1] 0.62

Run this chunk of code a few times in RStudio. Note that every time you do it, you’ll get different results. That’s because sample takes random samples.

A Vector of Estimates

First I’ll create a vector of zero values.

vectorOfEstimates = rep(0,10)
vectorOfEstimates
##  [1] 0 0 0 0 0 0 0 0 0 0

Now I’ll partially fill it with estimated mean values based on samples of size sampleSize. Note that “sampleSize” in my code is the \(n\) in the formulas from the CMU material. This prevents confusion with the number of estimates.

numberOfEstimates = 10
vectorOfEstimates = rep(0,numberOfEstimates)

sampleSize = 5

aSample = sample(diamonds$carat, size= sampleSize)
anEstimate = mean(aSample)
vectorOfEstimates[1] = anEstimate

# Again

aSample = sample(diamonds$carat, size= sampleSize)
anEstimate = mean(aSample)
vectorOfEstimates[2] = anEstimate

# And Again

aSample = sample(diamonds$carat, size= sampleSize)
anEstimate = mean(aSample)
vectorOfEstimates[3] = anEstimate

OK, Lets look at our vector of estimates.

The Vector of Estimates

vectorOfEstimates
##  [1] 0.774 0.454 0.900 0.000 0.000 0.000 0.000 0.000 0.000 0.000

I’ve filled up the first three positions in my vector by repeating the same block of code three times. The only difference among these three blocks is a single character, which identifies where in the vector estimates I’m putting the results.

Is there a way I can ask R to repeat the same block of code many times and just use a different value to position the estimate in the vector of estimates?

Yes!! It’s called a for loop. I’ll show you how it’s done, but I won’t expect you to make a for loop on your own. I want you to focus on the statistics, not the programming.

The For Loop

for ( i in 1:numberOfEstimates) 
{
  
  aSample = sample(diamonds$carat, size= sampleSize)
  anEstimate = mean(aSample)
  vectorOfEstimates[i] = anEstimate
}

vectorOfEstimates
##  [1] 0.474 0.566 0.408 1.090 0.664 0.500 1.048 0.548 0.892 0.700

The Code Snippet

I’m going to build a code snippet around this for loop and give you some controls to run it and add some display of results. We’ll use this to explore how the sample size effects the quality of our estimate of the mean.

I’ll run it with a sample size of 10 and create 1,000 estimates based on this sample size. You should think of this a single sample of size 1000 from the population of estimated means based on samples of size 10 from diamonds$carat. That’s a mouth full and a mind full.

Here’s the complete code snippet.

# A code snippet to investigate the sampling distribution of the mean.

# Your controls: Change these two values as you wish.
numberOfEstimates = 1000
sampleSize = 10


# My code - Look but don't touch.
# Create the vector of estimates to be filled with real values later.
vectorOfEstimates = rep(0,numberOfEstimates)

for ( i in 1:numberOfEstimates) 
{
  
  aSample = sample(diamonds$carat, size= sampleSize)
  anEstimate = mean(aSample)
  vectorOfEstimates[i] = anEstimate
}


# Examining the output - Look carefully and think hard!

print(paste("Sample size",sampleSize))
## [1] "Sample size 10"
print(paste("Mean",mean(vectorOfEstimates)))
## [1] "Mean 0.801006"
print(paste("Standard Deviation",sd(vectorOfEstimates)))
## [1] "Standard Deviation 0.156770063522398"
pop_mean = mean(diamonds$carat)
pop_sd = sd(diamonds$carat)


pred_mean = pop_mean
act_mean = mean(vectorOfEstimates)
meanRatio = pred_mean/act_mean
print(paste("Mean ratio is",meanRatio))
## [1] "Mean ratio is 0.996171998546829"
pred_sd = pop_sd/sqrt(sampleSize)
act_sd = sd(vectorOfEstimates)
sdRatio = pred_sd/act_sd
print(paste("Standard Deviation ratio is",sdRatio))
## [1] "Standard Deviation ratio is 0.956148855956613"
summary(vectorOfEstimates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4150  0.6937  0.7880  0.8010  0.8970  1.3450
htitle = paste("Estimated Mean Values with n =",sampleSize)
hist(vectorOfEstimates,main = htitle)

Compare these results with the results predicted on slide 171 of the CMU material and repeated on slide 3 of this presentation.

Is the mean of the estimated means consistent with the theoretical prediction?

What about the predicted value of the stanadrd deviation of the estimated means?

Does the population of estimated means have a distribution that is closer to a normal distribution than the original population of carat weights?

Answer

  1. Is the mean of the estimated means consistent with the theoretical prediction? Yes. Note that the ratio of predicted and actual values is close to 1.

  2. What about the predicted value of the stanadrd deviation of the estimated means? Yes again, for the same reason.

  3. Does the population of estimated means have a distribution that is closer to a normal distribution than the original population of carat weights? Yes. The right skewness is mostly gone. The distribution is much more symmetric.

Exercise

Copy the code snippet above into a new script file in RStudio and run it with a few values of sampleSize to see the impact of varying sampleSize. Use 1, 2, 4, 8, 16, 32 and 64 to see what happens each time you double the sampleSize parameter.

Make sure that you have ggplot installed and loaded. You can copy the code from the beginning of this presentation to do this.

My Explanation

You can see my comments on youtube at https://youtu.be/KOOIXHYqEn0

Right click and select "Open in new tab.