Lecture 8 - Sampling Distribution of the Sample Mean / CLT

Penelope Pooler Eisenbies
MAS 261

2023-09-28

Housekeeping

Today’s plan 📋
- Review Question about Emperical Rule
- A few minutes for R Questions 🪄
- Quick Review of Normal Distribution
  - Questions covered so far have been for a single observation (n=1)
- Sampling Distribution of the Sample mean
  - How does the Normal Distribution change when n > 1
- Introduction ot the Central Limit Theorem
Questions about HW 4
In-class Exercises

Review: R and RStudio 🪄

Review: You have two options to facilitate your introduction to R and RStudio:
- Option 1: Create Posit Cloud account and download and install R and RStudio on your laptop.
- Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
- We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
- You can use either Posit Cloud or your laptop.

💥 Lecture 8 In-class Exercises - Q1 (Review) 💥

At the local Trader Joe’s on Sundays, the mean number of customers 780 and the standard deviation is 40. Use the Empirical Rule to determine the probability that they will have between 860 and 900 next Sunday

Step 1. Convert range endpoints to Z values

Step 2. Use this helpful diagram.

Normal Distribution

In lectures 6 and 7 we have talked about the normal distribution

It is symmetric and bell-shaped.
It’s location is determined by the population mean, \(\mu\)
It’s width is determined by the population standard devation, \(\sigma\)
Regardless of the values of \(\mu\) and \(\sigma\), the normal distribution has a consistent shape
That shape is well known and provides information about all normally distributed populations.

Normal Distribution

So far we’ve talked about a SINGLE observation from normally distributed data:
- A single future year for annual average movie gross
- Price of eggs at single store
- A single morning of trading on the NYSE
- A single Sunday of business at Trader Joe’s

Today we’ll talk about how our understanding of the distribution changes when we ask a question about a sample mean with n > 1
We’ll start by introducing a case where n = 1 is inappropriate and then increase the sample size.

A supply chain example

A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces.

Population mean: \(\mu=12\) ounces

Population SD: \(\sigma = 0.4\) ounces

A supply chain consultant has been hired to help confirm if this true.

The plant owner is concerned that cans are being underfilled.

Decision Criteria

A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces.

Population mean: \(\mu=12\) ounces

Population SD: \(\sigma = 0.4\) ounces

Industry standard state that if the can(s) examined have a fill of less than 11.5 oz, the plant must shut down and recalibrate.

Which is Expensive!

If a single can is chosen, what is the probability that the can fill will be 11.5 oz or less?

💥 Lecture 8 In-class Exercises - Q2 💥

What is the probability (percent chance) that a single can will have a fill of 11.5 oz or less?

Examining Only ONE Can - NOT WISE!

There is about a 11% chance, that a random can will have 11.5 ounces or less if the plant is calibrated correctly.
P(X < 11.5) = 10.6%
Recalibration costs MILLIONS of dollars! Should this decision be based on one randomly chosen can?
- NO!
A consultant or analyst who based decision on only one randomly selected can would be committing malpractice.

How would this change with n = 4?

We’ve already seen that a sample size of one (n=1) is a bad idea
Instead the consultant randomly select 4 cans (\(n=4\)) and find the average can fill based on the four can measurements
- \(X\) is the measurement from one can.
  - X comes from a normal distribution with \(\mu=12\) and \(\sigma=0.4\)
  - Shorthand notation: \(X\sim N(12,0.4)\)
  - \(\sim\) is read as “is distributed as” and \(N\) stands for the normal distribution

\(\frac{X_{1}+X_{2}+X_{3}+X_{4}}{4}=\overline{X}\) is the sample mean from four can measurements.
- \(\overline{X}\) has a different distribution the X because it is estimate based on multiple observations

Comparison of Distributions of \(X\) and \(\overline{X}\)

X is 1 measurement from 1 can from a normal distribution

\(X\sim N(12,0.4)\)

\(\overline{X}\) is the sample mean from 4 \((n=4)\) can measurements.

\(\overline{X}\sim N(12,\frac{0.4}{\sqrt{4}})\)

Sampling Distribution of the Sample Mean

The sample mean is the average of multiple measurements or observations which provides more information.
This increase in information translates to a more precise and more narrow normal distribution
- The size of the sample used to create the mean effects how precise the distribution is.
X is an observation from a normal distribution with mean, \(\mu\), and standard deviation sigma, \(\sigma\). X is normally distributed.
- \(X\sim N(\mu,\sigma)\)
\(\overline{X}\) is also normally distributed with mean, \(\mu\), standard deviation sigma divided by the square root of the sample size, \(\sigma/\sqrt{n}\)
- \(\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.

Finding a probability based on a sample mean, \(\overline{X}\)

What is the probability (percent chance) that a sample mean of 4 cans \((n=4)\) will have a fill of 11.5 oz or less?

vdist_normal_prob(11.5, mean=12, sd=0.4/sqrt(4), type="lower")

Examining Four Cans is Better But…

In practice, the sample size would be predetermined by the plant and the consultant before the data were collected.
- I would argue for a sample size of at least 30 cans, if possible, just in case the the information about the distribution is imperfect.
Predetermining the sample size is essential so that no one tries to bias the results by adding to the data after it has been examined.
In this hypothetical case, we are examining the effect of increasing the sample size to show how it effects the distribution.
When we sampled Four Cans (\(n=4\)), the probability that the sample mean is 11.5 oz or less is 0.6%.
- Given that having to shut down the plant to recalibrate, the plant might still want a larger sample size.
- What is the probability that a sample mean based on 16 cans \((n=16)\) would have can fill less than 11.5?

💥 Lecture 8 In-class Exercises - Q3 💥

What is the probability (percent chance) that a sample mean based on 16 cans \((n=16)\) would have can fill less than 11.5

Probability (from Question 3) is not 0, but it’s pretty close.

Exact probability using a different R command (not required):

pnorm(11.5, mean=12, sd=0.4/sqrt(16), lower.tail = T)

[1] 0.0000002866516

The vdist commands are the only ones required in this part of the course, but we can get answers with more precision.
In practice, if a probability is less that 0.0001 (0.01%), a data scientist would consider that to be extremely unlikely.
In practical terms:
- If we sample 16 cans and get a sample mean less than 11.5 one of two things is true:
- The mean can fill at the plant is less than 11.5 and the plant should recalibrate
- The can fill was measured incorrectly (measurement error)

Comparison of the Distributions (n=1, n=4, n=16)

If \(\overline{X}\) is based on n > 1 observations, \(\sigma\) is replaced with \(\sigma_{\overline{x}}=\frac{\sigma}{\sqrt{n}}\)

💥 Lecture 8 In-class Exercises - Q4 💥

The following question is also Question 14 of HW Assignment 4.

If the sample size is increased the standard deviation of the sampling distribution of the sample mean will ___.

Example Two - Academic Calculus App

A start-up academic app claims it can USUALLY help students increase their college calculus test scores by 10 points on average, BUT (of course) there is variability in their success rate.

mean (\(\mu\)) increase is 10 points
standard deviation (\(\sigma\)) of increase is 5 points

Use this information to answer the following few questions.

💥 Lecture 8 In-class Exercises - Q5 and Q6 💥

Find the probability that a single student using the app will increase their test scores by 12 points.

How many standard deviations is an increase of 12 points away from the mean of 10 pts.?

💥 Lecture 8 In-class Exercises - Q7 and Q8 💥

Based on the app’s success, a professor asks their whole class (n=25 students) to use it.

What is the probability that this class of 25 will increase their score by an average of 12 pts.?

For sample of 25 students, how many standard deviations is an average increase of 12 points away from the mean of 10 pts.?

Something to consider:

We know from BOTH the probability (prev. question) and the Z value (because of the Empirical Rule), that an average increase of 12 points for the whole class may be a little ambitious.
BUT the probability that the whole class will see an average increase of only 8 points is also unlikely
However, without doing any calculations, we know there is a 50% chance that the average increase for all 25 students will be 10 points.
Why is that true?

Comparing Distributions for the Calculus App Data

Preview - Asynchronous Lecture 9 - The CLT

Today we covered how the sample mean from a normal population has different distribution than the population itself.
The mean is the same, but standard deviation is divided by the square root of the sample size making the distribution more precise.
Here’s a weird cool fact:
- Even if the population distribution is not normal, e.g. left skewed, right skewed, discrete, or unknown, the sampling distribution of the sample mean is NORMAL if the sample size is large enough.
- There is some dispute about the sample size needed, but 30 or more is recommended especially if the population distribution is skewed or unknown.
There are a lot of videos explain the Central Limit Theorem. I will develop a video for Tuesday that includes a couple of questions and provide links to other videos.
This concept is the Central Limit Thoerem (CLT) and is useful as we transition to looking at real data in the next section of the course.

Key Points from Today

Sampling Distribution of the Sample Mean
- If X represents a single observation from a normal distribution with mean (\(\mu\)) and standard deviation \(\sigma\).
- \(X\sim N(\mu,\sigma)\)
  - \(Z = \frac{X-\mu}{\sigma}\)
- A sample mean \(\overline{X} \sim N(\mu, \frac{\sigma}{\sqrt{n}})\)
  - \(Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\)
- Use same commands vdist_normal_prob or vdist_normal_perc, but divide the population SD, by the square root of the sample size, n.

To submit an Engagement Question or Comment about material from Lecture 8: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 8