MAS 261 - Lecture 8

Sampling Distribution of the Sample Mean / CLT

Author

Penelope Pooler Eisenbies

Published

September 15, 2024

Housekeeping

Today’s plan
- Review Question about Emperical Rule
- A few minutes for R Questions 🪄
- Quick Review of Normal Distribution
  - Questions covered so far have been for a single observation (n=1)
- Sampling Distribution of the Sample mean
  - How does the Normal Distribution change when n > 1
- Introduction ot the Central Limit Theorem
Questions about HW 4
In-class Exercises

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 8 In-class Exercises - Q1

Session ID: MAS261f24

At the local Trader Joe’s on Sundays, the mean number of customers 780 and the standard deviation is 40. Use the Empirical Rule to determine the probability that they will have between 860 and 900 customers next Sunday

Step 1. Convert range endpoints to Z values

Step 2. Use this helpful diagram.

Normal Distribution

In Lectures 6 and 7, we discussed the normal distribution

It is symmetric and bell-shaped.
It’s location is determined by the population mean, $\mu$
It’s width is determined by the population standard devation, $\sigma$
Regardless of the values of $\mu$ and $\sigma$, the normal distribution has a consistent shape
That shape is well known and provides information about all normally distributed populations.

Normal Distribution

So far we’ve talked about a SINGLE observation from normally distributed data:
- A single future year for annual average movie gross
- Price of eggs at single store
- A single morning of trading on the NYSE
- A single Sunday of business at Trader Joe’s

Today we’ll talk about how our understanding of the distribution changes when we ask a question about a sample mean with n > 1
We’ll start by introducing a case where n = 1 is inappropriate and then increase the sample size.

A supply chain example

A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces.

Population mean: $\mu=12$ ounces

Population SD: $\sigma = 0.4$ ounces

A supply chain consultant has been hired to help confirm if this true.

The plant owner is concerned that cans are being underfilled.

Decision Criteria

A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces.

Population mean: $\mu=12$ ounces

Population SD: $\sigma = 0.4$ ounces

Industry standards state that if the can(s) examined have an average fill of less than 11.5 oz, the plant must shut down and recalibrate…which is EXPENSIVE!

If a single can is chosen, what is the probability that the can fill will be 11.5 oz or less?

Lecture 8 In-class Exercises - Q2

Session ID: MAS261f24

What is the probability (percent chance) that a single can will have a fill of 11.5 oz or less?

Examining Only ONE Can - NOT WISE!

There is about a 11% chance, that a random can will have 11.5 ounces or less if the plant is calibrated correctly.
P(X < 11.5) = 10.6%
Recalibration costs MILLIONS of dollars! Should this decision be based on one randomly chosen can?
- NO!
A consultant or analyst who based decision on only one randomly selected can would be committing malpractice.

How would this change with n = 4?

We’ve already seen that a sample size of one (n=1) is a bad idea
Instead the consultant randomly select 4 cans ($n=4$) and find the average can fill based on the four can measurements
- $X$ is the measurement from one can.
  - X comes from a normal distribution with $\mu=12$ and $\sigma=0.4$
  - Shorthand notation: $X\sim N(12,0.4)$
  - $\sim$ is read as “is distributed as” and $N$ stands for the normal distribution

$\frac{X_{1}+X_{2}+X_{3}+X_{4}}{4}=\overline{X}$ is the sample mean from four can measurements.
- $\overline{X}$ has a different distribution than X because it is estimate based on multiple observations.

Comparison of Distributions of $X$ and $\overline{X}$

X is 1 measurement from 1 can from a normal distribution

$X\sim N(12,0.4)$

$\overline{X}$ is the sample mean from 4 $(n=4)$ can measurements.

$\overline{X}\sim N(12,\frac{0.4}{\sqrt{4}})$

Sampling Distribution of the Sample Mean

The sample mean is the average of multiple measurements or observations which provides more information.
This increase in information translates to a more precise and more narrow normal distribution
- The size of the sample used to create the mean effects how precise the distribution is.
X is an observation from a normal distribution with mean, $\mu$, and standard deviation sigma, $\sigma$. X is normally distributed.
- $X\sim N(\mu,\sigma)$
$\overline{X}$ is also normally distributed with mean, $\mu$, standard deviation sigma divided by the square root of the sample size, $\sigma/\sqrt{n}$
- $\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})$
The GOOD NEWS: The sample size adjustment is straightforward to include in the R commands we have covered.

Finding a probability based on a sample mean, $\overline{X}$

What is the probability (percent chance) that a sample mean of 4 cans $(n=4)$ will have a fill of 11.5 oz or less?

Code

```{r echo=T}
vdist_normal_prob(11.5, mean=12, sd=0.4/sqrt(4), type="lower")
```

Examining Four Cans is Better But…

In practice, the sample size would be predetermined by the plant and the consultant before the data were collected.
- I would argue for a sample size of at least 30 cans, if possible, just in case the the information about the distribution is imperfect.
Predetermining the sample size is essential so that no one tries to bias the results by adding to the data after it has been examined.
In this hypothetical case, we are examining the effect of increasing the sample size to show how it effects the distribution.
When we sampled Four Cans ($n=4$), the probability that the sample mean is 11.5 oz or less is 0.6%.
- Given that having to shut down the plant to recalibrate, the plant might still want a larger sample size.
- What is the probability that a sample mean based on 16 cans $(n=16)$ would have can fill less than 11.5?

Lecture 8 In-class Exercises - Q3

Session ID: MAS261f24

What is the probability (percent chance) that a sample mean based on 16 cans $(n=16)$ would have can fill less than 11.5?

Probability (from Q3) is not 0, but it’s pretty close.

Exact probability using a different R command (not required):

Code

```{r echo=T}
pnorm(11.5, mean=12, sd=0.4/sqrt(16), lower.tail = T)
```

[1] 0.0000002866516

The vdist commands are the only ones required in this part of the course, but we can get answers with more precision.
In practice, if a probability is less that 0.0001 (0.01%), a data scientist would consider that to be extremely unlikely.
In practical terms:
- If we sample 16 cans and get a sample mean less than 11.5 one of two things is true:
- The mean can fill at the plant is less than 11.5 and the plant should recalibrate
- The can fill was measured incorrectly (measurement error)

Comparison of the Distributions (n=1, n=4, n=16)

If $\overline{X}$ is based on n > 1 observations, $\sigma$ is replaced with $\sigma_{\overline{x}}=\frac{\sigma}{\sqrt{n}}$

Lecture 8 In-class Exercises - Q4

Session ID: MAS261f24

The following question is also Question 14 of HW Assignment 4.

If the sample size is increased the standard deviation of the sampling distribution of the sample mean will ___.

Example Two - Academic Calculus App

A start-up academic app claims it can USUALLY help students increase their college calculus test scores by 10 points on average, BUT (of course) there is variability in their success rate.

mean ($\mu$) increase is 10 points
standard deviation ($\sigma$) of increase is 5 points

Use this information to answer the following few questions.

Lecture 8 In-class Exercises - Q5-Q6

Session ID: MAS261f24

Find the probability that a single student using the app will increase their test scores by 12 points.

How many standard deviations is an increase of 12 points away from the mean of 10 pts.?

Lecture 8 In-class Exercises - Q7-Q8

Session ID: MAS261f24

Based on the app’s success, a professor asks their whole class (n=25 students) to use it.

What is the probability that this class of 25 will increase their score by an average of 12 pts.?

For sample of 25 students, how many standard deviations is an average increase of 12 points away from the mean of 10 pts.?

Something to consider:

We know from BOTH the probability (prev. question) and the Z value (because of the Empirical Rule), that an average increase of 12 points for the whole class may be a little ambitious.
BUT the probability that the whole class will see an average increase of only 8 points is also unlikely
However, without doing any calculations, we know there is a 50% chance that the average increase for all 25 students will be 10 points.
Why is that true?

Comparing Distributions for the Calculus App Data

Preview - Central Limit Theorem (CLT)

The sample mean from a normal population ($\overline{X}$) has a different distribution than the population itself.
- The mean ($\mu$) is the same.
- The standard deviation is divided by the square root of the sample size ($\frac{\sigma}{\sqrt{n}}$) so the distribution more precise.
The Central Limit Theorem - A weird cool fact:
- Even if the population distribution is not normal, e.g. left skewed, right skewed, discrete, or unknown, the sampling distribution of the sample mean is NORMAL if the sample size is large enough.
- There is some dispute about the sample size needed, but $n \geq 30$ is recommended.
This fact is essential as we transition to looking at real data in this course.
Explaining the CLT, will only take about half a lecture or less to cover. The remainder of Lecture 9 will be used for HW, R/RStudio, and Posit Cloud questions.

Key Points from Today

Sampling Distribution of the Sample Mean
- If X represents a single observation from a normal distribution with mean ($\mu$) and standard deviation $\sigma$.
- $X\sim N(\mu,\sigma)$
  - $Z = \frac{X-\mu}{\sigma}$
- A sample mean $\overline{X} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$
  - $Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$
- Use same commands vdist_normal_prob or vdist_normal_perc, but divide the population SD, by the square root of the sample size, n.

To submit an Engagement Question or Comment about material from Lecture 8: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 8" subtitle: "Sampling Distribution of the Sample Mean / CLT" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions) # verify packages # p_loaded() ``` - Today's plan - Review Question about Emperical Rule - A few minutes for R Questions 🪄 - Quick Review of Normal Distribution - Questions covered so far have been for a single observation (n=1) - Sampling Distribution of the Sample mean - How does the Normal Distribution change when n \> 1 - Introduction ot the Central Limit Theorem - Questions about HW 4 - In-class Exercises ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free) - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I will demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer. ## Lecture 8 In-class Exercises - Q1 ***Session ID: MAS261f24*** **At the local Trader Joe's on Sundays, the mean number of customers 780 and the standard deviation is 40. Use the Empirical Rule to determine the probability that they will have between 860 and 900 customers next Sunday** Step 1. Convert range endpoints to Z values Step 2. Use this helpful diagram. ![](img/Emperical_Rull_Graphic_Edited.png){fig-align="center"} ## Normal Distribution In Lectures 6 and 7, we discussed the normal distribution - It is symmetric and bell-shaped. - It's location is determined by the population mean, $\mu$ - It's width is determined by the population standard devation, $\sigma$ - Regardless of the values of $\mu$ and $\sigma$, the normal distribution has a consistent shape - That shape is well known and provides information about all normally distributed populations. ## Normal Distribution - So far we’ve talked about a SINGLE observation from normally distributed data: - A single future year for annual average movie gross - Price of eggs at single store - A single morning of trading on the NYSE - A single Sunday of business at Trader Joe's - Today we'll talk about how our understanding of the distribution changes when we ask a question about a sample mean with n \> 1 - We'll start by introducing a case where n = 1 is inappropriate and then increase the sample size. ## A supply chain example ::: columns ::: {.column width="60%"} A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces. Population mean: $\mu=12$ ounces Population SD: $\sigma = 0.4$ ounces A supply chain consultant has been hired to help confirm if this true. The plant owner is concerned that cans are being underfilled. ::: ::: {.column width="40%"} ```{r} knitr::include_graphics("img/coke_plant_he.jpg") ``` ::: ::: ## Decision Criteria ::: columns ::: {.column width="70%"} A manufacturing plant is supposed to fill cans with 12 oz. of coca-cola, on average, with a standard deviation of 0.4 ounces. Population mean: $\mu=12$ ounces Population SD: $\sigma = 0.4$ ounces Industry standards state that if the can(s) examined have an average fill of less than 11.5 oz, the plant must shut down and recalibrate...**which is EXPENSIVE!** **If a single can is chosen, what is the probability that the can fill will be 11.5 oz or less?** ::: ::: {.column width="30%"} ```{r} knitr::include_graphics("img/coke_plant_he.jpg") ``` ::: ::: ## Lecture 8 In-class Exercises - Q2 ***Session ID: MAS261f24*** ```{r} knitr::include_graphics("img/coke_plant_he.jpg") ``` **What is the probability (percent chance) that a single can will have a fill of 11.5 oz or less?** ## Examining Only ONE Can - NOT WISE! - There is about a 11% chance, that a random can will have 11.5 ounces or less if the plant is calibrated correctly. - P(X \< 11.5) = 10.6% - Recalibration costs MILLIONS of dollars! Should this decision be based on one randomly chosen can? - **NO!** - A consultant or analyst who based decision on only one randomly selected can would be committing malpractice. ## How would this change with n = 4? - We've already seen that a sample size of one (n=1) is a bad idea - Instead the consultant randomly select 4 cans ($n=4$) and find the average can fill based on the four can measurements - $X$ is the measurement from one can. - X comes from a normal distribution with $\mu=12$ and $\sigma=0.4$ - Shorthand notation: $X\sim N(12,0.4)$ - $\sim$ is read as "is distributed as" and $N$ stands for the normal distribution - $\frac{X_{1}+X_{2}+X_{3}+X_{4}}{4}=\overline{X}$ is the sample mean from four can measurements. - $\overline{X}$ has a different distribution than X because it is estimate based on multiple observations. ## Comparison of Distributions of $X$ and $\overline{X}$ ::: columns ::: {.column width="50%"} X is 1 measurement from 1 can from a normal distribution $X\sim N(12,0.4)$ ```{r} knitr::include_graphics("img/hist_x_1_can.png") ``` ::: ::: {.column width="50%"} $\overline{X}$ is the sample mean from 4 $(n=4)$ can measurements. $\overline{X}\sim N(12,\frac{0.4}{\sqrt{4}})$ ```{r} knitr::include_graphics("img/hist_xbar_4_cans.png") ``` ::: ::: ## ### Sampling Distribution of the Sample Mean - The sample mean is the average of multiple measurements or observations which provides more information. - This increase in information translates to a more precise and more narrow normal distribution - The size of the sample used to create the mean effects how precise the distribution is. - X is an observation from a normal distribution with mean, $\mu$, and standard deviation sigma, $\sigma$. X is normally distributed. - $X\sim N(\mu,\sigma)$ - $\overline{X}$ is also normally distributed with mean, $\mu$, standard deviation sigma divided by the square root of the sample size, $\sigma/\sqrt{n}$ - $\overline{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}})$ - **The GOOD NEWS:** The sample size adjustment is straightforward to include in the R commands we have covered. ## ### Finding a probability based on a sample mean, $\overline{X}$ What is the probability (percent chance) that a sample mean of 4 cans $(n=4)$ will have a fill of 11.5 oz or less? ```{r echo=T} vdist_normal_prob(11.5, mean=12, sd=0.4/sqrt(4), type="lower") ``` ## Examining Four Cans is Better But... - In practice, the sample size would be **predetermined** by the plant and the consultant **before** the data were collected. - I would argue for a sample size of at least 30 cans, if possible, just in case the the information about the distribution is imperfect. - Predetermining the sample size is essential so that no one tries to bias the results by adding to the data after it has been examined. - In this hypothetical case, we are examining the effect of increasing the sample size to show how it effects the distribution. - When we sampled **Four Cans** ($n=4$), the probability that the sample mean is 11.5 oz or less is 0.6%. - Given that having to shut down the plant to recalibrate, the plant might still want a larger sample size. - What is the probability that a sample mean based on 16 cans $(n=16)$ would have can fill less than 11.5? ## Lecture 8 In-class Exercises - Q3 ***Session ID: MAS261f24*** **What is the probability (percent chance) that a sample mean based on 16 cans** $(n=16)$ would have can fill less than 11.5? ## ### Probability (from Q3) is not 0, but it's pretty close. Exact probability using a different R command (not required): ```{r echo=T} pnorm(11.5, mean=12, sd=0.4/sqrt(16), lower.tail = T) ``` - The `vdist` commands are the only ones required in this part of the course, but we can get answers with more precision. - In practice, if a probability is less that 0.0001 (0.01%), a data scientist would consider that to be extremely unlikely. - In practical terms: - If we sample 16 cans and get a sample mean less than 11.5 one of two things is true: - The mean can fill at the plant is less than 11.5 and the plant should recalibrate - The can fill was measured incorrectly (measurement error) ## ### Comparison of the Distributions (n=1, n=4, n=16) If $\overline{X}$ is based on n \> 1 observations, $\sigma$ is replaced with $\sigma_{\overline{x}}=\frac{\sigma}{\sqrt{n}}$ ```{r} knitr::include_graphics("img/hist_all3samplesizes.png", dpi=200) ``` ## Lecture 8 In-class Exercises - Q4 ***Session ID: MAS261f24*** The following question is also Question 14 of HW Assignment 4. **If the sample size is increased the standard deviation of the sampling distribution of the sample mean will `___`.** ## Example Two - Academic Calculus App ::: columns ::: {.column width="60%"} A start-up academic app claims it can USUALLY help students increase their college calculus test scores by 10 points on average, BUT (of course) there is variability in their success rate. - mean ($\mu$) increase is 10 points - standard deviation ($\sigma$) of increase is 5 points ::: fragment Use this information to answer the following few questions. ::: ::: ::: {.column width="40%"} ```{r} knitr::include_graphics("img/calc_graphic.png") ``` ::: ::: ## ### Lecture 8 In-class Exercises - Q5-Q6 ***Session ID: MAS261f24*** **Find the probability that a single student using the app will increase their test scores by 12 points.** **How many standard deviations is an increase of 12 points away from the mean of 10 pts.?** ## ### Lecture 8 In-class Exercises - Q7-Q8 ***Session ID: MAS261f24*** Based on the app's success, a professor asks their whole class (n=25 students) to use it. **What is the probability that this class of 25 will increase their score by an average of 12 pts.?** **For sample of 25 students, how many standard deviations is an average increase of 12 points away from the mean of 10 pts.?** ## Something to consider: - We know from BOTH the probability (prev. question) and the Z value (because of the Empirical Rule), that an average increase of 12 points for the whole class may be a little ambitious. - BUT the probability that the whole class will see an average increase of only 8 points is also unlikely - However, without doing any calculations, we know there is a 50% chance that the average increase for all 25 students will be 10 points. - Why is that true? ## ### Comparing Distributions for the Calculus App Data ```{r } knitr::include_graphics("img/calc_app_summary_slide.png", dpi=200) ``` ## ## Preview - Central Limit Theorem (CLT) - **The sample mean from a normal population (**$\overline{X}$) has a different distribution than the population itself. - The mean ($\mu$) is the same. - The standard deviation is divided by the square root of the sample size ($\frac{\sigma}{\sqrt{n}}$) so the distribution more precise. - **The Central Limit Theorem - A weird cool fact:** - Even if the population distribution is not normal, e.g. left skewed, right skewed, discrete, or unknown, **the sampling distribution of the sample mean is NORMAL if the sample size is large enough**. - There is some dispute about the sample size needed, but $n \geq 30$ is recommended. - This fact is essential as we transition to looking at real data in this course. - Explaining the CLT, will only take about half a lecture or less to cover. The remainder of Lecture 9 will be used for HW, R/RStudio, and Posit Cloud questions. ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Sampling Distribution of the Sample Mean - If X represents a single observation from a normal distribution with mean ($\mu$) and standard deviation $\sigma$. - $X\sim N(\mu,\sigma)$ - $Z = \frac{X-\mu}{\sigma}$ - A sample mean $\overline{X} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$ - $Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$ - Use same commands `vdist_normal_prob` or `vdist_normal_perc`, but divide the population SD, by the square root of the sample size, n. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 8:** Submit it by midnight today (day of lecture). :::

MAS 261 - Lecture 8

Housekeeping

R and RStudio

Lecture 8 In-class Exercises - Q1

Normal Distribution

Normal Distribution

A supply chain example

Decision Criteria

Lecture 8 In-class Exercises - Q2

Examining Only ONE Can - NOT WISE!

How would this change with n = 4?

Comparison of Distributions of \(X\) and \(\overline{X}\)

Sampling Distribution of the Sample Mean

Finding a probability based on a sample mean, \(\overline{X}\)

Examining Four Cans is Better But…

Lecture 8 In-class Exercises - Q3

Probability (from Q3) is not 0, but it’s pretty close.

Comparison of the Distributions (n=1, n=4, n=16)

Lecture 8 In-class Exercises - Q4

Example Two - Academic Calculus App

Lecture 8 In-class Exercises - Q5-Q6

Lecture 8 In-class Exercises - Q7-Q8

Something to consider:

Comparing Distributions for the Calculus App Data

Preview - Central Limit Theorem (CLT)

Key Points from Today