MAS 261 - Lecture 7

Empirical Rule / Finding X from a Probability

Author

Penelope Pooler Eisenbies

Published

September 14, 2024

Housekeeping

Loading required package: pacman

Today’s plan
- Review Question about Normal Probability
- A few minutes for R Questions 🪄
- Review of Of The Normal Distribution
- Empirical Rule
  - Interpreting data values intuitively
- Finding an observed value, X, from a probability (percent chance)
Questions about HW 3
In-class Exercises

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 7 In-class Exercises - Q1

Session ID: MAS261f24

The mean number of customers at a local cafe on a Monday morning is 39 with a standard deviation of 3

What is the percent chance that they will have 45 or more customers next Monday morning?

Use the vdist_norm_prob command to help you answer this question.

Review of Histograms of Different Distributions

Histograms are an effective tool for examining the distribution of the data.

LEFT SKEWED

Tail pulled out to LEFT

Low outliers

e.g. Human Lifespan

NORMAL/SYMMETRIC

Data appear in a symmetric bell-shaped curve

No graphic evidence of outliers

e.g. Test scores

RIGHT SKEWED

Tail pulled out to RIGHT

High outliers

e.g. Movie Gross values

Hypothetical Histogram

Most of the data falls in the middle intervals
Distribution is symmetric, and bell-shaped with no outliers.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Histogram overlayed with Density Curve

Recall the sum of the proportion of data in each interval equals 1
Area under the curve ALSO sums to 1

Normal Density Curve

We “smooth out” the histogram to a curve.
Area under the curve equals 1
We use this distribution to find the probability (percent chance) that a certain data value occurs.

Normal Distribution

In lecture 6 we talked about the normal distribution

It is symmetric and bell-shaped.
It’s location is determined by the population mean, $\mu$
It’s width is determined by the population standard devation, $\sigma$
Regardless of the values of $\mu$ and $\sigma$, the normal distribution has a consistent shape
That shape is well known and provides information about all normally distributed populations.
Also recall that all normally distributed populations can be converted to the standard normal distribution $Z$
- Z is normally distributed with mean, $\mu$ of 0 and SD, $\sigma$ of 1.
- If X is from a normal population, $Z = \frac{X-\mu}{\sigma}$

Normal Distribution - Empirical Rule

Part 1: 68% of all Normal populations fall within one standard deviation of their mean (illustrated using Z distribution).

Normal Distribution - Empirical Rule

Part 2: 95% of all Normal populations fall within two standard deviations of their mean (illustrated using Z distribution).

Normal Distribution - Empirical Rule

Part 3: 99.7% of all Normal population falls within three standard deviations of their mean (illustrated using Z distribution).

True probability is 0.9973, so I think this package rounds up.

Normal Distribution - Empirical Rule

Also Referred to as the 68-95-99.7 Rule

Summarizing the Emperical Rule in Words

68% of all values are within 1 std. dev, $\sigma$, of the pop. mean, $\mu$

95% of all values are within 2 std. dev, $2\times \sigma$, of the pop. mean, $\mu$

99.7% of all values are within 3 std. dev, $3\times \sigma$, of the pop. mean, $\mu$

How is the 68-95-99.7 Rule Useful?

R, Excel, Other software, Normal Tables, Apps for phone or PC, etc. can ALL be used find probabilities from a normal distribution.

BUT

Internalizing the Empirical Rule allows you to understand the probability of seeing observed data intuitively WITHOUT using a computer or phone.
Learning these rules and how to use them allows you to immediately evaluate data to determine
- Is the observation reasonable
- Is it unlikely but not too surprizing
- Is it so unlikely that it may be due to an error in data collection or
- Is it so unlikely it might cause us to reevaluate are assumptions about the population distribution.

Example: Trading on the NYSE

Historic data indicates that the first 30 minutes of New York Stock Exchange (NYSE) trading volume (millions of shares) is normally distributed with
a mean of 200 million shares, $\mu = 200$
a standard deviation of 26 million shares, $\sigma = 26$.

Answering Questions using the Empirical Rule

Use the Empirical Rule to find the probability that the trading volume will be in the range of 174 to 226 million shares.

A good approach is to convert range endpoints to Z-scores:

Code

```{r echo=T}
(174 - 200)/26  # -1 means 1 sd below the mean
(226 - 200)/26  #  1 means 1 sd above the mean
```

[1] -1
[1] 1

174 to 226 is ($\mu \pm \sigma$) mean +/- 1 SD
Recall the Rule:
- 68% of population within $\mu \pm \sigma$
- 95% of population within $\mu \pm 2\sigma$
- 99.7% of population within $\mu \pm 3\sigma$
Probability that trading will be between 174 and 226 is 68%

Lecture 7 In-class Exercises - Q2

Session ID: MAS261f24

Use the Empirical Rule to find the probability that the NYSE morning trading volume will be in the range of 148 to 200 million shares.

Convert endpoints to Z scores
Hint: Normal distribution is SYMETRIC

Lecture 7 In-class Exercises - Q3

Session ID: MAS261f24

Use the Empirical Rule to find the probability that the NYSE morning trading volume will be in the range of 200 to 278 million shares.

Convert endpoints to Z scores
Hint: Normal distribution is SYMETRIC

Interpreting a Z score using the Empirical Rule

If Z is between -1 and 1, observed value is VERY LIKELY.
If Z is between -1 and -2 or between 1 and 2, observed value is NOT AT ALL UNLIKELY, BUT MAY NOT BE TOO COMMON (especially as Z gets closer to 2 or -2).
If Z value is between -2 and -3 or between 2 and 3, observed value is UNLIKELY, BUT NOT TOO SUPRISING (until Z gets closer to 3 or -3).
If Z value is less than -3 or greater than 3, observed value is EXTREMELY UNLIKELY and could be due to error if $\vert{Z}\vert$ is very large.

Lecture 7 In-class Exercises - Q4

Session ID: MAS261f24

If trading in the first half hour is at 250 million shares, how should we interpret that?

Step 1. Convert 250 to Z score

Step 2. Use the guidance on the previous slide (based on the Empirical Rule) to interpret that Z score.

Finding X (observed value) from a Percentile

Sometimes what we want to know is what value would put us in

the top 10%
the bottom 5%
etc.

For example, how high would trading have to be to put it in the top 5% for sales
To answer this question we use a similar command to one we already know,
- vdist_normal_perc
- perc stands for percentile.
We (the user) specify the percentile and the output shows the value needed to achieve that percentile.

Finding X from a Percentile - NYSE

How high would trading have to be to put it in the top 5%?

Code

vdist_normal_perc(.05, mean=200, sd=26, type="upper")

Finding X from a Percentile - Average Movie Gross

Recall our Annual Average Movie Gross example from Lecture 6:

The data follows an approximately normal distribution with
- Population Mean ($\mu$) = $17.77 million
- Population Std. Dev. ($\sigma$) = $2.41 million

Two Polling Questions:
- How low would the annual average gross have to be in 2024 to be in the bottom 10%?
- How high would the annual average gross have to be in 2024 to be in the top 2%?

Lecture 7 In-class Exercises - Q5-Q6

Session ID: MAS261f24

Use vdist_normal_perc to answer each of these questions.

Round each answer to two decimal places.

How low would the annual average gross have to be in 2024 to be in the bottom 10%?

How high would the annual average gross have to be in 2024 to be in the top 2%?

A note about `vdist_normal_perc`

In each of the previous questions, one logical choice is to match the command inputs to how the question is written.

You can get the same answer two different ways

Example: How high would the annual average gross have to be in 2025 to be in the top 20%?

Code

```{r echo=T}
vdist_normal_perc(.2, mean=17.77, sd=2.41, type="upper")
```

Code

```{r echo=T}
vdist_normal_perc(.8, mean=17.77, sd=2.41, type="lower")
```

Preview of Lecture 8

In Lectures 6 and 7 we have talked about the normal distribution.
If we know are data are from a normal population, then we can easily find the probability of observing a single observation
- greater than or equal to a specific value
- less than or equal to a specific value
- within a specified range
In Lecture 8 we will talk about the probability of observing a sample mean.
- How does working with a sample mean with sample size (n) greater than 1, change our calculations?
- Spoiler Alert: The adjustment to our calculations is very straightforward.

Key Points from Today

Normal Distribution is symmetric and bell-shaped
- Width is determined by the population standard deviation, $\sigma$.
- Location is determined by the population mean ($\mu$).
Emperical (68-95-99.7) Rule
- 68% of all values are within 1 std. dev, $\sigma$, of the pop. mean, $\mu$
- 95% of all values are within 2 std. dev, $2\times \sigma$, of the pop. mean,$\mu$
- 99.7% of all values are within 3 std. dev, $3\times \sigma$, of the pop. mean,$\mu$
Convert values of interest and then use rule to determine how likely a value or range of values is.
Finding a value of interest from a percent chance or percentile?
- use vdist_normal_perc and interpret

To submit an Engagement Question or Comment about material from Lecture 7: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 7" subtitle: "Empirical Rule / Finding X from a Probability" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions) # verify packages # p_loaded() ``` - Today's plan - Review Question about Normal Probability - A few minutes for R Questions 🪄 - Review of Of The Normal Distribution - Empirical Rule - Interpreting data values intuitively - Finding an observed value, X, from a probability (percent chance) - Questions about HW 3 - In-class Exercises ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free) - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I will demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer. ## Lecture 7 In-class Exercises - Q1 ***Session ID: MAS261f24*** **The mean number of customers at a local cafe on a Monday morning is 39 with a standard deviation of 3** **What is the percent chance that they will have 45 or more customers next Monday morning?** Use the `vdist_norm_prob` command to help you answer this question. ## Review of Histograms of Different Distributions Histograms are an effective tool for examining the distribution of the data. ::: columns ::: {.column width="100%"} ```{r out.width="100%"} knitr::include_graphics("img/histogram_examples.png") ``` ::: ::: ::: columns ::: {.column width="35%"} ::: fragment **LEFT SKEWED** Tail pulled out to LEFT Low outliers e.g. Human Lifespan ::: ::: ::: {.column width="35%"} ::: fragment **NORMAL/SYMMETRIC** Data appear in a symmetric bell-shaped curve No graphic evidence of outliers e.g. Test scores ::: ::: ::: {.column width="30%"} ::: fragment **RIGHT SKEWED** Tail pulled out to RIGHT High outliers e.g. Movie Gross values ::: ::: ::: ## Hypothetical Histogram - Most of the data falls in the middle intervals - Distribution is symmetric, and bell-shaped with no outliers. ```{r hypothetical hist, message=FALSE} # Data set.seed(5) x <- rnorm(100000) df <- data.frame(x) # Histogram with kernel density ggplot(df, aes(x = x)) + geom_histogram(aes(y = ..density..), fill = "lightblue", color="darkblue", binwidth = 0.2) + # geom_density(color="darkred", linewidth=1) + theme_classic() + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank()) ``` ## Histogram overlayed with Density Curve - Recall the sum of the proportion of data in each interval equals 1 - Area under the curve ALSO sums to 1 ```{r hypothetical hist with density line, message=FALSE} # Data set.seed(5) x <- rnorm(100000) df <- data.frame(x) # Histogram with kernel density ggplot(df, aes(x = x)) + geom_histogram(aes(y = ..density..), fill = "lightblue", color="darkblue", binwidth = 0.2) + geom_density(color="darkred", linewidth=1) + theme_classic() + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank()) ``` ## Normal Density Curve - We "smooth out" the histogram to a curve. - Area under the curve equals 1 - We use this distribution to find the probability (percent chance) that a certain data value occurs. ```{r hypothetical density line only, message=FALSE} # Data set.seed(5) x <- rnorm(100000) df <- data.frame(x) # Histogram with kernel density ggplot(df, aes(x = x)) + #geom_histogram(aes(y = ..density..), # fill = "lightblue", # color="darkblue", binwidth = 0.2) + geom_density(color="darkred", linewidth=1) + theme_classic() + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank()) ``` ## Normal Distribution In lecture 6 we talked about the normal distribution - It is symmetric and bell-shaped. - It's location is determined by the population mean, $\mu$ - It's width is determined by the population standard devation, $\sigma$ - Regardless of the values of $\mu$ and $\sigma$, the normal distribution has a consistent shape - That shape is well known and provides information about all normally distributed populations. - Also recall that all normally distributed populations can be converted to the standard normal distribution $Z$ - Z is normally distributed with mean, $\mu$ of 0 and SD, $\sigma$ of 1. - If X is from a normal population, $Z = \frac{X-\mu}{\sigma}$ ## Normal Distribution - Empirical Rule Part 1: 68% of all Normal populations fall within one standard deviation of their mean (illustrated using Z distribution). ```{r } vdist_normal_prob(c(-1,1), mean=0, sd=1, type="both") ``` ## Normal Distribution - Empirical Rule Part 2: 95% of all Normal populations fall within two standard deviations of their mean (illustrated using Z distribution). ```{r } vdist_normal_prob(c(-2,2), mean=0, sd=1, type="both") ``` ## Normal Distribution - Empirical Rule Part 3: 99.7% of all Normal population falls within three standard deviations of their mean (illustrated using Z distribution). **True probability is 0.9973, so I think this package rounds up.** ```{r } vdist_normal_prob(c(-3,3), mean=0, sd=1, type="both") ``` ## Normal Distribution - Empirical Rule Also Referred to as the **68-95-99.7 Rule** ```{r } knitr::include_graphics("img/Emperical_Rull_Graphic_Edited.png", dpi=200) ``` ## Summarizing the Emperical Rule in Words 68% of all values are within 1 std. dev, $\sigma$, of the pop. mean, $\mu$ 95% of all values are within 2 std. dev, $2\times \sigma$, of the pop. mean, $\mu$ 99.7% of all values are within 3 std. dev, $3\times \sigma$, of the pop. mean, $\mu$ ```{r} vdist_normal_plot() ``` ## How is the 68-95-99.7 Rule Useful? - R, Excel, Other software, Normal Tables, Apps for phone or PC, etc. can ALL be used find probabilities from a normal distribution. ::: fragment BUT ::: - Internalizing the Empirical Rule allows you to understand the probability of seeing observed data intuitively **WITHOUT** using a computer or phone. - Learning these rules and how to use them allows you to immediately evaluate data to determine - Is the observation reasonable - Is it unlikely but not too surprizing - Is it so unlikely that it may be due to an error in data collection or - Is it so unlikely it might cause us to reevaluate are assumptions about the population distribution. ## Example: Trading on the NYSE ::: columns ::: {.column width="50%"} - Historic data indicates that the first 30 minutes of New York Stock Exchange (NYSE) trading volume (millions of shares) is normally distributed with - a mean of 200 million shares, $\mu = 200$ - a standard deviation of 26 million shares, $\sigma = 26$. ::: ::: {.column width="50%"} ```{r} vdist_normal_plot(mean=200, sd=26) ``` ::: ::: ## Answering Questions using the Empirical Rule **Use the Empirical Rule to find the probability that the trading volume will be in the range of 174 to 226 million shares.** - A good approach is to convert range endpoints to Z-scores: ::: fragment ```{r echo=T} (174 - 200)/26 # -1 means 1 sd below the mean (226 - 200)/26 # 1 means 1 sd above the mean ``` ::: - 174 to 226 is ($\mu \pm \sigma$) mean +/- 1 SD - Recall the Rule: - 68% of population within $\mu \pm \sigma$ - 95% of population within $\mu \pm 2\sigma$ - 99.7% of population within $\mu \pm 3\sigma$ - Probability that trading will be between 174 and 226 is 68% ## Lecture 7 In-class Exercises - Q2 ***Session ID: MAS261f24*** ::: columns ::: {.column width="50%"} **Use the Empirical Rule to find the probability that the NYSE morning trading volume will be in the range of 148 to 200 million shares.** - Convert endpoints to Z scores - Hint: Normal distribution is SYMETRIC ::: ::: {.column width="50%"} ```{r} vdist_normal_plot(mean=200, sd=26) ``` ::: ::: ## Lecture 7 In-class Exercises - Q3 ***Session ID: MAS261f24*** ::: columns ::: {.column width="50%"} **Use the Empirical Rule to find the probability that the NYSE morning trading volume will be in the range of 200 to 278 million shares.** - Convert endpoints to Z scores - Hint: Normal distribution is SYMETRIC ::: ::: {.column width="50%"} ```{r} vdist_normal_plot(mean=200, sd=26) ``` ::: ::: ## ### Interpreting a Z score using the Empirical Rule - If **Z is between -1 and 1**, observed value is **VERY LIKELY**. - If **Z is between -1 and -2 or between 1 and 2**, observed value is **NOT AT ALL UNLIKELY, BUT MAY NOT BE TOO COMMON** (especially as Z gets closer to 2 or -2). - If **Z value is between -2 and -3 or between 2 and 3**, observed value is **UNLIKELY, BUT NOT TOO SUPRISING** (until Z gets closer to 3 or -3). - If **Z value is less than -3 or greater than 3**, observed value is **EXTREMELY UNLIKELY** and could be due to error if $\vert{Z}\vert$ is very large. ## Lecture 7 In-class Exercises - Q4 ***Session ID: MAS261f24*** **If trading in the first half hour is at 250 million shares, how should we interpret that?** Step 1. Convert 250 to Z score Step 2. Use the guidance on the previous slide (based on the Empirical Rule) to interpret that Z score. ## ### Finding X (observed value) from a Percentile Sometimes what we want to know is what value would put us in - the top 10% - the bottom 5% - etc. - For example, how high would trading have to be to put it in the top 5% for sales - To answer this question we use a similar command to one we already know, - `vdist_normal_perc` - `perc` stands for percentile. - We (the user) specify the percentile and the output shows the value needed to achieve that percentile. ## Finding X from a Percentile - NYSE How high would trading have to be to put it in the top 5%? ```{r echo=TRUE} vdist_normal_perc(.05, mean=200, sd=26, type="upper") ``` ## ### Finding X from a Percentile - Average Movie Gross Recall our Annual Average Movie Gross example from Lecture 6: - The data follows an approximately normal distribution with - Population Mean ($\mu$) = \$17.77 million - Population Std. Dev. ($\sigma$) = \$2.41 million - **Two Polling Questions:** - How low would the annual average gross have to be in 2024 to be in the bottom 10%? - How high would the annual average gross have to be in 2024 to be in the top 2%? ## ### Lecture 7 In-class Exercises - Q5-Q6 ***Session ID: MAS261f24*** Use `vdist_normal_perc` to answer each of these questions. Round each answer to two decimal places. **How low would the annual average gross have to be in 2024 to be in the bottom 10%?** **How high would the annual average gross have to be in 2024 to be in the top 2%?** ## A note about `vdist_normal_perc` ::: columns ::: {.column width="50%"} In each of the previous questions, one logical choice is to match the command inputs to how the question is written. You can get the same answer two different ways Example: How high would the annual average gross have to be in 2025 to be in the top 20%? ::: ::: {.column width="50%"} ```{r echo=T} vdist_normal_perc(.2, mean=17.77, sd=2.41, type="upper") ``` ```{r echo=T} vdist_normal_perc(.8, mean=17.77, sd=2.41, type="lower") ``` ::: ::: ## Preview of Lecture 8 - In Lectures 6 and 7 we have talked about the normal distribution. - If we know are data are from a normal population, then we can easily find the probability of observing a single observation - greater than or equal to a specific value - less than or equal to a specific value - within a specified range - In Lecture 8 we will talk about the probability of observing a sample mean. - How does working with a sample mean with sample size (n) greater than 1, change our calculations? - Spoiler Alert: The adjustment to our calculations is very straightforward. ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Normal Distribution is symmetric and bell-shaped - Width is determined by the population standard deviation, $\sigma$. - Location is determined by the population mean ($\mu$). - Emperical (68-95-99.7) Rule - 68% of all values are within 1 std. dev, $\sigma$, of the pop. mean, $\mu$ - 95% of all values are within 2 std. dev, $2\times \sigma$, of the pop. mean,$\mu$ - 99.7% of all values are within 3 std. dev, $3\times \sigma$, of the pop. mean,$\mu$ - Convert values of interest and then use rule to determine how likely a value or range of values is. - Finding a value of interest from a percent chance or percentile? - use `vdist_normal_perc` and interpret ::: fragment **To submit an Engagement Question or Comment about material from Lecture 7:** Submit it by midnight today (day of lecture). :::

Housekeeping

R and RStudio

Lecture 7 In-class Exercises - Q1

Review of Histograms of Different Distributions

Hypothetical Histogram

Histogram overlayed with Density Curve

Normal Density Curve

Normal Distribution

Normal Distribution - Empirical Rule

Normal Distribution - Empirical Rule

Normal Distribution - Empirical Rule

Normal Distribution - Empirical Rule

Summarizing the Emperical Rule in Words

How is the 68-95-99.7 Rule Useful?

Example: Trading on the NYSE

Answering Questions using the Empirical Rule

Lecture 7 In-class Exercises - Q2

Lecture 7 In-class Exercises - Q3

Interpreting a Z score using the Empirical Rule

Lecture 7 In-class Exercises - Q4

Finding X (observed value) from a Percentile

Finding X from a Percentile - NYSE

Finding X from a Percentile - Average Movie Gross

Lecture 7 In-class Exercises - Q5-Q6

A note about vdist_normal_perc

Preview of Lecture 8

Key Points from Today

A note about `vdist_normal_perc`