class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 [Topic 4](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/) Lecture (Part 1) ## Sampling Distributions ### La Trobe University This lecture complements the [Topic 4 readings](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/) --- # Topic 4: Related Links ## Readings [Topic 4 readings](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/) ## Maths Background [Scientific notation and E-notation](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#scientific-notation-and-e-notation) [Squares, square roots and powers](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#squares-square-roots-and-powers) [Fractions, squares and square roots](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#fractions-squares-and-square-roots) ## Notation [Notation for Topics 3 and 4: Probability, Distributions and Sampling Distributions](https://bookdown.org/a_shaker/STM1001_Topic_0/notation-summary.html#topics-3-and-4-probability-distributions-and-sampling-distributions) --- # Topic 4: Sampling Distributions **Overview** <iframe src="https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/" width="100%" height="400px" data-external="1"></iframe> --- # Today's lecture Today, we will introduce the following topics: -- * Sampling -- * The sample mean -- * The Central Limit Theorem -- * The distribution of the sample mean -- We will learn how the distribution of an underlying population relates to the distribution of the sample mean. --- name: stat class: middle background-image: url(data:image/png;base64,#slide_1.png) background-size: 110% --- name: stat class: middle background-image: url(data:image/png;base64,#slide_6.png) background-size: 100% --- name: stat class: middle background-image: url(data:image/png;base64,#slide_7.png) background-size: 100% --- # Sampling <img src="data:image/png;base64,#stm1001_week1_population_sample.jpg" width="100%" style="display: block; margin: auto;" /> --- # Sampling In previous weeks, we have discussed ***sampling***, which involves randomly selecting a ***sample*** of `\(n\)` units from a given ***population***. * Often, we do this because we want to learn something about a ***population*** -- For example, suppose I wanted to know the population mean height of STM1001 students. Let's denote this as `\(\mu\)`. -- Also suppose it is not feasible for me to survey every single STM1001 student (past and present - thousands of students) to calculate the population mean height, `\(\mu\)`. -- * However, I could ask the height of a ***sample*** of students from the ***population*** -- * From this ***sample***, we could then calculate the mean height which we would call the ***sample mean*** and denote `\(\bar{x}\)` -- * We could then use `\(\bar{x}\)` to ***estimate*** the ***population mean*** `\(\mu\)` --- # Sampling We would hope the ***sample mean*** is close to the ***population mean***, because it is the ***population*** we are wanting to learn about. -- * However, since the ***sample mean*** is only an ***estimate***, we are never really sure how close our ***estimate*** is to the ***true value*** -- Statistics can help us determine how confident we can be in a given estimate. We can factor in things like variabililty, sample size, and sample design. -- * This helps us when drawing ***inferences*** about a ***population*** from our ***sample estimates*** -- * In order to draw inferences, we need to know what the ***sampling distribution*** is. Once we know this, it will be much easier draw conclusions about our ***estimates*** by using ***probabilities*** -- * By the end of today's lectures, we will have learnt how to ascertain whether or not the ***distribution of the sample mean*** is ***Normal***, and if so, how to use this distribution to calculate ***probabilities*** * These concepts will be fundamental for what we learn in future topics --- # The Sample Mean * Suppose a random ***sample*** of `\(n = 10\)` STM1001 students' heights were: `$$163, 174, 188, 161, 171, 173, 179, 170, 192, 171$$` <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-4-1.svg" width="60%" style="display: block; margin: auto;" /> --- # The Sample Mean * Now suppose we obtain 7 more random samples of `\(n = 10\)` students: <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-5-1.svg" width="65%" style="display: block; margin: auto;" /> --- # The Sample Mean Each time, we get a slightly different estimate for the sample mean and sample standard deviation: |Sample | Sample mean | Sample standard deviation | |:--------|:-----------:|:-------------------------:| |Sample 1 | 174.1 | 9.8 | |Sample 2 | 173.8 | 11.8 | |Sample 3 | 175.0 | 14.4 | |Sample 4 | 169.0 | 8.3 | |Sample 5 | 171.6 | 12.6 | |Sample 6 | 174.4 | 13.5 | |Sample 7 | 168.1 | 11.7 | |Sample 8 | 172.5 | 9.0 | -- * We don't know what the true value of the population mean `\(\mu\)` is, but we could use any one of our sample means to estimate it -- *[To demonstrate here, we have taken numerous random samples, but normally we would only take one]* --- # The Sample Mean Now let's add a histogram of the eight ***sample means***: -- .left-column[ <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] .right-column[ <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] --- # The distribution of the sample mean Comparing the histograms on the previous slide, you may have also noticed that all histograms (both red and green) were centered roughly around the same number. -- * It turns out that if we know that the underlying distribution of some random variable `\(X\)` is Normal with mean `\(\mu\)`, we can assume that the mean of `\(\overline{X}\)` is also `\(\mu\)` -- Comparing the histograms on the previous slide, does the red histogram contain more or less variability than the other histograms? -- * It turns out that the distribution of the sample ***means*** is **less variable** than the distribution of the values for each ***individual*** -- * This is because it is a histogram of ***mean*** values, rather than a histogram of ***individual*** values --- # The variance of `\(X\)` vs `\(\overline{X}\)` If we know that the underlying distribution of some random variable `\(X\)` is Normal with variance `\(\sigma^2\)`, we can assume that the variance of `\(\overline{X}\)` is `\(\dfrac{\sigma^2}{n}\)` -- In this context, if the variance of `\(\overline{X}\)` is `\(\dfrac{\sigma^2}{n}\)`, this fundamentally means that larger samples will result in smaller variance values for `\(\overline{X}\)`. --- # Distribution of the sample mean example Let's consider the following example: -- 1. Suppose we generate 100 observations from the standard normal distribution [i.e. `\(N(0, 1)\)`], represented in the green histogram below: <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-9-1.svg" width="55%" style="display: block; margin: auto;" /> --- # Distribution of the sample mean example 2. Next, we generate `\(n = 5\)` observations from the standard normal distribution, obtain the estimate `\(\bar{x}\)` (the average of these 5 observations), and repeat this process a further 9,999 times so that we obtain 10,000 estimates. These 10,000 estimates are represented in the red histogram below: <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-10-1.svg" width="55%" style="display: block; margin: auto;" /> --- 3. The same procedure has been followed for the second two red histograms below, but with `\(n = 30\)` and `\(n = 60\)` respectively (instead of `\(n=5\)`). -- * The blue line overlaying each histogram is a normal density curve <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-11-1.svg" width="70%" height="60%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-12-1.svg" width="88%" style="display: block; margin: auto;" /> --- # Distribution of the sample mean example Observations: -- * Note that all four histograms are centered around 0 -- * Also, the red histograms display less variability compared with the green histogram, and this variability decreases as `\(n\)` increases -- * In this example, since the distribution of the **underlying population** (the green sample was taken from this population) was Normal, the distributions of the **sample means** were also Normal --- # The distribution of the sample mean But what if the distribution of the **underlying population** is **not** Normal? What will the distribution of the **sample means** be like then? -- * For example, on the next slide, we see the results of a similar simulation, but this time where the data are sampled from the exponential distribution, which is known to be very skewed --- # The distribution of the sample mean <img src="data:image/png;base64,#Topic_4_Lecture_files/figure-html/unnamed-chunk-13-1.svg" width="65%" style="display: block; margin: auto;" /> --- # The distribution of the sample mean Observations: -- * As expected, the green histogram appears highly skewed -- * All four histograms are centered around the same point (about 0.1) -- * The data in the red histogram of means with `\(n = 5\)` are more symmetric than the green histogram, but still with some skew to the right and not fitting the normal curve as well as the other red histograms -- * The red histograms with `\(n = 30\)` and `\(n = 60\)` appear to be normally distributed, and the normal density curves fit the data very well. This is remarkable, considering how much skew we observe in the underlying distribution -- * The red histograms display less variability as `\(n\)` increases -- Amazing fact: even if the underlying distribution is highly skewed, as long as the sample size is large (normally 30 or greater), the distribution of the means will still be approximately normal! * This leads us to the remarkable ***Central Limit Theorem*** --- # The Central Limit Theorem .content-box-blue[ .center[ **The Central Limit Theorem (CLT)** ] Let `\(X_1, \ldots, X_n\)` be a random sample from a distribution with finite mean `\(\mu\)` and finite variance `\(\sigma^2\)`. For `\(\overline{X}\)` denoting the sample mean, if the sample size `\(n\)` is sufficiently large then `$$\overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu,\frac{\sigma^2}{n}\right)$$` where `\(\stackrel{\tiny \text{approx.}}\sim\)` denotes 'approximately distributed as'. ] -- *Normally, a sample size of approximately `\(n = 30\)` is considered to be 'sufficiently large'.* -- We will shortly consider some examples showing how this theorem can be applied when determining the distribution of `\(\overline{X}\)`. --- # Determining the distribution of the sample mean Given a sample of data, it can be very useful to ascertain the associated ***distribution of the sample mean***. -- * Once we know what this distribution is, it makes it much easier to draw conclusions, or make inferences, about the population mean based on our sample -- We will consider the following three scenarios: -- 1. It is known that the underlying population distribution is ***normal*** -- 2. The underlying population distribution is ***not known***, and the ***sample size is 30 or more*** -- 3. The underlying population distribution is ***not known***, and the ***sample size is less than 30*** --- # Population distribution is normal If it is known that the underlying population distribution is normal, then we can assume the following: $$ \text{If } X \sim N\left(\mu, \sigma^2\right), \text{then } \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ This is true regardless of the sample size. --- # Example 1: Population distribution is normal Suppose `\(X\)` is normally distributed with a mean of `\(\mu = 9\)` and a **variance** of `\(\sigma^2 = 16\)`. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` -- **Step 1:** *Is the underlying distribution Normal?* --- # Example 1: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of `\(\mu = 9\)` and a **variance** of `\(\sigma^2 = 16\)`. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` **Step 1:** *Is the underlying distribution Normal?* **Yes.** -- Therefore, `\(\overline{X}\)` will also be normally distributed. -- Recall that, since `\(X\)` is normally distributed, we can assume $$ \text{If } X \sim N\left(\mu, \sigma^2\right), \text{then } \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ --- # Example 1: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of <mark> `\(\mu = 9\)` </mark> and a <mark> **variance** of `\(\sigma^2 = 16\)` </mark>. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` **Step 1:** *Is the underlying distribution Normal?* **Yes.** Therefore, `\(\overline{X}\)` will also be normally distributed. Recall that, since `\(X\)` is normally distributed, we can assume $$ \text{If } X \sim N\left(\mu, \sigma^2\right), \text{then } \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ From the question, we know `\(\mu = 9\)` and `\(\sigma^2 = 16\)`. That is, `\(X \sim N\left(9, 16\right).\)` -- Therefore, `\(\overline{X} \sim N\left(9, \dfrac{16}{n}\right).\)` --- # Example 1: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of <mark> `\(\mu = 9\)` </mark> and a <mark> **variance** of `\(\sigma^2 = 16\)` </mark>. Further suppose that a random sample of <mark> `\(n = 50\)` </mark> has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` Since `\(n = 50\)` and `\(\overline{X} \sim N\left(9, \dfrac{16}{n}\right),\)` -- we have `\(\overline{X} \sim N\left(9, \dfrac{16}{50}\right),\)` -- which we can simplify to `\(\overline{X} \sim N\left(9, \dfrac{8}{25}\right).\)` -- **Note:** Since the variance of `\(\overline{X}\)` is `\(\frac{8}{25},\)` we can take the square root to get the standard deviation. -- Therefore, the standard deviation of `\(\overline{X}\)` is `\(\sqrt{\frac{8}{25}}\)` -- `\(=\frac{\sqrt{8}}{{\sqrt{25}}}\)` -- `\(=\frac{\sqrt{8}}{{5}}.\)` -- We can also express this as `\(\sqrt{\frac{16}{50}}\)` -- `\(=\frac{\sqrt{16}}{\sqrt{50}}\)` -- `\(=\frac{4}{\sqrt{50}}.\)` -- You can use a calculator to check these numbers are equivalent. --- # Example 2: Population distribution is normal Suppose `\(X\)` is normally distributed with a mean of `\(\mu = 11\)` and a **standard deviation** of `\(\sigma = 6\)`. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` -- **Step 1:** *Is the underlying distribution Normal?* --- # Example 2: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of `\(\mu = 11\)` and a **standard deviation** of `\(\sigma = 6\)`. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` **Step 1:** *Is the underlying distribution Normal?* **Yes.** -- Therefore, `\(\overline{X}\)` will also be normally distributed. -- Recall that, since `\(X\)` is normally distributed, we can assume $$ \text{If } X \sim N\left(\mu, \sigma^2\right), \text{then } \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ --- # Example 2: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of <mark> `\(\mu = 11\)` </mark> and a <mark> **standard deviation** of `\(\sigma = 6\)` </mark>. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` **Step 1:** *Is the underlying distribution Normal?* **Yes.** Therefore, `\(\overline{X}\)` will also be normally distributed. Recall that, since `\(X\)` is normally distributed, we can assume $$ \text{If } X \sim N\left(\mu, \sigma^2\right), \text{then } \overline{X}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ From the question, we have that `\(\mu = 11\)` and `\(\sigma = 6\)`. -- Note that we need to square the **standard deviation** to get the **variance**: `\(\sigma^2 = 6^2 = 36.\)` -- Thus, `\(X \sim N\left(11, 36\right).\)` -- Therefore, `\(\overline{X} \sim N\left(11, \dfrac{36}{n}\right).\)` --- # Example 2: Population distribution is normal Suppose <mark> `\(X\)` is normally distributed</mark> with a mean of <mark> `\(\mu = 11\)` </mark> and a <mark> **standard deviation** of `\(\sigma = 6\)` </mark>. Further suppose that a random sample of <mark> `\(n = 50\)` </mark> has been taken from this population. Write down the distribution of the associated sample mean, `\(\overline{X}.\)` Since `\(n = 50\)` and `\(\overline{X} \sim N\left(11, \dfrac{36}{n}\right),\)` -- we have `\(\overline{X} \sim N\left(11, \dfrac{36}{50}\right),\)` -- which we can simplify to `\(\overline{X} \sim N\left(11, \dfrac{18}{25}\right).\)` -- **Note:** Since the variance of `\(\overline{X}\)` is `\(\frac{18}{25},\)` we can take the square root to get the standard deviation. -- Therefore, the standard deviation of `\(\overline{X}\)` is `\(\sqrt{\frac{18}{25}}\)` -- `\(=\frac{\sqrt{18}}{{\sqrt{25}}}\)` -- `\(=\frac{\sqrt{18}}{{5}}.\)` -- We can also express this as `\(\sqrt{\frac{36}{50}}\)` -- `\(=\frac{\sqrt{36}}{\sqrt{50}}\)` -- `\(=\frac{6}{\sqrt{50}}.\)` -- You can use a calculator to check these numbers are equivalent. --- # Quiz 4 hint In some questions in Quiz 4, you may be asked a question like, *Suppose `\(X\)` is normally distributed with a mean of `\(\mu = 11\)` and a **standard deviation** of `\(\sigma = 6\)`. Further suppose that a random sample of `\(n = 50\)` has been taken from this population. Which of the following statements is/are true about the distribution of the associated sample mean, `\(\overline{X}\)`?* -- You will first need to write down the distribution of `\(\overline{X}\)`. -- Once you have done so, you will be able to consider each statement and decide whether or not it is true. --- # Population distribution unknown and `\(n \geq 30\)` If the underlying distribution is unknown, but the sample size is large (i.e. `\(n \geq 30\)`), then, from the Central Limit Theorem, we can assume: $$ \overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ --- # Example: Population distribution unknown and `\(n \geq 30\)` Suppose a random sample of `\(n = 60\)` is taken from a population with an unknown distribution that has `\(\mu = 62\)` and `\(\sigma^2 = 11.5^2\)`. What is the distribution of `\(\overline{X}\)`? -- **Step 1:** *Is the underlying distribution Normal?* --- # Example: Population distribution unknown and `\(n \geq 30\)` Suppose a random sample of `\(n = 60\)` is taken from a population with an <mark> unknown distribution </mark> that has `\(\mu = 62\)` and `\(\sigma^2 = 11.5^2\)`. What is the distribution of `\(\overline{X}\)`? **Step 1:** *Is the underlying distribution Normal?* **The underlying distribution is unknown.** -- **Step 2:** *Is the sample size 30 or more?* --- # Example: Population distribution unknown and `\(n \geq 30\)` Suppose a random sample of <mark> `\(n = 60\)` </mark> is taken from a population with an <mark> unknown distribution </mark> that has `\(\mu = 62\)` and `\(\sigma^2 = 11.5^2\)`. What is the distribution of `\(\overline{X}\)`? **Step 1:** *Is the underlying distribution Normal?* **The underlying distribution is unknown.** **Step 2:** *Is the sample size 30 or more?* **Yes: we can therefore apply the Central Limit Theorem.** -- That is, $$ \overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu, \frac{\sigma^2}{n}\right).$$ -- Although, from the Central Limit Theorem (CLT), it is known that `\(\overline{X}\)` **approximately** follows the above distribution, for ease of notation and without loss of generality, from this point onwards we will use `\(\sim\)` in place of `\(\stackrel{\tiny \text{approx.}}\sim\)`. --- # Example: Population distribution unknown and `\(n \geq 30\)` Suppose a random sample of <mark> `\(n = 60\)` </mark> is taken from a population with an <mark> unknown distribution </mark> that has <mark> `\(\mu = 62\)` </mark> and <mark> `\(\sigma^2 = 11.5^2\)` </mark>. What is the distribution of `\(\overline{X}\)`? From the question, we have that `\(\mu = 62\)` and `\(\sigma^2 = 11.5^2\)` -- `\(= 132.25\)`. -- Therefore, applying the Central Limit Theorem, we have `$$\overline{X} \sim N\left(62, \dfrac{132.25}{n}\right).$$` -- <br> Since `\(n = 60\)`, `\(\overline{X} \sim N\left(62, \dfrac{132.25}{60}\right).\)` --- # Population distribution unknown and `\(n < 30\)` If the underlying distribution is unknown, and the sample size is small (i.e. `\(n < 30\)`), then it is not possible to apply the Central Limit Theorem. In this situation, the distribution of the sample mean is unknown. -- ## Example: Suppose a random sample of `\(n = 20\)` is taken from from a population with unknown distribution that has with `\(\mu = 5\)` and `\(\sigma^2 = 1\)`, and a sample mean is calculated. -- **Step 1:** *Is the underlying distribution Normal?* --- # Population distribution unknown and `\(n < 30\)` If the underlying distribution is unknown, and the sample size is small (i.e. `\(n < 30\)`), then it is not possible to apply the Central Limit Theorem. In this situation, the distribution of the sample mean is unknown. ## Example: Suppose a random sample of `\(n = 20\)` is taken from from a population with <mark> unknown distribution </mark> that has with `\(\mu = 5\)` and `\(\sigma^2 = 1\)`, and a sample mean is calculated. **Step 1:** *Is the underlying distribution Normal?* **The underlying distribution is unknown.** -- **Step 2:** *Is the sample size 30 or more?* --- # Population distribution unknown and `\(n < 30\)` If the underlying distribution is unknown, and the sample size is small (i.e. `\(n < 30\)`), then it is not possible to apply the Central Limit Theorem. In this situation, the distribution of the sample mean is unknown. ## Example: Suppose a random sample of <mark> `\(n = 20\)` </mark> is taken from from a population with <mark> unknown distribution </mark> that has with `\(\mu = 5\)` and `\(\sigma^2 = 1\)`, and a sample mean is calculated. What is the distribution of `\(\overline{X}\)`? **Step 1:** *Is the underlying distribution Normal?* **The underlying distribution is unknown.** **Step 2:** *Is the sample size 30 or more?* **No,** since `\(n = 20\)` which is less than 30. Therefore, we cannot apply the Central Limit Theorem. -- Thus, it is not possible to determine the distribution of `\(\overline{X}\)`. --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Kahoot! (if time) ## Go to [www.kahoot.it](https://www.kahoot.it) and use ## the code provided ## Have calculators ready! --- # Next * In the second part of this lecture, we will learn how to calculate probabilities from the normal distribution: a very important concept leading into Topic 5. * We will also learn how to calculate probabilities from the Binomial distribution * These skills will be very helpful for Quiz 4 --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>