STM1001 Lecture 5

class: middle
background-image: url(data:image/png;base64,#LTU_logo.jpg)
background-position: top left
background-size: 30%

# STM1001 Lecture 
# Simulations in R
## Data Science stream
### La Trobe University

---

# Welcome!

### In this lecture we cover an introduction to simulations in R.

* By the end of this lecture you will:

* know how to conduct simulations in R
  
--

* have a deeper understanding of the Central Limit Theorem

<br>

*Reminder: We will use the RStudio interface to carry out R coding for all our STM1001 Data Science-stream work.*

---

# Overview

Over the following slides, we will cover:

* Random Sampling

* Running Simulations in R

* Visualising Simulation Results in R

* The Central Limit Theorem

---

# 1. Sampling Distributions

When we conduct a data analysis, we typically are using data which has been sampled from a population of interest.

It is usually not possible to collect data from the entire population of interest.

Therefore we take a random sample, with the expectation that any inferences we make using our sample data can provide information about the population.

We rely on statistical techniques to quantify the uncertainty relating to the estimates and inferences we make about the population based on our sample data.

Let's take a look at a simple [visualisation](https://stm1001.github.io/simple_sampling/) to help us better understand random sampling.

---

---

# 2. Simulations in R

The data shown in the previous slide were not sampled from real sources.

Instead, they were produced via .teal_style[simulations] in R, following specific coded instructions.

Running simulations can save a great deal of time compared to collecting actual data, and in some contexts allows us to run analyses which would be otherwise impossible.

Conducting simulations in R is quite straightforward - let's go through the steps together now.

---

# 2. Simulations in R

Recall in the [Core Computer Lab 4](https://rpubs.com/LTU_STM1001/CL4) that we introduced several `norm` functions, which can be used in R to conduct calculations relating to the normal distribution.

We can use `rnorm()` to randomly generate values from a normal distribution.

For example, the code

``` r
random_data <- rnorm(100)
```

will generate 100 observations (i.e. data points) from the standard normal distribution.

In other words, this code is sampling 100 values from `$Z \sim N(0, 1)$`.

* Most values will be close to the mean of 0
* The data will not be too spread out, since the variance is 1

---

We can visualise these simulated data using e.g. a histogram:

``` r
hist(random_data, freq = F, col = "turquoise", xlim = c(-3, 3),
     main = paste("Histogram of random sample \n from standard Normal distribution, n = 100")) 
# Overlay standard normal curve
curve(dnorm(x), add = T, col = "blue", lwd = 2)
# Add line denoting sample mean
abline(v = round(mean(random_data), 3), col = "red", lwd = 2, lty = 2)
```

---

# The Sample Mean vs the Population Mean

The histogram just seen shows simulated values sampled from a random variable `$Z\sim N(0,1)$`.

However, our main interest is often not the underlying data or variability of the sample values themselves, but rather .teal_style[the sample mean], and by extension, .teal_style[the population mean].

If we check the sample mean for our simulated data, we find `$\overline{x} \approx 0.109$` (shown by the red dashed line in the histogram).

Since we have taken just one sample of 100 observations, we cannot really be confident that the sample mean we obtained will be an accurate estimate of the population mean (which we specified to be 0).

*Remember that in practice, we typically will not know the population mean nor the population variance, for our population of interest.*

---

# Simulating Sample Means

Rather than relying solely on the one sample mean estimate, it would be helpful if we had multiple estimates.

To save time and other resources, one option here is to use simulations. This will allow us to:

1. Obtain multiple sample mean estimates

2. Calculate the mean of the sample means

3. Use this result as a more accurate estimate of the population mean

We could replicate the `random_data <- rnorm(100)` code shown previously, again and again over numerous lines, but this will quickly become unwieldy.

A more elegant approach would be to wrap the simulation process inside a control flow construct.

In R, one method for conducting this process is to use `for loops`.

---

# for loops

A `for loop` is a function that will repeat a specified set of operations a specified number of times.

The code below provides a simple example of the R syntax for a `for loop`:

``` r
for(i in 1:10){
  x <- rnorm(i)
  mean(x)}
```

Here:

* The `for()` function is used to conduct the `for loop`

* The `i` is an input, which we have specified can take integer values from 1 to 10 inclusive

* The body of the `for loop` (everything inside the `{}` section) specifies the calculations to be performed

In this example, the code will simulate `i` random values from the standard normal distribution (starting at `i = 1`), compute the mean of these values, and then repeat the loop, for the next value of `i`, until the final specified value of `i = 10` has been used.

---

# for loops

For clarity, the output below displays results for the first 5 iterations of the `for loop` code shown previously:

```
## [1] -0.6203667
## [1] -0.6203667
## [1]  0.04211587 -0.91092165
## [1] -0.4344029
## [1]  0.1580288 -0.6545846  1.7672873
## [1] 0.4235771
## [1] 0.7167075 0.9101742 0.3841854 1.6821761
## [1] 0.9233108
## [1] -0.6357365 -0.4616447  1.4322822 -0.6506964 -0.2073807
## [1] -0.1046352
```

Note we have the sample value(s), followed by the sample mean, and then the loop begin anew.

---

# Multiple Simulations

We can use this `for loop` structure to conduct multiple simulations.

For example, suppose that we would like to conduct 1000 trials, where each trial consists of sampling 5 values from a standard normal distribution.

For each trial, our aim is to compute the sample mean, so that we end up with 1000 sample means.

If we compute the mean of these sample means, this should provide a much more accurate estimate of the population mean, than if we simply conducted one sample procedure.

---

# Multiple Simulations

Example R code for this process is shown below (and you can practice this in this week's Data Science computer lab):

``` r
#Specify sample size
ns <- 5
# Specify number of times to conduct process  
trials <- 1000
# Create object in which to store sample means once generated
norm.means <- rep(0, trials) # Note length = number of trials
# Run for loop  
for(i in 1:trials){
    # Randomly generate ns samples from the normal distribution
    x <- rnorm(ns) 
    # Compute mean of samples, store in the norm.means object, 
    # in ith position 
    norm.means[i] <- mean(x)
    }
```

---

# Distribution of the Sample Mean

We observe that the distribution of the simulated sample means becomes narrower as the sample size increases.

---

# CLT Simulations

For the previous example, the underlying data was normally distributed, so it stands to reason that the distribution of the sample means was also normal.

* We can treat this as a precursor to the Central Limit Theorem (CLT)

---

# 3. The Central Limit Theorem

Logically, it seems reasonable to assume that as we increase our sample size `$n$`, the accuracy of the estimates and inferences we make will increase.

One key metric we typically focus on is the sample mean, denoted `$\overline{X}$`.

In [Topic 4 of the Core content](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/3-the-central-limit-theorem.html), we introduced the remarkable Central Limit Theorem (CLT), which is a fundamental result in Statistics.

Using the CLT, we find that as `$n$` increases:

* The variance of `$\overline{X}$` decreases

* The value of `$\overline{X}$` converges to the (unknown) population mean, denoted `$\mu$`

---

# The Central Limit Theorem Equation

Let `$X_1, \ldots, X_n$` be a random sample from a distribution with finite mean `$\mu$` and finite variance `$\sigma^2$`.

For `$\overline{X}$` denoting the sample mean, if `$n$` is sufficiently large then
$$\overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu,\frac{\sigma^2}{n}\right) , $$
where `$\stackrel{\tiny \text{approx.}}\sim$` denotes 'approximately distributed as'.
</center>
<br>

<br>

In other words, while samples of individuals can exhibit significant variation, samples of means tend to have less variability.

* Each mean is calculated from a sample of individuals, reducing the impact of outliers

---

# Using the Central Limit Theorem with skewed data

When our underlying data is **not** normally distributed, it is important to note that:

<center>.teal_style[The distribution of the sample mean obtained from the underlying data will always be (approximately) normal, as long as the requirements of the CLT hold].</center>

To demonstrate the CLT in practice, and to highlight its versatility and utility, let us now conduct some simulations involving data generated from the .teal_style[exponential distribution].

* Unlike the normal distribution, the exponential distribution is highly skewed (i.e. asymmetric)

Let's take a look at some [visualisations](https://stm1001.github.io/clt_shiny_app/) to help us better understand the CLT.

---

# The Exponential Distribution

Recall that we introduced the exponential distribution in [the Topic 3 readings](https://bookdown.org/a_shaker/STM1001_Topic_3/5-other-continuous-distributions.html).

The exponential distribution is defined by one parameter, known as the rate parameter. The rate parameter is denoted by the Greek letter `$\lambda$`.

If a random variable `$X$` follows an exponential distribution with rate parameter `$\lambda$`, we write `$X \sim \text{Exp}(\lambda)$`.

* Here `$X$` will have mean `$\dfrac{1}{\lambda}$` and variance `$\dfrac{1}{\lambda^2}$`

To simulate data from an exponential distribution in R, we can use the function `rexp()`.

---
<img src="data:image/png;base64,#STM1001_DS_Week_5_Lecture_files/figure-html/unnamed-chunk-9-1.png" width="400px" style="display: block; margin: auto;" />

``` r
random_data <- rexp(100, 1)
hist(random_data, freq = F, col = "turquoise", xlim = c(-2, 4),
     main = paste("Histogram of random sample \n from exponential distribution \n with rate = 1 and n =100")) 
# Overlay  normal curve
curve(dnorm(x, 1, 1), add = T, col = "blue", lwd = 2)
# Add line denoting sample mean
abline(v = round(mean(random_data), 3), col = "red", lwd = 2, lty = 2)
```

---

We observe that although the distribution of the underlying data (previous slide) was clearly non-normal, the distribution of the simulated sample means appear increasingly normal as the sample size increases.

Feel free to experiment with the [CLT web app](https://stm1001.github.io/clt_shiny_app/) to explore this result further.

---

---

# End

That concludes our introduction to simulations in R lecture.

What to do next:

* Please attend/complete Computer Lab 5B, in which we will practice running simulations in R and further develop our R coding skills.

* If you have any questions, we can resolve them in the computer labs.

---

background-image: url(data:image/png;base64,#computerlab.jpg)
background-position: bottom
background-size: 75%
class: center

# See you in the computer labs!

---
class: middle

<font color = "grey">
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>
</font>