Data Science Module

Topic 5B: Simulations in R


Welcome to the fifth Data Science computer lab.

In Topic 4, we introduced sampling distributions and the concept of the Central Limit Theorem.

In this computer lab we will cover how to conduct the simulations discussed in Chapter 3 of Topic 4. It might be helpful to keep this content open in a separate tab while you work through the lab material, in case you would like to quickly double-check your understanding of any concepts.

By the end of this lab, you should be able to conduct simple simulations in R. Let’s get started!

1 Sampling Distributions

1.1 Sampling

Recall that a sample is a random selection of units from a chosen population (e.g. a selection of STM1001 students, from the entire STM1001 cohort).

Because it is typically unfeasible to collect data on the entire population, we instead conduct analyses using a sample. We hope that the inferences we make using the sample data can be extrapolated to provide information about the population, and we can use statistical techniques to ascertain the accuracy of these inferences.

One such statistical technique we can use is the Central Limit Theorem.

1.2 The Central Limit Theorem

Recall the definition of the Central Limit Theorem (CLT):

Let \(X_1, \ldots, X_n\) be a random sample from a distribution with finite mean \(\mu\) and finite variance \(\sigma^2\).

For \(\overline{X}\) denoting the sample mean, if \(n\) is sufficiently large then \[\overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu,\frac{\sigma^2}{n}\right) , \] where \(\stackrel{\tiny \text{approx.}}\sim\) denotes ‘approximately distributed as’.


By the CLT, it follows that as we increase our sample size, the sample mean \(\overline{X}\) will get increasingly closer (i.e. converge) to the population mean \(\mu\). As \(n\) increases, the variance of the sample mean \(\overline{X}\) will also decrease.

In other words, while samples of individuals can exhibit significant variation, samples of means tend to have less variability (as each mean is calculated from a sample of individuals, reducing the impact of outliers).

Let’s visualise this process in R using some simulations.

2 Simulations in R

Carrying out simulations in R is actually quite straightforward.

We will walk through the steps involved one at a time.

2.1

Suppose that we are interested in the random variable \(X\), which follows the standard normal distribution - in other words, \(X \sim N(0, 1)\).

We can generate a sample of 100 observations from this distribution using the following R code:

set.seed(1)
n <- 100

x <- rnorm(n)

Run this code now.

Note: The default mean and standard deviation values used by rnorm are 0 and 1 respectively, so we don’t need to specify these.

2.2

We can plot our generated data using the hist function, as shown below.

hist(x, freq = FALSE, col = "chartreuse3", xlim = c(-3, 3),
     main = paste("Histogram of random sample \n from standard Normal distribution, n = ", n)) 
# Overlay standard normal curve
curve(dnorm(x), add = TRUE, col = "blue", lwd = 2)
# Add line denoting sample mean
abline(v = round(mean(x), 3), col = "red", lwd = 2, lty = 2)

Note that we have also used the curve and dnorm functions here to overlay the standard normal distribution probability density curve.

2.3

As expected, the 100 sample observations appear to fit the standard normal distribution probability density curve well.

Our main interest here is not the variability of the sample values however, but rather the sample mean.

If we check the sample mean of x, we find that this is equal to approximately 0.109 (shown by the red dashed line in the plot above). Since this is the sample mean for just one sample, we can’t be too confident in using this estimate as an accurate estimate of the population mean (which we know was specified to be 0).

Rather than relying on just one sample, we will now simulate a number of random samples from \(X\), in order to obtain a more accurate estimate of the sample mean \(\bar{X}\).

2.4

To begin, suppose we sample 5 observations of \(X\), and then compute the sample mean of these observations.

Using the code in 2.1, it follows that this process is quite straightforward:

set.seed(1)
n <- 5 # Specify sample size
x <- rnorm(n) # Randomly generate ns samples from the normal distribution
norm_mean <- mean(x) # Compute the mean of the samples, and store it in a new object

2.5

Since our underlying data is generated from a normal distribution, it follows that the distribution of the sample mean will also be normal (we can think of this result as a precursor to the Central Limit Theorem).

To accurately visualise the distribution of the sample mean, we will need to repeat the process outlined in 2.4 many times. This is where our simulations can really shine.

Suppose, rather than conducting the process outlined in 2.4 once, we instead conduct it 10000 times. It would take a while to write this code out line by line in R (over 9000 lines potentially!).

Instead, to save time, we will use a loop. Specifically, we will use what is known as a for loop. A for loop is a function that will repeat a specified set of operations a specified number of times.

The Code chunk below provides a simple example of the R syntax for a for loop.

for(i in 1:10){
  x <- rnorm(i)
  mean(x)
}

How do we interpret this code? Well,we have specified that for some input i, where i can take the integer values 1 up to 10, the for loop will simulate i random values from the standard Normal distribution, store them in the object x, and then compute the mean of x.

Thus, our for loop runs like this:

  • The for loop starts with the first specified value, 1, and simulates 1 random value from the standard Normal distribution.
  • The mean of x is now computed. This is the last operation in the for loop.
  • Since the specified operations inside the loop have been finished for that value of i, the loop starts again, with i=2 now.
  • This process continues, until the final specified value of i, namely i=10, has been used.
  • At this point, the for loop is complete.

Note: We do not have to use the letter i here, it is simply convention. You can use a different letter, but remember, make sure you have not used your chosen letter as a name for a previous object in which you have stored data.

2.6

The Code chunk below expands upon the for loop example discussed above in 2.5. Take a look at this code, read the comments carefully, and make sure you understand what the code does (note some values are missing, as denoted by the ...).

#Specify sample size
ns <- ...
  
# Specify number of times to conduct process  
trials <- ...

# create object in which to store sample means once they are generated
norm.means <- rep(0, trials) # Note length is equal to the number of trials

# Run for loop  
for(i in 1:trials){
    # Randomly generate ns samples from the normal distribution
    x <- rnorm(ns) 
    # Compute mean of samples, and store it in the norm.means object, in ith position 
    norm.means[i] <- mean(x) 
    }

Once you feel confident in your understanding of this code, replace the ... missing details for ns and trials, using the following details:

  • You want to take 10000 samples of randomly generated values from the standard Normal distribution.
  • Each sample should have a sample size of 5.

Once you have filled in the details, run your code.

Hint: If you have given this a decent shot, but are stuck, you can check the Code chunk below:

#Specify sample size
ns <- 5
  
# Specify number of times to conduct process  
trials <- 10000

# create object in which to store sample means once they are generated
norm.means <- rep(0, trials) # Note length is equal to the number of trials

# Run for loop  
for(i in 1:trials){
    # Randomly generate ns samples from the normal distribution
    x <- rnorm(ns) 
    # Compute mean of samples, and store it in the norm.means object, in ith position 
    norm.means[i] <- mean(x) 
    }

2.7

To visualise our results, we can use the following code:

hist(norm.means, freq = FALSE, breaks = 20, col = "red", 
     xlim = c(-3, 3),
     xlab = expression(bar(x)),
     main = paste("Histogram of means, n = ", ns)) 

We can also overlay a normal density curve, that, since our underlying data is normal, will reflect the distribution of the sample mean.

curve(dnorm(x, 
      mean = 0, sd = 1/sqrt(ns)), 
      add = TRUE, col = "blue", lwd = 2) 

Run these code chunks now, and check your results. Are our simulated sample means close to the population mean of 0?

2.8

Repeat steps 2.6 and 2.7, using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

2.9

Finally, repeat steps 2.6 and 2.7, using a sample of 60. What do you notice about the distribution of the simulated sample means now?

3 CLT Simulation - Exponential Distribution

To demonstrate the magic of the Central Limit Theorem, let’s carry out some simulations involving data generated from distributions other than the normal distribution.

For example, let’s now consider the Exponential distribution. The Exponential distribution is known to be highly skewed (i.e. asymmetric). Clearly, we cannot fit a normal curve well to data generated from an Exponential distribution. However, as we work through this question, we will find that when we model the distribution of sample means of data generated from an Exponential distribution, this distribution of sample means can be fitted well by a normal curve, especially as our sample size increases. This (somewhat unintuitive) result is due to the Central Limit Theorem!

3.1

Recall that the Exponential distribution is defined by one parameter, known as the rate parameter (in contrast for example to the Normal distribution, which is defined by two parameters, namely the mean and the variance).

Suppose now that our random variable \(X\) follows an exponential distribution with rate parameter \(\lambda = 10\). In other words, \(X \sim Exp(10)\).

We can generate a sample of 100 observations from this distribution using the following R code:

set.seed(1)
n <- 100
rate = 10
x <- rexp(n, rate = rate)

Note that we use rexp to randomly generate values from the exponential distribution, instead of using rnorm (which is for the normal distribution).

Run this code now.

3.2

Our generated data is highly skewed, and clearly different to data generated from the symmetric normal distribution. If we try (somewhat optimistically) to overlay a normal distribution probability density curve (similar to in 2.2), with appropriate mean and variance values, (shown in blue), our result is unhelpful.

3.3

Instead of considering the individual data points generated from the Exponential distribution, let’s shift our focus to the sample mean of these data points. If we check the sample mean of x, we find that this is equal to 0.1030676. In theory, we know that the population mean \(\mu = 1/\lambda = 0.1\), so this sample mean is quite a good estimate.

3.4

Let us now conduct the simulation process used in 2.6, for our exponentially distributed data.

Using the code in 2.6 as a guide, simulate 10000 sample means of \(X\), using samples of size 5.

Plot your results, using the code in 2.7 as a guide.

Note: You will have to change the x-axis range for the histogram (via the xlim argument). Try using a range of 0 to 0.4.

Note 2: Use the code in the Code chunk below to overlay the appropriate Normal distribution probability density curve:

curve(dnorm(x, mean = 1/rate, sd = 1/rate/sqrt(ns)), add = TRUE,
        col = "blue", lwd = 2)

3.5

What do you notice about your resultant plot?

3.6

Repeats 3.4 for \(X \sim Exp(10)\), using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

3.7

Finally, repeat 3.4 for \(X \sim Exp(10)\), using a sample of 60. What do you notice about the distribution of the simulated sample means now?

4 CLT Simulation - Bernoulli Distribution

As discussed in Section 3.3 of Topic 4, the Central Limit Theorem even applies to discrete random variables.

Suppose that our random variable \(X\) now follows a Bernoulli distribution with success parameter \(p = 0.3\). In other words, \(X \sim BERN(0.3)\).

4.1

If we generate a sample of 100 observations from this distribution (see the R code below) and then visualise our generated data, we see a roughly 70/30 split of observations between \(x=0\) and \(x=1\). Clearly, (just as in the Exponential distribution case), the normal distribution probability density curve, with appropriate parameter specifications, (shown in blue), is an extremely poor fit for this data.

set.seed(1)
n <- 100
p = 0.3
x <- rbinom(n, 1, p)

Note that we use rbinom to randomly generate values from the Bernoulli distribution, instead of using rnorm (which is for the normal distribution).

4.2

However, as we will soon see, despite our data being generated from a discrete distribution, we can still use the Central Limit Theorem to obtain an accurate estimate of the population mean!

Let us now conduct the simulation process used in 2.6, for our Bernoulli distributed data.

Using the code in 2.6 as a guide, simulate 10000 sample means of \(X\), using samples of size 5. Then, plot your results, using the code in 2.7 as a guide.

Note: You will have to change the x-axis range for the histogram (via the xlim argument).

Note 2: Use the code in the Code chunk below to overlay the appropriate normal distribution probability density curve:

# Set mu and sigma for the approximating normal distribution
mu <- p
sigma <- sqrt(p*(1 - p))
  
# Overlay the normal density
curve(dnorm(x, mu, sigma/sqrt(ns)), add = TRUE, col = "blue", lwd = 2)

4.3

What do you notice about your resultant plot?

4.4

Repeat 3.4 for \(X \sim BERN(0.3)\), using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

4.5

Finally, repeat 3.4 for \(X \sim BERN(0.3)\), using a sample of 60. What do you notice about the distribution of the simulated sample means now?

5 Extension - Average Diamond Prices

Let’s consider how we can apply the Central Limit Theorem to real data.

For this question, we will consider data on approximately 54,000 round cut diamonds. This data is stored in the diamonds data set in the ggplot2 R package. We will treat this data as the population data.

Install and load the ggplot2 R package now.

Hint: If you need a refresher on installing and loading packages in R, check the Code chunk below:

# Note that you will have to change this code to the relevant package
install.packages("palmerpenguins")
library(palmerpenguins)

5.1

If we plot the price of round cut diamonds using a histogram, we can see that the data is clearly not normally distributed.

Make sure to run this code before continuing.

hist(diamonds$price, col = "skyblue", xlab = "Diamond Price ($)",
     main = "Histogram of Diamond Prices ($)", freq = F)

5.2

Suppose that you are interested in estimating the population mean price for round cut diamonds, but do not have access to the full diamonds data set.

Instead, you can only sample prices for 5 round cut diamonds at a time, represented by the code below:

x <- sample(diamonds$price, 5)

Using this code and the code in 2.6 as a guide, simulate taking 100 samples of 5 random round cut diamond sale prices, and plot a histogram of the resultant sample means. What do you observe?

5.3

Repeat 5.2, but this time suppose that you are able to take samples of 30 diamonds at a time, instead of 5. What changes do you observe?


Great job, that’s everything for today!

Hopefully, this computer lab has helped cement your understanding of the Central Limit Theorem.


References


These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

---
title: "STM1001: Computer Lab 5B"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Data Science Module {-}

### Topic 5B: Simulations in R {-}

<br>

Welcome to the fifth Data Science computer lab.

In [Topic 4](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/), we introduced sampling distributions and the concept of the *Central Limit Theorem*. 

In this computer lab we will cover how to conduct the simulations discussed in [Chapter 3 of Topic 4](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/3-the-central-limit-theorem.html). 
**It might be helpful to keep this content open in a separate tab while you work through the lab material, in case you would like to quickly double-check your understanding of any concepts.**

By the end of this lab, you should be able to conduct simple simulations in R. Let's get started!

# Sampling Distributions

## Sampling

Recall that a **sample** is a random selection of units from a chosen population (e.g. a selection of STM1001 students, from the entire STM1001 cohort).

Because it is typically unfeasible to collect data on the entire population, we instead conduct analyses using a sample. We hope that the inferences we make using the sample data can be extrapolated to provide information about the population, and we can use statistical techniques to ascertain the accuracy of these inferences. 

One such statistical technique we can use is the Central Limit Theorem.

## The Central Limit Theorem {#CLT}

Recall the definition of the Central Limit Theorem (CLT):

<center>
Let $X_1, \ldots, X_n$ be a random sample from a distribution with finite mean $\mu$ and finite variance $\sigma^2$. 

For $\overline{X}$ denoting the sample mean, if $n$ is sufficiently large then
$$\overline{X}\stackrel{\tiny \text{approx.}}\sim N\left(\mu,\frac{\sigma^2}{n}\right) , $$
where $\stackrel{\tiny \text{approx.}}\sim$ denotes 'approximately distributed as'.
</center>
<br>

By the CLT, it follows that as we increase our sample size, the sample mean $\overline{X}$ will get increasingly closer (i.e. converge) to the population mean $\mu$. As $n$ increases, the variance of the sample mean $\overline{X}$ will also decrease.

In other words, while samples of individuals can exhibit significant variation, samples of means tend to have less variability (as each mean is calculated from a sample of individuals, reducing the impact of outliers).

Let's visualise this process in R using some simulations.

# Simulations in R

Carrying out simulations in R is actually quite straightforward. 

We will walk through the steps involved one at a time.

## {#base}

Suppose that we are interested in the random variable $X$, which follows the standard normal distribution - in other words, $X \sim N(0, 1)$.

We can generate a sample of 100 observations from this distribution using the following R code:

```{r class.source = "fold-show", eval = T, echo = T}
set.seed(1)
n <- 100

x <- rnorm(n)
```

Run this code now.

*Note: The default mean and standard deviation values used by `rnorm` are 0 and 1 respectively, so we don't need to specify these.*

## {#basehist}

We can plot our generated data using the `hist` function, as shown below.

```{r class.source = "fold-show", eval = T, echo = T}
hist(x, freq = FALSE, col = "chartreuse3", xlim = c(-3, 3),
     main = paste("Histogram of random sample \n from standard Normal distribution, n = ", n)) 
# Overlay standard normal curve
curve(dnorm(x), add = TRUE, col = "blue", lwd = 2)
# Add line denoting sample mean
abline(v = round(mean(x), 3), col = "red", lwd = 2, lty = 2)
```

*Note that we have also used the `curve` and `dnorm` functions here to overlay the standard normal distribution probability density curve.*

##

As expected, the 100 sample observations appear to fit the standard normal distribution probability density curve well. 

Our main interest here is not the variability of the sample values however, but rather the *sample mean*. 

If we check the sample mean of `x`, we find that this is equal to approximately `r round(mean(x), 3)` (shown by the red dashed line in the plot above). Since this is the sample mean for just one sample, we can't be too confident in using this estimate as an accurate estimate of the population mean (which we know was specified to be `0`).

Rather than relying on just one sample, we will now simulate a number of random samples from $X$, in order to obtain a more accurate estimate of the sample mean $\bar{X}$. 

## {#start}

To begin, suppose we sample 5 observations of $X$, and then compute the sample mean of these observations.

Using the code in \@ref(base), it follows that this process is quite straightforward:

```{r class.source = "fold-show", eval = F, echo = T}
set.seed(1)
n <- 5 # Specify sample size
x <- rnorm(n) # Randomly generate ns samples from the normal distribution
norm_mean <- mean(x) # Compute the mean of the samples, and store it in a new object
```

## {#simstart}

Since our underlying data is generated from a normal distribution, it follows that the distribution of the sample mean will also be normal (we can think of this result as a precursor to the Central Limit Theorem). 

To accurately visualise the distribution of the sample mean, we will need to repeat the process outlined in \@ref(start) **many times**. This is where our simulations can really shine.

Suppose, rather than conducting the process outlined in \@ref(start) once, we instead conduct it 10000 times. 
It would take a while to write this code out line by line in R (over 9000 lines potentially!).

Instead, to save time, we will use a `loop`.
Specifically, we will use what is known as a `for loop`. A `for loop` is a function that will repeat a specified set of operations a specified number of times.

The `Code` chunk below provides a simple example of the R syntax for a `for loop`.

```{r class.source = "fold-show", eval = F, echo = T}
for(i in 1:10){
  x <- rnorm(i)
  mean(x)
}
```

How do we interpret this code?
Well,we have specified that for some input `i`, where `i` can take the integer values 1 up to 10, the `for loop` will simulate `i` random values from the standard Normal distribution, store them in the object `x`, and then compute the mean of `x`. 

Thus, our `for loop` runs like this:

* The `for loop` starts with the first specified value, 1, and simulates 1 random value from the standard Normal distribution. 
* The mean of `x` is now computed. This is the last operation in the `for loop`.
* Since the specified operations inside the loop have been finished for that value of `i`, the loop starts again, with `i=2` now. 
* This process continues, until the final specified value of `i`, namely `i=10`, has been used. 
* At this point, the `for loop` is complete.

*Note: We do not have to use the letter `i` here, it is simply convention. You can use a different letter, but remember, make sure you have not used your chosen letter as a name for a previous object in which you have stored data.*


##  {#sim}

The `Code` chunk below expands upon the `for loop` example discussed above in \@ref(simstart). Take a look at this code, read the comments carefully, and make sure you understand what the code does (note some values are missing, as denoted by the `...`).

```{r class.source = "fold-show", eval = F, echo = T}
#Specify sample size
ns <- ...
  
# Specify number of times to conduct process  
trials <- ...

# create object in which to store sample means once they are generated
norm.means <- rep(0, trials) # Note length is equal to the number of trials

# Run for loop  
for(i in 1:trials){
    # Randomly generate ns samples from the normal distribution
    x <- rnorm(ns) 
    # Compute mean of samples, and store it in the norm.means object, in ith position 
    norm.means[i] <- mean(x) 
    }

```

Once you feel confident in your understanding of this code, replace the `...` missing details for `ns` and `trials`, using the following details:

* You want to take 10000 samples of randomly generated values from the standard Normal distribution.
* Each sample should have a sample size of 5.

Once you have filled in the details, run your code.

*Hint: If you have given this a decent shot, but are stuck, you can check the `Code` chunk below:*

```{r class.source = "fold-hide", eval = T, echo = T}
#Specify sample size
ns <- 5
  
# Specify number of times to conduct process  
trials <- 10000

# create object in which to store sample means once they are generated
norm.means <- rep(0, trials) # Note length is equal to the number of trials

# Run for loop  
for(i in 1:trials){
    # Randomly generate ns samples from the normal distribution
    x <- rnorm(ns) 
    # Compute mean of samples, and store it in the norm.means object, in ith position 
    norm.means[i] <- mean(x) 
    }
```

## {#viz}

To visualise our results, we can use the following code:

```{r class.source = "fold-show", eval = F, echo = T}
hist(norm.means, freq = FALSE, breaks = 20, col = "red", 
     xlim = c(-3, 3),
     xlab = expression(bar(x)),
     main = paste("Histogram of means, n = ", ns)) 
```

We can also overlay a normal density curve, that, since our underlying data is normal,
will reflect the distribution of the sample mean. 

```{r class.source = "fold-show", eval = F, echo = T}  
curve(dnorm(x, 
      mean = 0, sd = 1/sqrt(ns)), 
      add = TRUE, col = "blue", lwd = 2) 
```

Run these code chunks now, and check your results. Are our simulated sample means close to the population mean of 0?

##

Repeat steps \@ref(sim) and \@ref(viz), using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

##

Finally, repeat steps \@ref(sim) and \@ref(viz), using a sample of 60. What do you notice about the distribution of the simulated sample means now?

# CLT Simulation - Exponential Distribution

To demonstrate the magic of the Central Limit Theorem, let's carry out some simulations involving data generated from distributions other than the normal distribution. 

For example, let's now consider the **Exponential distribution**. 
The Exponential distribution is known to be highly skewed (i.e. asymmetric). Clearly, we cannot fit a normal curve well to data generated from an Exponential distribution. However, as we work through this question, we will find that when we model the distribution of sample means of data generated from an Exponential distribution, this distribution of sample means **can** be fitted well by a normal curve, especially as our sample size increases. This (somewhat unintuitive) result is due to the Central Limit Theorem!

##

Recall that the Exponential distribution is defined by one parameter, known as the *rate parameter* (in contrast for example to the Normal distribution, which is defined by two parameters, namely the mean and the variance). 

Suppose now that our random variable $X$ follows an exponential distribution with rate parameter $\lambda = 10$. In other words, $X \sim Exp(10)$.

We can generate a sample of 100 observations from this distribution using the following R code:

```{r class.source = "fold-show", eval = T, echo = T}
set.seed(1)
n <- 100
rate = 10
x <- rexp(n, rate = rate)
```

*Note that we use `rexp` to randomly generate values from the exponential distribution, instead of using `rnorm` (which is for the normal distribution).*

Run this code now.

##

Our generated data is highly skewed, and clearly different to data generated from the symmetric normal distribution. If we try (somewhat optimistically) to overlay a normal distribution probability density curve (similar to in \@ref(basehist)),  with appropriate mean and variance values, (shown in blue), our result is unhelpful.

```{r class.source = "fold-show", eval = T, echo = F, fig.dim = c(6,6)}
hist(x, freq = FALSE, col = "chartreuse3", xlim = c(0, 0.5),
     main = paste("Histogram of random sample \n from Exponential distribution, n = ", n))
curve(dnorm(x, 1/rate, 1/rate), add = TRUE, col = "blue", lwd = 2)
```

##

Instead of considering the individual data points generated from the Exponential distribution, let's shift our focus to the sample mean of these data points.
If we check the sample mean of `x`, we find that this is equal to `r mean(x)`. In theory, we know that the population mean $\mu = 1/\lambda = 0.1$, so this sample mean is quite a good estimate.

## {#expsim}

Let us now conduct the simulation process used in \@ref(sim), for our exponentially distributed data.

Using the code in \@ref(sim) as a guide, simulate 10000 sample means of $X$, using samples of size 5.

Plot your results, using the code in \@ref(viz) as a guide.

*Note: You will have to change the x-axis range for the histogram (via the `xlim` argument). Try using a range of 0 to 0.4.*

*Note 2: Use the code in the `Code` chunk below to overlay the appropriate Normal distribution probability density curve:*

```{r class.source = "fold-show", eval = F, echo = T}
curve(dnorm(x, mean = 1/rate, sd = 1/rate/sqrt(ns)), add = TRUE,
        col = "blue", lwd = 2)
```

##

What do you notice about your resultant plot?

##

Repeats \@ref(expsim) for $X \sim Exp(10)$, using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

##

Finally, repeat \@ref(expsim) for $X \sim Exp(10)$, using a sample of 60. What do you notice about the distribution of the simulated sample means now?

# CLT Simulation - Bernoulli Distribution

As discussed in [Section 3.3 of Topic 4](https://bookdown.org/content/88ef9b7c-5833-4a70-84f2-93470957d1f9/3-3-clt-simulated-example-with-bernoulli-distributed-population.html), the Central Limit Theorem even applies to discrete random variables.

Suppose that our random variable $X$ now follows a Bernoulli distribution with success parameter $p = 0.3$. In other words, $X \sim BERN(0.3)$.

##

If we generate a sample of 100 observations from this distribution (see the R code below) and then visualise our generated data, we see a roughly 70/30 split of observations between $x=0$ and $x=1$. Clearly, (just as in the Exponential distribution case), the normal distribution probability density curve, with appropriate parameter specifications, (shown in blue), is an extremely poor fit for this data.

```{r class.source = "fold-show", eval = T, echo = T}
set.seed(1)
n <- 100
p = 0.3
x <- rbinom(n, 1, p)
```

*Note that we use `rbinom` to randomly generate values from the Bernoulli distribution, instead of using `rnorm` (which is for the normal distribution).*

```{r class.source = "fold-show", eval = T, echo = F}
hist(x, freq = FALSE, col = "chartreuse3", xlim = c(0, 1),
     main = paste("Histogram of random sample \n from Bernoulli distribution, n = ", n)) 
curve(dnorm(x, p, sd = sqrt(p*(1 - p))), add = TRUE, col = "blue", lwd = 2)
```

##

However, as we will soon see, despite our data being generated from a discrete distribution, we can still use the Central Limit Theorem to obtain an accurate estimate of the population mean!

Let us now conduct the simulation process used in \@ref(sim), for our Bernoulli distributed data.

Using the code in \@ref(sim) as a guide, simulate 10000 sample means of $X$, using samples of size 5.
Then, plot your results, using the code in \@ref(viz) as a guide.

*Note: You will have to change the x-axis range for the histogram (via the `xlim` argument).*

*Note 2: Use the code in the `Code` chunk below to overlay the appropriate normal distribution probability density curve:*

```{r class.source = "fold-show", eval = F, echo = T}
# Set mu and sigma for the approximating normal distribution
mu <- p
sigma <- sqrt(p*(1 - p))
  
# Overlay the normal density
curve(dnorm(x, mu, sigma/sqrt(ns)), add = TRUE, col = "blue", lwd = 2)
```

##

What do you notice about your resultant plot?

##

Repeat \@ref(expsim) for $X \sim BERN(0.3)$, using a sample of 30 instead of 5. What do you notice about the distribution of the simulated sample means?

##

Finally, repeat \@ref(expsim) for $X \sim BERN(0.3)$, using a sample of 60. What do you notice about the distribution of the simulated sample means now?

# Extension - Average Diamond Prices

Let's consider how we can apply the Central Limit Theorem to real data.

For this question, we will consider data on approximately 54,000 round cut diamonds. This data is stored in the `diamonds` data set in the `ggplot2` R package. We will treat this data as the population data.

Install and load the `ggplot2` R package now.

*Hint: If you need a refresher on installing and loading packages in R, check the `Code` chunk below:*

```{r class.source = "fold-hide", eval = F, echo = T}
# Note that you will have to change this code to the relevant package
install.packages("palmerpenguins")
library(palmerpenguins)
```

```{r class.source = "fold-hide", eval = T, include = F}
library(ggplot2)
```

##

If we plot the price of round cut diamonds using a histogram, we can see that the data is clearly not normally distributed.

*Make sure to run this code before continuing.*

```{r class.source = "fold-show", eval = T, echo = T}
hist(diamonds$price, col = "skyblue", xlab = "Diamond Price ($)",
     main = "Histogram of Diamond Prices ($)", freq = F)
```

## {#diamondsim}

Suppose that you are interested in estimating the population mean price for round cut diamonds, but do not have access to the full `diamonds` data set.

Instead, you can only sample prices for 5 round cut diamonds at a time, represented by the code below:

```{r class.source = "fold-show", eval = T, echo = T}
x <- sample(diamonds$price, 5)
```

Using this code and the code in \@ref(sim) as a guide, simulate taking 100 samples of 5 random round cut diamond sale prices, and plot a histogram of the resultant sample means. What do you observe?

## {#diamondsimlarge}

Repeat \@ref(diamondsim), but this time suppose that you are able to take samples of 30 diamonds at a time, instead of 5. What changes do you observe? 

<br>

#### Great job, that's everything for today! #### {-}

Hopefully, this computer lab has helped cement your understanding of the Central Limit Theorem.

<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>