Week 4 Discussion

0. Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42), where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

set.seed(seed = 13)

1. Please Google and describe Law of Large Numbers in your own words.

Law of Large Numbers states that as you increase the number of trials, the observed probabilities of the observations will start to reflect the actual probability. For example, the larger our sample size is, the closer our sample average will be to the actual population average.

2. Please explain CLT in your own words.

The Central Limit Theorem states the larger the sample size, the more the distribution of the sample mean will resemble a normal distribution, despite the shape of the population distribution.

3. What are the similarities and differences between LLN and CLT?

The Law of Large Numbers and the Central Limit Theorem both revolve around what happens as the sample size of a population increases. However, the Law of Large Numbers deals with the accuracy of the sample mean, while the Central Limit Theorem deals with the distribution of the sample mean.

4. Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution). Please describe this distribution first in 5 lines.

I chose a gamma distribution. The gamma distribution is a continuous probability distribution that is used to model the time until one or more events occur. For example, you could use this distribution to model time until a light bulb burns out. Since this distribution is dealing with time, we can only model non-negative numbers. The R functions related to gamma distributions are dgamma, pgamma, qgamma, and rgamma, which give the PDF, CDF, quantile function, and simulate randomly variates, respectively. These functions typically ask you to define vector of quantiles, shape, and rate.

5A. Then, apply the CLT on the sample mean of this chosen distribution in R

# Creating our Gamma distribution
mygamma <- rgamma(
  n = 10000,
  shape = 2,
  rate = 2
)
# Adding the mean of the population
popmu <- mean(mygamma)
popmu

## [1] 1.000985

Graphing the Gamma distribution:

hist(mygamma,
     probability = TRUE,
     breaks = 40,
     main = "Gamma Distribution (shape = 2, rate = 2)",
     xlab = "Waiting time")

Creating an empty matrix to store different sample means of the population in

mymat <- matrix(
  rep(
    x=0,
    times=50000
  ),
  nrow= 50,
  ncol = 1000
)
str(mymat)

##  num [1:50, 1:1000] 0 0 0 0 0 0 0 0 0 0 ...

Populating the matrix with 1000 samples of size 50. Each column is a sample

for (i in 1:1000){mymat[1:50,i]=sample(mygamma, 50, replace = TRUE)}
str(mymat)

##  num [1:50, 1:1000] 0.2403 4.0851 0.0934 0.2566 0.7817 ...

Taking the sample means then plotting them to show a normal distribution to prove CLT

mymeans=colMeans(mymat)
plot(density(mymeans),
     main = "Density of Sample Means of Gamma Distribution",
     xlab = "Sample Mean",
     ylab = "Density",
     )

5B. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

mymedians = apply(mymat, 2, median)
plot(density(mymedians),
     main = "Density of Sample Medians of Gamma Distribution",
     xlab = "Sample Median",
     ylab = "Density",
     )

Does the central limit theorem hold as expected?

In 5A, the distribution of the sample means of a gamma population begins to resemble a normal distribution, which makes sense according to the CLM. 5B shows that the Sample Medians are approaching a normal distribution but it still some what positively skewed. This could be do to the median being a more robust statistic than the mean.