Question 0: Set seed.

set.seed(seed = 17)

Question 1:

Please Google and describe Law of Large Numbers Links to an external site. in your own words.

Answer 1:

The law of large numbers (LLN) describes the theorem that, the larger your sample size or the number of trials in an experiment, the closer the outcome (sample mean) converges to the true mean (or expected value). The law of large numbers is important in statistics because it almost guarantees that, the larger the number of trials or observations in a study, the more stable long-term results will be for a series of random trials or events. One popular example is with casino games; even though there are jackpot winners who may skew the results, in large numbers, the probability that a person will win is still low, and the law of large numbers shows that, the more trials of casino games we observe, the more we will see convergence to the true mean, which is that the house will win (i.e., the saying that the “house always wins”). Another way of looking at the law of large numbers is that the standard deviation will decrease the more trials are included (larger numbers of n) in the data.

Question 2:

Please explain CLT in your own words.

Answer 2:

The central limit theorem or (CLT) is a statistical principal that describes that the sampling distribution of the mean will be normally distributed when the sample size is sufficiently large (typically n> 30). Large sample sizes ensure that the distribution of the sample mean will more closely equal the population mean as the sample size increases. Essentially, no matter what the shape of the population distribution is (triangular distribution, “beta bathtub distribution”, uniform distribution, etc.), the sample mean approaches a normal distribution with a large sample size where n > 30. The central limit theorem does not tell us that a single sample will have a normal distribution, but we assume that the single sample is representative of the other samples. However, when we take repeated samples, the more normal the distribution gets.

Question 3:

What are the similarities and differences between LLN and CLT? Write a few lines.

Answer 3:

Both the law of large numbers and the central limit theorem rely on sampling in an effort to understand the larger population, of which it is impossible to sample due to constraints such as time, money, and ability. Both seek to answer what the sample mean would be as the sample size we are observing approaches infinity, and both are based on many independent trials (or samples) from the same distribution. In addition, both try to explain the behavior of the sample mean. However, while the CLT attempts to give us the shape of the distribution, the LLN gives us the approximate value of the sample mean as n becomes larger.

Question 4:

Pick up any distribution Links to an external site. apart from normal, uniform or poisson. Please describe this distribution first in 5 lines.

Answer 4:

The Exponential Distribution: The exponential distribution is primarily used for measuring the amount of time until a specific event occurs, or the space between events in a Poisson process. Examples may be the amount of time that you have to wait for the next bus, the time that a product will fail, or the probability that a customer service call comes in within the next x minutes. The probability density function of the exponential distribution is:

\(f(x;λ)=λe−^{λx}\) \(x≥0\) where λ > 0 is the rate parameter, or how often events occur and x in the time or distance until the next event.

The cumulative distribution function (CDF) is the probability that the event occurs within time and is given by: \(F(x;λ)=1−e−^{λx}\) x≥0, Where the mean is the expected value, or \(\frac{1}{λ}\) and the variance is \(\frac{1}{λ^2}\) X is the random variable and is continuous.

Sources: https://medium.com/analytics-vidhya/law-of-large-numbers-vs-central-limit-theorem-7819f32c67b2 https://en.wikipedia.org/wiki/Exponential_distribution https://www.geeksforgeeks.org/r-language/exponential-distribution-in-r-programming-dexp-pexp-qexp-and-rexp-functions/

Question 5A:

Then, apply the CLT on the sample mean of this chosen distribution Links to an external site. in R

Answer 5A:

#Generate the Population N
N <- 100

#Use function rexp(N, rate ) to generate an exponential distribution
## Rate parameter is number of events that occur within a given time or space
### Say this is number of buses in an hour - say 5
exp_dist <- rexp(N,rate =5)
hist(exp_dist,
     breaks = 40,
     probability = "True",
     xlab = "Waiting Time between Buses (in an Hour)",
     ylab = "Frequency")

mean <- mean(exp_dist)
sd <- sd(exp_dist)

#Using class R code from dropbox Rmd File from here on out, but slightly modified

## Step 2:  Create an empty matrix that will store means of different sample sizes, 
#say 6 different sample sizes of 4,6,40,60 and 2000 

matrix_exp <- matrix(data = rep(x     = 0, 
                        times = 60000
), 
nrow = 10000, 
ncol = 6
)

matrix_exp[1:16]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## STEP 3:  Take 10,000 random samples of sample size 4,6,40,60 and 2000 respectively, 
#compute their mean, and place the sample mean in the matrix matrix_exp (column j) to plot it later
for (j in 1:6){       # indexing columns of matrix, where each column will represent different sample size
  for (i in 1:10000){ # indexes the rows of moatrix
    n_size <- c( 4 , 6 , 40, 60, 200, 2000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
    matrix_exp[i,j] <-  mean(sample(x       = exp_dist, 
                            size    = n_size[j], 
                            replace = TRUE
    )
    )
  }
}

## Step 4:  If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get 'tighter' around the true popultion mean.

## Step 5:  Confirm intution above with either Plot or with summary commands 
colnames(matrix_exp) <- c("Sample size=4", "Sample size=6", "Sample size=40","Sample size=60", "Sample size=200", "Sample size=2000") 
summary(matrix_exp)  
##  Sample size=4     Sample size=6     Sample size=40    Sample size=60  
##  Min.   :0.01515   Min.   :0.01525   Min.   :0.09556   Min.   :0.1202  
##  1st Qu.:0.13121   1st Qu.:0.14377   1st Qu.:0.18636   1st Qu.:0.1922  
##  Median :0.19187   Median :0.19538   Median :0.21145   Median :0.2128  
##  Mean   :0.21550   Mean   :0.21530   Mean   :0.21508   Mean   :0.2148  
##  3rd Qu.:0.26694   3rd Qu.:0.26311   3rd Qu.:0.23996   3rd Qu.:0.2350  
##  Max.   :1.20770   Max.   :0.90116   Max.   :0.40024   Max.   :0.3776  
##  Sample size=200  Sample size=2000
##  Min.   :0.1544   Min.   :0.1947  
##  1st Qu.:0.2022   1st Qu.:0.2109  
##  Median :0.2140   Median :0.2146  
##  Mean   :0.2146   Mean   :0.2147  
##  3rd Qu.:0.2260   3rd Qu.:0.2184  
##  Max.   :0.3003   Max.   :0.2409
# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 4, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 40, and col 4 has rpoiss sample size 60, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 2000.  
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.  

par(mar = c(4, 4, 2, 1))
hist(x    = exp_dist, 
     main = "Histogram of Exponential Distribution (lambda=5)"
)

#Bigger graphs

# dev.off()       # Clear par(mfrow=c(3,2)) type commands 

for (k in 1:6){
  hist(x = matrix_exp[,k], 
       main = "Histogram of Mean of Exponential Dist (lambda=5)",
       xlim = c(0, 0.5),  # keeps the x axis same in all charts
       xlab = paste0("Column ", k, " from matrix") 
  )
} # main = paste("Mean of " ,z[,k] )

## Step 6: Backtracing population mean and sd
mean      # original uniform distribution mean
## [1] 0.2146656
colMeans(matrix_exp)              # [1] 14.99702 15.01788 15.02452 15.01863
##    Sample size=4    Sample size=6   Sample size=40   Sample size=60 
##        0.2155034        0.2153005        0.2150782        0.2147514 
##  Sample size=200 Sample size=2000 
##        0.2146096        0.2147112
##########################################################################################
### original distribution mean is approximated well by small (n=20) sample and/or big (n=1000) sample equally well ###
##########################################################################################


sd
## [1] 0.2517402
##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################


apply(X = matrix_exp,
      MARGIN = 2,  # columns
      FUN = sd) * c(sqrt(4), 
                    sqrt(6), 
                    sqrt(40), 
                    sqrt(60), 
                    sqrt(200), 
                    sqrt(2000)
      )
##    Sample size=4    Sample size=6   Sample size=40   Sample size=60 
##        0.2483638        0.2533248        0.2526896        0.2502376 
##  Sample size=200 Sample size=2000 
##        0.2507254        0.2506098

Question 5B:

Answer 5B:

#Generate the Population N
N <- 100

#Use function rexp(N, rate ) to generate an exponential distribution
## Rate parameter is number of events that occur within a given time or space
### Say this is number of buses in an hour - say 5
exp_dist <- rexp(N,rate =5)
hist(exp_dist,
     breaks = 40,
     probability = "True",
     xlab = "Waiting Time between Buses (in an Hour)",
     ylab = "Frequency")

median <- median(exp_dist)

#Remaining code is adapted from class code from drop box rmd file.

## Step 2:  Create an empty matrix that will store means of different sample sizes, 
#say 6 different sample sizes of 4,6,40,60 and 2000 

matrix_exp <- matrix(data = rep(x     = 0, 
                        times = 60000
), 
nrow = 10000, 
ncol = 6
)

matrix_exp[1:16]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## STEP 3:  Take 10,000 random samples of sample size 4,6,40,60 and 2000 respectively, 
#compute their mean, and place the sample mean in the matrix matrix_exp (column j) to plot it later
for (j in 1:6){       # indexing columns of matrix, where each column will represent different sample size
  for (i in 1:10000){ # indexes the rows of moatrix
    n_size <- c( 4 , 6 , 40, 60, 200, 2000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
    matrix_exp[i,j] <-  median(sample(x       = exp_dist, 
                            size    = n_size[j], 
                            replace = TRUE
    )
    )
  }
}

## Step 4:  If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get 'tighter' around the true popultion mean.

## Step 5:  Confirm intution above with either Plot or with summary commands 
colnames(matrix_exp) <- c("Sample size=4", "Sample size=6", "Sample size=40","Sample size=60", "Sample size=200", "Sample size=2000") 
summary(matrix_exp)  
##  Sample size=4      Sample size=6      Sample size=40    Sample size=60   
##  Min.   :0.008341   Min.   :0.008107   Min.   :0.05526   Min.   :0.06631  
##  1st Qu.:0.101930   1st Qu.:0.104114   1st Qu.:0.13032   1st Qu.:0.13296  
##  Median :0.159266   Median :0.157060   Median :0.15482   Median :0.15581  
##  Mean   :0.167260   Mean   :0.161940   Mean   :0.15644   Mean   :0.15676  
##  3rd Qu.:0.219957   3rd Qu.:0.212631   3rd Qu.:0.18866   3rd Qu.:0.18160  
##  Max.   :0.854266   Max.   :0.709362   Max.   :0.30974   Max.   :0.25342  
##  Sample size=200   Sample size=2000
##  Min.   :0.08078   Min.   :0.1330  
##  1st Qu.:0.13600   1st Qu.:0.1548  
##  Median :0.15581   Median :0.1558  
##  Mean   :0.15541   Mean   :0.1550  
##  3rd Qu.:0.16840   3rd Qu.:0.1568  
##  Max.   :0.21645   Max.   :0.1887
# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 4, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 40, and col 4 has rpoiss sample size 60, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 2000.  
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.  

par(mar = c(4, 4, 2, 1))
hist(x    = exp_dist, 
     main = "Histogram of Poisson Distribution (lambda=1)"
)

#Bigger graphs

# dev.off()       # Clear par(mfrow=c(3,2)) type commands 

for (k in 1:6){
  hist(x = matrix_exp[,k], 
       main = "Histogram of Mean of Exponential Dist (lambda=5)",
       xlim = c(0, 0.5),  # keeps the x axis same in all charts
       xlab = paste0("Column ", k, " from matrix") 
  )
} # main = paste("Mean of " ,z[,k] )

library(miscTools)
## Warning: package 'miscTools' was built under R version 4.5.2
## Step 6: Backtracing population mean and sd
median      # original uniform distribution mean
## [1] 0.1558135
colMedians(matrix_exp)              # [1] 14.99702 15.01788 15.02452 15.01863
##    Sample size=4    Sample size=6   Sample size=40   Sample size=60 
##        0.1592661        0.1570596        0.1548184        0.1558135 
##  Sample size=200 Sample size=2000 
##        0.1558135        0.1558135

Answer to Questions 5A and 5B:

Note that the below results come from a prior run. I did set the seed but am not sure if it’s expected that these values will change between runs. As a result, I’m using static values from a previous run to inform my answers below.

Yes, the central limit theorem does hold as expected for both the mean and the median of the distributions. We can see this as evidenced from the values from the output and in the histograms from both exercises in 5A and 5B. In 5A, the mean value produced from the original distribution (0.1790411 with an n = 100) was very close to the other column means: Sample size=4 Sample size=6 Sample size=40 Sample size=60 Sample size=200 Sample size=2000 0.1783052 0.1794711 0.1787707 0.1788627 0.1791100 0.1790118 The larger the sample size, the closer the means converged.

We can see this with the different distribution histographs with different sample sizes as well. The smaller the sample size, the more the histogram looks like an exponential distribution. However, the larger the sample size, the more the histogram looks like a normal distribution.

The CLT seems to somewhat hold true with with the median, although not as well as the mean. The median value from the original distribution was 0.1435424 with an n of 100, which was also very close to the other column means from the sample distributions:

Sample size=4 Sample size=6 Sample size=40 0.1484602 0.1443420 0.1425275 Sample size=60 Sample size=200 Sample size=2000 0.1435424 0.1435424 0.1435424

However, the medians seem to vary within the different sample sizes. For example, going from a sample size of 6 to a sample size of 40 decreased the median by ~0.002, but moving from a sample size of 40 to a sample size of 60 increased the median by 0.001. So there’s not a clear convergence pattern. This is also similarly shown in the histograms, where, as the sample size increases, the histograms look more like normal distributions, but these are still not as clean or uniform as with the mean.