In this project, we will work with the sampling distribution, inequalities, and the Central Limit Theorem (CLT).

The Central Limit Theorem is one of the most powerful concepts/tools that we will learn in this course, it will send you back to the beginning of this class, require you to remind yourself and use the distributions and probability principles that we studied then, and yet at the end of this class now, it also sets the stage for new beginnings for many things to come. Make sure to have fun, and good luck!

1 Sampling Distribution of a Statistic.

Given sample data of the form: \[ X= \{x_1, x_2, x_3, \dots, x_n\},\] consider the following statistic: \[ \hat{\theta}(X) = \frac{\sum_{i=1}^n (x_i - \overline{x})^2}{n}.\] Note that this statistic can be an “estimator” for the population variance \(\sigma^2\). For now, write a function “theta_hat” that calculates the value of the statistic given sample data “samp”.

theta_hat <- function(samp){
  x_mean=mean(samp)
  n=length(samp)
     sum((samp-x_mean)^2)/n
   }

Use the replicate and hist function to calculate the sampling distribution of \(\hat{\theta}\) when working with random samples coming from \(N(\mu = 5, \sigma = 1.5)\) of sizes \(n = 2, 3, 5, 10, 50, 500\).

B <- 10000
sizes <- c(2, 3, 4, 5, 10, 50 ,500)
 
for(n in sizes){
   thetas <- replicate(B, {
     samp <- rnorm(n, mean=5, sd=1.5)
     theta_hat(samp)
   })
   hist(thetas, breaks = 50,
        main =paste("Sampling Distribution of our Random Sample n=", n))
}

For each of these cases of sample sizes, calculate the empirical expected value.

B <- 10000
sizes <- c(2, 3, 4, 5, 10, 50 ,500)
 
 for(n in sizes){
   thetas <- replicate(B, {
     samp <- rnorm(n, mean=5/n, sd=1.5)
     theta_hat(samp)
   })
   print(paste("Sample size:",  n, "Empirical exp value of theta_hat ", round(mean(thetas), 5)))
   }
## [1] "Sample size: 2 Empirical exp value of theta_hat  1.10684"
## [1] "Sample size: 3 Empirical exp value of theta_hat  1.48195"
## [1] "Sample size: 4 Empirical exp value of theta_hat  1.68566"
## [1] "Sample size: 5 Empirical exp value of theta_hat  1.78769"
## [1] "Sample size: 10 Empirical exp value of theta_hat  2.02097"
## [1] "Sample size: 50 Empirical exp value of theta_hat  2.2106"
## [1] "Sample size: 500 Empirical exp value of theta_hat  2.24756"

As sample size increases what is the relation between the empirical mean of \(\hat{\theta}\) and \(\sigma^2\)? As sample size increases, we see that both \(\hat{\theta}\) and \(\sigma^2\) proceed to get closer to our expected value. This in turns, show a stronger relationship to our original population. # Markov’s and Chebychev’s Bounds. Recall that given a random variable \(X\) with mean \(\mu\) and variance \(\sigma^2\), we have Markov’s bound given as (for \(a >0\)) \[ P(X\ge a) \le \frac{\mu}{a},\] and Chebychev’s bound given by (for \(k>0\)): \[ P(|X-\mu|\ge k ) \le \frac{\sigma^2}{k^2}.\] Write functions “markov” and “chebychev” which take input relevant inputs (\(a, \mu\) for Markov and \(k, \sigma^2\) for Chebychev) and output the relevant value of the bounds.

1.1 Normal distribution

For values of \(a\in (14, 22)\) plot the true value of \(P(X\ge a)\) and the Markovs bound on the same plot (\(x\)-axis will be \((14, 22)\)) when \(X\sim N(\mu = 18, \sigma = 1.5)\).

For values of \(k\in (0,5)\) plot the true value of \(P(|X-\mu_X|\ge k)\) and the Chebychev’s bound, on the same plot (\(x\)-axis will be \((0,5)\)) when \(X\sim N(\mu = 18, \sigma = 1.5)\).

1.2 Exponential Distribution

For values of \(a\in (0, 5)\) plot the true value of \(P(X\ge a)\) and the Markovs bound on the same plot (\(x\)-axis will be \((0, 5)\)) when \(X\sim \text{Exp}(\lambda = 2)\).

For values of \(k\in (0,5)\) plot the true value of \(P(|X-\mu_X|\ge k)\) and the Chebychev’s bound, on the same plot (\(x\)-axis will be \((0,5)\)) when \(X\sim \text{Exp}(\lambda = 2)\).

2 Central Limit Theorem

2.1 Exponential Distribution

Suppose we are working with a population that has the exponential distibution with \(\lambda = 2\).

Use the replicate function to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

B <- 10000
sizes <- c(1, 2, 3, 4, 15,500)
 
for(n in sizes){
   thetas <- replicate(B, {
     samp <- rnorm(n, mean=2, sd=2)
     theta_hat(samp)
   })
   hist(thetas, breaks = 50,
        main =paste("Histograms from lamda=2 of n=",n ))
}

What do you notice? The higher the n value, the closer the exponential distribution is to looking like a bell-curve. We can seee that from n=1,2,3,4,15 the distribution is horrible. If we had n=30, based off of CLT, we could conclude that this sample size is approximately normal. ## Discrete Uniform distibution

Suppose we are working with the discrete uniform random variable taking values \(\{1, 2, 3, 4, 5, 6\}\).

Define a function “disc_samp” that takes input “n” and returns a random sample of size “n” from this distribution.

size<- c(1,2,3,4,5,6)
disc_samp <- function(n){
  return(runif(n,min=1, max=6))
}

Use the “disc_samp” function and the replicate function to to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

B <- 10000
sizes <- c(1, 2, 3, 4, 15,500)
 
for(n in sizes){
   discsamples <- replicate(B, {
     disc_samp(n)
   })
   hist(discsamples, breaks = 50,
        main =paste("Discrete Uniform Disrtibution with n=", n) )
}

What do you notice? It seems that the each are all equally likely but since each x value has the same probability. However, the higher the n value shows how uniform this distribution is.

2.2 Continuous Uniform distibution

Suppose we are working with the Continuous uniform random variable taking values on \((0,1)\).

Define a function “cont_uni_samp” that takes input “n” and returns a random sample of size “n” from this distribution.

Use the “cont_uni_samp” function and the replicate function to to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

What do you notice?