Discussion 5 - Central Limit Theorem

Q1. Law of Large Numbers

The Law of Large Numbers (LLN) is a foundational principle in probability and statistics, detailing how the average of results obtained from a large number of trials will converge to the expected value as more trials are performed. This theorem underpins the reliability of sample means as estimators of population parameters, particularly when large samples are considered.

There are two main forms of the LLN: the Weak Law of Large Numbers (WLLN) and the Strong Law of Large Numbers (SLLN). The WLLN states that for a sequence of independent and identically distributed random variables with a common mean \(\mu\) and finite variance, the sample mean converges in probability to the expected value \(\mu\) as the sample size \(n\) increases. Mathematically, this is expressed as:

\[ \overline{X}_n \xrightarrow{P} \mu \quad \text{as} \quad n \to \infty \]

On the other hand, the SLLN, a stronger assertion, claims that the sample mean converges almost surely (with probability 1) to the expected value. This means that the sample mean not only converges in probability but is virtually certain to equal the population mean for sufficiently large sample sizes. The formal expression for the SLLN is:

\[ \overline{X}_n \xrightarrow{a.s.} \mu \quad \text{as} \quad n \to \infty \]

Both versions of the LLN highlight the increasing accuracy of the sample mean as an estimator of the population mean with the growth of the sample size.

Reference:
1. Sharma, A. (2024). Law of Large Numbers. [Rmd file].

Q2. Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental theorem in statistics that describes the distribution of sample means. According to the CLT, the sampling distribution of the sample mean of a large number of independent, identically distributed variables, irrespective of the population’s original distribution, will approximate a normal distribution. This approximation becomes increasingly accurate as the sample size grows.

Mathematically, the CLT states that if \(X_1, X_2, \ldots, X_n\) are independent and identically distributed random variables with a common mean \(\mu\) and standard deviation \(\sigma\), the sample mean \(\overline{X}\) defined as \(\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i\) will have a sampling distribution with a mean \(\mu\) and standard error \(\sigma / \sqrt{n}\). Thus, the standardized form of the sample mean \(Z = \frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) will converge in distribution to a standard normal distribution as \(n\) approaches infinity:

\[ Z \xrightarrow{d} N(0,1) \quad \text{as} \quad n \to \infty \]

The CLT is significant because it allows for the use of normal probability models in situations where the population distribution is unknown, providing a basis for inferential statistics such as confidence intervals and hypothesis testing.

Reference:
1. Sharma, A. (2024). Central Limit Theorem. [Rmd file].

Q3. Similarities and differences between LLN and CLT

Similarities: Both the LLN and CLT describe the behavior of sample means as sample size grows. They are foundational in the field of statistics for understanding the outcomes of repeated random processes and are essential for the practical application of inferential statistics, including hypothesis testing and confidence interval estimation. Each theorem provides theoretical support for estimating population parameters using sample statistics.

Differences: The key distinction lies in their focus and implications. The LLN is concerned primarily with the convergence of the sample mean to the population mean as the sample size increases.

On the other hand, the CLT focuses on the shape of the distribution of the sample means. It states that, regardless of the population’s distribution, the distribution of the sample mean will approximate a normal distribution as the sample size becomes large enough.

In summary, while the LLN guarantees that the sample mean will accurately estimate the population mean with a large sample, the CLT provides a framework for understanding the distribution of this estimate, enabling more sophisticated inferential techniques based on the normal distribution.

Reference:
1. Sharma, A. (2024). Central Limit Theorem. [Rmd file].

2. Sharma, A. (2024). Law of Large Numbers. [Rmd file].

Q4. Gamma distribution

The Gamma distribution is a continuous probability distribution often used to model the time until an event occurs, such as the lifespan of an electronic component or the time between customer arrivals in a queue. It is characterized by two parameters: a shape parameter \(\alpha\) (also known as “k”) and a scale parameter \(\theta\) (also known as “\(\beta\)”), both of which are positive real numbers.

In R, the Gamma distribution can be interacted with using four main functions:

dgamma(x, shape, scale): This function returns the density of the Gamma distribution at a given value \(x\).

# Gamma density function
x <- seq(0, 20, length.out = 100)
plot(x, dgamma(x, shape = 2, scale = 2), type = "l", main = "Gamma Density Function")

pgamma(q, shape, scale): This function provides the cumulative distribution function, calculating the probability that a variable is less than or equal to \(q\).

# Gamma CDF
plot(x, pgamma(x, shape = 2, scale = 2), type = "l", main = "Gamma Cumulative Distribution Function")

qgamma(p, shape, scale): This function computes the quantile function, finding the value \(x\) such that the probability of the variable being less than \(x\) is \(p\).

# Gamma quantile function
quantiles <- qgamma(c(0.25, 0.5, 0.75), shape = 2, scale = 2)
quantiles

## [1] 1.922558 3.356694 5.385269

rgamma(n, shape, scale): This function generates \(n\) random values following the Gamma distribution.

# Generating random values from a Gamma distribution
set.seed(42)
random_gamma <- rgamma(100, shape = 2, scale = 2)
hist(random_gamma, main = "Histogram of Random Gamma Values")

Q5A. Apply the CLT on the sample mean of Gamma distribution.

# Parameters
shape <- 2
scale <- 2

# Number of samples and sample size
n_samples <- 10000
sample_size <- 30

# Generate samples and compute means
set.seed(42)
sample_means <- replicate(n_samples, mean(rgamma(sample_size, shape, scale)))

# Plot the distribution
hist(sample_means, breaks = 50, probability = TRUE, main = "Gamma Distribution of Sample Means",
     xlab = "Sample Means", col = "blue")

# Normal distribution curve
mean_of_means <- mean(sample_means)
sd_of_means <- sd(sample_means)
curve(dnorm(x, mean = mean_of_means, sd = sd_of_means), add = TRUE, col = "red", lwd = 2)

Q5B. Apply the CLT on any other sample statistic.

# Generate samples
set.seed(42)
samples <- replicate(n_samples, rgamma(sample_size, shape, scale))

# Calculate sample 25th, 50th and 80th percentiles
percentile_25th <- apply(samples, 2, function(x) quantile(x, probs = 0.25))
percentile_50th <- apply(samples, 2, function(x) quantile(x, probs = 0.50))
percentile_80th <- apply(samples, 2, function(x) quantile(x, probs = 0.80))

# Plot the distribution of the 25th percentile
hist(percentile_25th, breaks = 50, probability = TRUE, main = "Distribution of Sample 25th Percentile",
     xlab = "25th Percentile", col = "blue")
curve(dnorm(x, mean = mean(percentile_25th), sd = sd(percentile_25th)), add = TRUE, col = "red", lwd = 2)

# Plot the distribution of the median
hist(percentile_50th, breaks = 50, probability = TRUE, main = "Distribution of Sample Median",
     xlab = "Median", col = "yellow")
curve(dnorm(x, mean = mean(percentile_50th), sd = sd(percentile_50th)), add = TRUE, col = "red", lwd = 2)

# Plot the distribution of the 80th percentile
hist(percentile_80th, breaks = 50, probability = TRUE, main = "Distribution of Sample 80th Percentile",
     xlab = "80th Percentile", col = "green")
curve(dnorm(x, mean = mean(percentile_80th), sd = sd(percentile_80th)), add = TRUE, col = "red", lwd = 2)

Each histogram above is accompanied by a red curve representing the expected normal distribution based on the CLT.

Gamma Distribution of Sample Means: This presents a histogram of sample means that closely matches the overlaid normal distribution curve. This is an illustration of the CLT, which predicts that the distribution of sample means will approximate a normal distribution as the number of samples increases.
Distribution of Sample Median: This shows the sample medians. While the distribution exhibits central clustering, it is not as neatly aligned with the normal distribution as the sample means. This deviation is not unexpected since the CLT most precisely applies to means, and medians are less sensitive to outliers and non-normal data.

Thus, while the CLT is known to hold for sample means, histograms in 5B suggest that with large sample sizes, other sample statistics like medians and percentiles can exhibit distributions that resemble a normal distribution, though this may not be as pronounced as it is for means.