Week 4 Discussion - Central Limit Theorem

0. Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42) , where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

# Set the seed to ensure reproducibility
set.seed(seed = 30)

# Generate a random vector of 10 numbers
random_numbers <- runif(10)

# Print the random numbers
random_numbers

##  [1] 0.09878282 0.48823179 0.36403673 0.42061913 0.30096439 0.14763513
##  [7] 0.89857491 0.22355651 0.96596330 0.14106704

Please Google and describe Law of Large Numbers Links to an external site. in your own words.

The Law of Large Numbers is a fundamental concept in probability and statistics that describes the behavior of sample averages as the sample size increases. In simple terms, it states that as you take larger and larger samples from a population, the sample mean (average) will tend to get closer to the true population mean. There are two main forms of the Law of Large Numbers:

Weak Law of Large Numbers (WLLN): This form states that the sample mean converges in probability to the population mean as the sample size increases. In other words, the probability that the sample mean deviates by a certain amount from the population mean approaches zero as the sample size grows.

Strong Law of Large Numbers (SLLN): The strong form asserts that the sample mean converges almost surely to the population mean. This means that with probability one, as the sample size becomes infinite, the sample mean will equal the population mean.

The Law of Large Numbers is crucial in statistical inference and provides a theoretical foundation for the reliability of sample averages in estimating population parameters. It helps explain why, in practice, larger sample sizes tend to provide more accurate and stable estimates of population characteristics. The Law of Large Numbers emphasizes the idea that random variations tend to cancel out when working with large samples, leading to more reliable and representative estimates of the underlying population parameters.

Please explain CLT in your own words.

The Central Limit Theorem (CLT) is a fundamental concept in statistics that describes the behavior of the distribution of sample means from any population, regardless of its underlying distribution, as the sample size increases. It states that the distribution of sample means approaches a normal distribution as the sample size becomes sufficiently large, regardless of the shape of the original population distribution. Imagine you have a population with any distribution, whether it’s normal, uniform, exponential, or something entirely different. Now, suppose take multiple samples from this population, calculate the mean of each sample, and plot a histogram of those sample means. According to the Central Limit Theorem, as increase the sample size and repeat this process, the distribution of these sample means will approximate a normal distribution. The Central Limit Theorem is powerful because it allows us to make inferences about population parameters (such as the population mean) based on sample statistics (such as the sample mean), even when we don’t know the underlying population distribution. It forms the basis for many statistical techniques and hypothesis tests, particularly those involving means and proportions. Here are some key points and uses of the Central Limit Theorem:

Sampling Distribution of the Sample Mean: The Central Limit Theorem specifically applies to the sampling distribution of the sample mean. It states that as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population distribution.

Estimation and Inference: The CLT allows statisticians to make accurate estimations and conduct hypothesis tests about population parameters based on sample means. For example, it enables us to construct confidence intervals and perform hypothesis tests for population means.

Practical Applications: The Central Limit Theorem is widely used in various fields such as finance, economics, engineering, and social sciences. It’s applied in areas like quality control, market research, public health studies, and more, where making inferences about populations based on sample data is common.

Assumptions and Limitations: While the Central Limit Theorem is a powerful tool, it does have assumptions and limitations. For instance, it assumes that the samples are drawn independently and identically distributed (iid), and it may require a sufficiently large sample size for the approximation to the normal distribution to hold. Understanding the Central Limit Theorem is crucial for anyone working with statistical data, as it underpins many statistical methods and helps ensure the validity of statistical analyses.

Resources: - StatQuest: Central Limit Theorem by Josh Starmer at StatQuest provides a clear and intuitive explanation of the Central Limit Theorem. - Intro to Central Limit Theorem by JB Statistics offers another insightful explanation with examples and demonstrations. - Wikipedia - Central Limit Theorem provides a detailed overview and formal definition of the theorem, along with its significance and uses.

What are the similarities and differences between LLN and CLT? Write a few lines.

Both the law of large numbers and the central limit theorem are fundamental concepts in statistics that describe the behavior of sample statistics as the sample size increases.

Similarities: Both LLN and CLT deal with the behavior of sample statistics as sample size increases. They are fundamental principles in statistical theory and are widely used in various statistical analyses. Both LLN and CLT are essential for understanding the properties of estimators and making inferences about population parameters based on sample data.

Differences: The Law of Large Numbers focuses on the behavior of sample means, stating that as the sample size increases, the sample mean converges to the population mean. The Central Limit Theorem, on the other hand, concerns the distribution of sample means, asserting that as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the shape of the population distribution. While LLN emphasizes the convergence of sample statistics to population parameters, CLT emphasizes the convergence of the distribution of sample statistics to a normal distribution. LLN applies to any population distribution with finite variance, whereas CLT applies to any population distribution, regardless of its shape or characteristics, under certain conditions such as finite variance or finite mean.

Both LLN and CLT play critical roles in statistical inference, but they address different aspects of the behavior of sample statistics and have distinct conditions and implications.

Pick up any distributionLinks to an external site. apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution).

The Exponential distribution models the time until an event occurs in a process with a constant rate. It is often used to represent the waiting time between independent, identically distributed events. The distribution is defined by a rate parameter (λ), where λ > 0.

# R code for generating Beta distribution
alpha <- 2
beta <- 5
set.seed(123)  # for reproducibility
beta_values <- rbeta(10, alpha, beta)
beta_values

##  [1] 0.18559377 0.24147013 0.68896738 0.30016841 0.31254601 0.38799896
##  [7] 0.09457126 0.16650121 0.20402421 0.37360897

A Bernoulli distribution is a discrete probability distribution that represents the outcome of a single binary experiment, such as success/failure or heads/tails.

# R code for generating Bernoulli distribution from Rlab package
library(Rlab)

## Rlab 4.0 attached.

## 
## Attaching package: 'Rlab'

## The following objects are masked from 'package:stats':
## 
##     dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
##     qweibull, rexp, rgamma, rweibull

## The following object is masked from 'package:datasets':
## 
##     precip

p <- 0.3
set.seed(123)  # for reproducibility
bernoulli_values <- rbern(10, p)
bernoulli_values

##  [1] 0 1 0 1 1 0 0 1 0 0

The binomial distribution is a discrete probability distribution that describes the number of successes in a given number of independent Bernoulli trials.

# R code for generating Binomial distribution
n <- 10
p <- 0.4
set.seed(123)  # for reproducibility
binomial_values <- rbinom(10, n, p)
binomial_values

##  [1] 3 5 4 6 6 1 4 6 4 4

The Cauchy distribution is a continuous probability distribution with heavy tails, meaning that it has a higher probability of extreme outcomes than other distributions such as the normal distribution.

# R code for generating Cauchy distribution
location <- 0
scale <- 1
set.seed(123)  # for reproducibility
cauchy_values <- rcauchy(10, location, scale)
cauchy_values

##  [1]   1.2691296  -0.7842432   3.4011811  -0.3850032  -0.1892392   0.1441052
##  [7] -11.2960947  -0.3514607  -6.1346270   7.2913309

The chi-square distribution is a continuous probability distribution that originated in the context of the chi-square test and is used in various statistical tests, such as tests for goodness of fit and tests of independence in contingency tables.

# R code for generating Chi-Square distribution
df <- 5
set.seed(123)  # for reproducibility
chisq_values <- rchisq(10, df)
chisq_values

##  [1]  2.5718020  8.0747086  0.6485141  4.3740386 10.3216603  5.4098898
##  [7]  1.2220565  0.6062728  8.2114143  5.0824402

The F distribution is a continuous probability distribution that arises in the context of statistical hypothesis testing, particularly in analysis of variance (ANOVA) and regression analysis. It is the ratio of two independent chi-square distributions, each divided by its degrees of freedom.

# R code for F distribution
df1 <- 3
df2 <- 5
set.seed(123)  # for reproducibility
f_values <- rf(10, df1, df2)
f_values

##  [1] 0.21386018 0.01836131 2.12599955 1.04125921 1.15158441 0.13626690
##  [7] 1.04252091 0.17713294 0.64475570 0.19676228

5 A. Then, apply the CLT on the sample mean of this chosen distributionLinks to an external site. in RLinks to an external site. (adapt our class R code, or you can find an alternative code on the web too).

# Parameters for the Exponential distribution
rate <- 0.2

# Number of samples to draw from the Exponential distribution
sample_size <- 1000

# Number of times to repeat the sampling process
num_samples <- 1000

# Function to generate sample means
generate_sample_means <- function(sample_size, num_samples, rate) {
  sample_means <- replicate(num_samples, mean(rexp(sample_size, rate)))
  return(sample_means)
}

# Generate sample means
set.seed(123)  # for reproducibility
sample_means <- generate_sample_means(sample_size, num_samples, rate)

# Plot the histogram of sample means
hist(sample_means, main = "Distribution of Sample Means (CLT)",
     xlab = "Sample Mean", col = "skyblue", border = "black")

# Add a red dashed line for the theoretical mean of the Exponential distribution
abline(v = 1 / rate, col = "red", lty = 2, lw = 2)

# Add a blue dashed line for the theoretical standard deviation of the sample mean
abline(v = 1 / (rate * sqrt(sample_size)), col = "blue", lty = 2, lw = 2)

# Add legend
legend("topright", legend = c("Theoretical Mean", "Theoretical SD of Sample Mean"),
       col = c("red", "blue"), lty = 2, lw = 2)

5 B. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

# Function to generate sample medians
generate_sample_medians <- function(sample_size, num_samples, rate) {
  sample_medians <- replicate(num_samples, median(rexp(sample_size, rate)))
  return(sample_medians)
}

# Function to generate sample 25th percentiles
generate_sample_25th_percentiles <- function(sample_size, num_samples, rate) {
  sample_25th_percentiles <- replicate(num_samples, quantile(rexp(sample_size, rate), probs = 0.25))
  return(sample_25th_percentiles)
}

# Generate sample medians
set.seed(123)  # for reproducibility
sample_medians <- generate_sample_medians(sample_size, num_samples, rate)

# Generate sample 25th percentiles
set.seed(123)  # for reproducibility
sample_25th_percentiles <- generate_sample_25th_percentiles(sample_size, num_samples, rate)

# Plot the histogram of sample medians
hist(sample_medians, main = "Distribution of Sample Medians (CLT)",
     xlab = "Sample Median", col = "lightgreen", border = "black")

# Add a red dashed line for the theoretical median of the Exponential distribution
abline(v = log(2) / rate, col = "red", lty = 2, lw = 2)

# Add a blue dashed line for the theoretical standard deviation of the sample median
abline(v = (1.253 / sqrt(sample_size)) * (1 / rate), col = "blue", lty = 2, lw = 2)

# Add legend
legend("topright", legend = c("Theoretical Median", "Theoretical SD of Sample Median"),
       col = c("red", "blue"), lty = 2, lw = 2)

# Plot the histogram of sample 25th percentiles
hist(sample_25th_percentiles, main = "Distribution of Sample 25th Percentiles (CLT)",
     xlab = "Sample 25th Percentile", col = "lightblue", border = "black")

# Add a red dashed line for the theoretical 25th percentile of the Exponential distribution
abline(v = qexp(0.25, rate), col = "red", lty = 2, lw = 2)

# Add a blue dashed line for the theoretical standard deviation of the sample 25th percentile
abline(v = (1.349 / sqrt(sample_size)) * (1 / rate), col = "blue", lty = 2, lw = 2)

# Add legend
legend("topright", legend = c("Theoretical 25th Percentile", "Theoretical SD of Sample 25th Percentile"),
       col = c("red", "blue"), lty = 2, lw = 2)

In all cases (sample mean, sample median, and sample 25th percentile), the sample statistics converge to their theoretical values (red dashed lines) as the sample size increases. This aligns with the central limit theorem’s prediction.The histograms demonstrate a trend towards normality as the sample size increases. While the distributions may not perfectly resemble normal distributions, this behavior supports the central limit theorem, which states that the sampling distribution becomes approximately normal with larger sample sizes.The blue dashed lines represent the theoretical standard deviations of the sample statistics, which decrease as the sample size increases. The actual sample statistics align well with these theoretical values, providing evidence that the central limit theorem holds for the sample mean, median, and 25th percentile in this context.

Week 4 Discussion - Central Limit Theorem

Mahmodul Hassan

2024-02-06