set.seed(seed = 42)
rand_nums <- runif(10)
print(rand_nums)
## [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 0.5190959 0.7365883
## [8] 0.1346666 0.6569923 0.7050648
The Law of Large Numbers states that as the sample size increases, the sample mean converges to the population mean. This occurs because larger sample sizes limit the effects of random fluctuations, making the sample more representative of the population.
The Central Limit Theorem (CLT) states that the distribution of a sample mean can be approximated by a near-normal distribution if the sample size is sufficiently large, given that the sample is independent and identically distributed. It’s an important concept in statistics because it enables statisticians to employ statistical processes that accept a normal distribution. It also can be used on both discrete and continuous data, providing flexibility in its applications.
https://bootcamp.umass.edu/blog/quality-management/central-limit-theorem
Both the LoLN and CLT state that larger samples of a population lead to more representative data related to the population. Both LoLN and CLT are utilized in an attempt to make predictions of an entire population based on the findings of a smaller sample. While LoLN conveys the idea that the sample mean will approximate the population mean as the sample grows, the CLT shows that the sample mean will be normally distributed when the sample grows. In this difference, CLT is more intrinsically related to normal distributions than the LoLN when it comes to the result, application, and shape of the distribution.
The Gamma distribution is a probability distribution that models the time it takes for a certain number of events to occur (waiting times). It’s parameters are shape and scale. Shape relates to how many events you are waiting for, and scale means the average time per event on average. The Chi-square distribution is a specific iteration of the Gamma distribution, with a scale of 2 and an integer k. Gamma distributions are considered flexible in their application, and are often used in Bayesian statistics, queuing theory, and survival analysis to model waiting times and uncertainty.
5A. Then, apply the CLT on the sample mean of this chosen distribution in R.
## Step 1: Create the distribution
shape <- 2 #Shape parameter
scale <- 1 #Scale parameter
n <- 10000 #Number of samples
population <- rgamma(n, shape, scale)
hist(x =population, main = 'Histogram of Gamma Distribution')
## Step 2: Create an empty matrix to store sample means
sample_n <- c(2, 6, 30, 50, 200, 1000) #Recommended sample sizes
z2 <- matrix(0, nrow = n, ncol = length(sample_n))
z2[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Step 3: Generate sample means for previously created sample sizes
for (j in 1:length(sample_n)) {
for(i in 1:10000) {
sample_n
z2[i,j] <- mean(sample(x = population, size = sample_n[j],
replace = TRUE ))
}
}
z2[1:16]
## [1] 3.0412766 1.6856202 0.8310131 4.3073299 2.3505171 0.8469918 1.7042939
## [8] 1.9000347 1.6099917 1.9414896 0.8434196 2.4363454 2.5411472 1.0746662
## [15] 1.2179040 1.8383885
## Step 4: Establish values of the sample mean for subsequent plots
pop_mean <- mean(population)
pop_sd <- sd(population)
pop_mean
## [1] 2.004068
pop_sd
## [1] 1.432217
## Step 5: Show that the mean of the sample size approaches the
## mean of a normal distribution as n increases.
#Assigning column names to z2 and observing CLT intuition via summary
#commands
colnames(z2) <- c("Sample size = 2", "Sample size = 6", "Sample size = 30","Sample size = 50", "Sample size = 200", "Sample size = 1000")
summary(z2)
## Sample size = 2 Sample size = 6 Sample size = 30 Sample size = 50
## Min. :0.1457 Min. :0.5194 Min. :1.107 Min. :1.310
## 1st Qu.:1.2718 1st Qu.:1.5794 1st Qu.:1.823 1st Qu.:1.867
## Median :1.8469 Median :1.9292 Median :1.994 Median :1.996
## Mean :2.0122 Mean :1.9889 Mean :2.007 Mean :2.004
## 3rd Qu.:2.5644 3rd Qu.:2.3594 3rd Qu.:2.179 3rd Qu.:2.134
## Max. :8.0680 Max. :4.6939 Max. :3.060 Max. :2.836
## Sample size = 200 Sample size = 1000
## Min. :1.632 Min. :1.817
## 1st Qu.:1.935 1st Qu.:1.973
## Median :2.001 Median :2.004
## Mean :2.003 Mean :2.004
## 3rd Qu.:2.071 3rd Qu.:2.034
## Max. :2.475 Max. :2.178
#Configuring plotting layout
par(mfrow=c(2,1))
#Looping through the columns of z2 and creating a histogram of sample
#means
for (j in 1:ncol(z2)) {
print(
hist(z2[,j], probability = TRUE, col = rgb(.2, .4, .6, .5),
main = colnames(z2)[j], xlab = 'Histogram of Gamma Distribution')
)
#Including a red line at the population mean
abline(v = pop_mean, col = 'red', lwd = 2)
#Including a normal distribution curve using the sample mean and
#sample standard deviation to contrast results
curve(dnorm(x, mean=mean(z2[,j]), sd=sd(z2[,j])), col = 'blue',
lwd = 2, add = TRUE)
}
Does the central limit theorem hold as expected?
The central limit theorem does hold as expected. The distribution of the sample mean begins to resemble a normal distribution as the sample size increases, even though the original random variables were not distributed normally. This confirms that regardless of the initial distribution, the sample mean follows a normal distribution when the sample size is big enough.It’s generally considered that sample sizes need to exceed thirty for this theorem to take form, and that holds true in my plots.