CLT Explain

Central Limit Theorem

Measurement of Central Tendency

Definition of Central Tendency: Measures of central tendency are statistics that describe the center or midpoint of a dataset.
Common Measures of Central Tendency: We’ll delve into commonly used measures of central tendency, including the mean (average), median (middle value), and mode (most frequent value). We’ll see how these measures can be calculated using both population data and sample data.

Measurement of Dispersion

Definition of Dispersion: Measures of dispersion quantify the spread or variability in a dataset.
Common Measures of Dispersion: We’ll explore measures of dispersion, such as variance and standard deviation, and understand how they help us assess the extent to which data points deviate from the central tendency.

What is Central Limit Theorem ?

Definition of CLT: The CLT states that, under certain conditions, the sampling distribution of the sample mean will approximate a normal distribution, regardless of the shape of the population distribution.

Assumptions of CLT

Independence of Random Variables: The observations in the sample must be independent of each other.
Identical Distribution: Each observation in the sample must come from the same probability distribution.
Sufficiently Large Sample Size: The sample size should be sufficiently large (usually n > 30) for the CLT to apply.
Statement of CLT: The CLT enables us to understand how the distribution of sample means approaches a normal distribution as the sample size increases. It also provides a formula for calculating the standard error of the sample mean.

Sampling Distribution

Explanation of Sampling Distribution: The sampling distribution represents the distribution of sample statistics (e.g., sample means) obtained from multiple samples drawn from the same population.
Comparison of Population and Sampling Distribution: We’ll see how the characteristics of the population distribution relate to those of the sampling distribution.

Sample Data

# Sample data
data <- c(12, 15, 18, 21, 24, 27, 30, 33, 36, 39)

# Calculate mean, median, and mode
mean_value <- mean(data)
median_value <- median(data)
mode_value <- as.numeric(names(sort(table(data), decreasing = TRUE)[1]))
variance_value <- var(data)
sd_value <- sd(data)

Sample Data Distribution: Calculation

df <- c("mean" = mean_value, 
        "median" = median_value, 
        "mode" = mode_value, 
        "variance" = variance_value, 
        "SD" = sd_value)


knitr::kable(df)

	x
mean	25.500000
median	25.500000
mode	12.000000
variance	82.500000
SD	9.082951

Sample Data Distribution: Visualization

# Visualize distribution
hist(data, main = "Histogram of Sample Data", xlab = "Value", ylab = "Frequency", col = "lightblue")
abline(v = mean_value, col = "red", lwd = 2)
abline(v = median_value, col = "blue", lwd = 2)
legend("topright", legend = c("Mean", "Median"), col = c("red", "blue"), lwd = 2)

Plot

Sampling Distribution Demo

# Simulate CLT
set.seed(123)  # For reproducibility
n_samples <- 1000
sample_means <- numeric(n_samples)

for (i in 1:n_samples) {
  sample_data <- base::sample(data, size = 10, replace = TRUE)
  sample_means[i] <- mean(sample_data)
}

# Plot the histogram of sample means
hist(sample_means, main = "Histogram of Sample Means (CLT)", xlab = "Sample Mean", ylab = "Frequency", col = "lightgreen")

Plot

Empirical Rules

Demo Sampling Distribution with Empirical Rules

# Calculate mean and standard deviation
mean_value <- mean(sample_means)
sd_value <- sd(sample_means)

# Create a histogram
hist(sample_means, main = "Histogram of Sample Means (CLT)", xlab = "Sample Mean", ylab = "Frequency", col = "lightgreen")

# Calculate values for the vertical lines
sd_lines <- c(-3, -2, -1, 1, 2, 3)
line_colors <- c("red", "orange", "yellow", "yellow", "orange", "red")

# Add vertical lines for each standard deviation on both sides
for (i in 1:length(sd_lines)) {
  abline(v = mean_value + sd_lines[i] * sd_value, col = line_colors[i], lty = 2)
  text(mean_value + sd_lines[i] * sd_value, 10, round(mean_value + sd_lines[i] * sd_value, 2), col = line_colors[i])
}

# Add a vertical line for the mean
abline(v = mean_value, col = "blue", lty = 1)
text(mean_value, 10, round(mean_value, 2), col = "blue")

Plot

Is that true?

minu_2_sd <- (mean(sample_means) - (2 * sd(sample_means)))
plus_2_sd <- (mean(sample_means) + (2 * sd(sample_means)))

x <- sample_means >=  minu_2_sd & sample_means <= plus_2_sd

sum(x)/length(x) * 100

[1] 96.5