Definition of Central Tendency: Measures of central tendency are statistics that describe the center or midpoint of a dataset.
Common Measures of Central Tendency: We’ll delve into commonly used measures of central tendency, including the mean (average), median (middle value), and mode (most frequent value). We’ll see how these measures can be calculated using both population data and sample data.
Definition of Dispersion: Measures of dispersion quantify the spread or variability in a dataset.
Common Measures of Dispersion: We’ll explore measures of dispersion, such as variance and standard deviation, and understand how they help us assess the extent to which data points deviate from the central tendency.
Definition of CLT: The CLT states that, under certain conditions, the sampling distribution of the sample mean will approximate a normal distribution, regardless of the shape of the population distribution.
Independence of Random Variables: The observations in the sample must be independent of each other.
Identical Distribution: Each observation in the sample must come from the same probability distribution.
Sufficiently Large Sample Size: The sample size should be sufficiently large (usually n > 30) for the CLT to apply.
Statement of CLT: The CLT enables us to understand how the distribution of sample means approaches a normal distribution as the sample size increases. It also provides a formula for calculating the standard error of the sample mean.
Explanation of Sampling Distribution: The sampling distribution represents the distribution of sample statistics (e.g., sample means) obtained from multiple samples drawn from the same population.
Comparison of Population and Sampling Distribution: We’ll see how the characteristics of the population distribution relate to those of the sampling distribution.
# Visualize distribution
hist(data, main = "Histogram of Sample Data", xlab = "Value", ylab = "Frequency", col = "lightblue")
abline(v = mean_value, col = "red", lwd = 2)
abline(v = median_value, col = "blue", lwd = 2)
legend("topright", legend = c("Mean", "Median"), col = c("red", "blue"), lwd = 2)# Simulate CLT
set.seed(123) # For reproducibility
n_samples <- 1000
sample_means <- numeric(n_samples)
for (i in 1:n_samples) {
sample_data <- base::sample(data, size = 10, replace = TRUE)
sample_means[i] <- mean(sample_data)
}
# Plot the histogram of sample means
hist(sample_means, main = "Histogram of Sample Means (CLT)", xlab = "Sample Mean", ylab = "Frequency", col = "lightgreen")Empirical Rules
# Calculate mean and standard deviation
mean_value <- mean(sample_means)
sd_value <- sd(sample_means)
# Create a histogram
hist(sample_means, main = "Histogram of Sample Means (CLT)", xlab = "Sample Mean", ylab = "Frequency", col = "lightgreen")
# Calculate values for the vertical lines
sd_lines <- c(-3, -2, -1, 1, 2, 3)
line_colors <- c("red", "orange", "yellow", "yellow", "orange", "red")
# Add vertical lines for each standard deviation on both sides
for (i in 1:length(sd_lines)) {
abline(v = mean_value + sd_lines[i] * sd_value, col = line_colors[i], lty = 2)
text(mean_value + sd_lines[i] * sd_value, 10, round(mean_value + sd_lines[i] * sd_value, 2), col = line_colors[i])
}
# Add a vertical line for the mean
abline(v = mean_value, col = "blue", lty = 1)
text(mean_value, 10, round(mean_value, 2), col = "blue")minu_2_sd <- (mean(sample_means) - (2 * sd(sample_means)))
plus_2_sd <- (mean(sample_means) + (2 * sd(sample_means)))
x <- sample_means >= minu_2_sd & sample_means <= plus_2_sd
sum(x)/length(x) * 100[1] 96.5