# Load packages manager ----
if (!require(pacman)) {
install.packages("pacman")
library(pacman)
}
# Load required packages ----
p_load(
tidyverse,
janitor,
skimr,
grateful,
patchwork
)
p_load_gh("datarootsio/artyfarty")
theme_set(theme_scientific())
The Law of Large Numbers and Central Limit Theorem with Simulation in R
A Short Guide to the Law of Large Numbers and Central Limit Theorem
This article provides a brief exploration of two fundamental statistical theorems: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). It explains the meaning, applications, and significance of both theorems in various fields, such as insurance, polling, and hypothesis testing. Practical simulations are demonstrated using R, including rolling a die to illustrate the LLN and sampling from a uniform distribution to showcase the CLT. Visualizations are generated using ggplot2 to highlight how sample means converge to expected values and approximate normal distributions. These concepts are essential for understanding statistical inference and the behavior of sample data.
1 Introduction
The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are two of the most fundamental theorems in probability and statistics. This document provides a short guide to understanding these concepts, their practical applications, and how to simulate them in R using ggplot2 for visualizations.
2 Part 1: The Law of Large Numbers
2.1 Meaning of the Law of Large Numbers
The Law of Large Numbers (LLN) states that as the number of independent and identically distributed (i.i.d.) trials or observations increases, the sample average converges to the expected value of the population. In other words, the more data you collect, the closer the sample mean will be to the population mean (Black 2023).
2.1.1 Types of LLN
- Weak Law of Large Numbers (WLLN): Convergence in probability.
- Strong Law of Large Numbers (SLLN): Almost sure convergence.
2.1.2 Formula
If \(X_{1}, X_2, ..., X_n\) are i.i.d. random variables with a finite expected value \(E(X)\), the sample average is given by:
\[ \bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \]
As \(n\) increases, \(\bar{X}_n\) approaches \(E(X)\).
2.2 Applications of the Law of Large Numbers
LLN is foundational to many fields, including:
- Insurance: Insurers rely on LLN to predict average loss and set premiums.
- Gambling: Casinos use LLN to ensure that, over time, they make a predictable profit.
- Polling: As more people are polled, the average response should approximate the population mean.
2.3 Simulating the Law of Large Numbers in R
2.3.1 Example: Rolling a Fair Die
We will simulate rolling a fair six-sided die. The expected value is:
\[ E(X) = \frac{1+2+3+4+5+6}{6} = 3.5 \]
The simulation will show how the average of dice rolls converges to this expected value as the number of rolls increases.
# Set seed for reproducibility
set.seed(456)
# Simulate rolling a fair die
<- 10000
n <- sample(1:6, n, replace = TRUE)
dice_rolls <- cumsum(dice_rolls) / seq_along(dice_rolls)
cumulative_avg
# Create a data frame for plotting
<- data.frame(
data Roll = 1:n,
Cumulative_Avg = cumulative_avg
)
We then summarise the data.
::skim_without_charts(data) skimr
Name | data |
Number of rows | 10000 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
Roll | 0 | 1 | 5000.50 | 2886.90 | 1.00 | 2500.75 | 5000.50 | 7500.25 | 10000 |
Cumulative_Avg | 0 | 1 | 3.48 | 0.05 | 3.29 | 3.48 | 3.49 | 3.49 | 5 |
Next, we visualize the simulation of the LLN.
# Plot using ggplot2
ggplot(data, aes(
x = Roll,
y = Cumulative_Avg
+
)) geom_line(color = "blue") +
geom_hline(
yintercept = 3.5,
color = "red",
linetype = "dashed"
+
) labs(
title = "Law of Large Numbers: Convergence of Sample Mean",
x = "Number of Rolls",
y = "Cumulative Average"
+
) theme_minimal()
2.3.2 Interpretation
In the plot above, as the number of dice rolls increases, the sample mean (blue line) converges toward the expected value of 3.5 (red dashed line). This demonstrates the Law of Large Numbers.
3 Part 2: The Central Limit Theorem
3.1 Meaning of the Central Limit Theorem
The Central Limit Theorem (CLT) states that the distribution of the sample mean of a sufficiently large number of i.i.d. random variables, regardless of the original distribution, will tend to follow a normal distribution, provided the original population has a finite variance (Black 2023).
3.1.1 Formula
Given a population with mean (\(\mu\)) and variance ( \(\sigma^2\) ), the distribution of the sample mean (\(\bar{X}\_n\)) for large (\(n\)) is approximately normal:
\[ \bar{X}_n \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]
3.1.2 Conditions for CLT:
- The samples must be independent.
- The sample size should be large (typically (\(n > 30\))).
- The population should have finite variance.
3.2 Applications of the Central Limit Theorem
CLT is crucial in many statistical methods, including:
- Hypothesis Testing: CLT allows us to use normal approximations for the distribution of sample means.
- Confidence Intervals: Sample means can be assumed to be normally distributed for large samples.
- Quality Control: CLT is used in process control charts to monitor deviations in manufacturing processes.
3.3 Simulating the Central Limit Theorem in R
3.3.1 Example: Sampling from a Uniform Distribution
Let’s take 1,000 samples of size 50 from a uniform distribution (which is not normal) and demonstrate how the sample means approximate a normal distribution as per the CLT.
# Simulate sampling from a uniform distribution
<- 1000
n_samples <- 50
sample_size <- replicate(n_samples,
uniform_samples mean(runif(sample_size,
min = 0, max = 1)))
# Create a data frame for plotting
<- data.frame(Sample_Mean = uniform_samples) clt_data
We then summarize the data.
::skim_without_charts(clt_data) skimr
Name | clt_data |
Number of rows | 1000 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
Sample_Mean | 0 | 1 | 0.5 | 0.04 | 0.38 | 0.47 | 0.5 | 0.53 | 0.63 |
Next, we plot the data demonstrating the CLT.
# Plot using ggplot2
ggplot(clt_data, aes(x = Sample_Mean)) +
geom_histogram(aes(y = ..density..),
bins = 30,
fill = "lightblue",
color = "black"
+
) stat_function(
fun = dnorm, args = list(
mean = mean(uniform_samples),
sd = sd(uniform_samples)
),color = "red", size = 1
+
) labs(
title = "Central Limit Theorem: Distribution of Sample Means",
x = "Sample Means",
y = "Density"
+
) theme_minimal()
3.3.2 Interpretation
In the histogram above, the distribution of the sample means (blue bars) approximates a normal distribution (red curve), even though the original data came from a uniform distribution. This illustrates the Central Limit Theorem in action.
4 Conclusion
The Law of Large Numbers and the Central Limit Theorem are foundational to probability and statistics. The LLN ensures that, as the sample size increases, the sample mean converges to the population mean, while the CLT shows that the distribution of sample means approaches normality as the sample size grows. These principles underpin much of statistical inference and are essential for understanding how sample data behave (Grolemund and Wickham 2023).
5 R Packages Used
We used R version 4.4.1 (R Core Team 2024) and the following R packages: Amelia v. 1.8.2 (Honaker, King, and Blackwell 2011), artyfarty v. 0.0.1 (Smeets 2024), car v. 3.1.3 (Fox and Weisberg 2019), caTools v. 1.18.3 (Tuszynski 2024), corrplot v. 0.94 (Wei and Simko 2024), DHARMa v. 0.4.6 (Hartig 2022), doParallel v. 1.0.17 (Corporation and Weston 2022), easystats v. 0.7.3 (Lüdecke et al. 2022), factoextra v. 1.0.7 (Kassambara and Mundt 2020), ggcorrplot v. 0.1.4.1 (Kassambara 2023), ggthemes v. 5.1.0 (Arnold 2024), gt v. 0.11.1 (Iannone et al. 2024), here v. 1.0.1 (Müller 2020), janitor v. 2.2.0 (Firke 2023), kableExtra v. 1.4.0.4 (Zhu 2024), MASS v. 7.3.61 (Venables and Ripley 2002a), mice v. 3.16.0 (van Buuren and Groothuis-Oudshoorn 2011), modelsummary v. 2.2.0 (Arel-Bundock 2022), NbClust v. 3.0.1 (Charrad et al. 2014), nnet v. 7.3.19 (Venables and Ripley 2002b), pacman v. 0.5.1 (Rinker and Kurkiewicz 2018), pandoc v. 0.2.0 (Dervieux 2023), patchwork v. 1.3.0 (Pedersen 2024), performance v. 0.12.3 (Lüdecke et al. 2021), randomForest v. 4.7.1.2 (Liaw and Wiener 2002), rmarkdown v. 2.28 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; Allaire et al. 2024), rpart v. 4.1.23 (Therneau and Atkinson 2023), rpart.plot v. 3.1.2 (Milborrow 2024), rsconnect v. 1.3.1 (Atkins et al. 2024), sandwich v. 3.1.1 (Zeileis 2004, 2006; Zeileis, Köll, and Graham 2020), sf v. 1.0.17 (Pebesma 2018; Pebesma and Bivand 2023), sjPlot v. 2.8.16 (Lüdecke 2024), skimr v. 2.1.5 (Waring et al. 2022), stargazer v. 5.2.3 (Hlavac 2022), styler v. 1.10.3 (Müller and Walthert 2024), summarytools v. 1.0.1 (Comtois 2022), tictoc v. 1.2.1 (Izrailev 2024), tidyverse v. 2.0.0 (Wickham et al. 2019), varImp v. 0.4 (Probst 2020), webshot2 v. 0.1.1 (Chang 2023).