The Law of Large Numbers and Central Limit Theorem with Simulation in R

A Short Guide to the Law of Large Numbers and Central Limit Theorem

Author

Affiliations

John Karuitha, PhD

Karatina University, School of Business and Economics

University of the Witwatersrand, School of Construction Economics & Management

Published

October 7, 2024

Modified

October 7, 2024

Executive Summary

This article provides a brief exploration of two fundamental statistical theorems: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). It explains the meaning, applications, and significance of both theorems in various fields, such as insurance, polling, and hypothesis testing. Practical simulations are demonstrated using R, including rolling a die to illustrate the LLN and sampling from a uniform distribution to showcase the CLT. Visualizations are generated using ggplot2 to highlight how sample means converge to expected values and approximate normal distributions. These concepts are essential for understanding statistical inference and the behavior of sample data.

1 Introduction

The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are two of the most fundamental theorems in probability and statistics. This document provides a short guide to understanding these concepts, their practical applications, and how to simulate them in R using ggplot2 for visualizations.

2 Part 1: The Law of Large Numbers

2.1 Meaning of the Law of Large Numbers

The Law of Large Numbers (LLN) states that as the number of independent and identically distributed (i.i.d.) trials or observations increases, the sample average converges to the expected value of the population. In other words, the more data you collect, the closer the sample mean will be to the population mean (Black 2023).

2.1.1 Types of LLN

Weak Law of Large Numbers (WLLN): Convergence in probability.
Strong Law of Large Numbers (SLLN): Almost sure convergence.

2.1.2 Formula

If \(X_{1}, X_2, ..., X_n\) are i.i.d. random variables with a finite expected value \(E(X)\), the sample average is given by:

\[ \bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \]

As \(n\) increases, \(\bar{X}_n\) approaches \(E(X)\).

2.2 Applications of the Law of Large Numbers

LLN is foundational to many fields, including:

Insurance: Insurers rely on LLN to predict average loss and set premiums.
Gambling: Casinos use LLN to ensure that, over time, they make a predictable profit.
Polling: As more people are polled, the average response should approximate the population mean.

2.3 Simulating the Law of Large Numbers in R

2.3.1 Example: Rolling a Fair Die

We will simulate rolling a fair six-sided die. The expected value is:

\[ E(X) = \frac{1+2+3+4+5+6}{6} = 3.5 \]

The simulation will show how the average of dice rolls converges to this expected value as the number of rolls increases.

# Load packages manager ----
if (!require(pacman)) {
  install.packages("pacman")
  library(pacman)
}

# Load required packages ----
p_load(
  tidyverse,
  janitor,
  skimr,
  grateful,
  patchwork
)

p_load_gh("datarootsio/artyfarty")

theme_set(theme_scientific())

# Set seed for reproducibility
set.seed(456)

# Simulate rolling a fair die
n <- 10000
dice_rolls <- sample(1:6, n, replace = TRUE)
cumulative_avg <- cumsum(dice_rolls) / seq_along(dice_rolls)

# Create a data frame for plotting
data <- data.frame(
  Roll = 1:n,
  Cumulative_Avg = cumulative_avg
)

We then summarise the data.

skimr::skim_without_charts(data)

Data summary
Name	data
Number of rows	10000
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Roll	0	1	5000.50	2886.90	1.00	2500.75	5000.50	7500.25	10000
Cumulative_Avg	0	1	3.48	0.05	3.29	3.48	3.49	3.49	5

Next, we visualize the simulation of the LLN.

# Plot using ggplot2
ggplot(data, aes(
  x = Roll,
  y = Cumulative_Avg
)) +
  geom_line(color = "blue") +
  geom_hline(
    yintercept = 3.5,
    color = "red",
    linetype = "dashed"
  ) +
  labs(
    title = "Law of Large Numbers: Convergence of Sample Mean",
    x = "Number of Rolls",
    y = "Cumulative Average"
  ) +
  theme_minimal()

2.3.2 Interpretation

In the plot above, as the number of dice rolls increases, the sample mean (blue line) converges toward the expected value of 3.5 (red dashed line). This demonstrates the Law of Large Numbers.

3 Part 2: The Central Limit Theorem

3.1 Meaning of the Central Limit Theorem

The Central Limit Theorem (CLT) states that the distribution of the sample mean of a sufficiently large number of i.i.d. random variables, regardless of the original distribution, will tend to follow a normal distribution, provided the original population has a finite variance (Black 2023).

3.1.1 Formula

Given a population with mean (\(\mu\)) and variance ( \(\sigma^2\) ), the distribution of the sample mean (\(\bar{X}\_n\)) for large (\(n\)) is approximately normal:

\[ \bar{X}_n \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]

3.1.2 Conditions for CLT:

The samples must be independent.
The sample size should be large (typically (\(n > 30\))).
The population should have finite variance.

3.2 Applications of the Central Limit Theorem

CLT is crucial in many statistical methods, including:

Hypothesis Testing: CLT allows us to use normal approximations for the distribution of sample means.
Confidence Intervals: Sample means can be assumed to be normally distributed for large samples.
Quality Control: CLT is used in process control charts to monitor deviations in manufacturing processes.

3.3 Simulating the Central Limit Theorem in R

3.3.1 Example: Sampling from a Uniform Distribution

Let’s take 1,000 samples of size 50 from a uniform distribution (which is not normal) and demonstrate how the sample means approximate a normal distribution as per the CLT.

# Simulate sampling from a uniform distribution
n_samples <- 1000
sample_size <- 50
uniform_samples <- replicate(n_samples, 
                             mean(runif(sample_size, 
                                                   min = 0, max = 1)))

# Create a data frame for plotting
clt_data <- data.frame(Sample_Mean = uniform_samples)

We then summarize the data.

skimr::skim_without_charts(clt_data)

Data summary
Name	clt_data
Number of rows	1000
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Sample_Mean	0	1	0.5	0.04	0.38	0.47	0.5	0.53	0.63

Next, we plot the data demonstrating the CLT.

# Plot using ggplot2
ggplot(clt_data, aes(x = Sample_Mean)) +
  geom_histogram(aes(y = ..density..),
    bins = 30,
    fill = "lightblue",
    color = "black"
  ) +
  stat_function(
    fun = dnorm, args = list(
      mean = mean(uniform_samples),
      sd = sd(uniform_samples)
    ),
    color = "red", size = 1
  ) +
  labs(
    title = "Central Limit Theorem: Distribution of Sample Means",
    x = "Sample Means",
    y = "Density"
  ) +
  theme_minimal()

3.3.2 Interpretation

In the histogram above, the distribution of the sample means (blue bars) approximates a normal distribution (red curve), even though the original data came from a uniform distribution. This illustrates the Central Limit Theorem in action.

4 Conclusion

The Law of Large Numbers and the Central Limit Theorem are foundational to probability and statistics. The LLN ensures that, as the sample size increases, the sample mean converges to the population mean, while the CLT shows that the distribution of sample means approaches normality as the sample size grows. These principles underpin much of statistical inference and are essential for understanding how sample data behave (Grolemund and Wickham 2023).

5 R Packages Used

We used R version 4.4.1 (R Core Team 2024) and the following R packages: Amelia v. 1.8.2 (Honaker, King, and Blackwell 2011), artyfarty v. 0.0.1 (Smeets 2024), car v. 3.1.3 (Fox and Weisberg 2019), caTools v. 1.18.3 (Tuszynski 2024), corrplot v. 0.94 (Wei and Simko 2024), DHARMa v. 0.4.6 (Hartig 2022), doParallel v. 1.0.17 (Corporation and Weston 2022), easystats v. 0.7.3 (Lüdecke et al. 2022), factoextra v. 1.0.7 (Kassambara and Mundt 2020), ggcorrplot v. 0.1.4.1 (Kassambara 2023), ggthemes v. 5.1.0 (Arnold 2024), gt v. 0.11.1 (Iannone et al. 2024), here v. 1.0.1 (Müller 2020), janitor v. 2.2.0 (Firke 2023), kableExtra v. 1.4.0.4 (Zhu 2024), MASS v. 7.3.61 (Venables and Ripley 2002a), mice v. 3.16.0 (van Buuren and Groothuis-Oudshoorn 2011), modelsummary v. 2.2.0 (Arel-Bundock 2022), NbClust v. 3.0.1 (Charrad et al. 2014), nnet v. 7.3.19 (Venables and Ripley 2002b), pacman v. 0.5.1 (Rinker and Kurkiewicz 2018), pandoc v. 0.2.0 (Dervieux 2023), patchwork v. 1.3.0 (Pedersen 2024), performance v. 0.12.3 (Lüdecke et al. 2021), randomForest v. 4.7.1.2 (Liaw and Wiener 2002), rmarkdown v. 2.28 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; Allaire et al. 2024), rpart v. 4.1.23 (Therneau and Atkinson 2023), rpart.plot v. 3.1.2 (Milborrow 2024), rsconnect v. 1.3.1 (Atkins et al. 2024), sandwich v. 3.1.1 (Zeileis 2004, 2006; Zeileis, Köll, and Graham 2020), sf v. 1.0.17 (Pebesma 2018; Pebesma and Bivand 2023), sjPlot v. 2.8.16 (Lüdecke 2024), skimr v. 2.1.5 (Waring et al. 2022), stargazer v. 5.2.3 (Hlavac 2022), styler v. 1.10.3 (Müller and Walthert 2024), summarytools v. 1.0.1 (Comtois 2022), tictoc v. 1.2.1 (Izrailev 2024), tidyverse v. 2.0.0 (Wickham et al. 2019), varImp v. 0.4 (Probst 2020), webshot2 v. 0.1.1 (Chang 2023).

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Arel-Bundock, Vincent. 2022. “modelsummary: Data and Model Summaries in R.” Journal of Statistical Software 103 (1): 1–23. https://doi.org/10.18637/jss.v103.i01.

Arnold, Jeffrey B. 2024. ggthemes: Extra Themes, Scales and Geoms for “ggplot2”. https://CRAN.R-project.org/package=ggthemes.

Atkins, Aron, Toph Allen, Hadley Wickham, Jonathan McPherson, and JJ Allaire. 2024. rsconnect: Deploy Docs, Apps, and APIs to “Posit Connect,” “shinyapps.io,” and “RPubs”. https://CRAN.R-project.org/package=rsconnect.

Black, Ken. 2023. Business Statistics: For Contemporary Decision Making. John Wiley & Sons.

Chang, Winston. 2023. Webshot2: Take Screenshots of Web Pages. https://CRAN.R-project.org/package=webshot2.

Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software 61 (6): 1–36. https://www.jstatsoft.org/v61/i06/.

Comtois, Dominic. 2022. summarytools: Tools to Quickly and Neatly Summarize Data. https://CRAN.R-project.org/package=summarytools.

Corporation, Microsoft, and Steve Weston. 2022. doParallel: Foreach Parallel Adaptor for the “parallel” Package. https://CRAN.R-project.org/package=doParallel.

Dervieux, Christophe. 2023. pandoc: Manage and Run Universal Converter “Pandoc” from “R”. https://CRAN.R-project.org/package=pandoc.

Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage. https://www.john-fox.ca/Companion/.

Grolemund, Garrett, and Hadley Wickham. 2023. R for Data Science, 2nd Edition. O’Reilly Media.

Hartig, Florian. 2022. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. https://CRAN.R-project.org/package=DHARMa.

Hlavac, Marek. 2022. stargazer: Well-Formatted Regression and Summary Statistics Tables. Bratislava, Slovakia: Social Policy Institute. https://CRAN.R-project.org/package=stargazer.

Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45 (7): 1–47. https://doi.org/10.18637/jss.v045.i07.

Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, JooYoung Seo, Ken Brevoort, and Olivier Roy. 2024. gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.

Izrailev, Sergei. 2024. tictoc: Functions for Timing r Scripts, as Well as Implementations of “Stack” and “StackList” Structures. https://CRAN.R-project.org/package=tictoc.

Kassambara, Alboukadel. 2023. ggcorrplot: Visualization of a Correlation Matrix Using “ggplot2”. https://CRAN.R-project.org/package=ggcorrplot.

Kassambara, Alboukadel, and Fabian Mundt. 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.

Lüdecke, Daniel. 2024. sjPlot: Data Visualization for Statistics in Social Science. https://CRAN.R-project.org/package=sjPlot.

Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. “performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.

Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, Etienne Bacher, Rémi Thériault, and Dominique Makowski. 2022. “easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting.” CRAN. https://doi.org/10.32614/CRAN.package.easystats.

Milborrow, Stephen. 2024. rpart.plot: Plot “rpart” Models: An Enhanced Version of “plot.rpart”. https://CRAN.R-project.org/package=rpart.plot.

Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Müller, Kirill, and Lorenz Walthert. 2024. styler: Non-Invasive Pretty Printing of r Code. https://CRAN.R-project.org/package=styler.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

Pedersen, Thomas Lin. 2024. patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

Probst, Philipp. 2020. varImp: RF Variable Importance for Arbitrary Measures. https://CRAN.R-project.org/package=varImp.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rinker, Tyler W., and Dason Kurkiewicz. 2018. pacman: Package Management for R. Buffalo, New York. http://github.com/trinker/pacman.

Smeets, Bart. 2024. artyfarty: Themes for Ggplot2. https://github.com/datarootsio/artyfarty.

Therneau, Terry, and Beth Atkinson. 2023. rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Tuszynski, Jarek. 2024. caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, Etc. https://CRAN.R-project.org/package=caTools.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

Venables, W. N., and B. D. Ripley. 2002a. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.

———. 2002b. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.

Wei, Taiyun, and Viliam Simko. 2024. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10): 1–17. https://doi.org/10.18637/jss.v011.i10.

———. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9): 1–16. https://doi.org/10.18637/jss.v016.i09.

Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.

Zhu, Hao. 2024. kableExtra: Construct Complex Table with “kable” and Pipe Syntax. https://github.com/haozhu233/kableExtra.