2025-02-09

Motivation

Suppose we wish to characterize an arbitrary or unknown population distribution, like the one shown below, by estimating its mean.

\[ \Tiny p\left(x\right)=\left(\frac{2\sqrt{2}}{2\sqrt{2}+3\arctan\left(\frac{1}{\sqrt{2}}\right)}\right)\left(\frac{3}{1+3\cosh\left(4x\right)}+\frac{5}{1+\cosh\left(10x-7\right)}\right) \]

Population distribution

Let’s verify that the proposed population distribution is a valid probability density function, and calculate its mean and standard deviation for later reference using the well-known formulae. In R, we can use the built-in integrate function for this:

\[ \Tiny \int_{-\infty}^{\infty}p\left(x\right)dx = \]

integrate(function(x) pdist(x), -Inf, Inf)$"value"
## [1] 1

The total area under the curve is 1, so this is a valid PDF.

Population distribution (cont.)

\[ \Tiny m=\int_{-\infty}^{\infty}x\cdot p\left(x\right)dx = \]

integrate(function(x) x*pdist(x), -Inf, Inf)$"value"
## [1] 0.4235199

\[ \Tiny s=\sqrt{\int_{-\infty}^{\infty}\left(x-m\right)^{2}p\left(x\right)dx} = \]

sqrt(integrate(function(x) (x - m)^2*pdist(x), -Inf, Inf)$"value")
## [1] 0.4535899

Sampling

The best we can do is take a sample of size n to analyze. Let’s try with n = 30, shown below as a scaled and superimposed histogram.

##  [1]  0.6394524 -0.8777029  0.2200319  0.1793867  0.7908655  0.1389407
##  [7]  0.7286388  0.8810286  0.9537313 -0.3292273  0.7485900  0.2006211
## [13] -0.2382185  0.9658239  0.4832752  0.5315641  1.0193566  0.6382180
## [19]  0.9884277  0.7533490  0.7222608  1.2899833  0.5799573  0.5657416
## [25]  0.6808264  0.8046462  0.5584595  0.8138911  0.5861398  0.5989587

Sample mean and standard deviation

Then let’s calculate the sample mean and standard deviation, and their differences from the population values. In R, this is easy:

mean(samp)
## [1] 0.5539006
mean(samp) - m
## [1] 0.1303807
sd(samp)
## [1] 0.4453801
sd(samp) - s
## [1] -0.008209762

Central Limit Theorem

Intuitively, the sample mean should be “close” to the population mean, and we should be more confident about this with increasing sample sizes n. The CLT validates this intuition and makes it precise. It states that for large n (going to infinity), regardless of the population distribution, the sampling distribution will approach a normal distribution with:

  • A mean equal to the population mean
  • A stddev equal to the population stddev, divided by \(\sqrt{n}\)

In particular, this justifies the construction of Z and T confidence intervals with sufficiently large n, regardless of the population distribution, and even if we don’t know it at all.

Simulating more samples

Now let’s take N = 1000 more samples of size n = 30, and plot another histogram of their means. The normal distribution theorized by the CLT is shown in red.

As expected, the histogram coincides very closely with the theoretical curve.