Cummulative Distribution Functions
In this section we tackle the problem of estimating the behaviour of a randomm variable where we have no (or very little) information about the population. Recall that a distribution is uniquely defined by the cummulative distribution function (cdf). We define the cdf as: \[P(X \leq x)\] where \(X\) is the random variable of interest and \(x\) is a real valued number. In Statistics our most common approach is that the data (or sample) well represents the population. As such the behaviour in a probability sense of an r.v in the sample should be similar to its behavious in the population.
Example The sample mean should be close to population mean.
Empirical Cummulative Distribution Function(ecdf)
This is anologous to the cdf of the population only for sample, we define it as: \[F_n(x) = \frac{1}{n}(\#X_i \leq x)\] The ecdf has the following three properties:
Let the ordered data be \(X_1, ..., X_n\), then for \(x < X_1\) we have that \(F_n(x) = 0\)
For \(X_k \leq x < X_{k+1}\) we have that \(F_n(x) = k/n\).
If there are \(r\) observations with the same value \(x\) then \(F_n\) has a jump of \(\frac{r}{n}\) at \(x\).
Now we want to show that the ecdf is consistent for the cdf (converges in probability). In the limit for a fixed \(\varepsilon > 0\) we have that Chebyshev’s Inequality gives: \[\lim_{n \rightarrow \infty} P(|F_n(x) - F(x)| < \varepsilon)\] \[\lim_{n \rightarrow \infty} P\Big(|\frac{Y}{n} - p| < \varepsilon\Big) \geq \lim_{n\rightarrow \infty}\Big(1 - \frac{p(1-p)}{n\varepsilon^2}\Big) = 1\] And so we have that \[\lim_{n \rightarrow \infty} P(|F_n(x) - F(x)| < \varepsilon) = 1\]