We begin the course with Chapter 10.
We now tackle the problem of estimating the behaviour of a random variable where we have no (or very little) information about the population.
Recall the distribution if uniquely defined by the cummulative function (cdf) which is defined as : \[P(X\leq x)\] where \(X\) is the random variable of interest and \(x\) is some fixed real-valued number. In Statistics our most common assumption is that the data (or sample) we represents the population. As such the behaviour, in a probability sense, of a random variable in the sample should be similar to its behaviour in the population.
Example
The sample mean should be close to the population mean. The same can be said with the variance, and proportion of a population.
This is anologous to the cdf of the population only for the sample, its defined as: \[F_n(x) = \hat{F}(x) = \frac{1}{n}\Big(\#X_i \leq x \Big)\]
Let \(X_1, X_2, ..., X_n\) be the oredered data then we have that the ecdf has the following properties:
For \(x \leq X_1\) we have that \(F_n(x) = 0\).
For \(X_k \leq x < X_{k+1}\) we have that \(F_n(x) = k/n\).
If there are \(r\) observations with the same value \(x\) then \(F_n\) has a jump of \(r/n\) at \(x\).
Next we want to show that the ecdf is consistent for the cdf (converges in probability). In the limit for a fixed \(\varepsilon > 0\) Chebychev’s Inequality gives: \[\lim_{n \rightarrow \infty} P\Big(|F_n(x) - F(x)| < \varepsilon \Big)\] \[ = \lim_{n \rightarrow \infty} P\Big(|Y/n - p| < \varepsilon \Big) \geq \lim_{n \rightarrow \infty}\Big(1 - \frac{p(1-p)}{n\varepsilon^2}\Big)\]
This last equation happens is look closely at how we defined the ecdf. We defined it as the number of values in the data that are less then or equal to a fixed value \(x\). With this interpretation we have that the ecdf is a binomial random variable, where each r.v in the data is either less then \(x\) or not. We let \(Y\) denote the value of the binomial. Now looking at the cdf and its definition, its readily clear that this is simply the population proportion \(p\). Now carrying out the limits of this lat inequality \[ = \lim_{n \rightarrow \infty} P\Big(|Y/n - p| < \varepsilon \Big) \geq \lim_{n \rightarrow \infty}\Big(1 - \frac{p(1-p)}{n\varepsilon^2}\Big) = 1\] and so we have that, \[\lim_{n \rightarrow \infty} P\Big(|F_n(x) - F(x)| < \varepsilon \Big) = 1\] since a probability cannot carry values greater than one.
This consitency hold uniformly over the real line, namely we have the following: \[\sup_{-\infty < x < \infty} \Big|F_n(x) - F(x)\Big| \xrightarrow{Pr} 0\]
For mathematical purposese we can further improve on the expression for the ecdf, and write it as follows: \[F_n(x) = \frac{1}{n}\sum_{i = 1}^n\mathbf{1}_{(-\infty, x]}(X_i)\] where the indicator function is defined as: \[\mathbf{1}_{(-\infty, x]}(X_i) = \begin{cases} 1 & X_i \leq x \\ 0 & X_i > x \end{cases}\] Writting it in this way we see that the indicator function is not only a random variable but also independent Bernoulli random variables, with the following values, \[P\Big[\mathbf{1}_{(-\infty, x]}(X_i) = 1\Big] = F(x)\] which makes sense since by definition we have that \(P(X \leq x) = F(x)\) and from this is follows that \[P\Big[\mathbf{1}_{(-\infty, x]}(X_i) = 0\Big] = 1 - F(x)\] Extending this even further we have that \(nF_n(x)\) is a Binomiall random variable, with \(n\) trials and probability of success \(F(x)\), it follows that \[E(F_n(x)) = F(x)\] \[Var(F_n(x)) = \frac{1}{n}F(x)[1 - F(x)]\] and so not only is the ecdf consistent but also an unbiased estimator of the cdf.