Statistics 2 Topic 1

Author

Minerva Mukhopadhyay

Inferring Population Characteristics from Samples

Population: A collection of units or measurements.

This demonstration discusses how random samples can be used to infer properties of a population.

(I) Key Takeaways from Tuesday’s Lecture

The following key points:

A distribution (for example, \(N_{[-2,3]}(0,1)\)) is a valid characterization of a population (recall \(\mathbb{P}_{3}\)). For any \(a<b\), if we know the proportion of population units taking values in \((a,b]\), \(p_{(a,b]}\), then we can answer any query regarding the population.
To make inferences about this population, we need to take representative samples (preferably, iid), say \(X_{1}, \ldots, X_{n}\).
The common distribution of these samples is specified as follows:
- For any \(a < b\), the probability that a sample \(X_{i}\) takes a value in the interval \((a,b]\) is given by:
  
  \[ P(X_{i} \in (a,b]) = p_{(a,b]} = \text{Proportion of population lying in } (a,b]. \]
  
  This proportion, \(p_{(a,b]}\), is a fixed quantity (for any given \(a<b\)).

Recall Problem (iii) - Finding \(E(X^2)\) when \(X \sim N_{[-2, 3]}(0,1)\)

In this example, we take \(n\) iid samples from the \(N_{[-2, 3]}(0,1)\) distribution, with different sample sizes \(n\). We then compare:

The proportion of samples in the interval \((0,1]\), \(f_{n, (0,1]}\) with the population proportion, which is: \[ p_{(0,1]} = \frac{\Phi(1) - \Phi(0)}{\Phi(3) - \Phi(-2)}, \]

where \(\Phi(x)\) represents the CDF of the standard normal distribution.
The mean of the squared samples, given by \(n^{-1} \sum_{i=1}^{n} X_i^2\), with the expectation:

\[ E(X^2 \mid X \sim N_{[-2, 3]}(0,1)). \]

n = 100

The sample proportions is \(f_{100,(0,1]} = 0.34\), with bias (departure from the true value) \(-0.0098\).
The sample mean is \(0.9845\), with bias \(0.1088\).

n = 1000

The sample proportion is \(f_{1000,(0,1]} = 0.336\), with bias \(-0.0138\).
The sample mean is \(0.8504\), with bias \(-0.0253\).

n = 10000

The sample proportions is \(f_{10000,(0,1]} = 0.3515\), with bias \(0.0017\).
The sample mean is \(0.8822\), with bias \(0.0065\).

n = 100000

The sample proportion is \(f_{10^{5},(0,1]} = 0.3501\), with bias \(0.0003\).
The sample mean is \(0.8808\), with bias \(0.0051\).

Observe that as the sample size \(n\) increases, the histogram of relative frequencies and empirical CDF of the samples approaches towards the PDF and CDF of the population (i.e., \(N_{[-2,3]}(0,1)\) distribution, indicated in red). Consequently the estimates are becomes more accurate.

Further Exploration:

Try different values of \(n\), possibly setting different seeds.

Experiment with various intervals \((a,b]\), and measure other population properties (e.g., variance, quantiles).
Explore other distributions, such as exponential or binomial.

(II) Key Takeaways from Thursday’s Lecture

The key concepts discussed include statistical model, parametric inference, parameter and statistic.

Recall that our problem is to measure some characteristic of the population. However, it is practically impossible to do so due to several constraints (e.g., time, cost, or feasibility).
Therefore we take samples from the population and infer population characteristics based on these samples.
The samples are random quantities and it is usually assumed that they are independent and identically distributed (iid).
The procedure of drawing conclusions about the population from sample data, while incorporating uncertainty in the results is called Statistical inference.

Statistical Inference

To infer about the population from the samples, we first need to make some assumptions. For instance, consider the following example:
Example 1. Suppose we are interested in estimating the gravitational constant \(g\). Usual way to estimate \(g\) is by the pendulum experiment, where \(g\approx 4\pi^{2} l/T^{2}\), where \(l\) is the length of a simple pendulum and \(T\) is time period of the pendulum for an oscillation. Suppose length of a simple pendulum is \(75\) cm. Due to variation which depends on several factors such as skill of the experimentor and measurement error, \(T\) can not be measured exactly. Instead \(10\) repeated measurements are taken which are

Using these data how can you estimate \(g\)?
To estimate \(g\) we first need to make assumptions on the population of all possible measurements. This collection of assumptions is called a statistical model.

For instance, we can assume that the time measured in the \(i\)-th repeatation \(T_{i} \stackrel{iid}{\sim} \mathtt{Normal} (\tau, \sigma^{2})\) distribution, where the unknown (fixed) quantities \(\tau\) and \(\sigma^{2}\) are indicate the actual time of an oscillation, and expected departure of the measurements from \(\tau\), respectively.

By setting the above model we are implicitly assume that (i) the error in the time measurement is symmetric about \(\tau\) (i.e., we are equally prone to make positive and negative errors), and (ii) large errors are extremely rare.
While modeling the population, we do not specify the distribution entirely. We assume a structure of the population distribution (for e.g., \(\mathtt{Normal}\)) but do not specify the exact member of the class. Rather, we specify the distribution up to certain unknown quantities, called parameters. In the above modelling of \(T_{i}\), we do not specify \(\tau\) and \(\sigma^{2}\), which are the parameters.
In statistical (parametric) inference, the primary goal is to estimate the unknown parameters based on the sample data. By estimating these parameters, we are effectively estimating the entire population distribution, since the model structure has already been defined. Once the population distribution is accurately estimated, we can answer any questions about the population.
Finally, to estimate the parameters we put forward certain functions of the samples. For instance, we can use \(\bar{T}\) to estimate \(\tau\) in the above example. Any function of the samples is called a statistic.