Quick Review of Statistical Inference

The objective of Statistical Inference is to understand and quantify the uncertainty of parameter estimates.
Two crucial topics
- Confidence interval for a parameter (a range of plausible values)
- Hypothesis testing framework to formally test claims about the populations

Population and Sample

Suppose we are interested in the average annual income for a family of 4 in the United States in 2015. Our population would be the set of annual incomes for every family of four living in the United States in 2015.
To actually gather all that data would be virtually impossible. Thus we have to make do with a subset of the families of 4 in the US in 2015. That subset is called a sample.
If every family of 4 in 2015 had an equal chance of being included in the subset, then we would say we had a random sample.
Given our sample of families and their annual incomes we can calculate the average income for all the families in our sample.
More concretely: If we have $n$ families and $x_i$ is the annual income of the $i$-th family then the average income for the families in our sample is denoted by $\bar{x}$ and is defined by \[ \bar{x} = \sum_{i = 1}^{n} x_i\]
But, we don’t know if the value $\bar{x}$ is anywhere close to the average income of all families.

Confidence Interval

Z distribution just stands for $Z \sim N(0,1)$
- $\alpha$ is the desired confidence, e.g., $\alpha = 0.95$ represents 95% confidence.
$z_{1 - \frac{\alpha}{2}}$ is chosen so that \[P(-z_{1 - \frac{\alpha}{2}} < z < z_{1 - \frac{\alpha}{2}}) = 1 - \alpha\]
A 95% confidence interval estimate for the parameter $\mu$ is given by $\bar{X} \pm z_0.975*SE$ where the Standard error is $\frac{\sigma}{\sqrt{n}}$, where $n$ is equal to the sample size.
Unfortunately when we don’t know the population mean, we almost certainly don’t know what the population standard deviation is. (it’s kind of circular actually)
This is particularly a problem when the sample sizes are very small, e.g., {n < 30$
Gossest, writing under the pseudonym “Student” developeed a distribution tha while behaving similarly to the Normal distribution, had only 1 parameter, the degrees of freedom $df$.

Student t-distribution compared to the normal

Student t-Distributionon vs Normal

Confidence intervals with t-distributions

t-distribution with $df$ degrees of freedom.
$t_{1 - \frac{\alpha}{2}}$ is chosen so that \[P(-t_{1 - \frac{\alpha}{2},df} < t_{df} < t_{1 - \frac{\alpha}{2},df}) = 1 - \alpha\]
A 95% confidence interval estimate for the parameter $\mu$ is given by $\bar{X} \pm t_{0.975,df}*SE$ where the Standard error is $\frac{S}{\sqrt{n}}$, where $n$ is equal to the sample size and $S$ is the sample standard deviation.
You can either use a table or a the qt function from R to get the appropriate t=quantile.
Example:

n = 100
x = rnorm(n,1,2)

se = sd(x)/sqrt(n)
mean(x) + c(-1,1)*se*qt(0.975,99) # t CI

## [1] 0.6224048 1.3964536

mean(x) + c(-1,1)*se*qnorm(0.975)

## [1] 0.6271354 1.3917230

But even in this example the t-CI is slightly larger. So always use t, it’s more coservative.

Formal Hypothesis Testing Framework

Suppose we have estimated the slope for a regression line for a single variable. The true slope is $\beta_1$ parameter, and the sample slope is given by $

$\mathbf{H_0}$: $\beta_1 = 0$ There is no linear relationship.

$\mathbf{H_A}$: $\beta-1 \ne 0$ There is a linear relationship (two sided test)

The idea is that we could only reject $H_0$ if we had a lot of evidence to the contrary. In other words how likely are the results we observed if the null hypothesis was true.

Example: Galton data

The follow code computes statistics of the coefficients in a single linear regresion of child on parent in the galton data set.

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 23.9415302 2.81087834  8.517455 6.536845e-17
## parent       0.6462906 0.04113588 15.711115 1.732509e-49

Usually we don’t pay too much attention to the intercept, unless it has some intrinsic meaning. In this case there are very few adults who are 23 inches tall!

However the coefficient of parent $\hat{\beta_1}$ says that a child’s height will only increase by about .65 inches for every inch increase in the parent. Can we conclude this is a statistically significant result?. Let’s look at the tale. The column labeles “Pr(>|t|)” is the probability of observing a value as large, in absolute value, as the the one we absovered is on the order of $10^{-14}$, in other words effectively zero.

So we would reject the null hypothesis that there is no linear relationship between the heights. “Pr(>|t|)” is frequently called the $p$-value. We would reject the null hypothesis if the p-value were as large as 2.5%.

Statistical Inference and Regression