Statistical Inference Lectures

Master Data Science and Statistical Learning

Author

Affiliation

Giovanni M. Marchetti

DISIA, University of Florence

Published

March 5, 2026

1 Samples, estimation and sampling distributions

Statistical thinking concerns the relation of quantitative data to a real-world problem, often in the presence of variability and uncertainty. It attempts to make precise and explicit what the data has to say about the problem of interest

— Mallows (1998) The zeroth problem

The starting point in Statistics is a scientific enquiry, an investigation concerning a substantive question.

The first problem (the zeroth problem) to be studied is then to understand what is the relevant population.

In other words one has to decide what the relevant data are, and how these relate to the purpose of the statistical study.

1.1 The Fiji Fertility Survey

One important issue in demography is fertility (the number of children ever born of a woman) and its relation with other factors.
The Fiji Fertility Survey (which is part of the World Fertility Survey, see Fiji Fertility survey, 1974) collected data of fertility and other relevant factors.
The data below are a subset of the full data set, regarding 1703 ever-married women Aged 35 -49 in 1974 and concerning the variables:
F, fertility (number of children), the main response,
A, age in years,
M, age at the first marriage in years,
E, education in completed years
U, a binary indicator of urban residence
I, a binary indicator of indian ethnicity

Questions

Some of the basic questions are:

What is the population to be investigated and how should we collect data?
Should we measure fertility could be measured for each woman at a given age or at a given point in time?
Should we choose as relevant women population for a specific region or for a selected ethnic group?
Can we distinguish categories of variables, i.e. responses, intermediate variables, background variables, etc.
Do we have causal hypothesis concerning fertility?

Explicit what we know about the problem

Fertility is a primary outcome depending on age, age at first marriage, education, and residence and ethnicity
The variables M and E are intermediate variables, while A, U and I can be interpreted as background variables that may be considered fixed and on the same standing.
One reasonable ordering of the variables is the following \left\{ \begin{array}{c} I\\ U \\ A \end{array} \right\}\to \{E\} \to \{M\} \to \{F\}

Each variable potentially depends on the variables to the left.

Meaning of statistical inference

A statistical inference will be defined as a statement about statistical populations made from given observations with measured uncertainty.

An inference in general is an uncertain conclusion.

Two things mark out statistical inferences.

First the information on which they are based is statistical, i.e. consists of observations subject to random fluctuations.
Secondly we explicitly recognize that our conclusion is uncertain and attempt to measure, as objectively as possible, the uncertainty involved.

—Cox (1958)

Random sampling

Suppose that our population consists in the 1703 ever-married women Aged 35-49 in 1974.

source("fiji.R")
summary(fiji)

       F                A               M               E          I      
 Min.   : 0.000   Min.   :35.00   Min.   : 9.00   Min.   : 0.000   0:774  
 1st Qu.: 4.000   1st Qu.:38.00   1st Qu.:15.00   1st Qu.: 0.000   1:929  
 Median : 6.000   Median :41.00   Median :17.00   Median : 4.000          
 Mean   : 5.904   Mean   :41.04   Mean   :17.91   Mean   : 3.985          
 3rd Qu.: 8.000   3rd Qu.:44.00   3rd Qu.:20.00   3rd Qu.: 6.000          
 Max.   :17.000   Max.   :49.00   Max.   :44.00   Max.   :20.000          
 U       
 0:1113  
 1: 590

Then we choose a random sample with replacement of size 20 from this population.
The R code is as follows

set.seed(457)  # set the seed to reproduce results
x20 <- fiji[sample(1703, 20, replace = TRUE), ]
summary(x20)

       F               A               M               E         I      U     
 Min.   : 0.00   Min.   :35.00   Min.   :12.00   Min.   : 0.00   0:11   0:14  
 1st Qu.: 4.75   1st Qu.:39.50   1st Qu.:16.00   1st Qu.: 3.00   1: 9   1: 6  
 Median : 6.50   Median :41.50   Median :17.50   Median : 4.50                
 Mean   : 6.20   Mean   :41.75   Mean   :18.05   Mean   : 4.70                
 3rd Qu.: 7.25   3rd Qu.:46.00   3rd Qu.:20.25   3rd Qu.: 7.25                
 Max.   :11.00   Max.   :48.00   Max.   :24.00   Max.   :10.00

1.2 Inference with survey sampling approach

The survey sampling approach is typically used for finite populations
The focus is typically on population means and variance of important variables.
The simplest problem of inference is how to relate the sample mean \bar x of a variable X, from a random sample of size n, to the population mean \mu in the finite population

Inference is this case is based on the repeated sampling principle:

Statistical procedures should be evaluated on the basis of their behaviour in hypothetical repetitions of the experiment that generated the original data.

This principle is the basis of the frequentist approach

Example

The previous sample of size n = 20 is a simple random sample with replacement i.e. we know that the observation are

mutually independent and
identically distributed for each variable.

To estimate the mean of the fertility \mu_F in the population we use the sample mean (called a statistic) \bar x_F = \frac{1}{n} \sum_{i=1}^n x_i

The the population mean \mu_F and the sample mean \bar x_F are in general different

mu <- mean(fiji$F)
xbar <- mean(x20$F)

\mu_F = 5.9 \quad \bar x_F = 6.2

Therefore we need a measure of the uncertainty of the estimate \bar x_F

1.3 Measure of the error of an estimator

According to the repeated sampling principle we measure the uncertainty as follows.

We consider the means in repeated random samples, denoted by the random variable \bar X_F.
The random variable \bar X_F has a probability distribution that can be studied. It is called the estimator of \mu_F
We measure the uncertainty of the estimator with the Mean Squared Error (MSE)

MSE = E\{(\bar X_F - \mu_F)^2\}
The square root of MSE, called the standard error, is a measure of uncertainty of the estimator.

Formula

There is a formula valid for simple random samples with replacement

\begin{align*} E(\bar X_F) &= \mu_F \\ MSE &= \mathrm{var}(X_F) = \sigma_F^2 / n \\ \end{align*}

where \sigma_F^2 is the true variance of fertility.

The standard error of the estimator of the mean is therefore SE (\bar X_F) = \sigma_F/\sqrt{n}

1.4 Estimated Standard Error

The standard error is however an unknown parameter because the standard deviation \sigma concerns the population that is unkown!
To get the standard error you need to estimate \sigma from the same sample using the estimator

s_F = \sqrt{\sum_{i=1}^n (X_i - \bar X_F)^2/(n-1) }

The estimate of the standard error is obtained by substituting s_F to \sigma_F \text{se}(\bar X_F) = s_F/\sqrt{n}.

Example

The sample standard deviation of fertility in our sample is

ssd <- sd(x20$F)
se <- ssd/sqrt(20)
cat("Estimated SD: ", ssd, "\nEstimated SE: ", se, "\n")

Estimated SD:  2.504732 
Estimated SE:  0.5600752

Our inference about the mean fertility is summarized by \bar x_F = 6.2,\quad (\text{se} = 0.56)

Graphical presentation

Consider the true distribution of the population with the true mean (black) and our estimate (red).

The true standard deviation is \sigma = 3.08 and its estimate is s = 2.5.
The true standard error of the mean is \text{SE} = 0.69.
The estimated standard error of the mean is \text{se} = 0.56.

1.5 Model-based inference theories

Often we start not from a finite population but from a statistical model i.e., from an assumption concerning the distribution of an infinite population
For example, we assume that the distribution of the population is Gaussian (Normal) and that our sample is a random sample from that population with unknown mean \mu and standard deviation \sigma
In this example, as before we want to estimate these two unknown parameters \mu and \sigma.
The model-based theory of inference are essentially two:
- the frequentist theory (Fisher, Neyman and Egon Pearson) seen before
- the Bayesian theory, (Bayes, Laplace, Jeffreys) the useful for more complex models and data.

Basic ingredients

Both theories are based on

a research question to be answered
a design D for the study; and
a probability model f(x ; \theta) for the random variable X, which is used to represent the data produced by the study
The parameter \theta can be a scalar or a vector belonging to a given parameter space.
If the model can be defined by a finite number of parameters is said a parametric model.

1.6 The Bernoulli model

In this model the data are binary x = 0,1, where 0 indicates failure and 1 success with probability mass f(x; \theta) = \begin{cases} 1-\theta & \text{ if } x = 0 \\ \theta & \text{ if } x = 1 \end{cases}
Assume that the design is a random sample (X_1, \dots, X_n) where the variables are independent and identically distributed (i.i.d)
The parameter \theta is the probability of success, and assumes values in a parameter space 0 <\theta<1.
The mean of the Bernoulli is also equal to \theta
The variance is \theta(1-\theta)

Example (Placenta previa)

Placenta Previa is a complication in the second half of pregnancy where the placenta attaches inside the uterus in a position that partially or completely covers the cervical opening.

Suppose that we have data from a random sample in the population of placenta previa births in Germany, and we are interested in the proportion of female births.
Therefore we can specify a Bernoulli model with parameter \theta equal to the probability of a female birth.
If the size of the sample is n = 980 and the number of females in the sample is s = 437 we can estimate \theta as \hat \theta = 437 /980 = 0.446

1.7 The standard error of \hat \theta

The first inference is to obtain the standard error of the estimator.
In general, given an i.i.d sample (X_1, \dots, X_n) from the Bernoulli, the proportion of successes is \hat \theta = \frac{1}{n} \sum_{i = 1}^n X_i = \frac{T}{n} where T is the number of females in the sample.
The sampling distribution of T has a Binomial distribution with mean E(T) = n\theta and variance \mathrm{var}(T) = n\theta(1-\theta).
The MSE of \hat \theta is equal to the variance \mathrm{var}(\hat \theta) = \mathrm{var}(T/n) = \frac{1}{n^2}\mathrm{var}(T) = \frac{\theta(1-\theta)}{n}. Therefore the standard error is \text{SE}(\hat \theta) = \sqrt{\frac{\theta(1-\theta)}{n}}

Example

Again the standard error is a function of the unknown parameter \theta
The estimated standard error is obtained by \text{se}(\hat \theta) = \sqrt{\frac{\hat \theta(1-\hat \theta)}{n}}
The inference concerning \hat \theta in the example of placenta previa is \hat \theta = 0.446, \quad (\text{se} = 0.016)
The conclusion is that the proportion of females in the population of placenta previa births is 44.6\% with a standard error of 1.6%.

1.8 Large samples

The sampling distribution of the proportion \hat \theta is a Binomial T divided by the sample size n
In the placenta previa example, assuming that the true proportion is \theta the distribution of \hat \theta is shown in figure below

We have two fundamental results obtained using the large sample property
1. The proportion T/n converges in probabilityto the true parameter \theta, for n \to \infty
2. The proportion T/n converges in law to the Normal, for n \to \infty

De Moivre-Laplace Theorem

When \theta is not too close to 0 or 1, the Binomial distribution \text{Bin}(n, \theta) in large samples approximates the Normal distribution N(n\theta,n\theta(1-\theta)).

We say that the distribution of \hat \theta is asymptotically normal.

Central limit theorem

Given any population with a continuous distribution with a finite mean \mu and variance \sigma^2, in large samples, the distribution of the sample mean under random sampling approximates the Normal distribution N(\mu, \sigma^2/n)

1.9 Details on the normal and the chi-squared distributions

The previous mathematical results imply a central role of the normal distribution in statistical inference.
The standard normal random variable Z is defined by the density \phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}, \quad -\infty < z < \infty
We know that Z is symmetric unimodal with respect to \mu. Moreover E(Z) = 0 and \mathrm{var}(Z) = 1
The family of normal distributions is defined by a linear transformation of the standard normal X = \mu + \sigma Z where \mu \in \mathbb{R} is the mean parameter and \sigma^2>0 is the variance parameter. The family is denoted by X \sim N(\mu, \sigma^2).

The tails are rather short:

\begin{align*} \Pr\{ -1 < Z < 1 \} &= 0.683\\ \Pr\{ -2 < Z < 2 \} &= 0.954\\ \Pr\{ -3 < Z < 3 \} &= 0.997\\ \end{align*}

Therefore, the probability that X differs by more than 3 \sigma from the mean is only 0.003.

The chi-squared distribution

The chi-squared (\chi^2_d) is a distribution closely connected with the normal. The parameter d is an integer called degrees of freedom
The simplest example is the chi-squared with 1 degree of freedom that is the square of one standard normal Z Specifically, Z^2 \sim \chi^2_1 whith parameter 1.

In particular this distribution has support the positive reals, with mean E(Z^2) = 1 and \mathrm{var}(Z^2) = 2.
More generally if we have d independent standard normal variables Z_1, \dots, Z_n the sum of squares Z_1^2 + \cdots + Z_n^2 \sim \chi^2_n has a chi-squared distribution with n d.f.
The mean is n and the variance is 2n
For example see the distribution of \chi^2_3

1.10 The one-sample normal model

This model assumes that we get a random sample from a population that has a normal distribution N(\mu, \sigma^2) where both \mu and \sigma^2 are unknown.
In formulae we assume that
- we have a random vector {\boldsymbol{X}} = (X_1, \dots, X_n) such that each X_i \sim N(\mu, \sigma^2) for i = 1, \dots, n
- the observations X_i are mutually independent so that the joint distribution factorizes as
  f_{{\boldsymbol{X}}}(x_1, \dots, x_n) = f_1(x_1) \cdot \cdots \cdot f_n(x_n)

The one sample model can be written as X_i = \mu + \varepsilon_i where \varepsilon_i are mutually independent errors with normal distributions N(0, \sigma^2) with the same variance.

Of course this model makes quite strict assumptions so that we expect to check them carefully.

1.11 Sampling distributions of the mean and variance estimators

The estimators of \mu (parameter of interest) and \sigma^2 (nuisance parameter) are respectively \bar X = \textstyle \frac{1}{n}\sum_{i=1}^n X_i, \quad S^2 = \sum_{i=1}^n(X_i - \bar X)^2/(n-1)
Standard error of the sample mean:

\text{se}(\bar X) = S /\sqrt{n} - The sampling distributions with the normal model and sample size n are exactly

\bar X \sim N(\mu, \sigma^2/n)
S^2 \sim \sigma^2 \chi^2_{n-1}/(n-1)
The variance estimate has a chi-squared distribution with n-1 degrees of freedom.

Why n-1 d.f.?

If Z_1, \dots, Z_n are IID standard normals it can be proved that \sum_{i=1}^n Z_i^2 \sim \chi^2_n but \sum_{i=1}^n (Z_i - \bar Z)^2 \sim \chi^2_{n-1}
This happens because the r.v. Z_i -\bar Z must satisfy one constrain that their sum must be always zero -Thus, you don’t have n independent variables but only n-1, loosing 1 degree of freedom.

Example about Salary

Data on employees from one job category (skilled, entry–level clerical) of a bank that was sued for sex discrimination. The data are on 61 female employees, hired between 1965 and 1975. (R Package Sleuth3)

 [1] 5100 4800 5280 5280 4800 4800 5400 5520 5400 5700 3900 4800 6120 5220 5100
[16] 4380 4290 5400 4380 5400 5400 4500 5400 5520 5640 4800 5400 4500 5400 5400
[31] 6300 5160 5100 4800 5400 4020 4980 5280 5700 4800 4800 5700 4380 4380 5700
[46] 5400 4440 6300 6000 5100 4800 5100 5700 5400 5100 4800 6000 4380 5580 4620
[61] 5220

Below a synthesis with a box-plot:

Estimates of the mean and standard error:

sample mean = $ 5138.852  se = $ 69.12335

1.12 The two-sample normal model

A more interesting model concerns the comparison of two populations (be they Bernoulli, Normal or whatever)

Let’s discuss the previous example.

Sex discrimination in employment

Did a bank discriminatorily pay higher starting salaries to men than to women? The salaries studied before concerned women, but there are data also for men.
The full set of data is in R package Sleuth3, loading data case1202

 [1] 5040 6300 6000 6000 6000 6840 8100 6000 6000 6900 6900 5400 6000 6000 5400
[16] 5400 5100 6600 5100 6600 5700 6000 6000 6000 6000 4620 5220 6600 5400 6000
[31] 5400 6000

The two samples have sizes n_0 = 61 (females) and n_1= 32 (males).

The two-sample normal model assumes that the data are obtained by two independent random samples (possibly of different sizes) from two normal populations with parameters N(\mu, \sigma^2) \text{ and } N(\mu + \theta, \sigma^2) See the figure below:

The two random samples can be collected together as a independent observations Y_i for i = 1, \dots, n plus an indicator variable X_i with value 0 for the women and 1 for the males, therefore identifying the two samples \begin{array}{cccccccccc} {\boldsymbol{Y}} &= &(Y_1 & Y_2& \cdots & Y_{n_0}& Y_{n_0+1}& Y_{n_0+2} &\cdots & Y_{n_0+n_1}) \\ {\boldsymbol{X}} &= &(0 & 0 & \cdots & 0 & 1 & 1 &\cdots & 1 ) \\ \end{array}

The two-sample model can be denoted by Y_{i} = \mu + \theta \cdot X_i + \varepsilon_i with IID errors \varepsilon_i \sim N(0, \sigma^2)

Note that the observations Y_i are independent but not identical
The errors \varepsilon_i are instead IID because the variances are assumed to be the same (an assumption called homoscedasticity).

Reparameterization

The parameters in the two-sample models are three: \mu, \theta and \sigma^2
You can change the parameters to \mu_0, \mu_1, \sigma^2 where \mu_0 and \mu_1 are the salary means of women and men.
There is a 1-1 correspondence with the previous parameters

\begin{align*} \mu_0 &= \mu\\ \mu_1 &= \mu+ \theta \end{align*} ot, inversely, \begin{align*} \mu &= \mu_0\\ \theta &= \mu_1 - \mu_0 \end{align*}
Here \theta has the role of parameter of interest because it evaluates the amount of sex discrimination
\sigma^2 has the role of nuisance parameter because it measures the uncertainty of the parameter of interest

1.13 Estimation and standard errors of the two-sample model

Intuitively, we can estimate \mu with the mean of the first sample, and \theta with the difference between the two sample means.
Denoting by \bar Y_0 and \bar Y_1 the sample means, we get the estimators \hat \mu = \bar Y_0, \quad \hat \theta = \bar Y_1 - \bar Y_0 and we can prove that both are normally distributed because linear combinations of normal variables are normal
The standard error of \bar Y_0 is given by \text{SE}(\bar{Y_0}) = \sqrt{\mathrm{var}(\bar Y_0)} = \frac{\sigma}{\sqrt{n_0}} where n_0 is the size of the first sample.
The standard error of \hat \theta is \begin{align*} \text{SE}(\hat\theta) &= \sqrt{\mathrm{var}(\bar Y_1 - \bar Y_0)} \\ &= \sqrt{\mathrm{var}(\bar Y_1) + \mathrm{var}(\bar Y_0)}\\ &= \sqrt{\text{SE}(\bar Y_0)^2 + \text{SE}(\bar Y_1)^2 }\\ &= \sqrt{\frac{\sigma^2}{n_0} + \frac{\sigma^2}{n_1}}\\ &= \sqrt{\sigma^2 \Big(\frac{1}{n_0} + \frac{1}{n_1}\Big)} \end{align*}
The estimated standard errors depend upon an estimate of the common variance \sigma^2

The pooled variance^*

The estimator of \sigma^2 is the pooled variance S^2_{p} = \frac{1}{n-2}\left[\textstyle \sum_{i \in G_0}(Y_i - \bar Y_0)^2 + \sum_{i \in G_1}(Y_i - \bar Y_1)^2 \right] that is a sum of the deviances within the groups divided by the number of observations minus 2.
This estimator can be interpreted as a weighted average of the variances of the groups:

S^2_{p} = \frac{(n_0-1) S^2_0 + (n_1-1) S^2_{1}} {(n_0-1) + (n_1-1)}

Estimated standard error of \hat \theta

We use again the plug-in rule substituting to \sigma^2 the pooled variance \text{se}(\hat \theta) = \sqrt{ S^2_p \left(\frac{1}{n_0} + \frac{1}{n_1}\right)}.

Computations for sex discrimination

             Estimate Std. Error
(Intercept) 5138.8525   76.25502
SexMale      818.0225  129.99747

Using the formulae we have the following estimates

\begin{align*} \hat \mu &= \$\,5138.8 \quad (\text{se} = 76.2) \\ \hat \theta &= \$\,818.0 \quad (\text{se} = 130.0) \end{align*}

Thus, there is a salary difference of about 800 dollars with a standard error of 130 dollars.
The salary difference has an exactly normal sample distribution. Intuitively, it seems unlikely that a difference of $818 can happen if the true \beta = 0 given that is 6 times the standard error $130.

1.14 Exercises

In a region a telephone company has 100 000 subscribers. It plans to take a simple random sample of size 400 as part of a market research study. According to Census data nearly 20% of the company’s subscribers earn over $50 000 a year. Find the standard error for the percentage of persons in the sample with incomes over that level.
You have a box containing the counts \{ 0,0,0,1,2\}. Consider this as a population. The population has a mean = 0.6 and SD = 0.8. Find the SE for the percentage of 1’s in 400 draws with replacement.
Consider the R file fiji.R and use the code source("fiji.R") that returns a 1703 x 6 dataframe named fiji. Assume that it is a simple random sample from the population. Then obtain an estimate of the mean age at first marriage and its standard error.
A survey organization wants to take a simple random sample in order to estimate the percentage of people who have seen a certain television program. Their client will only tolerate chance errors of 1 percentage point or so in the estimate. Should they use a sample of size 100, 2,500, or 10,000? You may assume the population to be very large; past experience suggests the population percentage will be in the range 20%–40%.
Use the data fiji.R and consider the variables M (age at first marriage) and I (indicator of indian etnicity: 1 = Indian, 0 = Fijian). Draw two side-by-side boxplots and estimate the difference between the mean age of the Fijians and the Indians and its standard error.
Let Z \sim N(0,1) with density \phi(z) and cumulative distribution function \Phi(z) = \Pr(Z \le z). True or false and explain.
1. The slope of \Phi at the real number z is \phi(z)
2. The area to the right of z under \phi is \Phi(z)
3. P(Z = x) = \phi(x)

2 Fundamental concepts of inference

Many inferential problems can be classified into three categories:

estimation
confidence sets
hypothesis testing

Her we give a short introduction.

2.1 Point Estimation

Point estimation tries to find an “optimal guess” of some quantity of interest like

a parameter
a distribution
a regression function
a prediction for a future value

Let’s call \theta a generic parameter of a model and \hat\theta (or \hat \theta_n) its estimator. Remember that the estimator is a random variable in repeated sampling.

For a random sample IID (X_1, \dots, X_n) from a distribution, a point estimator \hat\theta is a function of the sample.

Bias

The bias of an estimator is defined by the difference between the expected value of the estimator and the true \theta \text{bias}(\hat \theta) = E(\hat \theta) - \theta
If E(\hat \theta) = \theta the bias is 0 and the estimator is said unbiased.

Examples

Given an IID sample. the sample mean \bar X is always unbiased for the mean of the population \mu.
Given an i.i.d sample from a Bernoulli, the proportion of successes \hat \theta = T/n is unbiased.
However the estimator (T/n)^2 of \theta^2 of the Bernoulli is biased because E(T^2/n^2) = \theta^2 + \underbrace{\theta(1-\theta)/n}_{\text{bias}} To get an exactly unbiased estimator of \theta^2 you should use the estimator \frac{T(T-1)}{n(n-1)}.
The sample variance S^2 (with denominator n-1) is unbiased for the variance \sigma^2 of the population.
However the square root of the sample variance, i.e. \sqrt{S^2} is biased for the standard deviation \sigma.
In general, unbiasedness is a non persistent property, and in fact many good estimators are biased, provided that the bias tends to zero when the sample size increases.

Consistency

An essential thing for an estimator is not the zero bias, but a property called consistency meaning that the estimator converges to the true parameter value as we collect more and more data.

Definition

A point estimator \hat\theta_n of \theta is consistent if \hat\theta_n \to \theta for n \to \infty in probability.

Mean squared error

As we have seen before the quality of a point estimate is assessed by the mean squared error \text{MSE} = E_\theta\{ \hat \theta - \theta)^2\}.
Interestingly the MSE can be decomposed as follows

\text{MSE} = \text{bias}^2(\hat \theta) + \mathrm{var}(\hat \theta) And this helps in checking if an estimator is consistent

Theorem

If the \text{bias} \to 0 and the \text{SE} \to 0 as n \to \infty then the estimator \hat \theta_n is consistent.

Example

The estimator of proportion of successes \hat \theta = T/n is consistent, because it is unbiased and the standard error is \theta(1-\theta)/n so that \lim_{n \to \infty} \theta(1-\theta)/n = 0.

2.2 Methods of estimation

The likelihood

Behind the scenes of the frequentist approach there is the fundamental concept of likelihood. Estimation methods, interval estimates and test statistics can all be derived from the likelihood function.

Idea

As the data are assumed to be generated by a model defined by a probability distribution (like the Bernoulli model with parameter \theta) all the information in the sample concerning \theta must be contained in that probability distribution
Once you get the observations {\boldsymbol{x}}_\text{obs} = (x_1, \dots, x_n) you can measure the plausibile value of the parameter \theta by studying the probability of \theta, for fixed values of the observed data {\boldsymbol{x}}_\text{obs}

Example for the one-sample Bernoulli model

In the Bernoulli model with success probability \theta the distribution of the population is f(x; \theta) = \Pr(X = x) = \theta^x(1-\theta)^{1-x}, \qquad x = 0,1.
Suppose now that for a random sample of size n = 20 we get \begin{align*} {\boldsymbol{x}}_{obs} &= (0, 1, 1, 0, 0, 1, 1, 0, 1, 1, \\ &\phantom{= (} 1, 1, 1, 0, 1, 1, 1, 1, 1, 1) \end{align*}

Then the likelihood: is the probability of observing {\boldsymbol{x}}_\text{obs} L(\theta) = \prod_{i=1}^{20} \Pr(X_i = x_i) = \theta^{15} (1-\theta)^{5}

because there are 15 successes ond 5 failures in the sample.

The shape of the Bernoulli likelihood is the following

The Likelihood’s domain is the parameter space of the success probability \theta \in (0,1).
The red dot is the value \hat \theta with the most plausible value for the success probability in the population.

Likelihood definition

The likelihood function is defined by L(\theta, {\boldsymbol{x}}) = f({\boldsymbol{x}}; \theta) seen as a function of \theta for given data {\boldsymbol{x}}.
Typically is convenient to work with the logarithm of the likelihood \ell(\theta; {\boldsymbol{x}}) = \log L(\theta; {\boldsymbol{x}}).

Synthesis

The likelihood function shows, for each value of \theta, what is the probability of observing what we have observed under the model.

Given two values \theta_0, \theta_1 of the parameter, the likelihood ratio L(\theta_1; {\boldsymbol{y}})/ L(\theta_0; {\boldsymbol{y}}) is interpreted as the degree to which data support \theta_1 versus \theta_0.

Maximum likelihood estimates

The estimation method of maximum likelihood prescribes choosing as estimator the value of \theta that maximizes the likelihood (or equivalently the log-likelihood function)

Example

The maximum likelihood estimate for \theta can be obtained analytically by \hat \theta = \frac{\text{n. successes}}{n} = \frac{\sum_{i = 1}^n X_i}{n} = 15/20 = 0.75.
The log-likelihood and the MLE are shown below

Some properties

The log-likelihood is in general a curve (for a scalar parameter) or a surface (for a vector parameter) and can have multiple maxima. But in many important regular models there is a single maximum.
The curvature at the maximum is an indicator of the precision of the maximum likelihood estimator.
For a scalar parameter the curvature is measured by the observed information that is the negative of the second derivative of the log-likelihood function evaluated at the observed data.
The larger the information the more concentrated is the likelihood function around the maximum.

Sometimes is more convenient to work with the average curvature, called the Fisher information.

Example

The information depends on the sample size. If we had had a sample of 200 observations with 150 successes, the estimate would have been the same, but the curvature would have been much larger.

A greater curvature with 200 observations

Properties of the MLE

Under some regularity conditions on the model the MLE estimator has very appealing properties

In general

The MLE is equivariant: if \hat \theta is the MLE of \theta then g(\hat \theta) is the MLE estimator of g(\theta). For instance the MLE of the variance of the Bernoulli \theta(1-\theta) is \hat \theta(1-\hat \theta).

For large samples

The MLE is consistent
The MLE is asymptotically normal and the estimated standard error often can be computed analytically because it is related to the information.

MLE for some models

One-sample normal: \hat \mu = \bar X, \quad \hat \sigma^2 = \sum_i(X_i - \bar X)^2/n
Two-sample normal: \hat \mu = \bar Y_0, \quad \hat \theta = \bar Y_1 - \bar Y_0, \quad \hat \sigma^2 = \frac{n_0 \hat\sigma^2_0 + n_1 \hat \sigma^2_1}{n}

2.3 Confidence Sets

Definition

A confidence interval of level 1-\alpha is an interval with extremes L (left) and R (right) that are functions of the random sample (therefore random variables) such that \textstyle \Pr_\theta(L < \theta < R) \ge (1-\alpha) \text{ for all } \theta.

The idea is that we have random interval that traps the true parameter \theta with probability 1-\alpha called the coverage.
Typically we use 95 percent confidence levels, which correspond to choosing \alpha = 0.05.
If the parameter \theta is a vector we can define confidence sets. For example ellipses in 2D space.

Illustration

The probability concerns the random interval not the parameter. The parameter is fixed !

Wrong interpretations

(Wasserman) Every day, newspapers report opinion polls. For example, they might say that “83 percent of the population favor arming pilots with guns.” Usually, you will see a statement like “this poll is accurate to within 4 points 95 percent of the time.”
Technically, this means that a study based on a random sample gives a confidence interval with level 0.95 0.83 \pm 0.04 for the true proportion \theta of people who favor arming pilots.
A typical error is interpreting the observed confidence interval as: \small \Pr( 0.79 < \theta < 0.87) = 0.95 because is conceptually wrong!

Correct interpretation

If you continue constructing confidence intervals (not only for \theta but also for a sequence of other unrelated parameters, then 95 percent of your intervals will trap the true parameter value.

2.4 Approximate normal confidence intervals

Theorem

If we have an asymptotically normal estimator \hat\theta_n of \theta such that \hat \theta_n \approx N(\theta, \text{se}^2), then we can construct an approximate CI for \theta

C_n = (\hat \theta_n - 1.96 \, \text{se} < \theta < \hat \theta_n + 1.96\, \text{se})

with confidence level \to 0.95 for n \to \infty.

The value 1.96 is the quantile z_{1-\alpha/2} = z_{0.975} with \alpha = 0.05 (see subsequent explanation).
This value can be simplified to 2. For the confidence level 99% you could use 2.5.

Proof

If \hat\theta_n \approx N(\theta, \text{se}^2) then the distribution of the statistic \frac{ \hat\theta_n - \theta}{\text{se}} \to N(0,1)\quad \text{for } n \to \infty Therefore, \begin{align*} \textstyle \Pr_\theta(C_n) &= \textstyle \Pr_\theta(\hat \theta_n - 1.96\, \text{se} < \theta < \hat \theta_n + 1.96\, \text{se} )\\ &= \textstyle \Pr_\theta\Big(-1.96 < \displaystyle\frac{\hat \theta_n - \theta}{\text{se}} <1.96\Big)\\ &\bumpeq \textstyle \Pr_\theta\Big(-1.96 < Z < 1.96\Big) = 0.95 \end{align*}

Example

The approximate confidence interval for the mean fertility with a random sample with size n = 1703 is calculated as follows

X <- fiji$F
Xbar <- mean(X)
se <- sd(X)/sqrt(1703)
L <- Xbar - 1.96*se
U <- Xbar + 1.96*se
cat("mean: ", Xbar, ", se: ", se, ", CI95 = (", L, U, ")\n" )

mean:  5.904287 , se:  0.0746156 , CI95 = ( 5.75804 6.050533 )

Details about quantiles

Definition

Given a standard normal Z\sim N(0,1) a quantile or percentile of order p (a probability) is the value z_p such that \Pr (Z \le z_p) = p.

For example z_{0.975} = 1.96 .
This means that the probability that Z\le 1.96 is 0.975.

In the figure the orange area is 0.975. The blue dot is the quantile z_{0.975}. The red dot is the symmetric quantile z_{0.025} = -1.96.
With R you can find the quantiles easily. For example the quantile z_{0.90} (called also the ninetieth percentile) is

qnorm(0.90)

[1] 1.281552

General formula

In general the confidence intervals for \theta for large samples with confidence level 1-\alpha are \hat \theta_n \pm z_{1-\alpha/2} \text{se}.

2.5 Two-sample confidence intervals

Continuous data and large samples

Suppose that you have two independent random samples from two continuous distributions (not necessarily normal) with means \mu_0 and \mu_1 and two unequal variances \sigma^2_0 and \sigma^2_1.
If you want a confidence interval for the difference between the means \theta = \mu_1- \mu_0 and you have a large sample then you can use the approximation of central limit theorem \hat \theta \approx N\Big(\theta , \frac{\sigma_0^2}{n_0} + \frac{\sigma_1^2}{n_1}\Big)
If you don’t know the the population variances you can estimate them by \begin{align*} \hat\sigma_0^2 &= \frac{1}{n_0} \Sigma_{i=1}^{n_0} (Y_i-\bar Y_0)^2\\ \hat\sigma_1^2 &= \frac{1}{n_1} \Sigma_{i= n_0+1}^{n} (Y_i-\bar Y_1)^2\\ \end{align*}
Therefore you can get the second approximation \hat \theta \approx N\Big(\theta , \frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}\Big)

Result

You can obtain an approximate 1-\alpha confidence interval with \hat \theta \pm z_{1-\alpha/2} \text{se}(\hat \theta) with \text{se}(\hat \theta) =\sqrt{\frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}}.

Example (Age at first marriage and etnicity)

Consider the data of the Fiji Fertility Survey for all married women with age between 40-49 as a random sample of size n = 1005 from a non-normal discrete distribution.
Then let’s study the response Y, age at first marriage as a function of etnicity, I

There are two groups with size 554 (Fijian) and 451 (Indian).
The boxplot shows that on average Indian women get married earlier than Fijian.
The confidence interval of the difference is calculated as follows

\begin{align*} \bar y_1 &= 16.22 \\ \bar y_0 &= 19.72 \\ \hat \theta &= \bar y_0 - \bar y_1 = 3.5 \end{align*}

The estimated standard error of the difference is calculated from \begin{align*} \hat\sigma_1^2/n_1 &= 0.03 \\ \hat \sigma_0^2/n_0 &= 0.03 \\ \text{se} &= 0.26 \end{align*}
Therefore the 95% confidence interval for the difference of years at first mariiage \theta is 3.5\pm 1.96 \cdot 0.26 \to (2.9904, 4.0096)
Comment: Inference confirms that Indian women get married about 3.5 years earlier than Fijans with a small standard error \text{se} = 0.26. The 95% confidence interval for the difference has a margin of error of about half a year.

2.6 Controlled Experiments with 2x2 tables

Controlled experiments are conducted to assess whether a certain treatment has a significant effect on a outcome. For example, in 1954, an experimental study was conducted to test whether the Salk polio vaccine was effective.
- The treatment is X. Levels: placebo, vaccine
- The outcome Y = polio, no polio )
To be sure that the effect of treatment on the outcome is free from the possible influence of other variables observed or unobserved, the two groups are formed through a random process called randomization.
In the Salk field trial a large portion of the children were randomized in a double-blind, placebo-controlled experiment. Among consenting families, children were randomly assigned to receive either the vaccine or a salt solution (placebo)
A group of 200,745 children was vaccinated and a group of 201,229 different children was not vaccinated. The cases of polio in the two groups were then compared.

\small \begin{array}{lrrr} & \text{Polio} & \text{No polio} & \text{Total} \\\hline \text{Placebo} & 115 & 201,114 & 201,229\\ \text{Vaccine} & 33 & 200,712 & 200,745 \\\hline \text{Total} & 148 & 401,826 & 401,974 \\ \end{array}

The model

The structure of the experiment allows us to assume that the data are two independent samples from two Bernoulli populations \begin{align*} Y_0 &\sim \text{Bernoulli}(\theta_0) \\ Y_1 &\sim \text{Bernoulli}(\theta_1) \end{align*}
- \theta_0 is the probability of polio in placebo group
- \theta_1 is the probability of polio in vaccinated group
The parameter of interest is \delta = \theta_0 - \theta_1

Inference

The estimator of \delta is the difference of the sample proportions \hat \delta = p_0 - p_1 where p_0 = \frac{115}{201,229} = 0.000571 \quad p_1 = \frac{33}{200,745} = 0.000164
Since the sample sizes are huge, by the CLT we have \begin{align*} p_0 &\approx N\Big(\theta_0, \frac{\theta_0(1-\theta_0)}{n_0} \Big)\\ p_1 &\approx N\Big(\theta_1, \frac{\theta_1(1-\theta_1)}{n_1} \Big) \end{align*}
Then, considering the independence assumptions, the asymptotic sampling distribution of \hat \delta is \hat \delta \approx N\Big(\delta, \frac{\theta_0(1-\theta_0)}{n_0} + \frac{\theta_1(1-\theta_1)}{n_1} \Big)
Finally the estimated standard error is (by substitution) \text{se}(\hat \delta) = \sqrt{\frac{p_0(1-p_0)}{n_0} + \frac{p_1(1-p_1)}{n_1}}

Synthesis

The estimate of the difference is \hat \delta = 0.0004071005 with \text{se} = 0.00006047412. The difference is more than 6 times the standard error.
1-\alpha confidence interval for the difference
\hat \delta \pm z_{1-\alpha/2} \text{se}(\hat \theta) that gives (0.0002835980, 0.0005306031).

Note The 95% confidence interval does not contain the zero value. Thus in the long run 95% of the intervals show that \theta_0> \theta_1, i.e., \theta_\text{placebo} > \theta_\text{vaccine}.

Relative risk

The relative risk (or risk ratio) of the polio for placebo versus vaccine is \text{RR} = \frac{\theta_0}{\theta_1}
The estimator of the relative risk is \text{rr} = p_0/p_1 = 0.000571/0.000164 = 3.48 This means that the probability of contracting the polio in the placebo group is more than three times that of the vaccine group.
Notice that the RR is more interpretable than the risk difference \delta.
A 95% confidence interval for the relative risk is obtained from the estimate of the \log(\text{RR}) and its standard error (calculations omitted) \begin{aligned} \log(\text{rr}) &= \log(3.48) = 1.2470323 \\ \text{se} &= \textstyle \sqrt{\frac{1}{a} + \frac{1}{c} - \frac{1}{a+b} - \frac{1}{c+d}} \end{aligned} where a,b, c, d are the counts in the 2x2 table \begin{array}{|l|cc|c|}\hline & 0 & 1 & \text{Total} \\ \hline 0 & a & b & a + b\\ 0 & c & d & c + d \\ \hline \end{array} Then the confidence limits are \exp(\log (\text{rr}) \pm 1.96 \cdot \text{se})
The result can be obtained by a simple R function.

R function

ci95_RR <- function(a, b, c, d){
  rr <- (a/(a+b)) / (c/(c+d))
  se_log_rr <- sqrt(1/a + 1/c - 1/(a+b) - 1/(c+d))
  L <- exp(log(rr) - 1.96 * se_log_rr)
  R <- exp(log(rr) + 1.96 * se_log_rr)
  cat("RR = ", round(rr, 4))
  cat("95% CI: (", round(L, 4), round(R, 4), ")")
}
a <- 115; b <- 201114 ; c <- 33;  d <- 200712
ci95_RR (a,b,c,d)

RR =  3.476595% CI: ( 2.3608 5.1194 )

Appendix on the SE of the log relative risk*

The logarithm of the relative risk (call this \rho) \log(\rho) = \log(\theta_0) - \log(\theta_1) and the estimator is \log(\hat \rho) = \log(p_0) - \log(p_1)
We know that for large samples the two estimators p_0 and p_1 are independent and are asymptotically normal

\begin{align*} p_0 &\approx N\left(\theta_0, \frac{\theta_0(1-\theta_0)}{n_0}\right) \\ p_1 &\approx N\left(\theta_1, \frac{\theta_1(1-\theta_0)}{n_1}\right) \end{align*}

The variance of \log \hat \rho is \begin{align*} \mathrm{var}(\log \hat \rho) &= \mathrm{var}(\log p_0 - \log p_1)\\ &= \mathrm{var}(\log p_0) +\mathrm{var}(\log p_1) \end{align*}
To calculate the two variances we use a useful trick called the delta method. If you have an estimator T_n that has an approximately normal sampling distribution then the sampling distribution of a transformation g(T_n) (like for example the logarithm) is also approximately normal.

Theorem (delta method)

If T_n \approx N(\theta, \sigma^2/n) then g(T_n) is approximately normal with mean g(\theta) and variance [g'(\theta)]^2 \sigma^2/n.

In the formula g'(\theta) is the derivative. Therore g must be twice differentiable
By applying the delta method to our problem we get that the asymptotic variances of \log p_i for i = 0,1 are \begin{aligned} \mathrm{var}(\log p_i) &= \frac{\theta_i(1-\theta_i)}{n_i}\left(\frac{d}{d\theta}\log \theta_i\right)^2\\ &= \frac{\theta_i(1-\theta_i)}{n_i}\frac{1}{\theta_i^2}\\ &= \frac{1-\theta_i}{n_i\theta_i} \end{aligned} Therefore \text{SE}(\log \hat \rho) = \sqrt{\frac{1-\theta_0}{n_0\theta_0}+ \frac{1-\theta_1}{n_1\theta_1}}

And finally we get the estimated standard error by substituting to \theta_0 and \theta_1 p_0 = \frac{a}{a+b}, \quad p_1 = \frac{c}{c+d}

The result is \begin{aligned} \text{se}(\log\hat\rho) &= \sqrt{\frac{b}{a(a+b)} + \frac{d}{c(c+d)}}\\ &= \sqrt{\frac{1}{a} - \frac{1}{a+b} + \frac{1}{c} - \frac{1}{c+d}}. \end{aligned}

2.7 Exercises

Assume that we have an IID sample from a N(\mu, \sigma^2). We know that an unbiased estimator of the population variance \sigma^2 is S^2 = \frac{1}{n-1}\sum_{i= 1}^n (X_i - \bar X)^2.
1. What is the bias of the estimator of \sigma^2 \hat \sigma^2 = \frac{1}{n}\sum_{i= 1}^n (X_i - \bar X)^2 ?
2. Does the second estimator tend to underestimate or overestimate \sigma^2?
3. The bias is a number or a function?
4. Can you say if the estimator S^2 is consistent?
(Freedman) A simple random sample of 3500 people age 18 or over is taken in a large town to estimate the percentage of people (age 18 and over in that town) who read newspapers. It turns that 2487 people in the sample are newspaper readers. Calculate a 95% confidence interval for the percentage.
Consider the dataset fiji.R Consider the response E: education (in number of years) and the groups defined by the variable I (etnicity: 1= indian, 0=fijian). Assume that the observations E satisfy the assumption of the two-sample model.

Find a 95% confidence interval for the difference between the average number of years between the two groups.

Suggestion: read section 2.5. The approximate 95% confidence interval for the difference of the means \theta = \mu_1 - \mu_0 is (\hat\theta_n - 1.96 \text{se} < \theta < \hat\theta_n - 1.96 \text{se} ) where \hat\theta_n = \bar{Y}_1 - \bar{Y}_0 and \text{se}(\hat \theta_n) =\sqrt{\frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}}.

(Dobson) The weights, in kg, of a IID sample of 20 men before the participation in a “waist loss” are observed. Then, the same men are observed after 12 months. These are the data.

      man before after
 [1,]   1  100.8  97.0
 [2,]   2  102.0 107.5
 [3,]   3  105.9  97.0
 [4,]   4  108.0 108.0
 [5,]   5   92.0  84.0
 [6,]   6  116.7 111.5
 [7,]   7  110.2 102.5
 [8,]   8  135.0 127.5
 [9,]   9  123.5 118.5
[10,]  10   95.0  94.2
[11,]  11  105.0 105.0
[12,]  12   85.0  82.4
[13,]  13  107.2  98.2
[14,]  14   80.0  83.6
[15,]  15  115.1 115.0
[16,]  16  103.5 103.0
[17,]  17   82.0  80.0
[18,]  18  101.5 101.5
[19,]  19  103.5 102.6
[20,]  20   93.0  93.0

Do these data satisfy the assumptions of the two-sample model? Yes or No and explain.

(Freedman) A large university takes a simple random sample of male students and another simple random sample of females. The sample sizes are 200 and 300. It turned out that 107 of the men in the first sample use a tablet regularly, compared to 132 of the women in the second sample.

Find a confidence interval for the difference between the percentages.
In your opinion the difference is real or is a chance variation?

3 Hypothesis testing

3.1 Basic idea

We start from a study concerning a substantive research hypothesis, like for example the study concerning the starting salaries of men and women.
We have a kind of default theory. In the example of sex discrimination, mean salaries for men and women should be the same.
The default theory must be converted into a statistical hypothesis that makes statements about population parameters, called the null hypothesis. In the example

H_0: \theta = \mu_{\text{M}} - \mu_{\text{F}} = 0

We can also identify an alternative hypothesis H_1 that specifies the directions of the departures from H_0. For example H_1: \theta > 0
Then we ask if the data provide sufficient evidence to reject the theory. If not, we retain the null hypothesis.
The evidence depends on some test statistic T = t({\boldsymbol{Y}}) that measures the difference between the data and the hypothesis. For instance, the evidence of sex discrimination can be measured by the difference between the mean salaries of males and females. In this case

\small \begin{cases} \text{if $T$ is large} & \text{ reject } H_0 \\ \text{if $T$ is small} & \text{ mantain } H_0 \end{cases}

Another example

In the study concerning the gender of the births with placenta previa, we could ask if the proportion of males \theta is 1/2. Then we can compare a null hypothesis H_0 : \theta = \textstyle\frac{1}{2} with an alternative hypothesis H_1: \theta \ne \textstyle\frac{1}{2}.
It seems reasonable to reject H_0 if the difference in absolute value |\text{estimate} - \text{parameter}| = |\hat \theta - \textstyle \frac{1}{2} | is large.

Approaches to testing

We can distinguish two different approaches to testing:

The Fisher’s approach is directed to evaluate the evidence against a single hypothesis H_0 measured by a normalized value of the distance of the data and the hypothesis, called p-value
The Neyman and Pearson’s approach considers two hypotheses H_0 and an alternative H_1 and use a decision procedure on the basis of data, either to accept or reject H_0.

3.2 The Fisher approach

The procedure consist in defining a test statistic, that is a function of the data measuring the difference t({\boldsymbol{Y}}) between the data and what is expected on the null hypothesis. Typically,
t({\boldsymbol{Y}}) = \frac{\text{observed} - \text{expected}}{\text{se}}
The observed value of the test statistic is denoted by t_{\text{obs}}.
Then we calculate the chance of getting a test statistic as extreme or more extreme than t_{\text{obs}}, computed on the basis that H_0 is right:
p = \Pr(t({\boldsymbol{Y}}) \ge t_{\text{obs}}; H_0)

This probability is called the observed significance level or briefly the p-value.

Test concerning the two samples of salaries

Start from a general model assumed to be the generating process of the observations Y_i = \mu + \theta X_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2), \text{ for } i = 1, \dots, n where \theta = \mu_1 - \mu_0 is the parameter of interest and X_i is just an indicator of the two samples.
Define a null hypothesis H_0: \theta = 0 asserting that the two populations are identical: Y_i = \mu + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) meaning that the mean salaries are equal.
An alternative hypothesis can be H_1: \theta > 0 so that \mu_{\text{1}} > \mu_{\text{0}} (men salaries > women salaries)
Define a test statistic (said the Wald test statistic) t({\boldsymbol{Y}}) = W = \frac{\bar Y_1 - \bar Y_0}{\text{se}(\hat\theta)}

that is the difference between the means divided by the estimated standard error, calculated with the pooled variance.

In our data about salaries the value of the test statistic is w_{\text{obs}} = \frac{818}{130}= 6.3
To evaluate if this value is small or large we calculate the p-value: p_\text{obs} = \Pr(W \ge 6.3; H_0).
To get this probability we should find the null distribution of W
Fortunately we know that W under the null hypothesis has t-distribution (thanks to William Gosset said “Student”) with parameter n-2.
We will give more details later, but for the moment we just plot it. The following figure shows the density of a T_{n-2}, with n = n_0 + n_1 = 93

As you see the curve is similar to the standard normal.

The red dot, representing the value w_{\text{obs}} = 6.3, is quite far in the right tail so that the p-value is extremely small.
To find the p-value for the test of \theta=0 we calculate the left-tail area with R using the function pt() that finds the area under the normal in the interval (6.3, \infty):

pt(6.3, df = 91, lower.tail = FALSE)

[1] 5.202414e-09

A p-value close to 0 gives an indication that a value of w_{\text{obs}} is unlikely under the hypothesis. In this sense the smaller is the p-value the larger is the evidence against the hypothesis.

Traditionally, the p-value is reported on a rough scale: \small \begin{array}{cll} \text{P-value} & \text{Interpretation} & \text{Jargon}\\ \hline \le 0.01 & \text{strong evidence against } H_0 & \text{highly significant} \\ \bumpeq 0.05 & \text{moderate evidence against } H_0 & \text{significant}\\ > 0.1 & \text{no evidence against $H_0$ } & \text{non-significant}\\ \hline \end{array}

A non-significant test does not imply that the data confirm H_0.
The p-value is NOT the probability that the null is true: in frequentist inference there isn’t any concept of probability of hypotheses.

3.3 Neyman-Pearson’s approach

A technological problem

A manufacturing plant produces batches of a chemical using a standard production method (A). Now they try a modified method (B) that is supposed to give a higher yield
The experimenters verify that the batches could be supposed independent and that the order of the runs has no influence on the results.
Problem: decide using an experiment whether method (B) gives significantly higher yields than method (A).

Procedure

Define the experiment: use a completely randomized experiment by making in sequence 10 batches of a chemical using the standard production method (A) followed by 10 batches using the modified method (B).
Assume the data are generated by a two-sample model Y_i = \mu_A + \theta X_i + \varepsilon_i, \quad \varepsilon_i \sim_{\text{ ind}} N(0, \sigma^2) where X_i=0 if chemical i is produced by method (A) and X_i = 1 if is produced by method (B)
Define the null and alternative hypotheses. In general they define a partition of the parameter space \Theta: H_0 : \theta \in \Theta_0 \text{ versus } H_1: \theta \in \Theta_1 For instance : H_0: \mu_B - \mu_A = \theta = 0 \text{ vs } H_1: \theta > 0
A hypothesis is said simple if it defines a single distribution. Otherwise it is said composite

Note: In the previous example both hypotheses are composite!

Decision Rule

Before collecting the data define an appropriate subset R in the sample space called the rejection region such that the decision rule is \small \begin{aligned} {\boldsymbol{Y}} \in R &\implies \text{ reject } H_0\\ {\boldsymbol{Y}} \not\in R &\implies \text{retain (do not reject) } H_0 \end{aligned}
Usually the rejection region is defined by \small R = \{ {\boldsymbol{Y}} : t({\boldsymbol{Y}}) > c\} where t({\boldsymbol{Y}}) is a test statistic and c is called the critical value.
Errors: Two types of error result from the decision rule:
- type I error: reject H_0 when it is true (convict an innocent)
- type II error: retain H_0 when it is false (absolve a guilty)

Note Unfortunately if we can reduce the type I error, the type II error increases and viceversa.

Power and size

A specific distinction from the Fisher’s approach is the definitions of power function and size of the test.

The power function is the probability of rejecting H_0 for all values of the parameter \theta
The size of the test, called \alpha, is the maximum probability of type I error when the value of the parameter \theta satisfies the null hypothesis.

Most powerful test

Definition: The optimal decision rule for Neyman and Pearson consists in finding the rejection region t({\boldsymbol{Y}}) > c with the highest power under the alternative H_1 among all regions with fixed size \alpha

The famous Neyman and Pearson lemma shows that in some special cases the most powerful test exists.
Unfortunately, finding this optimal test is rather hard !
However, there are several tests that come close to this ideal.
In the rest of the chapter I will just list some relevant examples.

3.4 Some useful tests

The t-test for the two-sample model

We met before the two-sample model with normal errors. The t-test concerns the test about the difference of the two means \theta = \mu_1 - \mu_0.

Two-sided test H_0: \theta = \theta_0 \text{ versus } H_1: \theta \ne \theta_0
One-sided test (right) H_0: \theta \le \theta_0 \text{ versus } H_1: \theta > \theta_0
Wald test statistic

W = \frac{\hat \theta - \theta_0}{\text{se}} \quad \text{where }\quad \hat \theta = \bar Y_1 - \bar Y_0

Null distribution: is a t distribution with parameter (degrees of freedom} n-2:

W \sim T_{n -2}.

Rejection region with size \alpha \begin{cases} \text{ for the two-sided hypothesis: } & |W| > t_{1-\alpha/2} \\ \text{ for the one sided hypothesis: } & W > t_{1-\alpha} \end{cases} where t_{1-\alpha/2} and t_{1-\alpha} are the quantiles of the T_{n-2} distribution, shown in the figure below

The experiment of chemical production

The two samples of chemical batches are

YA <-c(89.7,81.4,84.5,84.8,87.3,79.7,85.1,81.7,83.7,84.5)
YB <-c(84.7,86.1,83.2,91.9,86.3,79.3,82.6,89.1,83.7,88.5)

with a boxplot is

group = factor(c(rep("A", 10), rep("B", 10)))
Y = c(YA, YB)
boxplot(Y ~ group,  horizontal = TRUE)

The test statistic is calculated as follows. The estimate of \theta is \hat \theta = \bar y_B - \bar y_A = 85.54 - 84.24 = 1.3 the estimate of the variance \sigma^2 is the pooled variance \begin{align*} s^2_p &= \frac{\Sigma_{i\in A} (y_i - \bar y_A)^2 + \Sigma_{i \in B} (y_i - \bar y_B)^2}{n_A + n_B -2} \\ &= 10.87 \end{align*} and the estimate of the standard error is \text{se} = \sqrt{s^2_p \left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = 1.47 Therefore, the test statistic for H_0: \theta = 0
w_{obs} = \frac{1.30 - 0}{1.47} = 0.88
The rejection region for the one-sided test H_1: \theta > 0 of size \alpha = 0.05 is W > t_{18, 0.05}. The region can be found with the R function qt:

cat("0.05 upper quantile:",  qt(1-0.05, df = 18), "\n")

0.05 upper quantile: 1.734064

Thus the rejection region consists of all the samples that have the Wald statistic W > 1.734. As w_{obs} = 0.88 the null hypothesis is not rejected at the level \alpha = 0.05.

The p-value is calculated by

cat("p-value: ", pt(0.88, df = 18, lower.tail = FALSE), "\n")

p-value:  0.1952286

so that the test is not significant at the 5 % level.

Tests are sometimes less useful than CI

Most statisticians believe that tests have been overemphasized in research because they merely indicate whether the specific parameter value in H_0 is plausible.
When a p-value is small, the test indicates that the hypothesized value is not plausible, but tells us little about which potential parameter values are plausible
A confidence interval is more informative because it shows the whole set of believable values. Also it shows if the null is badly false by showing if the values in the interval are far from \theta_0.
That helps in evaluating whether the difference between the true value and the \theta_0 value has practical importance.

Example from Wasserman

A level \alpha test rejects H_0: \theta = \theta_0 if and only if the 1-\alpha CI does not include \theta_0.
In the figure below there are two different confidence intervals.

Both exclude \theta_0 so in both cases the test would reject H_0.
However in the first case the finding is of little scientific or practical value.
In the second case, the finding is of scientific value.

Test for the difference of proportions

Model: Two independent IID samples of sizes n_0 from a \text{Bernoulli}(\theta_0) and of size n_1 from a \text{Bernoulli}(\theta_1)
Null hypothesis: H_0: \delta = \theta_0 - \theta_1 = 0
Test statistic under the null W_n = \frac{\hat \delta - 0}{\text{se}} where
- \hat \delta = p_0 - p_1
- \text{se}= \sqrt{\frac{p_0 (1-p_0)}{n_0} + \frac{p_1 (1-p_1)}{n_1}}
Null distribution for large samples: W_n \approx Z \sim N(0, 1), \quad n_0, n_1 \to \infty
p-value:
- H_1 : \delta \ne 0, p_{obs} = \Pr\{ |W_n| > w_{obs}; H_0 \}
- H_1: \delta > 0, p_{obs} = \Pr\{ W_n > w_{obs}; H_0 \}
- H_1: \delta < 0, p_{obs} = \Pr\{ W_n < w_{obs}; H_0 \}
Rejection region with level \alpha:
- H_0: \delta \ne 0, reject if |W_n| > z_{1-\alpha/2}
- H_0: \delta > 0, reject if W_n > z_{1-\alpha}
- H_0: \delta < 0, reject if W_n < -z_{1-\alpha}

Note: This test can be applied to the 2x2 tables for prospective studies

Prospective studies are of two types:
- Clinical trials: they randomly allocate subjects to the two groups of interest, such as (vaccine and placebo) observing the response variable in future time
- Cohort studies: subjects make their own choice about which group to join (the group of those who go to the fitness center and those who don’t) and the study observes the response variable in future time.

3.5 Likelihood Ratio Test

The Wald test W is useful for scalar parameters.
A more general test is the Likelihood Ratio Test (LRT)

Definition of the LRT

If we want to test {\boldsymbol{\theta }}\in \Theta_0 versus {\boldsymbol{\theta }}\not\in \Theta_0 we can use the test statistic \text{LRT} = 2 \log\left(\frac{ L(\hat {\boldsymbol{\theta}})}{L(\hat {\boldsymbol{\theta}}_0)} \right) where \hat {\boldsymbol{\theta}} is the MLE and \hat{\boldsymbol{\theta}}_0 is the MLE, restricted to the parameter space \Theta_0

Example

Assume that you have an IID sample of size 100 from a Bernoulli with an unknown probability of success \theta. For example, think of throwing a coin 100 times. Then you want to test the hypothesis
H_0: \theta = 1/2 \quad \text{versus } H_1: \theta \ne 1/2
An approximate test for large samples is

W = \frac{p - 0.5}{\sqrt{p(1-p)/n}} \approx N(0,1) \text{ under } H_0.

An alternative is the likelihood ratio test. Given the observed number of successes t = \sum_{i=1}^n x_i calculate the MLE p = t/n and the MLE restricted to H_0 that is simply p_0 = 1/2.
Remembering that L(\theta) = \theta^t(1 - \theta)^{n-t} the log-likelihood is \ell(\theta) = t \log\theta + (n-t) \log(1-\theta)
Finally the log-likelihood ratio is \begin{align*} \ell(p) &= t \log(\hat p) + (n - t) \log(1 - \hat p)\\ \ell(p_0) &= t \log(p_0) + (n - t) \log(1 - p_0)\\ \text{LRT} &=2 \cdot \text{difference} \end{align*}
Suppose that the number of successes is 65 in 100 coin tosses. The computations are shown below in R:

LRT_prop <- function(x, n, p0) {
  # x: number of successes
  # n: sample size
  # p0: proportion under the null
  
  p_hat <- x / n  # MLE
  
  # 
  logL0 <- x * log(p0) + (n - x) * log(1 - p0)
  logL1 <- x * log(p_hat) + (n - x) * log(1 - p_hat)
  
  # Test statistics -2 * log(LR)
  lrt_stat <- -2 * (logL0 - logL1)
  
  # p-value  (1 degree of freedom)
  
  p_value <- pchisq(lrt_stat, df = 1, lower.tail = FALSE)
  
  return(list(statistic = lrt_stat, p_value = p_value, estimate = p_hat))
}

# Example: Test H0: pi = 0.5 with 65 successes in 100 rolls
LRT_prop(x = 65, n = 100, p0 = 0.5)

$statistic
[1] 9.140108

$p_value
[1] 0.002500663

$estimate
[1] 0.65

The sampling distribution of the test, under the null hypothesis, is asymptotically a chi-squared with 1 degree of freedom.
Therefore, the test finds evidence of a significant difference from \theta = 1/2

The likelihood theory proves the following theorem:

Theorem

The LRT statistic has a chi-squared asymptotic distribution under the null hypothesis H_0: {\boldsymbol{\theta }}\in \Theta_0.

The degrees of freedom are equal to the dimension of \Theta minus the dimension of \Theta_0.

3.6 Exercises

A person who claims to possess extrasensory perception (ESP) says she can guess more often than not the outcome of a flip of a balanced coin. Out of 200 flips, she guesses correctly 110 times.

What is the nature of the variable to be studied?
Specify the hypotheses H_0 and H_1
Give the test statistic, its distribution under the null and its observed value
What is the p-value?
What is your conclusion?

(Agresti) When the 2000 General Social Survey asked, “Would you be willing to pay much higher taxes in order to protect the environment?” it was of interest to know if the proportion of people answering yes was 0.5.

Specify the null and the alternative hypotheses
Give a test statistic and its sampling distribution under the null hypothesis.
The survey result was that 369 people answered yes and 483 answered no. Apply the test giving the observed test statistic, the p-value and its interpretation in the context.
Calculate a 95% confidence interval for the parameter and explain an advantage of the CI over the significance test.

A recent study compared different psychological therapies for teenage girls suffering from anorexia. Each girl’s weight was measured before and after a period of therapy. The variable of interest was the weight change, defined as weight at the end of the study minus weight at the beginning of the study. In this study, 29 girls received cognitive behavioral therapy directed in identifying the thinking that causes the undesirable behavior and replacing it with thoughts designed to help improve this behavior. The data in pounds (in file `anorexia.csv’) are shown below

anorexia<-read.csv("anorexia.csv")
cbind(before = anorexia[,1], after = anorexia[,2], diff = (anorexia[,2] - anorexia[,1]))

      before after diff
 [1,]   80.5  82.2  1.7
 [2,]   84.9  85.6  0.7
 [3,]   81.5  81.4 -0.1
 [4,]   82.6  81.9 -0.7
 [5,]   79.9  76.4 -3.5
 [6,]   88.7 103.6 14.9
 [7,]   94.9  98.4  3.5
 [8,]   76.3  93.4 17.1
 [9,]   81.0  73.4 -7.6
[10,]   80.5  82.1  1.6
[11,]   85.0  96.7 11.7
[12,]   89.2  95.3  6.1
[13,]   81.3  82.4  1.1
[14,]   76.5  72.5 -4.0
[15,]   70.0  90.9 20.9
[16,]   80.4  71.3 -9.1
[17,]   83.3  85.4  2.1
[18,]   83.0  81.6 -1.4
[19,]   87.7  89.1  1.4
[20,]   84.2  83.9 -0.3
[21,]   86.4  82.7 -3.7
[22,]   76.5  75.7 -0.8
[23,]   80.2  82.6  2.4
[24,]   87.8 100.4 12.6
[25,]   83.3  85.2  1.9
[26,]   79.7  83.6  3.9
[27,]   84.5  84.6  0.1
[28,]   80.8  96.2 15.4
[29,]   87.4  86.7 -0.7

Do a significance t-test for finding the strength of evidence supporting the effectiveness of the cognitive behavioral therapy, that is, to determine whether it results in a positive mean weight change.
Show the hypotheses, the test statistic, the p-value and comment.
Find a confidence interval for the difference in kg.
Do a boxplot of the weight difference. What do you note? Do you think that the data are normally distributed?

For a large supermarket chain in Florida a women’s group claimed that female employees were passed over for management training in favor of their male colleagues. The company denied this claim, saying they picked the employees from the eligible pool at random to receive this training. Also, statewide, the large pool of more than 1000 eligible employees who can be tapped for management training is 40% female and 60% male. Therefore the company says that there is no gender bias because since this program began, of the 40 employees chosen for management training 28 were male and 12 were female. The company says that a test of H_0: \theta = 0.6 against H_1:\theta > 0.6 is not significant.

Check the company’s claim that there is no evidence of gender bias.
Find a confidence level for \theta. Make your comments.

(Agresti). The data below come from the Physicians’ Health Study. This was a five-year randomized study concerning if regular intake of aspirin reduces the chance of heart attacks.

Every other day, the male physicians participating in the study took either one aspirin tablet or a placebo.
The study was blind. The physicians in the study did not know which type of pill they were taking.

\small \begin{array}{lrrr}\\ \hline \text{Group} & \text{Yes} & \text{No} & \text{Total} \\ \hline \text{Placebo} & 189 & 10,845 & 11,034 \\ \text{Aspirin} & 104 & 10,933 & 11,037\\ \hline \end{array}

Test the hypothesis H_0: \delta = 0 versus H_1: \delta \ne 0 with \delta = \pi_1 - \pi_0 is the risk difference between the Aspirin and the Placebo groups.

4 Introduction to linear regression models

4.1 The straight line model

There are situations with n measurements Y_i associated with the values X_i, i = 1, \dots, n of another variable.
For instance, consider systolic blood pressures y_i of n = 12 women and the corresponding ages x_i

x <- c(56, 42, 72, 36, 63, 47, 55, 49, 38, 42, 68, 60)
y <- c(147, 125, 160, 118, 149, 128, 150, 145, 115, 140, 152, 155)

The data are pairs of observations (X_1, Y_1) , \cdots, (X_n, Y_n). The variable X is called a predictor or a regressor and Y is called the outcome or the response
Notice that the two variables are not on equal standing but, according to a research hypothesis, the predictor is expected to influence the outcome.

Important: the research hypothesis is an assumption

The ages instead are considered as fixed values representing the conditions under which the blood pressures were measured.
Therefore we are interested in the function r(x) = E(Y \mid X = x) for x varying in a realistic interval. The function r(x) is called the regression function

Scatter of the data on age and systolic blood pressure with two superposed regression functions

The above scatter of the two variables suggests that in the range of ages between 35 and 70 the expected blood pressure increases with age with an approximately linear regression function.
We can specify a regression model of this form Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i or a slightly nonlinear model like Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i
The errors \varepsilon_i are usually assumed to be mutually independent, with: E(\varepsilon_i \mid X_i) = 0, \text{ and } \mathrm{var}(\varepsilon_i \mid X_i) = \sigma^2.
The errors are not necessarily normal,
but their variance \varepsilon_i is assumed to be constant
The above linear model, called the simple linear regression model generates the data with 3 parameters: \beta_0, \beta_1 and \sigma^2.
The parameter \beta_1 is the slope of the line, i.e., the difference in the expected blood pressures for two women with an age difference of one year.
The variances \mathrm{var}(Y_i) = \sigma^2 - as we said - are assumed to be constant, so that \sigma is the standard deviation of the differences Y_i - \mu_i.

4.2 Another example

Consider the following data on measurements of abdominal circumference Y taken from 610 fetuses during ultrasound scans at Kings College Hospital, London, at gestational ages X ranging between 12 and 42 weeks.

Of course Y depends on gestational age and specifically the average abdominal circumference grows smoothly when the age grows.

The joint distribution here is shown by this scatterplot with an added estimated regression function.

Abdominal circumference and gestational age

In this case the regression function is close to a line (except perhaps at lower gestational age)
The errors appear to be small, but the assumption of constant variance is not adequate.

4.3 Least squares estimators

If the unknown parameters in the linear model are estimated in some way by \hat \beta_0 and \hat \beta_1 the fitted line is \hat r(x) = \hat\beta_0 + \hat \beta_1 x
The fitted values are \hat Y_i = \hat r(x_i) and the residuals are defined by \hat \varepsilon_i = Y_i - \hat Y_i = Y_i - (\hat \beta_0 + \hat \beta_1 x_i)
The residual sum of squares (RSS) is defined by \text{RSS} = \sum_{i=1}^n \hat \varepsilon_i^2 measure how well the straight line fits the data. Of course the RSS depend on the scale of the measurements
Finally we introduce a general method of estimation of \beta_0 and \beta_1 called the least squares:

Definition

The least-squares estimates are the values \hat \beta_0 and \beta_1 that minimize \sum_{i=1}^n \hat \varepsilon_i^2

Theorem

The least-squares estimates are given by \begin{align*} \hat \beta_1 &= \frac{\sum_i (x_i - \bar x)(Y_i - \bar Y)}{\sum_i (x_i - \bar x)^2} \\ \hat \beta_0 &= \bar Y - \hat \beta_1 \bar x \end{align*} An unbiased estimate of \sigma^2 is s^2 = \frac{\sum_{i = 1}^n\hat\varepsilon_i^2}{n-2}

Example (age - blood pressure)

Assume that the blood pressure satisfy the simple linear regression model with homoskedastic errors.
The LS estimates of the intercept and the slope is found using the previous formulae or using the R function lm() (linear model)

ls <- lm(y ~ x)
ls$coef

(Intercept)           x 
  80.777730    1.138005

that requires the model formula y ~ x.

The residuals \hat \varepsilon_i are obtained by

e <-residuals(ls)
e

        1         2         3         4         5         6         7         8 
 2.493981 -3.573947 -2.714101 -3.745916 -3.472055 -6.263972  6.631986  8.460017 
        9        10        11        12 
-9.021926 11.426053 -6.162081  5.941960

The fitted line is \hat r(x) = 80.8 + 1.14 x, shown in the previous plot.
The estimated variance of the errors is

sum(e^2)/(12-2)

[1] 49.24669

The estimated SD of the errors is 7.02 (in scale of the blood pressure).

Interpretation

Usually the intercept is not interpreted
The fitted slope \hat \beta_1 = 1.14 is interpreted by saying that the difference in blood pressures for two women with an age difference of one year is about 1 mm Hg.

We could even go so far as to say that the difference in blood pressures for two women with an age difference of 10 years is about 10mm Hg (all things remaining the same)

Least-Squares properties

The least squares estimates do not assume normality. They have nevertheless some remarkable properties
Unbiasedness: The expected value of the estimated coefficients equals the true population parameters
Minimum MSE (Efficiency): Within the class of linear, unbiased estimators, the LS estimators have the smallest possible variance, ensuring the highest precision.
Consistency: As the sample size approaches infinity, the LS estimator converges to the true parameter value.
Normal Distribution: If the errors are normally distributed, or if the sample size is large (via the Central Limit Theorem), the LS estimates are normally distributed, allowing for hypothesis testing and confidence intervals.
Estimated standard errors of the estimates are obtained explicitly, for example:

\text{se}(\hat \beta_1) = \frac{s}{\sqrt{\sum_{i=1}^n(x_i - \bar x)^2}}

Confidence Intervals: Standard errors are used to obtain approximate 1-\alpha confidence intervals for coefficients, using the normal quantiles \hat \beta_1 \pm z_{1-\alpha/2} \text{se}(\hat \beta_1) if the the sample size is large.
Impact of Sample Size: As the sample size increases, the standard errors typically decrease, increasing the precision of the estimates.
Impact of the range of the variable X: The larger the range (max-min) of X the better the precision of \hat \beta_1

Maximum likelihood

If we can assume that \varepsilon_i \sim N(0, \sigma^2) we can use the method of maximum likelihood.

Theorem

Under the assumption of normality the MLE of the regression coefficients \beta_0 and \beta_1 coincide with the least-squares estimators.
The MLE of \sigma^2 is not exactly unbiased (as expected) \hat \sigma^2_{\text{ML}} = \frac{\sum_{i = 1}^n\hat\varepsilon_i^2}{n}
All the other properties of the MLE are maintained.

Test of hypotheses for the slope

With normality assumption

The test of H_0: \beta_1 = c versus H_1: \beta_1 \ne c the statistic W = \frac{\hat \beta_1 - c}{\text{se}(\hat\beta_1)} has a T_{n-2} distribution under H_0
The rejection region at level \alpha is |W| = \left|\frac{\hat \beta_1-c}{\text{se}(\hat \beta_1)}\right| > t_{n-2, 1-\alpha/2}.

With large samples

If the assumption of normality is dubious but you have a large sample the previous test statistic has approximately a N(0,1) distribution under the null.
Therefore, the 1-\alpha rejection region is found using the normal quantiles z_{1-\alpha/2}.

Example

The complete inference for the linear regression model for the data on age and blood pressure is shown in the following table

summary(lm(y ~ x))$coef

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 80.777730  9.5437459 8.463944 7.162571e-06
x            1.138005  0.1782089 6.385794 7.976396e-05

From this we have
1. the standard error of the slope \text{se}(\hat\beta_1) = 0.18
2. With this small sample if we believe that the errors are normal, the slope is significantly different from 0 because, for \alpha = 0.05, |W|= 6.386 > t_{8, 1-\alpha/2} = 2.31
3. A confidence interval is

confint(lm(y ~ x))

                 2.5 %     97.5 %
(Intercept) 59.5129389 102.042521
x            0.7409311   1.535079

As usual the intercept can be disregarded. The 95% confidence interval for the slope is (0.74, 1.54) giving an idea of precision of the estimator.

Predictor effect plot

This plot defines a point-wise confidence envelope that represent the uncertainty band around the estimated regression line.

library(effects)
 plot(predictorEffects(lm(y~x), ~ x), main = "")

The envelope is formed by a set of confidence intervals calculated independently for each value of the predictor.
It is not a band that must contain the entire true line. But for each single point chosen on the horizontal axis is calculated a confidence interval for the associated fitted value
Given a point x_0 of the age, the fitted value \hat y_0 = \hat \beta_0 + \hat \beta_1 x_0 has an estimated standard error \text{se}(\hat y_0) = s\sqrt{\frac{1}{n} + \frac{(x_0-\bar x)^2}{\sum_{i=1}^n(x_i - \bar x)^2}} Then the confidence envelope is defined by the limits \hat y_0 \pm z_{1-\alpha/2} \text{se}(y_0)
The envelope does not have a constant width: In the center the band is narrower because the estimate is more accurate. At the end the band widens as we move away from the mean: the greater the distance from the mean, the greater the standard error.

4.4 Model criticism

After fitting the model we can use the residuals to criticize our model specification
Residual Analysis: Residuals (observed - predicted values) should be randomly distributed around zero. The residuals should be independent from X.
Non-linearity: A pattern in the residual plot suggests a non-linear relation.
Heteroscedasticity: Particular shapes in residuals indicates non-constant error variance.
Normality: Q-Q plots are used to check if residuals are normally distributed.

Example (gestational age)

Assuming a simple linear regression model with homoscedastic errors the regression coefficients are

summary(lm(y ~ x, data = abdom))$coef

             Estimate Std. Error   t value      Pr(>|t|)
(Intercept) -55.17951  2.0010031 -27.57593 3.401592e-109
x            10.33823  0.0700963 147.48606  0.000000e+00

The fitted line \hat y = -55.18 + 10.34 x means that the circumference of the abdomen increases of about 1 cm per week

Residuals vs fitted values: The shape of the points on the left plot shows that the residuals are not randomly distributed. This is a sign of a misspecification of the regression function. The red curve (a fitted non-parametric function) suggests a nonlinear regression function.
Moreover, in the same plot, the dispersion of the points around the regression function tends to increase as gestational age increases, a clear sign of heteroskedasticity

The second plot to the right is a quantile-quantile plot. The genearal idea is to plot the sample quantiles for the residuals vs the theoretical quantiles of a standard normal. If the residuals have a normal distribution the points in the plot should be aligned on a straight line. In this case the residuals are somewhat aligned in the center of the distribution, but not in the tails.

4.5 Goodness of fit

A very popular index to be used in regression is the coefficient of determination called briefly R squared.
The definition is R^2 = \frac{\sum_{i=1}^n(\hat y_i - \bar y)^2}{\sum_{i=1}^n(y_i - \bar y)^2} = \frac{\text{ESS}}{\text{TSS}} This is the ratio of the Explained Sum of Squares and the Total Sum of Squares.
The following result proves that the explained sum of squares is always lower than the total sum of squares.

Theorem

The residual sum of squares is \text{RSS} = \text{TSS} - \text{ESS}

The interpretation is that the total sum of squares \text{TSS} can be partly “explained” by the fitted regression \text{ESS}.
Therefore R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} and of course 0\le R^2 \le 1.
If R^2 is close to 1 then the points are close to the line ad the residuals are small
If R^2 is close to 0 then the points are more dispersed

Warning

A large value of R^2 does not mean that the model chosen is correct.

A small value of R^2 does not mean that the model is wrong.

The R^2 in the previous examples

Blood pressure

x <- c(56, 42, 72, 36, 63, 47, 55, 49, 38, 42, 68, 60)
y <- c(147, 125, 160, 118, 149, 128, 150, 145, 115, 140, 152, 155)
summary(lm(y ~ x))$r.squared

[1] 0.8030658

This means that about 80% of the variability of the response is explained by a linear regression

Measures on the fetus

summary(lm(y ~ x, data = abdom))$r.squared

[1] 0.9728088

This means that 97% of the variability of circumference of the abdomen of a fetus is explained by the weeks of gestation.
The coefficient of determination is quite close to 1 but ew discovered that the straight line model is not totally right especially because we wrongly assumed the homogeneity of the error variances.

Example (fertility depending on education)

Let consider now the data of the Fiji Fertility Survey not as an example of a complete population but as a sample.
Then let’s study the response number of children as a function of education within a stratum of all married women 40-49 at the time of the survey, in order to obviate the need for a control for woman’s age.

The fitted lines show that, marginally, fertility decreases with education.
The table of the regression coefficients is obtained with

  sel <- (fiji$A <=49) & (fiji$A >= 40)
  fiji1005 <- fiji[sel,]
  lsE <- lm(F ~ E, data = fiji1005)
  summary(lsE)$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept)  6.8854869 0.15448462 44.570695 4.227095e-240
E           -0.1389875 0.03179211 -4.371759  1.360988e-05

The coefficient of determination is

  summary(lsE)$r.squared

[1] 0.0186988

Comments

The regression line is \widehat{F} = 6.9 - 0.14\cdot E
The standard error of the slope is 0.0318.
The test of the hypothesis \beta_1 = 0 is rejected at level 0.05 because |W| = |-4.372| > 1.96 \quad (\text{ assuming normality because $n$ is large})

The p-value is

2*(1 - pnorm(abs(-4.372)))

[1] 1.231135e-05

The p-value in the table of regression coefficients is different, because the function lm() assumes that the errors are exactly normal. In that case the test statistic has an exact null distribution T_{n-2}. Thus
```
 2*(1 - pt(abs(-4.371759), df = 1005-2))
```
```
[1] 1.360988e-05
```
Notice that despite a quite small R^2 = 0.019 the slope of the regression is significantly different from zero

Example (Fertility vs education, given etnicity)

The regression line can be different if we consider etnicity I where 1 = Indian and 0 = Fijian.

  par(mfrow = c(1,2))
  data0 = fiji1005[fiji1005$I == "0",]
  data1 = fiji1005[fiji1005$I == "1",]
  plot(F ~ jitter(E), data = data0, main = "Fijian", ylim = c(0, 16))
  abline(lm(F ~ E, data0), lwd = 3, col = "red")
  plot(F ~ jitter(E), data = data1, main = "Indian", xlim = c(0,20), ylim = c(0,16))
  abline(lm(F ~ E, data1), lwd = 3, col = "darkgreen")

In this case the slopes of the regression lines are different.

summary(lm(F ~ E, data0))$coef

              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 5.73675977 0.38768403 14.7975136 1.223775e-40
E           0.02132646 0.06398459  0.3333062 7.390588e-01

summary(lm(F ~ E, data0))$r.squared

[1] 0.000247362

summary(lm(F ~ E, data1))$coef

              Estimate Std. Error  t value      Pr(>|t|)
(Intercept)  7.0883413 0.16942985 41.83644 2.565648e-173
E           -0.1373248 0.04708628 -2.91645  3.684353e-03

summary(lm(F ~ E, data1))$r.squared

[1] 0.01517501

The slope for the Fijians is positive but not significant while that for the Indians is significantly negative
Both R^2 are quite small.

4.6 Exercises

In the model Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i what is the characteristic of \beta_1? Options: (you can choose more than one): observable, unobservable, a parameter, a random variable.
Suppose you study the relation between altitude and temperature. The data pairs0001.txt in Moodle contain the variables: temperature (average) in degrees Celsius and altitudein meters taken at 349 stations in Germany.
1. Read the data in R with the command dati <- read.table("pairs0001.txt" header = TRUE)
2. Choose the role of the variables: in your opinion which is the response?
3. Transform the altitudes in km. Then fit a simple linear regression model.
4. Give a general comment of your analysis. For example give the estimates of the regression coefficients and a confidence interval for the slope.
5. Discuss if the model is well specified using some plots.
6. Can you suggest corrections to get a better fit?
True or false and justify:
1. If R^2 = 0 the fitted regression line has a zero slope.
2. The fitted values have a sum equal to n \bar y.
3. If the coefficient of determination is zero, it is certain that the response Y does not depend in any way on the predictor X.
(Weisberg) The data ftcollinssnow.R in Moodle, contain the variables

Early: September to December snowfall, inches
Late: January to June snowfall, inches

both measured in inches at Ft. Collins, Colorado (Colorado Climate Center, 2012).

The question is: can early season snowfall from September 1 until December 31 predict snowfall in the remainder of the year, from January 1 to June 30?

To answer, use a simple linear regression model and test if the estimate of the slope is significant. The data frame ftcollinssnow is obtained in the workspace with the command: source("ftcollinssnow.R")

(Freedman) You have a model Y_i = \beta_0 + \beta_1 x_i+\varepsilon_i with \varepsilon_i \sim N(0, \sigma^2) IID. Your theory is that \beta_1 \ne 0. You fit the model using least-squares to your sampled data that have size n = 57 and get \hat \beta_1 = 3.79 with a standard error \text{se} = 1.88. True or False and explain:
1. for testing the null hypothesis that \beta_1 =0 you get a test statistic t = 2.02 (rounded)
2. \hat \beta_1 is statistically significant
3. \hat \beta_1 is highly significant
4. The probability that \beta_1 \ne 0 is about 95%
5. The probability that \beta_1 = 0 is about 5%
6. If the model is right and \beta_1 = 0 there is about a 5% chance of getting |\hat \beta_1/ \text{se}|>2
7. If the model is right and \beta_1 = 0 there is about a 95% chance of getting |\hat \beta_1/ \text{se}|<2
8. You can be about 95% confident that \beta_1 \ne 0
9. The test shows that the model is right
10. The test assumes that the model is right.

References

The lectures are mainly based on:

Wasserman, L. (2004). All of Statistics. New-York: Springer-Verlag.
The examples discussed come from:

Agresti, A., Franklin, C. (2013). Statistics. The art and science of learning from data. Boston: Pearson Education, Inc.

Dobson A. J., Barnett, A. G. (2018). An Introduction to Generalized Linear Models,4th ed., Boca Raton FL: CRC Press

Freedman, D., Pisani, R., Purves, R. (2007). Statistics 4th ed. New York: Norton & Company

Gelman, A., Carlin, J.B., Stern, H. S. & Rubin, D. B. (2) Bayesian data analysis, Boca Raton FL: Chapman & Hall/ CRC

Ramsey, F. L., Shafer, D. W. (2013). The statistical sleuth, 3rd ed. Boston: Brooks/Cole

Retherford, R. D., Choe, M. K. (1993). Statistical models for causal analysis. New York: Wiley

Weisberg, S. (2014) Applied linear regression, Hoboken: John Wiley & Sons.