Statistical Inference Lectures

Master Data Science and Statistical Learning

Author

Affiliation

Giovanni M. Marchetti

DISIA, University of Florence

Published

February 1, 2026

1 Introduction

Statistical thinking concerns the relation of quantitative data to a real-world problem, often in the presence of variability and uncertainty. It attempts to make precise and explicit what the data has to say about the problem of interest

— Mallows (1998) The zeroth problem

The starting point in Statistics is a scientific enquiry, an investigation concerning a substantive question.

The first problem (the zeroth problem) to be studied is then to understand what is the relevant population.

In other words one has to decide what the relevant data are, and how these relate to the purpose of the statistical study.

1.1 An example

One important issue in demography is fertility (the number of children ever born of a woman) and its relation with other factors. Some of the basic questions are:

What is the population to be investigated and how should we collect data?
Should we measure fertility could be measured for each woman at a given age or at a given point in time?
Should we choose as relevant women population for a specific region or for a selected ethnic group?
Can we distinguish categories of variables, i.e. responses, intermediate variables, background variables, etc.
Do we have causal hypothesis concerning fertility?

Fiji Fertility Survey

The Fiji Fertility Survey (which is part of the World Fertility Survey, see Fiji Fertility survey, 1974) collected data of fertility and other relevant factors.

The data below are a subset of the full data set, regarding 1703 ever-married women Aged 20-49 in 1974 and concerning the variables:

F, fertility (number of children), the main response,
A, age in years,
M, age at the first marriage in years,
E, education in completed years
U, a binary indicator of urban residence
I, a binary indicator of indian ethnicity

A sample from the dataframe

source("fiji_data.R") 
set.seed(457)
X <- fiji_data
x10 <- X[sample(1703, 10, replace = TRUE), ]
x10

      F  A  M  E I U
56    5 35 14  4 1 1
1544  7 42 20  0 1 0
931  11 47 12  0 1 0
1664  4 35 23 10 1 0
314   4 40 21  4 0 0
1572  7 44 16  0 1 0
1678  7 46 24  8 0 0
480   4 40 22  5 0 0
1630  6 48 17  3 1 1
1412  0 47 20  0 1 1

Explicit what we know about the problem

Fertility is a primary outcome depending on age, age at first marriage, education, and residence and ethnicity
The variables M and E are intermediate variables, while A, U and I can be interpreted as background variables that may be considered fixed and on the same standing.
One reasonable ordering of the variables is the following \left\{ \begin{array}{c} I\\ U \\ A \end{array} \right\} \quad \{E\} \quad \{M\} \quad\{F\}

Each variable potentially depends on the variables to the left.

2 Estimation and sampling distributions

What’s the meaning of statistical inference?

A statistical inference will be defined as a statement about statistical populations made from given observations with measured uncertainty.

An inference in general is an uncertain conclusion.

Two things mark out statistical inferences.

First the information on which they are based is statistical, i.e. consists of observations subject to random fluctuations.
Secondly we explicitly recognize that our conclusion is uncertain and attempt to measure, as objectively as possible, the uncertainty involved.

—Cox (1958)

2.1 Inference with survey sampling approach

The survey sampling approach is typically used for finite populations
The focus is on population and sub-population means, or totals, of important variables.
The simplest problem of inference is how to relate the sample mean \bar x of a variable X, from a random sample of size n, to the population mean \mu in the finite population

Inference is this case is based on the repeated sampling principle:

Statistical procedures should be evaluated on the basis of their behaviour in hypothetical repetitions of the experiment that generated the original data.

This principle is the basis of the frequentist approach

Example

The previous sample of size n = 10 is a simple random sample with replacement i.e. we know that the observation are

mutually independent and
identically distributed for each variable.

To estimate the mean of the fertility \mu_F in the population we use the sample mean (called a statistic) \bar x_F = \frac{1}{n} \sum_{i=1}^n x_i

The mean \bar x_F of the sample and \mu_F of the population are in general different

mu <- mean(X[, "F"])
xbar <- mean(x10[,"F"])

\bar x_F = 5.5, \quad \mu_F = 5.9

We need a measure of the uncertainty of the estimate \bar x_F !

2.2 Measure of the error of the estimator

According to the repeated sampling principle we measure the uncertainty as follows.

We consider the means in repeated random samples, denoted by the random variable \bar X_F.
The random variable \bar X_F has a probability distribution that can be studied. It is called the estimator of \mu_F
We measure the uncertainty of the estimator with the Mean Squared Error (MSE)

MSE = E\{(\bar X_F - \mu_F)^2\}
The square root of MSE, called the standard error, is the measure of uncertainty of the estimator.

2.3 Formula

There is a formula valid for simple random samples with replacement \begin{align*} E(\bar X_F) &= \mu_F \\ MSE &= \mathrm{var}(X_F) = \sigma^2 / n \\ \end{align*} where \sigma^2 is the variance of the population.
The standard error of the estimator of the mean is therefore SE (\bar X_F) = \sigma/\sqrt{n}

2.4 Estimated Standard Error

The standard error is however an unknown parameter because the standard deviation \sigma concerns the population that is unkown!
To get the standard error you need to estimate \sigma from the same sample using the estimator

s_F = \sqrt{\sum_{i=1}^n (X_i - \bar X_F)^2/(n-1) }

The estimate of the standard error is \text{se}(\bar X_F) = s_F/\sqrt{n}.

Example

The sample standard deviation of fertility in our sample is

sf <- sd(X[,"F"])
se <- sf/sqrt(10)
cat("Estimated SE: ", se, "\n")

Estimated SE:  0.9737265

Our inference about the mean fertility is summarized by \bar x_F = 5.5,\quad (\text{se} = 0.91)

meaning that the error is about 1 child.

Graphical presentation

Consider the true distribution of the population and our estimate.

p <- table(X[,"F"])/1703
plot(0:17, p, type= "h", lwd = 5,
     col = "RoyalBlue", axes = FALSE,
     xlab = "X", xlim = c(0, 20))
axis(1, pos = 0)
axis(2, pos = 0)
points(5.5, 0, pch =19, cex =1.5, col  = "red")

The standard deviation is \sigma = 3.1 but the standard error of the estimator is \text{se} = 0.9.

2.5 Model-based inference theories

Often we start not from a finite population but from a statistical model i.e., from an assumption concerning the distribution of an infinite population
For example, we assume that the distribution of the population is Gaussian (Normal) and that our sample is a random sample from that population with unknown mean \mu and standard deviation \sigma
In this example, as before we want to estimate these two unknown parameters \mu and \sigma.
The model-based theory of inference are essentially two:
- the frequentist theory (Fisher, Neyman and Egon Pearson) seen before
- the Bayesian theory, (Bayes, Laplace, Jeffreys) the useful for more complex models and data.

Basic ingredients

Both theories are based on

a research question to be answered
a design D for the study; and
a probability model f(x ; \theta) for the random variable X, which is used to represent the data produced by the study
The parameter \theta can be a scalar or a vector belonging to a given parameter space.
If the model can be defined by a finite number of parameters is said a parametric model.

2.6 The Bernoulli model

In this model the data are binary x = 0,1, where 0 indicates failure and 1 success with probability mass f(x; \theta) = \begin{cases} 1-\theta & \text{ if } x = 0 \\ \theta & \text{ if } x = 1 \end{cases}
Assume that the design is a random sample (X_1, \dots, X_n) where the variables are independent and identically distributed (i.i.d)
The parameter \theta is the probability of success, and assumes values in a parameter space 0 <\theta<1.
The mean of the Bernoulli is also equal to \theta
The variance is \theta(1-\theta)

Example (Placenta previa)

Placenta Previa is a complication in the second half of pregnancy where the placenta attaches inside the uterus in a position that partially or completely covers the cervical opening.

Suppose that we have data from a random sample in the population of placenta previa births in Germany, and we are interested in the proportion of female births.
Therefore we can specify a Bernoulli model with parameter \theta equal to the probability of a female birth.
If the size of the sample is n = 980 and the number of females in the sample is s = 437 we can estimate \theta as \hat \theta = 437 /980 = 0.446

2.7 The standard error of \hat \theta

The first inference is to obtain the standard error of the estimator.
In general, given an i.i.d sample (X_1, \dots, X_n) from the Bernoulli, the proportion of successes is \hat \theta = \frac{1}{n} \sum_{i = 1}^n X_i = \frac{T}{n} where T is the number of females in the sample.
The sampling distribution of T has a Binomial distribution with mean E(T) = n\theta and variance \mathrm{var}(T) = n\theta(1-\theta).
The MSE of \hat \theta is equal to the variance \mathrm{var}(\hat \theta) = \mathrm{var}(T/n) = \frac{1}{n^2}\mathrm{var}(T) = \frac{\theta(1-\theta)}{n}. Therefore the standard error is \text{SE}(\hat \theta) = \sqrt{\frac{\theta(1-\theta)}{n}}

Example

Again the standard error is a function of the unknown parameter \theta
The estimated standard error is obtained by \text{se}(\hat \theta) = \sqrt{\frac{\hat \theta(1-\hat \theta)}{n}}
The inference concerning \hat \theta in the example of placenta previa is \hat \theta = 0.446, \quad (\text{se} = 0.016 )
The conclusion is that the proportion of females in the population of placenta previa births is 44.6\% with a standard error of 1.6%.

2.8 Large samples

The sampling distribution of the proportion \hat \theta is a Binomial T divided by the sample size n
In the placenta previa example, assuming that the true proportion is \theta the distribution of \hat \theta is shown in figure below

We have two fundamental results obtained using the large sample property
1. The proportion T/n converges in probabilityto the true parameter \theta, for n \to \infty
2. The proportion T/n converges in law to the Normal, for n \to \infty

De Moivre-Laplace Theorem

When \theta is not too close to 0 or 1, the Binomial distribution \text{Bin}(n, \theta) in large samples approximates the Normal distribution N(n\theta,n\theta(1-\theta)).

We say that the distribution of \hat \theta is asymptotically normal.

Central limit theorem

Given any population with a continuous distribution with a finite mean \mu and variance \sigma^2, in large samples, the distribution of the sample mean under random sampling approximates the Normal distribution N(\mu, \sigma^2/n)

2.9 Details on the normal distribution

The previous mathematical results imply a central role of the normal distribution in statistical inference.
The standard normal random variable Z is defined by the density \phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}, \quad -\infty < z < \infty
We know that Z is symmetric unimodal with respect to \mu. Moreover E(Z) = 0 and \mathrm{var}(Z) = 1
The family of normal distributions is defined by a linear transformation of the standard normal X = \mu + \sigma Z where \mu \in \mathbb{R} is the mean parameter and \sigma^2>0 is the variance parameter. The family is denoted by X \sim N(\mu, \sigma^2).

The tails are rather short:

\begin{align*} \Pr\{ -1 < Z < 1 \} &= 0.683\\ \Pr\{ -2 < Z < 2 \} &= 0.954\\ \Pr\{ -3 < Z < 3 \} &= 0.997\\ \end{align*}

Therefore, the probability that X differs by more than 3 \sigma from the mean is only 0.003.
The squared transformation Z^2 belongs to a different family of distributions called chi-squared and denoted by \chi^2_m. Specifically, Z^2 \sim \chi^2_1.

2.10 The one-sample normal model

This model assumes that we get a random sample from a population that has a normal distribution N(\mu, \sigma^2) where both \mu and \sigma^2 are unknown.
In formulae we assume that
- we have a random vector {\boldsymbol{X}} = (X_1, \dots, X_n) such that each X_i \sim N(\mu, \sigma^2) for i = 1, \dots, n
- the observations X_i are mutally independent so that the joint distribution factorizes as
  f_{{\boldsymbol{X}}}(x_1, \dots, x_n) = f_1(x_1) \cdot \cdots \cdot f_n(x_n)

The one sample model can be written as X_i = \mu + \varepsilon_i where \varepsilon_i are mutually independent errors with normal distributions N(0, \sigma^2) with the same variance.

Of course this model makes quite strict assumptions so that we expect to check them carefully.

Sampling distributions of the mean and variance estimators

The estimators of \mu (parameter of interest) and \sigma^2 (nuisance parameter) are respectively \bar X = \sum_{i=1}^n X_i, \quad S^2 = \sum_{i=1}^n(X_i - \bar X)^2/(n-1)
Standard error of the sample mean:

\text{se}(\bar X) = S /\sqrt{n} - The sampling distributions with the normal model and sample size n are exactly

\bar X \sim N(\mu, \sigma^2/n)
S^2 \sim \sigma^2 \chi^2_{n-1}/(n-1)
The variance estimate has a chi-squared distribution with mean n-1, in statistical jargon called degrees of freedom (details postponed).

Example about Salary

Data on employees from one job category (skilled, entry–level clerical) of a bank that was sued for sex discrimination. The data are on 61 female employees, hired between 1965 and 1975. (R Package Sleuth3)

Estimates of the mean and standard error:

sample mean = $ 5138.852  se = $ 69.12335

2.11 The two-sample normal model

A more interesting model concerns the comparison of two populations (be they Bernoulli, Normal ore whatever)

Let’s discuss the previous example.

Sex discrimination in employment

Did a bank discriminatorily pay higher starting salaries to men than to women? The salaries studied before concerned women, but there are data also for men.

The full set of data is in R package Sleuth3

library(Sleuth3)
ex <- case1202
df <- ex[, c("Bsal", "Sex")]
boxplot(Bsal ~ Sex, data = df, horizontal = TRUE, 
        col = c("Orange", "LightGreen"))

The two samples have sizes n_0 = 61 (females) and n_1= 32 (males).

The two-sample normal model assumes that the data are obtained by two independent random samples (possibly of different sizes) from two normal populations with parameters N(\mu, \sigma^2) \text{ and } N(\mu + \theta, \sigma^2) See the figure below:

The two random samples can be collected together as a independent observations Y_i for i = 1, \dots, n plus an explanatory variable X_i representing the gender and therefore identifying the two samples \begin{array}{cccccccccc} {\boldsymbol{Y}} &= &(Y_1 & Y_2& \cdots & Y_{n_0}& Y_{n_0+1}& Y_{n_0+2} &\cdots & Y_{n_0+n_1}) \\ {\boldsymbol{X}} &= &(0 & 0 & \cdots & 0 & 1 & 1 &\cdots & 1 ) \\ \end{array}

Note

The two-sample model can be denoted by Y_{i} = \mu + \theta \cdot X_i + \varepsilon_i with errors \varepsilon_i \sim N(0, \sigma^2) mutually independent

Note that the observations Y_i are independent but not identical
The errors \varepsilon_i are instead i.i.d. because the variances are assumed to be the same (an assumption called heteroskedasticity).

Interpretations

The parameters in the two-sample models are three: \mu, \theta and \sigma^2
We can change the parameters defining a 1-1 correspondence. For example, E(Y_i) = \begin{cases} \mu & \text{if } X_i = 0\\ \mu + \theta & \text{if } X_i = 1 \end{cases}
The means of the two normal distributions are \mu_0 = \mu and \mu_1 = \mu+ \theta, so that \begin{align*} \mu &= \mu_0\\ \theta &= \mu_1 - \mu_0 \end{align*} Therefore \theta measures the discrepancy between the two populations.
In the example \theta is the parameter of interest in evaluating the amount of sex discrimination.

2.12 Estimation and standard errors of the two-sample model

Intuitively, we can estimate \mu with the mean of the first sample, and \theta with the difference between the two means. Denoting by \bar Y_0 and \bar Y_1 the sample means, we get the estimators \hat \mu = \bar Y_0, \quad \hat \theta = \bar Y_1 - \bar Y_0 and we can prove that both are normally distributed.
The standard error of \hat \mu is given by \text{SE}(\hat \alpha) = \frac{\sigma}{\sqrt{n_0}} where n_0 is the size of the first sample.
The standard error of \hat \theta is the square root of \begin{align*} \mathrm{var}(\hat \theta) &= \mathrm{var}(\bar Y_1 - \bar Y_0) \\ &= \mathrm{var}(\bar Y_1) + \mathrm{var}(\bar Y_0)\\ &= \frac{\sigma^2}{n_0} + \frac{\sigma^2}{n_1} = \sigma^2 \left(\frac{1}{n_0} + \frac{1}{n_1}\right) \end{align*}
The estimated standard errors depend upon an estimate of the common variance \sigma^2

The pooled variance^*

The estimator of \sigma^2 is the pooled variance S^2_{p} = \frac{1}{n-2}\left[\textstyle \sum_{i \in G_0}(Y_i - \bar Y_0)^2 + \sum_{i \in G_1}(Y_i - \bar Y_1)^2 \right] that is a sum of the deviances within the groups divided by the number of observations minus 2. The explanation is postponed.
This estimator can be interpreted as a weighted average of the variances of the groups:

S^2_{p} = \frac{(n_0-1) S^2_0 + (n_1-1) S^2_{1}} {(n_0-1) + (n_1-1)}

Estimated standard error of \hat \theta

We use again the plug-in rule substituting to \sigma^2 the pooled variance \text{se}(\hat \theta) = \sqrt{ S^2_p \left(\frac{1}{n_0} + \frac{1}{n_1}\right)}.

Computations for sex discrimination

Using the formulae we have the following estimates

\begin{align*} \hat \mu &= \$\,5138.8 \quad (\text{se} = 75.2) \\ \hat \theta &= \$\,818.0 \quad (\text{se} = 130.0) \end{align*}

Thus, there is a salary difference of about 800 dollars with a standard error of 130 dollars.

= The salary difference has an exactly normal sample distribution. Intuitively, it seems unlikely that a difference of $818 can happen if the true \beta = 0 given that is 6 times the standard error $130.

3 Fundamental concepts of inference

Many inferential problems can be classified into three categories:

estimation
confidence sets
hypothesis testing

Her we give a short introduction.

3.1 Point Estimation

Point estimation tries to find an “optimal guess” of some quantity of interest like

a parameter
a distribution
a regression function
a prediction for a future value

Let’s call \theta a generic parameter of a model and \hat\theta its estimator. Remember that the estimator is a random variable in repeated sampling.

For a random sample i.i.d. (X_1, \dots, X_n) from a distribution, a point estimator \hat\theta is a function of the sample.

Bias

The bias of an estimator is defined by the difference between the true \theta and the expected value of the estimator \text{bias}(\hat \theta) = E(\hat \theta) - \theta If E(\hat \theta) = \theta the bias is 0 and the estimator is said unbiased.

Examples

Given an i.i.d. sample the sample mean \bar X is always unbiased for the mean of the population \mu.
Given an i.i.d sample from a Bernoulli, the proportion of successes \hat \theta = T/n is unbiased.
However the estimator (T/n)^2 of \theta^2 of the Bernoulli is biased because the expectation is E(T^2/n^2) = E(T^2)/n^2 = \theta/n^2 and \text{bias}[(T/n)^2] = \theta/n^2 - \theta^2
The sample variance S^2 (with denominator n-1) is unbiased for the variance \sigma^2 of the population.
However the square root of the sample variance, i.e. \sqrt{S^2} is biased for the standard deviation \sigma.
In general, unbiasedness is a non persistent property, and in fact many good estimators are biased provided that the bias tends to zero when the sample size increases.

Consistence

The important thing for an estimator is not the zero bias, but a property called consistence meaning that the estimator converges to the true parameter value as we collect more and more data.

A point estimator \hat\theta_n of \theta is consistent if \hat\theta_n \to \theta for n \to \infty in probability.

Mean squared error

As we have seen before the quality of a point estimate is assessed by the mean squared error \text{MSE} = E_\theta\{ \hat \theta - \theta)^2\}.
Interestingly the MSE can be decomposed as follows

\text{MSE} = \text{bias}^2(\hat \theta) + \mathrm{var}(\hat \theta) And this helps in checking if an estimator is consistent

If the \text{bias} \to 0 and the \to 0 as n \to \infty then the estimator \hat \theta_n is consistent.

Example

The estimator of proportion of successes \hat \theta = T/n is consistent, because it has a bias equal to zero and a \text{SE} = \theta(1-\theta)/n so that \lim_{n = \infty} \theta(1-\theta)/n = 0.

3.2 Confidence Sets

Definition

A confidence interval of level 1-\alpha is an interval with extremes L (left) and R (right) that are functions of the random sample (therefore random variables) such that \textstyle \Pr_\theta(L < \theta < R) \ge (1-\alpha) \text{ for all } \theta.

The idea is that we have random interval that traps the true parameter \theta with probability 1-\alpha called the coverage.
Typically we use 95 percent confidence levels, which correspond to choosing \alpha = 0.05.

Warning

The probability concerns the random interval not the parameter. The parameter is fixed !

If the parameter \theta is a vector we can define confidence sets. For example ellipses in 2D space.

Example

(Wasserman) Every day, newspapers report opinion polls. For example, they might say that “83 percent of the population favor arming pilots with guns.” Usually, you will see a statement like “this poll is accurate to within 4 points 95 percent of the time.”

Technically, this means that a study based on a random sample gives a confidence interval with level 0.95 0.83 \pm 0.04 for the true proportion \theta of people wo favor arming pilots.
A typical error is interpreting the observed confidence interval as: \small \Pr( 0.79 < \theta < 0.87) = 0.95 because is conceptually wrong!

Correct interpretation

If you continue constructing confidence intervals (not only for \theta but also for a sequence of other unrelated parameters, then 95 percent of your intervals will trap the true parameter value.

Approximate normal confidence intervals

Theorem

If we have an asymptotically normal estimator \hat \theta_n \approx N(\theta, \text{se}^2) of \theta, then we can construct an approximate CI for \theta

C_n = (\hat \theta_n - 1.96 \, \text{se} < \mu < \hat \theta_n + 1.96\, \text{se})

with confidence level \to 0.95 for n \to \infty.

Justification: \begin{align*} \textstyle \Pr_\theta(C_n) &= \textstyle \Pr_\theta(\hat \theta_n - 1.96\, \text{se} < \theta < \hat \theta_n + 1.96\, \text{se} )\\ &= \textstyle \Pr_\theta\Big(-1.96 < \displaystyle\frac{\hat \theta_n - \theta}{\text{se}} <1.96\Big)\\ &\bumpeq \textstyle \Pr_\theta\Big(-1.96 < Z < 1.96\Big) = 0.95 \end{align*}

Example

The approximate confidence interval for the mean fertility considering a random sample with size n = 200 is calculated as follows

X <- fiji_data$F
x200 <- X[sample(1703, 200, replace = TRUE)]
M <- mean(x200)
se <- sd(x200)/sqrt(200)
L <- M - 1.96*se
U <- M + 1.96*se
cat("mean: ", M, ", se: ", se, "n: ", 200 , "\n" )

mean:  6.18 , se:  0.236469 n:  200

95 percent Confidence Interval

( 5.716521 6.643479 )

Illustration

3.3 Hypothesis testing

In hypothesis testing we start from a study concerning a substantive research, like for example the study concerning the starting salaries of men and women.
Also we have a sort of default theory like that the mean salaries for men and women are the same.
The deafult theory must be converted into a statistical hypothesis that makes statements about population parameters, called the null hypothesis. In the example

H_0: \theta = \mu_{\text{M}} - \mu_{\text{F}} = 0

We can also identify an alternative hypothesis H_1 that specifies the directions of the departures from H_0. For example H_0: \theta > 0
Then we ask if the data provide sufficient evidence to reject the theory. If not we retain the null hypothesis.
Calculate some function t({\boldsymbol{Y}}) of the difference between the means of males and females. Then:

\small \begin{cases} \text{if $T$ is large} & \text{ reject } H_0 \\ \text{if $T$ is small} & \text{ mantain } H_0 \end{cases}

Example

In the study concerning the gender of the births with placenta previa, we could ask if the proportion of males is 1/2. Then we can compare a null hypothesis H_0 : \theta = \textstyle\frac{1}{2} with an alternative hypothesis H_1: \theta \ne \textstyle\frac{1}{2}. It seems reasonable to reject H_0 if T = |\hat \theta - \textstyle\frac{1}{2}|. is large.

Approaches to testing

We can distinguish two different approaches to testing:

Fisher’s approach is directed to evaluate the evidence against a single hypothesis H_0 measured by a normalized value of the distance of the data and the hypothesis, called p-value
Neyman and Pearson’s approach considers two hypotheses H_0 and an alternative H_1 and use a decision procedure on the basis of data, either to accept or reject H_0.

3.4 Fisher approach

The procedure consist in defining a test statistic, that is a function of the data measuring the difference t({\boldsymbol{Y}}) between the data and what is expected on the null hypothesis. Typically,
t({\boldsymbol{Y}}) = \frac{\text{observed} - \text{expected}}{\text{se}}
The observed value of the test statistic is denoted by t_{\text{obs}}.
Then we calculate the chance of getting a test statistic es extreme as or more extreme than t_{\text{obs}}, computed on the basis that H_0 is right:
p = \Pr(t({\boldsymbol{Y}}) \ge t_{\text{obs}}; H_0)

This probability is called the observed significance level or briefly the p-value.

Test concerning the two samples of salaries

Start from a general model assumed to be the generating process of the observations Y_i = \mu + \theta X_i + \varepsilon_i, \quad \epsilon_i \sim N(0, \sigma^2), \text{ for } i = 1, \dots, n where X_i is a binary variable defining two samples.
We define a null hypothesis H_0: \theta = 0 asserting that the two populations are identical: Y_i = \mu + \epsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) meaning that the mean salaries are equal.
An alternative hypothesis can be H_1: \theta > 0 so that \mu_{\text{M}} > \mu_{\text{F}}.
We define a test statistic t({\boldsymbol{Y}}) = W = \frac{\bar Y_1 - \bar Y_0}{\text{se}(\hat\beta)}
that is the difference between the means divided by the estimated standard error, calculated with the pooled variance.
In our data about salaries the value of the test statistic is t_{\text{obs}} = \frac{818}{130}= 6.3
To evaluate it this value is small or large we calculate the p-value: p = \Pr(W \ge 6.3; H_0).
To calculate this probability we should find the null distribution of T that fortunately is known as a t- distribution (discovered by William Gosset said “Student”) with parameter n-2.
We will explain its genesis later, but for the moment we just plot it. The following figure shows the density of a T_{n-2}, with n = n_0 + n_1 = 93

As you see the curve is quite similar to the standard normal.

The red dot, representing the value d_{\text{oss}} = 6.3, is quite far in the right tail so that the p-value is extremely
small.
To find the p-value for the test of \beta=0 we calculate the left-tail area with R using the function pt():

pt(6.3, df = 91, lower.tail = FALSE)

[1] 5.202414e-09

A p-value close to 0 gives an indication that a value of d_{\text{obs}} is unlikely under the hypothesis. In this sense the smaller is the p-value the the larger is the evidence against the hypothesis.

Note

Traditionally, the p-value is reported on a rough scale: \small \begin{array}{cll} \text{P-value} & \text{Interpretation} & \text{Jargon}\\ \hline \le 0.01 & \text{strong evidence against } H_0 & \text{highly significant} \\ \bumpeq 0.05 & \text{moderate evidence against } H_0 & \text{significant}\\ > 0.1 & \text{reasonable consistency with } H_0 & \text{non-significant}\\ \hline \end{array}

Warning

A non-significant test does not imply that the data confirm H_0.
The p-value is NOT the probability that the null is true: in frequentist inference there isn’t any concept of probabilty of hypothesis.

3.5 Neyman-Pearson’s approach

Start from a problem

A manufacturing plant produces batches of a chemical using a standard production method (A). Now they try a modified method (B) that is supposed to give a higher yield
The experimenters verify that the batches could be supposed independent and that the order of the runs has no influence on the results.
Problem: decide using an experiment whether method (B) gives significantly higher yields than method (A).

NP Procedure

Define the experiment: use a completely randomized experiment by making in sequence 10 batches of a chemical using the standard production method (A) followed by 10 batches using the modified method (B).
Define the model: the data are generated by a two-sample model Y_i = \mu_A + \theta X_i + \varepsilon_i, \quad \varepsilon_i \sim_{\text{ ind}} N(0, \sigma^2) where X_i=0 if chemical i is produced by method (A) and X_i = 1 if is produced by method (B)
Define the null and alternative hypotheses. In general they define a partition of the parameter space \Theta: H_0 : \theta \in \Theta_0 \text{ versus } H_1: \theta \in \Theta_1 For instance : H_0: \mu_B - \mu_A = \theta = 0 \text{ vs } H_1: \theta > 0
A hypothesis is said simple if it defines a single distribution. Otherwise it is said composite

Warning

In the previous example both hypotheses are composite!

Define an appropriate subset R in the sample space called the rejection region. The decision rule is \small \begin{aligned} {\boldsymbol{Y}} \in R &\implies \text{ reject } H_0\\ {\boldsymbol{Y}} \not\in R &\implies \text{retain (do not reject)} H_0 \end{aligned}

Usually the rejection region is defined by \small R = \{ {\boldsymbol{Y}} : t({\boldsymbol{Y}}) > c\} where t({\boldsymbol{Y}}) is a test statistic and c is called the critical value.

two types of error result from the decision rule:
- type I error: reject H_0 when it is true (convict an innocent)
- type II error: retain H_0 when it is false (absolve a guilty)

Note

If we can reduce the type I error the type II error increases and viceversa.

Power and size

A specific distinction from the Fisher’s approach is the definitions of power function and size of the test.

The power function is the probability of rejecting H_0 for all values of the parameter \theta
The size of the test called \alpha is the maximum probability of type I error when the value of the parameter \theta satisfies the null hypothesis.

Most powerful test

Tip

The decision rule consist in finding the rejection region t({\boldsymbol{Y}}) > c with the highest power under the alternative H_1 among all regions with fixed size \alpha

Unfortunately, finding the optimal test is rather hard !
However there are several tests that come close to this ideal.

The t-test for the two-sample model

We met before the two-sample model with normal errors. The t-test concerns the test about the difference of the two means \theta = \mu_1 - \mu_0.

Two-sided test H_0: \theta = \theta_0 \text{ versus } H_1: \theta \ne \theta_0
One-sided test (right) H_0: \theta \le \theta_0 \text{ versus } H_1: \theta \ge \theta_0
Test statistic

W = \frac{\hat \theta - \theta_0}{\text{se}} \quad \text{where }\quad \hat \theta = \bar Y_1 - \bar Y_0

Null distribution: is a t distribution with parameter (degrees of freedom} n-2:

W \sim T_{n -2}.

Rejection region with size \alpha
\begin{cases} \text{ for the two-sided hypothesis: } & |W| > t_{\alpha/2} \\ \text{ for the one sided hypothesis: } & W > t_{\alpha} \end{cases} where z_{\alpha/2} and z_{\alpha} are quantiles shown in the figure below

3.6 Applications

Example of the chemical production

The two samples of chemical batches are

YA <-c(89.7,81.4,84.5,84.8,87.3,79.7,85.1,81.7,83.7,84.5)
YB <-c(84.7,86.1,83.2,91.9,86.3,79.3,82.6,89.1,83.7,88.5)

and a boxplot is

group = factor(c(rep("A", 10), rep("B", 10)))
Y = c(YA, YB)
boxplot(Y ~ group,  horizontal = TRUE)

The test statistic is calculated as follows. The estimate of \theta is \hat \theta = \bar y_B - \bar y_A = 85.54 - 84.24 = 1.3 the estimate of the variance \sigma^2 is the pooled variance s^2_p = \frac{\Sigma_{i\in A} (y_i - \bar y_A)^2 + \Sigma_{i \in B} (y_i - \bar y_B)^2}{n_A + n_B -2} = \frac{75.784+119.924}{18} = 10.87 and the estimate of the standard error is \text{se} = \sqrt{s^2_p \left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = 1.47 So that w_{obs} = \frac{1.30 - 0}{1.47} = 0.88
The rejection region for the one-sided test of size \alpha = 0.05 is |W| > t_{18, 0.05}. The region can be found with

cat("0.05 upper quantile:",  qt(0.05, df = 18, lower.tail = FALSE), "\n")

0.05 upper quantile: 1.734064

that gives R = \{ {\boldsymbol{Y}} : W > 1.734\}. Therefore the null hypothesis is not rejected at the level \alpha = 0.05.

The p-value is calculated by

cat("p-value: ", pt(0.88, df = 18, lower.tail = FALSE), "\n")

p-value:  0.1952286

so that the test is not significant at the 5 % level.

4 Linear regression models