Statistical thinking concerns the relation of quantitative data to a real-world problem, often in the presence of variability and uncertainty. It attempts to make precise and explicit what the data has to say about the problem of interest
The starting point in Statistics is a scientific enquiry, an investigation concerning a substantive question.
The first problem (the zeroth problem) to be studied is then to understand what is the relevant population.
In other words one has to decide what the relevant data are, and how these relate to the purpose of the statistical study.
1.1 The Fiji Fertility Survey
One important issue in demography is fertility (the number of children ever born of a woman) and its relation with other factors.
The Fiji Fertility Survey (which is part of the World Fertility Survey, see Fiji Fertility survey, 1974) collected data of fertility and other relevant factors.
The data below are a subset of the full data set, regarding 1703 ever-married women Aged 35 -49 in 1974 and concerning the variables:
F, fertility (number of children), the main response,
A, age in years,
M, age at the first marriage in years,
E, education in completed years
U, a binary indicator of urban residence
I, a binary indicator of indian ethnicity
Questions
Some of the basic questions are:
What is the population to be investigated and how should we collect data?
Should we measure fertility could be measured for each woman at a given age or at a given point in time?
Should we choose as relevant women population for a specific region or for a selected ethnic group?
Can we distinguish categories of variables, i.e. responses, intermediate variables, background variables, etc.
Do we have causal hypothesis concerning fertility?
Explicit what we know about the problem
Fertility is a primary outcome depending on age, age at first marriage, education, and residence and ethnicity
The variables M and E are intermediate variables, while A, U and I can be interpreted as background variables that may be considered fixed and on the same standing.
One reasonable ordering of the variables is the following
\left\{ \begin{array}{c}
I\\ U \\ A
\end{array} \right\}\to
\{E\} \to \{M\} \to \{F\}
Each variable potentially depends on the variables to the left.
Meaning of statistical inference
A statistical inference will be defined as a statement about statistical populations made from given observations with measured uncertainty.
An inference in general is an uncertain conclusion.
Two things mark out statistical inferences.
First the information on which they are based is statistical, i.e. consists of observations subject to random fluctuations.
Secondly we explicitly recognize that our conclusion is uncertain and attempt to measure, as objectively as possible, the uncertainty involved.
Suppose that our population consists in the 1703 ever-married women Aged 35-49 in 1974.
source("fiji.R")summary(fiji)
F A M E I
Min. : 0.000 Min. :35.00 Min. : 9.00 Min. : 0.000 0:774
1st Qu.: 4.000 1st Qu.:38.00 1st Qu.:15.00 1st Qu.: 0.000 1:929
Median : 6.000 Median :41.00 Median :17.00 Median : 4.000
Mean : 5.904 Mean :41.04 Mean :17.91 Mean : 3.985
3rd Qu.: 8.000 3rd Qu.:44.00 3rd Qu.:20.00 3rd Qu.: 6.000
Max. :17.000 Max. :49.00 Max. :44.00 Max. :20.000
U
0:1113
1: 590
Then we choose a random sample with replacement of size 20 from this population.
The R code is as follows
set.seed(457) # set the seed to reproduce resultsx20 <- fiji[sample(1703, 20, replace =TRUE), ]summary(x20)
F A M E I U
Min. : 0.00 Min. :35.00 Min. :12.00 Min. : 0.00 0:11 0:14
1st Qu.: 4.75 1st Qu.:39.50 1st Qu.:16.00 1st Qu.: 3.00 1: 9 1: 6
Median : 6.50 Median :41.50 Median :17.50 Median : 4.50
Mean : 6.20 Mean :41.75 Mean :18.05 Mean : 4.70
3rd Qu.: 7.25 3rd Qu.:46.00 3rd Qu.:20.25 3rd Qu.: 7.25
Max. :11.00 Max. :48.00 Max. :24.00 Max. :10.00
1.2 Inference with survey sampling approach
The survey sampling approach is typically used for finite populations
The focus is typically on population means and variance of important variables.
The simplest problem of inference is how to relate the sample mean \bar x of a variable X, from a random sample of size n, to the population mean \mu in the finite population
Inference is this case is based on the repeated sampling principle:
Statistical procedures should be evaluated on the basis of their behaviour in hypothetical repetitions of the experiment that generated the original data.
This principle is the basis of the frequentist approach
Example
The previous sample of size n = 20 is a simple random sample with replacement i.e. we know that the observation are
mutually independent and
identically distributed for each variable.
To estimate the mean of the fertility \mu_F in the population we use the sample mean (called a statistic)
\bar x_F = \frac{1}{n} \sum_{i=1}^n x_i
The the population mean \mu_F and the sample mean \bar x_F are in general different
mu <-mean(fiji$F)xbar <-mean(x20$F)
\mu_F = 5.9 \quad \bar x_F = 6.2
Therefore we need a measure of the uncertainty of the estimate \bar x_F
1.3 Measure of the error of an estimator
According to the repeated sampling principle we measure the uncertainty as follows.
We consider the means in repeated random samples, denoted by the random variable \bar X_F.
The random variable \bar X_F has a probability distribution that can be studied. It is called the estimator of \mu_F
We measure the uncertainty of the estimator with the Mean Squared Error (MSE)
MSE = E\{(\bar X_F - \mu_F)^2\}
The square root of MSE, called the standard error, is a measure of uncertainty of the estimator.
Formula
There is a formula valid for simple random samples with replacement
Our inference about the mean fertility is summarized by
\bar x_F = 6.2,\quad (\text{se} = 0.56)
Graphical presentation
Consider the true distribution of the population with the true mean (black) and our estimate (red).
The true standard deviation is \sigma = 3.08 and its estimate is s = 2.5.
The true standard error of the mean is \text{SE} = 0.69.
The estimated standard error of the mean is \text{se} = 0.56.
1.5 Model-based inference theories
Often we start not from a finite population but from a statistical model i.e., from an assumption concerning the distribution of an infinite population
For example, we assume that the distribution of the population is Gaussian (Normal) and that our sample is a random sample from that population with unknown mean \mu and standard deviation \sigma
In this example, as before we want to estimate these two unknown parameters \mu and \sigma.
The model-based theory of inference are essentially two:
the frequentist theory (Fisher, Neyman and Egon Pearson) seen before
the Bayesian theory, (Bayes, Laplace, Jeffreys) the useful for more complex models and data.
Basic ingredients
Both theories are based on
a research question to be answered
a designD for the study; and
a probability model f(x ; \theta) for the random variable X, which is used to represent the data produced by the study
The parameter \theta can be a scalar or a vector belonging to a given parameter space.
If the model can be defined by a finite number of parameters is said a parametric model.
1.6 The Bernoulli model
In this model the data are binary x = 0,1, where 0 indicates failure and 1 success with probability mass
f(x; \theta) = \begin{cases}
1-\theta & \text{ if } x = 0 \\
\theta & \text{ if } x = 1
\end{cases}
Assume that the design is a random sample (X_1, \dots, X_n) where the variables are independent and identically distributed (i.i.d)
The parameter \theta is the probability of success, and assumes values in a parameter space 0 <\theta<1.
The mean of the Bernoulli is also equal to \theta
The variance is \theta(1-\theta)
Example (Placenta previa)
Placenta Previa is a complication in the second half of pregnancy where the placenta attaches inside the uterus in a position that partially or completely covers the cervical opening.
Placenta previa
Suppose that we have data from a random sample in the population of placenta previa births in Germany, and we are interested in the proportion of female births.
Therefore we can specify a Bernoulli model with parameter \theta equal to the probability of a female birth.
If the size of the sample is n = 980 and the number of females in the sample is s = 437 we can estimate \theta as
\hat \theta = 437 /980 = 0.446
1.7 The standard error of \hat \theta
The first inference is to obtain the standard error of the estimator.
In general, given an i.i.d sample (X_1, \dots, X_n) from the Bernoulli, the proportion of successes is
\hat \theta = \frac{1}{n} \sum_{i = 1}^n X_i =
\frac{T}{n}
where T is the number of females in the sample.
The sampling distribution of T has a Binomial distribution with mean E(T) = n\theta and variance \mathrm{var}(T) = n\theta(1-\theta).
The MSE of \hat \theta is equal to the variance
\mathrm{var}(\hat \theta) = \mathrm{var}(T/n) = \frac{1}{n^2}\mathrm{var}(T) = \frac{\theta(1-\theta)}{n}.
Therefore the standard error is
\text{SE}(\hat \theta) = \sqrt{\frac{\theta(1-\theta)}{n}}
Example
Again the standard error is a function of the unknown parameter \theta
The estimated standard error is obtained by
\text{se}(\hat \theta) = \sqrt{\frac{\hat \theta(1-\hat \theta)}{n}}
The inference concerning \hat \theta in the example of placenta previa is
\hat \theta = 0.446, \quad (\text{se} = 0.016)
The conclusion is that the proportion of females in the population of placenta previa births is 44.6\% with a standard error of 1.6%.
1.8 Large samples
The sampling distribution of the proportion \hat \theta is a Binomial T divided by the sample size n
In the placenta previa example, assuming that the true proportion is \theta the distribution of \hat \theta is shown in figure below
We have two fundamental results obtained using the large sample property
The proportion T/nconverges in probabilityto the true parameter \theta, for n \to \infty
The proportion T/nconverges in law to the Normal, for n \to \infty
De Moivre-Laplace Theorem
When \theta is not too close to 0 or 1, the Binomial distribution \text{Bin}(n, \theta) in large samples approximates the Normal distribution N(n\theta,n\theta(1-\theta)).
We say that the distribution of \hat \theta is asymptotically normal.
Central limit theorem
Given any population with a continuous distribution with a finite mean \mu and variance \sigma^2, in large samples, the distribution of the sample mean under random sampling approximates the Normal distribution N(\mu, \sigma^2/n)
1.9 Details on the normal and the chi-squared distributions
The previous mathematical results imply a central role of the normal distribution in statistical inference.
The standard normal random variable Z is defined by the density
\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2},
\quad -\infty < z < \infty
We know that Z is symmetric unimodal with respect to \mu. Moreover E(Z) = 0 and \mathrm{var}(Z) = 1
The family of normal distributions is defined by a linear transformation of the standard normal
X = \mu + \sigma Z
where \mu \in \mathbb{R} is the mean parameter and \sigma^2>0 is the variance parameter. The family is denoted by X \sim N(\mu, \sigma^2).
The tails are rather short:
\begin{align*}
\Pr\{ -1 < Z < 1 \} &= 0.683\\
\Pr\{ -2 < Z < 2 \} &= 0.954\\
\Pr\{ -3 < Z < 3 \} &= 0.997\\
\end{align*}
Therefore, the probability that X differs by more than 3 \sigma from the mean is only 0.003.
The chi-squared distribution
The chi-squared (\chi^2_d) is a distribution closely connected with the normal. The parameter d is an integer called degrees of freedom
The simplest example is the chi-squared with 1 degree of freedom that is the square of one standard normal Z Specifically, Z^2 \sim \chi^2_1 whith parameter 1.
The chi-squared with 1 d.f.
In particular this distribution has support the positive reals, with mean E(Z^2) = 1 and \mathrm{var}(Z^2) = 2.
More generally if we have d independent standard normal variables Z_1, \dots, Z_n the sum of squares
Z_1^2 + \cdots + Z_n^2 \sim \chi^2_n
has a chi-squared distribution with n d.f.
The mean is n and the variance is 2n
For example see the distribution of \chi^2_3
The chi-squared with 3 d.f.
1.10 The one-sample normal model
This model assumes that we get a random sample from a population that has a normal distribution N(\mu, \sigma^2) where both \mu and \sigma^2 are unknown.
In formulae we assume that
we have a random vector {\boldsymbol{X}} = (X_1, \dots, X_n) such that each X_i \sim N(\mu, \sigma^2) for i = 1, \dots, n
the observations X_i are mutually independent so that the joint distribution factorizes as
f_{{\boldsymbol{X}}}(x_1, \dots, x_n) = f_1(x_1) \cdot \cdots \cdot f_n(x_n)
The one sample model can be written as
X_i = \mu + \varepsilon_i
where \varepsilon_i are mutually independent errors with normal distributions N(0, \sigma^2) with the same variance.
Of course this model makes quite strict assumptions so that we expect to check them carefully.
1.11 Sampling distributions of the mean and variance estimators
The estimators of \mu (parameter of interest) and \sigma^2 (nuisance parameter) are respectively
\bar X = \textstyle \frac{1}{n}\sum_{i=1}^n X_i, \quad S^2 = \sum_{i=1}^n(X_i - \bar X)^2/(n-1)
Standard error of the sample mean:
\text{se}(\bar X) = S /\sqrt{n}
- The sampling distributions with the normal model and sample size n are exactly
\bar X \sim N(\mu, \sigma^2/n)
S^2 \sim \sigma^2 \chi^2_{n-1}/(n-1)
The variance estimate has a chi-squared distribution with n-1 degrees of freedom.
Why n-1 d.f.?
If Z_1, \dots, Z_n are IID standard normals it can be proved that \sum_{i=1}^n Z_i^2 \sim \chi^2_n but \sum_{i=1}^n (Z_i - \bar Z)^2 \sim \chi^2_{n-1}
This happens because the r.v. Z_i -\bar Z must satisfy one constrain that their sum must be always zero -Thus, you don’t have n independent variables but only n-1, loosing 1 degree of freedom.
Example about Salary
Data on employees from one job category (skilled, entry–level clerical) of a bank that was sued for sex discrimination. The data are on 61 female employees, hired between 1965 and 1975. (R Package Sleuth3)
A more interesting model concerns the comparison of two populations (be they Bernoulli, Normal or whatever)
Let’s discuss the previous example.
Sex discrimination in employment
Did a bank discriminatorily pay higher starting salaries to men than to women? The salaries studied before concerned women, but there are data also for men.
The full set of data is in R package Sleuth3, loading data case1202
The two samples have sizes n_0 = 61 (females) and n_1= 32 (males).
The two-sample normal model assumes that the data are obtained by two independent random samples (possibly of different sizes) from two normal populations with parameters
N(\mu, \sigma^2) \text{ and } N(\mu + \theta, \sigma^2)
See the figure below:
The two random samples can be collected together as a independent observationsY_i for i = 1, \dots, n plus an indicator variable X_i with value 0 for the women and 1 for the males, therefore identifying the two samples
\begin{array}{cccccccccc}
{\boldsymbol{Y}} &= &(Y_1 & Y_2& \cdots & Y_{n_0}& Y_{n_0+1}& Y_{n_0+2} &\cdots & Y_{n_0+n_1}) \\
{\boldsymbol{X}} &= &(0 & 0 & \cdots & 0 & 1 & 1 &\cdots & 1 ) \\
\end{array}
The two-sample model can be denoted by
Y_{i} = \mu + \theta \cdot X_i + \varepsilon_i
with IID errors \varepsilon_i \sim N(0, \sigma^2)
Note that the observations Y_i are independent but not identical
The errors \varepsilon_i are instead IID because the variances are assumed to be the same (an assumption called homoscedasticity).
Reparameterization
The parameters in the two-sample models are three: \mu, \theta and \sigma^2
You can change the parameters to \mu_0, \mu_1, \sigma^2 where \mu_0 and \mu_1 are the salary means of women and men.
There is a 1-1 correspondence with the previous parameters
Here \theta has the role of parameter of interest because it evaluates the amount of sex discrimination
\sigma^2 has the role of nuisance parameter because it measures the uncertainty of the parameter of interest
1.13 Estimation and standard errors of the two-sample model
Intuitively, we can estimate \mu with the mean of the first sample, and \theta with the difference between the two sample means.
Denoting by \bar Y_0 and \bar Y_1 the sample means, we get the estimators
\hat \mu = \bar Y_0, \quad \hat \theta = \bar Y_1 - \bar Y_0
and we can prove that both are normally distributed because linear combinations of normal variables are normal
The standard error of \bar Y_0 is given by
\text{SE}(\bar{Y_0}) = \sqrt{\mathrm{var}(\bar Y_0)} = \frac{\sigma}{\sqrt{n_0}}
where n_0 is the size of the first sample.
The standard error of \hat \theta is \begin{align*}
\text{SE}(\hat\theta) &= \sqrt{\mathrm{var}(\bar Y_1 - \bar Y_0)} \\
&= \sqrt{\mathrm{var}(\bar Y_1) + \mathrm{var}(\bar Y_0)}\\
&= \sqrt{\text{SE}(\bar Y_0)^2 + \text{SE}(\bar Y_1)^2 }\\
&= \sqrt{\frac{\sigma^2}{n_0} + \frac{\sigma^2}{n_1}}\\
&= \sqrt{\sigma^2 \Big(\frac{1}{n_0} + \frac{1}{n_1}\Big)}
\end{align*}
The estimated standard errors depend upon an estimate of the common variance\sigma^2
The pooled variance^*
The estimator of \sigma^2 is the pooled variance
S^2_{p} = \frac{1}{n-2}\left[\textstyle \sum_{i \in G_0}(Y_i - \bar Y_0)^2 + \sum_{i \in G_1}(Y_i - \bar Y_1)^2
\right]
that is a sum of the deviances within the groups divided by the number of observations minus 2.
This estimator can be interpreted as a weighted average of the variances of the groups:
We use again the plug-in rule substituting to \sigma^2 the pooled variance
\text{se}(\hat \theta) = \sqrt{ S^2_p \left(\frac{1}{n_0} + \frac{1}{n_1}\right)}.
Thus, there is a salary difference of about 800 dollars with a standard error of 130 dollars.
The salary difference has an exactly normal sample distribution. Intuitively, it seems unlikely that a difference of $818 can happen if the true \beta = 0 given that is 6 times the standard error $130.
1.14 Exercises
In a region a telephone company has 100 000 subscribers. It plans to take a simple random sample of size 400 as part of a market research study. According to Census data nearly 20% of the company’s subscribers earn over $50 000 a year. Find the standard error for the percentage of persons in the sample with incomes over that level.
You have a box containing the counts
\{ 0,0,0,1,2\}.
Consider this as a population. The population has a mean = 0.6 and SD = 0.8. Find the SE for the percentage of 1’s in 400 draws with replacement.
Consider the R file fiji.R and use the code source("fiji.R") that returns a 1703 x 6 dataframe named fiji. Assume that it is a simple random sample from the population. Then obtain an estimate of the mean age at first marriage and its standard error.
A survey organization wants to take a simple random sample in order to estimate the percentage of people who have seen a certain television program. Their client will only tolerate chance errors of 1 percentage point or so in the estimate. Should they use a sample of size 100, 2,500, or 10,000? You may assume the population to be very large; past experience suggests the population percentage will be in the range 20%–40%.
Use the data fiji.R and consider the variables M (age at first marriage) and I (indicator of indian etnicity: 1 = Indian, 0 = Fijian). Draw two side-by-side boxplots and estimate the difference between the mean age of the Fijians and the Indians and its standard error.
Let Z \sim N(0,1) with density \phi(z) and cumulative distribution function \Phi(z) = \Pr(Z \le z). True or false and explain.
The slope of \Phi at the real number z is \phi(z)
The area to the right of z under \phi is \Phi(z)
P(Z = x) = \phi(x)
2 Fundamental concepts of inference
Many inferential problems can be classified into three categories:
estimation
confidence sets
hypothesis testing
Her we give a short introduction.
2.1 Point Estimation
Point estimation tries to find an “optimal guess” of some quantity of interest like
a parameter
a distribution
a regression function
a prediction for a future value
Let’s call \theta a generic parameter of a model and \hat\theta (or \hat \theta_n) its estimator. Remember that the estimator is a random variable in repeated sampling.
For a random sample IID (X_1, \dots, X_n) from a distribution, a point estimator \hat\theta is a function of the sample.
Bias
The bias of an estimator is defined by the difference between the expected value of the estimator and the true \theta
\text{bias}(\hat \theta) = E(\hat \theta) - \theta
If E(\hat \theta) = \theta the bias is 0 and the estimator is said unbiased.
Examples
Given an IID sample. the sample mean \bar X is always unbiased for the mean of the population \mu.
Given an i.i.d sample from a Bernoulli, the proportion of successes \hat \theta = T/n is unbiased.
However the estimator (T/n)^2 of \theta^2 of the Bernoulli is biased because
E(T^2/n^2) = \theta^2 + \underbrace{\theta(1-\theta)/n}_{\text{bias}}
To get an exactly unbiased estimator of \theta^2 you should use the estimator
\frac{T(T-1)}{n(n-1)}.
The sample variance S^2 (with denominator n-1) is unbiased for the variance \sigma^2 of the population.
However the square root of the sample variance, i.e. \sqrt{S^2} is biased for the standard deviation \sigma.
In general, unbiasedness is a non persistent property, and in fact many good estimators are biased, provided that the bias tends to zero when the sample size increases.
Consistency
An essential thing for an estimator is not the zero bias, but a property called consistency meaning that the estimator converges to the true parameter value as we collect more and more data.
Definition
A point estimator \hat\theta_n of \theta is consistent if \hat\theta_n \to \theta for n \to \infty in probability.
Mean squared error
As we have seen before the quality of a point estimate is assessed by the mean squared error
\text{MSE} = E_\theta\{ \hat \theta - \theta)^2\}.
Interestingly the MSE can be decomposed as follows
\text{MSE} = \text{bias}^2(\hat \theta) + \mathrm{var}(\hat \theta)
And this helps in checking if an estimator is consistent
Theorem
If the \text{bias} \to 0 and the \text{SE} \to 0 as n \to \infty then the estimator \hat \theta_n is consistent.
Example
The estimator of proportion of successes\hat \theta = T/n is consistent, because it is unbiased and the standard error is \theta(1-\theta)/n so that \lim_{n \to \infty} \theta(1-\theta)/n = 0.
2.2 Methods of estimation
The likelihood
Behind the scenes of the frequentist approach there is the fundamental concept of likelihood. Estimation methods, interval estimates and test statistics can all be derived from the likelihood function.
Idea
As the data are assumed to be generated by a model defined by a probability distribution (like the Bernoulli model with parameter \theta) all the information in the sample concerning \theta must be contained in that probability distribution
Once you get the observations {\boldsymbol{x}}_\text{obs} = (x_1, \dots, x_n) you can measure the plausibile value of the parameter \theta by studying the probability of \theta, for fixed values of the observed data {\boldsymbol{x}}_\text{obs}
Example for the one-sample Bernoulli model
In the Bernoulli model with success probability \theta the distribution of the population is
f(x; \theta) = \Pr(X = x) = \theta^x(1-\theta)^{1-x}, \qquad x = 0,1.
Suppose now that for a random sample of size n = 20 we get \begin{align*}
{\boldsymbol{x}}_{obs} &= (0, 1, 1, 0, 0, 1, 1, 0, 1, 1, \\
&\phantom{= (} 1, 1, 1, 0, 1, 1, 1, 1, 1, 1)
\end{align*}
Then the likelihood: is the probability of observing {\boldsymbol{x}}_\text{obs}
L(\theta) = \prod_{i=1}^{20} \Pr(X_i = x_i) = \theta^{15} (1-\theta)^{5}
because there are 15 successes ond 5 failures in the sample.
The shape of the Bernoulli likelihood is the following
The Likelihood’s domain is the parameter space of the success probability \theta \in (0,1).
The red dot is the value \hat \theta with the most plausible value for the success probability in the population.
Likelihood definition
The likelihood function is defined by
L(\theta, {\boldsymbol{x}}) = f({\boldsymbol{x}}; \theta)
seen as a function of \theta for given data {\boldsymbol{x}}.
Typically is convenient to work with the logarithm of the likelihood
\ell(\theta; {\boldsymbol{x}}) = \log L(\theta; {\boldsymbol{x}}).
Synthesis
The likelihood function shows, for each value of \theta, what is the probability of observing what we have observed under the model.
Given two values \theta_0, \theta_1 of the parameter, the likelihood ratio
L(\theta_1; {\boldsymbol{y}})/ L(\theta_0; {\boldsymbol{y}})
is interpreted as the degree to which data support \theta_1 versus \theta_0.
Maximum likelihood estimates
The estimation method of maximum likelihood prescribes choosing as estimator the value of \theta that maximizes the likelihood (or equivalently the log-likelihood function)
Example
The maximum likelihood estimate for \theta can be obtained analytically by
\hat \theta = \frac{\text{n. successes}}{n} = \frac{\sum_{i = 1}^n X_i}{n} = 15/20 = 0.75.
The log-likelihood and the MLE are shown below
Some properties
The log-likelihood is in general a curve (for a scalar parameter) or a surface (for a vector parameter) and can have multiple maxima. But in many important regular models there is a single maximum.
The curvature at the maximum is an indicator of the precision of the maximum likelihood estimator.
For a scalar parameter the curvature is measured by the observed information that is the negative of the second derivative of the log-likelihood function evaluated at the observed data.
The larger the information the more concentrated is the likelihood function around the maximum.
Sometimes is more convenient to work with the average curvature, called the Fisher information.
Example
The information depends on the sample size. If we had had a sample of 200 observations with 150 successes, the estimate would have been the same, but the curvature would have been much larger.
A greater curvature with 200 observations
Properties of the MLE
Under some regularity conditions on the model the MLE estimator has very appealing properties
In general
The MLE is equivariant: if \hat \theta is the MLE of \theta then g(\hat \theta) is the MLE estimator of g(\theta). For instance the MLE of the variance of the Bernoulli \theta(1-\theta) is \hat \theta(1-\hat \theta).
For large samples
The MLE is consistent
The MLE is asymptotically normal and the estimated standard error often can be computed analytically because it is related to the information.
A confidence interval of level1-\alpha is an interval with extremes L (left) and R (right) that are functions of the random sample (therefore random variables) such that
\textstyle \Pr_\theta(L < \theta < R) \ge (1-\alpha) \text{ for all } \theta.
The idea is that we have random interval that traps the true parameter \theta with probability 1-\alpha called the coverage.
Typically we use 95 percent confidence levels, which correspond to choosing \alpha = 0.05.
If the parameter \theta is a vector we can define confidence sets. For example ellipses in 2D space.
Illustration
The probability concerns the random interval not the parameter. The parameter is fixed !
Wrong interpretations
(Wasserman) Every day, newspapers report opinion polls. For example, they might say that “83 percent of the population favor arming pilots with guns.” Usually, you will see a statement like “this poll is accurate to within 4 points 95 percent of the time.”
Technically, this means that a study based on a random sample gives a confidence interval with level 0.95
0.83 \pm 0.04
for the true proportion \theta of people who favor arming pilots.
A typical error is interpreting the observed confidence interval as:
\small
\Pr( 0.79 < \theta < 0.87) = 0.95
because is conceptually wrong!
Correct interpretation
If you continue constructing confidence intervals (not only for \theta but also for a sequence of other unrelated parameters, then 95 percent of your intervals will trap the true parameter value.
2.4 Approximate normal confidence intervals
Theorem
If we have an asymptotically normal estimator\hat\theta_n of \theta such that \hat \theta_n \approx N(\theta, \text{se}^2), then we can construct an approximate CI for \theta
Given a standard normal Z\sim N(0,1) a quantile or percentile of order p (a probability) is the value z_p such that
\Pr (Z \le z_p) = p.
For example z_{0.975} = 1.96 .
This means that the probability that Z\le 1.96 is 0.975.
In the figure the orange area is 0.975. The blue dot is the quantile z_{0.975}. The red dot is the symmetric quantile z_{0.025} = -1.96.
With R you can find the quantiles easily. For example the quantile z_{0.90} (called also the ninetieth percentile) is
qnorm(0.90)
[1] 1.281552
General formula
In general the confidence intervals for \theta for large samples with confidence level 1-\alpha are
\hat \theta_n \pm z_{1-\alpha/2} \text{se}.
2.5 Two-sample confidence intervals
Continuous data and large samples
Suppose that you have two independent random samples from two continuous distributions (not necessarily normal) with means \mu_0 and \mu_1 and two unequal variances \sigma^2_0 and \sigma^2_1.
If you want a confidence interval for the difference between the means \theta = \mu_1- \mu_0 and you have a large sample then you can use the approximation of central limit theorem
\hat \theta \approx N\Big(\theta , \frac{\sigma_0^2}{n_0} + \frac{\sigma_1^2}{n_1}\Big)
If you don’t know the the population variances you can estimate them by \begin{align*}
\hat\sigma_0^2 &= \frac{1}{n_0} \Sigma_{i=1}^{n_0} (Y_i-\bar Y_0)^2\\
\hat\sigma_1^2 &= \frac{1}{n_1} \Sigma_{i= n_0+1}^{n} (Y_i-\bar Y_1)^2\\
\end{align*}
Therefore you can get the second approximation
\hat \theta \approx N\Big(\theta , \frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}\Big)
Result
You can obtain an approximate1-\alpha confidence interval with
\hat \theta \pm z_{1-\alpha/2} \text{se}(\hat \theta)
with
\text{se}(\hat \theta) =\sqrt{\frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}}.
Example (Age at first marriage and etnicity)
Consider the data of the Fiji Fertility Survey for all married women with age between 40-49 as a random sample of size n = 1005 from a non-normal discrete distribution.
Then let’s study the response Y, age at first marriage as a function of etnicity, I
There are two groups with size 554 (Fijian) and 451 (Indian).
The boxplot shows that on average Indian women get married earlier than Fijian.
The confidence interval of the difference is calculated as follows
The estimated standard error of the difference is calculated from \begin{align*}
\hat\sigma_1^2/n_1 &= 0.03 \\
\hat \sigma_0^2/n_0 &= 0.03 \\
\text{se} &= 0.26
\end{align*}
Therefore the 95% confidence interval for the difference of years at first mariiage \theta is
3.5\pm 1.96 \cdot 0.26 \to (2.9904, 4.0096)
Comment: Inference confirms that Indian women get married about 3.5 years earlier than Fijans with a small standard error \text{se} = 0.26. The 95% confidence interval for the difference has a margin of error of about half a year.
2.6 Controlled Experiments with 2x2 tables
Controlled experiments are conducted to assess whether a certain treatment has a significant effect on a outcome. For example, in 1954, an experimental study was conducted to test whether the Salk polio vaccine was effective.
The treatment is X. Levels: placebo, vaccine
The outcome Y = polio, no polio )
To be sure that the effect of treatment on the outcome is free from the possible influence of other variables observed or unobserved, the two groups are formed through a random process called randomization.
In the Salk field trial a large portion of the children were randomized in a double-blind, placebo-controlled experiment. Among consenting families, children were randomly assigned to receive either the vaccine or a salt solution (placebo)
A group of 200,745 children was vaccinated and a group of 201,229 different children was not vaccinated. The cases of polio in the two groups were then compared.
The structure of the experiment allows us to assume that the data are two independent samples from two Bernoulli populations \begin{align*}
Y_0 &\sim \text{Bernoulli}(\theta_0) \\
Y_1 &\sim \text{Bernoulli}(\theta_1)
\end{align*}
\theta_0 is the probability of polio in placebo group
\theta_1 is the probability of polio in vaccinated group
The parameter of interest is \delta = \theta_0 - \theta_1
Inference
The estimator of \delta is the difference of the sample proportions \hat \delta = p_0 - p_1 where
p_0 = \frac{115}{201,229} = 0.000571 \quad
p_1 = \frac{33}{200,745} = 0.000164
Since the sample sizes are huge, by the CLT we have \begin{align*}
p_0 &\approx N\Big(\theta_0, \frac{\theta_0(1-\theta_0)}{n_0} \Big)\\
p_1 &\approx N\Big(\theta_1, \frac{\theta_1(1-\theta_1)}{n_1} \Big)
\end{align*}
Then, considering the independence assumptions, the asymptotic sampling distribution of \hat \delta is
\hat \delta \approx N\Big(\delta, \frac{\theta_0(1-\theta_0)}{n_0} + \frac{\theta_1(1-\theta_1)}{n_1} \Big)
Finally the estimated standard error is (by substitution)
\text{se}(\hat \delta) = \sqrt{\frac{p_0(1-p_0)}{n_0} + \frac{p_1(1-p_1)}{n_1}}
Synthesis
The estimate of the difference is \hat \delta = 0.0004071005 with \text{se} = 0.00006047412. The difference is more than 6 times the standard error.
1-\alpha confidence interval for the difference
\hat \delta \pm z_{1-\alpha/2} \text{se}(\hat \theta)
that gives (0.0002835980, 0.0005306031).
Note The 95% confidence interval does not contain the zero value. Thus in the long run 95% of the intervals show that \theta_0> \theta_1, i.e., \theta_\text{placebo} > \theta_\text{vaccine}.
Relative risk
The relative risk (or risk ratio) of the polio for placebo versus vaccine is
\text{RR} = \frac{\theta_0}{\theta_1}
The estimator of the relative risk is
\text{rr} = p_0/p_1 = 0.000571/0.000164 = 3.48
This means that the probability of contracting the polio in the placebo group is more than three times that of the vaccine group.
Notice that the RR is more interpretable than the risk difference\delta.
A 95% confidence interval for the relative risk is obtained from the estimate of the \log(\text{RR}) and its standard error (calculations omitted)
\begin{aligned}
\log(\text{rr}) &= \log(3.48) = 1.2470323 \\
\text{se} &= \textstyle \sqrt{\frac{1}{a} + \frac{1}{c} - \frac{1}{a+b} - \frac{1}{c+d}}
\end{aligned}
where a,b, c, d are the counts in the 2x2 table
\begin{array}{|l|cc|c|}\hline
& 0 & 1 & \text{Total} \\ \hline
0 & a & b & a + b\\
0 & c & d & c + d \\ \hline
\end{array}
Then the confidence limits are
\exp(\log (\text{rr}) \pm 1.96 \cdot \text{se})
The result can be obtained by a simple R function.
R function
ci95_RR <-function(a, b, c, d){ rr <- (a/(a+b)) / (c/(c+d)) se_log_rr <-sqrt(1/a +1/c -1/(a+b) -1/(c+d)) L <-exp(log(rr) -1.96* se_log_rr) R <-exp(log(rr) +1.96* se_log_rr)cat("RR = ", round(rr, 4))cat("95% CI: (", round(L, 4), round(R, 4), ")")}a <-115; b <-201114 ; c <-33; d <-200712ci95_RR (a,b,c,d)
RR = 3.476595% CI: ( 2.3608 5.1194 )
Appendix on the SE of the log relative risk*
The logarithm of the relative risk (call this \rho)
\log(\rho) = \log(\theta_0) - \log(\theta_1)
and the estimator is
\log(\hat \rho) = \log(p_0) - \log(p_1)
We know that for large samples the two estimators p_0 and p_1 are independent and are asymptotically normal
The variance of \log \hat \rho is \begin{align*}
\mathrm{var}(\log \hat \rho) &= \mathrm{var}(\log p_0 - \log p_1)\\
&= \mathrm{var}(\log p_0) +\mathrm{var}(\log p_1)
\end{align*}
To calculate the two variances we use a useful trick called the delta method. If you have an estimator T_n that has an approximately normal sampling distribution then the sampling distribution of a transformation g(T_n) (like for example the logarithm) is also approximately normal.
Theorem (delta method)
If T_n \approx N(\theta, \sigma^2/n) then g(T_n) is approximately normal with mean g(\theta) and variance [g'(\theta)]^2 \sigma^2/n.
In the formula g'(\theta) is the derivative. Therore g must be twice differentiable
By applying the delta method to our problem we get that the asymptotic variances of \log p_i for i = 0,1 are
\begin{aligned}
\mathrm{var}(\log p_i) &= \frac{\theta_i(1-\theta_i)}{n_i}\left(\frac{d}{d\theta}\log \theta_i\right)^2\\
&= \frac{\theta_i(1-\theta_i)}{n_i}\frac{1}{\theta_i^2}\\
&= \frac{1-\theta_i}{n_i\theta_i}
\end{aligned}
Therefore
\text{SE}(\log \hat \rho) = \sqrt{\frac{1-\theta_0}{n_0\theta_0}+ \frac{1-\theta_1}{n_1\theta_1}}
And finally we get the estimated standard error by substituting to \theta_0 and \theta_1
p_0 = \frac{a}{a+b}, \quad p_1 = \frac{c}{c+d}
The result is
\begin{aligned}
\text{se}(\log\hat\rho) &= \sqrt{\frac{b}{a(a+b)} + \frac{d}{c(c+d)}}\\
&= \sqrt{\frac{1}{a} - \frac{1}{a+b} + \frac{1}{c} - \frac{1}{c+d}}.
\end{aligned}
2.7 Exercises
Assume that we have an IID sample from a N(\mu, \sigma^2). We know that an unbiased estimator of the population variance \sigma^2 is
S^2 = \frac{1}{n-1}\sum_{i= 1}^n (X_i - \bar X)^2.
What is the bias of the estimator of \sigma^2
\hat \sigma^2 = \frac{1}{n}\sum_{i= 1}^n (X_i - \bar X)^2 ?
Does the second estimator tend to underestimate or overestimate \sigma^2?
The bias is a number or a function?
Can you say if the estimator S^2 is consistent?
(Freedman) A simple random sample of 3500 people age 18 or over is taken in a large town to estimate the percentage of people (age 18 and over in that town) who read newspapers. It turns that 2487 people in the sample are newspaper readers. Calculate a 95% confidence interval for the percentage.
Consider the dataset fiji.R Consider the response E: education (in number of years) and the groups defined by the variable I (etnicity: 1= indian, 0=fijian). Assume that the observations E satisfy the assumption of the two-sample model.
Find a 95% confidence interval for the difference between the average number of years between the two groups.
Suggestion: read section 2.5. The approximate 95% confidence interval for the difference of the means \theta = \mu_1 - \mu_0 is
(\hat\theta_n - 1.96 \text{se} < \theta < \hat\theta_n - 1.96 \text{se} )
where \hat\theta_n = \bar{Y}_1 - \bar{Y}_0 and
\text{se}(\hat \theta_n) =\sqrt{\frac{\hat\sigma_0^2}{n_0} + \frac{\hat\sigma_1^2}{n_1}}.
(Dobson) The weights, in kg, of a IID sample of 20 men before the participation in a “waist loss” are observed. Then, the same men are observed after 12 months. These are the data.
Do these data satisfy the assumptions of the two-sample model? Yes or No and explain.
(Freedman) A large university takes a simple random sample of male students and another simple random sample of females. The sample sizes are 200 and 300. It turned out that 107 of the men in the first sample use a tablet regularly, compared to 132 of the women in the second sample.
Find a confidence interval for the difference between the percentages.
In your opinion the difference is real or is a chance variation?
3 Hypothesis testing
3.1 Basic idea
We start from a study concerning a substantive research hypothesis, like for example the study concerning the starting salaries of men and women.
We have a kind of default theory. In the example of sex discrimination, mean salaries for men and women should be the same.
The default theory must be converted into a statistical hypothesis that makes statements about population parameters, called the null hypothesis. In the example
H_0: \theta = \mu_{\text{M}} - \mu_{\text{F}} = 0
We can also identify an alternative hypothesisH_1 that specifies the directions of the departures from H_0. For example
H_1: \theta > 0
Then we ask if the data provide sufficient evidence to reject the theory. If not, we retain the null hypothesis.
The evidence depends on some test statisticT = t({\boldsymbol{Y}}) that measures the difference between the data and the hypothesis. For instance, the evidence of sex discrimination can be measured by the difference between the mean salaries of males and females. In this case
In the study concerning the gender of the births with placenta previa, we could ask if the proportion of males \theta is 1/2. Then we can compare a null hypothesis
H_0 : \theta = \textstyle\frac{1}{2}
with an alternative hypothesis
H_1: \theta \ne \textstyle\frac{1}{2}.
It seems reasonable to reject H_0 if the difference in absolute value
|\text{estimate} - \text{parameter}| = |\hat \theta - \textstyle \frac{1}{2} |
is large.
Approaches to testing
We can distinguish two different approaches to testing:
The Fisher’s approach is directed to evaluate the evidence against a single hypothesis H_0 measured by a normalized value of the distance of the data and the hypothesis, called p-value
The Neyman and Pearson’s approach considers two hypotheses H_0 and an alternative H_1 and use a decision procedure on the basis of data, either to accept or reject H_0.
3.2 The Fisher approach
The procedure consist in defining a test statistic, that is a function of the data measuring the differencet({\boldsymbol{Y}}) between the data and what is expected on the null hypothesis. Typically,
t({\boldsymbol{Y}}) = \frac{\text{observed} - \text{expected}}{\text{se}}
The observed value of the test statistic is denoted by t_{\text{obs}}.
Then we calculate the chance of getting a test statistic as extreme or more extreme than t_{\text{obs}}, computed on the basis that H_0 is right:
p = \Pr(t({\boldsymbol{Y}}) \ge t_{\text{obs}}; H_0)
This probability is called the observed significance level or briefly the p-value.
Test concerning the two samples of salaries
Start from a general model assumed to be the generating process of the observations
Y_i = \mu + \theta X_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2), \text{ for } i = 1, \dots, n
where \theta = \mu_1 - \mu_0 is the parameter of interest and X_i is just an indicator of the two samples.
Define a null hypothesisH_0: \theta = 0 asserting that the two populations are identical:
Y_i = \mu + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)
meaning that the mean salaries are equal.
An alternative hypothesis can be H_1: \theta > 0 so that \mu_{\text{1}} > \mu_{\text{0}} (men salaries > women salaries)
Define a test statistic (said the Wald test statistic)
t({\boldsymbol{Y}}) = W = \frac{\bar Y_1 - \bar Y_0}{\text{se}(\hat\theta)}
that is the difference between the means divided by the estimated standard error, calculated with the pooled variance.
In our data about salaries the value of the test statistic is
w_{\text{obs}} = \frac{818}{130}= 6.3
To evaluate if this value is small or large we calculate the p-value:
p_\text{obs} = \Pr(W \ge 6.3; H_0).
To get this probability we should find the null distribution of W
Fortunately we know that Wunder the null hypothesis has t-distribution (thanks to William Gosset said “Student”) with parameter n-2.
We will give more details later, but for the moment we just plot it. The following figure shows the density of a T_{n-2}, with n = n_0 + n_1 = 93
As you see the curve is similar to the standard normal.
The red dot, representing the value w_{\text{obs}} = 6.3, is quite far in the right tail so that the p-value is extremely small.
To find the p-value for the test of \theta=0 we calculate the left-tail area with R using the function pt() that finds the area under the normal in the interval (6.3, \infty):
pt(6.3, df =91, lower.tail =FALSE)
[1] 5.202414e-09
A p-value close to 0 gives an indication that a value of w_{\text{obs}} is unlikely under the hypothesis. In this sense the smaller is the p-value the larger is the evidence against the hypothesis.
Traditionally, the p-value is reported on a rough scale:
\small
\begin{array}{cll}
\text{P-value} & \text{Interpretation} & \text{Jargon}\\ \hline
\le 0.01 & \text{strong evidence against } H_0 & \text{highly significant} \\
\bumpeq 0.05 & \text{moderate evidence against } H_0 & \text{significant}\\
> 0.1 & \text{no evidence against $H_0$ } & \text{non-significant}\\ \hline
\end{array}
A non-significant test does not imply that the data confirmH_0.
The p-value is NOT the probability that the null is true: in frequentist inference there isn’t any concept of probability of hypotheses.
3.3 Neyman-Pearson’s approach
A technological problem
A manufacturing plant produces batches of a chemical using a standard production method (A). Now they try a modified method (B) that is supposed to give a higher yield
The experimenters verify that the batches could be supposed independent and that the order of the runs has no influence on the results.
Problem: decide using an experiment whether method (B) gives significantly higher yields than method (A).
Procedure
Define the experiment: use a completely randomized experiment by making in sequence 10 batches of a chemical using the standard production method (A) followed by 10 batches using the modified method (B).
Assume the data are generated by a two-sample model
Y_i = \mu_A + \theta X_i + \varepsilon_i, \quad \varepsilon_i \sim_{\text{ ind}} N(0, \sigma^2)
where X_i=0 if chemical i is produced by method (A) and X_i = 1 if is produced by method (B)
Define the null and alternative hypotheses. In general they define a partition of the parameter space \Theta:
H_0 : \theta \in \Theta_0 \text{ versus } H_1:
\theta \in \Theta_1
For instance :
H_0: \mu_B - \mu_A = \theta = 0 \text{ vs } H_1: \theta > 0
A hypothesis is said simple if it defines a single distribution. Otherwise it is said composite
Note: In the previous example both hypotheses are composite!
Decision Rule
Before collecting the data define an appropriate subset R in the sample space called the rejection region such that the decision rule is
\small
\begin{aligned}
{\boldsymbol{Y}} \in R &\implies \text{ reject } H_0\\
{\boldsymbol{Y}} \not\in R &\implies \text{retain (do not reject) } H_0
\end{aligned}
Usually the rejection region is defined by
\small
R = \{ {\boldsymbol{Y}} : t({\boldsymbol{Y}}) > c\}
where t({\boldsymbol{Y}}) is a test statistic and c is called the critical value.
Errors: Two types of error result from the decision rule:
type I error: reject H_0 when it is true (convict an innocent)
type II error: retain H_0 when it is false (absolve a guilty)
Note Unfortunately if we can reduce the type I error, the type II error increases and viceversa.
Power and size
A specific distinction from the Fisher’s approach is the definitions of power function and size of the test.
The power function is the probability of rejecting H_0 for all values of the parameter \theta
The size of the test, called \alpha, is the maximum probability of type I error when the value of the parameter \theta satisfies the null hypothesis.
Most powerful test
Definition: The optimal decision rule for Neyman and Pearson consists in finding the rejection region t({\boldsymbol{Y}}) > c with the highest power under the alternative H_1among all regions with fixed size\alpha
The famous Neyman and Pearson lemma shows that in some special cases the most powerful test exists.
Unfortunately, finding this optimal test is rather hard !
However, there are several tests that come close to this ideal.
In the rest of the chapter I will just list some relevant examples.
3.4 Some useful tests
The t-test for the two-sample model
We met before the two-sample model with normal errors. The t-test concerns the test about the difference of the two means \theta = \mu_1 - \mu_0.
Two-sided test
H_0: \theta = \theta_0 \text{ versus } H_1: \theta \ne \theta_0
One-sided test (right)
H_0: \theta \le \theta_0 \text{ versus } H_1: \theta > \theta_0
Null distribution: is a t distribution with parameter (degrees of freedom} n-2:
W \sim T_{n -2}.
Rejection region with size \alpha
\begin{cases}
\text{ for the two-sided hypothesis: } &
|W| > t_{1-\alpha/2} \\
\text{ for the one sided hypothesis: } &
W > t_{1-\alpha}
\end{cases}
where t_{1-\alpha/2} and t_{1-\alpha} are the quantiles of the T_{n-2} distribution, shown in the figure below
The experiment of chemical production
The two samples of chemical batches are
YA <-c(89.7,81.4,84.5,84.8,87.3,79.7,85.1,81.7,83.7,84.5)YB <-c(84.7,86.1,83.2,91.9,86.3,79.3,82.6,89.1,83.7,88.5)
The test statistic is calculated as follows. The estimate of \theta is
\hat \theta = \bar y_B - \bar y_A = 85.54 - 84.24 = 1.3
the estimate of the variance \sigma^2 is the pooled variance \begin{align*}
s^2_p &= \frac{\Sigma_{i\in A} (y_i - \bar y_A)^2 + \Sigma_{i \in B} (y_i - \bar y_B)^2}{n_A + n_B -2} \\
&= 10.87
\end{align*} and the estimate of the standard error is
\text{se} = \sqrt{s^2_p \left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = 1.47
Therefore, the test statistic for H_0: \theta = 0
w_{obs} = \frac{1.30 - 0}{1.47} = 0.88
The rejection region for the one-sided test H_1: \theta > 0 of size \alpha = 0.05 is W > t_{18, 0.05}. The region can be found with the R function qt:
Thus the rejection region consists of all the samples that have the Wald statistic W > 1.734. As w_{obs} = 0.88 the null hypothesis is not rejected at the level \alpha = 0.05.
so that the test is not significant at the 5 % level.
Tests are sometimes less useful than CI
Most statisticians believe that tests have been overemphasized in research because they merely indicate whether the specific parameter value in H_0 is plausible.
When a p-value is small, the test indicates that the hypothesized value is not plausible, but tells us little about which potential parameter values are plausible
A confidence interval is more informative because it shows the whole set of believable values. Also it shows if the null is badly false by showing if the values in the interval are far from \theta_0.
That helps in evaluating whether the difference between the true value and the \theta_0 value has practical importance.
Example from Wasserman
A level \alpha test rejects H_0: \theta = \theta_0 if and only if the 1-\alpha CI does not include \theta_0.
In the figure below there are two different confidence intervals.
Two cases
Both exclude \theta_0 so in both cases the test would reject H_0.
However in the first case the finding is of little scientific or practical value.
In the second case, the finding is of scientific value.
Test for the difference of proportions
Model: Two independent IID samples of sizes n_0 from a \text{Bernoulli}(\theta_0) and of size n_1 from a \text{Bernoulli}(\theta_1)
H_0: \delta \ne 0, reject if |W_n| > z_{1-\alpha/2}
H_0: \delta > 0, reject if W_n > z_{1-\alpha}
H_0: \delta < 0, reject if W_n < -z_{1-\alpha}
Note: This test can be applied to the 2x2 tables for prospective studies
Prospective studies are of two types:
Clinical trials: they randomly allocate subjects to the two groups of interest, such as (vaccine and placebo) observing the response variable in future time
Cohort studies: subjects make their own choice about which group to join (the group of those who go to the fitness center and those who don’t) and the study observes the response variable in future time.
3.5 Likelihood Ratio Test
The Wald test W is useful for scalar parameters.
A more general test is the Likelihood Ratio Test (LRT)
Definition of the LRT
If we want to test {\boldsymbol{\theta }}\in \Theta_0 versus {\boldsymbol{\theta }}\not\in \Theta_0 we can use the test statistic
\text{LRT} = 2 \log\left(\frac{ L(\hat {\boldsymbol{\theta}})}{L(\hat {\boldsymbol{\theta}}_0)} \right)
where \hat {\boldsymbol{\theta}} is the MLE and \hat{\boldsymbol{\theta}}_0 is the MLE, restricted to the parameter space \Theta_0
Example
Assume that you have an IID sample of size 100 from a Bernoulli with an unknown probability of success \theta. For example, think of throwing a coin 100 times. Then you want to test the hypothesis
H_0: \theta = 1/2 \quad \text{versus } H_1: \theta \ne 1/2
An approximate test for large samples is
W = \frac{p - 0.5}{\sqrt{p(1-p)/n}} \approx N(0,1) \text{ under } H_0.
An alternative is the likelihood ratio test. Given the observed number of successes t = \sum_{i=1}^n x_i calculate the MLE p = t/n and the MLE restricted to H_0 that is simply p_0 = 1/2.
Remembering that L(\theta) = \theta^t(1 - \theta)^{n-t} the log-likelihood is
\ell(\theta) = t \log\theta + (n-t) \log(1-\theta)
Finally the log-likelihood ratio is \begin{align*}
\ell(p) &= t \log(\hat p) + (n - t) \log(1 - \hat p)\\
\ell(p_0) &= t \log(p_0) + (n - t) \log(1 - p_0)\\
\text{LRT} &=2 \cdot \text{difference}
\end{align*}
Suppose that the number of successes is 65 in 100 coin tosses. The computations are shown below in R:
LRT_prop <-function(x, n, p0) {# x: number of successes# n: sample size# p0: proportion under the null p_hat <- x / n # MLE# logL0 <- x *log(p0) + (n - x) *log(1- p0) logL1 <- x *log(p_hat) + (n - x) *log(1- p_hat)# Test statistics -2 * log(LR) lrt_stat <--2* (logL0 - logL1)# p-value (1 degree of freedom) p_value <-pchisq(lrt_stat, df =1, lower.tail =FALSE)return(list(statistic = lrt_stat, p_value = p_value, estimate = p_hat))}# Example: Test H0: pi = 0.5 with 65 successes in 100 rollsLRT_prop(x =65, n =100, p0 =0.5)
The sampling distribution of the test, under the null hypothesis, is asymptotically a chi-squared with 1 degree of freedom.
Therefore, the test finds evidence of a significant difference from \theta = 1/2
The likelihood theory proves the following theorem:
Theorem
The LRT statistic has a chi-squared asymptotic distribution under the null hypothesis H_0: {\boldsymbol{\theta }}\in \Theta_0.
The degrees of freedom are equal to the dimension of \Theta minus the dimension of \Theta_0.
3.6 Exercises
A person who claims to possess extrasensory perception (ESP) says she can guess more often than not the outcome of a flip of a balanced coin. Out of 200 flips, she guesses correctly 110 times.
What is the nature of the variable to be studied?
Specify the hypotheses H_0 and H_1
Give the test statistic, its distribution under the null and its observed value
What is the p-value?
What is your conclusion?
(Agresti) When the 2000 General Social Survey asked, “Would you be willing to pay much higher taxes in order to protect the environment?” it was of interest to know if the proportion of people answering yes was 0.5.
Specify the null and the alternative hypotheses
Give a test statistic and its sampling distribution under the null hypothesis.
The survey result was that 369 people answered yes and 483 answered no. Apply the test giving the observed test statistic, the p-value and its interpretation in the context.
Calculate a 95% confidence interval for the parameter and explain an advantage of the CI over the significance test.
A recent study compared different psychological therapies for teenage girls suffering from anorexia. Each girl’s weight was measured before and after a period of therapy. The variable of interest was the weight change, defined as weight at the end of the study minus weight at the beginning of the study. In this study, 29 girls received cognitive behavioral therapy directed in identifying the thinking that causes the undesirable behavior and replacing it with thoughts designed to help improve this behavior. The data in pounds (in file `anorexia.csv’) are shown below
Do a significance t-test for finding the strength of evidence supporting the effectiveness of the cognitive behavioral therapy, that is, to determine whether it results in a positive mean weight change.
Show the hypotheses, the test statistic, the p-value and comment.
Find a confidence interval for the difference in kg.
Do a boxplot of the weight difference. What do you note? Do you think that the data are normally distributed?
For a large supermarket chain in Florida a women’s group claimed that female employees were passed over for management training in favor of their male colleagues. The company denied this claim, saying they picked the employees from the eligible pool at random to receive this training. Also, statewide, the large pool of more than 1000 eligible employees who can be tapped for management training is 40% female and 60% male. Therefore the company says that there is no gender bias because since this program began, of the 40 employees chosen for management training 28 were male and 12 were female. The company says that a test of H_0: \theta = 0.6 against H_1:\theta > 0.6 is not significant.
Check the company’s claim that there is no evidence of gender bias.
Find a confidence level for \theta. Make your comments.
(Agresti). The data below come from the Physicians’ Health Study. This was a five-year randomized study concerning if regular intake of aspirin reduces the chance of heart attacks.
Every other day, the male physicians participating in the study took either one aspirin tablet or a placebo.
The study was blind. The physicians in the study did not know which type of pill they were taking.
Test the hypothesis H_0: \delta = 0 versus H_1: \delta \ne 0 with \delta = \pi_1 - \pi_0 is the risk difference between the Aspirin and the Placebo groups.
4 Introduction to linear regression models
4.1 The straight line model
There are situations with n measurements Y_i associated with the values X_i, i = 1, \dots, n of another variable.
For instance, consider systolic blood pressures y_i of n = 12 women and the corresponding ages x_i
The data are pairs of observations (X_1, Y_1) , \cdots, (X_n, Y_n). The variable X is called a predictor or a regressor and Y is called the outcome or the response
Notice that the two variables are not on equal standing but, according to a research hypothesis, the predictor is expected to influence the outcome.
Important: the research hypothesis is an assumption
The ages instead are considered as fixed values representing the conditions under which the blood pressures were measured.
Therefore we are interested in the function
r(x) = E(Y \mid X = x)
for x varying in a realistic interval. The function r(x) is called the regression function
Scatter of the data on age and systolic blood pressure with two superposed regression functions
The above scatter of the two variables suggests that in the range of ages between 35 and 70 the expected blood pressure increases with age with an approximately linear regression function.
We can specify a regression model of this form
Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i
or a slightly nonlinear model like
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i
The errors\varepsilon_i are usually assumed to be mutually independent, with:
E(\varepsilon_i \mid X_i) = 0, \text{ and } \mathrm{var}(\varepsilon_i \mid X_i) = \sigma^2.
The errors are not necessarily normal,
but their variance \varepsilon_i is assumed to be constant
The above linear model, called the simple linear regression model generates the data with 3 parameters: \beta_0,\beta_1 and \sigma^2.
The parameter \beta_1 is the slope of the line, i.e., the difference in the expected blood pressures for two women with an age difference of one year.
The variances \mathrm{var}(Y_i) = \sigma^2 - as we said - are assumed to be constant, so that \sigma is the standard deviation of the differences Y_i - \mu_i.
4.2 Another example
Consider the following data on measurements of abdominal circumference Y taken from 610 fetuses during ultrasound scans at Kings College Hospital, London, at gestational ages X ranging between 12 and 42 weeks.
Of course Y depends on gestational age and specifically the average abdominal circumference grows smoothly when the age grows.
The joint distribution here is shown by this scatterplot with an added estimated regression function.
Abdominal circumference and gestational age
In this case the regression function is close to a line (except perhaps at lower gestational age)
The errors appear to be small, but the assumption of constant variance is not adequate.
4.3 Least squares estimators
If the unknown parameters in the linear model are estimated in some way by \hat \beta_0 and \hat \beta_1 the fitted line is
\hat r(x) = \hat\beta_0 + \hat \beta_1 x
The fitted values are \hat Y_i = \hat r(x_i) and the residuals are defined by
\hat \varepsilon_i = Y_i - \hat Y_i = Y_i - (\hat \beta_0 + \hat \beta_1 x_i)
The residual sum of squares (RSS) is defined by
\text{RSS} = \sum_{i=1}^n \hat \varepsilon_i^2
measure how well the straight line fits the data. Of course the RSS depend on the scale of the measurements
Finally we introduce a general method of estimation of \beta_0 and \beta_1 called the least squares:
Definition
The least-squares estimates are the values \hat \beta_0 and \beta_1 that minimize\sum_{i=1}^n \hat \varepsilon_i^2
Theorem
The least-squares estimates are given by \begin{align*}
\hat \beta_1 &= \frac{\sum_i (x_i - \bar x)(Y_i - \bar Y)}{\sum_i (x_i - \bar x)^2} \\
\hat \beta_0 &= \bar Y - \hat \beta_1 \bar x
\end{align*} An unbiased estimate of \sigma^2 is
s^2 = \frac{\sum_{i = 1}^n\hat\varepsilon_i^2}{n-2}
Example (age - blood pressure)
Assume that the blood pressure satisfy the simple linear regression model with homoskedastic errors.
The LS estimates of the intercept and the slope is found using the previous formulae or using the R function lm() (linear model)
The fitted line is \hat r(x) = 80.8 + 1.14 x, shown in the previous plot.
The estimated variance of the errors is
sum(e^2)/(12-2)
[1] 49.24669
The estimated SD of the errors is 7.02 (in scale of the blood pressure).
Interpretation
Usually the intercept is not interpreted
The fitted slope \hat \beta_1 = 1.14 is interpreted by saying that the difference in blood pressures for two women with an age difference of one year is about 1 mm Hg.
We could even go so far as to say that the difference in blood pressures for two women with an age difference of 10 years is about 10mm Hg (all things remaining the same)
Least-Squares properties
The least squares estimates do not assume normality. They have nevertheless some remarkable properties
Unbiasedness: The expected value of the estimated coefficients equals the true population parameters
Minimum MSE (Efficiency): Within the class of linear, unbiased estimators, the LS estimators have the smallest possible variance, ensuring the highest precision.
Consistency: As the sample size approaches infinity, the LS estimator converges to the true parameter value.
Normal Distribution: If the errors are normally distributed, or if the sample size is large (via the Central Limit Theorem), the LS estimates are normally distributed, allowing for hypothesis testing and confidence intervals.
Estimated standard errors of the estimates are obtained explicitly, for example:
Confidence Intervals: Standard errors are used to obtain approximate 1-\alpha confidence intervals for coefficients, using the normal quantiles
\hat \beta_1 \pm z_{1-\alpha/2} \text{se}(\hat \beta_1)
if the the sample size is large.
Impact of Sample Size: As the sample size increases, the standard errors typically decrease, increasing the precision of the estimates.
Impact of the range of the variableX: The larger the range (max-min) of X the better the precision of \hat \beta_1
Maximum likelihood
If we can assume that \varepsilon_i \sim N(0, \sigma^2) we can use the method of maximum likelihood.
Theorem
Under the assumption of normality the MLE of the regression coefficients \beta_0 and \beta_1coincide with the least-squares estimators.
The MLE of \sigma^2 is not exactly unbiased (as expected)
\hat \sigma^2_{\text{ML}} = \frac{\sum_{i = 1}^n\hat\varepsilon_i^2}{n}
All the other properties of the MLE are maintained.
Test of hypotheses for the slope
With normality assumption
The test of H_0: \beta_1 = c versus H_1: \beta_1 \ne c the statistic
W = \frac{\hat \beta_1 - c}{\text{se}(\hat\beta_1)}
has a T_{n-2} distribution under H_0
The rejection region at level \alpha is
|W| = \left|\frac{\hat \beta_1-c}{\text{se}(\hat \beta_1)}\right| > t_{n-2, 1-\alpha/2}.
With large samples
If the assumption of normality is dubious but you have a large sample the previous test statistic has approximately a N(0,1) distribution under the null.
Therefore, the 1-\alpha rejection region is found using the normal quantiles z_{1-\alpha/2}.
Example
The complete inference for the linear regression model for the data on age and blood pressure is shown in the following table
summary(lm(y ~ x))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.777730 9.5437459 8.463944 7.162571e-06
x 1.138005 0.1782089 6.385794 7.976396e-05
From this we have
the standard error of the slope \text{se}(\hat\beta_1) = 0.18
With this small sample if we believe that the errors are normal, the slope is significantly different from 0 because, for \alpha = 0.05,
|W|= 6.386 > t_{8, 1-\alpha/2} = 2.31
A confidence interval is
confint(lm(y ~ x))
2.5 % 97.5 %
(Intercept) 59.5129389 102.042521
x 0.7409311 1.535079
As usual the intercept can be disregarded. The 95% confidence interval for the slope is (0.74, 1.54) giving an idea of precision of the estimator.
Predictor effect plot
This plot defines a point-wise confidence envelope that represent the uncertainty band around the estimated regression line.
library(effects)plot(predictorEffects(lm(y~x), ~ x), main ="")
The envelope is formed by a set of confidence intervals calculated independently for each value of the predictor.
It is not a band that must contain the entire true line. But for each single point chosen on the horizontal axis is calculated a confidence interval for the associated fitted value
Given a point x_0 of the age, the fitted value \hat y_0 = \hat \beta_0 + \hat \beta_1 x_0 has an estimated standard error
\text{se}(\hat y_0) = s\sqrt{\frac{1}{n} + \frac{(x_0-\bar x)^2}{\sum_{i=1}^n(x_i - \bar x)^2}}
Then the confidence envelope is defined by the limits
\hat y_0 \pm z_{1-\alpha/2} \text{se}(y_0)
The envelope does not have a constant width: In the center the band is narrower because the estimate is more accurate. At the end the band widens as we move away from the mean: the greater the distance from the mean, the greater the standard error.
4.4 Model criticism
After fitting the model we can use the residuals to criticize our model specification
Residual Analysis: Residuals (observed - predicted values) should be randomly distributed around zero. The residuals should be independent from X.
Non-linearity: A pattern in the residual plot suggests a non-linear relation.
Heteroscedasticity: Particular shapes in residuals indicates non-constant error variance.
Normality: Q-Q plots are used to check if residuals are normally distributed.
Example (gestational age)
Assuming a simple linear regression model with homoscedastic errors the regression coefficients are
summary(lm(y ~ x, data = abdom))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -55.17951 2.0010031 -27.57593 3.401592e-109
x 10.33823 0.0700963 147.48606 0.000000e+00
The fitted line
\hat y = -55.18 + 10.34 x
means that the circumference of the abdomen increases of about 1 cm per week
Residuals vs fitted values: The shape of the points on the left plot shows that the residuals are not randomly distributed. This is a sign of a misspecification of the regression function. The red curve (a fitted non-parametric function) suggests a nonlinear regression function.
Moreover, in the same plot, the dispersion of the points around the regression function tends to increase as gestational age increases, a clear sign of heteroskedasticity
The second plot to the right is a quantile-quantile plot. The genearal idea is to plot the sample quantiles for the residuals vs the theoretical quantiles of a standard normal. If the residuals have a normal distribution the points in the plot should be aligned on a straight line. In this case the residuals are somewhat aligned in the center of the distribution, but not in the tails.
4.5 Goodness of fit
A very popular index to be used in regression is the coefficient of determination called briefly R squared.
The definition is
R^2 = \frac{\sum_{i=1}^n(\hat y_i - \bar y)^2}{\sum_{i=1}^n(y_i - \bar y)^2} = \frac{\text{ESS}}{\text{TSS}}
This is the ratio of the Explained Sum of Squares and the Total Sum of Squares.
The following result proves that the explained sum of squares is always lower than the total sum of squares.
Theorem
The residual sum of squares is
\text{RSS} = \text{TSS} - \text{ESS}
The interpretation is that the total sum of squares \text{TSS} can be partly “explained” by the fitted regression \text{ESS}.
Therefore R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} and of course 0\le R^2 \le 1.
If R^2 is close to 1 then the points are close to the line ad the residuals are small
If R^2 is close to 0 then the points are more dispersed
Warning
A large value of R^2 does not mean that the model chosen is correct.
A small value of R^2 does not mean that the model is wrong.
This means that about 80% of the variability of the response is explained by a linear regression
Measures on the fetus
summary(lm(y ~ x, data = abdom))$r.squared
[1] 0.9728088
This means that 97% of the variability of circumference of the abdomen of a fetus is explained by the weeks of gestation.
The coefficient of determination is quite close to 1 but ew discovered that the straight line model is not totally right especially because we wrongly assumed the homogeneity of the error variances.
Example (fertility depending on education)
Let consider now the data of the Fiji Fertility Survey not as an example of a complete population but as a sample.
Then let’s study the response number of children as a function of education within a stratum of all married women 40-49 at the time of the survey, in order to obviate the need for a control for woman’s age.
The fitted lines show that, marginally, fertility decreases with education.
The table of the regression coefficients is obtained with
sel <- (fiji$A <=49) & (fiji$A >=40) fiji1005 <- fiji[sel,] lsE <-lm(F ~ E, data = fiji1005)summary(lsE)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.8854869 0.15448462 44.570695 4.227095e-240
E -0.1389875 0.03179211 -4.371759 1.360988e-05
The coefficient of determination is
summary(lsE)$r.squared
[1] 0.0186988
Comments
The regression line is
\widehat{F} = 6.9 - 0.14\cdot E
The standard error of the slope is 0.0318.
The test of the hypothesis \beta_1 = 0 is rejected at level 0.05 because
|W| = |-4.372| > 1.96 \quad (\text{ assuming normality because $n$ is large})
The p-value is
2*(1-pnorm(abs(-4.372)))
[1] 1.231135e-05
The p-value in the table of regression coefficients is different, because the function lm() assumes that the errors are exactly normal. In that case the test statistic has an exact null distribution T_{n-2}. Thus
2*(1-pt(abs(-4.371759), df =1005-2))
[1] 1.360988e-05
Notice that despite a quite smallR^2 = 0.019 the slope of the regression is significantly different from zero
Example (Fertility vs education, given etnicity)
The regression line can be different if we consider etnicity I where 1 = Indian and 0 = Fijian.
par(mfrow =c(1,2)) data0 = fiji1005[fiji1005$I =="0",] data1 = fiji1005[fiji1005$I =="1",]plot(F ~jitter(E), data = data0, main ="Fijian", ylim =c(0, 16))abline(lm(F ~ E, data0), lwd =3, col ="red")plot(F ~jitter(E), data = data1, main ="Indian", xlim =c(0,20), ylim =c(0,16))abline(lm(F ~ E, data1), lwd =3, col ="darkgreen")
In this case the slopes of the regression lines are different.
summary(lm(F ~ E, data0))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.73675977 0.38768403 14.7975136 1.223775e-40
E 0.02132646 0.06398459 0.3333062 7.390588e-01
summary(lm(F ~ E, data0))$r.squared
[1] 0.000247362
summary(lm(F ~ E, data1))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0883413 0.16942985 41.83644 2.565648e-173
E -0.1373248 0.04708628 -2.91645 3.684353e-03
summary(lm(F ~ E, data1))$r.squared
[1] 0.01517501
The slope for the Fijians is positive but not significant while that for the Indians is significantly negative
Both R^2 are quite small.
4.6 Exercises
In the model Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i what is the characteristic of \beta_1? Options: (you can choose more than one): observable, unobservable, a parameter, a random variable.
Suppose you study the relation between altitude and temperature. The data pairs0001.txt in Moodle contain the variables: temperature (average) in degrees Celsius and altitudein meters taken at 349 stations in Germany.
Read the data in R with the command dati <- read.table("pairs0001.txt" header = TRUE)
Choose the role of the variables: in your opinion which is the response?
Transform the altitudes in km. Then fit a simple linear regression model.
Give a general comment of your analysis. For example give the estimates of the regression coefficients and a confidence interval for the slope.
Discuss if the model is well specified using some plots.
Can you suggest corrections to get a better fit?
True or false and justify:
If R^2 = 0 the fitted regression line has a zero slope.
The fitted values have a sum equal to n \bar y.
If the coefficient of determination is zero, it is certain that the response Y does not depend in any way on the predictor X.
(Weisberg) The data ftcollinssnow.R in Moodle, contain the variables
Early: September to December snowfall, inches
Late: January to June snowfall, inches
both measured in inches at Ft. Collins, Colorado (Colorado Climate Center, 2012).
The question is: can early season snowfall from September 1 until December 31 predict snowfall in the remainder of the year, from January 1 to June 30?
To answer, use a simple linear regression model and test if the estimate of the slope is significant. The data frame ftcollinssnow is obtained in the workspace with the command: source("ftcollinssnow.R")
(Freedman) You have a model Y_i = \beta_0 + \beta_1 x_i+\varepsilon_i with \varepsilon_i \sim N(0, \sigma^2) IID. Your theory is that \beta_1 \ne 0. You fit the model using least-squares to your sampled data that have size n = 57 and get \hat \beta_1 = 3.79 with a standard error \text{se} = 1.88. True or False and explain:
for testing the null hypothesis that \beta_1 =0 you get a test statistic t = 2.02 (rounded)
\hat \beta_1 is statistically significant
\hat \beta_1 is highly significant
The probability that \beta_1 \ne 0 is about 95%
The probability that \beta_1 = 0 is about 5%
If the model is right and \beta_1 = 0 there is about a 5% chance of getting |\hat \beta_1/ \text{se}|>2
If the model is right and \beta_1 = 0 there is about a 95% chance of getting |\hat \beta_1/ \text{se}|<2
You can be about 95% confident that \beta_1 \ne 0
The test shows that the model is right
The test assumes that the model is right.
References
The lectures are mainly based on:
Wasserman, L. (2004). All of Statistics. New-York: Springer-Verlag.
The examples discussed come from:
Agresti, A., Franklin, C. (2013). Statistics. The art and science of learning from data. Boston: Pearson Education, Inc.
Dobson A. J., Barnett, A. G. (2018). An Introduction to Generalized Linear Models,4th ed., Boca Raton FL: CRC Press
Freedman, D., Pisani, R., Purves, R. (2007). Statistics 4th ed. New York: Norton & Company
Gelman, A., Carlin, J.B., Stern, H. S. & Rubin, D. B. (2) Bayesian data analysis, Boca Raton FL: Chapman & Hall/ CRC
Ramsey, F. L., Shafer, D. W. (2013). The statistical sleuth, 3rd ed. Boston: Brooks/Cole
Retherford, R. D., Choe, M. K. (1993). Statistical models for causal analysis. New York: Wiley
Weisberg, S. (2014) Applied linear regression, Hoboken: John Wiley & Sons.
Comments
The regression line is \widehat{F} = 6.9 - 0.14\cdot E
The standard error of the slope is 0.0318.
The test of the hypothesis \beta_1 = 0 is rejected at level 0.05 because |W| = |-4.372| > 1.96 \quad (\text{ assuming normality because $n$ is large})
The p-value is
The p-value in the table of regression coefficients is different, because the function
lm()assumes that the errors are exactly normal. In that case the test statistic has an exact null distribution T_{n-2}. ThusNotice that despite a quite small R^2 = 0.019 the slope of the regression is significantly different from zero