Chapter One: Simple Linear Regression

Introduction

Regression analysis is dealing with the formulation of mathematical models that describes relationships among variables.
Regression analysis is concerned with describing and evaluating the relationship between a single variable \(y\), called the dependent or response variable and one or more variables which are assumed to influence the given response variable, often called independent or explanatory variables, \(x_1, x_2,x_3,\ldots, x_p\).
The response variable must be a continuous but the explanatory variables can be continuous, discrete or categorical.
When \(p=1\), it is called simple linear regression but when \(p \ge 1\) it is called multiple linear regression (this will be discussed in next Chapter).

Introduction

In simple linear regression a single predictor or independent variable is used for predicting the variable of interest, i.e. dependent variable.
In this course we will consider parametric regression models and we will assume a relationship that is linear in the parameters.
The Simple Linear Regression Model (SLRM) is:
- simple since there is only 1 predictor/independent variable
- linear or linear in the parameters since no parameter appears as an exponent or is multiplied or divided by another parameter

Introduction

Objectives of regression analyses can be:

to describe the data and its structure
to estimate parameters
to predict a continuous dependent variable from one or a number of independent variables
to check whether there is a relationship between a dependent or response variable and one or more than one independent variables
to explain the impact of response random variable based on the value of at least one independent variable
build a theory

The model

Equation of the model

The model expresses the value of a response variable as a linear function of a predictor variable and an error term. The simple linear regression model can be stated as follows: \[y_i = \beta_0 + \beta_1\,x_i + \varepsilon_i, \;\;\; i=1,...,n %y_i=\b_0+\beta_3\,z_i+\beta_4\,X_i\] where
\(y_i\) represents the \(i^{th}\) value of the response variable \(Y\),
\(x_i\) represents the \(i^{th}\) value of the predictor variable \(X\),
\(\beta_0\) and \(\beta_1\) are constants called regression coefficients or parameters,
\(\varepsilon_i\) is a random disturbance or error.

The error term

Generally the reasons for including the error term is:
- Omitted variables: it is not always possible to include all relevant variables in a functional form. The omission of these variables from the model introduces an error.
- Measurement error: inaccuracy in collection and measurement of sample data.
- Sampling error: the sample we randomly choose to examine the relationship between the variables may turn out predominantly some specific group. In such cases, our estimation of parameters may not be as good as that from a balanced sample group.
- Measurement error in the process of observation.
- There may be random influence which are out of our control.

Matrix notation of simple linear regression model

The previous model implies that \[ \left\{ \begin{array}{ccl} y_1& =& \beta_0 + \beta_1 x_1 + \varepsilon_1\\ \vdots \\ y_i& =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ \vdots \\ y_n& =& \beta_0 + \beta_1 x_n + \varepsilon_n \end{array} \right.\]

Matrix notation of simple linear regression model

This can be written in matrix form as follows \[ \left( \begin{array}{c} y_1 \\ y_2\\ \vdots\\ y_n \end{array} \right) = \left( \begin{array}{cc} 1 & x_{11} \\ 1 & x_{21} \\ \vdots\\ 1 & x_{n1} \end{array} \right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \end{array} \right) + \left( \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n \end{array} \right) \]

Matrix notation of simple linear regression model

The column of ones incorporates the intercept term. or \[\begin{eqnarray*} \begin{array}{cccccc} {\bf y} &=&{\bf X}&\beta & +&\epsilon \\ (n\times 1) & &(n \times 2) &(2 \times 1) & &(n \times 1) \end{array} \end{eqnarray*}\]
The fitted simple linear regression model is: \(\hat y_i= \hat {\beta_{0}} + \hat {\beta_1}x_i\)

Model formulation in R using the function

The Galapagos Islands Data

A data frame with 30 Galapagos islands and 7 variables is available in R under the library named faraway:
- Species: the number of plant species found on the island
- Endemics: the number of endemic species
- Area: the area of the island (km\(^2\))
- Elevation: the highest elevation of the island (m)
- Nearest: the distance from the nearest island (km)
- Scruz: the distance from Santa Cruz island (km)
- Adjacent: the area of the adjacent island (square km)
The data were presented by Johnson and Raven (1973) and also appear in Weisberg (1985). The original dataset contained several missing values which have been filled for convenience.

Model formulation in R using the function

Response and predictor

The relationship between the number of plant species and several geographic variables is of interest.
In this chapter we are interested in the relationship between the number of plant species and the area of the adjacent island (square km).

Model formulation in R using the function

library(faraway)

## 
## Attaching package: 'faraway'

## The following object is masked from 'package:lattice':
## 
##     melanoma

data(gala)
attach(gala)
head(gala)

##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

Model formulation in R using the function

Scatterplot

plot(Adjacent, Species)

Model formulation in R using the function

Fitting a simple linear regression model in R is done using the function.
A general call of the function has the form of:
which corresponds to the model \[ y_i = \beta_0 + \beta_1\,x_i + \varepsilon_i. \]
Then the model fitted using the function.
A linear regression model in which is the response and is the independet variable can be fitted by

ALR.fit1<-lm(Species~Elevation)

The Output

The object contains that data.

summary(ALR.fit1)

## 
## Call:
## lm(formula = Species ~ Elevation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -218.319  -30.721  -14.690    4.634  259.180 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.33511   19.20529   0.590     0.56    
## Elevation    0.20079    0.03465   5.795 3.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.66 on 28 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.5291 
## F-statistic: 33.59 on 1 and 28 DF,  p-value: 3.177e-06

Estimation of the regression parameters

1. Least squares estimation

The least squares criterion states that the estimator \(\widehat{\beta}\) of \(\beta\) must be found in such a way that \(\epsilon'\epsilon\), the , is a minimum. That is, to find the least squares estimates we have to minimize.
The sum of squares of the errors (SSE) is: \[\begin{eqnarray*} S(\beta) & = & \epsilon'\epsilon = \sum_{i=1}^n\varepsilon_i^2 = ({\bf y- X} \beta)'{\bf (y-X}\beta) \\ & = & ({\bf y'-\beta'X'}){\bf (y-X}\beta)\\ & = & {\bf y'y} -\beta'{\bf X' y} -{\bf y' X}\beta+\beta'{\bf X' X}\beta\\ & = & {\bf y'y} -2\beta'{\bf X' y} +\beta'{\bf X' X}\beta\\ \end{eqnarray*}\] where \(\beta'{\bf X' y} ={\bf y' X}\beta\)

Least squares estimation

In order to minimise \(S(\beta)\) we set \(\left.\frac{d S(\beta)}{d\beta}\right|_{\beta=\widehat{\beta}} = -2\bf X' y + 2{\bf X' X}\hat\beta={\bf 0}\). We find that \[ {\bf X'X} \widehat{\beta} = {\bf X'y} \] and these are called the normal equations.

Least squares estimation

Provided that \({\bf X}\) has full column rank or \({\bf X}'{\bf X}\) is non-singular or invertable, the least square estimates are given by \[ \widehat{\beta} ={ \bf (X'X)^{-1} X'y} = {\bf Hy} \] where \(\widehat{\beta}= (\hat{\beta_0},\hat{\beta_1})\) in the case of simple linear regression model.
The least squares regression line is given by \(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1\,x_i\).
\(\bf H=\bf X(X'X)^{-1} X'\) is called hat matrix.
\(\bf H\) is useful for theoretical manipulations such as:
1. Predicted values: \(\hat{y}=\bf{H}{Y}=\bf X\widehat\beta\)
2. Residuals: \(\hat\epsilon= y-\bf X\widehat\beta=y-\hat y=(I-H)y\)
3. Residuals sum of squares: \(\hat\epsilon'\hat\epsilon=y'(\bf I-H)(\bf I-H)y=y'(\bf I-H)y\)

Least squares estimation

Note:

If the rank\((x)=p\), then \(\widehat{\beta} ={ \bf (X'X)^{-1} X'y}\) is the unique solution to \({\bf X'X} \widehat{\beta} = {\bf X'y}\).
The vector of estimated residuals is: \(\hat \epsilon=Y-\hat Y=Y-X\hat \beta\)

\[\begin{eqnarray*} \hat \epsilon'\hat\epsilon &=& (Y-X\hat\beta)'(Y-X\hat\beta)\\ &=& Y'Y-2\hat \beta'X'Y+\hat \beta'X'X\hat\beta\\ &=& Y'Y-\hat \beta'X'X\hat\beta\\ \end{eqnarray*}\]

gives \(Y'Y= \hat \epsilon'\hat\epsilon+ (X\hat\beta)'(X\hat\beta)\)

\(\Rightarrow SST=SSE+SSR\) - Measure of goodness of fit is coefficent of determination, \(R^2=\frac{SSR}{SST}, R^2\in[0,1]\)

Least squares estimation

For simple linear regression the least squares estimates can also computed by taking partial derivatives of SSE w.r.t. \({\beta}_0\) and \({\beta}_1\) and setting the resulting expressions equal to zero: \[\begin{eqnarray*} SSE&=&{\sum_{i=1}^n(y_i - \hat{y_i})^2}={\sum_{i=1}^n(y_i -\hat{\beta}_0 - \hat{\beta}_1x_i)^2}\\ \end{eqnarray*}\]
The least square estimate is: \(\left.\frac{d(SSE)}{d\beta_0}\right|_{\beta=\widehat{\beta_0}} = {0}\). We find that \[\begin{eqnarray*} &=>& 2{\sum_{i=1}^n(y_i -\hat{\beta}_{0} - \hat{\beta}_1x_i)} (-1) =0\\ &=>& n\bar{y}-\hat {\beta}_0-\hat{\beta}_1\bar{x} =0\\ \hat {\beta}_0= \bar{y}-\hat{\beta}_1\bar{x}\\ \end{eqnarray*}\]

Least squares estimation

The least square estimate based on \(\beta_{1}\) is: \(\left.\frac{d(SSE)}{d\beta_1}\right|_{\beta=\widehat{\beta_1}} = {0}\). We find that \[\begin{eqnarray*} &=>& 2{\sum_{i=1}^n(y_i -\hat{\beta}_0 - \hat{\beta}_1x_i)} (-x_i) =0\\ &=>& {\sum_{i=1}^n(y_ix_i -(\bar{y}-\hat{\beta_1}\bar{x})x_i -\hat{\beta_1}{x_i}^2)} =0\\ \end{eqnarray*}\] \[\hat {\beta}_1 = \frac{\sum_{i=1}^n(x_i -\bar{x})(y_i -\bar{y})}{\sum_{i=1}^n(x_i -\bar{x})^2}=\frac{S_{xy}}{S_{xx}}\]
The resulting estimators, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are referred to as the least squares estimators for \({\beta}_0\) and \({\beta}_1\), respectively.

Estimation in R

The Galapagos Islands Data

Fitted model and output:

ALR.fit1<-lm(Species~Elevation)
summary(ALR.fit1)

## 
## Call:
## lm(formula = Species ~ Elevation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -218.319  -30.721  -14.690    4.634  259.180 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.33511   19.20529   0.590     0.56    
## Elevation    0.20079    0.03465   5.795 3.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.66 on 28 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.5291 
## F-statistic: 33.59 on 1 and 28 DF,  p-value: 3.177e-06

The output

The estimates for the model intercept and the slope are 11.33511 and 0.20079.
Therefore, the fitted or the least squares regression line is \(\hat{y} = 11.335 + 0.201\,x\)
The slope show that there is a positive relationship (consistent with the scatter plot), in other words, as the elevation of the island increase there will be an increase in the number of species of tortoise found on the island.

Scatter Plot

plot(Adjacent, Species)
points(Species,ALR.fit1$fit,col=2)

2. Maximum likelihood estimation

When the regression model errors are normally and independently distributed, the method of maximum likelihood leads to the same estimators as those obtained by the least squares method.
The model is \[ \bf y= \bf X\beta+\epsilon,\] where \(\beta=({\beta}_0,{\beta}_1)\) for simple linear regression model and the errors are normally and independently distributed with constant variance \(\sigma ^2\); or \[ \epsilon \sim \mathcal N(\bf 0, \sigma^2\bf I)\] and the normal density function for the error is \[f(\epsilon)=\frac{1}{\sqrt{2\pi \sigma ^2}}{exp\left[-\frac{\epsilon ^2}{2\sigma ^2}\right]}\]

Maximum likelihood estimation

Likelihood function

The likelihood is the joint density of \(\epsilon_1, \epsilon_2,..,\epsilon_n\), or \(\prod_{i=1}^{n}{}f(\epsilon_i)\)
Therefore, the likelihood function is given by: \[\begin{eqnarray*} L(\epsilon,\beta,\sigma^2) &= &\prod_{i=1}^n {f(\varepsilon_i)} =\frac{e^{-\frac{\epsilon'\epsilon}{2\sigma^2}}}{(2\pi)^{n/2}\sigma^n}\\ &=&\ (2\pi)^{-n/2}\sigma^{-n}e^{{-}\frac{\left(\bf Y-\bf X\hat{\beta}\right)'\left(\bf Y-\bf X\hat{\beta}\right)}{2\sigma^2}} \end {eqnarray*}\]

-The log-likelihood becomes \[\begin{eqnarray*} lnL(\epsilon,\beta,\sigma^2) &=& \frac{-n}{2}ln(2\pi)-n\ln\sigma-\frac{(\bf Y-\bf X\hat{\beta})'(\bf Y-\bf X\hat{\beta})}{2\sigma^2}\\ \end{eqnarray*}\]

Maximum likelihood estimation

Estimation of parameters

For a fixed value of \(\sigma^2\), the maximum likelihood estimator of \(\beta\) is obtained by maximizing this function, i.e.\(lnL(\epsilon,\beta,\sigma^2)\) with respect to \(\beta\). In particular, differentiating \(lnL(\epsilon,\beta,\sigma^2)\) with respect to \(\beta\) yields. \[\begin{eqnarray*} \left.\frac{d lnL(\epsilon,\beta,\sigma^2)}{d\beta}\right|_{{\beta}, \sigma^2} &=&\left(\frac{(\bf Y-\bf X\hat{\beta})'(\bf Y-\bf X\hat{\beta})}{2\sigma^2}\right)' ={\bf 0}\\ &=& \left({\bf y'y} -2\beta'{\bf X' y} +\beta'{\bf X' X}\beta\right)' ={\bf 0}\\ \end{eqnarray*}\] \[{\bf X'X} \widehat{\beta} = {\bf X'y}\]
Therefore, the maximum likelihood estimator of \(\beta\) is given by \(\widehat{\beta} ={\bf (X'X)^{-1} X'y}\)

Interpretation of the regression coefficents

Estimates

In the SLR model, we have \[y_i = f(\,x_i)+ \varepsilon_i\] \[y_i = \beta_0 + \beta_1\,x_i + \varepsilon_i\]
If we take expectations of the model above, this will yield the true regression relation, which is, in this case, a true regression line: \[ E(y_i) = \beta_0 + \beta_1\,x_i\]
The true model cannot be observed since \(\beta_0\) and \(\beta_1\) are not known. We must estimate them from the data. This gives the estimated or fitted regression line: \(\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1\,x_i\).
where
- \(\widehat{\beta}_0\) estimates \({\beta}_0\)
- \(\hat{\beta}_1\) estimates \({\beta}_1\)
- \(\hat{Y_i}\) estimates \(E(Y_i)\)

The interpretation

The estimated regression coefficients \(\widehat{\beta}_0\) and \(\hat{\beta}_1\) are interpreted as:
- The coefficient \(\beta_1\), called the slope, may be interpreted as the change in \(y\) for a unit change in \(x\).
- The coefficient \(\beta_0\), called the constant coefficient or intercept, is the predicted value of \(y\) when \(x=0\).
- Note that the random term \(\varepsilon_i\) does not contain any systematic information for determining \(y_i\) that is not already captured by \(x_i\).

Properties of least squares estimation

Main properties

\({\sum_{i=1}^n(y_i -\hat{y_i})}\)= \({\sum_{i=1}^n{\varepsilon_i}} ={0}\)
\({\sum_{i=1}^n{y_i}} ={\sum_{i=1}^n{\hat{y_i}}}\)
The least square regression line always passes through the centroid, the point \((\bar {y}, \bar{x})\) of the data.

Main properties

\({\sum_{i=1}^n{x_i}{\varepsilon_i}}={0}\) or \(X'\epsilon=0\), where \(\hat\epsilon=Y-X\hat\beta\)

Proof: \[\begin{eqnarray*} X'\hat\epsilon &=& X'(Y-X\hat\beta)\\ &=& X'Y-(X'X)(X'X)^{-1}X'Y\\ &=& 0 \end{eqnarray*}\]
\({\sum_{i=1}^n{\hat{y_i}}{\varepsilon_i}}= {0}\) or \(\hat Y'\epsilon=0\)

where \(\hat Y=X\hat \beta\) and \(\hat\beta\) is a solution to \(X'X\beta=X'Y\)

Proof:

\[\begin{eqnarray*} \hat Y'\epsilon &=& (X\hat\beta)'(Y-X\hat\beta)\\ &=& \hat\beta'X'Y-\hat\beta'X'X\beta\\ &=& 0 \end{eqnarray*}\]

Additional properties

Another properties of least square estimation are:
1. Unbiasedness i.e \(E(\hat{\beta_1})=\beta_1\) and \(E(\hat{\beta_0})=\beta_0\).
2. Variance of the estimators:
  - \(Var(\hat\beta_1) = \frac{\sigma^2}{\sum_{i=1}^n(x_i -\bar{x})^2}\), and
  - \(Var(\hat\beta_0)= \sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\)

Proof: Unbiasedness

Prove that the the \(E(\hat{\beta_1})\) is \(\beta_1\)

Solution: \[\begin{eqnarray*} \hat {\beta}_1 &=& \frac{\sum_{i=1}^n(x_i -\bar{x})(y_i -\bar{y})}{\sum_{i=1}^n(x_i -\bar{x})^2} = \frac{\sum_{i=1}^n(x_i -\bar{x})y_i}{\sum_{i=1}^n(x_i -\bar{x})^2}\\ &= &\frac{\sum_{i=1}^n(x_i -\bar{x})(\beta_0 + \beta_1\,x_i + \varepsilon_i)}{\sum_{i=1}^n(x_i -\bar{x})^2}\\ E(\hat\beta_1) &=& \frac{\sum_{i=1}^n(x_i -\bar{x})(\beta_0 + \beta_1\,x_i + E(\varepsilon_i))}{\sum_{i=1}^n(x_i -\bar{x})^2}\\ &=& \frac{\beta_0{\sum_{i=1}^n(x_i -\bar{x})}+{\beta_1{\sum_{i=1}^n(x_i -\bar{x})}}{x_i}}{\sum_{i=1}^n(x_i -\bar{x})^2}\\ &=& \frac{{\beta_1{\sum_{i=1}^n(x_i -\bar{x})}}{x_i}}{\sum_{i=1}^n(x_i -\bar{x})^2}= \beta_1 \end{eqnarray*}\] Therefore, \(\hat{\beta_1}\) is unbiased estimator of \(\beta_1\).

Proof: Unbiasedness

Prove that \(E(\hat{\beta_0})\) is \(\beta_0\)

Solution: \[\begin{eqnarray*} E(\hat\beta_0) &=& E(\bar{y}-\hat{\beta}_1{\bar{x}}) \\ &=& E\left(\frac{\sum_{i=1}^n{y_i}}{n}-\hat{\beta}_1{\bar{x}}\right)\\ &=& E\left(\frac{\sum_{i=1}^n{(\beta_0 + \beta_1\,x_i + \varepsilon_i)}}{n}-\hat{\beta}_1{\bar{x}}\right)\\ &=& E(\beta_0+\beta_1\bar{x}-\hat\beta_1\bar{x}) \\ &=& E(\beta_0)+\beta_1\bar{x}-E(\hat\beta_1)\bar{x} \\ &=& \beta_0+\beta_1\bar{x}-\beta_1\bar{x} \\ &=& \beta_0 \end{eqnarray*}\] Therefore, \(\hat{\beta_0}\) is unbiased estimator of \(\beta_0\).

Proof: Variance of the estimators

Prove that the \(var(\hat\beta_1)\) is \(\frac{\sigma^2}{\sum_{i=1}^n(x_i -\bar{x})^2}\),

Solution \[\begin{eqnarray*} var(\hat\beta_1)&=& var\left(\frac{\sum_{i=1}^n(x_i -\bar{x})(y_i -\bar{y})}{\sum_{i=1}^n(x_i -\bar{x})^2}\right) \\ &=& var\left(\frac{\sum_{i=1}^n(x_i -\bar{x}){y_i}}{S_{xx}}\right)\\ &=& \frac{\sum_{i=1}^n(x_i -\bar{x})^2var({y_i})}{{S_{xx}}^2} \\ &=& \sigma^2\frac{\sum_{i=1}^n(x_i -\bar{x})^2}{{S_{xx}}^2} = \sigma^2\frac{S_{xx}}{{S_{xx}}^2}\\ &=& \frac{\sigma^2}{{S_{xx}}} = \frac{\sigma^2}{\sum_{i=1}^n(x_i -\bar{x})^2} \end{eqnarray*}\]

Proof: Variance of the estimators

Prove that the \(var(\hat\beta_0)\) is \(\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})\),

Solution: \[\begin{eqnarray*} var(\hat\beta_0)&=& var(\bar{y}-\hat{\beta}_1\bar{x})\\ &=& var(\bar{y})-2\bar{x}Cov(\bar{y},\bar\beta_1)+\bar{x}^2var(\hat\beta_1)\\ &=& var\left(\frac{\sum_{i=1}^{n}{y_i}}{n}\right)+\bar{x}^2var(\hat\beta_1)\\ &=& \left(\frac{\sum_{i=1}^{n}var({y_i})}{n^2}\right)+\frac{\bar{x}^2}{S_{xx}} \\ &=&\frac{n\sigma^2}{n^2}+\frac{\bar{x}^2\sigma^2}{{S_{xx}}}= \sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}\right)\\ &=&\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^n(x_i -\bar{x})^2}\right) \end{eqnarray*}\]

Assumptions of simple linear regression model

Assumptions

The true relationship between \(Y\) and \(X\) is: \(y_i = \beta_0 + \beta_1\,x_i + \varepsilon_i\)
The error terms have zero mean: i.e. \(E(\varepsilon_i)\) or \(E(\epsilon) ={\bf 0}\). This means that on average the errors balance out.
Homoscedasticity (error terms have constant variance): i.e. \(Var(\varepsilon_i)\) = \(E(\varepsilon_i ^2)\) = \(\sigma^2\), for all \(i\) or \(Var(\epsilon) = \sigma^2\,{\bf I}_n\).
- This means that the variance of the disturbance is the same for each observation.
No error autocorrelation (the error terms \(\varepsilon_i\) are statistically independent of each other), i.e \(Cov(\varepsilon_i\),\(\varepsilon_j)\)= \(E(\varepsilon_i\varepsilon_j)=0\) for \(i\ne j\).
- Means that disturbances associated with different observations are uncorrelated.

Assumptions

The independent variable is non-random. In an experiment, the values of the independent variable would be fixed by the experimenter and repeated samples could be drawn with the independent variable fixed at the same values in each sample.
Normality: \(\varepsilon_i\) are normally distributed with mean zero and variance \(\sigma^2\) for all \(i\) \((\varepsilon_i \sim \mathcal N(0, \sigma^2))\).

Assumptions

From model (1), note that the observed value of \(y\) in the \(i^{th}\) observation is the sum of two components, namely the constant term \(\beta_0 + \beta_1\,x_i\) and the random error term \(\varepsilon_i\). Hence \(y_i\) is a random variable.
From the above assumptions it follow that
1. \(E(y_i) = E(\beta_0 + \beta_1\,x_i + \varepsilon_i) = \beta_0 + \beta_1\,x_i\).
2. \(Var(y_i) = Var(\beta_0 + \beta_1\,x_i + \varepsilon_i)=Var(\varepsilon_i) = \sigma^2\).
3. \(Cov(Y_i,Y_j) = 0\), i.e., the response \(y_i\) and \(y_j\) are uncorrelated.
4. Using matrix notation, \({\bf y} \sim \mathcal N({\bf X}\,\beta, \sigma^2\,{\bf I}_n)\).

Fitting a model using

Data: Galapagos Islands

ALR.fit1<-lm(Species~Elevation)
summary(ALR.fit1)

## 
## Call:
## lm(formula = Species ~ Elevation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -218.319  -30.721  -14.690    4.634  259.180 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.33511   19.20529   0.590     0.56    
## Elevation    0.20079    0.03465   5.795 3.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.66 on 28 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.5291 
## F-statistic: 33.59 on 1 and 28 DF,  p-value: 3.177e-06

The output

The multiple R-squared or \(R^2 = 0.5454\) Residual standard error: 78.66 on 28 degrees of freedom Multiple R-squared: 0.5454, Adjusted R-squared: 0.5291 F-statistic: 33.59 on 1 and 28 DF, p-value: 3.177e-06.
The coefficient of determination: may be interpreted as the proportion of the total variability in the number of species of tortoise found on the island (i.e. the response variable) that is accounted for the elevation of the island (i.e. the predictor variable).

The output

Therefore, 54.54% of the total variability in the number of species of tortoise found on the island is accounted for by the elevation of the island.
For simple linear regression, the correlation coefficient can be calculated using the multiple R-squared value from the above output as \(r = \sqrt{0.5454} \approx 0.74\).
This shows that there is a strong positive linear relationship between the two variables, species and elevation.

Fitting a model using

Data: The Cars data

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

data("cars")
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Scatter plot

plot(cars$speed,cars$dist)

Fitting a model using

We would like to model the stopping distance as a function of the car’s speed.
We formulate the following model: \[ \mbox{dist}_{i}= \beta_0 + \beta_1 \mbox{speed}_{i} + \varepsilon_i. \]
In R the model is fitted using

fit.lm<-lm(cars$dist~cars$speed)

The output

The function can be used to see the estimated parameters.

summary(fit.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Scatter plot

plot(cars$speed,cars$dist)
points(cars$speed,fit.lm$fit,col=2)

Appendix

Matrix results

In this section we will discuss some matrix definitions and results that will be used in the coming chapters.

Let \({\bf x} =(x_1, x_2, \ldots, x_p)'\) be a \(p \times 1\) vector and \({\bf A} = (a_{ij})\) a \(p \times p\) matrix. Then \({\bf x}'{\bf A}{\bf x} = \sum_{i=1}^p\sum_{j=1}^p a_{ij} x_i x_j\) is called a quadratic form in \({\bf x}\). The matrix \({\bf A}\) of any quadratic form \({\bf x}'{\bf x}\) can always be chosen to be symmetric, hence every quadratic form in this guide will be considered to have a symmetric matrix.
The \(p \times p\) matrix \({\bf A}\) is called positive definite if \({\bf x}'{\bf A}{\bf x} > 0\) for all \({\bf x} \ne {\bf 0}\), where \({\bf 0}\) is a vector of zeros.
If the \(>\) sign in part (2) is replaced by \(\ge\) (and the equality holds for at least one \({\bf x} \ne {\bf 0})\), then the matrix \({\bf A}\) is said to be positive semi-definite.

Rank of a matrix

Rank of a matrix
1. The rank of a matrix \({\bf A}\), denoted by \(r({\bf A})\), is defined to be the number of linearly independent rows of \({\bf A}\) (or, equivalently, the number of linearly independent columns).
2. If \({\bf A}\) is an \(n \times m\) matrix then \(r({\bf A}) \le m\) and \(r({\bf A}) \le n\).
3. For any matrix \({\bf A}\), \(r({\bf A}') = r({\bf A})\).
4. If \({\bf A}\) is a \(p \times p\) matrix and \(r({\bf A}) = p\) then \({\bf A}\) is said to be non-singular and \({\bf A}^{-1}\) exists; if \(r({\bf A}) < p\) then \({\bf A}\) is singular and \({\bf A}^{-1}\) does not exist.
5. For any \(n \times n\) matrix \({\bf A}\) and \(n \times 1\) vector \({\bf b}\), \(r({\bf A}, {\bf b}) \ge r({\bf A})\), i.e., the inclusion of a column vector cannot decrease the rank of a matrix.
6. A positive definite matrix is non-singular.
If \({\bf A}\) is an \(n \times p\) matrix, then \(r({\bf A}) = r({\bf A}'{\bf A}) = r({\bf A}{\bf A}')\).

Rank of a matrix

The trace of a \(p \times p\) matrix \({\bf A}\), denoted by \(tr({\bf A})\), is defined to be the sum of its diagonal elements. If \({\bf A} = (a_{ij})\), then \(tr({\bf A}) = \sum_{i=1}^{p} a_{ii}\).
For conformable matrices \({\bf A}\) and \({\bf B}\), \(r({\bf A}{\bf B}) \le r({\bf A})\) and \(r({\bf A}{\bf B}) \le r({\bf B})\).
If \({\bf A}\) is an \(m \times n\) matrix and \({\bf B}\) an \(n \times m\) matrix, then \(tr({\bf A}{\bf B}) = tr({\bf B}{\bf A})\). Similarly, if \({\bf A}\) is \(m \times n\), \({\bf B}\) is \(n \times k\) and \({\bf C}\) is \(k \times m\), then \(tr({\bf A}{\bf B}{\bf C}) = tr({\bf B}{\bf C}{\bf A}) = tr({\bf C}{\bf A}{\bf B})\).
In particular, if \({\bf x}\) is a \(p \times 1\) vector and \({\bf A}\) a \(p \times p\) matrix, then the quadratic form \({\bf x}'{\bf A}{\bf x}\) is a \(1 \times 1\) matrix which is equal to its trace. Thus \({\bf x}'{\bf A}{\bf x} = tr({\bf x}'{\bf A}{\bf x}) = tr({\bf x}{\bf x}'{\bf A} = tra({\bf x}{\bf x}')\).

Rank of a matrix

The matrix \({\bf A}\) is called idempotent if \({\bf A}{\bf A} = {\bf A}\).
If \({\bf A}\) is idempotent then the rank of \({\bf A}\) is equal to its trace: \(r({\bf A})= tr({\bf A})\).
If \({\bf X}\) is an \(n \times m\) matrix with rank \(r({\bf X}) = m\) (thus \(m \le n\)) then \({\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'\) is idempotent with rank \(m\). Furthermore, \(I_{n} - {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'\) is idempotent with rank \(n - m\). Finally, \({\bf X}({\bf X}'{\bf X})^{-1}{\bf X}')(I_{n}- {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}') = {\bf 0}\), the \(n \times m\) matrix of zeros.

More matrix results

An important vector is the \(n \times 1\) vector of ones:

\[ {\bf 1}=\left ( \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right ), ~~~~~~~~ {\bf 1}'{\bf 1}= ( 1, 1, \ldots, 1) \left ( \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right ) = n \] \[ {\bf 1}{\bf 1}'= \left ( \begin{array}{cccc} 1 & 1 & \ldots & 1\\ 1 & 1 & \ldots & 1 \\ \vdots & \vdots & \ldots & \vdots \\ 1 & 1 & \ldots & 1 \end{array} \right ), ~~{\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}' = \left ( \begin{array}{cccc} \frac{1}{n} & \frac{1}{n} & \ldots & \frac{1}{n}\\ \frac{1}{n} & \frac{1}{n} & \ldots & \frac{1}{n} \\ \vdots & \vdots & \ldots & \vdots \\ \frac{1}{n} & \frac{1}{n} & \ldots & \frac{1}{n} \end{array} \right ) \]

More matrix results

\[ ~~~\mbox{and}~~~ {\bf I}_n - {\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}' = \left ( \begin{array}{cccc} 1- \frac{1}{n} & 1 - \frac{1}{n} & \ldots & 1- \frac{1}{n}\\ 1 - \frac{1}{n} & 1 - \frac{1}{n} & \ldots & 1 - \frac{1}{n} \\ \vdots & \vdots & \ldots & \vdots \\ 1 - \frac{1}{n} & 1 - \frac{1}{n} & \ldots & 1 - \frac{1}{n} \end{array} \right ) \]

From parts (11) and (12), it follows that \({\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'\) and \({\bf I}_n - {\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'\) are idempotent with ranks 1 and \(n - 1\) respectively, and their product is equal to the null matrix \({\bf 0}\).

Orthogonal matrix

An orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors).
Equivalently, a matrix \({\bf Q}\) is orthogonal if its transpose is equal to its inverse.i.e, \({\bf Q}' = Q^{-1}\) which entails \({\bf Q}'{\bf Q} = {\bf Q}{\bf Q}' = {\bf I}_n\) where \({\bf I}_n\) is the identity matrix.
An orthogonal matrix \({\bf Q}\) is necessarily square, invertible (with inverse (\({\bf Q}^{-1} = {\bf Q}'\)), unitary (\({\bf Q}^{-1} = {\bf Q}' = {\bf Q}^{-1}{\bf Q}' = {\bf I}_n\)), and normal (\({\bf Q}'{\bf Q} = {\bf Q}{\bf Q}'\)).

Orthogonal matrix

Spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors.
Only diagonalizable matrices can be factorized in this way.
A square matrix Q is called diagonalizable if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix \({\bf P}\) such that \({\bf P}^{-1}{\bf Q}{\bf P}\) is a diagonal matrix.

Matrix differentiation

Let \({\bf \beta} = (\beta_1, \beta_2, \ldots, \beta_p)'\) and let \(f({\bf \beta})\) be a real valued function of \(p\) variables. Then the partial derivatives of \(f({\bf \beta})\) with respect to \({\bf \beta}\) are: \[ \frac{\partial f({\bf \beta})}{\partial {\bf \beta}} = \left ( \begin{array}{c} \frac{\partial f(\beta_1)}{\partial \beta_1} \\ \frac{\partial f(\beta_2)}{\partial \beta_2} \\ \vdots \\ \frac{\partial f(\beta_p)}{\partial \beta_p}\end{array} \right ) \]

Matrix differentiation

Let \({\bf c} = (c_1, c_2, \ldots, c_p)\), \(\beta = (\beta_1, \beta_2, \ldots, \beta_p)'\) and \(f(\beta) = {\bf c}'\beta\). Then, \(f({\bf \beta}) = {\bf c}'\beta = \sum_{i=1}^p c_i \beta_i\) and we find that \(\frac{\partial f(\beta)}{\partial \beta_i} = c_i\). Hence, \[ \frac{\partial f(\beta)}{\partial \beta} = \left ( \begin{array}{c} c_1 \\ c_2 \\ \vdots \\ c_p \end{array} \right ) = {\bf c}. \]
Let \({\bf A} = (a_{ij}) = ({\bf a}_1 {\bf a}_2, \ldots, {\bf a}_p)\) be a \(p \times p\) symmetric matrix where \({\bf a} = (a_{i1}, a_{i2}, \ldots, a_{ip})\) is the \(i\)th row of the matrix \({\bf A}\). Let \(\beta = (\beta_1, \beta_2, \ldots, \beta_p)'\) be a \(p \times p\) and \(f(\beta) = \beta' {\bf A} \beta\). Then \[ \frac{\partial f(\beta)}{\partial \beta} = 2{\bf A} \beta. \]

Proof

Since \[ \beta' {\bf A} \beta = \sum_{i=1}^p a_{ij} \beta_i^2 + 2\sum_{i\le j} a_{ij} \beta_i \beta_j \] we get \[ \frac{\partial \beta' {\bf A} \beta}{\partial \beta_1} = 2 a_{11} \beta_1 + 2 \sum_{j=2}^p a_{1j} \beta_j = 2 {\bf a}_1' \beta. \] \[ \frac{\partial \beta' {\bf A} \beta}{\partial \beta_1} = 2 \left ( \begin{array}{c} {\bf a}'_1 \beta \\ {\bf a}'_2 \beta \\ \vdots \\ {\bf a}'_p \beta \end{array} \right ) = 2 {\bf A} \beta. \]

Distribution theory

Let \({\bf y}\) be a column vector consisting of \(p\) random variables, that is \({\bf y}' = (y_1, y_2, \ldots, y_p)\). Let the expected value of \(y_i\) be \(\mu_i\) and let \(\mu' = (\mu_1, \mu_2, \ldots, \mu_p)\). Also, let the covariance of \(y_i\) and \(y_j\) be \(\sigma_{ij}\). Thus, \(E\{(y_i - \mu_i)(y_j - \mu_j)\} = \sigma_{ij}\) and hence \(var(y_i) = \sigma_{ii}\).
Let \(\Sigma\) be the matrix \[ \Sigma = var({\bf y}) = (\sigma_{ij}) = \left ( \begin{array}{cccc} \sigma_{11} & \sigma_{12} & \ldots & \sigma_{1p} \\ \sigma_{21} & \sigma_{22} & \ldots & \sigma_{2p} \\ \vdots & \vdots & \ldots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \ldots & \sigma_{pp} \end{array} \right ). \] Obviously \(\sigma_{ij} = \sigma_{ji}\) so that \(\Sigma\) is symmetric. \(\Sigma\) is called the covariance matrix of \({\bf y}\). We would also write \[ \Sigma = E[({\bf y} - \mu) ({\bf y} - \mu)']. \]

Distribution theory

A covariance matrix must be positive definite (or at least positive semi-definite).
The \(p \times 1\) random vector \({\bf y}\) is said to have a multivariate normal distribution with means vector \(\mu\) and covariance matrix \(\Sigma\), also written \({\bf y} \sim N(\mu, \Sigma)\), if the probability density function of \({\bf y}\) is given by \[ f({\bf y}) = \frac{1}{(2 \pi)^{p/2}|\Sigma|^{1/2}} \exp-\frac{1}{2}({\bf y} - \mu)'\Sigma^{-1} ({\bf y} - \mu). \]
Note that the covariance matrix \(\Sigma\) must be positive definite otherwise \(f({\bf y})\) will not be a probability density function.
If \({\bf y}\) is a random vector with \(E({\bf y}) = \mu\) and \(Cov({\bf y}, {\bf y}') = \Sigma\) and if \({\bf A}\) is a matrix of constants, then \({\bf z} = {\bf A} {\bf y}\) is a random vector with \(E({\bf z}) = {\bf A} \mu\) and \(Cov({\bf z}, {\bf z}') = {\bf A} \Sigma {\bf A}'\).
- If \({\bf y} \sim N(\mu, \Sigma)\) then \({\bf A}{\bf y} \sim N({\bf A} \mu, {\bf A} \Sigma {\bf A}')\).

Distribution theory

If \({\bf y} \sim N({\bf 0}, {\bf I}_p)\) then \({\bf y}'{\bf y} \sim \chi^2_p\) (that is \({\bf y}'{\bf y}\) has a chi-square distribution with \(p\) degrees of freedom).
If \({\bf y} \sim N({\bf 0}, \sigma^2 {\bf I}_p)\) then \(({\bf y} - \mu)' ({\bf y} - \mu) / \sigma^2 \sim \chi^2_p\).
If \({\bf y} \sim N({\bf 0}, \sigma^2 {\bf I}_p)\) then \({\bf y}'{\bf y}/ \sigma^2\) is said to have a non-central chi-square distribution with \(p\) degrees of freedom and non-centrality parameter (n.c.p.) \(\lambda = \mu' \mu / \sigma^2\). Some textbooks use the alternative convention \(\lambda = (\frac{1}{2})\mu' \mu / \sigma^2\). We write \({\bf y}'{\bf y}/ \sigma^2 \sim chi^2_{p, \lambda}\).
- When \(\lambda = 0\) the distribution becomes the (central) chi-square distribution. The n.c.p. \(\lambda\) is non-negative by definition.
- As \(\lambda\) increases, the distribution shifts to the right. This is also reflected in the following result.

Distribution theory

If \(x\) is a noncentral chi-square variate with \(p\) degrees of freedom and noncentrality parameter \(\lambda\), then \(x\) has mean and variance \(E(x) = p + \lambda\) and \(Var(x) = 2p + 4 \lambda\), respectively. Thus both the mean and the variance increase as \(\lambda\) increases.
Let \(u\) and \(v\) be independent random variables with \(u \sim \chi^2_p\) and \(v \sim \chi^2_q\), then

\[ F = \frac{u/p}{v/q}\] is said to have an \(F_{p, q}\) distribution.

Distribution theory

Let \(u\) and \(v\) be independent random variables with \(u\) a noncentral chi-square variate with \(p\) degrees of freedom and n.c.p. \(\lambda\) and \(v\sim \chi^2_q\). Then

\[F = \frac{u/p}{v/q}\]

is said to have a noncentral \(F\)-distribution with \(p\) and \(q\) degrees of freedom and n.c.p. \(\lambda\). - The effect of the n.c.p. \(\lambda\) on the distribution of \(F\) is the same as the effect of \(\lambda\) on the distribution of \(u\) increasing \(\lambda\) causes a shift to the right.

Distribution theory

Let \({\bf y} \sim N({\bf 0}, \sigma^2 {\bf I}_p)\). Let \({\bf A}\) be a symmetric matrix of constants. Then \({\bf y}'{\bf A}{\bf y}/\sigma^2\) has a (noncentral) chi-square distribution with \(r({\bf A})\) degrees of freedom and n.c.p. \(\lambda = \mu'{\bf A} \mu/\sigma^2\) if and only if \({\bf A}\) is idempotent, that is \({\bf A} {\bf A} = {\bf A}\).
Let \({\bf y} \sim N({\bf 0}, \sigma^2 {\bf I}_p)\) and let \({\bf A}\) and \({\bf B}\) be constant matrices. Then \(\mu'{\bf A} \mu/\sigma^2\) and \(\mu'{\bf B} \mu/\sigma^2\) are independent if and only if \({\bf A} {\bf B} = {\bf 0}\).

Distribution theory

If \({\bf y} \sim N({\bf 0}, \sigma^2 {\bf I}_p)\) then the quadratic form \(\mu'{\bf A} \mu/\sigma^2\) and the linear form \(\frac{1}{\sigma^2}{\bf B}{\bf y}\) are independent if and only if \({\bf B}{\bf A} = {\bf 0}\).
More generally, if \({\bf y} \sim N(\mu, {\bf V})\) where \({\bf V}\) is the variance-covariance matrix of \({\bf y}\), then \({\bf y}'{\bf A}{\bf y}\) has a noncentral chi-square distribution with \(r(A)\) degrees of freedom and n.c.p. \(\lambda = \mu' {\bf A} \mu\) if and only if \({\bf A}{\bf V}\) is idempotent.
If \({\bf y} \sim N(\mu, {\bf V})\) then \({\bf y}'{\bf A}{\bf y}\) and \({\bf y}'{\bf B}{\bf y}\) are independent if and only if \({\bf A}{\bf V}{\bf B} = {\bf 0}\).
If \({\bf y} \sim N(\mu, (1/\sigma^2){\bf V})\) where \({\bf V}\) is non-singular, then \({\bf y}'{\bf V}^{-1}{\bf y}/\sigma^2\) is a noncentral chi-square variate with \(r({\bf V})\) degrees of freedom and n.c.p. \(\lambda \mu'{\bf V}^{-1} \mu /\sigma^2\).

Chapter Two: Multiple Linear Regression

Introduction

Multiple regression is a method used to model the relationship between a dependent or response variable and two or more independent variables.
If all the partial derivatives of \(y\) with respect to each of the parameters, \(\beta_{0}\),\(\beta_{1}\), …, \(\beta_{p-1}\) are independent of the parameters, then the model is called as a linear model.
If any of the partial derivatives of y with respect to \(\beta_{0}\),\(\beta_{1}\), …, \(\beta_{p-1}\) is not independent of the parameters, the model is called as nonlinear.

Introduction

Independent categorical variables with more than two levels can also be used in multiple linear regression analyses, but each of the levels first must be converted into variables that have only two levels. This is called dummy coding.
One point to keep in mind with regression analysis is that causal relationships among the variables cannot be determined. Given that \({\bf x} = (x_1, x_2, \ldots, x_p)\) with \({\bf x}\) a vector, and the components are called the predictors or regressors.
Then the terminology is in such way that \({\bf x}\) “predicts” \(y\), we cannot say that \({\bf x}\) “causes” \(y\).

Multiple linear regression model

The model

The model expresses the value of a response variable as a linear function of two or more predictor variables and an error term. The multiple linear regression model has the following general form \[ y_{i}=\mu(x_{i1},...,x_{i,p-1})+\varepsilon_{i}, i=1,...,n\]
- \(x_{i1},\ldots, x_{i,p-1}\) are the particular (deterministic) values of the \((p-1)\) regressors
- \(y_i\) is the corresponding continuous response random variable
- \(\mu(\cdot)\) the unknown mean response function
- \(\varepsilon_1, ..., \varepsilon_n\) independent (unobservable) random error terms with \(E(\varepsilon_i)=~0.\)

The model

We shall consider the linear model \[\mu(x_{1},...,x_{p-1})=\beta_{0}+\sum_{j=1}^{p-1}\beta_jx_{j} \]
The parameters \(\beta_{j}\) represents the expected change in the response \(y\) per unit change in \(x_{j}\) when all of the remaining regressor variables \(x_i\) \((i\ne j)\) are held constant. For this reason the parameters \(\beta_{j}\), \(j=1,2,..,p-1\) are often called partial regression coefficients.

General linear regression model

In other words, we define the general linear regression model as \[ y_i= \beta_0 + \beta_1 x_{i1} + \ldots + \beta_{p-1} x_{i,p-1} + \varepsilon_i. \]
Thus, we have the equations \[\begin{eqnarray} \left\{ \begin{array}{ccl} y_1& =& \beta_0 + \beta_1 x_{11} + \ldots + \beta_{p-1} x_{1, p-1} + \varepsilon_1\\ \vdots \\ y_i& =& \beta_0 + \beta_1 x_{i1} + \ldots + \beta_{p-1} x_{i,p-1} + \varepsilon_i\\ \vdots \\ y_n& =& \beta_0 + \beta_1 x_{n1} + \ldots + \beta_{p-1} x_{n, p-1} + \varepsilon_n \end{array} \right. \end{eqnarray}\]

A model in matrix form

This can be written in matrix terms as follows \[\begin{eqnarray} \left( \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right) = \left( \begin{array}{cccc} 1& x_{11}& \ldots& x_{1, p-1}\\ 1& x_{21}& \ldots& x_{2,p-1}\\ \vdots& & & \\ 1& x_{n1}& \ldots& x_{n,p-1} \end{array} \right) \left( \begin{array}{c} \beta_0\\ \beta_1 \\ \vdots\\ \beta_{p-1} \end{array} \right) + \left( \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n \end{array} \right) \end{eqnarray}\] or \[\begin{eqnarray} \begin{array}{cccccc} {\bf y} &=&{\bf X}&\beta & +&\epsilon\\ (n\times 1) & &(n \times p) &(p \times 1) & &(n \times 1) \end{array} \end{eqnarray}\]

Estimation of \({\bf\beta}\) and \(\sigma^2\)

The maximum likelihood estimator of \(\beta\) is given by \[\widehat{\beta} ={\bf (X'X)^{-1} X'y}\] (see the details in the previous chapter)
Estimation of \(\sigma^2\)
The maximum likelihood estimator of \(\sigma^2\) is also obtained by maximizing the log likelihood function, \(lnL(\epsilon,\beta,\sigma^2)\), over \(\sigma^2\). In particular, differentiating \(lnL(\epsilon,\beta,\sigma^2)\) with respect to \(\sigma^2\) yields. \[\begin{eqnarray*} \left.\frac{d lnL(\epsilon,\beta,\sigma^2)}{d\beta}\right|_{\sigma^2=\hat\sigma^2} & =&{0}\\ &=& -\frac{n}{2\hat\sigma^2}+\frac{1}{2\hat\sigma^4}(\bf y-\bf X\beta)'(\bf y-\bf X\beta)={0}\\ \end{eqnarray*}\] solving for \(\sigma^2\) we have: \[ \hat\sigma^2=\frac{1}{n}(\bf y-\bf X\beta)'(\bf y-\bf X\beta)\]

Properties of estimators

1. Unbiasedness, \(E(\hat\beta)=\hat\beta\)

Proof: \[\begin{eqnarray*} E\left(\widehat{\beta}\right) &=& E\left({(\bf X}'{\bf X})^{-1} {\bf X}'{\bf y}\right)\\ &=& E\left({(\bf X}'{\bf X})^{-1} {\bf X}'({\bf X{\beta}}+\epsilon)\right)\\ &=& E\left({(\bf X}'{\bf X})^{-1} {\bf X}'{\bf X{\beta}}\right)+ E\left({(\bf X}'{\bf X})^{-1} {\bf X}'{\epsilon}\right)\\ &=& {\beta}+ {(\bf X}'{\bf X})^{-1} {\bf X}'E({\epsilon})\\ &=& {\beta} \end{eqnarray*}\]
Therefore, \(E(\hat\beta)\) is \(\beta\).

Properties of estimators

2. The mean of random term is zero

Proof: \[\begin{eqnarray} E({\epsilon}) = E\left(\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n \end{array} \right)= \left(\begin{array}{c} E(\varepsilon_1) \\ E(\varepsilon_2) \\ \vdots\\ E(\varepsilon_n) \end{array} \right)=\left(\begin{array}{c} 0 \\ 0 \\ \vdots\\ 0 \end{array} \right)\\ \end{eqnarray}\] and \(E(\bf Y)=E(\bf X \beta + \epsilon) = \bf X \beta\)

Properties of estimators

3. Covariance of the estimators

Proof: \[\begin{eqnarray*} Cov({\beta}) &=& E\left(\hat{\beta}-E(\hat{\beta}))(\hat{\beta}-E(\hat{\beta}))'\right)\\ &=& E\left(({(\bf X}'{\bf X})^{-1} {\bf X}'{\bf y}-E(\hat{\beta}))({(\bf X}'{\bf X})^{-1} {\bf X}'{\bf y}-E(\hat{\beta}))'\right)\\ &=& E\left(({(\bf X}'{\bf X})^{-1} {\bf X}'(\bf X\beta+\epsilon)-\beta))({(\bf X}'{\bf X})^{-1} {\bf X}'(\bf X\beta+\epsilon)-\beta)'\right)\\ &=& E\left((\beta+{(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon-\beta)(\beta+{(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon-\beta)'\right)\\ &=& E\left((\beta+{(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon-\beta)(\beta+{(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon-\beta)'\right)\\ &=& E\left({(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon)({(\bf X}'{\bf X})^{-1} {\bf X}'\epsilon)'\right)\\ &=& {(\bf X}'{\bf X})^{-1} {\bf X}'E(\epsilon\epsilon')\bf X {(\bf X}'{\bf X})^{-1}\\ &=& \sigma^2 {(\bf X}'{\bf X})^{-1} \end{eqnarray*}\] and \(E(\bf Y)=E(\bf X \beta+ \epsilon) = \bf X \beta\)

Assumptions

The true model is: \(y=\beta_{0}+\beta_{1}x_{1}+...+\beta_{p-1}x_{p-1}+\varepsilon\)
The error terms have zero mean : i.e. \(E(\varepsilon_i)\) or \(E(\epsilon) ={\bf 0}\). This means that on average the errors balance out.
Homoscedasticity (error terms have constant variance): i.e. \(Var(\varepsilon_i)\) = \(E(\varepsilon_i ^2)\) = \(\sigma^2\), for all \(i\) or \(Var(\epsilon) = \sigma^2\,{\bf I}_n\). This means that the variance of the disturbance is the same for each observation.

Assumptions

No error autocorrelation (the error terms \(\varepsilon_i\) are statistically independent of each other), i.e \(Cov(\varepsilon_i\),\(\varepsilon_j)\)= \(E(\varepsilon_i\varepsilon_j)=0\) for \(i\ne j\). This means disturbances associated with different observations are uncorrelated.
No multicollinearity: No exact linear relationship exists between any of the explanatory variables.
Normality: \(\varepsilon_i\) are normally distributed with mean zero and variance \(\sigma^2\) for all \(i\) (\(\varepsilon_i \sim \mathcal N(0, \sigma^2)\)).

Model adequacy

Goodness of fit

How well does the model fit the data?
One measure is \(\ R^2\), the so-called coefficient of determination or percentage of variance explained.

The coefficient of determination

-The coefficient of determination \(( R^2 )\) can be calculated as usual as: \[\begin{eqnarray*} R^2=\frac{SSR}{SST}=\frac{\sum_{1}^{n}(\hat{y_i}-\bar{y})^2}{\sum_{1}^{n}(y_i-\bar{y})^2} \end{eqnarray*}\]

\(R^2\) measures the proportion of variation in the dependent variable \(Y\) that is explained by the explanatory variables (or by the multiple linear regression model).
It is a goodness-of-fit statistic.

The coefficient of determination

Note

The range is \(0\le\ R^2\le 1\).
If all the data points fall exactly on a line having non-zero slope, then \(R^2=1\).
If \(\forall\beta_i=0\), the \(R^2=0\).
Values closer to 1 indicating better fits. For simple linear regression \(\ R^2 =\ r^2\) where \(r\) is the correlation between \(\ x\) and \(\ y\).
In simple linear regression model,the estimated slope \(\hat \beta_1\) and \(r=\sqrt{R^2}\) are correlated as follows: \[r=\frac{s_x}{s_y}\hat\beta_1\] where \(s_x\) and \(s_y\) are the standard deviations of \(X\) and \(Y\), respectively.

Test of model adequacy

A test for the significance of \(R^2\) or a test of model adequacy is accomplished by testing the hypotheses:
\(H_0:\beta_{1}=\beta_{2}=...=\beta_{p-1}\)
\(H_A: H_0\) is not true
- The test statistic is given by: \(F_{cal}=\frac{RSS/(K-1)}{ESS/(n-K)}\)\
- The linear model is adequate if \(F_{cal}=F_{\alpha}(K-1, n-K)\)
- Where \(K =p-1\) is the number of parameters estimated from the sample data.

Tests on the regression coefficients

To test whether each of the coefficients are significant or not, the null and alternative hypotheses are given by:

\(H_0:\beta_{j}= 0\) Vs \(H_A:\beta_{j}\not= 0\)
The test statistic is: \(t_{j}=\frac{\hat{\beta_j}}{S.e(\hat{\beta_j})}\)
If \(t_j > t_{\frac{\alpha}{2}}(n-k)\), we reject \(H_0\) and conclude that \(\beta_{j}\) is significant, that is, the regressor variable \(x_j\), significantly affects the dependent variable \(y\).

Fitting linear regression models in R

Model formulation in R

We know that the least squares estimator of \(\beta\) for the linear model \({\bf y} = {\bf X} \beta + \varepsilon\) is given by \[ \widehat{\beta}= {(\bf X}'{\bf X})^{-1} {\bf X}'{\bf y} \] of provided that \({\bf X}'{\bf X}\) is non-singular.
Similar to the simple linear regression model, a multiple linear regression model can also be fitted using the function in R.
We use the function in R.
A general call of the function has the form:

Model formulation in R using the function

Example 1: Galapagos Islands Data

Recall the Galapagos Islands Data example.
Consider now the relationship between the number of plant species with the four geographic variables and the number of endemic species.
The linear model is fitted using the following R code:

library(faraway)
data(gala)
MLR.fit.1<-lm(Species~Area+Elevation+Nearest+Scruz+Adjacent,data=gala)

Model formulation in R using the function

The output

summary(MLR.fit.1)

## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

Model formulation in R using the function

Parameter estimates for \(\sigma^{2}\) and \(R^{2}\) in R.
The fitted regression model for significant variables is: \[ \hat{y} = 7.068 + 0.319\, Elevation - 0.075\,Adjacent \]
\(R^2 = 0.7658\): 76.58% of the total variability in the number of species of tortoise found on the island is accounted for by the fitted regression model (or by Area, Elevation, Nearest, Scruz, Adjacent).

Model formulation in R using the function

Example 2: The data

The in R contains information about the daily air quality measurements in New York, May to September 1973.
For more information about the data use:

help(airquality)

Our main interest is to model the dependency of the daily ozone level on radiation, wind speed an temperature.
We formulate the following regression model: \[ \mbox{Ozone}_{i}=\beta_{0}+\beta_{1}\mbox{Solar.R}_{i}+\beta_{2}\mbox{Wind}_{i}+\beta_{3}\mbox{Temp}_{i}+\varepsilon_{i}. \]

Model formulation in R using the function

In R, the model \[ \mbox{Ozone}_{i}=\beta_{0}+\beta_{1}\mbox{Solar.R}_{i}+\beta_{2}\mbox{Wind}_{i}+\beta_{3}\mbox{Temp}_{i}+\varepsilon_{i}, \] can be fitted using

library(faraway)
data("airquality")
fit.air_q<-lm(Ozone~Solar.R+ Wind+Temp, data=airquality)

Model formulation in R using the function

The output

summary(fit.air_q)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

Model formulation in R using the function

Recall that a simple linear regression model of the form \[ \mbox{Ozone}_{i}=\beta_{0}+\beta_{1}\mbox{Temp}_{i}+\varepsilon_{i}, \] was fitted in Chapter 1.
In this chapter our aim is to fit a model of the form \[ \mbox{Ozone}_{i}=\beta_{0}+\beta_{1}\mbox{Temp}_{i}+\beta_{2}\mbox{Temp}^{2}_{i}+\varepsilon_{i}, \] which is a linear regression model with two predictors.

Model formulation in R using the function

In R, the model \[ \mbox{Ozone}_{i}=\beta_{0}+\beta_{1}\mbox{Temp}_{i}+\beta_{2}\mbox{Temp}^{2}_{i}+\varepsilon_{i}, \] can be fitted using response and predictors.

Ozone<-airquality$Ozone
Ozone1<-na.omit(Ozone)
Temp<-airquality$Temp
Temp1<-Temp[!is.na(Ozone)]

Model formulation in R using the function

Scatter plot

plot(Temp1,Ozone1)

Model formulation in R using the function

The output

fit.air_q1<-lm(Ozone1~Temp1)
summary(fit.air_q1)

## 
## Call:
## lm(formula = Ozone1 ~ Temp1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.729 -17.409  -0.587  11.306 118.271 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -146.9955    18.2872  -8.038 9.37e-13 ***
## Temp1          2.4287     0.2331  10.418  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.71 on 114 degrees of freedom
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

Model formulation in R using the function

The output

fit.air_q2<-lm(Ozone1~Temp1 +{Temp1}^2)
summary(fit.air_q2)

Model formulation in R using the function

The output

Parameter estimates for the quadratic model:

summary(fit.air_q2)
    Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 305.48577  122.12182   2.501 0.013800 *  
    Temp1        -9.55060    3.20805  -2.977 0.003561 ** 
    Temp2         0.07807    0.02086   3.743 0.000288 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 22.47 on 113 degrees of freedom
    Multiple R-squared:  0.5442,    Adjusted R-squared:  0.5362 
    F-statistic: 67.46 on 2 and 113 DF,  p-value: < 2.2e-16

Appendix

The Gauss-Markov Theorem: Properties of \(\widehat{\beta}\)

The Gauss-Markov Theorem

Estimable function: A linear combination of the parameters \(\psi = {\bf c}'\,\beta\) is estimable if and only if there exists a linear combination \({\bf a}'\,{\bf y}\) such that \[ E({\bf a}'\,{\bf y}) = {\bf c}'\,\beta ~~~~~~\forall \beta. \]
The Gauss-Markov Theorem: Suppose \({\bf y} = \bf X'\beta+\epsilon\), where E\((\epsilon) = 0\) and \(Var(\epsilon)=\sigma^2{\bf I}_n\). Let \(\psi = {\bf c}'\,\beta\) be an estimable function, then in the class of all unbiased linear estimates of \(\psi\), \(\widehat{\psi} = {\bf c}'\widehat{\beta}\) has the minimum variance and is unique.

Proof

Suppose \({\bf a}'{\bf y}\) is some unbiased estimate of \({\bf c}'\beta\) so that \(E({\bf a}'{\bf y})={\bf c}'{\beta}\) ~~~ \(\forall \;\beta\), since \(E({\bf y}) = {\bf X}\beta\) it follows that \({\bf a}'{\bf X} \beta = {\bf c}'\beta \; \;\)~~~~ \(\forall \beta\) which means that \({\bf a}'{\bf X} = {\bf c}'\).
This implies that \({\bf c}\) must be in the range space of \({\bf X}'\) which in turn implies that \({\bf c}\) is also in the range space of \({\bf X}'{\bf X}\) which means there exists \(\lambda\) such that \[\begin{eqnarray*} {\bf c}={\bf X}'{\bf X} \lambda\\\\ \Rightarrow {\bf c}'\widehat{\beta} & = & \lambda'{\bf X}'{\bf X}\widehat{\beta} = \lambda'{\bf X}'{\bf X}({\bf X}'{\bf X})^{-1}({\bf X}'{\bf y}) = \lambda'{\bf X}'{\bf y} \end{eqnarray*}\]

Proof

Now, let us compute the variance of \({\bf a}'{\bf y}\).

\[\begin{eqnarray*} Var({\bf a}'{\bf y}) &=& Var({\bf a}'{\bf y}+{\bf c}'\widehat{\beta}-{\bf c}'\widehat{\beta})\\ & = & Var({\bf a}'{\bf y} -\lambda'{\bf X}'{\bf y}+{\bf c}'\widehat{\beta})\\ & = & Var({\bf a}'{\bf y}-\lambda'{\bf X}'{\bf y}) + Var({\bf c}'\widehat{\beta}) + 2\,Cov({\bf a}'{\bf y} -\lambda'{\bf X}'{\bf y} \lambda'{\bf X}'{\bf y} \end{eqnarray*}\]

Proof

So \(Var({\bf a}'{\bf y})=Var({\bf a}'{\bf y} - \lambda'\bf X'{\bf y})+ Var({\bf c}'\widehat{\beta})\).
Since \(Var({\bf a}'{\bf y} - \lambda'\bf X'{\bf y}) \ge 0\), \(Var({\bf a}'{\bf y})\geq Var({\bf c}'\widehat{\beta})\).
Thus \({\bf c'}\widehat{\beta}\) has minimum variance.

Proof

\(Var({\bf a}'{\bf y}) = Var({\bf c}'\widehat{\beta})\), if \(Var({\bf a}'{\bf y} -\lambda'\bf X'{\bf y})={\bf 0 }\) which would require that \({\bf a}' -\lambda'\bf X'={\bf 0}\) which means that \({\bf a}'{\bf y}=\lambda'\bf X'{\bf y}= {\bf c}'\widehat{\beta}\).
Therefore, equality occurs only if \({\bf a}'{\bf y} = {\bf c}'\widehat{\beta}\) revealing that the estimator is unique.

Implications

The Gauss-Markov theorem shows that the least squares estimate \(\widehat{\beta}\) is a good choice, but if the errors are correlated or have unequal variance, there will be better estimators.
Even if the errors behave but are non-normal then non-linear or biased estimates may work better in some sense. So this theorem does not tell one to use least squares all the time, it just strongly suggests it unless there is some strong reason to do otherwise.

Implications

Situations where estimators other than ordinary least squares should be considered are
- When the errors are correlated or have unequal variance, generalized least squares should be used.
- When the error distribution is long-tailed, then robust estimates might be used. Robust estimates are typically not linear in \(y\).
- When the predictors are highly correlated (collinear), then biased estimators such as ridge regression might be preferable.

Mean and Variance of \(\widehat\beta\)

Now: \(\widehat\beta=(\ X'\ X)^{-1}\ X'y\)
Mean:

\[\begin{eqnarray*} E(\widehat\beta)&=& E((\ X'\ X)^{-1}\ X'y)\\ &=& (\ X'\ X)^{-1}\ X'E(y)\\ &=&(\ X'\ X)^{-1}\ X'X\beta)\\ &=&\beta \end{eqnarray*}\]

Variance:

\[\begin{eqnarray*} Var(\widehat\beta)&=& Var((\ X'\ X)^{-1}\ X'y)\\ &=&(\ X'\ X)^{-1}\ X'Var(y)\ X (\ X'\ X)^{-1}\\ &=&(\ X'\ X)^{-1}\ X'\sigma^{2}I\ X (\ X'\ X)^{-1}\\ &=&\sigma^{2}(\ X'\ X)^{-1} \end{eqnarray*}\]

Estimation of \(\sigma^2\)

The variance-covariance matrix \(Var(\widehat{\beta})=\sigma^2(\bf {X}'\ {X})^{-1}\) contains the unknown parameter \(\sigma^2\). In this section we give an unbiased estimator of \(\sigma^2\). We start with some preliminaries from matrix theory. Lemma 1: Let \({\bf A}\) be a symmetric \((n \times n)\) matrix \({\bf (A}' = {\bf A})\), then

all eigenvalues \(\lambda_1,\ldots, \lambda_n\) of \({\bf A}\) are real and \[|{\bf A}| = \prod_{i=1}^n \lambda_i, ~~~~~~~ tr(A) = \sum\limits_{i=1}^n \lambda_i \]

Estimation of \(\sigma^2\)

the eigenvectors \(u_1, \ldots, u_n\) are orthogonal
\({\bf A}\) can be expressed as a diagonal matrix \[ {\bf A} = {\bf U}{\bf D}{\bf U}^{-1} \] where \({\bf U} = (u_1, \ldots, u_n)\) and \({\bf D} = diagonal(\lambda_1, \ldots, \lambda_n)\).
\(rank({\bf A})\) is the number of nonzero eigenvalues.
Let \({\bf A}\) be an idempotent projection \((n \times n)\) matrix \({\bf (A^2 = A})\), then \(\lambda_i = 1\) or \(0\), \(i=1,\ldots, n\).
Let \({\bf A}\) be an idempotent projection matrix, then rank \({\bf (A)} = tr({\bf A})\).

Estimation of \(\sigma^2\)

Proof of (vi) \[\begin{eqnarray*} tr({\bf A}) = \sum\limits_{i=1}^n \lambda_i = \# \{i : \lambda_i = 1\} = \mbox{rank}({\bf A}) \end{eqnarray*}\]
Let \({\bf X}\) denote the design matrix and assume that it has full rank. Define the \((n\times n)\) hat matrix, \({\bf H}\) given by \[\begin{eqnarray*} {\bf H} = {\bf X}({\bf X}'{\bf X})^{-1} {\bf X}' \end{eqnarray*}\]
Note that \(\widehat{\theta} ={\bf X} \widehat{\beta} ={\bf X}({\bf X}'{\bf X})^{-1} {\bf X}'{\bf y} = {\bf H}{\bf y}\). In the next lemma we show that \({\bf H}\) is an orthogonal projection (which corresponds with the fact that \(\widehat{\theta}\) is chosen such that \({\bf y} - \widehat{\theta} \perp {\bf R(X)}\).

Lemma 2

Assume that the design matrix \({\bf X}\) has full rank \(p\).

\({\bf H}\) is symmetric and idempotent. The same holds for \({\bf I}_n - {\bf H}\).
[(ii)] rank\(({\bf I}_n-{\bf H}) = tr({\bf I}_n-{\bf H}) = n-p\).
\(({\bf I}_n-{\bf H}){\bf X} = {\bf 0}\).

Proof

\({\bf H}' = {\bf H},\;{\bf I}_n-{\bf H})'={\bf I}_n-{\bf H}\)
\({\bf H}^2 ={\bf X}({\bf X}'{\bf X})^{-1} {\bf X}' {\bf X}({\bf X}'{\bf X})^{-1}{\bf X'} = {\bf X}({\bf X}'{\bf X})^{-1} {\bf I}_p{\bf X}' = {\bf H}\)
\(({\bf I}_n-{\bf H})^2 = {\bf I}_n^2 - 2{\bf H} + {\bf H}^2 = {\bf I}_n - {\bf H}\)
\(tr({\bf H})= tr({\bf X}({\bf X}'{\bf X})^{-1}{\bf X}') = tr({\bf X}'{\bf X})^{-1}{\bf X}'{\bf X})=tr({\bf I}_p)= {\bf p}\)
\(rank({\bf I}_n-{\bf H})=tr({\bf I}_n-{\bf H}) = tr({\bf I}_n)- tr({\bf H})\)
\(({\bf I}_n - {\bf H}){\bf X} ={\bf X} - {\bf X}({\bf X}'{\bf X})^{-1} {\bf X}'{\bf X} = {\bf X} - {\bf X} = {\bf 0}\)

A quadratic form

Let \({\bf A}\) be a symmetric \((n\times n)\) matrix of constants and let \({\bf y}\) be an \((n \times 1)\) random vector. A quadratic form is defined as \({\bf y}'{\bf A}{\bf y} =\sum\limits_{i=1}^n \sum\limits_{j=1}^n a_{ij} y_i y_j\), note that \(a_{ij} =a_{ji}\).

Lemma 3 If \(E({\bf y}) = \theta\) and \(\sigma^2({\bf y})= \Sigma\), then \[ E({\bf y}'{\bf A}{\bf y}) = tr({\bf A}{\Sigma}) + \theta' {\bf A}\Sigma \]

A quadratic form

Proof \[\begin{eqnarray*} E({\bf y}'{\bf A}{\bf y}) &= & E[tr({\bf y}'{\bf A}{\bf y})] \\ &= & E[tr({\bf A}{\bf y}{\bf y}')] \\ &= & tr[E({\bf A}{\bf y}{\bf y}')] \\ &= & tr[{\bf A}E({\bf y}{\bf y}')] \\ &= & tr[{\bf A}({\bf \Sigma}+\theta'\theta)], \;\;\text{because} \;\; {\bf \Sigma} = E({\bf y}{\bf y})'-\theta\theta' \\ &= & tr({\bf A}{\bf \Sigma}) + tr({\bf A}\theta\theta') \\ &= & tr({\bf A}{\Sigma}) + \theta'{\bf A}\theta' \\ \end{eqnarray*}\]

Theorem: Unbiased estimator of \(\sigma^2\)

Assume (i) \({\bf X}\) has full rank, (ii) \(E(\epsilon) = {\bf 0}\) and (iii) \(Var(\epsilon) = \sigma^2{\bf I}_n\). An unbiased estimator of \(\sigma^2\) is given by \[\begin{eqnarray*} \widehat{\sigma}^2= \frac{({\bf y} - {\bf X}\widehat{\beta})'({\bf y}-{\bf X} \widehat{\beta})}{n-p} \end{eqnarray*}\]

Proof \[\begin{eqnarray*} {\bf y} - {\bf X} \widehat{\beta} ={\bf y} - {\bf H}{\bf y} =({\bf I}_n - {\bf H}){\bf y} \end{eqnarray*}\] Therefore \[\begin{eqnarray*} (n-p)MSE &=& {\bf y}'({\bf I}_n-{\bf H})' ({\bf I}_n-{\bf H}){\bf y} \\ &=& {\bf y}'({\bf I}_n-{\bf H})^2{\bf y} \\ &=& {\bf y}'({\bf I}_n-{\bf H}){\bf y} \end{eqnarray*}\]

Proof

\[\begin{eqnarray*} E((n-p)MSE) &=& E({\bf y}'({\bf I}_n-{\bf H}){\bf y})\\ &=& tr[{\bf I}_n-{\bf H})\sigma^2({\bf y})] + E({\bf y})'{\bf I}_n-{\bf H})E({\bf y})\\ &=& \sigma^2 tr{\bf I}_n-{\bf H})+ \beta'{\bf X'}({\bf I}_n-{\bf H}){\bf X}\beta \\ &=& \sigma^2 (n-p) \end{eqnarray*}\] because \(({\bf I}_n-{\bf H}){\bf X}= {\bf 0}\) by Lemma 2(iii)

Chapter Three: Statistical Infernce

Introduction

In linear regression, estimation can be done using LSE,which does not require the normality assumption.
From simple linear regression model LSE, we obtain \(\hat\beta_o\) and \(\hat \beta_1\), which are the point estimators of the parameters \(\beta_o\) and \(\beta_1\), respectively.
Statistical inference, however, is based on the normal distribution.
- The errors, \(\epsilon_i\), are assumed to be mutually independent and (identically) normally distributed.i.e \(\epsilon \sim {\cal N} (0, \sigma^2 I)\)
- We rely on the Central Limit Theorem and assume that the inference is approximately correct.
Inference generally takes the form of either of two related techniques:
1. Hypothesis Tests
2. Confidence Intervals
(will see the details later)

Distribution of \(\widehat{\beta}\)

Recall the assumptions that the errors, \(\varepsilon_i\)’s are independent and identically normally distributed with mean zero and common variance \(\sigma^2\), that is the error vector \(\epsilon \sim \mathcal {N}({\bf 0}, \sigma^2\,{\bf I}_n)\). With this assumption and since \({\bf y} = {\bf X}\beta + \epsilon\), we have \({\bf y} \sim \mathcal{N}({\bf X}\beta, \sigma^2\,{\bf I}_n)\).
Furthermore, since the linear combinations of normally distributed random variables are also normal, the estimator \(\widehat{\beta}=({\bf X}'{\bf X})^{-1}{\bf X}'{\bf y} \sim \mathcal{N}(\bf \beta, \sigma^2\,({\bf X}'{\bf X})^{-1})\).

Properties of multivariate normal distribution

Suppose an \(n \times 1\) random vector \({\bf z}\) is normally distributed with mean vector \(\theta\) and variance-covariance matrix \(\Sigma\), i.e, \({\bf z}\sim {\cal N}_n(\theta,\Sigma)\). Then \(({\bf z}-\theta)' \Sigma^{-1} ({\bf z} -\theta) \sim \chi_n^2\). If \({\bf z} \sim {\cal N}_n(\theta, \Sigma)\) and \({\bf C}\) is a \((p \times n)\) matrix of rank \(p\), then \({\bf C}{\bf z} \sim {\cal N}_p({\bf C}\theta, {\bf C} \Sigma {\bf C}')\) Suppose \({\bf z} \sim {\cal N}_n(\theta, \sigma^2{\bf I}_n)\) and let \({\bf U} = {\bf A}\,{\bf z}\) and \({\bf V} = {\bf B}\,{\bf z}\). Let \({\bf A}_1\) represent the linearly independent rows of \({\bf A}\) and let \({\bf U}_1 = {\bf A}_1 {\bf z}\). Then, if \(Cov({\bf U}, {\bf V})= {\bf 0}\), we have: \({\bf U}_1\) is independent of \({\bf V}'{\bf V}\) \({\bf U}'{\bf U}\) and \({\bf V}'{\bf V}\) are independent.

Distribution of \(\hat{\sigma}^2\)

We need the following two results to prove a theorem related to the distributions of \(\widehat{\beta}\) and \(\hat{\sigma}^2\).

(P5) If \[Q_i \sim \chi_{r_i}^2, i = 1,2 ~~~~~\mbox{and}\;\;\ r_1 > r_2\] and \(Q = Q_1 - Q_2\) is independent of \(\ Q_2,\) then \[ Q \sim \chi_r^2\ \mbox{where}\ r = r_1 - r_2. \]

(P6) Recall from Chapter 1 (i) Let \({\bf A}\) be any \((m\times n)\) matrix. Then \(rank({\bf A}) = rank({\bf A}')\); (ii) Let \({\bf P}\) be a nonsingular \((m\times m)\) matrix and \({\bf R}\) a nonsingular \((n \times n)\) matrix. Then \(rank({\bf A}) = rank({\bf P}{\bf A}) = rank({\bf A}{\bf R})\).

Theorem: Distribution of \(\hat{\sigma}^2\)

Under the assumptions of \({\bf X}\) has full rank and \(\epsilon \sim \mathcal{N}({\bf 0}, \sigma^2 {\bf I}_n)\)

\(\widehat{\beta} \sim {\cal N}_p (\beta, \sigma^2 ({\bf X}'{\bf X})^{-1})\)
\(\frac{(\widehat{\beta} - \beta)'{\bf X}'{\bf X}(\widehat{\beta}-\beta)}{\sigma^2}\sim \chi_p^2\)
\(\widehat{\beta}\) and \(\hat{\sigma}^2\) are independent
\(\frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi_{n-p}^2\)

Proof

Let \({\bf C = ({\bf X}'{\bf X})^{-1} {\bf X}'}\) and observe that \({\bf C}\) is a \(p \times n\) matrix. Using result P2(ii) with \({\bf P} = ({\bf X}'{\bf X})^{-1}\) and \({\bf A} = {\bf X}'\) we have \(rank({\bf C}) = rank({\bf X}') = rank({\bf X}) = p\). Since \(rank({\bf C}) = p\) we can apply result P2 to obtain \(\widehat{\beta} \sim {\cal N}_p (\beta, \sigma^2 ({\bf X}'{\bf X})^{-1})\).

To confirm this, \({\bf C}E({\bf y}) = \beta\) and \({\bf C} \sigma^2 {\bf I}_n{\bf C}' = \sigma^2 ({\bf X}'{\bf X})^{-1}\).

This follows from result P1, i.e. \((\widehat{\beta} - \beta)' (Var(\widehat{\beta}))^{-1} (\widehat{\beta}-\beta) \sim \chi_p^2\).

Proof

The proof is based on results P3 and P4, with \({\bf A}\) and \({\bf B}\) of in these results defined as follows: \[ \widehat{\beta} = ({\bf X}'{\bf X})^{-1} {\bf X}'{\bf y} \equiv {\bf A}{\bf y} \equiv {\bf u} \] and, with \({\bf H} = {\bf X}({\bf X}'{\bf X})^{-1} {\bf X}'\), \[ {\bf y} - {\bf X}\widehat{\beta} = ({\bf I}_n - {\bf H}){\bf y} \equiv {\bf B}{\bf y} \equiv {\bf v}. \] Now show that \(Cov({\bf A}{\bf y}, {\bf B}{\bf y}) = {\bf 0}\). \[\begin{array}{l} Cov({\bf A}{\bf y}, {\bf B}{\bf y}) = {\bf A}Cov({\bf y}, {\bf y}){\bf B}'\\ = \sigma^2({\bf X}'{\bf X})^{-1} \underbrace{{\bf X}'({\bf I}_n-{\bf H})}_{\bf 0} \end{array} \]

Proof

Furthermore result P6(ii) implies rank\(({\bf A}) =\) rank\(({\bf X})\) and therefore we can apply result P3 with \({\bf A}_1 = {\bf A}\) to obtain the independence of \({\bf u}_1 = ({\bf X}'{\bf X})^{-1} {\bf X}'{\bf y} = \widehat{\beta}\) and \[ {\bf v}'{\bf v } = {\bf y}'({\bf I}_n - {\bf H})' ({\bf I}_n - {\bf H}){\bf y} = (n-p)\hat{\sigma}^2. \]

This proof is based on result P5. \[\begin{eqnarray*} Q_1 &=&({\bf y-X}\beta)'{\bf(y-X\beta)}\\ &=& ({\bf y-X}\widehat{\beta} +{\bf X}(\widehat{\beta}-\beta))'({\bf y-X}\widehat{\beta} + {\bf X}(\widehat{\beta}-\beta))\\ &=& ({\bf y-X} \widehat{\beta})'({\bf y-X}\widehat{\beta}) \leftarrow Q\\ &\ & + {\bf 2}(\widehat{\beta}-\beta){\bf'X'}({\bf y-X}\widehat{\beta}) \leftarrow {\bf 0}\\ &\ & + (\widehat{\beta} - \beta)'{\bf X}'{\bf X}(\widehat{\beta}-\beta) \leftarrow Q_2. \end{eqnarray*}\]

Proof

To see that the second term in the right hand side(rhs) equals 0, note that \[\begin{eqnarray*} (\widehat{\beta} - \beta)'{\bf X'(y-X\widehat{\beta})} = (\widehat{\beta}-\beta)'\underbrace{\bf {X'(I_n-P)Y}}_{={\bf 0}\ (Lemma\ 3(iii))} \end{eqnarray*}\] Therefore \[\begin{eqnarray*} Q_1 = Q + Q_2. \end{eqnarray*}\] Since \(Q_2\) is a continuous function of \(\widehat{\beta}\), we have independence of \(Q\) and \(Q_2\) by part (iii) of the present theorem. Now \[\begin{eqnarray*} \frac{ Q_1}{\sigma^2} = \sum\limits_{i=1}^n \left(\frac{\varepsilon_i^2}{\sigma^2}\right) \sim \chi_n^2 \end{eqnarray*}\]

Proof

and by part (ii) of the present theorem we have \[\begin{eqnarray*} \frac{Q_2}{\sigma^2} \sim \chi_p^2. \end{eqnarray*}\] Hence we obtain from result P5 \[\begin{eqnarray*} \frac{ Q }{\sigma^2} = \frac{( Q_1 - Q_2)}{\sigma^2} \sim \chi_{n-p}^2. \end{eqnarray*}\]

Sum of Squares and Analysis of Variances (ANOVA)

Decomposition for the total sum of squares

Let \({\bf\widehat{y}}=(\widehat{\ y}_{1},\ldots,\widehat{y}_{n})'\) denote the vector of fitted values and \({\bf e}=(e_{1},...,e_{n})'\) with \(e_{i} = y_{i}- \widehat{y}_{i}\) the vector of residuals, where \({\bf \widehat{y}} = {\bf X} \widehat{\beta}\) and \({\bf e = y - \widehat{y}} ={\bf y - X \widehat{\beta}}\).
Note that the normal equations \({\bf X'X}\widehat{\beta} ={\bf X'y}\) imply \(p\) constraints on \({\bf e}\), indeed \[{\bf X'}({\bf y-X}\widehat{\beta})= {\bf X}'({\bf y-\widehat{y}}){\bf X'e=0}\].

Analysis of Variance (ANOVA)

Analysis of Variance (or ANOVA) is another tool for interpreting the results from a linear regression.
We start with the total variation in the response,\(\ Y\): \(\ SSTO\equiv\ SS_{yy}=\sum_i^n{(\ Y_i-\bar Y)^2}\)
This is referred to as the “total sum of squares” or “sum of squares total” \((SSTO)\) or “sum of squares total (corrected)”.
The word “corrected” means that the \(\ Y\) values are corrected or adjusted for the mean, \(\bar Y\). Hence, \(\ SSTO\) measures the variability of the \(\ Y\) values around their observed mean.

Analysis of Variance (ANOVA)

We now show how the SSTO is partitioned or decomposed into different sources.
The devation \(\ Y_i-\bar Y\) is decomposed into different componnents.
So, by adding and subtracting \(\hat Y_i\) \[\underbrace{\ Y_i-\bar Y}_{total deviation}= \underbrace{\ Y_i-\hat Y_i}_{deviation around fitted regression line}+ \underbrace{\hat Y_i-\bar Y}_{deviation fitted regression value around mean}\]

Analysis of Variance (ANOVA

Take the sqaure of both sides

\[\begin{eqnarray*} \ SSTO &=& \sum_{i=1}^n(Y_i - \bar{Y})^2\\ &=& \sum_{i=1}^n (Y_i - \hat Y_i +\hat Y_i- \bar{Y})^2\\ &=& \sum_{i=1}^n [(Y_i -\hat Y_i)^2 +(\hat Y_i- \bar{Y})^2 + 2(Y_i - \hat Y_i)(\hat Y_i-\bar{Y})]\\ \ SSTO &=&\underbrace{\sum_{i=1}^n (Y_i -\hat Y_i)^2}_{SSE} + \underbrace{\sum_{i=1}^n (\hat Y_i- \bar{Y})^2}_{SSR} + \underbrace{2\sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i- \bar{Y})}_{0} \end{eqnarray*}\]

Analysis of Variance (ANOVA

The first term, \(\ SSE=\sum_i^n(Y_i-\hat Y_i)^2=\sum_i^n\ e_i^2\), is the sum of squared residuals and is called the error sum of squares or sum of squares due to error(SSE).
The second term, \(SSR=\sum_i^n(\hat Y_i-\bar Y)^2\), is called the regression sum of squares due to regression. It tells the variability accounted for by the fitted regression line.
The last term, \(2\sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i- \bar{Y})=0\).

ANOVA Table

Source      df      Sum of Squares  Mean Squares  F
_____________________________________________________
Regression  p-1    SSR        SSR/(p-1)           F
Residual    n-p    SSE        SSE/(n-p)
-----------------------------------------------------
Total      n-1     SST

Summary

To total sum of squares partition is summarized as:

\[ SS_{Total} =({\bf y} - \bar{y}{\bf 1_n})'({\bf y} - \bar{y}{\bf 1_n}) = \sum_{i=1}^n (y_i - \bar{y})^2 \] \[ SSR = (\hat{{\bf y}} - \bar{y}{\bf 1_n})'(\hat{{\bf y}} - \bar{y}{\bf 1_n}) = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \] and \[ SSE = {\bf (y - \hat{y})'(y - \hat{y})} ={\bf e'e} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \] where \({\bf \overline{ y}} = n^{-1} \sum\limits_{i=1}^n y_i\), and \({\bf 1}_n = (1, \ldots, 1)'\).

Summary

Or in matrix form they are given by \[\begin{eqnarray*} SSTO &=&{\bf y'y - \frac{1}{n} y' 1_n 1_n' y}\\ SSR &=&\widehat{\beta}{\bf X'y} - \frac{1}{n}{\bf y'1_n 1_n' y}\\ SSE &=& {\bf y'y} - \widehat{\beta}{\bf X'y} \end{eqnarray*}\] - Therefore, the fundamental ANOVA identity for linear model is given by \(SS_{Total} = SSR + SSE\).

Hypothesis Testing

General linear hypothesis

Consider the linear model \({\bf y = X} \beta + \epsilon\) where \({\bf X}\) is a full rank design matrix and \(\epsilon \sim \mathcal{N}({\bf 0}, \sigma^2{\bf I}_n)\). The hypothesis

\[ H_0 : {\bf A}\beta ={\bf c} \]

where \({\bf A}\) is a known \((q\times p)\) matrix of rank \(q\) and \({\bf c}\) is a known \((q \times 1)\) vector is called a general linear hypothesis.

The requirement that \({\bf A}\) must be of full row rank is not overly restrictive, since one may delete rows of \({\bf A}\) which are linear combinations of the remaining rows.
For example, the two hypotheses \(\beta_1-\beta_2 = 0\) and \(\beta_1-\beta_3=0\) automatically imply that \(\beta_2-\beta_3=0\), the latter being redundant.
We must of course be sure that the deleted hypotheses are consistent with the remaining ones (e.g. \(\beta_2-\beta_3 = 10\) would not be consistent with \(\beta_1-\beta_2 = 0\) and \(\beta_1-\beta_3=0\)).

General linear hypothesis

Consider the \(q\times 1\) random vector \({\bf A}\widehat{\beta} - {\bf c}\), where \(\widehat{\beta}\) is the least squares of \(\beta\). Observe that
- \(E({\bf A}\widehat{\beta} - {\bf c}) = {\bf A}\beta - {\bf c}\) and
- \(Var({\bf A}\widehat{\beta} - {\bf c}) = {\bf A}Var(\widehat{\beta}){\bf A}' = \sigma^2{\bf A}({\bf X}'{\bf X})^{-1}{\bf A}'\).
If \(H_0\) is true, since \(\widehat{\beta} \sim \mathcal{N}_p({\beta}, \sigma^2({\bf X}'{\bf X})^{-1})\) then it follows that \[ {\bf A}\widehat{\beta} - {\bf c} \sim \mathcal{N}_q ({\bf A}{\beta} - {\bf c}, \sigma^2{\bf A}({\bf X}'{\bf X})^{-1}{\bf A}'). \]

Quadratic Forms

Quadratic forms

We will now study the following quadratic forms which are relevant in the testing of linear hypotheses.

\(Q_0 = {\bf y}'{\bf y} = {\bf y}'{\bf I}_n {\bf y} = \sum_{i=1}^n y_i^2\)

\(Q_1 = {\bf y}'{\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'{\bf y}\)

\(Q_2 = Q_0 - Q_1 = {\bf y}'({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}'){\bf y}\)

\(Q_3 = \widehat{\beta}'({\bf X}'{\bf X})\widehat{\beta}\)

\(Q_4 = Q_0 - Q_3 = {\bf y}'{\bf y} - \widehat{\beta}'({\bf X}'{\bf X})^{-1}\widehat{\beta}\)

\(Q_5 = ({\bf A}\widehat{\beta} - {\bf c})' \left [{\bf A}({\bf X}'{\bf X})^{-1}{\bf A}'\right]^{-1}({\bf A}\widehat{\beta} - {\bf c})\)

Next, we will show each \(Q_i=\bf y'\bf A_i\bf y\) where \(\bf A_i\) is idempotent, i.e. \(\bf A_i\bf A_i=\bf A_i\).

Quadratic forms

Recall that \(Q_i/\sigma^2\) has a noncentral chi-square distribution with \(r_{i}=r({\bf A}_{i})\) degrees of freedom and noncentrality parameter \((E({\bf y}))'{\bf A}_{i}E({\bf y}))/\sigma^2 = \beta'{\bf X}'{\bf A}_{i}{\bf A}_i{\bf X}\beta/\sigma^2\).
Recall also that \(Q_{i}\) and \(Q_{j}\) are independent only if \({\bf A}_i{\bf A}_j = {\bf 0}\).
: Since \({\bf I}_n{\bf I}_n = {\bf I}_n\) and \(r({\bf I}_n) =n\), \(Q_0 = {\bf y}'{\bf I}_n {\bf y} / \sigma^2\) has a noncentral chi-square distribution with \(n\) degrees of freedom and noncentrality parameter (n.c.p.) \(\lambda_0 = \beta'{\bf X}'{\bf X}\beta/\sigma^2\). Since \({\bf X}'{\bf X}\) is positive definite, \(\lambda = 0\) if and only if \(\beta = {\bf 0}\).

Quadratic forms

: Since \({\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'\) idempotent, \(r({\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}') = 1\), and \(({\bf 1}'{\bf 1})^{-1}=1/n\), \(Q_1 = {\bf y}'{\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'){\bf y}\) has a noncentral chi-square distribution with one degree of freedom and n.c.p. \(\lambda_1 = \beta'{\bf X}'{\bf 1}{\bf 1}'{\bf X}\beta/n\,\sigma^2\). It follows that \(\lambda = 0\) if and only if \(\beta = {\bf 0}\) or if \({\bf X}'{\bf 1}={\bf 0}\). The latter condition implies that the sum of the elements in each row or \({\bf X}'\) (that is each column of \({\bf X}\)) is zero.
: Recall from Chapter 1 that \({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}'\) is idempotent and also observe that \(r({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}') = n-1\). Therefore, \(Q_2\) has a noncentral chi-square distribution with \(n-1\) degree of freedom and n.c.p.

Quadratic forms

\(\lambda_2 = \beta'{\bf X}'({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}'){\bf X}\beta/n\,\sigma^2\). Note that \(\lambda_2 = 0\) if \(\beta = {\bf 0}\). Consider now the situation where \(\beta \ne {\bf 0}\), for example let \[ {\bf X}= \left( \begin{array}{cccc} 1& x_{12}& \ldots& x_{1, p}\\ 1& x_{22}& \ldots& x_{2,p}\\ \vdots& & & \\ 1& x_{n2}& \ldots& x_{n,p} \end{array} \right) \]

Quadratic forms

that is, the first column of \({\bf X}\) consists of ones and \(\beta = (\beta_1, 0, \ldots, 0)'\). Therefore, the linear model is \[ \left( \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right) = \left( \begin{array}{cccc} 1& x_{12}& \ldots& x_{1, p}\\ 1& x_{22}& \ldots& x_{2,p}\\ \vdots& & & \\ 1& x_{n2}& \ldots& x_{n,p} \end{array} \right) \left( \begin{array}{c} \beta_1\\ 0 \\ \vdots\\ 0 \end{array} \right) + \left( \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n \end{array} \right) \]

Quadratic forms

\[ = \left( \begin{array}{c} \beta_1\\ \beta_1 \\ \vdots\\ \beta_1 \end{array} \right) + \left( \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n \end{array} \right), \] that is, the \(y_i\)’s have the same expected value \(\beta_1\). Therefore, \({\bf X}{\beta} = \beta_1{\bf 1}\) and \({\beta}'{\bf X}'=\beta_1{\bf 1}'\) and \[ \lambda_2 = \beta_1{\bf 1}'({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}'){\bf 1}\beta_1/\sigma^2 = \beta_1({\bf 1}'{\bf 1}-{\bf 1}'\bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}'{\bf 1})\beta_1/\sigma^2 = 0. \]

Quadratic forms

In this case, therefore, \(Q_2/\sigma^2\) is a (central) chi-square variate with \(n-1\) degrees of freedom, regardless of the value of \(\beta_1\). In fact, \(Q_2 = Q_0-Q_1 = \sum_{i=1}^ny_i^2 - n\bar{y}^2 = \sum_{i=1}^n(y_i - \bar{y})^2\) is a well-known quadratic form called the total sum of squares (adjusted for the mean).
Since \({\bf 1}({\bf 1}'{\bf 1})^{-1}{\bf 1}'({\bf I}_n - \bf 1({\bf 1}'{\bf 1})^{-1}{\bf 1}') = {\bf 0}\) it follows that \(Q_1\) and \(Q_2\) are stochastically independent. Thus, we have proved a well-known result: that \(n\,\bar{y}^2\) and \(\sum_{i=1}^n(y_i - \bar{y})^2\) are stochastically independent.
: Since \(\widehat{\beta} = ({\bf X}'{\bf X})^{-1}{\bf X}'{\bf y}\) and \(\widehat{\beta}' = {\bf y}'{\bf X}({\bf X}'{\bf X})^{-1}\), \(Q_3\) may be written in a number of alternative ways:

Quadratic forms

\[ Q_3 = \widehat{\beta}'({\bf X}'{\bf X})\widehat{\beta} = {\bf y}'{\bf X}({\bf X}'{\bf X})^{-1}({\bf X}'{\bf X})({\bf X}'{\bf X})^{-1}{\bf X}'{\bf y} = {\bf y}'{\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'{\bf y}. \] - Recall that \({\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'\) is idempotent and has rank \(p\). Thus, \(Q_3 / \sigma^2\) is a noncentral chi-square variate with \(p\) degrees of freedom and noncentrality parameter \[ \lambda_3 = \beta'{\bf X}'{\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'{\bf X}\beta / \sigma^2 = \beta'{\bf X}'{\bf X}\beta / \sigma^2 \] Since \({\bf X}'{\bf X}\) is positive definite, \(\lambda_3 = 0\) if and only if \(\beta = {\bf 0}\). The quadratic form \(Q_3\) is usually termed the regression sum of squares or the sum of squares due to the model.

Quadratic forms

: \[ Q_4 = Q_0 - Q_3 = {\bf y}'{\bf I}_n{\bf y} - {\bf y}'{\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'{\bf y} = {\bf y}'({\bf I}_n - {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'){\bf y} \]
Recall that \({\bf I}_n - {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'\) is idempotent matrix with rank \(n-p\). Since the non-centrality parameter \[ \lambda_4 = \frac{\beta'{\bf X}'({\bf I}_n - {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}'){\bf X}\beta}{\sigma^2} = 0 \]

Quadratic forms

\(Q_4/\sigma^2\) has a central chi-square distribution with \(n - p\) degrees freedom, whatever the value \(\beta\), provided the model is correct.

\(Q_4\) is equal to the sum of squares of the residuals, because \[ \begin{array}{ll} SSE &= ({\bf y} - \widehat{\bf y})'({\bf y} - \widehat{\bf y})=({\bf y} - {\bf X}\widehat{\beta})'({\bf y} - {\bf X}\widehat{\beta})\\ &= {\bf y}'{\bf y} - \widehat{\beta}'{\bf X}'{\bf X}\widehat{\beta} \\ & = Q_0 - Q_3 \end{array} \]

Quadratic forms

Recall from Chapter 1 that \([{\bf X}({\bf X}'{\bf X})^{-1}{\bf X}']({\bf I}_n - {\bf X}({\bf X}'{\bf X})^{-1}{\bf X}')={\bf 0}\). Therefore, it follows that \(Q_3\) and \(Q_4\) are independent.
: Since \({\bf A}\widehat{\beta} - {\bf c} \sim \mathcal{N}_q ({\bf 0}, \sigma^2{\bf A}({\bf X}'{\bf X})^{-1}{\bf A}')\), therefore from distribution theory (Chapter 1) \(Q_5\) has a non-central chi-square distribution with degrees of freedom \(rank({\bf A}({\bf X}'{\bf X})^{-1}{\bf A}')=q\) and non-centrality parameter \[ \lambda_5 = \frac{({\bf A}\beta - {\bf c})'[{\bf A}({\bf X}'{\bf X})^{-1}{\bf A}']^{-1}({\bf A}\beta - {\bf c})}{\sigma^2}. \]
Note that \(\lambda_5=0\) if and only if \({\bf A}\beta = {\bf c}\). The quadratic form \(Q_5\) is called the sum of squares due to the hypothesis. Note also that the quadratic form \(Q_3\) is a special case of \(Q_5\) with \({\bf A} = {\bf I}_{p}\) and \({\bf c} = {\bf 0}\).
: Prove that \(Q_4\) and \(Q_5\) are independent.

Summary

A general rule for testing a hypothesis in the linear model is as follows:

Find a quadratic form \(Q_i\) such that \(Q_i/ \sigma^2\) has a central chi-square distribution if \(H_0\) is true and a noncentral chi-square distribution if \(H_0\) is not true. Let \(Q_i/ \sigma^2\) have \(r_{i}\) degrees of freedom.
Find a second quadratic form \(Q_j\) which is independent of \(Q_i\) and such that \(Q_j/ \sigma^2\) has a central chi-square distribution whether \(H_0\) is true or not. Let \(Q_j/ \sigma^2\) have \(r_{j}\) degrees of freedom.

Summary

Compute \[ F = \frac{Q_i / r_i}{Q_j / r_j} \] which has a central \(F_{r_i, r_j}\) distribution if \(H_0\) is true and a noncentral F-distribution if \(H_0\) is not true. If \(F > F_{\alpha; (r_i, r_j)}\) reject \(H_0\) at the 100\(\alpha\)% level.

Hypothesis tests to compare models

Consider a linear model \({\bf y} = {\bf X}\beta + \epsilon\) consists of several predictors. Suppose two models, denoted by \(M_1\) and \(M_2\), where \(M_2\) consists of a subset of the predictors that are in \(M_1\). Because of the law of parsimony, we would prefer to use \(M_2\) if the data support it.
We can compare the two models conducting the test \(H_0: M_2~~\mbox{is supported by the data}\) versus \(H_1: M_1~~\mbox{is supported by the data}\). Let the number of parameters of \(M_1\) and \(M_2\) are \(q\) and \(p\), respectively. Note that \(q > p\).

Hypothesis tests to compare models

Let the sum of squares of errors or residuals for models \(M_1\) and \(M_2\) are denoted by \(SSE_1\) and \(SSE_2\), respectively. By Cochran’s theorem, if \(H_0\) is true then \[ \frac{(SSE_2 - SSE_1)/(q-p)}{\sigma^2} \sim \chi^2_{q-p},~~~~~~~~~~~~~ \frac{SSE_1/(n-q)}{\sigma^2} \sim \chi^2_{n-q} \] and the two quantities are independent. Therefore, by the above summery we find that \[ F = \frac{(SSE_2 - SSE_1)/(q-p)}{SSE_1/(n-q)} \sim F_{q-p, n-q}. \]
We would reject \(H_0\), if \(F > F_{\alpha, (q-p, n-q)}\).

Test of all predictors

Are any of the predictors useful in predicting the response? This is equivalent to comparing
- Full model (\(M_1\)): \({\bf y} = {\bf X}\beta + \epsilon\) where \({\bf X}\) is a full rank \(n \times p\) matrix, and
- Reduced model (\(M_2\)): \({\bf y} = \mu + \epsilon\) predicted by the mean (\(\mu = \beta_0\))
The null hypothesis can be written as \(H_0 : \beta_1 = \ldots = \beta_{p-1} = 0\). This is a special case of the general linear hypothesis \(H_0: {\bf A}\beta = {\bf c}\), where

\[ {\bf A} = \left( \begin{array}{ccccc} 1 & 0 & \ldots & 0 & 0 \\ 0 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \ldots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 & 0 \end{array} \right ) = [{\bf I}_{p-1} ~~ {\bf 0}_{(p-1)\times 1}] ~~\mbox{and} ~~~ {\bf c}_{(p-1)\times 1} = {\bf 0}_{(p-1)\times 1}. \]

Test of all predictors

Now
- \(SSE_1=(y-X\widehat \beta)'(y-X\widehat \beta)=\hat e'\hat e=SSE\)
- \(SSE_2=(y-\bar y)'(y-\bar y)=SST0\), which is sometimes known as the sum of squares corrected for the mean.

Thus \(Q_5 / \sigma^2 \sim \chi^2_{p-1}\) and \[ F = \frac{Q_5 / (p-1)}{SSE /(n-p)} = \frac{(SSTO-SSE)/(p-1)}{SSE/(n-p)} \sim F_{p-1, n-p}. \]

We would reject \(H_0\), if \(F > F_{\alpha, (p-1, n-p)}\).
The information in the above test is presented in an analysis of variance table.
A failure to reject the null hypothesis is not the end of the game, we must still investigate the possibility of non-linear transformations of the variables and of outliers which may obscure the relationship.

Test of all predictors

When the null is rejected, this does not imply that the alternative model is the best model.
We don’t know whether all the predictors are required to predict the response or just some of them.
Other predictors might also be added
- for example quadratic terms in the existing predictors.
Either way, the overall F-test is just the beginning of an analysis and not the end.

Testing Analysis of Variance using

Data: Savings rates in 50 countries

The savings data frame in the library has 50 rows and 5 columns. These data are averaged over the period 1960-1970 to remove business cycle or other short-term fluctuations. This data frame contains the following variables:

sr: savings rate - is aggregated personal saving divided by disposable income
pop15: percent population under age of 15
pop75: percent population over age of 75
dpi: per-capita disposable income in dollars
ddpi: percent rate of change in per capita disposable income (percent growth rate of dpi
sr is the response variables and the other four are predictor variables.

Testing Analysis of Variance using

Data: Savings rates in 50 countries

Consider a model with all the predictors. The linear model then fitted using the following R code:

library(faraway)
data(savings)
All.fit <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)

The output

summary(All.fit)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

The output

Since the \(p\)-value = 0.0007904 is very small we reject the null hypothesis \(H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\).
The analysis of variance table is obtained by the R code:

anova(All.fit)

## Analysis of Variance Table
## 
## Response: sr
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## pop15      1 204.12 204.118 14.1157 0.0004922 ***
## pop75      1  53.34  53.343  3.6889 0.0611255 .  
## dpi        1  12.40  12.401  0.8576 0.3593551    
## ddpi       1  63.05  63.054  4.3605 0.0424711 *  
## Residuals 45 650.71  14.460                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output

From the R output we have \(SSE = 650.71\).
We can also find this using the code

sum(All.fit$res^2)

## [1] 650.713

We can also do it directly using the F-testing formula:

library(faraway)
data(savings)
attach(savings)
mean(sr)

## [1] 9.671

The output

sum((savings$sr-mean(savings$sr))^2)

## [1] 983.6283

sum(All.fit$res^2)

## [1] 650.713

F-statistic is:

((983.63-650.71)/4)/(650.706/45)

## [1] 5.755825

1-pf(5.7558,4,45)

## [1] 0.0007902633

Testing just one predictor

Consider the null hypothesis \(H_0: \beta_i = 0\). Then can be test using the above procedure by fitting
- \(M_1\) is the model with all the predictor of interest, i.e. \(y_j = \beta_0 + \beta_1\,x_{1j} + \ldots + \beta_q\,x_{qj} + \epsilon_{j}\)
- \(M_2\) is the model with all the predictors except predictor \(i\)
The \(F\)-statistic computed using \[ F = \frac{(SSE_2 - SSE_1)/(1)}{SSE_1/(n-q)} \sim F_{(1, n-q)}. \]
\(SSE_1\) is the \(SSE\) for the model with all the predictors of interest (p parameters).
\(SSE_2\) is the \(SSE\) for the model with all the above predictors except predictor \(i\).
We would reject \(H_0\), if \(F > F_{\alpha, (1, n-q)}\).

Testing just one predictor

The above can also tested using the general linear hypothesis by choosing the \(1\times p\) matrix \({\bf A}=(0 ... 1 ... 0)\) with 1 on the \((i+1)\)th entry and \({\bf c=0}\) corresponds to \(H_0 : \beta_i = 0\).
Or alternative it can be tested using a \(t\)-statistic \[ t_i = \frac{\widehat{\beta}_i}{se(\widehat{\beta}_i)} \sim t_{n-p} ~~~\mbox{if}~~H_0~~\mbox{is true} \] where \(se(\widehat{\beta}_i) = \sqrt{\widehat{\sigma}^2\,a_{ii}}\) , \(a_{ii}\) is the (\(i,i\))th element of \(({\bf X}'{\bf X})^{-1}\).
We would reject \(H_0\), if \(t > t_{1-\alpha/2, n-p}\).
However, since for \(t_{d}\) variate, \(t^2_d = F_{1, d}\), the two approaches are identical.

Testing a predictor using

Data: Savings rates in 50 countries

In the savings data example, to test the null hypothesis that \(\beta_1 = 0\) that is significant in the full model, from the linear model result, we have the \(p\)-value = 0.002603 and conclude that the null hypothesis should be rejected. Or since \(t_{0.025, 45} < |t|=3.189\), we reject the null hypothesis \(\beta_1 = 0\).
For the general linear hypothesis we need to fit two models: \[ M_1: sr = \beta_0 + \beta_1\,pop15 + \beta_2\,pop75 + \beta_3\,dpi + \beta_4\,ddpi + \varepsilon \]

\[ M_2: sr = \beta_0 + \beta_2\,pop75 + \beta_3\,dpi + \beta_4\,ddpi + \varepsilon \]
The R code to fit \(M_2\) is

One.fit.1 <- lm(sr ~ pop75 + dpi + ddpi, savings)

Testing a predictor using

Data: Savings rates in 50 countries

The analysis of variance table is

anova(One.fit.1)

## Analysis of Variance Table
## 
## Response: sr
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## pop75      1  98.55  98.545  5.6825 0.02132 *
## dpi        1   2.13   2.135  0.1231 0.72729  
## ddpi       1  85.22  85.223  4.9143 0.03162 *
## Residuals 46 797.72  17.342                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing a predictor using

Data: Savings rates in 50 countries

Therefore, \(SSE_2 = 797.72\).
Recall from the previous analysis that the sum of squares of errors for \(M_1\) is \(SSE = 650.71\).
Thus, the \(F\)-statistic for the general linear hypothesis is \[ F = \frac{SSE_2 - SSE_1/(1)}{SSE_1/45} = \frac{797.72 - 650.71}{650.72/45} = 10.167. \]
This has a \(p\)-value

1-pf(10.167, 1, df.residual(All.fit))

## [1] 0.002602553

Therefore, since \(p\)-value is very small we reject \(\beta_1 = 0\).
Note that \(\sqrt{F} = \sqrt{10.167} \approx 3.189 = t\).

Testing a predictor using

Data: Savings rates in 50 countries

We can also do it directly using the F-testing formula:

library(faraway)
data(savings)
attach(savings)

## The following objects are masked from savings (pos = 3):
## 
##     ddpi, dpi, pop15, pop75, sr

sum(One.fit.1$res^2)

## [1] 797.7249

The output

F-statistic is:

(797.72-650.71)/(650.706/45)

## [1] 10.16657

The p-value is then

1-pf(10.167, 1, 45)

## [1] 0.002602553

We can relate this to the t-based test and p-value by

sqrt(10.167)

## [1] 3.188573

2*(1-pt(3.1886,45))

## [1] 0.002602358

The output

Since \(M_2\) is nested in \(M_1\) we can also use the R function to conduct the above test:

anova(One.fit.1, All.fit)

## Analysis of Variance Table
## 
## Model 1: sr ~ pop75 + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     46 797.72                                
## 2     45 650.71  1    147.01 10.167 0.002603 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing a subspace

Suppose we want to test \(H_0: \beta_l = \beta_m\).
This defines a linear subspace and can be tested using the general linear hypothesis procedure by choosing the \(1 \times p\) matrix \({\bf A}=(0, \ldots, 1, 0, \ldots, -1, 0, \ldots, 0)\) where 1 and -1 are in the \((l + 1)\)th and \((m + 1)\)th entries and \({\bf c} = {\bf 0}\).
Consider the savings example. Suppose we are interested to test

\(H_0: \beta_{pop15} = \beta_{pop75}\),

i.e. the effect of young and old people on the savings rate was the same. - The model under \(H_0\) is therefore has the form \[ sr = \beta_0 + \beta_{pop15}(pop15 + pop75) + \beta_{dpi}\,dpi + \beta_{ddpi}\,ddpi + \varepsilon. \]

Testing a substance using

The comparison of this model with the full model is done using the R code

data(savings)
All.fit <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
Reduced.fit <- lm(sr ~ I(pop15 + pop75) +
dpi + ddpi, savings)
anova(Reduced.fit, All.fit)

## Analysis of Variance Table
## 
## Model 1: sr ~ I(pop15 + pop75) + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     46 673.63                           
## 2     45 650.71  1    22.915 1.5847 0.2146

The output

The function ensures that the argument is evaluated rather than interpreted as part of the model formula.
The \(p\)-value of 0.2146 indicates that \(H_0: \beta_{pop15} = \beta_{pop75}\) cannot be rejected.
This implies that there is no sufficient evidence from the data that young and old people need to be treated separately in the context of this particular model.

Testing a subspace using

Suppose we want to test whether one of the coefficients can be set to a particular value, say

\(H_0: \beta_{ddpi} = 1\).

Then the reduced model under \(H_0\) has the form

\[ sr = \beta_0 + \beta_{pop15}\, pop15 + \beta_{pop75}\,pop75 + \beta_{dpi}\,dpi + ddpi + \varepsilon. \]

There is now no coefficient on the ddpi term, and such a fixed term in the regression equation is called an .

The output

The comparison of this model with the full model is done using the R code:

Fixed.fit <- lm(sr ~ pop15 + pop75 + dpi + offset(ddpi), savings)
anova(Fixed.fit, All.fit)

## Analysis of Variance Table
## 
## Model 1: sr ~ pop15 + pop75 + dpi + offset(ddpi)
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     46 781.61                                
## 2     45 650.71  1     130.9 9.0525 0.004286 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value is small and the null hypothesis here is soundly rejected.

The output

A simpler way testing a point hypotheses is as follows.

\(H_0: \beta_i = c\),

where \(c\) is the point hypothesis, can also be tested using \(t\)-statistic \[ t = \frac{\widehat{\beta}_i - c}{se(\widehat{\beta}_i)}. \] - From the full model analysis output we have \(\widehat{\beta}_{ddpi} = 0.409695\) and \(se(\widehat{\beta}_{ddpi}) = 0.1961971\). Therefore, \[ t = \frac{0.409695 - 1}{0.1961971} = -3.0087 \] or

tstat<-((0.409695-1)/0.196197)
tstat

## [1] -3.008736

The output

2*pt(((0.409695-1)/0.196197),45)

## [1] 0.004286075

Observe that this is the same as the \(p\)-value from the general linear hypothesis approach.
Furthermore, \(t^2 = 9.052493\) or,

using R,

tstat^2

## [1] 9.052493

is approximately equal to the above \(F\)-value.

In practice, the latter approach is preferred because it doesn’t need fitting two models.

Confidence intervals

Introduction

Confidence intervals provide an alternative way of expressing the uncertainty in our estimates.
Given a level of significance \(\alpha\), we are considering here a \(100(1 - \alpha)\)% confidence intervals and regions.
For a \(100(1 - \alpha)\)% confidence region, any point that lies within the region represents a null hypothesis that would not be rejected at the \(100\alpha\)% level while every point outside represents a null hypothesis that would be rejected.
Therefore, they are closely linked to the tests that we have already constructed.

Introduction

However, the confidence region provides a lot more information than a single hypothesis test in that it tells us the outcome of a whole range of hypotheses about the parameter values.
This makes it more valuable.
But note that by selecting the particular level of confidence for the region, we can only make tests at that level and we cannot determine the p-value for any given test simply from the region.

Simultaneous confidence region

Recall from the theorem related to distribution of \(\widehat{\sigma}^2\) that \[ \frac{(\widehat{\beta} - \beta)'\bf {X}'\bf {X}(\widehat{\beta}-\beta)}{\sigma^2}\sim \chi_p^2, and \] \[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi_{n-p}^2. \]
They are also independent.Hence,

\[ \frac{(\widehat{\beta} - \beta)'\bf {X}'\bf {X}(\widehat{\beta}-\beta)/p\sigma^2}{(n-p) \hat{\sigma}^2/(n-p)\sigma^2}\sim F_{p, n-p}. \] - A 100(\(1-\alpha\))% confidence region \(\beta\) has a form \[ (\widehat{\beta} - \beta)'\bf {X}'\bf {X}(\widehat{\beta}-\beta) \le p\,\hat{\sigma}^2\,F_{\alpha, (p,n-p)}. \] - These regions are ellipsoidally shaped and for higher dimension they can not easily be visualized.

Confidence interval for individual regression coefficient

Alternatively, one could consider each parameter individually which leads to confidence intervals which take the general form of

estimate \(\pm\) crtical value \(\times\) s.e of estimate

Observe, from Part (i) of the theorem discussed in this chapter, that \[ \widehat{\beta}_i \sim \mathcal{N}(\beta_i, \sigma^2\,a_{ii}) \] where \(a_{ii}\) is the (\(i,i\))th entry of \(({\bf X}'{\bf X})^{-1}\).
Hence \[ \frac{(\widehat{\beta}_i - \beta_i)}{\sqrt{\sigma^2\,a_{ii}}} \sim \mathcal{N}(0, 1). \]

Confidence interval for individual regression coefficient

From distribution theory

\[ \frac{(\widehat{\beta}_i - \beta_i)^{-1}/\sigma^2\,a_{ii}}{(n-p) \hat{\sigma}^2/(n-p)\sigma^2} \sim t_{n-p}. \]
Then solving for \(\beta_i\), a 100(\(1-\alpha\))% confidence interval for \(\beta_i\) is given by \[ \widehat{\beta}_i \pm t_{\alpha/2, n-p} \times \hat{\sigma}\,\sqrt{a_{ii}}. \]

Estimation of regression coefficents in

Data: Savings rates in 50 countries

All.fit <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
summary(All.fit)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

Confidence intervals for the regression parameter of pop75: in

The individual 95% confidence interval for the regression parameter of (using R code) is

c(-1.691498 - qt(0.975, 45)*(1.083599),
-1.691498 + qt(0.975, 45)*(1.083599))

## [1] -3.8739784  0.4909824

This interval contains zero which indicates that the null hypothesis \(H_0: \beta_{pop75} = 0\) would not be rejected at the 5% level.
From the regression analysis output we have the \(p\)-value =0.1255 \(>\) 5% so we don’t reject \(H_0: \beta_{pop75} = 0\) confirming this point.

The output

However, the 95% confidence interval of \(\beta_{ddpi}\) is

c(0.409695 - qt(0.975, 45)*(0.196197),
    0.409695 + qt(0.975, 45)*(0.196197))

## [1] 0.01453396 0.80485604

This interval for \(\beta_{ddpi}\) does not contain zero and so the null hypothesis \(H_0: \beta_{ddpi} = 0\) is rejected at the 5% level.
The \(p\)-value = 0.04247 \(<\) 5% also suggests rejection of \(H_0: \beta_{ddpi} = 0\) at the 5% level.

The output

Consider the joint 95% confidence region for these parameters, i.e. (\(\beta_{pop75}, \beta_{ddpi}\)).
We can draw the confidence ellipses using the R library . Note this is not part of base R so you must install it.

Plot of Confidence ellipse and regions for \(\beta_{pop75}\) and \(\beta_{pop15}\)

library(ellipse)

## 
## Attaching package: 'ellipse'

## The following object is masked from 'package:graphics':
## 
##     pairs

    plot(ellipse(All.fit, c(2,3)),type="l",xlim=c(-1,0))
    points(0,0)
    points(All.fit$coef[2],All.fit$coef[3],pch=15)
    abline(v=c(-0.461-2*2.01*0145,-0.461+2.01*0.145),lty=2)
    abline(h=c(-1.69-2.01*1.08,-1.69+2.01*1.08),lty=2)

Figure: Confidence ellipse and regions for \(\beta_{pop75}\) and \(\beta_{pop15}\)

Intepretation

The lines not tangential to the ellipse because the confidence intervals are calculated individually.
If we wanted a 95% chance that both intervals contain their true values, then the lines would be tangential. This can be obtained using the R code

library(ellipse)
plot(ellipse(All.fit,c(2,3),t=qt(0.95,45)),type="1", xlim=c(-1,0))
points(0,0)
points(coef(All.fit)[2],coef(All.fit)[3], pch=18)
ablines(v=confint(All.fit)[2,],lty=2)
ablines(h=confint(All.fit)[3,],lty=2)

Intepretation

In some circumstances, the origin could lie within both one-way confidence intervals, but lie outside the ellipse.
In this case, both one-at-a-time tests would not reject the null hypothesis whereas the joint test would.
The latter test would be preferred.
It’s also possible for the origin to lie outside the rectangle but inside the ellipse.
In this case, the joint test would not reject the null whereas both one-at-a-time tests would reject.
Again we prefer the joint test result.

Correlation coefficient of \(\beta_{pop75}\) and \(\beta_{pop15}\)

Observe that from the plot the coefficients have a positive correlation.
The correlation between predictors and the correlation between the coefficients of those predictors are often different in sign.
For example, the correlation between and is

cor(savings$pop15,savings$pop75)

## [1] -0.9084787

This may imply that two negatively correlated predictors are attempting to the perform the same job.
That is, the more work one does, the less the other can do and hence the positive correlation in the coefficients.

Confidence intervals for predictions

Prediction of a future value or new observations

A point estimate of the future observation \(y_0\) at the point \({\bf x}'_0 = (x_{01}, x_{02}, \ldots, x_{0p})\) is \[ \hat y_{0} = {\bf x}'_0\hat \beta \] and \[var(\bf x_o'\hat \beta)=\bf x_0'(X'X)^{-1}\bf x_0\sigma^2\] where we assume \(\varepsilon_h \sim N(0, \sigma^2)\) and independent of \(\varepsilon_1,\ldots, \varepsilon_n\). \(e_0 = y_{0} - \widehat{y}\).
Since \(\widehat{y}_{0}\) is independent of \(y\), we have

\[ Var(e_0)= Var(\hat{y_{0}} - \widehat{y}) = Var(\widehat{y}) +Var(\hat y_{0}) \] \[ = \sigma^2 ({\bf x}'_0({\bf X}'{\bf X})^{-1}{\bf x}_0 + 1) \]

Prediction interval of a future value or new observations

A future observation is predicted to be \(\bf x_0\hat\beta+\epsilon\).
So a 100(\(1 - \alpha\))% confidence interval for this future observation \(y_{0}\) is given by: \[ \widehat{y}_{0} \pm t(1 - \alpha/2; n-p)\,\widehat{\sigma} \sqrt{1 + {\bf x}'_0({\bf X}'{\bf X})^{-1}{\bf x}_0}. \]

Prediction of future mean response

Note that \(E(\hat y_{0}) = E({\bf x}'_0\beta + \varepsilon_0) = {\bf x}'_0\beta=E(y/x_0)\). where \({\bf x}'_0 = (x_{01}, x_{02}, \ldots, x_{0p})\)
Its predicted value is given \(\widehat{y}_0 = {\bf x}'_0 \widehat{\beta}\) and its variance \[ Var(\widehat{y}_{0}) = {\bf x}'_0 Var(\widehat{\beta}) {\bf x}_0 = \sigma^2 ({\bf x}'_0(X'X)^{-1}{\bf x}_0). \]
So a 100(\(1 - \alpha\))% confidence interval for the average response for a given \({\bf x}'_0\) is given by \[ \widehat{y}_{0} \pm t(1 - \alpha/2; n-p)\,\widehat{\sigma} \sqrt{{\bf x}'_0({\bf X}'{\bf X})^{-1}{\bf x}_0}. \]

Prediction of future value and mean response using

Data:Galapagos Islands

Suppose in the future if an island discovered with predictors 0.08,93,6.0,12.0,0.34 (same order as in the data set), then what will be the mean predicted number of species (of tortoise)?
This value can be calculated using the function .

MLR.fit.1 <- lm(Species ~ Area + Elevation + 
                  Nearest + Scruz + Adjacent, data=gala)
x0 <- data.frame(Area=0.08, Elevation=93, 
                 Nearest=6.0, Scruz=12, Adjacent=0.34)

The output

The mean response confidence interval is:

predict(MLR.fit.1, x0, interval = "confidence")

##        fit      lwr      upr
## 1 33.91967 1.033826 66.80551

The prediction interval for one new (future) island is:

predict(MLR.fit.1, x0, interval = "prediction")

##        fit      lwr      upr
## 1 33.91967 -96.1528 163.9921

The output

x_0<-c(1,0.08,93,6,12,0.34)
y_0<-sum(x_0*MLR.fit.1$coef)
y_0

## [1] 33.91967

The t-critical value is:

qt(0.975,24)

## [1] 2.063899

Calculating the \((X'X)^{-1}\)

x<-cbind(1, gala[,3:7])
x<-as.matrix(x)
xtxi<-solve(t(x)%*%x)

The output

The width of the bands for mean response CI is:

bm<-sqrt(x_0 %*%xtxi %*%x_0)*2.064*60.98
bm

##          [,1]
## [1,] 32.89005

and the interval is

c(y_0-bm,y_0+bm)

## [1]  1.029614 66.809721

The predction interval for the single future response is

bm<-sqrt(1+x_0 %*%xtxi %*%x_0)*2.064*60.98
c(y_0-bm,y_0+bm)

## [1] -96.16946 164.00879

The output

The prediction interval has negative lower confidence limit, since the response if the number of species this value does not make sense.
In such instances, impossible values in the confidence interval can be avoided by transforming the response, say taking logs, or by using a probability model more appropriate to the response.
A better choice for this example might be the Poisson distribution which is supported on the non-negative integers.

Orthogonality and Identifiability

Orthogonality

Suppose the design matrix \({\bf X}\) can be partitioned in two as \({\bf X} = [{\bf X}_1 | {\bf X}_2]\) such that \({\bf X}_1'{\bf X}_2 = {\bf 0}\).
So, we have \[ {\bf y} = {\bf X} \beta + \epsilon = {\bf X}_1 \beta_1 + {\bf X}_2 \beta_2 + \epsilon\] and \[ {\bf X}'{\bf X} = \left ( \begin{array}{cc} {\bf X}'_1{\bf X}_1 & {\bf X}'_1{\bf X}_2 \\ {\bf X}'_2{\bf X}_1 & {\bf X}'_2{\bf X}_2 \end{array} \right ) = \left (\begin{array}{cc} {\bf X}'_1{\bf X}_1 & {\bf 0} \\ {\bf 0} & {\bf X}'_2{\bf X}_2 \end{array} \right ). \]
This implies that, the least squares estimators of \(\beta_1\) and \(\beta_2\) are given by \[ \widehat{\beta}_1 = ({\bf X}'_1 {\bf X}_1)^{-1}{\bf X}'_1{\bf y}~~~~~~\mbox{and}~~~\widehat{\beta}_2 = ({\bf X}'_2 {\bf X}_2)^{-1}{\bf X}'_2{\bf y}, \] respectively.

Orthogonality

Notice that \(\widehat{\beta}_1\) will be the same regardless of whether \({\bf X}_2\) is in the model or not (and vice versa).
Now to test \(H_0 : \beta_1\), it should be noted that \(SSE_1\) of the full model will be different depending on whether \({\bf X}_2\) is included in the model or not but the difference in \(F\) is not liable to be so large as in non-orthogonal cases.
Orthogonality usually occurs when \({\bf X}\) is chosen by the experimenter.
It is a feature of a good experimental design.
In observational data, the researchers do not have direct control over \({\bf X}\) which is the source of much of the interpretational difficulties associated with non-experimental data.

Example: Odor of chemical by production settings

Data from an experiment to determine the effects of column temperature, gas/liquid ratio and packing height in reducing unpleasant odor of chemical product that was being sold for household use. The data has the following columns:

- odor: Odor score
- temp: Temperature coded as -1, 0 and 1
- gas:  Gas/Liquid ratio coded as -1, 0 and 1
- pack: Packing height coded as -1, 0 and 1

Example: Odor of chemical by production settings

The three predictors have been transformed from their original scale of measurement, for example \(temp = (Fahrenheit-80)/40\) so the original values of the predictor were 40,80 and 120.
The \({\bf X}\)-matrix extracted from the data using the R code:

library(faraway)
data(odor)
x <- as.matrix(cbind(1,odor[,-1]))

Example: Odor of chemical by production settings

Then the \({\bf X}'{\bf X}\) calculated using the R code:

t(x) %*% x

##       1 temp gas pack
## 1    15    0   0    0
## temp  0    8   0    0
## gas   0    0   8    0
## pack  0    0   0    8

The matrix is diagonal.
Note that if, for example, was measured in the original Fahrenheit scale,the matrix would still be diagonal but the entry corresponding to temp would change.

Example: Odor of chemical by production settings

The full linear model:

odor.fit.1 <- lm(odor ~ temp + gas + pack, odor)
summary(odor.fit.1, cor = T)

## 
## Call:
## lm(formula = odor ~ temp + gas + pack, data = odor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.200 -17.138   1.175  20.300  62.925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   15.200      9.298   1.635    0.130
## temp         -12.125     12.732  -0.952    0.361
## gas          -17.000     12.732  -1.335    0.209
## pack         -21.375     12.732  -1.679    0.121
## 
## Residual standard error: 36.01 on 11 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.1519 
## F-statistic: 1.836 on 3 and 11 DF,  p-value: 0.1989
## 
## Correlation of Coefficients:
##      (Intercept) temp gas 
## temp 0.00                 
## gas  0.00        0.00     
## pack 0.00        0.00 0.00

Example: Odor of chemical by production settings

Correlation of Coefficients:
     (Intercept) temp gas 
temp 0.00                 
gas  0.00        0.00     
pack 0.00        0.00 0.00

The correlation of the coefficients are all zero implying orthogonality.
Since the design is balanced the standard errors for the coefficients are equal.
Now refit the linear model by dropping

odor.fit.2 <- lm(odor ~ gas + pack, odor)

Example: Odor of chemical by production settings

summary(odor.fit.2)

## 
## Call:
## lm(formula = odor ~ gas + pack, data = odor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.200 -26.700   1.175  26.800  50.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   15.200      9.262   1.641    0.127
## gas          -17.000     12.683  -1.340    0.205
## pack         -21.375     12.683  -1.685    0.118
## 
## Residual standard error: 35.87 on 12 degrees of freedom
## Multiple R-squared:  0.2787, Adjusted R-squared:  0.1585 
## F-statistic: 2.319 on 2 and 12 DF,  p-value: 0.1408

Example: Odor of chemical by production settings

The coefficients values do not change but the residual standard error does change slightly which causes small changes in the standard errors of the coefficients, \(t\)-statistics and \(p\)-values, but nowhere near enough to change our qualitative conclusions.
Since data was from an experiment so it was possible to control the values of the predictors to ensure orthogonality.
Now consider the savings data which is observational. The full linear model results are

All.fit.1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)

Example: Observational study

summary(All.fit.1)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

Example: Observational study

Now refit the linear model by dropping \(pop_{15}\).

All.fit.2 <- lm(sr ~ pop75 + dpi + ddpi, savings)
summary(All.fit.2)

## 
## Call:
## lm(formula = sr ~ pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0577 -3.2144  0.1687  2.4260 10.0763 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.4874944  1.4276619   3.844  0.00037 ***
## pop75       0.9528574  0.7637455   1.248  0.21849    
## dpi         0.0001972  0.0010030   0.197  0.84499    
## ddpi        0.4737951  0.2137272   2.217  0.03162 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.164 on 46 degrees of freedom
## Multiple R-squared:  0.189,  Adjusted R-squared:  0.1361 
## F-statistic: 3.573 on 3 and 46 DF,  p-value: 0.02093

Example: Observational study

What changed?
By how much? Pay particular attention to pop75.
The effect has now become positive whereas it was negative. Granted, in neither case is it significant, but it is not uncommon in other datasets for such sign changes to occur and for them to be significant.

Identifiability

Recall the least squares normal equations: \[ {\bf X}'{\bf X}\widehat{\beta} = {\bf X}'{\bf y} \] where \({\bf X}\) is an \(n \times p\) matrix.
If \({\bf X}'{\bf X}\) is singular and cannot be inverted, then there will be infinitely many solutions to the normal equations and
\(\widehat{\beta}\) is at least partially unidentifiable.
Unidentifiability will occur when \({\bf X}\) is not of full rank, i.e. when its columns are linearly dependent.

Identifiability

With observational data, unidentifiability is usually caused by some oversight:
- Putting variables number of years of education K-12, number of years of post-High school education and also the total number of years of education into the same model.
- When there are more variables than cases (\(p > n\)).
- When \(p = n\), we may perhaps estimate all the parameters, but with no degrees of freedom left to estimate any standard errors or do any testing. Such a model is called saturated.
- When \(p > n\), then the model is called supersaturated.

Identifiability

Consider a two sample experiment: Treatment group with response observations \(y_1, \ldots y_n\) and control group with \(y_{n+1}, \ldots, y_{m+n}\). Consider the model: \[ y_j = \mu + \alpha_i, ~~~~i=1,2~~~j=1, \ldots m+n \] \[ \left ( \begin{array}{c}y_1 \\ \vdots \\ y_n\\ Y_{n+1}\\ \vdots \\ y_{m+n} \end{array}\right ) = \left ( \begin{array}{ccc} 1 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 1 \end{array}\right ) \left ( \begin{array}{c} \mu \\ \alpha_1 \\ \alpha_2 \end{array}\right ) + \left ( \begin{array}{c}\varepsilon_1 \\ \vdots \\ \varepsilon_{m+n} \end{array}\right ) \] \[ {\bf y} = {\bf X}\beta + \epsilon \]

Identifiability

\(rank({\bf X}) = 2\) and hence \(\beta = (\mu, \alpha_1, \alpha_2)'\) are not identifable.
The normal equations have infinitely many solutions.
To solve this problem we can impose some constraints on the parameters, for example, either \(\mu=0\) or \(\alpha_1 + \alpha_2 = 0\).

Interpreting parameter estimates

In a few tightly controlled experiments, it is possible to claim that measurement error is the only kind of error but usually some of the “error” actually comes from the effects of unmeasured variables.
Let the linear model decompose as follows: \[ {\bf y} = {\bf X}\,\beta + \epsilon = {\bf X}\,\beta + {\bf Z}\,\gamma +\delta \] where \({\bf Z}\) are unincluded predictors and \(\delta\) is measurement error in the response. Assume without any loss of generality that \(E(\epsilon) = {\bf 0}\), because if \(E(\epsilon) = {\bf c}\), we could simply redefine \(\beta_0\) as \(\beta_0 + c\) and the error would again have expectation zero.

Interpreting parameter estimates

This is another reason why it is generally unwise to remove the intercept term from the model since it acts as a sink for the mean effect of unincluded variables.
So we see that \(\epsilon\) incorporates both measurement error and the effect of other variables.
In a designed experiment, provided the assignment of the experimental units is random, we have \(Cor({\bf X}, {\bf Z}) = {\bf 0}\), so that the estimate of \(\beta\) is unaffected in expectation by the presence of \({\bf Z}\).

Interpreting parameter estimates

For observational data,
- no randomization can be used in assigning treatments to the units and orthogonality won’t just happen.
- There are serious objections to any causal conclusions.
- An inference of causality is often desired but this is usually too much to expect from observational data.
- An unmeasured and possible unsuspected “lurking” variable \({\bf Z}\) may be the real cause of an observed relationship between \({\bf y}\) and \({\bf X}\).

Interpreting parameter estimates

Because we have no control over the assignment of units, we have \(Cor({\bf X}, {\bf Z}) \ne {\bf 0}\).
So, the observed or worse, unobserved, presence of \({\bf Z}\) causes us great difficulty.
In observational studies, it is important to adjust for the effects of possible confounding variables.
If such variables can be identified, then at least their effect can be interpreted. Note that unfortunately, one can never be sure that the all relevant \({\bf Z}\) have been identified.

Interpreting parameter estimates

Suppose all relevant variables have been measured. In other words, suppose there are no unidentified lurking variables.
Even then the naive interpretation does not work. Consider \[ y = \widehat{\beta}_0 + \widehat{\beta}_1\,x_1 + \widehat{\beta}_2\,x_2 \] Suppose we change \(x_2 \rightarrow x_1 + x_2\) then \[ y = \widehat{\beta}_0 + (\widehat{\beta}_1 - \widehat{\beta}_2)\,x_1 + \widehat{\beta}_2(x_1 + \,x_2). \]

Interpreting parameter estimates

The coefficient for \(x_1\) has changed.
Interpretation cannot be done separately for each variable.
This is a practical problem because it is not unusual for the predictor of interest, \(x_1\) in this example, to be mixed up in some way with other variables like \(x_2\).
Another interpretation: “\(\widehat{\beta}_1\) is the effect of \(x_1\) when all the other (specified) predictors are held constant”.
This too has problems. Often in practice, individual variables cannot be changed without changing others too.
For example, in economics we can’t expect to change tax rates without other things changing too.
Furthermore, this interpretation requires the specification of the other variables - changing which other variables are included will change the interpretation. Unfortunately, there is no simple solution.

Chapter Four: Testing for Lack of Fit

Introduction to Generalized Least Squares

How can we tell if a model fits the data?
If the model is correct then \(\widehat{\sigma}^2\) should be an unbiased estimate of \(\sigma^2\).
If we have a model which is not complex enough to fit the data or simply takes the wrong form, then \(\widehat{\sigma}^2\) will overestimate \(\sigma^2\).
Alternatively, if our model is too complex and over-fits the data, then \(\widehat{\sigma}^2\) will be an underestimate.
This suggests a possible testing procedure - we should compare \(\widehat{\sigma}^2\) to \(\sigma^2\). There are two cases - one where \(\sigma^2\) is known and one where it is not.

Model tesing for \(\sigma^2\) known and \(\sigma^2\) unknown

\(\sigma^2\) known

\(\sigma^2\) known may be known from past experience, knowledge of the measurement error inherent in an instrument or by definition.
Recall from Chapter 4 that \[ \frac{\widehat{\sigma}^2}{\sigma^2} \sim \frac{\chi^2_{n-p}}{(n-p)}. \]
This leads to the test and we conclude there is a lack of fit if \[ \frac{(n-p)\,\widehat{\sigma}^2}{\sigma^2} > \chi^2_{1-\alpha, n-p}. \]
If a lack of fit is found, then a new model is needed.

Testing linear model using

Example: Strong interaction experiment data

Recall the Strong interaction experiment example.
Here we want to test if a linear model is adequate.
In this example, we know the variance almost exactly because each response value is the average of a large number of observations.
Because of the way the weights are defined, \(w_i = 1/Var(y_i)\), the known variance is implicitly equal to one.
Note that there is nothing special about one - we could define \(w_i = 99/Var(y_i)\) and the variance would be implicitly 99.
However, we would get essentially the same result as the following analysis.

Testing linear model using

library(faraway)
data(strongx)
WLS.fit <- lm(crossx ~ energy, weights=sd^-2, strongx)

Testing linear model using

summary(WLS.fit)

## 
## Call:
## lm(formula = crossx ~ energy, data = strongx, weights = sd^-2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3230 -0.8842  0.0000  1.3900  2.3353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  148.473      8.079   18.38 7.91e-08 ***
## energy       530.835     47.550   11.16 3.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.657 on 8 degrees of freedom
## Multiple R-squared:  0.9397, Adjusted R-squared:  0.9321 
## F-statistic: 124.6 on 1 and 8 DF,  p-value: 3.71e-06

Fitting a plot

Examine the \(R^2\) - do you think the model is a good fit?
Now plot the data and the fitted regression line (shown as a solid line on the following figure).
The R code for the plot:

plot(strongx$energy, strongx$crossx,
       xlab = "Energy", ylab = "Crossection")
abline(WLS.fit$coef)

Figure: Linear fit to the physics data

Lack of fit test using

Compute the test statistic and the \(p\)-value:

c <- (1.657)^2*8 #- residual se = 1.657 and df = 8
p.value <- 1 - pchisq(c, 8)
p.value

## [1] 0.004980759

We conclude that there is a lack of fit.
Just because \(R^2\) is large does not mean that you can not do better.

Lack of fit test using

Add a quadratic term to the model and test again:

WLS.fit.2 <- lm(crossx ~ energy + I(energy^2),
               weights = sd^-2, strongx)
summary(WLS.fit.2)

## 
## Call:
## lm(formula = crossx ~ energy + I(energy^2), data = strongx, weights = sd^-2)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89928 -0.43508  0.01374  0.37999  1.14238 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  183.8305     6.4591  28.461  1.7e-08 ***
## energy         0.9709    85.3688   0.011 0.991243    
## I(energy^2) 1597.5047   250.5869   6.375 0.000376 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6788 on 7 degrees of freedom
## Multiple R-squared:  0.9911, Adjusted R-squared:  0.9886 
## F-statistic: 391.4 on 2 and 7 DF,  p-value: 6.554e-08

Lack of fit test using

plot(strongx$energy, strongx$crossx,
       xlab = "Energy", ylab = "Crossection")
abline(WLS.fit$coef)
x <- seq(0.05, 0.35, by = 0.01)
lines(x, WLS.fit.2$coef[1] + WLS.fit.2$coef[2]*x + 
       WLS.fit.2$coef[3]*x^2, lty = 2)

Figure: Linear and quadratic fits to the physics data

Lack of fit test using

Compute the test statistic and the \(p\)-value:

c <- (0.6788)^2*7 #- residual se = 0.6788 and df = 7
p.value <- 1 - pchisq(c, 7)
p.value

## [1] 0.8633989

This time we cannot detect a lack of fit.
The curve on the above plot for the quadratic fit is obtained using the R code:
The curve is shown as a dotted line on the plot (thanks to lty=2). This seems clearly more appropriate than the linear model.

\(\sigma^2\) unknown

The \(\widehat{\sigma}^2\) that is based in the chosen regression model needs to be compared to some model-free estimate of \(\sigma^2\).
We can do this if we have repeated \(y\) for one or more fixed \(x\).
These replicates do need to be truly independent.
They cannot just be repeated measurements on the same subject or unit.
Such repeated measures would only reveal the within subject variability or the measurement error.
We need to know the between subject variability - this reflects the \(\sigma^2\) described in the model.

\(\sigma^2\) unknown

Suppose that \(y_{11}, \ldots, y_{1n_1}\) correspond to one set of equal rows of \(X\), and let \(\bar{y}_{1.} = \sum_{i=1}^{n_1}y_{1i}/n_1\), suppose \(y_{21}, \ldots, y_{2n_2}\) correspond to a second set of equal rows and let \(\bar{y}_{2.} = \sum_{i=1}^{n_2}y_{2i}/n_2\), etc. Suppose there are \(c\) such sets with sizes \(n_1, \ldots, n_{c}\) such that \(\sum_{i=1}^{c}n_{i} = n\), the total sample size. Then \[ SS_{Error}= \sum^{c}_{i=1}\sum^{n_{i}}_{j=1}(y_{ij}- \bar{y}_{i.})^2 \] is said to be the sum of squares due to pure error, since the only explanation for a variation among observations which correspond to equal rows of X is random variation.
The difference between the observed mean value \(\bar{y}_{i.}\) and the predicted value for the \(i\)th set \(\widehat{y}_{i}\), is an indication of how well the model \({\bf y} = {\bf X}\,\beta + \epsilon\) fits the data.

\(\sigma^2\) unknown

We write \[ SS_{\mbox{Lack of fit}}= \sum^{c}_{i=1}n_{i}(\widehat{y}_{i}- \bar{y}_{i.})^2. \]
It is easily shown that \[ SS_{Residual}=SS_{Error} + SS_{\mbox{Lack of fit}}. \]
\(SS_{\mbox{Lack of fit}}\) is a measure of the adequacy of the suggested model. We can write \[ SS_{Total} =SS_{Regression} + SS_{\mbox{Lack of fit}} + SS_{Error}. \]

The ANOVA table

The ANOVA table is given by

A test of the adequacy of the model is therefore based on \[ F = \frac{SS_{\mbox{Lack of fit}}/(c-p)}{SSE/(n-c)}. \]
Compare this to \(F_{\alpha, (c-p, n-c)}\) and reject if the test statistic value is greater.

The ANOVA table

If the lack of fit test is significant, one will have to rethink the model: more terms may have to be included, the variables may have to be transformed or one may have to design a completely new experiment.
In many practical applications of regression analysis the straight line seems to be inadequate to deal with the complexity of the problem.
This problem can be solved by fitting other form of models e.g. nonlinear or polynomial regression.

Fitting a model using

Data: Corrosion loss in Cu-Ni alloys

The data for this example consist of thirteen specimens of 90/10 Cu-Ni alloys with varying iron content in percent. The specimens were submerged in sea water for 60 days and the weight loss due to corrosion was recorded in units of milligrams per square decimeter per day. The data come from Draper and Smith (1998). The data contains the following columns

Fe: Iron content in percent
loss: Weight loss in mg per square decimeter per day

Fitting a model using

library(faraway)
data(corrosion)
corrosion

##      Fe  loss
## 1  0.01 127.6
## 2  0.48 124.0
## 3  0.71 110.8
## 4  0.95 103.9
## 5  1.19 101.5
## 6  0.01 130.1
## 7  0.48 122.0
## 8  1.44  92.3
## 9  0.71 113.1
## 10 1.96  83.7
## 11 0.01 128.0
## 12 1.44  91.4
## 13 1.96  86.2

Fitting a model using

data(corrosion)
Lack.fit.1 <- lm(loss ~ Fe, data = corrosion)
summary(Lack.fit.1)

## 
## Call:
## lm(formula = loss ~ Fe, data = corrosion)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7980 -1.9464  0.2971  0.9924  5.7429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  129.787      1.403   92.52  < 2e-16 ***
## Fe           -24.020      1.280  -18.77 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.058 on 11 degrees of freedom
## Multiple R-squared:  0.9697, Adjusted R-squared:  0.967 
## F-statistic: 352.3 on 1 and 11 DF,  p-value: 1.055e-09

The plot of fitted model

plot(corrosion$Fe, corrosion$loss, xlab = "Iron content", ylab = "Weight loss")
abline(Lack.fit.1$coef)

Figure: Linear fit to the Cu-Ni corrosion data.

Fitting a model using

We have an \(R^2\) of 97% and an apparently good fit to the data.
We now fit a model that reserves a parameter for each group of data with the same value of \(x\).
This is accomplished by declaring the predictor to be a factor.
The fitted values are the means in each group, these are add on the above plot using the :

The plot of fitted model

plot(corrosion$Fe, corrosion$loss, xlab = "Iron content", ylab = "Weight loss")
abline(Lack.fit.1$coef)
Lack.fit.2 <- lm(loss ~ factor(Fe), data = corrosion)
points(corrosion$Fe, Lack.fit.2$fit, pch = 18)

Figure: Linear fit to the Cu-Ni corrosion data. Group means denoted by black diamonds.

Comparsion of models

We can now compare the two models in the usual way:

anova(Lack.fit.1, Lack.fit.2)

## Analysis of Variance Table
## 
## Model 1: loss ~ Fe
## Model 2: loss ~ factor(Fe)
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     11 102.850                                
## 2      6  11.782  5    91.069 9.2756 0.008623 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output

The low p-value indicates that we must conclude that there is a lack of fit.
The reason is that the pure error \(sd = \sqrt{11.8/6}\) is substantially less than the regression standard error of 3.06.
We might investigate models other than a straight line although no obvious alternative is suggested by the plot.
Before considering other models, we would first find out whether the replicates are genuine, perhaps the low pure error SD can be explained by some correlation in the measurements.
Another possible explanation is unmeasured third variable is causing the lack of fit.

The output

When there are replicates, it is impossible to get a perfect fit.
Even when there is parameter assigned to each group of \(x\)-values, the residual sum of squares will not be zero.
For the factor model above, the \(R^2\) is 99.7%.
So even this saturated model does not attain a 100% value for \(R^2\).
For these data, it’s a small difference but in other cases, the difference can be substantial.
In these cases, one should realize that the maximum \(R^2\) that may be attained might be substantially less than 100% and so perceptions about what a good value for \(R^2\) should be downgraded appropriately.

Testing lack of fit

These methods are good for detecting lack of fit, but if the null hypothesis is not rejected, we cannot conclude that we have the true model.
After all, it may be that we just did not have enough data to detect the inadequacies of the model.
All we can say is that the model is not contradicted by the data.
When there are no replicates, it may be possible to group the responses for similar \(x\) but this is not straightforward.
It is also possible to detect lack of fit by less formal, graphical methods.
A more general question is how good a fit do you really want?

Testing lack of fit

By increasing the complexity of the model, it is possible to fit the data more closely.
By using as many parameters as data points, we can fit the data exactly.
However, very little is achieved by doing this since we learn nothing beyond the data itself and any predictions made using such a model will tend to have very high variance.
The question of how complex a model to fit is difficult and fundamental.

Testing lack of fit

For example, we can fit the mean responses for the example above exactly using a sixth order polynomial:

Lack.fit.3 <- lm(loss ~ Fe + I(Fe^2) + I(Fe^3)+I(Fe^4) +
                   I(Fe^5) + I(Fe^6), corrosion)
Lack.fit.3

## 
## Call:
## lm(formula = loss ~ Fe + I(Fe^2) + I(Fe^3) + I(Fe^4) + I(Fe^5) + 
##     I(Fe^6), data = corrosion)
## 
## Coefficients:
## (Intercept)           Fe      I(Fe^2)      I(Fe^3)      I(Fe^4)      I(Fe^5)  
##       129.3        -76.1        552.2      -1540.8       1784.6       -919.8  
##     I(Fe^6)  
##       173.8

Now let us look at the fit:

Polynomial fit

plot(loss ~ Fe, data = corrosion, ylim = c(60,130))
points(corrosion$Fe, Lack.fit.2$fit, pch = 18)
grid <- seq(0,2, len = 50)
lines(grid,predict(Lack.fit.3, data.frame(Fe=grid)))

Figure: Polynomial fit to the corrosion data

The output

As shown in the above figure, the fit of this model is excellent, for example:

summary(Lack.fit.3)$r.squared

## [1] 0.9965314

There is no plausible reason corrosion loss should suddenly drop at 1.7 and thereafter increase rapidly.
This is a consequence of overfitting the data.
This illustrates the need not to become too focused on measures of fit like \(R^2\).

Chapter Five: Diagnostics

Introduction

Regression model building is often an iterative and interactive process.
The first model we try may prove to be inadequate.
Regression diagnostics are used to detect problems with the model and suggest improvements.

Residuals and Leverage

Leverage

Recall from multiple regression chapter that \[ \widehat{\bf y} ={\bf X} \widehat{\beta} ={\bf X}({\bf X}'{\bf X})^{-1} {\bf X}'{\bf y} = {\bf H}{\bf y} \] where \({\bf H}\) is the hat-matrix.
Now \[ \widehat{\epsilon} = {\bf y}-\widehat{\bf y} = ({\bf I} - {\bf H})\,{\bf y}=({\bf I} - {\bf H})\,{\bf X}\,\beta + ({\bf I} - {\bf H})\,\epsilon = ({\bf I} - {\bf H})\,\epsilon \]
So \(Var(\widehat{\epsilon}) = Var[({\bf I} - {\bf H})\,\epsilon] = ({\bf I}-{\bf H})\,\sigma^2\) assuming that \(Var(\epsilon) = \sigma^2\,{\bf I}\).
We see that although the errors may have equal variance and be uncorrelated the residuals do not.
\(h_{ii} =\) the (\(i,i\)) entries of the hat matrix \({\bf H}\) are called and are useful diagnostics.

Leverage

\(h_{ii}\) is a measure of the distance between the \(X\) values for case \(i\) and the means of the \(X\) values for all cases.
Large \(h_{ii}\) implies that case \(i\) is distant from the center of all \(X\) observations.
We see that \(Var(\widehat{\varepsilon}_i) = \sigma^2\,(1-h_i)\) so that a large leverage for \(h_i\) will make \(Var(\widehat{\varepsilon}_i)\) small.
In other words the fit will be “forced” to be close to \(y_i\).
The \(h_{ii}\) depends only on \({\bf X}\), however knowledge of \({\bf y}\) is required for a full interpretation.

Leverage

The diagonal elements \(h_{ii}\) of the hat matrix, \(\bf H\), have the following properties

\(0 \leq h_{ii} \leq 1\). \({\bf H}\) is an orthogonal projection, therefore \[ {\bf H} = {\bf H}\,{\bf H} \]

Hence \(h_{ii} = ({\bf H}\,{\bf H})_{ii} = \sum_{j=1}^n h_{ij} h_{ji} = h_{ii}^2 + \sum _{i\neq j} h_{ij} h_{ji}= h_{ii}^2 + \sum_{i\neq j} h_{ij}^2\). This implies \(0 \leq h_{ii} \leq 1\).

If p is the number of parameters in the model, then \(\sum_{i=1}^n h_{ii}=p\). Because \[ tr({\bf H}) = \sum_{i=1}^n h_{ii} = tr [{\bf X}\,({\bf X}'{\bf X})^{-1}{\bf X}'] = tr[({\bf X}'{\bf X})^{-1} {\bf X}'{\bf X}] = tr ({\bf I}_p) = p. \]

Leverage

Furthermore, \(h_{ii} > 1/n,~~ \forall i\).

An average value for \(h_{ii}\) is \(p/n\).
Since \(Var(\widehat{{\bf y}}) = Var({\bf H}\,{\bf y}) = {\bf H}\,\sigma^2\) so \(Var(\widehat{y}_i) = h_{ii}\,\widehat{\sigma}^2\).
A leverage value \(h_{ii}\) is usually considered to be large if it is more than twice as large as the mean leverage, i.e. \[h_{ii}>\frac{2p}{n}\]
Large values of \(h_{ii}\) are due to extreme values in \({\bf X}\).
Rule of Thumb (for reasonably large data sets):
- \(0.2\le h_{ii}\le 0.5 \Rightarrow\) moderate leverage
- \(h_{ii}>0.5 \Rightarrow\) high leverage

Testing leverages using

Data: Savings rates in 50 countries

First we fit the model and make an index plot of the residuals:

library(faraway)
data(savings)
MLR.fit <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)

Scatter plot

plot(MLR.fit$res, ylab = "Residuals",
           main = "Index plot of residuals")

Figure: Polynomial fit to the corrosion data

The output

We can find which countries correspond to the largest and smallest residuals using the code

sort(MLR.fit$res)[c(1,50)]

##     Chile    Zambia 
## -8.242231  9.750914

Testing leverages using

Or by using the function. We first make up a character vector of the country names using which gets the row names from the data frame.

countries <- row.names(savings)
identify(1:50, MLR.fit$res, countries)

Click on the left mouse button next to the points you are interested in to identify them.
When you are done, click on the middle (if not available, the right) mouse button.
You will identified Chile and Zambia on the plot as countries with smallest and largest residuals, respectively.

Testing leverages using

Now consider the leverage.
Extract first the \({\bf X}\)-matrix here using and then compute and plot the leverages (also called “hat” values)

x <- model.matrix(MLR.fit)
lev <- hat(x)
lev

##  [1] 0.06771343 0.12038393 0.08748248 0.08947114 0.06955944 0.15840239
##  [7] 0.03729796 0.07795899 0.05730171 0.07546780 0.06271782 0.06372651
## [13] 0.09204246 0.13620478 0.08735739 0.09662073 0.06049212 0.06008079
## [19] 0.07049590 0.07145213 0.21223634 0.06651170 0.22330989 0.06079915
## [25] 0.08634787 0.07940290 0.04793213 0.09061400 0.05421789 0.05035056
## [31] 0.03897459 0.06937188 0.06504891 0.06425415 0.09714946 0.06510405
## [37] 0.16080923 0.07732854 0.12398898 0.07359423 0.03964224 0.07456729
## [43] 0.11651375 0.33368800 0.08628365 0.06433163 0.14076016 0.09794717
## [49] 0.53145676 0.06523300

Sactter plot of leverages

plot(lev, ylab = "Leverages",
          main="Index plot of Leverages")
abline(h=2*5/50)

sum(lev)

## [1] 5

Testing leverages using

Notice that the sum of the leverages is equal to \(p\) which is 5 for this data.
Which countries have large leverage?
Note in the figure below that, the horizontal line has a leverage value \(2\,p/n = 0.2\) to indicate the “rule of thumb”.
We can see which countries exceed this rather arbitrary cut-off:

names(lev) <- countries
lev[lev > 0.2]

Ireland         Japan United States         Libya 
0.2122363     0.2233099     0.3336880     0.5314568

The command assigns the country names to the elements of the vector making it easier to identify them.

The output

Alternatively, we can do it interactively like this

identify(1:50,lev,countries)

The two countries with the highest leverages are Libya and the United States.

Studentized residuals

Recall that
- \(Var(\widehat{y}_i) = h_{ii}\,\widehat{\sigma}^2\)
- \(Var(\hat{\epsilon}_i) = \widehat{\sigma}^2\ (1-h_{ii})\) .
- \(cov(\epsilon_i,\epsilon_j)= -\sigma^2\ h_{ij}\)
- The quantities \[ r_i = \frac{\widehat{\varepsilon}_i}{\widehat{\sigma}\,\sqrt{1 - h_{ii}}} \] are called studentized residuals

where \(0\le \ h_{ii}\le1\)

If the model assumptions are correct \(Var(r_i) = 1\) and \(Corr(r_i, r_j)\) tends to be small.

Studentized Residuals

Studentized residuals are sometimes preferred in residual plots as they have been standardized to have equal variance.
Note that studentization can only correct for the natural non-constant variance in residuals when the errors have constant variance.
If there is some underlying heteroscedascity in the errors, studentization cannot correct for it.
For the saving data the studentized residuals calculated using the code:

MLR.fit.s <- summary(MLR.fit)
MLR.fit.s$sig

## [1] 3.802669

Studentized residuals

stud <- MLR.fit$res/(MLR.fit.s$sig*sqrt(1-lev))
stud

##      Australia        Austria        Belgium        Bolivia         Brazil 
##     0.23520105     0.17282943     0.61085760    -0.19245030     0.96858807 
##         Canada          Chile          China       Colombia     Costa Rica 
##    -0.09083873    -2.20907436     0.69453131    -0.39319153     1.40168682 
##        Denmark        Ecuador        Finland         France        Germany 
##     1.46686216    -0.65379142    -0.46394723     0.70042898    -0.04974135 
##         Greece      Guatamala       Honduras        Iceland          India 
##    -0.86217889    -0.91031261     0.19259259    -1.69401854     0.13881900 
##        Ireland          Italy          Japan          Korea     Luxembourg 
##     1.00475012     0.52442520     1.57595468    -1.65713877    -0.45967116 
##          Malta         Norway    Netherlands    New Zealand      Nicaragua 
##     0.81536209    -0.23495632     0.11735008     0.61802723     0.17443311 
##         Panama       Paraguay           Peru    Philippines       Portugal 
##    -0.88366877    -1.66987256     1.77851567     1.81461452    -0.21267488 
##   South Africa South Rhodesia          Spain         Sweden    Switzerland 
##     0.13140922     0.37072635    -0.18374340    -1.19700295     0.67944806 
##         Turkey        Tunisia United Kingdom  United States      Venezuela 
##    -0.71532499    -0.77031393    -0.75327449    -0.35811077     0.99934066 
##         Zambia        Jamaica        Uruguay          Libya       Malaysia 
##     2.65091534    -0.85634746    -0.62681420    -1.08705199    -0.80805950

Plot of studentized residuals

plot(stud, ylab = "Studentized Residuals", main = "Studentized Residuals")

Studentized residuals

Notice the range on the axis. Which residuals are large?
In this case, there is not much difference between the studentized and raw residuals apart from the scale.
Only when there is unusually large leverage will the differences be noticeable.

An outlier test

An outlier is a point that does not fit the current model.
Outliers generally, are extreme observations which are far away from the rest of the observations in a given data set.
This in turn yields large residuals which can seriously affect the fitted least squares regression function.
Thus, it will be necessary to identify such typical observations and make necessary evaluation either to retain or eliminate them from the model.
An outlier test is useful because it enables us to distinguish between truly unusual points and residuals which are large but not exceptional.

An outlier test

The hat matrix we have seen in the previous section also plays an important role in identifying the outlying \(x\) observations.
Specifically, its diagonal elements indicate whether there are outlying cases with respect to the values of the predictor variable \(x\) or not.
The diagonal element \(h_{ii}\) of the hat matrix is a measure of the distance between the values of the predictor variable for the \(i\)th case and the means of \(x\) values for all \(n\) cases.
Hence, a large value of the diagonal element \(h_{ii}\) indicates that the \(i\)th case is distant from the center of all observations of variable \(x\).

An outlier test

We can use regression deletion diagnostics called to identify an outlier. Here is how it works:
We exclude point \(i\) and recompute the estimates to get \(\widehat{\beta}_{(i)}\) and \(\widehat{\sigma}^2_{(i)}\) where \((i)\) denotes that the \(i\)th case has been excluded. Hence \[ \widehat{y}_{(i)} = {\bf x}_i'\,\widehat{\beta}_{(i)}. \]
If \(\widehat{y}_{(i)} - y_i\) is large then case \(i\) is an outlier.
Just looking at \(\widehat{\varepsilon}_i\) misses those nasty points which pull the regression line so close to them that they conceal their true status.
How large is large?

An outlier test

Note that \(\widehat{y}_{(i)}\) and \(y_i\) are independent. Therefore \[ Var(\widehat{y}_{(i)} - y_i) = \sigma^2\,(1 + {\bf x}_i'\,({\bf X}_{(i)}'{\bf X}_{(i)})\,{\bf x}_i) \] and so \[ \widehat{Var(\widehat{y}_{(i)} - y_i)} = \widehat{\sigma}^2\,(1 + {\bf x}_i'\,({\bf X}_{(i)}'{\bf X}_{(i)})\,{\bf x}_i). \]
Define the jackknife (or externally studentized or crossvalidated) residuals as \[ t_i = \frac{y_i - \widehat{y}_{(i)}}{\widehat{\sigma}_{(i)}\,(1 + {\bf x}_i'\,({\bf X}_{(i)}'{\bf X}_{(i)})\,{\bf x}_i)^{1/2}} \] which are distributed \(t_{n-p-1}\) if the model is correct and \(\epsilon\sim \mathcal{N}({\bf 0}, \sigma^2\,{\bf I})\).

An outlier test

Easy way to compute \(t_i\) \[ t_i = \frac{\widehat{\varepsilon}_i}{\widehat{\sigma}_{(i)}\sqrt{1 - h_{ii}}} = r_i\,\left(\frac{n-p-1}{n-p-r_i^2}\right )^{1/2} \] which avoids doing n regressions.
Since \(t_i \sim t_{n-p-1}\) and we can calculate a \(p\)-value to test whether case \(i\) is an outlier.
However, we are likely to want to test all cases so we must adjust the level of the test accordingly.
Even though it might seems that we only test one or two large \(t_i\)’s, by identifying them as large we are implicitly testing all cases.

An outlier test

Suppose we want a level \(\alpha\) test. Now \[ \begin{array}{cc} P(\mbox{all tests accept}) & = 1 - P(\mbox{At least one rejects}) \\ & \geq 1 - \sum_iP(\mbox{Test}~ i ~\mbox{rejects}) = 1 - n\,\alpha. \end{array} \]
So this suggests that if an overall level \(\alpha\) test is required then a level \(\alpha/n\) should be used in each of the tests.
This method is called the Bonferroni correction and is used in contexts other than outliers as well.
It’s biggest drawback is that it is conservative and it finds fewer outliers than the nominal level of confidence would dictate. The larger that \(n\) is, the more conservative it gets.

Testing outliers using

Data: Savings rates in 50 countries

Now get the jacknife residuals for the savings data:

jack <- rstudent(MLR.fit)
plot(jack, ylab = "Jacknife Residuals", main = "Jacknife Residuals")

Figure: Jacknife residuals for the savings data

The output

jack[abs(jack)==max(abs(jack))]

##   Zambia 
## 2.853558

The largest residual of 2.85 is pretty big for a standard normal scale but is it an outlier?
Compute the Bonferroni critical value:

BC.value <- qt(0.05/(2*50), 44)
BC.value

## [1] -3.525801

p.value <- 1-pt(abs(BC.value), 44)
p.value

## [1] 5e-04

What do you conclude?

Notes

Two or more outliers next to each other can hide each other.
An outlier in one model may not be an outlier in another when the variables have been changed or transformed. You will usually need to reinvestigate the question of outliers when you change the model.
The error distribution may not be normal and so larger residuals may be expected.

Notes

Individual outliers are usually much less of a problem in larger datasets.
- A single point won’t have the leverage to affect the fit very much.
- It’s still worth identifying outliers if these type of points are worth knowing about in the particular application.
- For large datasets, we need only worry about clusters of outliers. Such clusters are less likely to occur by chance and more likely to represent actual structure. Finding these cluster is not always easy.

What should be done about outliers?

Check for a data entry error first. These are relatively common. Unfortunately, the original source of the data may have been lost.
Examine the physical context - why did it happen? Sometimes, the discovery of an outlier may be of singular interest. Some scientific discoveries spring from noticing unexpected aberrations.

What should be done about outliers?

Exclude the point from the analysis but try re-including it later if the model is changed.
- The exclusion of one or more points may make the difference between getting a statistical significant result or having some un-publishable research.
- This can lead to difficult decision about what exclusions are reasonable.
- To avoid any suggestion of dishonesty, always report the existence of outliers even if you do not include them in your final model.

Testing outliers using

Data: Star temperatures and light intensites

Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1, which is in the direction of Cygnus,

data(star)
plot(star$temp, star$light,
   xlab = "log(Temperature)",
   ylab = "log(Light Intensity)")

Testing outliers using

What do you think relationship is between temperature and light intensity?
Now fit a linear regression and add the fitted line to the plot

Star.fit.1 <- lm(light~temp, data=star)
Star.fit.1

## 
## Call:
## lm(formula = light ~ temp, data = star)
## 
## Coefficients:
## (Intercept)         temp  
##      6.7935      -0.4133

Testing outliers using

The plot is shown in the following figure with the regression line in solid type.

data(star)
plot(star$temp, star$light,
   xlab = "log(Temperature)",
   ylab = "log(Light Intensity)")
Star.fit.1 <- lm(light~temp, data=star)
abline(Star.fit.1)

Figure: Sold regression line including four leftmost points

Testing outliers using

Is this what you expected?
Are there any outliers in the data?

range(rstudent(Star.fit.1))

## [1] -2.049393  1.905847

We need not bother to actually compute the critical value since these values are clearly not large enough.
The outlier test does not reveal any (We will see this in next section).
The four stars on the upper left of the plot are giants.
Let us see what happens if these are excluded:

Testing outliers using

Star.fit.2 <- lm(light ~ temp, data = star,
                 subset = (temp > 3.6))
Star.fit.2

## 
## Call:
## lm(formula = light ~ temp, data = star, subset = (temp > 3.6))
## 
## Coefficients:
## (Intercept)         temp  
##      -4.057        2.047

Testing outliers using

plot(star$light~star$temp, data=star)
Star.fit.2 <- lm(light ~ temp, data = star,
                 subset = (temp > 3.6))
abline(Star.fit.2$coef, lty = 2)

Figure:Dotted regression line excluding four leftmost points

Testing outliers using

Observe from the previous two figures that the two lines are different in their direction of relationship.
This illustrates the problem of multiple outliers.
We can visualize the problems here, but for higher dimensional data this is much more difficult.

Influential observations

Identifying influential observations

After identifying cases that are outlying with respect to their \(Y\) and/or \(X\) values, the next step is to ascertain whether or not these outlying cases are influential.
A case is considered to be influential if its exclusion causes major changes in the fitted regression function.
An influential point may or may not be an outlier and may or may not have large leverage but it will tend to have at least one of those two properties.
Sometimes a few elements of a data set exert disproportionate influence on model coefficients, that is, parameter estimates might depend more on these few influential cases than on the other cases large in number.

Identifying influential observations

These influential observations might be shown up for nothing important so that they should be eliminated, or there might be a reasonable justification for their existence in the model.
However, once the identification of outlying observations has taken place, the next step is to check whether or not these outlying observations have impact on the estimated model.
In making such evaluations, we consider cases to be influential if their removal cause a major change on the fitted regression model, see the above example.

Identifying influential observations

1. Influence on single fitted value (DFFIT)

Difference between the fitted values (DFFITS) is a scaled measure of the change in the fitted value \(\hat{y}_{i}\) for the \(i\)-th observation and is calculated by deleting the \(i\)-th observation.
The precise concepts are defined as follows.
- We say that the \(i\)th trial is influential if (a standardized versions of) \[ \widehat{y}_i - \widehat{y}_{i(i)} \] where \(\widehat{y}_i\) is fitted value based on a regression fit with all observation and \(\widehat{y}_{i(i)}\) is fitted value for the \(i\)th case based on a regression fit with the \(i\)th case deleted, is large; or

Identifying influential observations

Influence on single fitted value (DFFIT)

if (a standardized version of) \[ \widehat{\beta} - \widehat{\beta}_{(i)} \] where \(\widehat{\beta}\) is estimated regression coefficients based on a regression fit with all observations and \(\widehat{\beta}_{(i)}\) is estimated regression coefficients based on a regression fit with the \(i\)th case deleted.
Note that \[ Var(\widehat{y}_i) = \sigma^2 \, \left( \sum_{j=1}^n h_{ij} y_j \right) = \sigma^2 \,\sum_{j=1}^n h_{ij}^2 = \sigma^2 h_{ii} \] and that confidence limits for \(E(y_i)\) can be obtained from the test statistic, with \(Var (\widehat{y}_i) = (h_{ii})\widehat{\sigma}^2\)

Identifying influential observations

Influence on single fitted value (DFFIT)

\[ \frac{\widehat{y}_i - E(y_i)}{s(\widehat{y}_i)} = \frac{\widehat{y}_i - E(y_i)}{\sqrt{\widehat{\sigma}^2\, h_{ii}}} \]

To measure the influence of the \(i\)th observation on the \(i\)th fitted value we look at the following quantity, which is inspired by the above expression \[ (DFFITS)_i = \frac{\widehat{y}_i - \widehat{y}_{i(i)}}{\sqrt{h_{ii}\widehat{\sigma}^2_{(i)}}}. \]
One can prove that \[ (DFFITS)_i = t_i\left( \frac{h_{ii}}{1-h_{ii}}\right)^{1/2} \] measures how much the estimated/fitted value, \(\hat y_i\), changes if the \(i^{th}\) case is excluded.

Identifying influential observations

Influence on single fitted value (DFFIT)

The first formula contains information on outlying \(y\)-observation whereas the second on outlying \(x\)-observation.
The second formula show that \((DFFITS)_i\) can be obtained without fitting models with the \(i\)th observation deleted.
Observation with large \(DFFITS\) value would be thought as influential point.
Rule of Thumb:

\(|DFFIT_i|> 1 \Rightarrow\) for small to medium size datasets and

\(|DFFIT_i|> 2\sqrt\frac {p}{n} \Rightarrow\) for large datasets

are considered influential.

Identifying influential observations

2. Influence on all fitted values (Cook’s distance measure)

The two measures of influence \(\widehat{y}_i - \widehat{y}_{i(i)}\) and \(\widehat{\beta} - \widehat{\beta}_{(i)}\), are hard to judge in the sense that the scale varies between datasets.
A popular alternative are the . It is defined by

\[ \begin{array}{ccc} D_i & = \frac{(\widehat{\beta} -\widehat{\beta}_{(i)})'({\bf X}'{\bf X}) (\widehat{\beta} - \widehat{\beta}_{(i)})}{p\, \widehat{\sigma}^2} &= \frac{(\widehat{y} -\widehat{y}_{(i)})' (\widehat{y}-\widehat{y}_{(i)})}{p\, \widehat{\sigma}^2}\\ & = \frac{\sum_{j=1}^n (\widehat{y}_j - \widehat{y}_{j(i)})^2}{p\, \widehat{\sigma}^2} & = \frac{1}{p}\,r_i^2\,\frac{h_{ii}}{1-h_{ii}}\\ \end{array} \] - In the last formula, the first term \(r_i^2\) is the residual effect and the second is the leverage. The combination of the two leads to influence.

Identifying influential observations

Cook’s distance measure

Therefore, the Cook’s distance measure is an aggregate influence measure showing the effect of the \(i\)th observation on all \(n\) fitted values.
Observation with large \(D_i\) value would be thought as influential point.
The Cook’s distance measure is similar to DFFITS measure, which considers the influence of the \(i\)th observation on the fitted value \(\hat{y}_{i}\) except that in the case of Cook’s distance measure, the influences are combined to form an aggregate effect.
From statistical distribution aspect, it refers to an \(F\)-value from \(F\)-distribution with \(p\) and \(n-p\) degrees of freedom.

Identifying influential observations

Cook’s distance measure

The rule of thumb:

\(D_i < F_{0.1;p,n-p} \Rightarrow\) non-influential

or \(D_i < F_{0.2;p,n-p}\Rightarrow\) non-influential

\(D_i > F_{0.5;p,n-p} \Rightarrow\) influential

Testing an influential observations using

An index plot of \(D_i\) can be used to identify influential points.

MLR.fit <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
cook <- cooks.distance(MLR.fit)
plot(cook, ylab = "Cooks distances")
countries <- row.names(savings)
identify(1:50, cook, countries)

## integer(0)

Figure: Cook Statistics for the savings data

Testing an influential observations using

Based on the Cook statistics we have identified the largest three values.
We now exclude the largest one and see how the fit changes.

Testing an influential observations using

MLR.fit.exc <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,
            subset=(cook < max(cook)))
summary(MLR.fit.exc)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings, 
##     subset = (cook < max(cook)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0699 -2.5408 -0.1584  2.0934  9.3732 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.5240460  8.2240263   2.982  0.00465 **
## pop15       -0.3914401  0.1579095  -2.479  0.01708 * 
## pop75       -1.2808669  1.1451821  -1.118  0.26943   
## dpi         -0.0003189  0.0009293  -0.343  0.73312   
## ddpi         0.6102790  0.2687784   2.271  0.02812 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.795 on 44 degrees of freedom
## Multiple R-squared:  0.3554, Adjusted R-squared:  0.2968 
## F-statistic: 6.065 on 4 and 44 DF,  p-value: 0.0005617

Testing an influential observations using

Compared to the full data fit:

MLR.fit <- lm(sr ~ pop15+pop75+dpi+ddpi,savings)
summary(MLR.fit)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

Identifying influential observations

What changed?
The coefficient for ddpi changed by about 50%.

Identifying influential observations

3. DFBETAS

DFBETAS are the scaled measures of the change in each parameter (‘beta’ coefficients) calculated by deleting each observation, one at a time.
According to index of coefficient \(k\) and observation \(i\), the DFBETAS is defined as

\[ (DFBETAS)_{k(i)} = \frac{\widehat{\beta}_k - \widehat{\beta}_{k(i)}}{s(\widehat{\beta}_k)} = \frac{\widehat{\beta}_k - \widehat{\beta_{k(i)}}}{\sqrt{c_{kk}\,\widehat{\sigma}^2_{(i)}}},~~~k =0, \ldots, p-1 \] where \(c_{kk}\) is the \(k\)th diagonal element of \(({\bf X}{\bf X})^{-1}\). - This statistic indicates how much the regression coefficients change in standard deviation units if the \(i\)th observations were deleted.

Identifying influential observations

DFBETAS

That is, it measures the influence of observation \(i\) on the estimate of the parameters.
Its values should be referred to a \(t\)-distribution.
The guidelines are values larger than 1 (for small to medium size datasets) or \(> 2/\sqrt{n}\) (for large datasets) are considered influential.

Testing influential observations using

It would be rather tedious to do the Cook’s distance measure the way it was done in the previous section for each country but there’s a quicker way:

MLR.fit.inf <- lm.influence(MLR.fit)
plot(MLR.fit.inf$coef[,2], MLR.fit.inf$coef[,3],
    xlab = "Change in pop15 coef",
    ylab = "Change in pop75 coef")
identify(MLR.fit.inf$coef[,2], MLR.fit.inf$coef[,3],
    countries)

## integer(0)

Figure: Plot of the change in the second and third parameter estimates when a case is left

Testing influential observations using

We just plotted the change in the second and third parameter estimates when a case is left out as seen in the above Figure.
Try this for the other estimates - which countries stick out? For example consider Japan.

Testing influential observations using

MLR.fit.J <- lm(sr ~ pop15+pop75+dpi+ddpi,savings,
           subset=(countries != "Japan"))
summary(MLR.fit.J)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings, 
##     subset = (countries != "Japan"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.997 -2.592 -0.115  2.032 10.157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 23.9401714  7.7839968   3.076  0.00361 **
## pop15       -0.3679015  0.1536296  -2.395  0.02096 * 
## pop75       -0.9736743  1.1554502  -0.843  0.40397   
## dpi         -0.0004706  0.0009191  -0.512  0.61116   
## ddpi         0.3347486  0.1984457   1.687  0.09871 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.738 on 44 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.2113 
## F-statistic: 4.214 on 4 and 44 DF,  p-value: 0.005649

Testing influential observations using

Compare to the full data fit - what qualitative changes do you observe?
Notice that the ddpi term is no longer significant and that the the \(R^2\) value has decreased a lot.

Identifications of outliers using residual plots

Residual plots

Outliers and influential points indicate cases that are in some way individually unusual but we also need to check the assumptions of the model.
The objective of residual plots is to check the model assumptions.
There are several important plots.
1. Plot \(\widehat{\varepsilon}\) against predicted values \(\widehat{y}\): We can find either heteroscedascity (non-constant variance) or nonlinearity, which means we need either a transformation of response or predictor. If all is well, you should see constant variance in the vertical ( \(\widehat{\varepsilon}\)) direction and the scatter should be symmetric vertically about 0.

Residual plots

Plot \(\widehat{\varepsilon}\) against \(x_i\) (for predictors that are both in and out of the model). Look for the same things except in the case of plots against predictors not in the model, look for any relationship which might indicate that this predictor should be included.
QQ-plot: Plot residual versus normal quantile values. Put residual by order \[ \widehat{\varepsilon}_{[1]} \le \widehat{\varepsilon}_{[2]} \le \ldots \le \widehat{\varepsilon}_{[n]}, \] and let \(u_i = \Phi^{-1}\left[i/(n+1)\right]\). Plot \(\widehat{\varepsilon}_{[i]}\) versus \(u_i\). No problem if it shows a straight line.

Residual plots

Partial residual plot: For predictor \(k\).
1. Fit a partial regression model for response versus all the other predictors but predictor \(k\) to get residual \(\widehat{\varepsilon}_r\),
2. Fit a regression model for predictor \(k\) versus all the other predictors to get a residual \(\widehat{\varepsilon}_k\).
3. Plot \(\widehat{\varepsilon}_r\) versus \(\widehat{\varepsilon}_k\) and look at the curves.
  
  It helps us to find the relationship of predictor \(k\) and response better.
Plot \(\widehat{\varepsilon}_i\) versus \(\widehat{\varepsilon}_{i+1}\) to look at the independence of error term.

Residual plots

Residuals versus fitted plot:Savings rates in 50 countries

The following R codes produce residuals versus fitted plot.

plot(MLR.fit$fit, MLR.fit$res,
    xlab = "Fitted", ylab = "Residuals")
abline(h=0)

- The plot suggests no change to the current model

Residual plots

abs(residuals) versus fitted plot

plot(MLR.fit$fit, abs(MLR.fit$res),
    xlab="Fitted", ylab="|Residuals|")

Figure: Residual plots for the savings data

The plot shows non-constant variance(heteroscedastcity)

Residual plots

Residuals versus predictor plot

par(mfrow=c(1,2))
plot(savings$pop15,residuals(MLR.fit),
  xlab = "Population under 15",
  ylab = "Residuals")
plot(savings$pop75,residuals(MLR.fit),
  xlab = "Population over 75",
  ylab = "Residuals")

Figure: Residual plots for the savings data

Residual plots

What do you see from the above figure?

Comparing two groups

Can you see the two groups in the first plot of the above figure?
Let’s compare and test the variances.
Given two independent samples from normal distributions, we can test for equal variance using the test statistic of the ratio of the two variance.
The null distribution is a \(F\) with degrees of freedom given by the two samples.
The R code to compare the two groups that we have seen in the above plot and the results:

Residual plots

Comparing two groups

var(MLR.fit$res[savings$pop15 > 35])/
          var(MLR.fit$res[savings$pop15 <35])

## [1] 2.785067

table(savings$pop15 > 35)

## 
## FALSE  TRUE 
##    27    23

p.value <- 1-pf(2.7851,22,26)
p.value

## [1] 0.006787451

There is a significant difference between the variances of the two groups.

Residual plots

A quick way to check non-constant variance is this regression:

summary(lm(abs(MLR.fit$res) ~ MLR.fit$fit))

## 
## Call:
## lm(formula = abs(MLR.fit$res) ~ MLR.fit$fit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8395 -1.6078 -0.3493  0.6625  6.7036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.8398     1.1865   4.079  0.00017 ***
## MLR.fit$fit  -0.2035     0.1185  -1.717  0.09250 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.163 on 48 degrees of freedom
## Multiple R-squared:  0.05784,    Adjusted R-squared:  0.03821 
## F-statistic: 2.947 on 1 and 48 DF,  p-value: 0.0925

Residual plots

This test is not quite right as some weighting should be used and the degrees of freedom should be adjusted but there doesn’t seem to be a clear problem with non-constant variance.

Non-constant variance

There are two approaches to dealing with non-constant variance.
Weighted least squares is appropriate when the form of the non-constant variance is either known exactly or there is some known parametric form.
Alternatively, one can transform \(y\) to \(h(y)\) where \(h()\) is chosen so that \(Var(h(y))\) is constant.
To see how to choose \(h()\) consider this \[ \begin{array}{ccl} h(y) & = & h(E(y)) + (y - E(y))\,h'(E(y)) + \ldots \\ Var(h(y)) & = & h'(E(y))^2\,Var(y) + \ldots \end{array} \]

Non-constant variance

We ignore the higher order terms. For \(Var(h(y))\) to be constant we need \[ h'(E(y)) \propto (Var(y))^{-1/2} \]
which suggests \[ h(y) = \int \frac{d\,y}{\sqrt{Var(y)}} ==\int\frac{d\,y}{SD(y)}. \]
For example, if \(Var(y) Var(\varepsilon) \propto (E(y))^2\) then \(h(y) = log y\) is suggested while if \(Var(\varepsilon) \propto E(y)\) then \(\sqrt{y}\).
Graphically one tends to see \(SD(y)\) rather then \(Var(y)\).
Sometimes \(y_i \le 0\) for some \(i\) in which case the transformations may choke. You can try \(log (y + c)\) for some small \(c\) but this makes interpretation difficult.

Scatter plot

Consider the residual-fitted plot for the Galapagos data.

MLR.Gala <- lm(Species ~ Area + Elevation + Scruz +
               Nearest + Adjacent, gala)
plot(MLR.Gala$fit, MLR.Gala$res,
      xlab = "Fitted", ylab = "Residuals",
      main = "Untransformed Response")

Figure: Residual-Fitted plots for the Galapagos data before transformation

We can see non-constant variance in the plot.

Scatter plot

We guess that a square root transformation will give us constant variance.
The R code:

MLR.Gala.sq <- lm(sqrt(Species) ~ Area + Elevation +
                 Scruz + Nearest + Adjacent, gala)
plot(MLR.Gala.sq$fit, MLR.Gala.sq$res,
       xlab = "Fitted", ylab = "Residuals",
       main = "Square root Response")

Scatter plot

Residual-Fitted plots for the Galapagos data after transformation
We see in this plot that the variance is now constant.

Non-Constant Variance

Our guess at a variance stabilizing transformation worked out here, but had it not, we could always have tried something else.
The square root transformation is often appropriate for count response data.
The Poisson distribution is a good model for counts and that distribution has the property that the mean is equal to the variance thus suggesting the square root transformation.
It might be even better to go with a poisson regression rather than the normal-based regression we are using here.

Summary

There are more formal tests for non-constant variance, for example one could start by regressing \(\widehat{\varepsilon}|\) on \(y\) or \(x_i\) but there is a problem in specifying the alternative hypothesis for such a test.
A formal test may be good at detecting a particular kind of non-constant variance but have no power to detect another.
Residual plots are more versatile because unanticipated problems may be spotted.
A formal diagnostic test may have reassuring aura of exactitude about it, but one needs to understand that any such test may be powerless to detect problems of an unsuspected nature.

Summary

Graphical techniques are usually more effective at revealing structure that you may not have suspected.
Of course, sometimes the interpretation of the plot may be ambiguous but at least one can be sure that nothing is seriously wrong with the assumptions.
For this reason, a graphical approach is preferable for diagnostics.

Non-linearity

Lack of fit tests can be used when there is replication which doesn’t happen too often, but even if you do have it, the tests don’t tell you how to improve the model.
How do we check if the systematic part (\(E({\bf y}) = {\bf X}\,\beta\)) of the model is correct?
We can look at
1. Plots of \(\widehat{\varepsilon}\) against \(\widehat{y}\) and \(x_i\)
2. Plots of y against each xi.

but what about the effect of other \(x\) on the \(y\) versus \(x_i\) plot?

Partial Regression or Added Variable plots can help isolate the effect of \(x_i\) on \(y\).

Non-linearity

To produce the partial regression plots use the following steps:
1. Regress \(y\) on all \(x\) except \(x_i\), get residuals \(\widehat{\delta}\). This represents \(y\) with the other \(X\)-effect taken out.
2. Regress \(x_i\) on all \(x\) except \(x_i\), get residuals \(\widehat{\gamma}\). This represents \(x_i\) with the other \(X\)-effect taken out.
3. Plot \(\widehat{\delta}\) against \(\widehat{\gamma}\).
The slope of a line fitted to the plot is \(\widehat{\beta}_i\) which adds some insight into the meaning of regression coefficients.
Look for non-linearity and outliers and/or influential points.

Partial residual plots

Partial residual plots are a competitor to added variable plots.
These plot \(\widehat{\varepsilon} + \widehat{\beta}_i\,x_i\) against \(x_i\).
To see where this comes from, look at the response with the predicted effect of the other \(X\) removed:

\[ y - \sum_{j \ne i} {\bf x}_j\,\widehat{\beta}_i = \widehat{y} + \widehat{\varepsilon} - \sum_{j \ne i} {\bf x}_j\,\widehat{\beta}_i = {\bf x}_i\,\widehat{\beta}_i + \widehat{\epsilon}. \]

Again the slope on the plot will be \(\widehat{\beta}_i\) and the interpretation is the same.
Partial residual plots are reckoned to be better for non-linearity detection while added variable plots are better for outlier/influential detection.

Partial residual plots using

Recall the savings dataset. First we construct a partial regression (added variable) plot for :

MLR.fit.d <- lm(sr ~ pop75+dpi+ddpi,savings)$res
res.g <- lm(pop15 ~ pop75 + dpi + ddpi,savings)$res
plot(res.g, MLR.fit.d, xlab = "pop15 residuals",
     ylab = "Saving residuals",
     main = "Partial Regression")

Figure:Partial regression plot for the savings data

Comparing slopes using

Compare the slope on the plot to the original regression and show the line on the above plot. The R code to do this:

part.reg <- lm(MLR.fit.d ~ res.g)
part.reg$coef

##   (Intercept)         res.g 
##  9.907998e-17 -4.611931e-01

MLR.fit$coef

##   (Intercept)         pop15         pop75           dpi          ddpi 
## 28.5660865407 -0.4611931471 -1.6914976767 -0.0003369019  0.4096949279

Partial residual plot

Notice how the slope in the plot and the slope for pop15 in the regression fit are the same.
The partial regression plot also provides some intuition about the meaning of regression coefficients.
We are looking at the marginal relationship between the response and the predictor after the effect of the other predictors has been removed.
Multiple regression is difficult because we cannot visualize the full relationship because of the high dimensionality.
The partial regression plot allows us to focus on the relationship between one predictor and the response, much as in simple regression.

Partial residual plot in

plot(savings$pop15, MLR.fit$res +
      MLR.fit$coef['pop15']*savings$pop15,
      xlab = "pop'n under 15",
      ylab = "Saving(adjusted)",
      main = "Partial Residual")
abline(0, MLR.fit$coef['pop15'])

Figure: Partial residual plot for the savings data

Partial residual plot in

Or more directly:

prplot(MLR.fit,1)

Comparing two groups using

Might there be a different relationship in the two groups?

MLR.fit.1 <- lm(sr ~ pop15+pop75+dpi+ddpi,
                savings, subset=(pop15 > 35))
MLR.fit.2 <- lm(sr ~ pop15+pop75+dpi+ddpi,
                savings, subset=(pop15 < 35))

Comparing two groups using

summary(MLR.fit.1)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings, 
##     subset = (pop15 > 35))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5511 -3.5101  0.0443  2.6764  8.4983 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4339689 21.1550278  -0.115    0.910
## pop15        0.2738537  0.4391910   0.624    0.541
## pop75       -3.5484769  3.0332806  -1.170    0.257
## dpi          0.0004208  0.0050001   0.084    0.934
## ddpi         0.3954742  0.2901012   1.363    0.190
## 
## Residual standard error: 4.454 on 18 degrees of freedom
## Multiple R-squared:  0.1558, Adjusted R-squared:  -0.03185 
## F-statistic: 0.8302 on 4 and 18 DF,  p-value: 0.5233

Comparing two groups using

summary(MLR.fit.2)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings, 
##     subset = (pop15 < 35))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5890 -1.5015  0.1165  1.8857  5.1466 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 23.9617950  8.0837502   2.964  0.00716 **
## pop15       -0.3858976  0.1953686  -1.975  0.06092 . 
## pop75       -1.3277421  0.9260627  -1.434  0.16570   
## dpi         -0.0004588  0.0007237  -0.634  0.53264   
## ddpi         0.8843944  0.2953405   2.994  0.00668 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.772 on 22 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4177 
## F-statistic: 5.663 on 4 and 22 DF,  p-value: 0.002734

Comparing two groups using

Can you see the difference?
The graphical analysis has shown a relationship in the data that a purely numerical analysis might easily have missed.
Higher dimensional plots can also be useful for detecting structure that cannot be seen in two dimensions(reading).

Assessing normality

The statistical test and confidence intervals we use are based on the assumption of normal errors.
The residuals can be assessed for normality using a Q-Q plot.
The steps are:
1. Sort the residuals: \(\widehat{\varepsilon}_{[1]} \le \dots \widehat{\varepsilon}_{[n]}\)
2. Compute \(u_i = \Phi^{-1}\left(\frac{1}{n+1}\right )\)
3. Plot \(\widehat{\varepsilon}_{[i]}\) against \(u_i\).

If the residuals are normally distributed an approximately straight-line relationship will be observed.

Checking normality using

Considering the full multiple regression model fitted for the savings data. The R code for the Q-Q-plots:

par(mfrow = c(1,2))
qqnorm(MLR.fit$res, ylab = "Raw Residuals")
qqline(MLR.fit$res)
qqnorm(rstudent(MLR.fit), ylab = "Studentized residuals")
abline(0,1)

Figure: Normality checks for the savings data

The output

See the first plot of the above figure qqline() adds a line joining the first and third quartiles - it’s useful as a guide.
See the plot on the right side, because these residuals have been normalized, they should lie along a 45 degree line.
Histograms and boxplots are not as sensitive for checking normality:

Checking normality using

par(mfrow=c(1,2))
hist(MLR.fit$res, 10)
boxplot(MLR.fit$res,
      xlab = "Residuals",
      main = "Boxplot of savings residuals")

Figure: Normality checks for the savings data

Consequences of non-normality

The major consequences of non-normality are
1. that the least squares estimates may not be optimal - they will still be BLUE but other robust estimators may be more effective.
2. that the tests and confidence intervals are invalid. However, it has been shown that only really long-tailed distributions cause a problem. Mild non-normality can safely be ignored and the larger the sample size the less troublesome the non-normality.

Possible solutions of non-normaility

1. A transformation of the response may solve the problem - this is often true for skewed errors.
2. Other changes in the model may help.
3. Accept non-normality and base the inference on the assumption of another distribution or use re-sampling methods such as the bootstrap or permutation tests. You don't want to do this unless absolutely necessary. Alternatively use robust methods which give less weight to outlying points. This is appropriate for long tailed distributions.
4. For short-tailed distributions, the consequences of non-normality are not serious and can reasonably be ignored.

Test of normality

There are formal tests for normality such as the Kolmogorov-Smirnov test
Null hypothesis: The data is normally distributed. If \(p> 0.05\), normality can be assumed.
For smaller sample sizes, formal tests lack power.

shapiro.test(MLR.fit$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  MLR.fit$res
## W = 0.98698, p-value = 0.8524

For the approximately normally distributed data, p = 0.8524 so the null hypothesis is retained at the 95% level of significance. Therefore, normality can be assumed for this data set and, provided any other test assumptions are satisfied, an appropriate parametric test can be used.

Half-normal plots

Half-normal plots are designed for the assessment of positive data.
They could be used for \(|\widehat{\epsilon}|\) but are more typically useful for diagnostic quantities like the leverages or the Cook Statistics.
The idea is to plot the data against the positive normal quantiles.
The steps are:
1. Sort the data: \(x_{[1]} \le \dots x_{[n]}\)
2. Compute \(u_i = \Phi^{-1}\left(\frac{n + i}{2\,n+1}\right )\)
3. Plot \(x_{[i]}\) against \(u_i\).

Half-normal plots

We are usually not looking for a straight line relationship since we do not necessarily expect a positive normal distribution for quantities like the leverages.
If the \({\bf X}\) is multivariate normal, the leverages will have a \(\chi^2_p\) distribution but there is usually no good reason to assume multivariate normality for the X.
We are looking for outliers which will be apparent as points that diverge substantially from the rest of the data.

Half-normal plots

Consider the half-normal plot on the leverages and Cook statistics for the savings data. The R code:

countries <- row.names(savings)
halfnorm(lm.influence(MLR.fit)$hat,
      labs = countries, ylab = "Leverages")

Half-normal plots

halfnorm(cooks.distance(MLR.fit),
     labs = countries, ylab = "Cook Statistics")

Half-normal plots

In the plot the country name plotted instead of just a dot for the largest two cases respectively to aid identification.
Libya shows up clearly as unusual in both plots.
Note that the function comes from the library.

Correlated Errors

We assume that the errors are uncorrelated but for temporally or spatially related data this may well be untrue.
For this type of data, it is wise to check the uncorrelated assumption.
1. Plot \(|\widehat{\epsilon}|\) against time.
2. Use formal tests like the Durbin-Watson or the run test.
If we do have correlated errors, one option is to use GLS. This does require that we know \(\Sigma\) or more usually that you can estimate it. In the latter case, an iterative fitting procedure will be necessary as in IRWLS. Such problems are common in Econometrics.

Correlated Errors

Data: New York Air Quality Measurements

This example is taken from an environmental study that measured the four variables ozone, solar radiation, temperature, and wind speed for 153 consecutive days in New York.
There are some missing values in the data. You can check this using the R code:

is.na(airquality)

##        Ozone Solar.R  Wind  Temp Month   Day
##   [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
##   [6,] FALSE    TRUE FALSE FALSE FALSE FALSE
##   [7,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [8,] FALSE   FALSE FALSE FALSE FALSE FALSE
##   [9,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [10,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [11,] FALSE    TRUE FALSE FALSE FALSE FALSE
##  [12,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [13,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [14,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [15,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [16,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [17,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [18,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [19,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [20,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [21,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [22,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [23,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [24,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [25,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [26,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [27,]  TRUE    TRUE FALSE FALSE FALSE FALSE
##  [28,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [29,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [30,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [31,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [32,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [33,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [34,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [35,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [36,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [37,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [38,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [39,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [40,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [41,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [42,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [43,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [44,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [45,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [46,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [47,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [48,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [49,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [50,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [51,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [52,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [53,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [54,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [55,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [56,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [57,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [58,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [59,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [60,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [61,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [62,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [63,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [64,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [65,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [66,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [67,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [68,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [69,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [70,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [71,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [72,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [73,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [74,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [75,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [76,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [77,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [78,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [79,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [80,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [81,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [82,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [83,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [84,]  TRUE   FALSE FALSE FALSE FALSE FALSE
##  [85,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [86,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [87,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [88,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [89,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [90,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [91,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [92,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [93,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [94,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [95,] FALSE   FALSE FALSE FALSE FALSE FALSE
##  [96,] FALSE    TRUE FALSE FALSE FALSE FALSE
##  [97,] FALSE    TRUE FALSE FALSE FALSE FALSE
##  [98,] FALSE    TRUE FALSE FALSE FALSE FALSE
##  [99,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [100,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [101,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [102,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [103,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [104,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [105,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [106,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [107,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [108,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [109,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [110,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [111,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [112,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [113,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [114,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [115,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [116,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [117,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [118,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [119,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [120,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [121,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [122,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [123,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [124,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [125,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [126,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [127,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [128,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [129,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [130,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [131,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [132,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [133,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [134,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [135,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [136,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [137,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [138,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [139,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [140,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [141,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [142,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [143,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [144,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [145,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [146,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [147,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [148,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [149,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [150,]  TRUE   FALSE FALSE FALSE FALSE FALSE
## [151,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [152,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [153,] FALSE   FALSE FALSE FALSE FALSE FALSE

Take a look at the data using the plot:

Panel plots for air quality dataset

pairs(airquality, panel = panel.smooth)

Figure: Panel plots for air quality dataset

Fitting a linear model using

Fit a standard linear model and check the residual-fitted plots.

Air.qual.fit <- lm(Ozone ~ Solar.R + Wind + Temp,
              airquality, na.action = na.exclude)

The output

summary(Air.qual.fit)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

The output

Notice how there are only 107 degrees corresponding to the 111 complete observations.
The default behavior in R when performing a regression with missing values is to exclude any case that contains a missing value.

Residual plot

plot(fitted(Air.qual.fit), residuals(Air.qual.fit),
      xlab = "Fitted", ylab = "Residuals")

Figure: Checking constant variance errors air quality dataset

We see some non-constant variance and so we try transforming the response.

Fitting a linear model using

The R code to fit the model after transforming the response:

Air.qual.fit.1 <- lm(log(Ozone) ~ Solar.R + Wind +
          Temp, airquality, na.action = na.exclude)
summary(Air.qual.fit.1)

## 
## Call:
## lm(formula = log(Ozone) ~ Solar.R + Wind + Temp, data = airquality, 
##     na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06193 -0.29970 -0.00231  0.30756  1.23578 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2621323  0.5535669  -0.474 0.636798    
## Solar.R      0.0025152  0.0005567   4.518 1.62e-05 ***
## Wind        -0.0615625  0.0157130  -3.918 0.000158 ***
## Temp         0.0491711  0.0060875   8.077 1.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5086 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6644, Adjusted R-squared:  0.655 
## F-statistic: 70.62 on 3 and 107 DF,  p-value: < 2.2e-16

Residual plot

plot(fitted(Air.qual.fit.1), residuals(Air.qual.fit.1),
        xlab = "Fitted", ylab = "Residuals",
        main = "Logged Response")

Figure: Checking constant variance errors for log transformed response

The interpretation

The error variance now looks constant.
Suppose we are now otherwise satisfied with this model and want to check for serial correlation.
The missing values in the data were not used in the construction of the model but this also breaks up the sequential pattern in the data.
This can be handled by reintroducing missing values into the residuals corresponding to the omitted cases.
First make an index plot of the residuals, the R code to do this:

res <- rep(NA,153)
res[as.numeric(row.names(na.omit(airquality)))] <- Air.qual.fit.1$res

Residual plot

plot(res, ylab = "Residuals",
       main = "Index plot of residuals")

Figure: Checking correlated errors - Index plot

Is there any evidence of serial correlation?
Now plot successive residuals:

Residual plot

plot(residuals(Air.qual.fit.1)[-153],
         residuals(Air.qual.fit.1)[-1],
    xlab = expression(hat(epsilon)[i]),
    ylab = expression(hat(epsilon)[i+1]))

Figure: Checking correlated errors - scatterplot of successive residuals

Residual plot

Do you see any problem?
We can further check this by fitting simple linear regression between successive residuals.

Fitting a linear mode between successive residuals

SLR.Corr.res <- lm(residuals(Air.qual.fit.1)[-1] ~
         -1+residuals(Air.qual.fit.1)[-153])
 summary(SLR.Corr.res)

## 
## Call:
## lm(formula = residuals(Air.qual.fit.1)[-1] ~ -1 + residuals(Air.qual.fit.1)[-153])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07274 -0.28953  0.02583  0.32256  1.32594 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)
## residuals(Air.qual.fit.1)[-153]   0.1104     0.1053   1.048    0.297
## 
## Residual standard error: 0.5078 on 91 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.01193,    Adjusted R-squared:  0.001073 
## F-statistic: 1.099 on 1 and 91 DF,  p-value: 0.2973

The output

We omitted the intercept term because the residuals have mean zero. We see that there is no significant correlation.
You can plot more than just successive pairs if you suspect a more complex dependence.
For spatial data, more complex checks are required.

Plot

Now consider the Galpagos Islands dataset analyzed earlier:

data(gala)
MLR.fit.g <- lm(Species ~ Area + Elevation +
           Nearest + Scruz + Adjacent, gala)
boxcox(MLR.fit.g, plotit = T)

boxcox(MLR.fit.g, lambda = seq(0.0,1.0,
          by = 0.05), plotit = T)

The output

The plots are shown in the above figure. We see that perhaps a cube-root transformation might be best here. A square root is also a possibility as this falls just within the confidence intervals. Certainly there is a strong need to transform.

Chapter Six: Variable Selection

Introduction

This chapter deals with model building problems and the selection of the regression model where such topics as
- miss-specification of predictor variables,
- selection criteria like Mallows \(C_{p}\), \(PRESS_{p}\), \(AIC_{P}\), \(BIC_{P}\),and
variable selection methods such as
- stepwise regression,
- forward and
- backward selection procedures are considered.
By the use of these selection methods, the possible ‘best’ model will be determined among the alternative predictor variables.
That is, variable selection is intended to select the ‘best’ subset of predictors.

Introduction

If we have, say, \(p-1\) predictor variables, we will be able to have a total of \(2^{P-1}\) possible candidate models.
And such problems, related with the method of identifying such group of regressors, are known as variable selection problems.
We will assume that the number of observations exceeds the maximum number of potential parameters, \(n>p\).

Why do we bother

We want to explain the data in the simplest way - redundant predictors should be removed. The smallest model that fits the data is best.
Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Degrees of freedom will be wasted.
Collinearity is caused by having too many variables trying to do the same job.
Cost: if the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors.

Why do we bother

Prior to variable selection:
1. Identify outliers and influential points - maybe exclude them at least temporarily.
2. Add in any transformations of the variables that seem appropriate.

Hierarchical Models

Some models have a natural hierarchy. For example, in polynomial models, \(x^2\) is a higher order term than \(x\).
When selecting variables, it is important to respect the hierarchy.
Lower order terms should not be removed from the model before higher order terms in the same variable.
There two common situations where this situation arises:
1. Polynomials models: Consider the model \[ y = \beta_0 + \beta_1\,x + \beta_2\,x^2 + \varepsilon \]

Hierarchical models

Suppose we fit this model and find that the regression summary shows that the term in \(x\) is not significant but the term in \(x^2\) is.
If we then removed the \(x\) term, our reduced model would then become \[ y = \beta_0 + \beta_2\,x^2 + \varepsilon \] but suppose we then made a scale change \(x \rightarrow x + a\), then the model would become \[ y = \beta_0 + \beta_2\,a^2 + 2\,\beta_2\,a\,x + \beta_2\,x^2 + \varepsilon. \]
The first order \(x\) term has now reappeared.
Scale changes should not make any important change to the model but in this case an additional term has been added.

Hierarchical models

This is not good. This illustrates why we should not remove lower order terms in the presence of higher order terms.
We would not want interpretation to depend on the choice of scale.
Removal of the first order term here corresponds to the hypothesis that the predicted response is symmetric about and has an optimum at \(x = 0\).
Often this hypothesis is not meaningful and should not be considered.
Only when this hypothesis makes sense in the context of the particular problem could we justify the removal of the lower order term.

Hierarchical models

Models with interactions: Consider the second order response surface model: \[ y = \beta_0 + \beta_1\,x_1 + \beta_2\,a\,x_2 + \beta_{11}\,x^2_1 + \beta_{22}\,x^2_2 + \beta_{12}\,x_1\,x_2 + \varepsilon. \]
- We would not normally consider removing the \(x_1\,x_2\) interaction term without simultaneously considering the removal of the \(x^2_1\) and \(x^2_2\) terms.
- A joint removal would correspond to the clearly meaningful comparison of a quadratic surface and linear one.
- Just removing the \(x_1\,x_2\) term would correspond to a surface that is aligned with the coordinate axes.
- This is hard to interpret and should not be considered unless some particular meaning can be attached.

Hierarchical models

Any rotation of the predictor space would reintroduce the interaction term and, as with the polynomials, we would not ordinarily want our model interpretation to depend on the particular basis for the predictors.

Criteria for Model Selection

Backward elimination

This is the simplest of all variable selection procedures and can be easily implemented without special software.
In situations where there is a complex hierarchy, backward elimination can be run manually while taking account of what variables are eligible for removal.
The steps are
1. Start with all the predictors in the model
2. Remove the predictor with highest p-value greater than \(\alpha_{crit}\)
3. Refit the model and goto 2
4. Stop when all p-values are less than \(\alpha_{crit}\).
The \(\alpha_{crit}\) is sometimes called the “\(p\)-to-remove” and does not have to be 5%.
If prediction performance is the goal, then a 15-20% cut-off may work best, although methods designed more directly for optimal prediction should be preferred.

Forward selection

This just reverses the backward method and the steps are
1. Start with no variables in the model.
2. For all predictors not in the model, check their \(p\)-value if they are added to the model. Choose the one with lowest \(p\)-value less than \(\alpha_{crit}\).
3. Continue until no new predictors can be added.

Stepwise regression

This is a combination of backward elimination and forward selection.
It addresses the situation where variables are added or removed early in the process and we want to change our mind about them later.
At each stage a variable may be added or removed and there are several variations on exactly how this is done.
Therefore a stepwise method adds/delets regressors one at a time.
Stepwise procedures are relatively cheap computationally but they do have some drawbacks.
1. Because of the “one-at-a-time” nature of adding/dropping variables, it’s possible to miss the “optimal” model.

Stepwise regression

The p-values used should not be treated too literally.
- There is so much multiple testing occurring that the validity is dubious.
- The removal of less significant predictors tends to increase the significance of the remaining predictors.
- This effect leads one to overstate the importance of the remaining predictors.
The procedures are not directly linked to final objectives of prediction or explanation and so may not really help solve the problem of interest.
- With any variable selection method, it is important to keep in mind that model selection cannot be divorced from the underlying purpose of the investigation.
- Variable selection tends to amplify the statistical significance of the variables that stay in the model.

Stepwise regression

Variables that are dropped can still be correlated with the response.
It would be wrong to say these variables are unrelated to the response, it’s just that they provide no additional explanatory effect beyond those variables already included in the model.

Stepwise variable selection tends to pick models that are smaller than desirable for prediction purposes. To give a simple example, consider the simple regression with just one predictor variable. Suppose that the slope for this predictor is not quite statistically significant. We might not have enough evidence to say that it is related to y but it still might be better to use it for predictive purposes.

Example: Data- US State

Data sets related to the 50 states of the United States of America. The variables are population estimate as of July 1, 1975; per capita income (1974); illiteracy (1970, percent of population); life expectancy in years (1969-71); murder and non-negligent manslaughter rate per 100,000 population (1976); percent high-school graduates (1970); mean number of days with min temperature \(<\) 32 degrees (1931-1960) in capital or large city; and land area in square miles. The data was collected from US Bureau of the Census. We will take life expectancy as the response and the remaining variables as predictors - a fix is necessary to remove spaces in some of the variable names.

Fitting models using backward elimination

Data; State

data(state)
statedata <- data.frame(state.x77,
     row.names = state.abb, check.names = T)
MLR.fit.state <- lm(Life.Exp ~ ., data=statedata)

Fitting models using backward elimination

summary(MLR.fit.state)

## 
## Call:
## lm(formula = Life.Exp ~ ., data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10

Fitting models using backward elimination

Which predictors should be included - can you tell from the p-values?
Looking at the coefficients, can you see what operation would be helpful?
Does the murder rate decrease life expectancy - that’s obvious a priori, but how should these results be interpreted?
At each stage we remove the predictor with the largest \(p-value\) over 0.05.

Fitting models using backward elimination

MLR.fit.state.1 <- update(MLR.fit.state, . ~ . - Area)
summary(MLR.fit.state.1)

## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
##     HS.Grad + Frost, data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49047 -0.52533 -0.02546  0.57160  1.50374 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.099e+01  1.387e+00  51.165  < 2e-16 ***
## Population   5.188e-05  2.879e-05   1.802   0.0785 .  
## Income      -2.444e-05  2.343e-04  -0.104   0.9174    
## Illiteracy   2.846e-02  3.416e-01   0.083   0.9340    
## Murder      -3.018e-01  4.334e-02  -6.963 1.45e-08 ***
## HS.Grad      4.847e-02  2.067e-02   2.345   0.0237 *  
## Frost       -5.776e-03  2.970e-03  -1.945   0.0584 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7361 on 43 degrees of freedom
## Multiple R-squared:  0.7361, Adjusted R-squared:  0.6993 
## F-statistic: 19.99 on 6 and 43 DF,  p-value: 5.362e-11

Fitting models using backward elimination

MLR.fit.state.2 <- update(MLR.fit.state.1, . ~ . - Illiteracy)
summary(MLR.fit.state.2)

## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
##     Frost, data = statedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4892 -0.5122 -0.0329  0.5645  1.5166 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.107e+01  1.029e+00  69.067  < 2e-16 ***
## Population   5.115e-05  2.709e-05   1.888   0.0657 .  
## Income      -2.477e-05  2.316e-04  -0.107   0.9153    
## Murder      -3.000e-01  3.704e-02  -8.099 2.91e-10 ***
## HS.Grad      4.776e-02  1.859e-02   2.569   0.0137 *  
## Frost       -5.910e-03  2.468e-03  -2.395   0.0210 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7277 on 44 degrees of freedom
## Multiple R-squared:  0.7361, Adjusted R-squared:  0.7061 
## F-statistic: 24.55 on 5 and 44 DF,  p-value: 1.019e-11

Fitting models using backward elimination

MLR.fit.state.3 <- update(MLR.fit.state.2, . ~ . - Income)
summary(MLR.fit.state.3)

## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
##     data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

Fitting models using backward elimination

MLR.fit.state.4 <- update(MLR.fit.state.3, . ~ . - Population)
summary(MLR.fit.state.4)

## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = statedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5015 -0.5391  0.1014  0.5921  1.2268 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.036379   0.983262  72.246  < 2e-16 ***
## Murder      -0.283065   0.036731  -7.706 8.04e-10 ***
## HS.Grad      0.049949   0.015201   3.286  0.00195 ** 
## Frost       -0.006912   0.002447  -2.824  0.00699 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7427 on 46 degrees of freedom
## Multiple R-squared:  0.7127, Adjusted R-squared:  0.6939 
## F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12

Fitting models using backward elimination

The final removal of the Population variable is a close call.
We may want to consider including this variable if interpretation is aided.
Notice that the \(R^2\) for the full model of 0.736 is reduced only slightly to 0.713 in the final model.
Thus the removal of four predictors causes only a minor reduction in fit.

Criterion-based procedures

If there are \(p\) potential predictors, then there are \(2^p\) possible models.
We fit all these models and choose the best one according to some criterion.
Clever algorithms such as the “branch-and-bound” method can avoid actually fitting all the models - only likely candidates are evaluated.
Some criteria are are discussed next.

1. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

Information criteria is a measure of goodness of fit or uncertainty for the range of values of the data.
In the context of multiple linear regression, information criteria measures the difference between a given model and the “true” underlying model.
Akaike (1973) introduced the concept of information criteria as a tool for optimal model selection.
The Akaike’s Information Criteria (AIC) and the Bayesian information criteria (BIC) are functions of the number of observations \(n\), the \(SSE\) and the number of parameters \(p\).
AIC and BIC are based on the maximum likelihood estimates of the model parameters.

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

In maximum likelihood, the idea is to estimate parameters so that, under the model, the probability of the observed data would be as large as possible.
In general, \[ AIC = -2\,\log(likelihood) + 2\,p \] and \[ BIC = -2\,\log(likelihood) + p\,log(n) \]
For linear regression models, the \(-2\,\log(likelihood) = n\,\log(SSE/n)\), this is known as the .
We search for models that have small values of AlC, or BIC.

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

Larger models will fit better and so have smaller \(SSE\) but use more parameters. Thus the best choice of model will balance fit with model size.
\(BIC\) penalizes larger models more heavily and so will tend to prefer smaller models in comparison to \(AIC\).
\(AIC\) and \(BIC\) can be used as selection criteria for other types of model too.

Choosing a model using AIC

Data: state

We can apply the \(AIC\) (and optionally the \(BIC\)) to the state data.
The function does not evaluate the \(AIC\) for all possible models but uses a search method that compares models sequentially.
Thus it bears some comparison to the stepwise method described above but with the advantage that no dubious \(p\)-values are used.
The R code:

MLR.fit.state <- lm(Life.Exp ~ ., data=statedata)
step(MLR.fit.state)

Choosing a model using AIC

Start:  AIC=-22.18
Life.Exp ~ Population + Income + Illiteracy + Murder +
           HS.Grad +  Frost + Area

             Df Sum of Sq    RSS     AIC
- Area        1    0.0011 23.298 -24.182
- Income      1    0.0044 23.302 -24.175
- Illiteracy  1    0.0047 23.302 -24.174
<none>                    23.297 -22.185
- Population  1    1.7472 25.044 -20.569
- Frost       1    1.8466 25.144 -20.371
- HS.Grad     1    2.4413 25.738 -19.202
- Murder      1   23.1411 46.438  10.305

Choosing a model using AIC

Step:  AIC=-24.18
Life.Exp ~ Population + Income + Illiteracy + Murder +
           HS.Grad + Frost

             Df Sum of Sq    RSS     AIC
- Illiteracy  1    0.0038 23.302 -26.174
- Income      1    0.0059 23.304 -26.170
<none>                    23.298 -24.182
- Population  1    1.7599 25.058 -22.541
- Frost       1    2.0488 25.347 -21.968
- HS.Grad     1    2.9804 26.279 -20.163
- Murder      1   26.2721 49.570  11.569

Choosing a model using AIC

Step:  AIC=-26.17
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost

             Df Sum of Sq    RSS     AIC
- Income      1     0.006 23.308 -28.161
<none>                    23.302 -26.174
- Population  1     1.887 25.189 -24.280
- Frost       1     3.037 26.339 -22.048
- HS.Grad     1     3.495 26.797 -21.187
- Murder      1    34.739 58.041  17.456

Choosing a model using AIC

Step:  AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost

             Df Sum of Sq    RSS     AIC
<none>                    23.308 -28.161
- Population  1     2.064 25.372 -25.920
- Frost       1     3.122 26.430 -23.877
- HS.Grad     1     5.112 28.420 -20.246
- Murder      1    34.816 58.124  15.528

Choosing a model using AIC

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad +
       Frost, data = statedata)

Coefficients:
(Intercept) Population   Murder    HS.Grad     Frost
  7.103e+01  5.014e-05  -3.001e-01  4.658e-02 -5.943e-03

The sequence of variable removal is the same as with backward elimination.
The only difference is the the Population variable is retained.

2. Adjusted \(R^2\)

Adjusted \(R^2\) called \(R^2_a\). Recall that \(R^2 = 1 - SSE/SS_{Total}\).
Adding a variable to a model can only decrease the \(SSE\) and so only increase the \(R^2\) so \(R^2\) by itself is not a good criterion because it would always choose the largest possible model. \[ R^2_a = 1 - \frac{SSE/(n-p)}{SS_{Total}/(n-1)} = 1 - \left (\frac{n-1}{n-p}\right)\,(1 - R^2) = 1 - \frac{\widehat{\sigma}^2_{model}}{\widehat{\sigma}^2_{null}}. \]
Adding a predictor will only increase \(R^2_a\) if it has some value.
Do you see the connection to \(\widehat{\sigma}^2\)?
Minimizing the standard error for prediction means minimizing \(\widehat{\sigma}^2\) which in term means maximizing \(R^2_a\).

3. Predicted Residual Sum of Squares (PRESS)

The Prediction Error Sum of Squares (PRESS) is a statistic which is based on the deleted residuals, say \(d_{i}\) such that, \(d_{i} = y_{i} - \widehat{y}_{i(i)}\), where \(\hat{y}_{i(i)}\) is the predicted value for the \(i\)th case when the regression function is fitted without the \(i\)th case.
There are \(n\) deleted residuals associated with each model.
Then the PRESS criterion is the sum of the squares of these deleted residuals: \[ PRESS_p = \sum\limits_{i=1}^n (d_i )^2 =\sum\limits_{i=1}^n (y_i - \widehat{y}_{(i)})^2 \]
PRESS values could be calculated using the equivalent expression for \(d_{i}\), \[ d_{i} = \left(\frac{\widehat{\varepsilon}_i}{1-h_{ii}}\right) \]

Predicted Residual Sum of Squares (PRESS)

Then \(PRESS_{p}\) becomes \[ PRESS_{p} = \sum\limits_{i=1}^n \left(\frac{\widehat{\varepsilon}_i}{1-h_{ii}}\right)^2, \] where \(\widehat{\varepsilon}_i\) is the ordinary residual and \(h_{ii}\) is the leverage value such that both are based on all n cases.
The models with small PRESS values are considered to be ‘good’ candidate models because the deleted residual \(d_{i}\) is the prediction error when a regression model is fitted without the \(i\)th case and \(\widehat{y}_{i(i)}\) is used as the predicted value.
Then small prediction errors involve small \(d_{i}\) values and hence small sum of \(d_{i}^{2}\) values.

Predicted Residual Sum of Squares (PRESS)

Therefore, such models with small PRESS values fit well due to the fact that they have small prediction errors, which may be desirable if prediction is the objective.

4. Mallow’s \(C_p\) Statistic

A good model should predict well so average \(MSE\) of prediction might be a good criterion: \[ \frac{1}{\sigma^2}\sum_{i=1}^nE(\widehat{y}_i - E(y_i))^2 \] which can be estimated by the \(C_p\) statistic \[ C_p = \frac{SSE_p}{\widehat{\sigma}^2} + 2\,p - n \] where \(\widehat{\sigma}^2\) is from the model with all predictors and \(SSE_p\) indicates the \(SSE\) from a model with \(p\) parameters.

Mallow’s \(C_p\) Statistic

\(C_p\) has the following properties:
1. \(C_p\) is easy to compute
2. It is closely related to \(R^2_a\) and the \(AIC\).
3. For the full model \(C_p = p\) exactly.
4. If a \(p\) predictor model fits then \(E(SSE_p) = (n - p)\,\sigma^2\) and then \(E(C_p) \thickapprox p\). A model with a bad fit will have \(C_p\) much bigger than \(p\).
It is usual to plot \(C_p\) against \(p\). We desire models with small \(p\) and \(C_p\) around or less than \(p\).

Example: state data - \(C_p\) and \(R^2_a\)

We default for the leaps() function is the Mallow’s Cp criterion:

library(leaps)
MLR.fit.state.Cp <- regsubsets(Life.Exp ~ .,
       nbest=3, data=statedata)
par(mfrow=c(1,1))
plot(MLR.fit.state.Cp)

Example: state data - \(C_p\) and \(R^2_a\)

Now let’s see which model the adjusted \(R^2\) criterion selects. The R code:

x <- model.matrix(MLR.fit.state)[, -1]
y <- statedata$Life
z<-leaps(x, y, wt=rep(1, NROW(x)), int=TRUE,
  method=c("adjr2"))

Example: state data - \(C_p\) and \(R^2_a\)

Observe from the output of the above R code that the Population, Frost, HS graduation and Murder model has the largest \(R^2_a\).

Effects of outliers and influential points

Variable selection methods are sensitive to outliers and influential points.
Let’s check for high leverage points for state data:

h <- hat(x)
names(h) <- state.abb
Lev <- round(rev(sort(h)), 3)
Lev[1:8]

##    AK    CA    HI    NV    NM    TX    NY    WA 
## 0.810 0.409 0.379 0.365 0.325 0.284 0.257 0.223

Which state sticks out?

Effects of outliers and influential points

Let’s try excluding it (Alaska is the second state in the data).

MLR.exc.Alaska <- leaps(x[-2,],y[-2],
        method="adjr2")
mod.sum <- cbind(MLR.exc.Alaska$which,
           MLR.exc.Alaska$adjr2)
mod.sum[order(MLR.exc.Alaska$adjr2),]

##   1 2 3 4 5 6 7            
## 1 0 0 0 0 0 0 1 -0.01718029
## 1 1 0 0 0 0 0 0 -0.01277762
## 1 0 0 0 0 0 1 0  0.06520553
## 1 0 1 0 0 0 0 0  0.20319699
## 1 0 0 1 0 0 0 0  0.32876890
## 6 1 1 1 0 1 1 1  0.39826292
## 2 0 0 0 0 1 1 0  0.40462058
## 2 0 0 0 0 1 0 1  0.41014920
## 2 0 1 0 0 1 0 0  0.41126196
## 1 0 0 0 0 1 0 0  0.41250889
## 2 0 0 1 0 1 0 0  0.43412341
## 2 0 0 1 1 0 0 0  0.58704752
## 1 0 0 0 1 0 0 0  0.59232605
## 2 0 0 0 1 0 1 0  0.61896135
## 2 1 0 0 1 0 0 0  0.62976399
## 2 0 0 0 1 0 0 1  0.62977339
## 2 0 1 0 1 0 0 0  0.63461975
## 3 1 0 0 1 0 1 0  0.64742252
## 3 1 0 0 1 0 0 1  0.65419549
## 3 0 0 0 1 0 1 1  0.65894342
## 2 0 0 0 1 1 0 0  0.66032808
## 3 0 1 0 1 0 1 0  0.66247853
## 3 0 1 0 1 1 0 0  0.66250241
## 3 0 0 1 1 1 0 0  0.66339581
## 3 0 0 0 1 1 0 1  0.66474629
## 3 0 1 0 1 0 0 1  0.66887357
## 4 0 1 0 1 1 0 1  0.67300490
## 4 1 0 0 1 0 1 1  0.67533495
## 4 1 1 0 1 1 0 0  0.67573257
## 3 1 0 0 1 1 0 0  0.68250469
## 4 1 0 0 1 1 0 1  0.68330361
## 5 1 1 1 1 1 0 0  0.68744234
## 4 0 0 1 1 1 1 0  0.68810694
## 6 1 1 1 1 1 0 1  0.68906478
## 6 1 1 1 1 0 1 1  0.68953371
## 5 0 1 1 1 1 1 0  0.68968425
## 5 1 0 1 1 1 0 1  0.69122028
## 4 1 0 1 1 1 0 0  0.69251700
## 5 0 0 1 1 1 1 1  0.69334457
## 5 0 1 1 1 0 1 1  0.69388055
## 3 0 0 0 1 1 1 0  0.69488547
## 5 1 1 0 1 0 1 1  0.69547589
## 4 0 1 0 1 1 1 0  0.69650751
## 6 1 1 1 1 1 1 0  0.69698252
## 4 0 1 0 1 0 1 1  0.69923087
## 4 0 0 0 1 1 1 1  0.69976145
## 6 0 1 1 1 1 1 1  0.70021694
## 7 1 1 1 1 1 1 1  0.70088995
## 5 1 1 0 1 1 1 0  0.70274495
## 5 1 0 1 1 1 1 0  0.70277714
## 6 1 0 1 1 1 1 1  0.70383148
## 5 0 1 0 1 1 1 1  0.70709124
## 6 1 1 0 1 1 1 1  0.70730267
## 4 1 0 0 1 1 1 0  0.70867029
## 5 1 0 0 1 1 1 1  0.71044047

We see that area now makes it into the model.

Effects of transformation

Transforming the predictors can also have an effect.
Take a look at the variables using boxplot. The R code

library(leaps)
data(statedata)
par(mfrow=c(3,3))
for(i in 1:8) boxplot(state.x77[,i],
       main=dimnames(state.x77)[[2]][i])

Figure: Boxplot of the state data(see page 132: from the book)

Boxplots of the State data

In the figure, we see that Population, Illiteracy and Area are skewed
we try transforming them:

x.tra <- cbind(log(x[,1]),x[,2],log(x[,3]), x[,4:6],log(x[,7]))
par(mfrow=c(3,3))
apply(x.tra,2, boxplot)

Figure: Boxplots of State data after transforming skewed variables (?)

which shows the appropriately transformed data.
Now try the adjusted \(R^2\) method again.

Boxplots of the State data

MLR.trans <- leaps(x.tra, y, method="adjr2")
mod.sum <- cbind(MLR.trans$which,MLR.trans$adjr2)
mod.sum[order(MLR.trans$adjr2, decreasing = TRUE),][1:5,]

  1 2 3 4 5 6 7
4 1 0 0 1 1 1 0 0.7173392
5 1 0 0 1 1 1 1 0.7136360
5 1 0 1 1 1 1 0 0.7117179
5 1 1 0 1 1 1 0 0.7110497
6 1 0 1 1 1 1 1 0.7083894

Boxplots of the State data

This changes the “best” model again to log(Population), Frost, HS graduation and Murder.
The adjusted \(R^2\) is the highest models we have seen so far.

Summary

Variable selection is a means to an end and not an end itself.
The aim is to construct a model that predicts well or explains the relationships in the data.
Automatic variable selections are not guaranteed to be consistent with these goals.
Use these methods as a guide only.
Stepwise methods use a restricted search through the space of potential models and use a dubious hypothesis testing based method for choosing between models.
Criterion-based methods typically involve a wider search and compare models in a preferable manner.
For this reason, the recommendation is use a criterion-based method.

Summary

Accept the possibility that several models may be suggested which fit about as well as each other. If this happens, consider:
1. Do the models have similar qualitative consequences?
2. Do they make similar predictions?
3. What is the cost of measuring the predictors?
4. Which has the best diagnostics?
If you find models that seem roughly equally as good but lead to quite different conclusions then it is clear that the data cannot answer the question of interest unambiguously.
Be alert to the danger that a model contradictory to the tentative conclusions might be out there.