This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.

1 Example dataset

We use a subset of the meuse soil pollution dataset found in the sp (spatial objects) package. Note that each time the source R Markdown file is executed, a different random set of observation is selected from the full dataset, so the results will be different.

library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)

## [1] 155  14

Add computed fields with the log-transformed values of the heavy metals:

meuse$logZn <- log10(meuse$zinc)    
meuse$logCu <- log10(meuse$copper)  
meuse$logPb <- log10(meuse$lead)    
meuse$logCd <- log10(meuse$cadmium) 
dim(meuse)

## [1] 155  18

names(meuse)

##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m" 
## [15] "logZn"   "logCu"   "logPb"   "logCd"

table(meuse$ffreq) # observations in each flood frequency class

## 
##  1  2  3 
## 84 48 23

To be able to show the matrices, we limit the data to only a few rows and 4 selected columns: a dependent variable logZn, a continuous predictor dist (normalized distance to the river Meuse), and a categorical predictor ffreq（flooding frequency).

To ensure we have some observations from each flood frequency class, we take a stratified sample based on this category. This uses the strata method from the sampling package. We choose an equal number of observations from each of the three classes.

The n.s parameter is the sample size from each stratum; you can change this to see the effect of taking larger or smaller samples.

n.s <- 5  # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size

## [1] 15

library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)

##  [1]  19  43  46  63  84  88 109 110 114 131 133 136 138 139 142

These are the selected rows, they will differ with each random sample.

# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])

##        logZn      dist ffreq
## 19  2.866287 0.0000000     1
## 44  2.245513 0.2118430     1
## 47  2.872739 0.1900860     1
## 64  2.818885 0.0316663     1
## 163 2.445604 0.2118430     1
## 92  2.703291 0.1139410     2
## 114 2.283301 0.5757520     2
## 115 2.380211 0.5814840     2
## 119 2.220108 0.2498520     2
## 161 2.100371 0.7716980     2
## 137 2.342423 0.2281230     3
## 141 2.198657 0.3966750     3
## 143 2.313867 0.2879660     3
## 144 2.654177 0.2335960     3
## 147 2.187521 0.2118460     3

Visualize the relation with possible predictors of log(Zn):

plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()

plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)

Depending on the random sample, there should be some relation: further distances and less-frequent flooding are associated with lower metal concentrations.

2 Regression on a continuous predictor

Model form: $y = \beta_0 + \beta_1 x + \varepsilon; \;\; \mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2)$

$y$: dependent variable, to be modelled/predicted
$x$: independent variable,
$\varepsilon$: error, lack of fit, noise; independently and identically distributed (IID) from a 0-mean normal distribution with some error variance $\sigma^2$
$\beta_1$: coefficient for $x$, ``slope’’ for simple regression
$\beta_0$: centering coefficient, ``intercept’’ for simple regression

Each observation $i$: $y_i = \beta_0 + \beta_1 x_i + r_i$ where $r_i$ is the residual – i.e., what the model does not explain.

There is only one set of coefficients $\beta_p$ which must be adjusted to all observations.

Once the model is fit, i.e., we compute the $\beta_p$, we can compute the fitted values that would be expected at each observation point: $\widehat{y}_i = \beta_0 + \beta_1 x_i$, and from that the residuals: $y_i - (\beta_0 + \beta_1 x_i)$, i.e., $(y_i - \widehat{y}_i)$, actual less fitted.

So the question is: what are the `best’ values of $\beta_p$?

2.1 Ordinary Least Squares (OLS)

One criterion is to select $\beta_0$, $\beta_1$ to minimize sum of squared residuals: $\sum_i (y_i - (\beta_0 + \beta_1 x_i))^2$.

Note:

All observations are equally weighted
This implies that the residuals are independent – they do not co-vary
- Not true for most time series and spatial data
This implies that all all residuals are from the same distribution
- Not true if some observations have more or less variance than others
- For example, different measurement precisions
The residuals are squared so that under- and over-fits are equally penalized.

Given this criterion, the optimal values of the coefficients $\beta_0$, $\beta_1$ can be found by simple minimization:

Minimize $\sum_i \varepsilon_i^2 = \sum_i (y_i - (\beta_0 + \beta_1 x_i))^2$
Take partial derivatives with respect to the two parameters; solve system of two simultaneous equations

Solution:

\[ \begin{aligned} \widehat{\beta}_1 &= \frac{ \sum_i (x_i - \overline{x})(y_i - \overline{y}) } { \sum_i (x_i - \overline{x})^2 } \\ \widehat{\beta}_0 &= \overline{y} - \widehat{\beta}_1 \overline{x} \end{aligned} \] Where:

$\overline{x}, \overline{y}$ are the means
$\widehat{\beta}_0$ centres the regression on $(\overline{x}, \overline{y})$

This can also be expressed in terms of variances and covariances:

\[ \widehat{\beta}_1 = \frac{s_{xy}}{s_x^2} \] Where:

$s_{xy}$ is the sample covariance between predictor (independent) and predictand (dependent).
$s_x^2$ is the sample variance of the independent variable – all the error is assumed to be in the dependent variable.

These are unbiased estimates of the population variance/covariance:

\[ \widehat{\beta}_1 = \frac{\mathrm{Covar}(x,y)}{\mathrm{Var}(x)} \]

2.2 Matrix notation

This can be more efficiently written in matrix notation: $\mathbf{y} = \mathbf{X} \mathbf{\beta} + \varepsilon$

Where:

$\mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})$, i.e., residuals are independently drawn from a normal distribution with $0$ mean and variance $^2%
$\mathbf{X}$ is the design matrix
$\mathbf{\beta}$ is the coefficient vector
$\mathbf{I}$ is the identity matrix: diagonals all 1, off-diagonals all 0

This allows extension to multiple variables and to categorical variables, as we will see below.

Notice that $\mathbf{I}$ means there is no correlation among the residuals. If this is not true, generalized least squares (GLS) must be used.

Solution by minimizing the sum of squares of the residuals:

\[ \begin{aligned} S &= \varepsilon^T \varepsilon \\ S &= (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}) \end{aligned} \]

Note that $(\mathbf{X}^T \mathbf{X})$ is the matrix equivalent of $s_x^2$, i.e., the variance of the predictor $x$.

This expands to: \[ \begin{aligned} S &= \mathbf{y}^T\mathbf{y} - \beta^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{X}\mathbf{\beta} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \nonumber \\ S &= \mathbf{y}^T\mathbf{y} - 2 \mathbf{\beta}^T \mathbf{X}^T \mathbf{y} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \end{aligned} \]

Minimize by finding the partial derivative with respect the the unknown coefficients $\mathbf{\beta}$, setting this equal to $\mathbf{0}$, and solving:

\[ \begin{aligned} \frac{\partial}{\partial \mathbf{\beta}^T} S &= -2 \mathbf{X}^T \mathbf{y} + 2 \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ \mathbf{0} &= -\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ (\mathbf{X}^T \mathbf{X} ) \mathbf{\beta} &= \mathbf{X}^T \mathbf{y} \\ (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \mathbf{X}) \mathbf{\beta} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \\ \mathbf{\widehat{\beta}}_{\mathrm{OLS}} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{aligned} \]

2.3 Computation

The model matix $\mathbf{X}$ is computed from the model formula, with the model.matrix function:

(X <- model.matrix(logZn ~ dist, data=ds))

##     (Intercept)      dist
## 19            1 0.0000000
## 44            1 0.2118430
## 47            1 0.1900860
## 64            1 0.0316663
## 163           1 0.2118430
## 92            1 0.1139410
## 114           1 0.5757520
## 115           1 0.5814840
## 119           1 0.2498520
## 161           1 0.7716980
## 137           1 0.2281230
## 141           1 0.3966750
## 143           1 0.2879660
## 144           1 0.2335960
## 147           1 0.2118460
## attr(,"assign")
## [1] 0 1

dim(X)

## [1] 15  2

Note the column of $1$s to represent the mean, and the column dist which are the observed normalized distances at each of the selected observations.

Its square, using the t “transpose” function and “%*%” matrix multiplication operator:

t(X)

##             19       44       47        64      163       92      114      115
## (Intercept)  1 1.000000 1.000000 1.0000000 1.000000 1.000000 1.000000 1.000000
## dist         0 0.211843 0.190086 0.0316663 0.211843 0.113941 0.575752 0.581484
##                  119      161      137      141      143      144      147
## (Intercept) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## dist        0.249852 0.771698 0.228123 0.396675 0.287966 0.233596 0.211846
## attr(,"assign")
## [1] 0 1

(XTX <- t(X)%*%X)

##             (Intercept)     dist
## (Intercept)   15.000000 4.296371
## dist           4.296371 1.859192

dim(XTX)

## [1] 2 2

This is the product-crossproduct matrix of the two predictor variables:

sum(rep(1,n)^2); sum((ds$dist)^2); sum(rep(1,n)*ds$dist)

## [1] 15

## [1] 1.859192

## [1] 4.296371

Its inverse, using the solve matrix solution function:

(XTXinv <- solve(XTX))

##             (Intercept)       dist
## (Intercept)   0.1971765 -0.4556514
## dist         -0.4556514  1.5908240

dim(XTXinv)

## [1] 2 2

Compute $\mathbf{X}^T \mathbf{y}$, the matrix equivalent of $s_{xy}$, i.e., the covariance between two predictors (intercept and normalized distance) and predictand:

y <- ds$logZn
(XTy <- t(X)%*%y)

##                  [,1]
## (Intercept) 36.632955
## dist         9.967598

dim(XTy)

## [1] 2 1

And finally the solution:

(beta <- XTXinv%*%XTy)

##                   [,1]
## (Intercept)  2.6814083
## dist        -0.8351626

dim(beta)

## [1] 2 1

This is the same result as using the lm function:

coef(m1 <- lm(logZn ~ dist, data=ds))

## (Intercept)        dist 
##   2.6814083  -0.8351626

2.4 Model diagnostics

Recall, in OLS the residuals are assumed to be identically and independently distributed, from a normal distribution.

First see if they appear to be normally-distributed. We use the qqPlot' method from thecar’ package, which shows the quantiles of the normal distribution against the residuals, with confidence intervals.

library(car)

## Loading required package: carData

qqPlot(residuals(m1))

##  47 147 
##   3  15

Second, see if there is any relation between the residuals and fitted values, and also if the variance of the residuals is the same across the range of fitted values:

plot(residuals(m1) ~ fitted(m1))
grid()
abline(h=0, col="blue")
lines(stats::lowess(residuals(m1) ~ fitted(m1)), col="red")

There are not so many points here, so it’s hard to draw a firm conclusion. The empirical smoother, using lowess “locally-weighted sum of squares”, helps visualize any relation.

Third, look for observations with high influence on the coefficients; these may be from a different population (process). The leverage is measured by the hat value, which measures the overall influence of a single observation on the predictions.

(sort(h <- hatvalues(m1)))

##        143        119        144        137        147         44        163 
## 0.06667045 0.06879450 0.07110646 0.07207403 0.07551481 0.07551553 0.07551553 
##         47        141         92         64         19        114        115 
## 0.08143135 0.08600332 0.11399471 0.16991412 0.19717650 0.19983495 0.20516374 
##        161 
## 0.44129001

plot(ds$dist, h, xlab="Normalized distance", ylab="Hat value", col=ds$ffreq, pch=20,
     xlim=c(0,1), ylim=c(0,0.8))
text(ds$dist, h, labels=names(residuals(m1)), pos=3)

The highest hat values are at the extremes of the predictor.

The influence of each observation can be measured by its Cook’s distance, which measures how much $\beta$ would change if the observation were omitted. It is defined as:

\[ \begin{aligned} D_i &= \frac{r^2_i}{\mathrm{tr}(\mathbf{H})} \frac{h_i}{1-h_i} \end{aligned} \] where $\mathbf{H}$ is the hat matrix, $h_i$ is its $i$th diagonal, and $r_i$ is the standardized residual, i.e., $(y_i - \widehat{y}_i) / s \sqrt{1 - h_i}$. This is based on $s^2$, which is the residual variance after the model fit.

sort(d <- cooks.distance(m1))

##         163         143         137          92         114         144 
## 0.003515888 0.014177974 0.021176635 0.022832342 0.024523949 0.026651287 
##         141         119         161          44          64         147 
## 0.027107622 0.058115950 0.065334629 0.068013566 0.076053434 0.101882099 
##          19         115          47 
## 0.120016966 0.126792043 0.135757328

plot(fitted(m1), d, xlab="Fitted value", ylab="Cook's distance", col=ds$ffreq, pch=20)
text(fitted(m1), d, labels=names(residuals(m1)), pos=4)

2.5 The hat matrix

The hat matrix is so-named because it `puts the hats on’ the fitted values: from observed $\mathbf{y}$ to best-fit $\mathbf{\widehat{y}}$

It is defined as:

\[ \begin{aligned} \mathbf{H} &= \mathbf{X} {(\mathbf{X}' \mathbf{X})}^{-1} \mathbf{X}' \end{aligned} \]

The hat matrix gives the weights by which each original observation is multiplied when computing $\mathbf{\widehat{\beta}}$. This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the $\mathbf{H}$ matrix gives the of the observation vector $\mathbf{y}$ into the of the design (model) matrix $\mathbf{X}$.

The overall leverage of each observation $y_i$ is given by its hat value, which is the sum of the squared entries of the hat matrix associated with the observation. This is also the diagonal of the hat matrix, because of the fact that $\mathbf{H} = \mathbf{H'} \mathbf{H} = \mathbf{H^2}$:

\[ \begin{aligned} h_{ii} &= \sum_{j=1}^n h_{ij}^2 \end{aligned} \]

3 Regression on a categorical predictor

There are various ways to set up the model matrix to separate categories. These are viewed and set with the contrasts function:

contrasts(ds$ffreq)

##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

This is the default “treatment contrasts” matrix; where “treatments” comes from experimental design. Here the “treatment” was by nature, not humans. Treatment contrasts are sometimes called “dummy variables”.

The model matrix uses these contrasts:

(X <- model.matrix(logZn ~ ffreq, data=ds))

##     (Intercept) ffreq2 ffreq3
## 19            1      0      0
## 44            1      0      0
## 47            1      0      0
## 64            1      0      0
## 163           1      0      0
## 92            1      1      0
## 114           1      1      0
## 115           1      1      0
## 119           1      1      0
## 161           1      1      0
## 137           1      0      1
## 141           1      0      1
## 143           1      0      1
## 144           1      0      1
## 147           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

Notice how the first category is included in the intercept – this is the base case, the other two classes are contrasted to it. The matrix has created two variables, beyond the intercept, to represent the two additional classes.

Solution is exactly as before, by OLS:

(XTX <- t(X)%*%X)

##             (Intercept) ffreq2 ffreq3
## (Intercept)          15      5      5
## ffreq2                5      5      0
## ffreq3                5      0      5

dim(XTX)

## [1] 3 3

This is the product-crossproduct matrix among the predictors. Note the relation to the number of observations in each class.

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)

##                 [,1]
## (Intercept) 36.63296
## ffreq2      11.68728
## ffreq3      11.69664

(beta <- XTXinv%*%XTy)

##                   [,1]
## (Intercept)  2.6498057
## ffreq2      -0.3123492
## ffreq3      -0.3104768

And again, these are the coefficents found by lm:

coef(m2 <- lm(logZn ~ ffreq, data=ds))

## (Intercept)      ffreq2      ffreq3 
##   2.6498057  -0.3123492  -0.3104768

The first coefficient is the mean of the independent variable for observations in the first class:

beta[1]

## [1] 2.649806

mean(ds$logZn[ds$ffreq==1])

## [1] 2.649806

The others are the difference between this and their class means:

beta[2]

## [1] -0.3123492

mean(ds$logZn[ds$ffreq==2])

## [1] 2.337456

mean(ds$logZn[ds$ffreq==2]) - beta[1]

## [1] -0.3123492

We can make the model without an intercept, by adding a -1 to the model formulation:

(X <- model.matrix(logZn ~ ffreq - 1, data=ds))

##     ffreq1 ffreq2 ffreq3
## 19       1      0      0
## 44       1      0      0
## 47       1      0      0
## 64       1      0      0
## 163      1      0      0
## 92       0      1      0
## 114      0      1      0
## 115      0      1      0
## 119      0      1      0
## 161      0      1      0
## 137      0      0      1
## 141      0      0      1
## 143      0      0      1
## 144      0      0      1
## 147      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

Now each observation has a 1 only in its own `dummy’ variable.

(XTX <- t(X)%*%X)

##        ffreq1 ffreq2 ffreq3
## ffreq1      5      0      0
## ffreq2      0      5      0
## ffreq3      0      0      5

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)

##            [,1]
## ffreq1 13.24903
## ffreq2 11.68728
## ffreq3 11.69664

(beta <- XTXinv%*%XTy)

##            [,1]
## ffreq1 2.649806
## ffreq2 2.337456
## ffreq3 2.339329

coef(m3 <- lm(logZn ~ ffreq - 1, data=ds))

##   ffreq1   ffreq2   ffreq3 
## 2.649806 2.337456 2.339329

Now the coefficients are the means of each class.

4 Mixed models

The same formulation works with any combination of predictors. For example, both the flood frequency and distance:

(X <- model.matrix(logZn ~ ffreq + dist, data=ds))

##     (Intercept) ffreq2 ffreq3      dist
## 19            1      0      0 0.0000000
## 44            1      0      0 0.2118430
## 47            1      0      0 0.1900860
## 64            1      0      0 0.0316663
## 163           1      0      0 0.2118430
## 92            1      1      0 0.1139410
## 114           1      1      0 0.5757520
## 115           1      1      0 0.5814840
## 119           1      1      0 0.2498520
## 161           1      1      0 0.7716980
## 137           1      0      1 0.2281230
## 141           1      0      1 0.3966750
## 143           1      0      1 0.2879660
## 144           1      0      1 0.2335960
## 147           1      0      1 0.2118460
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

(XTX <- t(X)%*%X)

##             (Intercept)   ffreq2   ffreq3     dist
## (Intercept)   15.000000 5.000000 5.000000 4.296371
## ffreq2         5.000000 5.000000 0.000000 2.292727
## ffreq3         5.000000 0.000000 5.000000 1.358206
## dist           4.296371 2.292727 1.358206 1.859192

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)

##                  [,1]
## (Intercept) 36.632955
## ffreq2      11.687282
## ffreq3      11.696644
## dist         9.967598

(beta <- XTXinv%*%XTy)

##                   [,1]
## (Intercept)  2.7511303
## ffreq2      -0.0537483
## ffreq3      -0.1985825
## dist        -0.7849289

coef(m4 <- lm(logZn ~ ffreq + dist, data=ds))

## (Intercept)      ffreq2      ffreq3        dist 
##   2.7511303  -0.0537483  -0.1985825  -0.7849289

Here the intercept is the value at distance 0 and in flood frequency class 1.

Their interaction:

(X <- model.matrix(logZn ~ ffreq * dist, data=ds))

##     (Intercept) ffreq2 ffreq3      dist ffreq2:dist ffreq3:dist
## 19            1      0      0 0.0000000    0.000000    0.000000
## 44            1      0      0 0.2118430    0.000000    0.000000
## 47            1      0      0 0.1900860    0.000000    0.000000
## 64            1      0      0 0.0316663    0.000000    0.000000
## 163           1      0      0 0.2118430    0.000000    0.000000
## 92            1      1      0 0.1139410    0.113941    0.000000
## 114           1      1      0 0.5757520    0.575752    0.000000
## 115           1      1      0 0.5814840    0.581484    0.000000
## 119           1      1      0 0.2498520    0.249852    0.000000
## 161           1      1      0 0.7716980    0.771698    0.000000
## 137           1      0      1 0.2281230    0.000000    0.228123
## 141           1      0      1 0.3966750    0.000000    0.396675
## 143           1      0      1 0.2879660    0.000000    0.287966
## 144           1      0      1 0.2335960    0.000000    0.233596
## 147           1      0      1 0.2118460    0.000000    0.211846
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

(XTX <- t(X)%*%X)

##             (Intercept)   ffreq2   ffreq3      dist ffreq2:dist ffreq3:dist
## (Intercept)   15.000000 5.000000 5.000000 4.2963713    2.292727   1.3582060
## ffreq2         5.000000 5.000000 0.000000 2.2927270    2.292727   0.0000000
## ffreq3         5.000000 0.000000 5.000000 1.3582060    0.000000   1.3582060
## dist           4.296371 2.292727 1.358206 1.8591921    1.340540   0.3917614
## ffreq2:dist    2.292727 2.292727 0.000000 1.3405404    1.340540   0.0000000
## ffreq3:dist    1.358206 0.000000 1.358206 0.3917614    0.000000   0.3917614

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)

##                  [,1]
## (Intercept) 36.632955
## ffreq2      11.687282
## ffreq3      11.696644
## dist         9.967598
## ffreq2:dist  5.182236
## ffreq3:dist  3.156250

(beta <- XTXinv%*%XTy)

##                   [,1]
## (Intercept)  2.8902949
## ffreq2      -0.2723506
## ffreq3      -0.3004762
## dist        -1.8629915
## ffreq2:dist  1.2513011
## ffreq3:dist  0.9408566

coef(m5 <- lm(logZn ~ ffreq * dist, data=ds))

## (Intercept)      ffreq2      ffreq3        dist ffreq2:dist ffreq3:dist 
##   2.8902949  -0.2723506  -0.3004762  -1.8629915   1.2513011   0.9408566

A nested model without intercept:

(X <- model.matrix(logZn ~ ffreq/dist - 1, data=ds))

##     ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 19       1      0      0   0.0000000    0.000000    0.000000
## 44       1      0      0   0.2118430    0.000000    0.000000
## 47       1      0      0   0.1900860    0.000000    0.000000
## 64       1      0      0   0.0316663    0.000000    0.000000
## 163      1      0      0   0.2118430    0.000000    0.000000
## 92       0      1      0   0.0000000    0.113941    0.000000
## 114      0      1      0   0.0000000    0.575752    0.000000
## 115      0      1      0   0.0000000    0.581484    0.000000
## 119      0      1      0   0.0000000    0.249852    0.000000
## 161      0      1      0   0.0000000    0.771698    0.000000
## 137      0      0      1   0.0000000    0.000000    0.228123
## 141      0      0      1   0.0000000    0.000000    0.396675
## 143      0      0      1   0.0000000    0.000000    0.287966
## 144      0      0      1   0.0000000    0.000000    0.233596
## 147      0      0      1   0.0000000    0.000000    0.211846
## attr(,"assign")
## [1] 1 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

(XTX <- t(X)%*%X)

##                ffreq1   ffreq2   ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## ffreq1      5.0000000 0.000000 0.000000   0.6454383    0.000000   0.0000000
## ffreq2      0.0000000 5.000000 0.000000   0.0000000    2.292727   0.0000000
## ffreq3      0.0000000 0.000000 5.000000   0.0000000    0.000000   1.3582060
## ffreq1:dist 0.6454383 0.000000 0.000000   0.1268904    0.000000   0.0000000
## ffreq2:dist 0.0000000 2.292727 0.000000   0.0000000    1.340540   0.0000000
## ffreq3:dist 0.0000000 0.000000 1.358206   0.0000000    0.000000   0.3917614

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)

##                  [,1]
## ffreq1      13.249028
## ffreq2      11.687282
## ffreq3      11.696644
## ffreq1:dist  1.629111
## ffreq2:dist  5.182236
## ffreq3:dist  3.156250

(beta <- XTXinv%*%XTy)

##                   [,1]
## ffreq1       2.8902949
## ffreq2       2.6179443
## ffreq3       2.5898187
## ffreq1:dist -1.8629915
## ffreq2:dist -0.6116904
## ffreq3:dist -0.9221350

coef(m6 <- lm(logZn ~ ffreq/dist - 1, data=ds))

##      ffreq1      ffreq2      ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist 
##   2.8902949   2.6179443   2.5898187  -1.8629915  -0.6116904  -0.9221350

This gives directly the means of each class at distance 0, and a coefficient showing how much the metal concentration decreases with each normalized distance away from the river.

5 Robust regression

A robust inference is one that is not greatly disturbed by:

a few unusual observations in the dataset, so-called outliers; or
a small violation of model assumptions.

For linear modelling, the underlying model form should still be linear, but there may be unusual observations that obscure the general relation.

An example of the first case is a contaminated dataset, where some observations result from a different process then that which produced the others. The issue here is that the observations do not all come from the same population, i.e., result of a single process.

Many of the problems with parametric regression can be avoided by fitting a so-called robust linear regression, which attempts to fit the majority of the observations and ignore the unusual cases. Since we do not know in advance if there is contamination or problems with the model assumptions, there is no way to select an optimum method. See the discussion in J Fox and S Weisberg. Robust regression in R: An appendix to “An R companion to applied regression, second edition”. 15-Dec-2010. URL http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Robust-Regression.pdf

One method is called least trimmed squares (LTS). It estimates the slope vector $\mathbf{\beta}$, by the criterion of minimizing

\[ \sum_{i=1}^q \big| \;y_i - \mathbf{X}_i \mathbf{\beta} \; \big|_{(i)}^2 \]

That is, the squared absolute deviations over some subset of the residuals, indicated by the subscript $(i)$. The smallest $q$ residuals are chosen as some proportion, by default: $q = \lfloor n/2 \rfloor + \lfloor (p + 1)/2 \rfloor$, where $p$ is the number of predictors and $n$ the number of observations. This is about half the observations.

To determine the observations to use, sort the squared residuals from smallest $(1)$ to largest $(n)$:

\[ (e^2)_{(1)}, (e^2)_{(2)}, \ldots (e^2)_{(n)} \] and select the observations with the smallest $q$ of these to re-fit the model.

p <- 2 # number of predictors
(q <- floor((n/2) + floor((p+1)/2))) # number of `good' observations

## [1] 8

In our example, 8 ‘good’ observations will be used; the other 7 observations with the largest residuals will be discarded. This is the ‘trimming’.

However, it is not so simple, because after re-fitting on the selected observations the residuals from all the observations must be re-computed with the new coefficients, and these sorted residuals may not produce the same $q$ smallest residuals. So the procedure is iterative until there is no change in the order of sorted residuals.

This is solved by the lqs function found in the MASS “Modern Applied Statistics with S” package, explained in the textbook: Venables, W., & Ripley, B. (2002). Modern Applied Statistics with S. Fourth Edition. Springer. This function has several methods, of which least trimmed squares is the default and most advised.

We show how to compute one iteration of the robust regression line, starting from the OLS line:

Find the 7 largest absolute residuals:

m1 <- lm(logZn ~ dist, data=ds)
(r0.s <- sort(r0.a <- abs(residuals(m1))))

##        163        161        114         92        143        137        141 
## 0.05888070 0.06345561 0.08273952 0.11704238 0.12704260 0.14846578 0.15146304 
##         64        144        115         19        119         44        147 
## 0.16392366 0.16785893 0.18443669 0.18487908 0.25263312 0.25897224 0.31696168 
##         47 
## 0.35008329

(hi.res <- r0.s[(q+1):n])

##       144       115        19       119        44       147        47 
## 0.1678589 0.1844367 0.1848791 0.2526331 0.2589722 0.3169617 0.3500833

(hi.res.ix <- which(abs(residuals(m1)) %in% hi.res))

## [1]  1  2  3  8  9 14 15

ds.trim <- ds[-hi.res.ix,]
dim(ds); dim(ds.trim)

## [1] 15  3

## [1] 8 3

m1.r <- lm(logZn ~ dist, data=ds.trim)

Compare the regression coefficients of the OLS and robust models, and the percent change:

coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)

## (Intercept)        dist 
##   2.6808001  -0.8557252

## (Intercept)        dist 
##   2.6814083  -0.8351626

## (Intercept)        dist 
##         0.0         2.5

Now re-compute the residuals using the predict method, the first-iteration robust model, and the full dataset:

(r1.s <- sort(r1.a <- abs(ds$logZn - predict(m1.r, newdata=ds))))

##        163        161        114         92        143        141        137 
## 0.05391654 0.07993184 0.09518659 0.11999343 0.12051315 0.14269825 0.14316685 
##         64        144         19        115        119         44        147 
## 0.16518293 0.17327039 0.18548721 0.19700162 0.24688739 0.25400807 0.31199745 
##         47 
## 0.35460008

r0.s

##        163        161        114         92        143        137        141 
## 0.05888070 0.06345561 0.08273952 0.11704238 0.12704260 0.14846578 0.15146304 
##         64        144        115         19        119         44        147 
## 0.16392366 0.16785893 0.18443669 0.18487908 0.25263312 0.25897224 0.31696168 
##         47 
## 0.35008329

In general the residuals after the re-fit will not be the same, nor in the same sorted order, as from the OLS fit. Check if the same observations are discarded:

hi.res.1 <- r1.s[(q+1):n]
(hi.res.ix.1 <- which(r1.a %in% hi.res.1))

## [1]  1  2  3  8  9 14 15

hi.res.ix

## [1]  1  2  3  8  9 14 15

intersect(hi.res.ix, hi.res.ix.1)

## [1]  1  2  3  8  9 14 15

In this case 7 of the 7 largest absolute residuals are from the same observations in both the OLS regression and the first iteration of the LTS regression.

If these sets are the same, this is the robust regression. If not, the process is repeated with the new set of ‘good’ observations, until the sets are the same.

The robust line can be computed directly by the lqs method, with the method optional argument set to lts for least-trimmed squares. This performs the iteration as needed.

require(MASS)

## Loading required package: MASS

m1.r <- lqs(logZn ~ dist, data=ds, method = "lts")
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)

## (Intercept)        dist 
##   2.2049524   0.1939035

## (Intercept)        dist 
##   2.6814083  -0.8351626

## (Intercept)        dist 
##       -17.8      -123.2

Depending on the sample, the regression coefficients can change quite a lot.

Visualize the two regression lines:

plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
abline(m1)
abline(m1.r, lty=2)
legend("topright", c("OLS", "robust"), lty=1:2)

Explaining Ordinary Least Squares (OLS) regression with R

Alema Fissuh

01 Feburary 2023