This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.
We use a subset of the meuse soil pollution dataset
found in the sp (spatial objects) package. Note that each
time the source R Markdown file is executed, a different random set of
observation is selected from the full dataset, so the results will be
different.
library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)
## [1] 155 14
Add computed fields with the log-transformed values of the heavy metals:
meuse$logZn <- log10(meuse$zinc)
meuse$logCu <- log10(meuse$copper)
meuse$logPb <- log10(meuse$lead)
meuse$logCd <- log10(meuse$cadmium)
dim(meuse)
## [1] 155 18
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [15] "logZn" "logCu" "logPb" "logCd"
table(meuse$ffreq) # observations in each flood frequency class
##
## 1 2 3
## 84 48 23
To be able to show the matrices, we limit the data to only a few rows
and 4 selected columns: a dependent variable logZn, a
continuous predictor dist (normalized distance to the river
Meuse), and a categorical predictor ffreq(flooding
frequency).
To ensure we have some observations from each flood frequency class,
we take a stratified sample based on this category. This uses the
strata method from the sampling package. We
choose an equal number of observations from each of the three
classes.
The n.s parameter is the sample size from each stratum;
you can change this to see the effect of taking larger or smaller
samples.
n.s <- 5 # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size
## [1] 15
library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)
## [1] 19 43 46 63 84 88 109 110 114 131 133 136 138 139 142
These are the selected rows, they will differ with each random sample.
# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])
## logZn dist ffreq
## 19 2.866287 0.0000000 1
## 44 2.245513 0.2118430 1
## 47 2.872739 0.1900860 1
## 64 2.818885 0.0316663 1
## 163 2.445604 0.2118430 1
## 92 2.703291 0.1139410 2
## 114 2.283301 0.5757520 2
## 115 2.380211 0.5814840 2
## 119 2.220108 0.2498520 2
## 161 2.100371 0.7716980 2
## 137 2.342423 0.2281230 3
## 141 2.198657 0.3966750 3
## 143 2.313867 0.2879660 3
## 144 2.654177 0.2335960 3
## 147 2.187521 0.2118460 3
Visualize the relation with possible predictors of log(Zn):
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)
Depending on the random sample, there should be some relation: further distances and less-frequent flooding are associated with lower metal concentrations.
Model form: \(y = \beta_0 + \beta_1 x + \varepsilon; \;\; \mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2)\)
Each observation \(i\): \(y_i = \beta_0 + \beta_1 x_i + r_i\) where \(r_i\) is the residual – i.e., what the model does not explain.
There is only one set of coefficients \(\beta_p\) which must be adjusted to all observations.
Once the model is fit, i.e., we compute the \(\beta_p\), we can compute the fitted values that would be expected at each observation point: \(\widehat{y}_i = \beta_0 + \beta_1 x_i\), and from that the residuals: \(y_i - (\beta_0 + \beta_1 x_i)\), i.e., \((y_i - \widehat{y}_i)\), actual less fitted.
So the question is: what are the `best’ values of \(\beta_p\)?
One criterion is to select \(\beta_0\), \(\beta_1\) to minimize sum of squared residuals: \(\sum_i (y_i - (\beta_0 + \beta_1 x_i))^2\).
Note:
Given this criterion, the optimal values of the coefficients \(\beta_0\), \(\beta_1\) can be found by simple minimization:
Solution:
\[ \begin{aligned} \widehat{\beta}_1 &= \frac{ \sum_i (x_i - \overline{x})(y_i - \overline{y}) } { \sum_i (x_i - \overline{x})^2 } \\ \widehat{\beta}_0 &= \overline{y} - \widehat{\beta}_1 \overline{x} \end{aligned} \] Where:
This can also be expressed in terms of variances and covariances:
\[ \widehat{\beta}_1 = \frac{s_{xy}}{s_x^2} \] Where:
These are unbiased estimates of the population variance/covariance:
\[ \widehat{\beta}_1 = \frac{\mathrm{Covar}(x,y)}{\mathrm{Var}(x)} \]
This can be more efficiently written in matrix notation: \(\mathbf{y} = \mathbf{X} \mathbf{\beta} + \varepsilon\)
Where:
This allows extension to multiple variables and to categorical variables, as we will see below.
Notice that \(\mathbf{I}\) means there is no correlation among the residuals. If this is not true, generalized least squares (GLS) must be used.
Solution by minimizing the sum of squares of the residuals:
\[ \begin{aligned} S &= \varepsilon^T \varepsilon \\ S &= (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}) \end{aligned} \]
Note that \((\mathbf{X}^T \mathbf{X})\) is the matrix equivalent of \(s_x^2\), i.e., the variance of the predictor \(x\).
This expands to: \[ \begin{aligned} S &= \mathbf{y}^T\mathbf{y} - \beta^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{X}\mathbf{\beta} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \nonumber \\ S &= \mathbf{y}^T\mathbf{y} - 2 \mathbf{\beta}^T \mathbf{X}^T \mathbf{y} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \end{aligned} \]
Minimize by finding the partial derivative with respect the the unknown coefficients \(\mathbf{\beta}\), setting this equal to \(\mathbf{0}\), and solving:
\[ \begin{aligned} \frac{\partial}{\partial \mathbf{\beta}^T} S &= -2 \mathbf{X}^T \mathbf{y} + 2 \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ \mathbf{0} &= -\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ (\mathbf{X}^T \mathbf{X} ) \mathbf{\beta} &= \mathbf{X}^T \mathbf{y} \\ (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \mathbf{X}) \mathbf{\beta} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \\ \mathbf{\widehat{\beta}}_{\mathrm{OLS}} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{aligned} \]
The model matix \(\mathbf{X}\) is
computed from the model formula, with the model.matrix
function:
(X <- model.matrix(logZn ~ dist, data=ds))
## (Intercept) dist
## 19 1 0.0000000
## 44 1 0.2118430
## 47 1 0.1900860
## 64 1 0.0316663
## 163 1 0.2118430
## 92 1 0.1139410
## 114 1 0.5757520
## 115 1 0.5814840
## 119 1 0.2498520
## 161 1 0.7716980
## 137 1 0.2281230
## 141 1 0.3966750
## 143 1 0.2879660
## 144 1 0.2335960
## 147 1 0.2118460
## attr(,"assign")
## [1] 0 1
dim(X)
## [1] 15 2
Note the column of \(1\)s to
represent the mean, and the column dist which are the
observed normalized distances at each of the selected observations.
Its square, using the t “transpose” function and “%*%”
matrix multiplication operator:
t(X)
## 19 44 47 64 163 92 114 115
## (Intercept) 1 1.000000 1.000000 1.0000000 1.000000 1.000000 1.000000 1.000000
## dist 0 0.211843 0.190086 0.0316663 0.211843 0.113941 0.575752 0.581484
## 119 161 137 141 143 144 147
## (Intercept) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## dist 0.249852 0.771698 0.228123 0.396675 0.287966 0.233596 0.211846
## attr(,"assign")
## [1] 0 1
(XTX <- t(X)%*%X)
## (Intercept) dist
## (Intercept) 15.000000 4.296371
## dist 4.296371 1.859192
dim(XTX)
## [1] 2 2
This is the product-crossproduct matrix of the two predictor variables:
sum(rep(1,n)^2); sum((ds$dist)^2); sum(rep(1,n)*ds$dist)
## [1] 15
## [1] 1.859192
## [1] 4.296371
Its inverse, using the solve matrix solution
function:
(XTXinv <- solve(XTX))
## (Intercept) dist
## (Intercept) 0.1971765 -0.4556514
## dist -0.4556514 1.5908240
dim(XTXinv)
## [1] 2 2
Compute \(\mathbf{X}^T \mathbf{y}\), the matrix equivalent of \(s_{xy}\), i.e., the covariance between two predictors (intercept and normalized distance) and predictand:
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 36.632955
## dist 9.967598
dim(XTy)
## [1] 2 1
And finally the solution:
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.6814083
## dist -0.8351626
dim(beta)
## [1] 2 1
This is the same result as using the lm function:
coef(m1 <- lm(logZn ~ dist, data=ds))
## (Intercept) dist
## 2.6814083 -0.8351626
Recall, in OLS the residuals are assumed to be identically and independently distributed, from a normal distribution.
First see if they appear to be normally-distributed. We use the
qqPlot' method from thecar’ package, which shows the
quantiles of the normal distribution against the residuals, with
confidence intervals.
library(car)
## Loading required package: carData
qqPlot(residuals(m1))
## 47 147
## 3 15
Second, see if there is any relation between the residuals and fitted values, and also if the variance of the residuals is the same across the range of fitted values:
plot(residuals(m1) ~ fitted(m1))
grid()
abline(h=0, col="blue")
lines(stats::lowess(residuals(m1) ~ fitted(m1)), col="red")
There are not so many points here, so it’s hard to draw a firm
conclusion. The empirical smoother, using lowess
“locally-weighted sum of squares”, helps visualize any relation.
Third, look for observations with high influence on the coefficients; these may be from a different population (process). The leverage is measured by the hat value, which measures the overall influence of a single observation on the predictions.
(sort(h <- hatvalues(m1)))
## 143 119 144 137 147 44 163
## 0.06667045 0.06879450 0.07110646 0.07207403 0.07551481 0.07551553 0.07551553
## 47 141 92 64 19 114 115
## 0.08143135 0.08600332 0.11399471 0.16991412 0.19717650 0.19983495 0.20516374
## 161
## 0.44129001
plot(ds$dist, h, xlab="Normalized distance", ylab="Hat value", col=ds$ffreq, pch=20,
xlim=c(0,1), ylim=c(0,0.8))
text(ds$dist, h, labels=names(residuals(m1)), pos=3)
The highest hat values are at the extremes of the predictor.
The influence of each observation can be measured by its Cook’s distance, which measures how much \(\beta\) would change if the observation were omitted. It is defined as:
\[ \begin{aligned} D_i &= \frac{r^2_i}{\mathrm{tr}(\mathbf{H})} \frac{h_i}{1-h_i} \end{aligned} \] where \(\mathbf{H}\) is the hat matrix, \(h_i\) is its \(i\)th diagonal, and \(r_i\) is the standardized residual, i.e., \((y_i - \widehat{y}_i) / s \sqrt{1 - h_i}\). This is based on \(s^2\), which is the residual variance after the model fit.
sort(d <- cooks.distance(m1))
## 163 143 137 92 114 144
## 0.003515888 0.014177974 0.021176635 0.022832342 0.024523949 0.026651287
## 141 119 161 44 64 147
## 0.027107622 0.058115950 0.065334629 0.068013566 0.076053434 0.101882099
## 19 115 47
## 0.120016966 0.126792043 0.135757328
plot(fitted(m1), d, xlab="Fitted value", ylab="Cook's distance", col=ds$ffreq, pch=20)
text(fitted(m1), d, labels=names(residuals(m1)), pos=4)
The hat matrix is so-named because it `puts the hats on’ the fitted values: from observed \(\mathbf{y}\) to best-fit \(\mathbf{\widehat{y}}\)
It is defined as:
\[ \begin{aligned} \mathbf{H} &= \mathbf{X} {(\mathbf{X}' \mathbf{X})}^{-1} \mathbf{X}' \end{aligned} \]
The hat matrix gives the weights by which each original observation is multiplied when computing \(\mathbf{\widehat{\beta}}\). This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the \(\mathbf{H}\) matrix gives the of the observation vector \(\mathbf{y}\) into the of the design (model) matrix \(\mathbf{X}\).
The overall leverage of each observation \(y_i\) is given by its hat value, which is the sum of the squared entries of the hat matrix associated with the observation. This is also the diagonal of the hat matrix, because of the fact that \(\mathbf{H} = \mathbf{H'} \mathbf{H} = \mathbf{H^2}\):
\[ \begin{aligned} h_{ii} &= \sum_{j=1}^n h_{ij}^2 \end{aligned} \]
There are various ways to set up the model matrix to separate
categories. These are viewed and set with the contrasts
function:
contrasts(ds$ffreq)
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
This is the default “treatment contrasts” matrix; where “treatments” comes from experimental design. Here the “treatment” was by nature, not humans. Treatment contrasts are sometimes called “dummy variables”.
The model matrix uses these contrasts:
(X <- model.matrix(logZn ~ ffreq, data=ds))
## (Intercept) ffreq2 ffreq3
## 19 1 0 0
## 44 1 0 0
## 47 1 0 0
## 64 1 0 0
## 163 1 0 0
## 92 1 1 0
## 114 1 1 0
## 115 1 1 0
## 119 1 1 0
## 161 1 1 0
## 137 1 0 1
## 141 1 0 1
## 143 1 0 1
## 144 1 0 1
## 147 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
Notice how the first category is included in the intercept – this is the base case, the other two classes are contrasted to it. The matrix has created two variables, beyond the intercept, to represent the two additional classes.
Solution is exactly as before, by OLS:
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3
## (Intercept) 15 5 5
## ffreq2 5 5 0
## ffreq3 5 0 5
dim(XTX)
## [1] 3 3
This is the product-crossproduct matrix among the predictors. Note the relation to the number of observations in each class.
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 36.63296
## ffreq2 11.68728
## ffreq3 11.69664
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.6498057
## ffreq2 -0.3123492
## ffreq3 -0.3104768
And again, these are the coefficents found by lm:
coef(m2 <- lm(logZn ~ ffreq, data=ds))
## (Intercept) ffreq2 ffreq3
## 2.6498057 -0.3123492 -0.3104768
The first coefficient is the mean of the independent variable for observations in the first class:
beta[1]
## [1] 2.649806
mean(ds$logZn[ds$ffreq==1])
## [1] 2.649806
The others are the difference between this and their class means:
beta[2]
## [1] -0.3123492
mean(ds$logZn[ds$ffreq==2])
## [1] 2.337456
mean(ds$logZn[ds$ffreq==2]) - beta[1]
## [1] -0.3123492
We can make the model without an intercept, by
adding a -1 to the model formulation:
(X <- model.matrix(logZn ~ ffreq - 1, data=ds))
## ffreq1 ffreq2 ffreq3
## 19 1 0 0
## 44 1 0 0
## 47 1 0 0
## 64 1 0 0
## 163 1 0 0
## 92 0 1 0
## 114 0 1 0
## 115 0 1 0
## 119 0 1 0
## 161 0 1 0
## 137 0 0 1
## 141 0 0 1
## 143 0 0 1
## 144 0 0 1
## 147 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
Now each observation has a 1 only in its own `dummy’
variable.
(XTX <- t(X)%*%X)
## ffreq1 ffreq2 ffreq3
## ffreq1 5 0 0
## ffreq2 0 5 0
## ffreq3 0 0 5
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## ffreq1 13.24903
## ffreq2 11.68728
## ffreq3 11.69664
(beta <- XTXinv%*%XTy)
## [,1]
## ffreq1 2.649806
## ffreq2 2.337456
## ffreq3 2.339329
coef(m3 <- lm(logZn ~ ffreq - 1, data=ds))
## ffreq1 ffreq2 ffreq3
## 2.649806 2.337456 2.339329
Now the coefficients are the means of each class.
The same formulation works with any combination of predictors. For example, both the flood frequency and distance:
(X <- model.matrix(logZn ~ ffreq + dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist
## 19 1 0 0 0.0000000
## 44 1 0 0 0.2118430
## 47 1 0 0 0.1900860
## 64 1 0 0 0.0316663
## 163 1 0 0 0.2118430
## 92 1 1 0 0.1139410
## 114 1 1 0 0.5757520
## 115 1 1 0 0.5814840
## 119 1 1 0 0.2498520
## 161 1 1 0 0.7716980
## 137 1 0 1 0.2281230
## 141 1 0 1 0.3966750
## 143 1 0 1 0.2879660
## 144 1 0 1 0.2335960
## 147 1 0 1 0.2118460
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3 dist
## (Intercept) 15.000000 5.000000 5.000000 4.296371
## ffreq2 5.000000 5.000000 0.000000 2.292727
## ffreq3 5.000000 0.000000 5.000000 1.358206
## dist 4.296371 2.292727 1.358206 1.859192
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 36.632955
## ffreq2 11.687282
## ffreq3 11.696644
## dist 9.967598
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.7511303
## ffreq2 -0.0537483
## ffreq3 -0.1985825
## dist -0.7849289
coef(m4 <- lm(logZn ~ ffreq + dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist
## 2.7511303 -0.0537483 -0.1985825 -0.7849289
Here the intercept is the value at distance 0 and in flood frequency class 1.
Their interaction:
(X <- model.matrix(logZn ~ ffreq * dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## 19 1 0 0 0.0000000 0.000000 0.000000
## 44 1 0 0 0.2118430 0.000000 0.000000
## 47 1 0 0 0.1900860 0.000000 0.000000
## 64 1 0 0 0.0316663 0.000000 0.000000
## 163 1 0 0 0.2118430 0.000000 0.000000
## 92 1 1 0 0.1139410 0.113941 0.000000
## 114 1 1 0 0.5757520 0.575752 0.000000
## 115 1 1 0 0.5814840 0.581484 0.000000
## 119 1 1 0 0.2498520 0.249852 0.000000
## 161 1 1 0 0.7716980 0.771698 0.000000
## 137 1 0 1 0.2281230 0.000000 0.228123
## 141 1 0 1 0.3966750 0.000000 0.396675
## 143 1 0 1 0.2879660 0.000000 0.287966
## 144 1 0 1 0.2335960 0.000000 0.233596
## 147 1 0 1 0.2118460 0.000000 0.211846
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## (Intercept) 15.000000 5.000000 5.000000 4.2963713 2.292727 1.3582060
## ffreq2 5.000000 5.000000 0.000000 2.2927270 2.292727 0.0000000
## ffreq3 5.000000 0.000000 5.000000 1.3582060 0.000000 1.3582060
## dist 4.296371 2.292727 1.358206 1.8591921 1.340540 0.3917614
## ffreq2:dist 2.292727 2.292727 0.000000 1.3405404 1.340540 0.0000000
## ffreq3:dist 1.358206 0.000000 1.358206 0.3917614 0.000000 0.3917614
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 36.632955
## ffreq2 11.687282
## ffreq3 11.696644
## dist 9.967598
## ffreq2:dist 5.182236
## ffreq3:dist 3.156250
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.8902949
## ffreq2 -0.2723506
## ffreq3 -0.3004762
## dist -1.8629915
## ffreq2:dist 1.2513011
## ffreq3:dist 0.9408566
coef(m5 <- lm(logZn ~ ffreq * dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## 2.8902949 -0.2723506 -0.3004762 -1.8629915 1.2513011 0.9408566
A nested model without intercept:
(X <- model.matrix(logZn ~ ffreq/dist - 1, data=ds))
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 19 1 0 0 0.0000000 0.000000 0.000000
## 44 1 0 0 0.2118430 0.000000 0.000000
## 47 1 0 0 0.1900860 0.000000 0.000000
## 64 1 0 0 0.0316663 0.000000 0.000000
## 163 1 0 0 0.2118430 0.000000 0.000000
## 92 0 1 0 0.0000000 0.113941 0.000000
## 114 0 1 0 0.0000000 0.575752 0.000000
## 115 0 1 0 0.0000000 0.581484 0.000000
## 119 0 1 0 0.0000000 0.249852 0.000000
## 161 0 1 0 0.0000000 0.771698 0.000000
## 137 0 0 1 0.0000000 0.000000 0.228123
## 141 0 0 1 0.0000000 0.000000 0.396675
## 143 0 0 1 0.0000000 0.000000 0.287966
## 144 0 0 1 0.0000000 0.000000 0.233596
## 147 0 0 1 0.0000000 0.000000 0.211846
## attr(,"assign")
## [1] 1 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## ffreq1 5.0000000 0.000000 0.000000 0.6454383 0.000000 0.0000000
## ffreq2 0.0000000 5.000000 0.000000 0.0000000 2.292727 0.0000000
## ffreq3 0.0000000 0.000000 5.000000 0.0000000 0.000000 1.3582060
## ffreq1:dist 0.6454383 0.000000 0.000000 0.1268904 0.000000 0.0000000
## ffreq2:dist 0.0000000 2.292727 0.000000 0.0000000 1.340540 0.0000000
## ffreq3:dist 0.0000000 0.000000 1.358206 0.0000000 0.000000 0.3917614
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## ffreq1 13.249028
## ffreq2 11.687282
## ffreq3 11.696644
## ffreq1:dist 1.629111
## ffreq2:dist 5.182236
## ffreq3:dist 3.156250
(beta <- XTXinv%*%XTy)
## [,1]
## ffreq1 2.8902949
## ffreq2 2.6179443
## ffreq3 2.5898187
## ffreq1:dist -1.8629915
## ffreq2:dist -0.6116904
## ffreq3:dist -0.9221350
coef(m6 <- lm(logZn ~ ffreq/dist - 1, data=ds))
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 2.8902949 2.6179443 2.5898187 -1.8629915 -0.6116904 -0.9221350
This gives directly the means of each class at distance 0, and a coefficient showing how much the metal concentration decreases with each normalized distance away from the river.
A robust inference is one that is not greatly disturbed by:
For linear modelling, the underlying model form should still be linear, but there may be unusual observations that obscure the general relation.
An example of the first case is a contaminated dataset, where some observations result from a different process then that which produced the others. The issue here is that the observations do not all come from the same population, i.e., result of a single process.
Many of the problems with parametric regression can be avoided by fitting a so-called robust linear regression, which attempts to fit the majority of the observations and ignore the unusual cases. Since we do not know in advance if there is contamination or problems with the model assumptions, there is no way to select an optimum method. See the discussion in J Fox and S Weisberg. Robust regression in R: An appendix to “An R companion to applied regression, second edition”. 15-Dec-2010. URL http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Robust-Regression.pdf
One method is called least trimmed squares (LTS). It estimates the slope vector \(\mathbf{\beta}\), by the criterion of minimizing
\[ \sum_{i=1}^q \big| \;y_i - \mathbf{X}_i \mathbf{\beta} \; \big|_{(i)}^2 \]
That is, the squared absolute deviations over some subset of the residuals, indicated by the subscript \((i)\). The smallest \(q\) residuals are chosen as some proportion, by default: \(q = \lfloor n/2 \rfloor + \lfloor (p + 1)/2 \rfloor\), where \(p\) is the number of predictors and \(n\) the number of observations. This is about half the observations.
To determine the observations to use, sort the squared residuals from smallest \((1)\) to largest \((n)\):
\[ (e^2)_{(1)}, (e^2)_{(2)}, \ldots (e^2)_{(n)} \] and select the observations with the smallest \(q\) of these to re-fit the model.
p <- 2 # number of predictors
(q <- floor((n/2) + floor((p+1)/2))) # number of `good' observations
## [1] 8
In our example, 8 ‘good’ observations will be used; the other 7 observations with the largest residuals will be discarded. This is the ‘trimming’.
However, it is not so simple, because after re-fitting on the selected observations the residuals from all the observations must be re-computed with the new coefficients, and these sorted residuals may not produce the same \(q\) smallest residuals. So the procedure is iterative until there is no change in the order of sorted residuals.
This is solved by the lqs function found in the
MASS “Modern Applied Statistics with S” package, explained
in the textbook: Venables, W., & Ripley, B. (2002). Modern Applied
Statistics with S. Fourth Edition. Springer. This function has several
methods, of which least trimmed squares is the default and most
advised.
We show how to compute one iteration of the robust regression line, starting from the OLS line:
Find the 7 largest absolute residuals:
m1 <- lm(logZn ~ dist, data=ds)
(r0.s <- sort(r0.a <- abs(residuals(m1))))
## 163 161 114 92 143 137 141
## 0.05888070 0.06345561 0.08273952 0.11704238 0.12704260 0.14846578 0.15146304
## 64 144 115 19 119 44 147
## 0.16392366 0.16785893 0.18443669 0.18487908 0.25263312 0.25897224 0.31696168
## 47
## 0.35008329
(hi.res <- r0.s[(q+1):n])
## 144 115 19 119 44 147 47
## 0.1678589 0.1844367 0.1848791 0.2526331 0.2589722 0.3169617 0.3500833
(hi.res.ix <- which(abs(residuals(m1)) %in% hi.res))
## [1] 1 2 3 8 9 14 15
ds.trim <- ds[-hi.res.ix,]
dim(ds); dim(ds.trim)
## [1] 15 3
## [1] 8 3
m1.r <- lm(logZn ~ dist, data=ds.trim)
Compare the regression coefficients of the OLS and robust models, and the percent change:
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept) dist
## 2.6808001 -0.8557252
## (Intercept) dist
## 2.6814083 -0.8351626
## (Intercept) dist
## 0.0 2.5
Now re-compute the residuals using the predict method,
the first-iteration robust model, and the full dataset:
(r1.s <- sort(r1.a <- abs(ds$logZn - predict(m1.r, newdata=ds))))
## 163 161 114 92 143 141 137
## 0.05391654 0.07993184 0.09518659 0.11999343 0.12051315 0.14269825 0.14316685
## 64 144 19 115 119 44 147
## 0.16518293 0.17327039 0.18548721 0.19700162 0.24688739 0.25400807 0.31199745
## 47
## 0.35460008
r0.s
## 163 161 114 92 143 137 141
## 0.05888070 0.06345561 0.08273952 0.11704238 0.12704260 0.14846578 0.15146304
## 64 144 115 19 119 44 147
## 0.16392366 0.16785893 0.18443669 0.18487908 0.25263312 0.25897224 0.31696168
## 47
## 0.35008329
In general the residuals after the re-fit will not be the same, nor in the same sorted order, as from the OLS fit. Check if the same observations are discarded:
hi.res.1 <- r1.s[(q+1):n]
(hi.res.ix.1 <- which(r1.a %in% hi.res.1))
## [1] 1 2 3 8 9 14 15
hi.res.ix
## [1] 1 2 3 8 9 14 15
intersect(hi.res.ix, hi.res.ix.1)
## [1] 1 2 3 8 9 14 15
In this case 7 of the 7 largest absolute residuals are from the same observations in both the OLS regression and the first iteration of the LTS regression.
If these sets are the same, this is the robust regression. If not, the process is repeated with the new set of ‘good’ observations, until the sets are the same.
The robust line can be computed directly by the lqs
method, with the method optional argument set to
lts for least-trimmed squares. This performs the iteration
as needed.
require(MASS)
## Loading required package: MASS
m1.r <- lqs(logZn ~ dist, data=ds, method = "lts")
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept) dist
## 2.2049524 0.1939035
## (Intercept) dist
## 2.6814083 -0.8351626
## (Intercept) dist
## -17.8 -123.2
Depending on the sample, the regression coefficients can change quite a lot.
Visualize the two regression lines:
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
abline(m1)
abline(m1.r, lty=2)
legend("topright", c("OLS", "robust"), lty=1:2)