Regression

========================================================

This is an R Markdown document created with examples of how to work with regressions. Based on Coursera’s Regression Models course.

About regression analysis

Wiki says: Regression analysis is a statistical process for estimating the relationships among variables.

Simple Linear regression

Simple example:

Let’s look at data used by Francis Galton in 1885: heights of parents and children

library(UsingR)
data(galton)
par(mfrow=c(1,2))
hist(galton$parent,col="blue",breaks=100, main="Parents' heights")
hist(galton$child,col="blue",breaks=100, main="Children's heights")

plot of chunk unnamed-chunk-1

plot(galton$parent,galton$child,pch=19,col="blue", main="Children's vs parents' heights")

plot of chunk unnamed-chunk-1

How do we fit this data? In other words, what model can we use to explain the height of children as a function of the height of their parents?

fit=lm(child~ parent , data = galton)
plot(galton$parent,galton$child,pch=19,col="blue", main="Children's vs parents' heights")
lines(galton$parent,fit$fitted,col="red",lwd=3)

plot of chunk unnamed-chunk-2

Mathematical interpretation of a linear fit

We are looking for the line of equation $y = \beta_0 + \beta_1 x$ that best fits the data pairs $(x_i, y_i)$. The lm function computes the coefficients $\hat \beta_0$ and $\hat \beta_1$ that ensure the best fit in the least squares sense, i.e., such that $y = \hat \beta_0 + \hat \beta_1 x$ is the line that passes as close to possible to the points $(x_i, y_i)$. Denoting by $\mathbf{x}$ the vector of coordinates $x_i$ and $\mathbf{y}$ the vector of coordinates $y_i$, it can be shown that these coefficients are:

\[\hat \beta_1 = Cor(\mathbf{x}, \mathbf{y}) \frac{Sd(\mathbf{x})}{Sd(\mathbf{y})} ~~~ \hat \beta_0 = \mathbf{\bar{y}} - \hat \beta_1 \mathbf{\bar{x}}\]

where $\mathbf{\bar{x}}$ and $\mathbf{\bar{y}}$ denote the average of the components of $\mathbf{x}$ and $\mathbf{y}$, respectively. Naturally, we could use this information to compute the coefficients instead of using the lm function:

y = galton$child
x = galton$parent
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
c(beta0, beta1)

## [1] 23.9415  0.6463

However, using the lm function is easier:

coef(lm(y ~ x))

## (Intercept)           x 
##     23.9415      0.6463

Further considerations:

The regression line $y = \hat \beta_0 + \hat \beta_1 x$ passes through the point $(\mathbf{\bar{x}}, \mathbf{\bar{y}}$)
The slope is the same one you would get if you centered the data and did regression through the origin. Indeed, it is sometimes useful to center the data. Again thinking of the example where we consider the heights of parents and children, we can center the data and fit a regression line. The slope of the line is the same, but the intercept is now zero:

y = galton$child-mean(galton$child)
x = galton$parent-mean(galton$parent)
coef(lm(y ~ x))

## (Intercept)           x 
##   6.309e-15   6.463e-01

Recall that $\hat \beta_1 = Cor(\mathbf{x}, \mathbf{y}) \frac{Sd(\mathbf{x})}{Sd(\mathbf{y})}$. If we normalize the data - so that instead of $(\mathbf{x}, \mathbf{y})=(x_i, y_i)$ we consider $(\mathbf{\hat{x}}, \mathbf{\hat{y}})=(\hat x_i, \hat y_i) = \left( \frac{x_i - \mathbf{\bar{x}}}{Sd(\mathbf{x})}, \frac{y_i - \mathbf{\bar{y}}}{Sd(\mathbf{y})} \right)$, the slope will be $cor(\mathbf{x}, \mathbf{y})=cor(\mathbf{\hat{x}}, \mathbf{\hat{y}})$

yn = (y - mean(y))/sd(y)
xn = (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])

##                   xn 
## 0.4588 0.4588 0.4588

In the case in which we are looking for linear relationship between $x_i$ and $y_i$ like $y = \beta_1 x$, then the parameter that ensures the best fit is $\hat \beta_1 = \frac{\sum_{i=1^n} y_i x_i}{\sum_{i=1}^n x_i^2}$

beta1 <- sum(y * x) / sum(x^2)

Another example and interpretation of the coefficients

Consider an example where there are diamonds, their prices in Signapore dollars and their weight. Let’s fit and plot the data:

data(diamond)
plot(diamond$carat, diamond$price,
     xlab = "Mass (carats)",
     ylab = "Price (SIN $)",
     bg = "lightblue",
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
fit= lm(price ~ carat, data = diamond)
abline(fit, lwd = 2)

plot of chunk unnamed-chunk-8

The coefficients of the linear fit we obtained are:

coef(fit)

## (Intercept)       carat 
##      -259.6      3721.0

The way we interpret $\beta_1= 3721$ is so that we estimate an expected 3721 (SIN) dollar increase in price for every carat increase in mass of diamond. As for the intercept -259.63, it is the expected price of a 0 carat diamond.

If we take the mean out of the variable on the x-values (the weights), then the intercept changes and we can get a more easily interpretable intercept (it is now the price of the average-weighted diamond)

fit2=lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)

##            (Intercept) I(carat - mean(carat)) 
##                  500.1                 3721.0

Note that the “I”" inside lm is simply placed there to denote that the operation inside of the parenthesis that follow it is to be performed before the fit. Now only the intercept is changed. It is the expected price for the average sized diamond of the data.

So now imagine we want to predict the values of three other diamonds, given their masses. The following two ways to do it are equivalent:

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx

## [1]  335.7  745.1 1005.5

predict(fit, data.frame(carat = newx))

##      1      2      3 
##  335.7  745.1 1005.5

Residuals

In statistical notation:

The ith outcome observed is $Y_i$, the predictor value being $X_i$
The ith predicted outcome, corresponding to predictor value $X_i$, is denoted by $\hat Y_i$. It is given using the regression line: \[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]
We define the residuals as the difference between the observed and predicted outcome: $e_i = Y_i - \hat Y_i$
The underlying statistical model is $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$. An important point is that we assume that $\epsilon_i \sim N(0, \sigma^2)$.

Let’s look at an example of how to compute the residuals:

data(diamond)
y = diamond$price
x = diamond$carat
n = length(y)
fit = lm(y ~ x)
e=resid(fit)
e

##        1        2        3        4        5        6        7        8 
## -17.9483  -7.7381 -22.9483 -85.1586 -28.6303   6.2619  23.4722  37.6312 
##        9       10       11       12       13       14       15       16 
## -38.7893  24.4722  51.8414  40.7389   0.2619  13.4209  -1.2098  40.5287 
##       17       18       19       20       21       22       23       24 
##  36.1029 -44.8406  79.3697 -25.0508  57.8414   9.2619 -20.9483  -3.7381 
##       25       26       27       28       29       30       31       32 
## -19.9483  27.8414 -54.9483   8.8414 -26.9483  16.4722 -22.9483 -13.1020 
##       33       34       35       36       37       38       39       40 
## -12.1020  -0.5278   3.2619   2.2619  -1.2098 -43.2098 -27.9483 -23.3123 
##       41       42       43       44       45       46       47       48 
## -15.6303  43.2672  32.8414   7.3697   4.3697 -11.5278 -14.8406  17.4722

Naturally, this give the same result as:

y-predict(fit)

##        1        2        3        4        5        6        7        8 
## -17.9483  -7.7381 -22.9483 -85.1586 -28.6303   6.2619  23.4722  37.6312 
##        9       10       11       12       13       14       15       16 
## -38.7893  24.4722  51.8414  40.7389   0.2619  13.4209  -1.2098  40.5287 
##       17       18       19       20       21       22       23       24 
##  36.1029 -44.8406  79.3697 -25.0508  57.8414   9.2619 -20.9483  -3.7381 
##       25       26       27       28       29       30       31       32 
## -19.9483  27.8414 -54.9483   8.8414 -26.9483  16.4722 -22.9483 -13.1020 
##       33       34       35       36       37       38       39       40 
## -12.1020  -0.5278   3.2619   2.2619  -1.2098 -43.2098 -27.9483 -23.3123 
##       41       42       43       44       45       46       47       48 
## -15.6303  43.2672  32.8414   7.3697   4.3697 -11.5278 -14.8406  17.4722

To visualise the residuals, it is better to explicitly plot them. Plotting the regression curve and visualising the residuals there leaves a large portion of the picture blank, making it harder to visualise patterns that may exist:

plot(diamond$carat, diamond$price,
     xlab = "Mass (carats)",
     ylab = "Price (SIN $)",
     bg = "lightblue",
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
yhat = predict(fit)
abline(fit, lwd = 2)
for (i in 1 : n)
  lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)

plot of chunk unnamed-chunk-14

plot(diamond$carat, e,
     xlab = "Mass (carats)",
     ylab = "Residuals (SIN $)",
     bg = "lightblue",
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

plot of chunk unnamed-chunk-14

The fact that plotting the residuals directly instead of the regression line gives us a more clear view whether indeed our model is good is illustrated in the following plots:

x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
plot(x, y); abline(lm(y ~ x))

plot of chunk unnamed-chunk-15

plot(x, resid(lm(y ~ x)));
abline(h = 0)

plot of chunk unnamed-chunk-15

The latter figure clearly shows the residuals are not normally distributed, as we had assumed. Indeed, the statistical model is $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where we assume $\epsilon_i \sim N(0, \sigma^2)$. We have not yet tried to estimate $\sigma$

One intuitive way to estimate $\sigma^2$ is averaging out squared residuals: $\frac{1}{n}\sum_{i=1}^n e_i^2$.
Most people use instead: \[ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. \]
The $n-2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$

R computes this latter estimate automatically:

y = diamond$price; x = diamond$carat; n = length(y)
fit = lm(y ~ x)
summary(fit)$sigma

## [1] 31.84

sqrt(sum(resid(fit)^2) / (n - 2))

## [1] 31.84

How good is the model?

Recall that we denote observed values by $Y_i$, predicted values by $\hat Y_i$ and the average by $\bar Y$. It can be shown that

\[\sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]

In other words, the total variation of the data is the residual variance (if the model were perfect, this would be zero) plus the regression variance. We can consider the percentage of total variation due to the model by $R^2$:

\[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \]

The quantity $R^2$ is then the ratio between how much variation there is in the model (relative to the average of the observations) and how much variation there is in the actual data (relative to the average of the observations). In other words, $R^2$ expresses how much of the variation the data has is explained by the model. Note:

$R^2$ is the percentage of variation explained by the regression model.
$0 \leq R^2 \leq 1$
it is easy to show that $R^2 = Cor(Y, X)^2$. So, $R^2$ is literally $r$ squared.
$R^2$ can be a misleading summary of model fit.
- Deleting data can inflate $R^2$.
- Adding terms to a regression model always increases $R^2$.

We still have to define criteria to check whether a model is good. In practice, we perform hypothesis testing to check if the slope ($\beta_1$) is zero. If this can be rejected with statistical significance, then we may look at $R^2$ to see how good the fit is.

T-tests for regression coefficients

It can be shown that one can create confidence intervals and perform hypothesis tests with regression coefficients. The standard error formulas for the coefficients are (as is easily obtained by tedious calculations):

$\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2$
$\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2$
In practice, $\sigma$ is replaced by its estimate.
It’s probably not surprising that under iid Gaussian errors \[ \frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} \] follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$. In this context we can think of hypothesis testing.

As an example, we can test the null hypothesis (that the coefficients $\beta_0$ and $\beta_1$ are zero, respectively for each hypothesis)

library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x);
summary(fit)$coefficients

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   -259.6      17.32  -14.99 2.523e-19
## x             3721.0      81.79   45.50 6.751e-40

We can also get confidence intervals for the coefficients:

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]

## [1] -294.5 -224.8

# confience interval for the first coefficient
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]

## [1] 3556 3886

# confience interval for the second coefficient

Interpretation: A one carat increase in diamond size results in a 3556 to 3886 increase in price.

Prediction of outcomes

We’ve seen how to compute confidence intervals for the coefficients. Now consider predicting $Y$ at a value of $X$. A standard error is needed to create a prediction interval.

Pointwise, the obvious estimate for prediction at point $x_0$ is $\hat \beta_0 + \hat \beta_1 x_0$. In terms of prediction intervals, there are two different situations: when you want to predict the value taken by $y$ at $x_0$ and when you want to predict the point of the regression line at $x_0$:

The prediction interval standard error at $x_0$ is $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$ (there is a constant term “1” inside the square bracket because there is an uncertainty where the points are even if you know perfectly well where your fit line should be. The points deviate from the line, no matter how perfectly your model line fits the data. Then there is the 1/n because the more points you get, the closer you are to getting the model right. And the other term in the square bracket accounts that the further you get from the mean, the worse you expect your results to be)
You get a different situation if you are trying to predict what are the points of the regression line. At point $x_0$ the standard error is now, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$. It gets smaller as the number of points we know increases.

In the following figure the upper and lower limits for the estimates of the the fit line (closer to the fit line) and the predictions (further away) are plotted:

xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])

plot of chunk unnamed-chunk-19

Multivariable linear regression analyses

The general linear model extends simple linear regression (SLR) by adding terms linearly into the model. \[ Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_{p} X_{pi} + \epsilon_{i} = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i} \]
Here $X_{1i}=1$ typically, so that an intercept is included.
Least squares (and hence ML estimates under iid Gaussianity of the errors) minimizes \[ \sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2 \]
Note, the important linearity is linearity in the coefficients. Thus

\[ Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots + \beta_{p} X_{pi}^2 + \epsilon_{i} \] is still a linear model. (We’ve just squared the elements of the predictor variables.)

Example in R

We start by building up our example in which we have data following the equation: $y=x1+x2+x3$ (in other words, all coefficients are one)

n <- 100; x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
y <- x1 + x2 + x3 + rnorm(n, sd = .1)

Note we are adding an error term (the residuals), which we assume to be normally distributed. Now we can ask R to determine the coefficients:

coef(lm(y ~ x1 + x2 + x3 - 1)) #the -1 removes the intercept term

##     x1     x2     x3 
## 0.9707 1.0148 1.0062

If we also try to compute the intercept:

coef(lm(y ~ x1 + x2 + x3))

## (Intercept)          x1          x2          x3 
##    0.003164    0.970484    1.014968    1.005869

Interpretation of the coeficient

\[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\] So that \[ E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]\] \[= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 \] So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.

Extension from simple linear regression to multivariate linear regression

All of our SLR quantities can be extended to linear models

Model $Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}$
Residuals $e_i = Y_i - \hat Y_i$
Variance estimate $\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2$
To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and $\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}$ follows a $T$ distribution with $n-p$ degrees of freedom.
Predicted responses have standard errors and we can calculate predicted and expected response intervals.

Multivariate linear regression example

Consider the swiss fertility data set:

library(datasets); data(swiss); head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

It is a data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].

[,1] Fertility Ig, ‘common standardized fertility measure’
[,2] Agriculture % of males involved in agriculture as occupation
[,3] Examination % draftees receiving highest mark on army examination
[,4] Education % education beyond primary school for draftees.
[,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
[,6] Infant.Mortality live births who live less than 1 year.

All variables but ‘Fertility’ give proportions of the population. Now let’s try to understand how we can explain fertility from the remaining variables:

summary(lm(Fertility ~ . , data = swiss))$coefficients

##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       66.9152   10.70604   6.250 1.906e-07
## Agriculture       -0.1721    0.07030  -2.448 1.873e-02
## Examination       -0.2580    0.25388  -1.016 3.155e-01
## Education         -0.8709    0.18303  -4.758 2.431e-05
## Catholic           0.1041    0.03526   2.953 5.190e-03
## Infant.Mortality   1.0770    0.38172   2.822 7.336e-03

Notice that

We estimate an expected 0.17 decrease in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant.
The t-test for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
Interestingly, the unadjusted estimate we obtain when we look at a simple regression model where there is only one predictor, agriculture, is

summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  60.3044    4.25126  14.185 3.216e-18
## Agriculture   0.1942    0.07671   2.532 1.492e-02

Looking at only this latter (also statistically significant) model, we would expect an increase in fertility when the percentage of males involved in agriculture increases! Which model should we trust? It’s hard to answer. One should not add variables that are not meaningful. There’s a branch of statistics called causal inference which helps. Otherwise just follow an heuristic approach and build up your own story; create a story from the data and model that makes sense.

What if we include an unnecessary variable?

Let’s create a variable z that adds no new linear information, since it’s a linear combination of variables already included. R automatically just drops terms that are linear combinations of other terms:

z <- swiss$Agriculture + swiss$Education
lm(Fertility ~ . + z, data = swiss)

## 
## Call:
## lm(formula = Fertility ~ . + z, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture       Examination         Education  
##           66.915            -0.172            -0.258            -0.871  
##         Catholic  Infant.Mortality                 z  
##            0.104             1.077                NA

Dummy variables are smart

Consider the linear model \[ Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i} \] where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
And for people not in the group $E[Y_i] = \beta_0$
The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
$\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.
Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.

Example:

Consider a multilevel factor level. For didactic reasons, let’s say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
$Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
$X_{i1}$ is 1 for Republicans and 0 otherwise.
$X_{i2}$ is 1 for Democrats and 0 otherwise.
If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
If $i$ is Independent $E[Y_i] = \beta_0$.
$\beta_1$ compares Republicans to Independents.
$\beta_2$ compares Democrats to Independents.
$\beta_1 - \beta_2$ compares Republicans to Democrats.
(Choice of reference category changes the interpretation)

Another example: consider the InsectSpray data set, which is counting how many insects were killed by using different sprays on different occasions:

require(datasets);data(InsectSprays)
require(stats); require(graphics)

head(InsectSprays)

##   count spray
## 1    10     A
## 2     7     A
## 3    20     A
## 4    14     A
## 5    14     A
## 6    12     A

boxplot(count ~ spray, data = InsectSprays,
        xlab = "Type of spray", ylab = "Insect count",
        main = "InsectSprays data", varwidth = TRUE, col = "lightgray")

plot of chunk unnamed-chunk-27

If we use dummy variables, like in the previous example, we get:

summary(lm(count ~ 
             I(1 * (spray == 'B')) + I(1 * (spray == 'C')) + 
             I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
             I(1 * (spray == 'F'))
           , data = InsectSprays))$coef

##                       Estimate Std. Error t value  Pr(>|t|)
## (Intercept)            14.5000      1.132 12.8074 1.471e-19
## I(1 * (spray == "B"))   0.8333      1.601  0.5205 6.045e-01
## I(1 * (spray == "C")) -12.4167      1.601 -7.7550 7.267e-11
## I(1 * (spray == "D"))  -9.5833      1.601 -5.9854 9.817e-08
## I(1 * (spray == "E")) -11.0000      1.601 -6.8702 2.754e-09
## I(1 * (spray == "F"))   2.1667      1.601  1.3532 1.806e-01

The way we interpret this is that we are using A as the reference category. So 0.8333, the coefficient of sprayB, is the estimated expected increase in dead insect count if we use spray B instead of spray A. Using the notation of the previous example, $\beta_0=14.5000$ is the expected number of dead insects if we use spray A. Moreover, $\beta_0+\beta_1=14.5000+0.8333$ is the expected number of dead insects if we use spray B. By the way, we don’t actually have to hard code the dummy variables, R does it for us:

summary(lm(count ~ spray, data = InsectSprays))$coef

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  14.5000      1.132 12.8074 1.471e-19
## sprayB        0.8333      1.601  0.5205 6.045e-01
## sprayC      -12.4167      1.601 -7.7550 7.267e-11
## sprayD       -9.5833      1.601 -5.9854 9.817e-08
## sprayE      -11.0000      1.601 -6.8702 2.754e-09
## sprayF        2.1667      1.601  1.3532 1.806e-01

If we treat Spray as a factor, R includes an intercept and omits the alphabetically first level of the factor.
All t-tests are for comparisons of Sprays versus Spray A.
Emprirical mean for A is the intercept.
Other group means are the intercept plus their coefficient.

We can also think of ommiting the intercept:

summary(lm(count ~ spray - 1, data = InsectSprays))$coef

##        Estimate Std. Error t value  Pr(>|t|)
## sprayA   14.500      1.132  12.807 1.471e-19
## sprayB   15.333      1.132  13.543 1.002e-20
## sprayC    2.083      1.132   1.840 7.024e-02
## sprayD    4.917      1.132   4.343 4.953e-05
## sprayE    3.500      1.132   3.091 2.917e-03
## sprayF   16.667      1.132  14.721 1.573e-22

If we omit an intercept, then it includes terms for all levels of the factor.
Group means are the coefficients.
Tests are tests of whether the groups are different than zero. (Are the expected counts zero for that spray.)
If we want comparisons between using spray C as reference, we could refit the model like so:

spray2 <- relevel(InsectSprays$spray, "C")
summary(lm(count ~ spray2, data = InsectSprays))$coef

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    2.083      1.132  1.8401 7.024e-02
## spray2A       12.417      1.601  7.7550 7.267e-11
## spray2B       13.250      1.601  8.2755 8.510e-12
## spray2D        2.833      1.601  1.7696 8.141e-02
## spray2E        1.417      1.601  0.8848 3.795e-01
## spray2F       14.583      1.601  9.1083 2.794e-13

Other thoughts on this data

Counts are bounded from below by 0, violates the assumption of normality of the errors.
Also there are counts near zero, so both the actual assumption and the intent of the assumption are violated.
Variance does not appear to be constant.
Perhaps taking logs of the counts would help.
There are 0 counts, so maybe log(Count + 1)
Also, we’ll cover Poisson GLMs for fitting count data.

Interactions

Let us consider the following data:

download.file("http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*","hunger.csv")
hunger <- read.csv("hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes",]
head(hunger)

##                                 Indicator Data.Source PUBLISH.STATES Year
## 6  Children aged <5 years underweight (%) NLIS_310214      Published 1986
## 10 Children aged <5 years underweight (%) NLIS_310232      Published 1989
## 11 Children aged <5 years underweight (%) NLIS_310240      Published 1975
## 13 Children aged <5 years underweight (%) NLIS_310272      Published 1988
## 16 Children aged <5 years underweight (%) NLIS_310357      Published 1983
## 21 Children aged <5 years underweight (%) NLIS_310427      Published 1987
##               WHO.region                          Country    Sex
## 6               Americas               Dominican Republic Female
## 10              Americas Bolivia (Plurinational State of)   Male
## 11 Eastern Mediterranean                          Tunisia   Male
## 13 Eastern Mediterranean                            Egypt   Male
## 16       Western Pacific                 Papua New Guinea   Male
## 21              Americas                        Guatemala Female
##    Display.Value Numeric Low High Comments
## 6            7.3     7.3  NA   NA       NA
## 10          10.4    10.4  NA   NA       NA
## 11          15.5    15.5  NA   NA       NA
## 13          11.7    11.7  NA   NA       NA
## 16          26.2    26.2  NA   NA       NA
## 21          26.6    26.6  NA   NA       NA

and in particular look at how the percentage of hungry people has evolved, in different countries, throughout the years:

lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")

plot of chunk unnamed-chunk-33

Let’s build a regression model:

\[Hu_i = b_0 + b_1 Y_i + e_i\]

$b_0$ = percent hungry at Year 0

$b_1$ = decrease in percent hungry per year

$e_i$ = everything we didn’t measure

In R, this will be given by:

lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
lines(hunger$Year,lm1$fitted,lwd=3,col="darkgrey")

plot of chunk unnamed-chunk-34

Now let us consider how the genders may play a different role - segment by gender. We can do this explicitly by considering different models for the genders:

\[HuF_i = bf_0 + bf_1 YF_i + ef_i\]

$bf_0$ = percent of girls hungry at Year 0

$bf_1$ = decrease in percent of girls hungry per year

$ef_i$ = everything we didn’t measure

\[HuM_i = bm_0 + bm_1 YM_i + em_i\]

$bm_0$ = percent of boys hungry at Year 0

$bm_1$ = decrease in percent of boys hungry per year

$em_i$ = everything we didn’t measure

In R, this will read:

lmM <- lm(hunger$Numeric[hunger$Sex=="Male"] ~ hunger$Year[hunger$Sex=="Male"])
lmF <- lm(hunger$Numeric[hunger$Sex=="Female"] ~ hunger$Year[hunger$Sex=="Female"])
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
lines(hunger$Year[hunger$Sex=="Male"],lmM$fitted,col="black",lwd=3)
lines(hunger$Year[hunger$Sex=="Female"],lmF$fitted,col="red",lwd=3)

plot of chunk unnamed-chunk-35

Now we should not have to segment manually. Instead we want R to do this for us:

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] ),col="black",lwd=3)

plot of chunk unnamed-chunk-36

Basically this corresponds to the model where we have two lines with the same slope:

\[Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + e^*_i\]

$b_0$ - percent hungry at year zero for females

$b_0 + b_1$ - percent hungry at year zero for males

$b_2$ - change in percent hungry (for either males or females) in one year

$e^*_i$ - everything we didn’t measure

Note that this is NOT the same as when we segmented males/females. In that case, we had not forced the two lines to have the same slope, as we now do… What we are missing is the interaction between gender and year: the slope for men and for female/the slope of time evolution for the different genders can be different. We can consider this interaction by having the model:

\[Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + b_3 \mathbb{1}(Sex_i="Male")\times Y_i + e^+_i\]

$b_0$ - percent hungry at year zero for females

$b_0 + b_1$ - percent hungry at year zero for males

$b_2$ - change in percent hungry (females) in one year

$b_2 + b_3$ - change in percent hungry (males) in one year

$e^+_i$ - everything we didn’t measure

which in R is implemented as:

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] +lmBoth$coeff[4]),col="black",lwd=3)

plot of chunk lmBothChunk

Now how do we interpret the coefficients that follow?

summary(lmBoth)

## 
## Call:
## lm(formula = hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex * 
##     hunger$Year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25.71 -11.14  -1.71   7.12  46.22 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                593.4690   158.2955    3.75  0.00019 ***
## hunger$Year                 -0.2884     0.0791   -3.65  0.00028 ***
## hunger$SexMale              58.6655   223.8637    0.26  0.79333    
## hunger$Year:hunger$SexMale  -0.0284     0.1118   -0.25  0.79984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 1020 degrees of freedom
## Multiple R-squared:  0.0329, Adjusted R-squared:  0.0301 
## F-statistic: 11.6 on 3 and 1020 DF,  p-value: 1.84e-07

Here, females are the reference category. So the intercept is the percentage of hungry females in year zero (ok, the number is ridiculous, but it’s just a model…). As for 58.6655, it is the change in intercept for males. The slope for females is -0.2884 and for males it’s -0.2884+(-0.0284).

In general, the way to interpret a continuous interaction like \[ E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2} \] is as follows: holding $X_2$ constant we have \[ E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_1 + \beta_3 x_{2} \] And thus the expected change in $Y$ per unit change in $X_1$ holding all else constant is not constant. $\beta_1$ is the slope when $x_{2} = 0$. Note further that: \[ E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1] \] \[ -E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] \] \[ =\beta_3 \] Thus, $\beta_3$ is the change in the expected change in $Y$ per unit change in $X_1$, per unit change in $X_2$.

Or, the change in the slope relating $X_1$ and $Y$ per unit change in $X_2$.

Let’s look at another example, where the goal is to interpret a model explaining how the percentage of hungry people depends on income and the year:

\[Hu_i = b_0 + b_1 In_i + b_2 Y_i + b_3 In_i \times Y_i + e^+_i\]

Interpretation:

$b_0$ - percent hungry at year zero for children with whose parents have no income

$b_1$ - change in percent hungry for each dollar of income in year zero

$b_2$ - change in percent hungry in one year for children whose parents have no income

$b_3$ - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be

\[b_2 + 1e4 \times b_3\]

$e^+_i$ - everything we didn’t measure

Residuals and diagnostics

Calling a point an outlier is vague.

Outliers can be the result of spurious or real processes.
Outliers can have varying degrees of influence.
Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship).
Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship.
Lower left hand point has low leverage, low influence and is not to be an outlier in any sense.
Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points.
Lower right hand point has high leverage and would exert it if it were included in the fit.

Influence measures

Do ?influence.measures to see the full suite of influence measures in stats. The measures include
rstandard - standardized residuals, residuals divided by their standard deviations)
rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
hatvalues - measures of leverage
dffits - change in the predicted response when the $i^{th}$ point is deleted in fitting the model.
dfbetas - change in individual coefficients when the $i^{th}$ point is deleted in fitting the model.
cooks.distance - overall change in teh coefficients when the $i^{th}$ point is deleted.
resid - returns the ordinary residuals
resid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point $i$, where it was not included in the model fitting.

How do I use all of these things?

Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It’s better to understand what they are trying to accomplish and use them judiciously.
Not all of the measures have meaningful absolute scales. You can look at them relative to the values across the data.
They probe your data in different ways to diagnose different problems.
Patterns in your residual plots generally indicate some poor aspect of model fit. These can include:
Heteroskedasticity (non constant variance).
Missing model terms.
Temporal patterns (plot residuals versus collection order).
Residual QQ plots investigate normality of the errors.
Leverage measures (hat values) can be useful for diagnosing data entry errors.
Influence measures get to the bottom line, ‘how does deleting or including this point impact a particular aspect of the model’.

Example: Imagine we have randomly distributed data as follows. Then naturally we will not have a significant linear regression model:

n=100
x <- c(rnorm(n)); y <- c(c(rnorm(n)))
summary(lm(y ~ x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.901 -0.606  0.137  0.577  1.811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.0197     0.0980    0.20     0.84
## x            -0.1006     0.1058   -0.95     0.34
## 
## Residual standard error: 0.975 on 98 degrees of freedom
## Multiple R-squared:  0.00914,    Adjusted R-squared:  -0.000968 
## F-statistic: 0.904 on 1 and 98 DF,  p-value: 0.344

Now if we add an outlier, point (10, 10), we shall obtain a statistically significant regression model which is just stupid:

x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))

plot of chunk unnamed-chunk-39

summary(lm(y~x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.237 -0.799  0.048  0.642  4.123 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0497     0.1154    0.43     0.67    
## x             0.5827     0.0870    6.69  1.3e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.16 on 99 degrees of freedom
## Multiple R-squared:  0.312,  Adjusted R-squared:  0.305 
## F-statistic: 44.8 on 1 and 99 DF,  p-value: 1.32e-09

Now how do we measure the influence of each point in the regression?

we can use dfbetas, for example, which will measure how the slope coefficient will change if we exclude each point:

fit <- lm(y ~ x)
round(dfbetas(fit)[1 : 10, 2], 3)

##      1      2      3      4      5      6      7      8      9     10 
##  7.220 -0.004 -0.134 -0.002  0.009 -0.031  0.005 -0.040 -0.098 -0.185

whereas the hatvalues measure the potential for influence:

round(hatvalues(fit)[1 : 10], 3)

##     1     2     3     4     5     6     7     8     9    10 
## 0.564 0.010 0.022 0.015 0.010 0.011 0.010 0.011 0.021 0.038

We can see that indeed the first point, (10, 10), has huge potential to influence the model.

Something else that may be useful

One may search for patterns by looking at these types of plots:

data(swiss); par(mfrow = c(2, 2))
fit <- lm(Fertility ~ . , data = swiss); plot(fit)

plot of chunk unnamed-chunk-42

Which variables should we include in the model?

Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don’t have to put in the model.
(If there’s too many unobserved confounding variables, even randomization won’t help you.)
Including variables that we shouldn’t have increases standard errors of the regression variables.
Actually, including any new variables increasese (actual, not estimated) standard errors of other regressors. So we don’t want to idly throw variables into the model.
The model must tend toward perfect fit as the number of non-redundant regressors approaches $n$.
$R^2$ increases monotonically as more regressors are included.
The SSE decreases monotonically as more regressors are included.
Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore.
The space of models explodes quickly as you add interactions and polynomial terms.
In the prediction class, we’ll cover many modern methods for traversing large model spaces for the purposes of prediction.
Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
If the models of interest are nested and without lots of parameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
A possible approach is as follows. Given a coefficient that I’m interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out (for example, does the treatment effect still hold if we include BMI? What if we include age?). This isn’t a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.

Nested models:

Consider the mtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark:

fit1=lm(mpg~factor(cyl)+wt, data=mtcars)
fit2=update(fit1, mpg~factor(cyl)+wt+factor(cyl)*wt)
anova(fit1, fit2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(cyl) + wt
## Model 2: mpg ~ factor(cyl) + wt + factor(cyl):wt
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1     28 183                         
## 2     26 156  2      27.2 2.27   0.12

How to interpret this: The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.

Another example: We are interested in the agriculture coefficient. But then we also want to know whether we should add Examination and Education to the model. And then also whether we should add Catholic plus infant mortality. The ANOVA function will create likelihood ratio tests.

fit1 <- lm(Fertility ~ Agriculture, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit3, fit5)

## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1     45 6283                              
## 2     43 3181  2      3102 30.2 8.6e-09 ***
## 3     41 2105  2      1076 10.5 0.00021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values say yes, we should include the examination and education components. And if you do, you should also include catholic + infant mortality.