========================================================
This is an R Markdown document created with examples of how to work with regressions. Based on Coursera’s Regression Models course.
Wiki says: Regression analysis is a statistical process for estimating the relationships among variables.
Let’s look at data used by Francis Galton in 1885: heights of parents and children
library(UsingR)
data(galton)
par(mfrow=c(1,2))
hist(galton$parent,col="blue",breaks=100, main="Parents' heights")
hist(galton$child,col="blue",breaks=100, main="Children's heights")
plot(galton$parent,galton$child,pch=19,col="blue", main="Children's vs parents' heights")
How do we fit this data? In other words, what model can we use to explain the height of children as a function of the height of their parents?
fit=lm(child~ parent , data = galton)
plot(galton$parent,galton$child,pch=19,col="blue", main="Children's vs parents' heights")
lines(galton$parent,fit$fitted,col="red",lwd=3)
\[\hat \beta_1 = Cor(\mathbf{x}, \mathbf{y}) \frac{Sd(\mathbf{x})}{Sd(\mathbf{y})} ~~~ \hat \beta_0 = \mathbf{\bar{y}} - \hat \beta_1 \mathbf{\bar{x}}\]
where \(\mathbf{\bar{x}}\) and \(\mathbf{\bar{y}}\) denote the average of the components of \(\mathbf{x}\) and \(\mathbf{y}\), respectively. Naturally, we could use this information to compute the coefficients instead of using the lm function:
y = galton$child
x = galton$parent
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
c(beta0, beta1)
## [1] 23.9415 0.6463
However, using the lm function is easier:
coef(lm(y ~ x))
## (Intercept) x
## 23.9415 0.6463
Further considerations:
y = galton$child-mean(galton$child)
x = galton$parent-mean(galton$parent)
coef(lm(y ~ x))
## (Intercept) x
## 6.309e-15 6.463e-01
yn = (y - mean(y))/sd(y)
xn = (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
## xn
## 0.4588 0.4588 0.4588
beta1 <- sum(y * x) / sum(x^2)
Consider an example where there are diamonds, their prices in Signapore dollars and their weight. Let’s fit and plot the data:
data(diamond)
plot(diamond$carat, diamond$price,
xlab = "Mass (carats)",
ylab = "Price (SIN $)",
bg = "lightblue",
col = "black", cex = 1.1, pch = 21,frame = FALSE)
fit= lm(price ~ carat, data = diamond)
abline(fit, lwd = 2)
The coefficients of the linear fit we obtained are:
coef(fit)
## (Intercept) carat
## -259.6 3721.0
The way we interpret \(\beta_1= 3721\) is so that we estimate an expected 3721 (SIN) dollar increase in price for every carat increase in mass of diamond. As for the intercept -259.63, it is the expected price of a 0 carat diamond.
If we take the mean out of the variable on the x-values (the weights), then the intercept changes and we can get a more easily interpretable intercept (it is now the price of the average-weighted diamond)
fit2=lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
## (Intercept) I(carat - mean(carat))
## 500.1 3721.0
Note that the “I”" inside lm is simply placed there to denote that the operation inside of the parenthesis that follow it is to be performed before the fit. Now only the intercept is changed. It is the expected price for the average sized diamond of the data.
So now imagine we want to predict the values of three other diamonds, given their masses. The following two ways to do it are equivalent:
newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
## [1] 335.7 745.1 1005.5
predict(fit, data.frame(carat = newx))
## 1 2 3
## 335.7 745.1 1005.5
In statistical notation:
Let’s look at an example of how to compute the residuals:
data(diamond)
y = diamond$price
x = diamond$carat
n = length(y)
fit = lm(y ~ x)
e=resid(fit)
e
## 1 2 3 4 5 6 7 8
## -17.9483 -7.7381 -22.9483 -85.1586 -28.6303 6.2619 23.4722 37.6312
## 9 10 11 12 13 14 15 16
## -38.7893 24.4722 51.8414 40.7389 0.2619 13.4209 -1.2098 40.5287
## 17 18 19 20 21 22 23 24
## 36.1029 -44.8406 79.3697 -25.0508 57.8414 9.2619 -20.9483 -3.7381
## 25 26 27 28 29 30 31 32
## -19.9483 27.8414 -54.9483 8.8414 -26.9483 16.4722 -22.9483 -13.1020
## 33 34 35 36 37 38 39 40
## -12.1020 -0.5278 3.2619 2.2619 -1.2098 -43.2098 -27.9483 -23.3123
## 41 42 43 44 45 46 47 48
## -15.6303 43.2672 32.8414 7.3697 4.3697 -11.5278 -14.8406 17.4722
Naturally, this give the same result as:
y-predict(fit)
## 1 2 3 4 5 6 7 8
## -17.9483 -7.7381 -22.9483 -85.1586 -28.6303 6.2619 23.4722 37.6312
## 9 10 11 12 13 14 15 16
## -38.7893 24.4722 51.8414 40.7389 0.2619 13.4209 -1.2098 40.5287
## 17 18 19 20 21 22 23 24
## 36.1029 -44.8406 79.3697 -25.0508 57.8414 9.2619 -20.9483 -3.7381
## 25 26 27 28 29 30 31 32
## -19.9483 27.8414 -54.9483 8.8414 -26.9483 16.4722 -22.9483 -13.1020
## 33 34 35 36 37 38 39 40
## -12.1020 -0.5278 3.2619 2.2619 -1.2098 -43.2098 -27.9483 -23.3123
## 41 42 43 44 45 46 47 48
## -15.6303 43.2672 32.8414 7.3697 4.3697 -11.5278 -14.8406 17.4722
To visualise the residuals, it is better to explicitly plot them. Plotting the regression curve and visualising the residuals there leaves a large portion of the picture blank, making it harder to visualise patterns that may exist:
plot(diamond$carat, diamond$price,
xlab = "Mass (carats)",
ylab = "Price (SIN $)",
bg = "lightblue",
col = "black", cex = 1.1, pch = 21,frame = FALSE)
yhat = predict(fit)
abline(fit, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
plot(diamond$carat, e,
xlab = "Mass (carats)",
ylab = "Residuals (SIN $)",
bg = "lightblue",
col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
The fact that plotting the residuals directly instead of the regression line gives us a more clear view whether indeed our model is good is illustrated in the following plots:
x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
plot(x, y); abline(lm(y ~ x))
plot(x, resid(lm(y ~ x)));
abline(h = 0)
The latter figure clearly shows the residuals are not normally distributed, as we had assumed. Indeed, the statistical model is \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where we assume \(\epsilon_i \sim N(0, \sigma^2)\). We have not yet tried to estimate \(\sigma\)
One intuitive way to estimate \(\sigma^2\) is averaging out squared residuals: \(\frac{1}{n}\sum_{i=1}^n e_i^2\).
The \(n-2\) instead of \(n\) is so that \(E[\hat \sigma^2] = \sigma^2\)
R computes this latter estimate automatically:
y = diamond$price; x = diamond$carat; n = length(y)
fit = lm(y ~ x)
summary(fit)$sigma
## [1] 31.84
sqrt(sum(resid(fit)^2) / (n - 2))
## [1] 31.84
Recall that we denote observed values by \(Y_i\), predicted values by \(\hat Y_i\) and the average by \(\bar Y\). It can be shown that
\[\sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]
In other words, the total variation of the data is the residual variance (if the model were perfect, this would be zero) plus the regression variance. We can consider the percentage of total variation due to the model by \(R^2\):
\[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \]
The quantity \(R^2\) is then the ratio between how much variation there is in the model (relative to the average of the observations) and how much variation there is in the actual data (relative to the average of the observations). In other words, \(R^2\) expresses how much of the variation the data has is explained by the model. Note:
We still have to define criteria to check whether a model is good. In practice, we perform hypothesis testing to check if the slope (\(\beta_1\)) is zero. If this can be rejected with statistical significance, then we may look at \(R^2\) to see how good the fit is.
It can be shown that one can create confidence intervals and perform hypothesis tests with regression coefficients. The standard error formulas for the coefficients are (as is easily obtained by tedious calculations):
As an example, we can test the null hypothesis (that the coefficients \(\beta_0\) and \(\beta_1\) are zero, respectively for each hypothesis)
library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x);
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.6 17.32 -14.99 2.523e-19
## x 3721.0 81.79 45.50 6.751e-40
We can also get confidence intervals for the coefficients:
sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
## [1] -294.5 -224.8
# confience interval for the first coefficient
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
## [1] 3556 3886
# confience interval for the second coefficient
Interpretation: A one carat increase in diamond size results in a 3556 to 3886 increase in price.
We’ve seen how to compute confidence intervals for the coefficients. Now consider predicting \(Y\) at a value of \(X\). A standard error is needed to create a prediction interval.
Pointwise, the obvious estimate for prediction at point \(x_0\) is \(\hat \beta_0 + \hat \beta_1 x_0\). In terms of prediction intervals, there are two different situations: when you want to predict the value taken by \(y\) at \(x_0\) and when you want to predict the point of the regression line at \(x_0\):
The prediction interval standard error at \(x_0\) is \(\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\) (there is a constant term “1” inside the square bracket because there is an uncertainty where the points are even if you know perfectly well where your fit line should be. The points deviate from the line, no matter how perfectly your model line fits the data. Then there is the 1/n because the more points you get, the closer you are to getting the model right. And the other term in the square bracket accounts that the further you get from the mean, the worse you expect your results to be)
You get a different situation if you are trying to predict what are the points of the regression line. At point \(x_0\) the standard error is now, \(\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\). It gets smaller as the number of points we know increases.
In the following figure the upper and lower limits for the estimates of the the fit line (closer to the fit line) and the predictions (further away) are plotted:
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])
\[ Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots + \beta_{p} X_{pi}^2 + \epsilon_{i} \] is still a linear model. (We’ve just squared the elements of the predictor variables.)
We start by building up our example in which we have data following the equation: \(y=x1+x2+x3\) (in other words, all coefficients are one)
n <- 100; x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
y <- x1 + x2 + x3 + rnorm(n, sd = .1)
Note we are adding an error term (the residuals), which we assume to be normally distributed. Now we can ask R to determine the coefficients:
coef(lm(y ~ x1 + x2 + x3 - 1)) #the -1 removes the intercept term
## x1 x2 x3
## 0.9707 1.0148 1.0062
If we also try to compute the intercept:
coef(lm(y ~ x1 + x2 + x3))
## (Intercept) x1 x2 x3
## 0.003164 0.970484 1.014968 1.005869
\[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\] So that \[ E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]\] \[= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 \] So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
All of our SLR quantities can be extended to linear models
Model \(Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}\) where \(\epsilon_i \sim N(0, \sigma^2)\)
Fitted responses \(\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}\)
Residuals \(e_i = Y_i - \hat Y_i\)
Variance estimate \(\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2\)
To get predicted responses at new values, \(x_1, \ldots, x_p\), simply plug them into the linear model \(\sum_{k=1}^p x_{k} \hat \beta_{k}\)
Coefficients have standard errors, \(\hat \sigma_{\hat \beta_k}\), and \(\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}\) follows a \(T\) distribution with \(n-p\) degrees of freedom.
Predicted responses have standard errors and we can calculate predicted and expected response intervals.
Consider the swiss fertility data set:
library(datasets); data(swiss); head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
It is a data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
All variables but ‘Fertility’ give proportions of the population. Now let’s try to understand how we can explain fertility from the remaining variables:
summary(lm(Fertility ~ . , data = swiss))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.9152 10.70604 6.250 1.906e-07
## Agriculture -0.1721 0.07030 -2.448 1.873e-02
## Examination -0.2580 0.25388 -1.016 3.155e-01
## Education -0.8709 0.18303 -4.758 2.431e-05
## Catholic 0.1041 0.03526 2.953 5.190e-03
## Infant.Mortality 1.0770 0.38172 2.822 7.336e-03
Notice that
We estimate an expected 0.17 decrease in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant.
Interestingly, the unadjusted estimate we obtain when we look at a simple regression model where there is only one predictor, agriculture, is
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.3044 4.25126 14.185 3.216e-18
## Agriculture 0.1942 0.07671 2.532 1.492e-02
Looking at only this latter (also statistically significant) model, we would expect an increase in fertility when the percentage of males involved in agriculture increases! Which model should we trust? It’s hard to answer. One should not add variables that are not meaningful. There’s a branch of statistics called causal inference which helps. Otherwise just follow an heuristic approach and build up your own story; create a story from the data and model that makes sense.
Let’s create a variable z that adds no new linear information, since it’s a linear combination of variables already included. R automatically just drops terms that are linear combinations of other terms:
z <- swiss$Agriculture + swiss$Education
lm(Fertility ~ . + z, data = swiss)
##
## Call:
## lm(formula = Fertility ~ . + z, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Education
## 66.915 -0.172 -0.258 -0.871
## Catholic Infant.Mortality z
## 0.104 1.077 NA
Example:
Another example: consider the InsectSpray data set, which is counting how many insects were killed by using different sprays on different occasions:
require(datasets);data(InsectSprays)
require(stats); require(graphics)
head(InsectSprays)
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
boxplot(count ~ spray, data = InsectSprays,
xlab = "Type of spray", ylab = "Insect count",
main = "InsectSprays data", varwidth = TRUE, col = "lightgray")
If we use dummy variables, like in the previous example, we get:
summary(lm(count ~
I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F'))
, data = InsectSprays))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000 1.132 12.8074 1.471e-19
## I(1 * (spray == "B")) 0.8333 1.601 0.5205 6.045e-01
## I(1 * (spray == "C")) -12.4167 1.601 -7.7550 7.267e-11
## I(1 * (spray == "D")) -9.5833 1.601 -5.9854 9.817e-08
## I(1 * (spray == "E")) -11.0000 1.601 -6.8702 2.754e-09
## I(1 * (spray == "F")) 2.1667 1.601 1.3532 1.806e-01
The way we interpret this is that we are using A as the reference category. So 0.8333, the coefficient of sprayB, is the estimated expected increase in dead insect count if we use spray B instead of spray A. Using the notation of the previous example, \(\beta_0=14.5000\) is the expected number of dead insects if we use spray A. Moreover, \(\beta_0+\beta_1=14.5000+0.8333\) is the expected number of dead insects if we use spray B. By the way, we don’t actually have to hard code the dummy variables, R does it for us:
summary(lm(count ~ spray, data = InsectSprays))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000 1.132 12.8074 1.471e-19
## sprayB 0.8333 1.601 0.5205 6.045e-01
## sprayC -12.4167 1.601 -7.7550 7.267e-11
## sprayD -9.5833 1.601 -5.9854 9.817e-08
## sprayE -11.0000 1.601 -6.8702 2.754e-09
## sprayF 2.1667 1.601 1.3532 1.806e-01
We can also think of ommiting the intercept:
summary(lm(count ~ spray - 1, data = InsectSprays))$coef
## Estimate Std. Error t value Pr(>|t|)
## sprayA 14.500 1.132 12.807 1.471e-19
## sprayB 15.333 1.132 13.543 1.002e-20
## sprayC 2.083 1.132 1.840 7.024e-02
## sprayD 4.917 1.132 4.343 4.953e-05
## sprayE 3.500 1.132 3.091 2.917e-03
## sprayF 16.667 1.132 14.721 1.573e-22
spray2 <- relevel(InsectSprays$spray, "C")
summary(lm(count ~ spray2, data = InsectSprays))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.083 1.132 1.8401 7.024e-02
## spray2A 12.417 1.601 7.7550 7.267e-11
## spray2B 13.250 1.601 8.2755 8.510e-12
## spray2D 2.833 1.601 1.7696 8.141e-02
## spray2E 1.417 1.601 0.8848 3.795e-01
## spray2F 14.583 1.601 9.1083 2.794e-13
Let us consider the following data:
download.file("http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*","hunger.csv")
hunger <- read.csv("hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes",]
head(hunger)
## Indicator Data.Source PUBLISH.STATES Year
## 6 Children aged <5 years underweight (%) NLIS_310214 Published 1986
## 10 Children aged <5 years underweight (%) NLIS_310232 Published 1989
## 11 Children aged <5 years underweight (%) NLIS_310240 Published 1975
## 13 Children aged <5 years underweight (%) NLIS_310272 Published 1988
## 16 Children aged <5 years underweight (%) NLIS_310357 Published 1983
## 21 Children aged <5 years underweight (%) NLIS_310427 Published 1987
## WHO.region Country Sex
## 6 Americas Dominican Republic Female
## 10 Americas Bolivia (Plurinational State of) Male
## 11 Eastern Mediterranean Tunisia Male
## 13 Eastern Mediterranean Egypt Male
## 16 Western Pacific Papua New Guinea Male
## 21 Americas Guatemala Female
## Display.Value Numeric Low High Comments
## 6 7.3 7.3 NA NA NA
## 10 10.4 10.4 NA NA NA
## 11 15.5 15.5 NA NA NA
## 13 11.7 11.7 NA NA NA
## 16 26.2 26.2 NA NA NA
## 21 26.6 26.6 NA NA NA
and in particular look at how the percentage of hungry people has evolved, in different countries, throughout the years:
lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
Let’s build a regression model:
\[Hu_i = b_0 + b_1 Y_i + e_i\]
\(b_0\) = percent hungry at Year 0
\(b_1\) = decrease in percent hungry per year
\(e_i\) = everything we didn’t measure
In R, this will be given by:
lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
lines(hunger$Year,lm1$fitted,lwd=3,col="darkgrey")
Now let us consider how the genders may play a different role - segment by gender. We can do this explicitly by considering different models for the genders:
\[HuF_i = bf_0 + bf_1 YF_i + ef_i\]
\(bf_0\) = percent of girls hungry at Year 0
\(bf_1\) = decrease in percent of girls hungry per year
\(ef_i\) = everything we didn’t measure
\[HuM_i = bm_0 + bm_1 YM_i + em_i\]
\(bm_0\) = percent of boys hungry at Year 0
\(bm_1\) = decrease in percent of boys hungry per year
\(em_i\) = everything we didn’t measure
In R, this will read:
lmM <- lm(hunger$Numeric[hunger$Sex=="Male"] ~ hunger$Year[hunger$Sex=="Male"])
lmF <- lm(hunger$Numeric[hunger$Sex=="Female"] ~ hunger$Year[hunger$Sex=="Female"])
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
lines(hunger$Year[hunger$Sex=="Male"],lmM$fitted,col="black",lwd=3)
lines(hunger$Year[hunger$Sex=="Female"],lmF$fitted,col="red",lwd=3)
Now we should not have to segment manually. Instead we want R to do this for us:
lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] ),col="black",lwd=3)
Basically this corresponds to the model where we have two lines with the same slope:
\[Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + e^*_i\]
\(b_0\) - percent hungry at year zero for females
\(b_0 + b_1\) - percent hungry at year zero for males
\(b_2\) - change in percent hungry (for either males or females) in one year
\(e^*_i\) - everything we didn’t measure
Note that this is NOT the same as when we segmented males/females. In that case, we had not forced the two lines to have the same slope, as we now do… What we are missing is the interaction between gender and year: the slope for men and for female/the slope of time evolution for the different genders can be different. We can consider this interaction by having the model:
\[Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + b_3 \mathbb{1}(Sex_i="Male")\times Y_i + e^+_i\]
\(b_0\) - percent hungry at year zero for females
\(b_0 + b_1\) - percent hungry at year zero for males
\(b_2\) - change in percent hungry (females) in one year
\(b_2 + b_3\) - change in percent hungry (males) in one year
\(e^+_i\) - everything we didn’t measure
which in R is implemented as:
lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] +lmBoth$coeff[4]),col="black",lwd=3)
Now how do we interpret the coefficients that follow?
summary(lmBoth)
##
## Call:
## lm(formula = hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex *
## hunger$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.71 -11.14 -1.71 7.12 46.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 593.4690 158.2955 3.75 0.00019 ***
## hunger$Year -0.2884 0.0791 -3.65 0.00028 ***
## hunger$SexMale 58.6655 223.8637 0.26 0.79333
## hunger$Year:hunger$SexMale -0.0284 0.1118 -0.25 0.79984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 1020 degrees of freedom
## Multiple R-squared: 0.0329, Adjusted R-squared: 0.0301
## F-statistic: 11.6 on 3 and 1020 DF, p-value: 1.84e-07
Here, females are the reference category. So the intercept is the percentage of hungry females in year zero (ok, the number is ridiculous, but it’s just a model…). As for 58.6655, it is the change in intercept for males. The slope for females is -0.2884 and for males it’s -0.2884+(-0.0284).
In general, the way to interpret a continuous interaction like \[ E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2} \] is as follows: holding \(X_2\) constant we have \[ E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_1 + \beta_3 x_{2} \] And thus the expected change in \(Y\) per unit change in \(X_1\) holding all else constant is not constant. \(\beta_1\) is the slope when \(x_{2} = 0\). Note further that: \[ E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1] \] \[ -E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] \] \[ =\beta_3 \] Thus, \(\beta_3\) is the change in the expected change in \(Y\) per unit change in \(X_1\), per unit change in \(X_2\).
Or, the change in the slope relating \(X_1\) and \(Y\) per unit change in \(X_2\).
Let’s look at another example, where the goal is to interpret a model explaining how the percentage of hungry people depends on income and the year:
\[Hu_i = b_0 + b_1 In_i + b_2 Y_i + b_3 In_i \times Y_i + e^+_i\]
Interpretation:
\(b_0\) - percent hungry at year zero for children with whose parents have no income
\(b_1\) - change in percent hungry for each dollar of income in year zero
\(b_2\) - change in percent hungry in one year for children whose parents have no income
\(b_3\) - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be
\[b_2 + 1e4 \times b_3\]
\(e^+_i\) - everything we didn’t measure
Calling a point an outlier is vague.
?influence.measures to see the full suite of influence measures in stats. The measures includerstandard - standardized residuals, residuals divided by their standard deviations)rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distributionhatvalues - measures of leveragedffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.cooks.distance - overall change in teh coefficients when the \(i^{th}\) point is deleted.resid - returns the ordinary residualsresid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.How do I use all of these things?
Example: Imagine we have randomly distributed data as follows. Then naturally we will not have a significant linear regression model:
n=100
x <- c(rnorm(n)); y <- c(c(rnorm(n)))
summary(lm(y ~ x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.901 -0.606 0.137 0.577 1.811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0197 0.0980 0.20 0.84
## x -0.1006 0.1058 -0.95 0.34
##
## Residual standard error: 0.975 on 98 degrees of freedom
## Multiple R-squared: 0.00914, Adjusted R-squared: -0.000968
## F-statistic: 0.904 on 1 and 98 DF, p-value: 0.344
Now if we add an outlier, point (10, 10), we shall obtain a statistically significant regression model which is just stupid:
x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.237 -0.799 0.048 0.642 4.123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0497 0.1154 0.43 0.67
## x 0.5827 0.0870 6.69 1.3e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.16 on 99 degrees of freedom
## Multiple R-squared: 0.312, Adjusted R-squared: 0.305
## F-statistic: 44.8 on 1 and 99 DF, p-value: 1.32e-09
Now how do we measure the influence of each point in the regression?
fit <- lm(y ~ x)
round(dfbetas(fit)[1 : 10, 2], 3)
## 1 2 3 4 5 6 7 8 9 10
## 7.220 -0.004 -0.134 -0.002 0.009 -0.031 0.005 -0.040 -0.098 -0.185
whereas the hatvalues measure the potential for influence:
round(hatvalues(fit)[1 : 10], 3)
## 1 2 3 4 5 6 7 8 9 10
## 0.564 0.010 0.022 0.015 0.010 0.011 0.010 0.011 0.021 0.038
We can see that indeed the first point, (10, 10), has huge potential to influence the model.
One may search for patterns by looking at these types of plots:
data(swiss); par(mfrow = c(2, 2))
fit <- lm(Fertility ~ . , data = swiss); plot(fit)
The SSE decreases monotonically as more regressors are included.
A possible approach is as follows. Given a coefficient that I’m interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out (for example, does the treatment effect still hold if we include BMI? What if we include age?). This isn’t a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
Consider the mtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark:
fit1=lm(mpg~factor(cyl)+wt, data=mtcars)
fit2=update(fit1, mpg~factor(cyl)+wt+factor(cyl)*wt)
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(cyl) + wt
## Model 2: mpg ~ factor(cyl) + wt + factor(cyl):wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 183
## 2 26 156 2 27.2 2.27 0.12
How to interpret this: The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.
Another example: We are interested in the agriculture coefficient. But then we also want to know whether we should add Examination and Education to the model. And then also whether we should add Catholic plus infant mortality. The ANOVA function will create likelihood ratio tests.
fit1 <- lm(Fertility ~ Agriculture, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 6283
## 2 43 3181 2 3102 30.2 8.6e-09 ***
## 3 41 2105 2 1076 10.5 0.00021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values say yes, we should include the examination and education components. And if you do, you should also include catholic + infant mortality.