The data for this lecture is the same as for the previous lecture. It is included here in two R vectors to keep this document self-contained.
sbp = c(144, 220, 138, 145, 162, 142, 170, 124, 158, 154, 162, 150, 140, 110, 128, 130, 135, 114, 116, 124, 136, 142, 120, 120, 160, 158, 144, 130, 125, 175)
age = c(39, 47, 45, 47, 65, 46, 67, 42, 67, 56, 64, 56, 59, 34, 42, 48, 45, 17, 20, 19, 36, 50, 39, 21, 44, 53, 63, 29, 25, 69)
With Stata this would be:
. use SBP
Let us create a data frame with the data:
data = data.frame(x=age, y=sbp)
The scatter plot of the data is as follows:
plot(data$x, data$y, xlab = "age", ylab="sbp")
The point with sbp > 200 seems to be an outlier and, in any case, it has been removed from the data set for the computations of this session. To locate the position of that point in the data set we do:
(outlier = which(data$y>200))
[1] 2
Now we remove it from the data:
data2 = data[-2, ]
and plot the resulting data set against the original one (using a differenmt color) to check our work:
plot(data$x, data$y, xlab = "age", ylab="sbp")
points(data2$x, data2$y, xlab = "age", ylab="sbp", col="red", pch="+")
Now that the outlier has been taken care of, we replace the original data set with the reduced one:
data = data2
To remove the outlier we use:
. drop if sbp==220
(1 observation deleted)
And then we plot the data with:
. twoway(scatter sbp age), name(stataScatter, replace)
The resulting graph is:
We are going fit two linear models for the sbp ~ age data:
We have already done this in previous weeks (but without removing the outlier), so I’ll just use the commands we already know:
lmXY = lm(y ~ x, data)
summary(lmXY)
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-19.354 -4.797 1.254 4.747 21.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.0771 5.5276 17.562 2.67e-16 ***
x 0.9493 0.1161 8.174 8.88e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.563 on 27 degrees of freedom
Multiple R-squared: 0.7122, Adjusted R-squared: 0.7015
F-statistic: 66.81 on 1 and 27 DF, p-value: 8.876e-09
The Anova table for this model is:
(anovaXY = anova(lmXY))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 6110.1 6110.1 66.808 8.876e-09 ***
Residuals 27 2469.3 91.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the confidence intervals for the coefficients of the model are:
confint(lmXY)
2.5 % 97.5 %
(Intercept) 85.7354850 108.418684
x 0.7110137 1.187631
The plot of the regression line is:
plot(data$x, data$y, xlab = "age", ylab="sbp")
#curve(lmXY$co + lmXY$coefficients[2] * x)
abline(lmXY, col="orange", lwd=3)
For a degree 2 linear regression model we use again lm with a different syntax:
lm2XY = lm(y ~ I(x) + I(x^2), data)
summary(lm2XY)
Call:
lm(formula = y ~ I(x) + I(x^2), data = data)
Residuals:
Min 1Q Median 3Q Max
-17.8731 -5.8814 0.5288 5.5253 23.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.134e+02 1.321e+01 8.585 4.59e-09 ***
I(x) 8.754e-02 6.453e-01 0.136 0.893
I(x^2) 9.937e-03 7.323e-03 1.357 0.186
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.418 on 26 degrees of freedom
Multiple R-squared: 0.7312, Adjusted R-squared: 0.7105
F-statistic: 35.37 on 2 and 26 DF, p-value: 3.822e-08
The I(x), I(x^2) syntax is used here because the default behavior of R would be to use the so called orthogonal polynomials for polynomial regression. You can read more about that in this link, but before doing so I think that it is better to wait until multiple regression has been introduced in the course.
The three coefficients returned for the model are the estimates of \(\beta_0, \beta_1, \beta_2\). They can be accessed as a vector with:
(beta = lm2XY$coefficients)
(Intercept) I(x) I(x^2)
1.134097e+02 8.754328e-02 9.936809e-03
In order to reproduce all the results on page 18 of the pdf with the lecture notes we can also get the Anova table for this model:
(anova2XY = anova(lm2XY))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
I(x) 1 6110.1 6110.1 68.8896 8.809e-09 ***
I(x^2) 1 163.3 163.3 1.8412 0.1865
Residuals 26 2306.0 88.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and confidence intervals for the coefficients:
confint(lm2XY)
2.5 % 97.5 %
(Intercept) 86.255332965 140.56408881
I(x) -1.238949228 1.41403578
I(x^2) -0.005116258 0.02498988
Let us add the resulting parabola to the plot
plot(data$x, data$y, xlab = "age", ylab="sbp")
#curve(lmXY$co + lmXY$coefficients[2] * x)
abline(lmXY, col="orange", lwd=3)
curve(beta[1] + beta[2] * x + beta[3] * x^2,
col="blue", lwd="3", lty=2, add=TRUE)
legend("topleft", inset=.05, title="Model type", c("degree 1", "degree 2"),
fill=c("orange","blue"))
And just for fun, if you want to add segments connecting the sample points to the predicted values (as the figure on page 5 of the pdf), just uncomment and execute the line in the next chunk of code (and if you do, I recommend to comment the
abline line, to avoid cluttering the graph):
#segments(data$x, lm2XY$fitted.values, data$x, data$y, lty=3)
The simple (degree 1) linear regression model for sbb vs age is obtained with:
. regress sbp age
Source | SS df MS Number of obs = 29
-------------+---------------------------------- F(1, 27) = 66.81
Model | 6110.10173 1 6110.10173 Prob > F = 0.0000
Residual | 2469.34654 27 91.4572794 R-squared = 0.7122
-------------+---------------------------------- Adj R-squared = 0.7015
Total | 8579.44828 28 306.408867 Root MSE = 9.5633
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .9493225 .1161445 8.17 0.000 .7110137 1.187631
_cons | 97.07708 5.527552 17.56 0.000 85.73549 108.4187
------------------------------------------------------------------------------
And the linear model with a degree 2 polynomial on age is:
. gen agesq=age*age
. regress sbp age agesq
Source | SS df MS Number of obs = 29
-------------+---------------------------------- F(2, 26) = 35.37
Model | 6273.40168 2 3136.70084 Prob > F = 0.0000
Residual | 2306.0466 26 88.6940999 R-squared = 0.7312
-------------+---------------------------------- Adj R-squared = 0.7105
Total | 8579.44828 28 306.408867 Root MSE = 9.4178
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0875433 .6453289 0.14 0.893 -1.238949 1.414036
agesq | .0099368 .0073232 1.36 0.186 -.0051163 .0249899
_cons | 113.4097 13.21041 8.58 0.000 86.25533 140.5641
------------------------------------------------------------------------------
If you compare the R output with the Stata output you will see that R does not show the total sum of squares, but you can recover it as:
sum(anova2XY$`Sum Sq`)
[1] 8579.448
The sum is of course the same if you decide to use the degree 1 model:
sum(anovaXY$`Sum Sq`)
[1] 8579.448
By contrast, Stata outputs the sum of squares for the degree 2 model, but the decomposition of that sum of squares between the degree 1 and degree 2 terms is not provided.
The required F-statistic, degrees of freedom and p-value appear in the output of the Anova table for the degree 2 regression model. But maybe we can make their computation explicit. Recall that the extra sum of squares due to adding \(x^2\) (as in page 15 of the pdf) is obtained in R as:
anova2XY$`Sum Sq`[2]
[1] 163.2999
and the residual sum of squares is
anova2XY$`Sum Sq`[3]
[1] 2306.047
The corresponding degrees of freedom are:
anova2XY$Df[2:3]
[1] 1 26
and so
(Fstatistic = (anova2XY$`Sum Sq`[2] / anova2XY$Df[2]) / (anova2XY$`Sum Sq`[3] / anova2XY$Df[3]) )
[1] 1.841159
which results in a p-value:
(pValue = pf(Fstatistic, df1 = anova2XY$Df[2], df2 = anova2XY$Df[3], lower.tail = FALSE))
[1] 0.1864797
To test the samehypothesis using Student’s \(t\) we go back to the summary of the linear model:
(sum2XY = summary(lm2XY))
Call:
lm(formula = y ~ I(x) + I(x^2), data = data)
Residuals:
Min 1Q Median 3Q Max
-17.8731 -5.8814 0.5288 5.5253 23.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.134e+02 1.321e+01 8.585 4.59e-09 ***
I(x) 8.754e-02 6.453e-01 0.136 0.893
I(x^2) 9.937e-03 7.323e-03 1.357 0.186
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.418 on 26 degrees of freedom
Multiple R-squared: 0.7312, Adjusted R-squared: 0.7105
F-statistic: 35.37 on 2 and 26 DF, p-value: 3.822e-08
from here the formula \[ t = \dfrac{\hat\beta_2}{\hat{SE}(\hat\beta_2)} \] on page 16 of the pdf is obtained with:
(tStatistic = sum2XY$coefficients[3, 1] / sum2XY$coefficients[3, 2])
[1] 1.356893
If you are confused about this reecall that sum2XY$coefficients is a matrix:
sum2XY$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.134097e+02 13.210405754 8.5848772 4.593894e-09
I(x) 8.754328e-02 0.645328878 0.1356568 8.931374e-01
I(x^2) 9.936809e-03 0.007323207 1.3568932 1.864797e-01
Note also that:
tStatistic^2
[1] 1.841159
equals the value of the F statistic. Thus the p-value computed as:
(n = nrow(data))
[1] 29
(k = nrow(sum2XY$coefficients) - 1)
[1] 2
(pValue = 2 * pt(abs(tStatistic), df = n - k - 1, lower.tail = FALSE))
[1] 0.1864797
is the same as before. Two remarks:
For the variance inflation factor (vif, this is explained later on in the lecture) we use the function provided by the library car:
library(car)
vif(lm2XY)
I(x) I(x^2)
31.83379 31.83379
(tolerance = 1 / vif(lm2XY))
I(x) I(x^2)
0.03141316 0.03141316
wtgain = c(1, 1.2, 1.8, 2.5, 3.6, 4.7, 6.6, 9.1)
dose = 1:8
WDdata = data.frame(x = dose, y = wtgain)
To use the data in Stata I will use:
library(foreign)
write.dta(data.frame(dose, wtgain), "weightDose.dta")
Note that I have dropped the \(x, y\) notation, so that the variable names in Stata coincide with those in the lecture.
The scatter plot:
plot(WDdata$x, WDdata$y, xlab = "Dose", ylab = "Weight gain")
The degree 1 linear model is:
(lmXY = lm(y ~ x, data=WDdata))
Call:
lm(formula = y ~ x, data = WDdata)
Coefficients:
(Intercept) x
-1.196 1.113
(sumXY = summary(lmXY))
Call:
lm(formula = y ~ x, data = WDdata)
Residuals:
Min 1Q Median 3Q Max
-0.7821 -0.7592 -0.1691 0.3985 1.3917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1964 0.7135 -1.677 0.144607
x 1.1131 0.1413 7.877 0.000222 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9157 on 6 degrees of freedom
Multiple R-squared: 0.9118, Adjusted R-squared: 0.8971
F-statistic: 62.05 on 1 and 6 DF, p-value: 0.0002217
(anovaXY = anova(lmXY))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 52.037 52.037 62.053 0.0002217 ***
Residuals 6 5.032 0.839
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(lmXY)
2.5 % 97.5 %
(Intercept) -2.9424073 0.5495501
x 0.7673399 1.4588505
and the degree 2 linear model is:
(lm2XY = lm(y ~ I(x) + I(x^2), data=WDdata))
Call:
lm(formula = y ~ I(x) + I(x^2), data = WDdata)
Coefficients:
(Intercept) I(x) I(x^2)
1.3482 -0.4137 0.1696
beta = lm2XY$coefficients
(sum2XY = summary(lm2XY))
Call:
lm(formula = y ~ I(x) + I(x^2), data = WDdata)
Residuals:
1 2 3 4 5 6
-0.1041667 0.0005952 0.1660714 0.0922619 0.0791667 -0.2732143
7 8
-0.1648810 0.2041667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3482 0.2767 4.872 0.004585 **
I(x) -0.4137 0.1411 -2.932 0.032554 *
I(x^2) 0.1696 0.0153 11.085 0.000104 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1984 on 5 degrees of freedom
Multiple R-squared: 0.9966, Adjusted R-squared: 0.9952
F-statistic: 722.7 on 2 and 5 DF, p-value: 6.977e-07
(anova2XY = anova(lm2XY))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
I(x) 1 52.037 52.037 1322.58 2.96e-07 ***
I(x^2) 1 4.835 4.835 122.88 0.0001041 ***
Residuals 5 0.197 0.039
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(lm2XY)
2.5 % 97.5 %
(Intercept) 0.6368423 2.05958627
I(x) -0.7763778 -0.05100318
I(x^2) 0.1303039 0.20898182
The following figure strongly suggests that the parabola is a better fit:
plot(WDdata$x, WDdata$y, xlab = "Dose", ylab="Weight gain")
abline(lmXY, col="orange", lwd=3)
curve(beta[1] + beta[2] * x + beta[3] * x^2,
col="blue", lwd="3", lty=2, add=TRUE)
legend("topleft", inset=.05, title="Model type", c("degree 1", "degree 2"),
fill=c("orange","blue"))
This is just a simple matter of adding a new term:
(lm3XY = lm(y ~ I(x) + I(x^2) + I(x^3), data=WDdata))
Call:
lm(formula = y ~ I(x) + I(x^2) + I(x^3), data = WDdata)
Coefficients:
(Intercept) I(x) I(x^2) I(x^3)
0.58571 0.37962 -0.03831 0.01540
(sum3XY = summary(lm3XY))
Call:
lm(formula = y ~ I(x) + I(x^2) + I(x^3), data = WDdata)
Residuals:
1 2 3 4 5 6 7
0.057576 -0.114935 0.004329 0.022944 0.148485 -0.111472 -0.049351
8
0.042424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.585714 0.290972 2.013 0.1144
I(x) 0.379618 0.263287 1.442 0.2228
I(x^2) -0.038312 0.066042 -0.580 0.5929
I(x^3) 0.015404 0.004845 3.179 0.0336 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1181 on 4 degrees of freedom
Multiple R-squared: 0.999, Adjusted R-squared: 0.9983
F-statistic: 1363 on 3 and 4 DF, p-value: 1.791e-06
(anova3XY = anova(lm3XY))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
I(x) 1 52.037 52.037 3731.655 4.301e-07 ***
I(x^2) 1 4.835 4.835 346.711 4.897e-05 ***
I(x^3) 1 0.141 0.141 10.107 0.03356 *
Residuals 4 0.056 0.014
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(lm3XY)
2.5 % 97.5 %
(Intercept) -0.222154377 1.39358295
I(x) -0.351383545 1.11061875
I(x^2) -0.221673230 0.14504985
I(x^3) 0.001951568 0.02885651
We have already imported the vif function:
vif(lm3XY)
I(x) I(x^2) I(x^3)
208.7828 1116.5909 399.0101
1/vif(lm3XY)
I(x) I(x^2) I(x^3)
0.0047896659 0.0008955831 0.0025062022
We load the data exported from R:
. use weightDose
(Written by R. )
The degree 1 model is obtained with:
. regress wtgain dose
Source | SS df MS Number of obs = 8
-------------+---------------------------------- F(1, 6) = 62.05
Model | 52.0372024 1 52.0372024 Prob > F = 0.0002
Residual | 5.03154762 6 .83859127 R-squared = 0.9118
-------------+---------------------------------- Adj R-squared = 0.8971
Total | 57.06875 7 8.15267857 Root MSE = .91575
------------------------------------------------------------------------------
wtgain | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dose | 1.113095 .1413027 7.88 0.000 .7673399 1.458851
_cons | -1.196429 .7135438 -1.68 0.145 -2.942407 .5495501
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
dose | 1.00 1.000000
-------------+----------------------
Mean VIF | 1.00
Degree 2 model:
. gen dosesq = dose * dose
. regress wtgain dose dosesq
Source | SS df MS Number of obs = 8
-------------+---------------------------------- F(2, 5) = 722.73
Model | 56.8720238 2 28.4360119 Prob > F = 0.0000
Residual | .19672619 5 .039345238 R-squared = 0.9966
-------------+---------------------------------- Adj R-squared = 0.9952
Total | 57.06875 7 8.15267857 Root MSE = .19836
------------------------------------------------------------------------------
wtgain | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dose | -.4136905 .1410915 -2.93 0.033 -.7763778 -.0510032
dosesq | .1696429 .0153035 11.09 0.000 .1303039 .2089818
_cons | 1.348214 .2767358 4.87 0.005 .6368423 2.059586
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
dose | 21.25 0.047059
dosesq | 21.25 0.047059
-------------+----------------------
Mean VIF | 21.25
and degree 3 model:
. gen dosecube = dosesq * dose
. regress wtgain dose dosesq dosecube
Source | SS df MS Number of obs = 8
-------------+---------------------------------- F(3, 4) = 1362.82
Model | 57.0129708 3 19.0043236 Prob > F = 0.0000
Residual | .055779221 4 .013944805 R-squared = 0.9990
-------------+---------------------------------- Adj R-squared = 0.9983
Total | 57.06875 7 8.15267857 Root MSE = .11809
------------------------------------------------------------------------------
wtgain | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dose | .3796176 .2632867 1.44 0.223 -.3513835 1.110619
dosesq | -.0383117 .0660418 -0.58 0.593 -.2216732 .1450499
dosecube | .015404 .0048452 3.18 0.034 .0019516 .0288565
_cons | .5857143 .2909723 2.01 0.114 -.2221544 1.393583
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
dosesq | 1116.59 0.000896
dosecube | 399.01 0.002506
dose | 208.78 0.004790
-------------+----------------------
Mean VIF | 574.79
By the way, the ordering of the variables in the VIF Stata output seems arbitrary…
Thanks for your attention!