The data for this lecture is:
wgt = c(64, 71, 53, 67, 55, 58, 77, 57, 56, 51, 76, 68)
hgt = c(57, 59, 49, 62, 51, 50, 55, 48, 42, 42, 61, 57)
age = c(8, 10, 6, 11, 8, 7, 10, 9, 10, 6, 12, 9)
Let us create a data frame with the data:
data = data.frame(y=wgt, x1=hgt, x2=age)
To use the data in Stata I will export it to a dta file:
library(foreign)
write.dta(data.frame(wgt, hgt, age), "weightHeightAge.dta")
And then to load it in Stata simply execute:
. use weightHeightAge
(Written by R. )
My basic goal for this week is quite simple: to be able to reproduce with R the numeric results in the pdf with the slides for this week’s lecture. In order to do that I’ll jup directly to pages 23 to 25 of the pdf, and I will show you how to obtain the results for the three linear models that appear on those pages. I will use numbers 3, 2, 1 to identify these three models, as explained below. The number indicates the number \(k\) of terms in each model (not taking the intercept into account). Also for the results in the first pages:
lm3: with hgt, age, and age squared.All the models are built using the lm function that we have used in previous weeks. It’s just a matter of adding the required terms to the model formula:
lm3 = lm(y ~ I(x1) + I(x2) + I(x2^2), data)
(summLm3 = summary(lm3))
Call:
lm(formula = y ~ I(x1) + I(x2) + I(x2^2), data = data)
Residuals:
Min 1Q Median 3Q Max
-6.806 -1.771 0.374 1.374 10.161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.43843 33.61082 0.102 0.921
I(x1) 0.72369 0.27696 2.613 0.031 *
I(x2) 2.77687 7.42728 0.374 0.718
I(x2^2) -0.04171 0.42241 -0.099 0.924
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.94 on 8 degrees of freedom
Multiple R-squared: 0.7803, Adjusted R-squared: 0.6978
F-statistic: 9.469 on 3 and 8 DF, p-value: 0.005208
For later use we need to determine the values of \(n\) and \(k\). A simple way of getting these values, irrespective of the model formula you are using is:
(k3 = length(lm3$coefficients) - 1)
[1] 3
(n = nrow(data))
[1] 12
The Anova table for this model is:
(anovaXY3 = anova(lm3))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
I(x1) 1 588.92 588.92 24.1375 0.001174 **
I(x2) 1 103.90 103.90 4.2584 0.072951 .
I(x2^2) 1 0.24 0.24 0.0097 0.923777
Residuals 8 195.19 24.40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the confidence intervals for the coefficients of the model are:
confint(lm3)
2.5 % 97.5 %
(Intercept) -74.06826354 80.9451155
I(x1) 0.08501204 1.3623684
I(x2) -14.35046099 19.9042101
I(x2^2) -1.01577933 0.9323659
As usual, R returns the decomposed sum of squares in the Anova table, but there’s no explicit value of the total sum of squares due regression as in Stata (the Model SS in Stata). But from the Anova table we can easily get the SS due regression as:
(SSregression3 =sum(anovaXY3$`Sum Sq`[1:k3]))
[1] 693.0605
and the total sum of squares (both model and residues) as:
(SStotal =sum(anovaXY3$`Sum Sq`))
[1] 888.25
Note, however, that the decomposition of the model sum of squares provided by R is the one that appears on page 26 of the pdf, so we get that information directly.
The variance inflation factors can be obtained with:
library(car)
vif(lm3)
I(x1) I(x2) I(x2^2)
1.610495 89.684752 89.969483
(tolerance = 1 / vif(lm3))
I(x1) I(x2) I(x2^2)
0.62092701 0.01115017 0.01111488
Thus, we have obtained all the results on page 23 of the pdf.
lm2: with hgt and age.Similarly, for this linear regression model we get the results on page 24 as follows. First, fit the model:
lm2 = lm(y ~ x1 + x2, data)
(summLm2 = summary(lm2))
Call:
lm(formula = y ~ x1 + x2, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.8708 -1.7004 0.3454 1.4642 10.2336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5530 10.9448 0.599 0.5641
x1 0.7220 0.2608 2.768 0.0218 *
x2 2.0501 0.9372 2.187 0.0565 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared: 0.78, Adjusted R-squared: 0.7311
F-statistic: 15.95 on 2 and 9 DF, p-value: 0.001099
The Anova table is:
(anovaXY2 = anova(lm2))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 588.92 588.92 27.1216 0.0005582 ***
x2 1 103.90 103.90 4.7849 0.0564853 .
Residuals 9 195.43 21.71
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that now:
(k2 = length(lm2$coefficients) - 1)
[1] 2
From the Anova table we get the SS due regression for this model:
(SSregression2 =sum(anovaXY2$`Sum Sq`[1:k2]))
[1] 692.8226
but the total sum of squares (both model and residues) remains unchanged, as expected:
(SStotal =sum(anovaXY2$`Sum Sq`))
[1] 888.25
The confidence intervals for the coefficients of the model are:
confint(lm2)
2.5 % 97.5 %
(Intercept) -18.20587071 31.311967
x1 0.13205592 1.312020
x2 -0.07002526 4.170278
and the variance inflation factors:
vif(lm2)
x1 x2
1.604616 1.604616
(tolerance = 1 / vif(lm2))
x1 x2
0.6232021 0.6232021
lm1: with hgt alone.For this linear regression model we get the results on page 25 as follows. First, fit the model:
lm1 = lm(y ~ x1, data)
(summLm1 = summary(lm1))
Call:
lm(formula = y ~ x1, data = data)
Residuals:
Min 1Q Median 3Q Max
-5.8736 -3.8973 -0.4402 2.2624 11.8375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1898 12.8487 0.482 0.64035
x1 1.0722 0.2417 4.436 0.00126 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.471 on 10 degrees of freedom
Multiple R-squared: 0.663, Adjusted R-squared: 0.6293
F-statistic: 19.67 on 1 and 10 DF, p-value: 0.001263
The Anova table is:
(anovaXY1 = anova(lm1))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 588.92 588.92 19.675 0.001263 **
Residuals 10 299.33 29.93
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now:
(k1 = length(lm1$coefficients) - 1)
[1] 1
From the Anova table we get the SS due regression for this model:
(SSregression1 =sum(anovaXY1$`Sum Sq`[1:k1]))
[1] 588.9225
and the total sum of squares (both model and residues) is the same as for the previous models:
(SStotal =sum(anovaXY1$`Sum Sq`))
[1] 888.25
The confidence intervals for the coefficients of the model are:
confint(lm1)
2.5 % 97.5 %
(Intercept) -22.4389419 34.818639
x1 0.5336202 1.610841
and when we try to obtain the variance inflation factors:
#vif(lm1)
#(tolerance = 1 / vif(lm1))
as you see, R refuses to compute the variance inflation factor for a model with a single term.
For the sake of completeness, I’ll include here the Stata code iused to generate these results:
First linear model with hgt, age and age squared:
. gen agesq = age*age
. regress wgt hgt age agesq
Source | SS df MS Number of obs = 12
-------------+------------------------------ F( 3, 8) = 9.47
Model | 693.060463 3 231.020154 Prob > F = 0.0052
Residual | 195.189537 8 24.3986921 R-squared = 0.7803
-------------+------------------------------ Adj R-squared = 0.6978
Total | 888.25 11 80.75 Root MSE = 4.9395
------------------------------------------------------------------------------
wgt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hgt | .7236902 .2769632 2.61 0.031 .085012 1.362368
age | 2.776875 7.427279 0.37 0.718 -14.35046 19.90421
agesq | -.0417067 .4224071 -0.10 0.924 -1.015779 .9323659
_cons | 3.438426 33.61082 0.10 0.921 -74.06826 80.94512
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
agesq | 89.97 0.011115
age | 89.68 0.011150
hgt | 1.61 0.620927
-------------+----------------------
Mean VIF | 60.42
Second linear model with hgt and age:
. regress wgt hgt age
Source | SS df MS Number of obs = 12
-------------+------------------------------ F( 2, 9) = 15.95
Model | 692.822607 2 346.411303 Prob > F = 0.0011
Residual | 195.427393 9 21.7141548 R-squared = 0.7800
-------------+------------------------------ Adj R-squared = 0.7311
Total | 888.25 11 80.75 Root MSE = 4.6598
------------------------------------------------------------------------------
wgt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hgt | .722038 .2608051 2.77 0.022 .1320559 1.31202
age | 2.050126 .9372256 2.19 0.056 -.0700253 4.170278
_cons | 6.553048 10.94483 0.60 0.564 -18.20587 31.31197
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
age | 1.60 0.623202
hgt | 1.60 0.623202
-------------+----------------------
Mean VIF | 1.60
Third linear model with only hgt:
. regress wgt hgt
Source | SS df MS Number of obs = 12
-------------+------------------------------ F( 1, 10) = 19.67
Model | 588.922523 1 588.922523 Prob > F = 0.0013
Residual | 299.327477 10 29.9327477 R-squared = 0.6630
-------------+------------------------------ Adj R-squared = 0.6293
Total | 888.25 11 80.75 Root MSE = 5.4711
------------------------------------------------------------------------------
wgt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hgt | 1.07223 .241731 4.44 0.001 .5336202 1.610841
_cons | 6.189849 12.84875 0.48 0.640 -22.43894 34.81864
------------------------------------------------------------------------------
. vif
Variable | VIF 1/VIF
-------------+----------------------
hgt | 1.00 1.000000
-------------+----------------------
Mean VIF | 1.00
.
end of do-file
The F statistic for the lm3 model is computed as:
(k3 = length(lm3$coefficients) - 1)
[1] 3
(MSregression = sum(anovaXY3$`Sum Sq`[1:k3]) / k3)
[1] 231.0202
(MSresidual = anovaXY3$`Sum Sq`[k3 + 1] / (n - k3 -1))
[1] 24.39869
(Fstatistic = MSregression / MSresidual)
[1] 9.468547
which leads to a p-value equal to:
(pValue = pf(Fstatistic, df1 = k3, df2 = n - k3 - 1, lower.tail = FALSE))
[1] 0.005208374
The cut point for the 99% hypothesis test that appears on page 22 can be computed with:
qf(0.99, df1 = k3, df2 = n - 1 - k3)
[1] 7.590992
Suppose that you want to compare a partial F test between a model A, and a model B that is obtained adding a new term to A. In R to perform the partial F test you can call the anova function with the models A and B as arguments. For example, the model lm2 is obtained from the model lm1 by adding the term I(x2). The partial F test for these two models is obtained with:
(anova_1_2 = anova(lm1, lm2))
Analysis of Variance Table
Model 1: y ~ x1
Model 2: y ~ x1 + x2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10 299.33
2 9 195.43 1 103.9 4.7849 0.05649 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see the second row of the column named Sum of Sq contains the value called \(SS(x_2|x_1)\) in the pdf (see page 31, and the last column of the table on page 26). The value \(MS residual(x_1, x_2)\) can be obtained as:
(MSresidual_1_2 = anova_1_2$RSS[2]/anova_1_2$Res.Df[2])
[1] 21.71415
and then the value of the F statistic is:
(Fstatistic = anova_1_2$`Sum of Sq`[2] / MSresidual_1_2)
[1] 4.784901
Similarly, the model lm3 is obtained from the model lm2 by adding the term I(x2^2). The partial F test for these two models is obtained with:
(anova_2_3 = anova(lm2, lm3))
Analysis of Variance Table
Model 1: y ~ x1 + x2
Model 2: y ~ I(x1) + I(x2) + I(x2^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 195.43
2 8 195.19 1 0.23786 0.0097 0.9238
and the F statistic is obtained as follows:
anova_2_3$`Sum of Sq`[2]
[1] 0.2378569
(MSresidual_2_3 = anova_2_3$RSS[2]/anova_2_3$Res.Df[2])
[1] 24.39869
(Fstatistic = anova_2_3$`Sum of Sq`[2] / MSresidual_2_3)
[1] 0.009748754
The first two values in this computation should be compared with the results for \(SS(x_3|x_1,x_2)\) on page 31 (the numerator of F) and \(MS residual(x_1, x_2, x_3)\) on page 33 (the denominator of F), respectively.
We can even see the computation on page 28 as a particular case of this, provided we introduce the null model, whose only term is the intercept. In R we can define that model as:
(lm0 = lm(y ~ 1, data))
Call:
lm(formula = y ~ 1, data = data)
Coefficients:
(Intercept)
62.75
and then lm1 is obtained from the null model by adding the term \(x1\). Thus we can compare these two models with:
(anova_0_1 = anova(lm0, lm1))
Analysis of Variance Table
Model 1: y ~ 1
Model 2: y ~ x1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 11 888.25
2 10 299.33 1 588.92 19.675 0.001263 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the same type of computation leads to:
anova_0_1$`Sum of Sq`[2]
[1] 588.9225
(MSresidual_0_1 = anova_0_1$RSS[2]/anova_0_1$Res.Df[2])
[1] 29.93275
(Fstatistic = anova_0_1$`Sum of Sq`[2] / MSresidual_0_1)
[1] 19.67486
And you can check that these are the same results obtained on page 28.
The same partial tests can be obtained directly from the summary of lm for each model. For example, when comparing lm3 to lm2 we can look at the summary of lm3. The estimate \(\hat\beta_3\) is in the last row (given by k3+1, the number of terms) of the first column in the table for that summary. The standard error for that estimate is in the same row of the second column. Those two values are, therefore:
summLm3$coefficients[k3 + 1, 1]
[1] -0.0417067
summLm3$coefficients[k3 + 1, 2]
[1] 0.4224071
and their quotient leads to the \(t\) statistic at the bottom of page 36 of the pdf:
(tStatistic = summLm3$coefficients[k3 + 1, 1] / summLm3$coefficients[k3 + 1, 2])
[1] -0.09873578
and the degrees of freedom to use can be expressed as:
n - k3 - 1
[1] 8
Thus the p-value for this 2-sided test is:
(pValue = 2 * pt(abs(tStatistic), df = n - k3 - 1, lower.tail = FALSE))
[1] 0.9237772
You may check that this is the same p-value that we obtained with anova_2_3. The square of the t statistic is the same value of \(F\) statistic that we obtained there, as expected:
tStatistic^2
[1] 0.009748754
Similarly, for the comparison between lm2 and lm1:
summLm2$coefficients[k2 + 1, 1]
[1] 2.050126
summLm2$coefficients[k2 + 1, 2]
[1] 0.9372256
(tStatistic = summLm2$coefficients[k2 + 1, 1] / summLm2$coefficients[k2 + 1, 2])
[1] 2.187442
(pValue = 2 * pt(abs(tStatistic), df = n - k2 - 1, lower.tail = FALSE))
[1] 0.05648531
tStatistic^2
[1] 4.784901
and the results agree with those we obtained in anova_1_2
The following code renders a 3d sctater plot of the sample points, predicted points and the hypersurface for model lm3. It may be a little hard to see this in the picture, but the hypersurface is actually curved. To appreciate that you can use your mouse to rotate the image. Zoom with the mouse scroll. Enjoy!
library(rgl)
open3d()
wgl
1
plot3d(data$x1, data$x2, data$y, col="blue", size=8)
points3d(data$x1, data$x2, lm1$fitted.values, col="cyan", size=8)
segments3d(x=rep(data$x1, rep(2, length(data$x1))), y=rep(data$x2, rep(2, length(data$x2))), z=c(t(cbind(data$y, lm1$fitted.values))), col="red", lwd=2)
Hyp <- function(x1, x2) {
beta = lm3$coefficients
beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x2^2
}
minx = min(data$x1)
maxx = max(data$x1)
miny = min(data$x2)
maxy = max(data$x2)
nx = 20
ny = 100
x= seq(minx, maxx, length.out = nx)
y= seq(miny, maxy, length.out = ny)
xnet = rep(x, ny)
ynet = rep(y, rep(nx,ny))
XY = matrix(c(xnet, ynet), ncol = 2)
znet = apply(XY, MARGIN = 1, FUN = function(x)(Hyp(x1 = x[1], x2 = x[2])))
points3d(xnet, ynet, znet, col="darkgreen")
nx = 100
ny = 20
x= seq(minx, maxx, length.out = nx)
y= seq(miny, maxy, length.out = ny)
ynet = rep(y, nx)
xnet = rep(x, rep(ny,nx))
XY = matrix(c(xnet, ynet), ncol = 2)
znet = apply(XY, MARGIN = 1, FUN = function(x)(Hyp(x1 = x[1], x2 = x[2])))
points3d(xnet, ynet, znet, col="darkgreen")
You must enable Javascript to view this page properly.
Thanks for your attention!