Preliminaries

Loading the weight, height, age data.

The data for this lecture is:

wgt = c(64, 71, 53, 67, 55, 58, 77, 57, 56, 51, 76, 68)

hgt = c(57, 59, 49, 62, 51, 50, 55, 48, 42, 42, 61, 57)

age = c(8, 10, 6, 11, 8, 7, 10, 9, 10, 6, 12, 9)

Let us create a data frame with the data:

data = data.frame(y=wgt, x1=hgt, x2=age)

Stata.

To use the data in Stata I will export it to a dta file:

library(foreign)
write.dta(data.frame(wgt, hgt, age), "weightHeightAge.dta") 

And then to load it in Stata simply execute:

. use weightHeightAge
(Written by R.              )

Fitting the linear models.

My basic goal for this week is quite simple: to be able to reproduce with R the numeric results in the pdf with the slides for this week’s lecture. In order to do that I’ll jup directly to pages 23 to 25 of the pdf, and I will show you how to obtain the results for the three linear models that appear on those pages. I will use numbers 3, 2, 1 to identify these three models, as explained below. The number indicates the number \(k\) of terms in each model (not taking the intercept into account). Also for the results in the first pages:

First linear model lm3: with hgt, age, and age squared.

All the models are built using the lm function that we have used in previous weeks. It’s just a matter of adding the required terms to the model formula:

lm3 = lm(y ~ I(x1) + I(x2) + I(x2^2), data)
(summLm3 = summary(lm3))

Call:
lm(formula = y ~ I(x1) + I(x2) + I(x2^2), data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-6.806 -1.771  0.374  1.374 10.161 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.43843   33.61082   0.102    0.921  
I(x1)        0.72369    0.27696   2.613    0.031 *
I(x2)        2.77687    7.42728   0.374    0.718  
I(x2^2)     -0.04171    0.42241  -0.099    0.924  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.94 on 8 degrees of freedom
Multiple R-squared:  0.7803,    Adjusted R-squared:  0.6978 
F-statistic: 9.469 on 3 and 8 DF,  p-value: 0.005208

For later use we need to determine the values of \(n\) and \(k\). A simple way of getting these values, irrespective of the model formula you are using is:

(k3 = length(lm3$coefficients) - 1)
[1] 3
(n = nrow(data))
[1] 12

The Anova table for this model is:

(anovaXY3 = anova(lm3))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value   Pr(>F)   
I(x1)      1 588.92  588.92 24.1375 0.001174 **
I(x2)      1 103.90  103.90  4.2584 0.072951 . 
I(x2^2)    1   0.24    0.24  0.0097 0.923777   
Residuals  8 195.19   24.40                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And the confidence intervals for the coefficients of the model are:

confint(lm3)
                   2.5 %     97.5 %
(Intercept) -74.06826354 80.9451155
I(x1)         0.08501204  1.3623684
I(x2)       -14.35046099 19.9042101
I(x2^2)      -1.01577933  0.9323659

As usual, R returns the decomposed sum of squares in the Anova table, but there’s no explicit value of the total sum of squares due regression as in Stata (the Model SS in Stata). But from the Anova table we can easily get the SS due regression as:

(SSregression3 =sum(anovaXY3$`Sum Sq`[1:k3]))
[1] 693.0605

and the total sum of squares (both model and residues) as:

(SStotal =sum(anovaXY3$`Sum Sq`))
[1] 888.25

Note, however, that the decomposition of the model sum of squares provided by R is the one that appears on page 26 of the pdf, so we get that information directly.

The variance inflation factors can be obtained with:

library(car)
vif(lm3)
    I(x1)     I(x2)   I(x2^2) 
 1.610495 89.684752 89.969483 
(tolerance = 1 / vif(lm3))
     I(x1)      I(x2)    I(x2^2) 
0.62092701 0.01115017 0.01111488 

Thus, we have obtained all the results on page 23 of the pdf.

Second linear model lm2: with hgt and age.

Similarly, for this linear regression model we get the results on page 24 as follows. First, fit the model:

lm2 = lm(y ~ x1 + x2, data)
(summLm2 = summary(lm2))

Call:
lm(formula = y ~ x1 + x2, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
x1            0.7220     0.2608   2.768   0.0218 *
x2            2.0501     0.9372   2.187   0.0565 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

The Anova table is:

(anovaXY2 = anova(lm2))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x1         1 588.92  588.92 27.1216 0.0005582 ***
x2         1 103.90  103.90  4.7849 0.0564853 .  
Residuals  9 195.43   21.71                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that now:

(k2 = length(lm2$coefficients) - 1)
[1] 2

From the Anova table we get the SS due regression for this model:

(SSregression2 =sum(anovaXY2$`Sum Sq`[1:k2]))
[1] 692.8226

but the total sum of squares (both model and residues) remains unchanged, as expected:

(SStotal =sum(anovaXY2$`Sum Sq`))
[1] 888.25

The confidence intervals for the coefficients of the model are:

confint(lm2)
                   2.5 %    97.5 %
(Intercept) -18.20587071 31.311967
x1            0.13205592  1.312020
x2           -0.07002526  4.170278

and the variance inflation factors:

vif(lm2)
      x1       x2 
1.604616 1.604616 
(tolerance = 1 / vif(lm2))
       x1        x2 
0.6232021 0.6232021 

Linear model lm1: with hgt alone.

For this linear regression model we get the results on page 25 as follows. First, fit the model:

lm1 = lm(y ~ x1, data)
(summLm1 = summary(lm1))

Call:
lm(formula = y ~ x1, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8736 -3.8973 -0.4402  2.2624 11.8375 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   6.1898    12.8487   0.482  0.64035   
x1            1.0722     0.2417   4.436  0.00126 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.471 on 10 degrees of freedom
Multiple R-squared:  0.663, Adjusted R-squared:  0.6293 
F-statistic: 19.67 on 1 and 10 DF,  p-value: 0.001263

The Anova table is:

(anovaXY1 = anova(lm1))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value   Pr(>F)   
x1         1 588.92  588.92  19.675 0.001263 **
Residuals 10 299.33   29.93                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now:

(k1 = length(lm1$coefficients) - 1)
[1] 1

From the Anova table we get the SS due regression for this model:

(SSregression1 =sum(anovaXY1$`Sum Sq`[1:k1]))
[1] 588.9225

and the total sum of squares (both model and residues) is the same as for the previous models:

(SStotal =sum(anovaXY1$`Sum Sq`))
[1] 888.25

The confidence intervals for the coefficients of the model are:

confint(lm1)
                  2.5 %    97.5 %
(Intercept) -22.4389419 34.818639
x1            0.5336202  1.610841

and when we try to obtain the variance inflation factors:

#vif(lm1)
#(tolerance = 1 / vif(lm1))

as you see, R refuses to compute the variance inflation factor for a model with a single term.

Stata code for the three models.

For the sake of completeness, I’ll include here the Stata code iused to generate these results:

First linear model with hgt, age and age squared:

. gen agesq = age*age

. regress wgt hgt age agesq

      Source |       SS       df       MS              Number of obs =      12
-------------+------------------------------           F(  3,     8) =    9.47
       Model |  693.060463     3  231.020154           Prob > F      =  0.0052
    Residual |  195.189537     8  24.3986921           R-squared     =  0.7803
-------------+------------------------------           Adj R-squared =  0.6978
       Total |      888.25    11       80.75           Root MSE      =  4.9395

------------------------------------------------------------------------------
         wgt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         hgt |   .7236902   .2769632     2.61   0.031      .085012    1.362368
         age |   2.776875   7.427279     0.37   0.718    -14.35046    19.90421
       agesq |  -.0417067   .4224071    -0.10   0.924    -1.015779    .9323659
       _cons |   3.438426   33.61082     0.10   0.921    -74.06826    80.94512
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
       agesq |     89.97    0.011115
         age |     89.68    0.011150
         hgt |      1.61    0.620927
-------------+----------------------
    Mean VIF |     60.42

Second linear model with hgt and age:

. regress wgt hgt age 

      Source |       SS       df       MS              Number of obs =      12
-------------+------------------------------           F(  2,     9) =   15.95
       Model |  692.822607     2  346.411303           Prob > F      =  0.0011
    Residual |  195.427393     9  21.7141548           R-squared     =  0.7800
-------------+------------------------------           Adj R-squared =  0.7311
       Total |      888.25    11       80.75           Root MSE      =  4.6598

------------------------------------------------------------------------------
         wgt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         hgt |    .722038   .2608051     2.77   0.022     .1320559     1.31202
         age |   2.050126   .9372256     2.19   0.056    -.0700253    4.170278
       _cons |   6.553048   10.94483     0.60   0.564    -18.20587    31.31197
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
         age |      1.60    0.623202
         hgt |      1.60    0.623202
-------------+----------------------
    Mean VIF |      1.60

Third linear model with only hgt:

. regress wgt hgt

      Source |       SS       df       MS              Number of obs =      12
-------------+------------------------------           F(  1,    10) =   19.67
       Model |  588.922523     1  588.922523           Prob > F      =  0.0013
    Residual |  299.327477    10  29.9327477           R-squared     =  0.6630
-------------+------------------------------           Adj R-squared =  0.6293
       Total |      888.25    11       80.75           Root MSE      =  5.4711

------------------------------------------------------------------------------
         wgt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         hgt |    1.07223    .241731     4.44   0.001     .5336202    1.610841
       _cons |   6.189849   12.84875     0.48   0.640    -22.43894    34.81864
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
         hgt |      1.00    1.000000
-------------+----------------------
    Mean VIF |      1.00

. 
end of do-file

Hypothesis test \(H_0=\{\beta_1=\cdots=\beta_k=0\}\) (page 22).

The F statistic for the lm3 model is computed as:

(k3 = length(lm3$coefficients) - 1)
[1] 3
(MSregression = sum(anovaXY3$`Sum Sq`[1:k3]) / k3)
[1] 231.0202
(MSresidual = anovaXY3$`Sum Sq`[k3 + 1] / (n - k3 -1))
[1] 24.39869
(Fstatistic = MSregression / MSresidual)
[1] 9.468547

which leads to a p-value equal to:

(pValue = pf(Fstatistic, df1 = k3, df2 = n - k3 - 1, lower.tail = FALSE))
[1] 0.005208374

The cut point for the 99% hypothesis test that appears on page 22 can be computed with:

qf(0.99, df1 = k3, df2 = n - 1 - k3)
[1] 7.590992

Partial F tests.

Suppose that you want to compare a partial F test between a model A, and a model B that is obtained adding a new term to A. In R to perform the partial F test you can call the anova function with the models A and B as arguments. For example, the model lm2 is obtained from the model lm1 by adding the term I(x2). The partial F test for these two models is obtained with:

(anova_1_2 = anova(lm1, lm2))
Analysis of Variance Table

Model 1: y ~ x1
Model 2: y ~ x1 + x2
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     10 299.33                              
2      9 195.43  1     103.9 4.7849 0.05649 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see the second row of the column named Sum of Sq contains the value called \(SS(x_2|x_1)\) in the pdf (see page 31, and the last column of the table on page 26). The value \(MS residual(x_1, x_2)\) can be obtained as:

(MSresidual_1_2 = anova_1_2$RSS[2]/anova_1_2$Res.Df[2])
[1] 21.71415

and then the value of the F statistic is:

(Fstatistic = anova_1_2$`Sum of Sq`[2] / MSresidual_1_2)
[1] 4.784901

Similarly, the model lm3 is obtained from the model lm2 by adding the term I(x2^2). The partial F test for these two models is obtained with:

(anova_2_3 = anova(lm2, lm3))
Analysis of Variance Table

Model 1: y ~ x1 + x2
Model 2: y ~ I(x1) + I(x2) + I(x2^2)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      9 195.43                           
2      8 195.19  1   0.23786 0.0097 0.9238

and the F statistic is obtained as follows:

anova_2_3$`Sum of Sq`[2]
[1] 0.2378569
(MSresidual_2_3 = anova_2_3$RSS[2]/anova_2_3$Res.Df[2])
[1] 24.39869
(Fstatistic = anova_2_3$`Sum of Sq`[2] / MSresidual_2_3)
[1] 0.009748754

The first two values in this computation should be compared with the results for \(SS(x_3|x_1,x_2)\) on page 31 (the numerator of F) and \(MS residual(x_1, x_2, x_3)\) on page 33 (the denominator of F), respectively.

We can even see the computation on page 28 as a particular case of this, provided we introduce the null model, whose only term is the intercept. In R we can define that model as:

(lm0 = lm(y ~ 1, data)) 

Call:
lm(formula = y ~ 1, data = data)

Coefficients:
(Intercept)  
      62.75  

and then lm1 is obtained from the null model by adding the term \(x1\). Thus we can compare these two models with:

(anova_0_1 = anova(lm0, lm1))
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ x1
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     11 888.25                                
2     10 299.33  1    588.92 19.675 0.001263 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the same type of computation leads to:

anova_0_1$`Sum of Sq`[2]
[1] 588.9225
(MSresidual_0_1 = anova_0_1$RSS[2]/anova_0_1$Res.Df[2])
[1] 29.93275
(Fstatistic = anova_0_1$`Sum of Sq`[2] / MSresidual_0_1)
[1] 19.67486

And you can check that these are the same results obtained on page 28.

Partial tests using \(t\).

The same partial tests can be obtained directly from the summary of lm for each model. For example, when comparing lm3 to lm2 we can look at the summary of lm3. The estimate \(\hat\beta_3\) is in the last row (given by k3+1, the number of terms) of the first column in the table for that summary. The standard error for that estimate is in the same row of the second column. Those two values are, therefore:

summLm3$coefficients[k3 + 1, 1]
[1] -0.0417067
summLm3$coefficients[k3 + 1, 2]
[1] 0.4224071

and their quotient leads to the \(t\) statistic at the bottom of page 36 of the pdf:

(tStatistic = summLm3$coefficients[k3 + 1, 1] / summLm3$coefficients[k3 + 1, 2])
[1] -0.09873578

and the degrees of freedom to use can be expressed as:

n - k3 - 1
[1] 8

Thus the p-value for this 2-sided test is:

(pValue = 2 * pt(abs(tStatistic), df = n - k3 - 1, lower.tail = FALSE))
[1] 0.9237772

You may check that this is the same p-value that we obtained with anova_2_3. The square of the t statistic is the same value of \(F\) statistic that we obtained there, as expected:

tStatistic^2
[1] 0.009748754

Similarly, for the comparison between lm2 and lm1:

summLm2$coefficients[k2 + 1, 1]
[1] 2.050126
summLm2$coefficients[k2 + 1, 2]
[1] 0.9372256
(tStatistic = summLm2$coefficients[k2 + 1, 1] / summLm2$coefficients[k2 + 1, 2])
[1] 2.187442
(pValue = 2 * pt(abs(tStatistic), df = n - k2 - 1, lower.tail = FALSE))
[1] 0.05648531
tStatistic^2
[1] 4.784901

and the results agree with those we obtained in anova_1_2

Extra: a 3d Scatterplot for the lm1 model.

The following code renders a 3d sctater plot of the sample points, predicted points and the hypersurface for model lm3. It may be a little hard to see this in the picture, but the hypersurface is actually curved. To appreciate that you can use your mouse to rotate the image. Zoom with the mouse scroll. Enjoy!

library(rgl)
open3d()
wgl 
  1 
plot3d(data$x1, data$x2, data$y, col="blue", size=8)
points3d(data$x1, data$x2, lm1$fitted.values, col="cyan", size=8)
segments3d(x=rep(data$x1, rep(2, length(data$x1))), y=rep(data$x2, rep(2, length(data$x2))), z=c(t(cbind(data$y, lm1$fitted.values))), col="red", lwd=2)


Hyp <- function(x1, x2) {
beta  = lm3$coefficients 
beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x2^2
}

minx = min(data$x1)
maxx = max(data$x1)
miny = min(data$x2)
maxy = max(data$x2)

nx = 20
ny = 100

x= seq(minx, maxx, length.out = nx)
y= seq(miny, maxy, length.out = ny)

xnet = rep(x, ny)
ynet = rep(y, rep(nx,ny))


XY = matrix(c(xnet, ynet), ncol = 2)

znet = apply(XY, MARGIN = 1, FUN = function(x)(Hyp(x1 = x[1], x2 = x[2])))

points3d(xnet, ynet, znet, col="darkgreen")

nx = 100
ny = 20

x= seq(minx, maxx, length.out = nx)
y= seq(miny, maxy, length.out = ny)


ynet = rep(y, nx)
xnet = rep(x, rep(ny,nx))
XY = matrix(c(xnet, ynet), ncol = 2)
znet = apply(XY, MARGIN = 1, FUN = function(x)(Hyp(x1 = x[1], x2 = x[2])))
points3d(xnet, ynet, znet, col="darkgreen")

You must enable Javascript to view this page properly.


Thanks for your attention!