Preliminaries

These are my notes for the lectures of the Coursera course “Applied Regression Analysis” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures.
Notes for week2 are here. Please read the preliminaries of that document for more details about these notes.

The sbp vs age data.

Loading the data.

The data for this lecture is the same as for the previous lecture. It is included here in two R vectors to keep this document self-contained.

sbp = c(144, 220, 138, 145, 162, 142, 170, 124, 158, 154, 162, 150, 140, 110, 128, 130, 135, 114, 116, 124, 136, 142, 120, 120, 160, 158, 144, 130, 125, 175)

age = c(39, 47, 45, 47, 65, 46, 67, 42, 67, 56, 64, 56, 59, 34, 42, 48, 45, 17, 20, 19, 36, 50, 39, 21, 44, 53, 63, 29, 25, 69)

With Stata this would be:

. use SBP

Let us create a data frame with the data:

data = data.frame(x=age, y=sbp)

Removing the outlier.

The scatter plot of the data is as follows:

plot(data$x, data$y, xlab = "age", ylab="sbp")

The point with sbp > 200 seems to be an outlier and, in any case, it has been removed from the data set for the computations of this session. To locate the position of that point in the data set we do:

(outlier = which(data$y>200))

[1] 2

Now we remove it from the data:

data2 = data[-2, ]

and plot the resulting data set against the original one (using a differenmt color) to check our work:

plot(data$x, data$y, xlab = "age", ylab="sbp")
points(data2$x, data2$y, xlab = "age", ylab="sbp", col="red", pch="+")

Now that the outlier has been taken care of, we replace the original data set with the reduced one:

data = data2

Stata code.

To remove the outlier we use:

. drop if sbp==220
(1 observation deleted)

And then we plot the data with:

. twoway(scatter sbp age), name(stataScatter, replace)

The resulting graph is:

Fitting the linear models.

We are going fit two linear models for the sbp ~ age data:

a degree 1 model (a straight line)
a degree 2 model (a parabola)

Fitting the degree 1 model.

We have already done this in previous weeks (but without removing the outlier), so I’ll just use the commands we already know:

lmXY = lm(y ~ x, data)
summary(lmXY)


Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.354  -4.797   1.254   4.747  21.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  97.0771     5.5276  17.562 2.67e-16 ***
x             0.9493     0.1161   8.174 8.88e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.563 on 27 degrees of freedom
Multiple R-squared:  0.7122,    Adjusted R-squared:  0.7015 
F-statistic: 66.81 on 1 and 27 DF,  p-value: 8.876e-09

The Anova table for this model is:

(anovaXY = anova(lmXY))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x          1 6110.1  6110.1  66.808 8.876e-09 ***
Residuals 27 2469.3    91.5                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And the confidence intervals for the coefficients of the model are:

confint(lmXY)

                 2.5 %     97.5 %
(Intercept) 85.7354850 108.418684
x            0.7110137   1.187631

The plot of the regression line is:

plot(data$x, data$y, xlab = "age", ylab="sbp")
#curve(lmXY$co + lmXY$coefficients[2] * x)
abline(lmXY, col="orange", lwd=3)

Fitting the degree 2 model.

For a degree 2 linear regression model we use again lm with a different syntax:

lm2XY = lm(y ~ I(x) + I(x^2), data)
summary(lm2XY)


Call:
lm(formula = y ~ I(x) + I(x^2), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.8731  -5.8814   0.5288   5.5253  23.5007 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.134e+02  1.321e+01   8.585 4.59e-09 ***
I(x)        8.754e-02  6.453e-01   0.136    0.893    
I(x^2)      9.937e-03  7.323e-03   1.357    0.186    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.418 on 26 degrees of freedom
Multiple R-squared:  0.7312,    Adjusted R-squared:  0.7105 
F-statistic: 35.37 on 2 and 26 DF,  p-value: 3.822e-08

The I(x), I(x^2) syntax is used here because the default behavior of R would be to use the so called orthogonal polynomials for polynomial regression. You can read more about that in this link, but before doing so I think that it is better to wait until multiple regression has been introduced in the course.

The three coefficients returned for the model are the estimates of $\beta_0, \beta_1, \beta_2$. They can be accessed as a vector with:

(beta = lm2XY$coefficients)

 (Intercept)         I(x)       I(x^2) 
1.134097e+02 8.754328e-02 9.936809e-03

In order to reproduce all the results on page 18 of the pdf with the lecture notes we can also get the Anova table for this model:

(anova2XY = anova(lm2XY))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
I(x)       1 6110.1  6110.1 68.8896 8.809e-09 ***
I(x^2)     1  163.3   163.3  1.8412    0.1865    
Residuals 26 2306.0    88.7                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and confidence intervals for the coefficients:

confint(lm2XY)

                   2.5 %       97.5 %
(Intercept) 86.255332965 140.56408881
I(x)        -1.238949228   1.41403578
I(x^2)      -0.005116258   0.02498988

Let us add the resulting parabola to the plot

plot(data$x, data$y, xlab = "age", ylab="sbp")
#curve(lmXY$co + lmXY$coefficients[2] * x)
abline(lmXY, col="orange", lwd=3)
curve(beta[1] + beta[2] * x + beta[3] * x^2, 
col="blue", lwd="3", lty=2, add=TRUE)
legend("topleft", inset=.05, title="Model type", c("degree 1", "degree 2"), 
       fill=c("orange","blue"))

And just for fun, if you want to add segments connecting the sample points to the predicted values (as the figure on page 5 of the pdf), just uncomment and execute the line in the next chunk of code (and if you do, I recommend to comment the abline line, to avoid cluttering the graph):

#segments(data$x, lm2XY$fitted.values, data$x, data$y, lty=3)

Stata code.

The simple (degree 1) linear regression model for sbb vs age is obtained with:

. regress sbp age

      Source |       SS           df       MS      Number of obs   =        29
-------------+----------------------------------   F(1, 27)        =     66.81
       Model |  6110.10173         1  6110.10173   Prob > F        =    0.0000
    Residual |  2469.34654        27  91.4572794   R-squared       =    0.7122
-------------+----------------------------------   Adj R-squared   =    0.7015
       Total |  8579.44828        28  306.408867   Root MSE        =    9.5633

------------------------------------------------------------------------------
         sbp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .9493225   .1161445     8.17   0.000     .7110137    1.187631
       _cons |   97.07708   5.527552    17.56   0.000     85.73549    108.4187
------------------------------------------------------------------------------

And the linear model with a degree 2 polynomial on age is:

. gen agesq=age*age

. regress sbp age agesq

      Source |       SS           df       MS      Number of obs   =        29
-------------+----------------------------------   F(2, 26)        =     35.37
       Model |  6273.40168         2  3136.70084   Prob > F        =    0.0000
    Residual |   2306.0466        26  88.6940999   R-squared       =    0.7312
-------------+----------------------------------   Adj R-squared   =    0.7105
       Total |  8579.44828        28  306.408867   Root MSE        =    9.4178

------------------------------------------------------------------------------
         sbp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0875433   .6453289     0.14   0.893    -1.238949    1.414036
       agesq |   .0099368   .0073232     1.36   0.186    -.0051163    .0249899
       _cons |   113.4097   13.21041     8.58   0.000     86.25533    140.5641
------------------------------------------------------------------------------

The sums of squares in the Anova tables.

If you compare the R output with the Stata output you will see that R does not show the total sum of squares, but you can recover it as:

sum(anova2XY$`Sum Sq`)

[1] 8579.448

The sum is of course the same if you decide to use the degree 1 model:

sum(anovaXY$`Sum Sq`)

[1] 8579.448

By contrast, Stata outputs the sum of squares for the degree 2 model, but the decomposition of that sum of squares between the degree 1 and degree 2 terms is not provided.

Hypothesis test for the addition of $x^2$ into the model.

The required F-statistic, degrees of freedom and p-value appear in the output of the Anova table for the degree 2 regression model. But maybe we can make their computation explicit. Recall that the extra sum of squares due to adding $x^2$ (as in page 15 of the pdf) is obtained in R as:

anova2XY$`Sum Sq`[2]

[1] 163.2999

and the residual sum of squares is

anova2XY$`Sum Sq`[3]

[1] 2306.047

The corresponding degrees of freedom are:

anova2XY$Df[2:3]

[1]  1 26

and so

(Fstatistic = (anova2XY$`Sum Sq`[2] / anova2XY$Df[2]) / (anova2XY$`Sum Sq`[3] / anova2XY$Df[3]) )

[1] 1.841159

which results in a p-value:

(pValue =  pf(Fstatistic, df1 = anova2XY$Df[2], df2 = anova2XY$Df[3], lower.tail = FALSE))

[1] 0.1864797

To test the samehypothesis using Student’s $t$ we go back to the summary of the linear model:

(sum2XY = summary(lm2XY))


Call:
lm(formula = y ~ I(x) + I(x^2), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.8731  -5.8814   0.5288   5.5253  23.5007 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.134e+02  1.321e+01   8.585 4.59e-09 ***
I(x)        8.754e-02  6.453e-01   0.136    0.893    
I(x^2)      9.937e-03  7.323e-03   1.357    0.186    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.418 on 26 degrees of freedom
Multiple R-squared:  0.7312,    Adjusted R-squared:  0.7105 
F-statistic: 35.37 on 2 and 26 DF,  p-value: 3.822e-08

from here the formula \[ t = \dfrac{\hat\beta_2}{\hat{SE}(\hat\beta_2)} \] on page 16 of the pdf is obtained with:

(tStatistic = sum2XY$coefficients[3, 1] / sum2XY$coefficients[3, 2])

[1] 1.356893

If you are confused about this reecall that sum2XY$coefficients is a matrix:

sum2XY$coefficients

                Estimate   Std. Error   t value     Pr(>|t|)
(Intercept) 1.134097e+02 13.210405754 8.5848772 4.593894e-09
I(x)        8.754328e-02  0.645328878 0.1356568 8.931374e-01
I(x^2)      9.936809e-03  0.007323207 1.3568932 1.864797e-01

Note also that:

tStatistic^2

[1] 1.841159

equals the value of the F statistic. Thus the p-value computed as:

(n = nrow(data))

[1] 29

(k = nrow(sum2XY$coefficients) - 1)

[1] 2

(pValue = 2 * pt(abs(tStatistic), df = n - k - 1, lower.tail = FALSE))

[1] 0.1864797

is the same as before. Two remarks:

I have computed the number of variables $k$ as the number of rows of the coefficient matrix minus 1.
The hypothesis test using t is bilateral (whereas with F it is unilateral). That’s the reason behind the absolute value and the 2 factor.

Variance inflation factor.

For the variance inflation factor (vif, this is explained later on in the lecture) we use the function provided by the library car:

library(car)
vif(lm2XY)

    I(x)   I(x^2) 
31.83379 31.83379

(tolerance = 1 / vif(lm2XY))

      I(x)     I(x^2) 
0.03141316 0.03141316

The weight gain vs dose data.

wtgain = c(1, 1.2, 1.8, 2.5, 3.6, 4.7, 6.6, 9.1)
dose = 1:8
WDdata = data.frame(x = dose, y = wtgain)

To use the data in Stata I will use:

library(foreign)
write.dta(data.frame(dose, wtgain), "weightDose.dta")

Note that I have dropped the $x, y$ notation, so that the variable names in Stata coincide with those in the lecture.

The scatter plot:

plot(WDdata$x, WDdata$y, xlab = "Dose", ylab = "Weight gain")

Fitting the models.

The degree 1 linear model is:

(lmXY = lm(y ~ x, data=WDdata))


Call:
lm(formula = y ~ x, data = WDdata)

Coefficients:
(Intercept)            x  
     -1.196        1.113

(sumXY = summary(lmXY))


Call:
lm(formula = y ~ x, data = WDdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7821 -0.7592 -0.1691  0.3985  1.3917 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.1964     0.7135  -1.677 0.144607    
x             1.1131     0.1413   7.877 0.000222 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9157 on 6 degrees of freedom
Multiple R-squared:  0.9118,    Adjusted R-squared:  0.8971 
F-statistic: 62.05 on 1 and 6 DF,  p-value: 0.0002217

(anovaXY = anova(lmXY))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x          1 52.037  52.037  62.053 0.0002217 ***
Residuals  6  5.032   0.839                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(lmXY)

                 2.5 %    97.5 %
(Intercept) -2.9424073 0.5495501
x            0.7673399 1.4588505

and the degree 2 linear model is:

(lm2XY = lm(y ~ I(x) + I(x^2), data=WDdata))


Call:
lm(formula = y ~ I(x) + I(x^2), data = WDdata)

Coefficients:
(Intercept)         I(x)       I(x^2)  
     1.3482      -0.4137       0.1696

beta = lm2XY$coefficients
(sum2XY = summary(lm2XY))


Call:
lm(formula = y ~ I(x) + I(x^2), data = WDdata)

Residuals:
         1          2          3          4          5          6 
-0.1041667  0.0005952  0.1660714  0.0922619  0.0791667 -0.2732143 
         7          8 
-0.1648810  0.2041667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3482     0.2767   4.872 0.004585 ** 
I(x)         -0.4137     0.1411  -2.932 0.032554 *  
I(x^2)        0.1696     0.0153  11.085 0.000104 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1984 on 5 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9952 
F-statistic: 722.7 on 2 and 5 DF,  p-value: 6.977e-07

(anova2XY = anova(lm2XY))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
I(x)       1 52.037  52.037 1322.58  2.96e-07 ***
I(x^2)     1  4.835   4.835  122.88 0.0001041 ***
Residuals  5  0.197   0.039                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(lm2XY)

                 2.5 %      97.5 %
(Intercept)  0.6368423  2.05958627
I(x)        -0.7763778 -0.05100318
I(x^2)       0.1303039  0.20898182

The following figure strongly suggests that the parabola is a better fit:

plot(WDdata$x, WDdata$y, xlab = "Dose", ylab="Weight gain")
abline(lmXY, col="orange", lwd=3)
curve(beta[1] + beta[2] * x + beta[3] * x^2, 
col="blue", lwd="3", lty=2, add=TRUE)
legend("topleft", inset=.05, title="Model type", c("degree 1", "degree 2"), 
       fill=c("orange","blue"))

A degree 3 model.

This is just a simple matter of adding a new term:

(lm3XY = lm(y ~ I(x) + I(x^2) + I(x^3), data=WDdata))


Call:
lm(formula = y ~ I(x) + I(x^2) + I(x^3), data = WDdata)

Coefficients:
(Intercept)         I(x)       I(x^2)       I(x^3)  
    0.58571      0.37962     -0.03831      0.01540

(sum3XY = summary(lm3XY))


Call:
lm(formula = y ~ I(x) + I(x^2) + I(x^3), data = WDdata)

Residuals:
        1         2         3         4         5         6         7 
 0.057576 -0.114935  0.004329  0.022944  0.148485 -0.111472 -0.049351 
        8 
 0.042424 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.585714   0.290972   2.013   0.1144  
I(x)         0.379618   0.263287   1.442   0.2228  
I(x^2)      -0.038312   0.066042  -0.580   0.5929  
I(x^3)       0.015404   0.004845   3.179   0.0336 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1181 on 4 degrees of freedom
Multiple R-squared:  0.999, Adjusted R-squared:  0.9983 
F-statistic:  1363 on 3 and 4 DF,  p-value: 1.791e-06

(anova3XY = anova(lm3XY))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq  F value    Pr(>F)    
I(x)       1 52.037  52.037 3731.655 4.301e-07 ***
I(x^2)     1  4.835   4.835  346.711 4.897e-05 ***
I(x^3)     1  0.141   0.141   10.107   0.03356 *  
Residuals  4  0.056   0.014                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(lm3XY)

                   2.5 %     97.5 %
(Intercept) -0.222154377 1.39358295
I(x)        -0.351383545 1.11061875
I(x^2)      -0.221673230 0.14504985
I(x^3)       0.001951568 0.02885651

Variance inflation factor.

We have already imported the vif function:

vif(lm3XY)

     I(x)    I(x^2)    I(x^3) 
 208.7828 1116.5909  399.0101

1/vif(lm3XY)

        I(x)       I(x^2)       I(x^3) 
0.0047896659 0.0008955831 0.0025062022

Stata code for this example.

We load the data exported from R:

. use weightDose
(Written by R.              )

The degree 1 model is obtained with:

. regress wtgain dose

      Source |       SS           df       MS      Number of obs   =         8
-------------+----------------------------------   F(1, 6)         =     62.05
       Model |  52.0372024         1  52.0372024   Prob > F        =    0.0002
    Residual |  5.03154762         6   .83859127   R-squared       =    0.9118
-------------+----------------------------------   Adj R-squared   =    0.8971
       Total |    57.06875         7  8.15267857   Root MSE        =    .91575

------------------------------------------------------------------------------
      wtgain |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        dose |   1.113095   .1413027     7.88   0.000     .7673399    1.458851
       _cons |  -1.196429   .7135438    -1.68   0.145    -2.942407    .5495501
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
        dose |      1.00    1.000000
-------------+----------------------
    Mean VIF |      1.00

Degree 2 model:

. gen dosesq = dose * dose

. regress wtgain dose dosesq

      Source |       SS           df       MS      Number of obs   =         8
-------------+----------------------------------   F(2, 5)         =    722.73
       Model |  56.8720238         2  28.4360119   Prob > F        =    0.0000
    Residual |   .19672619         5  .039345238   R-squared       =    0.9966
-------------+----------------------------------   Adj R-squared   =    0.9952
       Total |    57.06875         7  8.15267857   Root MSE        =    .19836

------------------------------------------------------------------------------
      wtgain |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        dose |  -.4136905   .1410915    -2.93   0.033    -.7763778   -.0510032
      dosesq |   .1696429   .0153035    11.09   0.000     .1303039    .2089818
       _cons |   1.348214   .2767358     4.87   0.005     .6368423    2.059586
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
        dose |     21.25    0.047059
      dosesq |     21.25    0.047059
-------------+----------------------
    Mean VIF |     21.25

and degree 3 model:

. gen dosecube = dosesq * dose

. regress wtgain dose dosesq dosecube

      Source |       SS           df       MS      Number of obs   =         8
-------------+----------------------------------   F(3, 4)         =   1362.82
       Model |  57.0129708         3  19.0043236   Prob > F        =    0.0000
    Residual |  .055779221         4  .013944805   R-squared       =    0.9990
-------------+----------------------------------   Adj R-squared   =    0.9983
       Total |    57.06875         7  8.15267857   Root MSE        =    .11809

------------------------------------------------------------------------------
      wtgain |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        dose |   .3796176   .2632867     1.44   0.223    -.3513835    1.110619
      dosesq |  -.0383117   .0660418    -0.58   0.593    -.2216732    .1450499
    dosecube |    .015404   .0048452     3.18   0.034     .0019516    .0288565
       _cons |   .5857143   .2909723     2.01   0.114    -.2221544    1.393583
------------------------------------------------------------------------------

. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
      dosesq |   1116.59    0.000896
    dosecube |    399.01    0.002506
        dose |    208.78    0.004790
-------------+----------------------
    Mean VIF |    574.79

By the way, the ordering of the variables in the VIF Stata output seems arbitrary…

Thanks for your attention!

Applied Regression Analysis. Week04

Course Notes by Fernando San Segundo

February 2016

Preliminaries

The sbp vs age data.

Loading the data.

Removing the outlier.

Stata code.

Fitting the linear models.

Fitting the degree 1 model.

Fitting the degree 2 model.

Stata code.

The sums of squares in the Anova tables.

Hypothesis test for the addition of \(x^2\) into the model.

Variance inflation factor.

The weight gain vs dose data.

Fitting the models.

A degree 3 model.

Variance inflation factor.

Stata code for this example.