Chapter 2 Linear Regression Models

2.2 Linear Regression Models Defined

The assumptions necessary for establishing Model are:

-Suitability

-Linearity

-Constant variance

-Independence

library(GLMsData); data(gestation); str(gestation)

## 'data.frame':    21 obs. of  4 variables:
##  $ Age   : int  22 23 25 27 28 29 30 31 32 33 ...
##  $ Births: int  1 1 1 1 6 1 3 6 7 7 ...
##  $ Weight: num  0.52 0.7 1 1.17 1.2 ...
##  $ SD    : num  NA NA NA NA 0.121 NA 0.589 0.319 0.438 0.313 ...

summary(gestation)

##       Age            Births           Weight            SD        
##  Min.   :22.00   Min.   :  1.00   Min.   :0.520   Min.   :0.1210  
##  1st Qu.:29.00   1st Qu.:  1.00   1st Qu.:1.480   1st Qu.:0.3575  
##  Median :34.00   Median :  7.00   Median :2.516   Median :0.4270  
##  Mean   :33.76   Mean   : 72.05   Mean   :2.335   Mean   :0.4057  
##  3rd Qu.:39.00   3rd Qu.: 53.00   3rd Qu.:3.353   3rd Qu.:0.4440  
##  Max.   :44.00   Max.   :401.00   Max.   :3.740   Max.   :0.5890  
##                                                   NA's   :6

The relationship between the expected mean birthweights of babies $\mu = E[y]$ and gestational age x is approximately linear over the given gestational age range

plot(Weight ~ Age, data = gestation, las=1, pch=ifelse( Births<20, 1 ,19),
     xlab="Gestational age (weeks)", ylab="Mean birthweight (kg)", 
     xlim= c(20,45), ylim= c(0,4))

-pch=ifelse(Births<20, 1 ,19) means that if the number of births m is fewer than 20, then plot using pch=1 (an empty circle) and otherwise use pch=19 (a filled circle)

-Reponses $y_i$ here represent sample mean birthweights.

-If birthweights of individual babies at gestational age $x_i$ have variance $\sigma^2$, then $y_i$ would have variance $\sigma^2/m_i$ ($m_i$ is the sample size of group i).

-The known prior weights are $w_i = m_i$

-A possible model for the data is

$var[y_i] = \sigma^2/m_i$

$\mu_i = \beta_0 + \beta_1x_i.$

where $E[y_i] = \mu_i$

2.3 Simple Linear Regression

2.3.1 Least-Squares Estimation

For regression models to be used in practice, the deviations between the observed data $y_i$ and the $\mu_i$ are given by

$e_i = y_i - \mu_i = y_i - \beta_0 - \beta_1x_i$

To summarize the deviations, we can square them (to avoid negative quantities) then sum them, to get

$S(\beta_0,\beta_1) = \sum_{i=1}^n w_ie_i^2 = \sum_{i=1}^n w_i(y_i - \mu_i)^2 = \sum_{i=1}^n w_i(y_1 - \beta_0 -\beta_1x_i)^2$

The non-negative weights $w_i$ may be used to weight observations according to their precision (for example, mean birthweights based on larger sample sizes are estimated with greater precision, so can be allocated larger weights). S summarizes how far the fitted line is from the observations $y_i$. Smaller values of $S$ mean the line is closer to the $y_i$, in general. The least-squares principle is to estimate $β_0$ and $β_1$ by those values that minimize $S$.

We can try some values for $\beta_0$ and $\beta_1$ and compute the corresponding value of S.

y   <- gestation $Weight
x   <- gestation $Age
wts <- gestation $Births
beta0.A <- -0.9; beta1.A <- 0.1  #Try these values for beta0 and beta1
mu.A <-beta0.A + beta1.A * x
SA <- sum( wts*(y - mu.A)^2 ); SA

## [1] 186.1106

Suppose we try different values for $\beta_0$ and $\beta_1$

beta0.A <- -3; beta1.A <- 0.15  
mu.A <-beta0.A + beta1.A * x
SB <- sum( wts*(y - mu.A)^2 ); SB

## [1] 343.4433

The smallest possible value of S (11.42) is achieved using the least-squares estimates $\beta_0$ and $\beta_1$ (-2.678 and 0.1538)

It would be discussed in 2.3.2.

2.3.2 Coefficient Estimates

The least-squares estimators of β0 and β1 can be found by using calculus to minimize the sum of squares S(β0, β1). The derivatives of S with respect to β0 and β1 are

$\partial S(\beta_0,\beta_1)/ \partial \beta_0 = 2 \sum_{i=1}^n w_i(y_i - \mu_i)$

$\partial S(\beta_0,\beta_1)/ \partial \beta_1 = 2 \sum_{i=1}^n w_ix_i(y_i - \mu_i)$

Solving $\partial S/ \partial \beta_0 = \partial S/ \partial \beta_1 = 0$

$\hat{\beta_0} = \bar{y}_w - \hat{\beta_1}\bar{x}_w$

$\hat{\beta_1} = SS_{xy} / SS_x = \sum_{i=1}^n w_i(x_i - \bar{x}_w)y_i / \sum_{i=1}^n w_i(x_i - \bar{x}_w)^2$

where $\bar{x}_w$ and $\bar{y}_w$ are the weighted means

$\bar{x}_w = \sum_{i=1}^n w_ix_i/ \sum_{i=1}^n w_i$ and $\bar{y}_w = \sum_{i=1}^n w_iy_i/ \sum_{i=1}^n w_i$

The minimized value of $S(\beta_0,\beta_1)$, evaluated at the least-squares estimates $\beta_0 = \hat{\beta_0}$ and $\beta_1 = \hat{\beta_1}$, is called the *residual sum-of-squares(RSS)$

RSS = $\sum_{i=1}^n w_i(y_i - \hat{\mu}_i)^2 = \sum_{i=1}^n w_i(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i )^2$

xbar <- weighted.mean(x, w=wts)    #The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts)    #The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum(wts*(y - mu)^2)
c( beta0=beta0, beta1=beta1, RSS=RSS)

##      beta0      beta1        RSS 
## -2.6783891  0.1537594 11.4198322

The usual way to fit the model would be to use lm()

lm(Weight ~ Age, weights = Births, data = gestation)

## 
## Call:
## lm(formula = Weight ~ Age, data = gestation, weights = Births)
## 
## Coefficients:
## (Intercept)          Age  
##     -2.6784       0.1538

$\hat{\mu} = -2.678 + 0.1538x$ with RSS = 11.42

2.3.3 Estimating the Variance $σ^2$

$\hat{\sigma}^2 = RSS/n$

$\hat{σ}^2$ is a biased estimate of $σ^2$, because the process of estimating $\hatμ_i$ is based on minimizing RSS, making RSS smaller than it would be by random variation and introducing a negative bias into $\hat{σ}^2$.

The correct way to adjust for the fact that the regression parameters have been estimated is to divide by $n − 2$ (residual degrees of freedom) instead of $n$.

$s^2 = RSS/(n-2)$

df <- length(y) - 2
s2 <- RSS / df
c( df = df, s=sqrt(s2), s2=s2 )

##         df          s         s2 
## 19.0000000  0.7752701  0.6010438

2.3.4 Standard errors of Coefficients

The variances of the parameter estimates are

var$[\hat{\beta}_0] = \sigma^2((1/\sum w_i)+(\bar{x}^2_w/SS_x))$ and var$[\hat{\beta}_1] = \sigma^2/SS_x$

The standard errors of the coefficients are the square roots of $\hat{var}[\hat{β}_j]$:

se$(\hat{\beta}_0) = s((1/\sum w_i)+(\bar{x}^2_w/SS_x)^{1/2}$ and se$(\hat{\beta}_1) = s/\sqrt{SS_x}$

var.b0 <- s2 * ( 1/sum(wts) + xbar^2 / SSx )
var.b1 <- s2 / SSx
sqrt( c( beta0=var.b0, beta1=var.b1) )  # The std errors

##       beta0       beta1 
## 0.371172341 0.009493212

2.3.5 Standard Errors of Fitted Values

For a given value of the explanatory variable, say $x_g$, the estimate of $μ_g$ also contains uncertainty.

The variance of $μ_g$ is

var$[\hat{\mu}_g] = \sigma^2((1/\sum w_i)+((x_g-\bar{x})^2/SS_x))$

An estimate of $var[\hat{μ}_g]$, written $\hat{var}[\hat{μ}_g]$, is found by substituting $s^2$ for the unknown true variance $σ^2$. The standard error of $\hat{μ}_g$, written $se(\hat{μ}_g)$, is the square root of the variance.

For the gestation data model, suppose we wish to use the model to estimate the mean birthweight for a gestation length of 30 weeks:

x.g <- 30
mu.g <- beta0 + x.g * beta1
var.mu.g <- s2 * ( 1/sum(wts) + (x.g-xbar)^2 / SSx )
se.mu.g <- sqrt(var.mu.g)
c( mu=mu.g, se=sqrt(var.mu.g))

##       mu       se 
## 1.934392 0.088124

The mean birthweight is estimated as $\hat{μ}_g$ = 1.934 kg, with a standard error of $se(\hat{μ}_g)$ = 0.08812 kg

2.4 Estimation for Multiple Regression

2.4.1 Coefficient Estimates

As for simple linear regression, we define the sum of squared deviations between the observations $y_i$ and the model means by

$S = \sum_{i=1}^n w_i(y_i - \mu_i)^2$

The minimum value of $S$ occurs when

$\partial S / \partial \beta_j = 0$ for $j = 0, 1,...,p.$

The least-squares estimators:

$\hat{\beta}_j = \sum_{i=1}^n w_ix_{ij}y_i / \sum_{i=1}^n w_i(x_{ij})^2$ for $j = 0, 1,...,p.$

The fitted values are

$\hat{\mu}_i = \hat{\beta}_0 + \sum_{j=1}^p \beta_jx_{ji}$

and the residuals are the deviations of the responses from the fitted values

$r_i = y_i - \hat{\mu}_i$

data( lungcap )
scatter.smooth( lungcap $ Ht, lungcap $ FEV, las=1, col="grey",
ylim=c(0, 6), xlim=c(45, 75), # Use similar scales for comparisons
main="FEV", xlab="Height (in inches)", ylab="FEV (in L)" )

scatter.smooth( lungcap $ Ht, log(lungcap $ FEV), las=1, col="grey",
ylim=c(-0.5, 2), xlim=c(45, 75), # Use similar scales for comparisons
main="log of FEV", xlab="Height (in inches)", ylab="log of FEV (in L)")

-For the lung capacity data (lungcap), the figure shows that the relationship between FEV and height is not linear, so a linear model is not appropriate. However, plotting the logarithm of FEV against height does show an approximate linear relationship.

-For th e lungcap data then, fitting a linear model for y = log(FEV) may be appropriate. On this basis, a possible linear regression model to fit to the data would be

$var[y_i] = \sigma^2$

$\mu_i = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 + \beta_4x_4$

where μ = E[y] for y = log(FEV), $x_1$ is height, $x_2$ is age, $x_3$ is the dummy variable for gender (0 for females; 1 for males), and $x_4$ is the dummy variable for smoking (0 for non-smokers; 1 for smokers). Here, $p'$ = 5 and $n$ = 654.

2.4.2 Estimating the Variance $\sigma^2$

RSS = $\sum_{i=1}^n w_i(y_i - \hat{\mu}_i)^2$

The residual degrees of freedom associated with RSS is equal to the number of observations minus the number of regression coefficients that were estimated in evaluating RSS, in this case $n − p'$

$s^2 = \sum_{i=1}^n w_i(y_i - \hat{\mu}_i)^2/(n - p') = RSS/(n-p')$

2.4.3 Standard Errors

Write $I^*_j = \sum_{i=1}^n w_i(x^*_{ij})^2$ for the sum of squares of the jth explanatory variable adjusted for the other variables.

This quantity $I^*_j$ is a measure of how well the regression model is leveraged to estimate the jth coefficient.

It tends to be larger when $x_j$ is independent of the other explanatory variables and smaller when $x_j$ is correlated with one or more of the other variables.

The variance of the jth coefficient is $var[\hat{\beta}_j] = \sigma^2 / I^*_j$

An estimate of $var[β^j]$, written $\hat{var}[\hat{β}_j]$, is found by substituting $s^2$ for the unknown true variance $σ^2$.

Then, the standard error becomes $se(\hat{\beta}_j) = s / \sqrt{I^*_j}$

2.6 Fitting Linear Regression Models Using R

Fitting the regression model for the birthweight data frame gestation requires the prior weights (the number of birth, Births) to be explicitly supplied in addition to the response and explanatory variable:

gest.wtd <- lm( Weight ~ Age, data = gestation,
weights=Births) # The prior weights
summary(gest.wtd)

## 
## Call:
## lm(formula = Weight ~ Age, data = gestation, weights = Births)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62979 -0.60893 -0.30063 -0.08845  1.03880 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.678389   0.371172  -7.216 7.49e-07 ***
## Age          0.153759   0.009493  16.197 1.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7753 on 19 degrees of freedom
## Multiple R-squared:  0.9325, Adjusted R-squared:  0.9289 
## F-statistic: 262.3 on 1 and 19 DF,  p-value: 1.416e-12

-In Weight ~ Age, the symbol ~ is read as ‘is modelled by’. The response variable (in this case Weight) is placed on the left of the ~, and the explanatory variables are placed on the right of the ~ and are joined by + signs if there are more than one.

-data = gestation indicates the data frame in which the variables are located.

-The argument weights specifies the prior weights $w_i$, and can be omitted if all the prior weights are equal to one. (參數權重指定優先權$w_i$，如果所有優先權均等於1，則可以省略。) ↓

Try fitting the regression without using prior weights for comparison:

gest.ord <- lm( Weight ~ Age, data = gestation); coef(gest.ord)

## (Intercept)         Age 
##   -3.049879    0.159483

Using the prior weights , the regression line is closer to the observations weighted more heavily (which contain more information) than the ordinary regression line (dashed line)

plot( Weight ~ Age, data = gestation, type="n",
las=1, xlim=c(20, 45), ylim=c(0, 4),
xlab="Gestational age (weeks)", ylab="Mean birthweight (in kg)" )
points( Weight[Births< 20] ~ Age[Births< 20], pch=1, data=gestation )
points( Weight[Births>=20] ~ Age[Births>=20], pch=19, data=gestation )
abline( coef(gest.ord), lty=2, lwd=2)
abline( coef(gest.wtd), lty=1, lwd=2)
legend("topleft", lwd=c(2, 2), bty="n",
lty=c(2, 1, NA, NA), pch=c(NA, NA, 1, 19),           # NA shows nothing
legend=c("Ordinary regression", "Weighted regression",
"Based on 20 or fewer obs.","Based on more than 20 obs."))

Consider fitting the Model to the lung capacity data (lungcap), using age, height, gender and smoking status as explanatory variables, and log(FEV) as the response:

# Recall, Smoke has been declared previously as a factor
lm( log(FEV) ~ Age + Ht + Gender + Smoke, data=lungcap )

## 
## Call:
## lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)
## 
## Coefficients:
## (Intercept)          Age           Ht      GenderM        Smoke  
##    -1.94400      0.02339      0.04280      0.02932     -0.04607

The output of the lm() command shows that the estimated systematic component is $\hatμ = −1.944 + 0.02339x_1 + 0.04280x_2 + 0.02932x_3 − 0.04607x_4$

To explicitly exclude the constant in the model (which is unusual), use one of these forms:

lm( log(FEV) ~ 0 + Age + Ht + Gender + Smoke, data=lungcap ) # No const.

## 
## Call:
## lm(formula = log(FEV) ~ 0 + Age + Ht + Gender + Smoke, data = lungcap)
## 
## Coefficients:
##      Age        Ht   GenderF   GenderM     Smoke  
##  0.02339   0.04280  -1.94400  -1.91468  -0.04607

lm( log(FEV) ~ Age + Ht + Gender + Smoke - 1, data=lungcap ) # No const.

## 
## Call:
## lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke - 1, data = lungcap)
## 
## Coefficients:
##      Age        Ht   GenderF   GenderM     Smoke  
##  0.02339   0.04280  -1.94400  -1.91468  -0.04607

R returns more information about the fitted model by directing the output of lm() to an output object:

LC.m1 <- lm( log(FEV) ~ Age + Ht + Gender + Smoke, data=lungcap )
names( LC.m1 )        # The names of the components of LC.m1

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

summary( LC.m1 )

## 
## Call:
## lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63278 -0.08657  0.01146  0.09540  0.40701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.943998   0.078639 -24.721  < 2e-16 ***
## Age          0.023387   0.003348   6.984  7.1e-12 ***
## Ht           0.042796   0.001679  25.489  < 2e-16 ***
## GenderM      0.029319   0.011719   2.502   0.0126 *  
## Smoke       -0.046068   0.020910  -2.203   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1455 on 649 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.8095 
## F-statistic: 694.6 on 4 and 649 DF,  p-value: < 2.2e-16

The parameter estimates are explicitly obtained using:

coef( LC.m1 )

## (Intercept)         Age          Ht     GenderM       Smoke 
## -1.94399818  0.02338721  0.04279579  0.02931936 -0.04606754

The estimate of σ is:

summary( LC.m1 )$sigma

## [1] 0.1454686

2.7 Interpreting the Regression Coefficients

The systematic component of linear regression model fitted to the gestation data is $\hat{\mu} = −2.678 + 0.1538x$, where $μ = E[y]$, where $y$ is the mean birthweight (in kg), and $x$ is the gestational age in weeks.

The random component implies that the variation of the weights around $μ$ is approximately constant with $s^2$ = 0.6010.

The interpretation for the systematic component model fitted to the lung capacity data is different, because the response variable is log(FEV)

$μ = E[log(FEV)] = −1.944 + 0.02339x_1 + 0.04280x_2 + 0.02932x_3 − 0.04607x_4$, for Age $x_1$, Ht $x_2$, the dummy variable for Gender $x_3$ and the dummy variable for Smoke $x_4$.

The regression coefficients can only be interpreted for their impact on $μ = E[log(FEV)]$ and not on E[FEV] directly. However, since $E[log y] ≈ log E[y] = log μ$,then

$log μ = log E[FEV] ≈ −1.944 + 0.02339x_1 + 0.04280x_2 + 0.02932x_3 − 0.04607x_4$.

The random component of the model indicates the variation of log(FEV) around μ = E[log(FEV)] is approximately constant, with $s^2$ = 0.02116.

Interpreting the effects of correlated covariates is subtle. For example, in the lung capacity data, height and age are positively correlated.

The coefficient in the linear model reflects only the net effect of a covariate, eliminating any concomitant(同時發生的) changes in the other covariates that might normally be present if all the covariate varied in an uncontrolled fashion.

2.8 Inference for Linear Regression Models: t-Tests

2.8.1 Normal Linear Regression Models

The usual assumption of linear regression is that the responses are normally distributed, either with constant variance or with variances that are proportional to the known weights. This can be stated as:

$y_i \sim N(\mu_i, \sigma^2/w_i)$

$\mu_i = \beta_0 + \sum_{j=1}^p \beta_jx_{ji}$

This is called a normal linear regression model.

2.8.2 The Distribution of $\hatβ_j$

The $\hatβ_j$ are random variables which follow normal distributions, since $\hatβ_j$ is a linear combination of the $y_i$.

$\hatβ_j \sim N(\beta_j, var[\hat{\beta}_j])$

This means that $\hatβ_j$ has a normal distribution with mean $\hatβ_j$ and variance var[$\hatβ_j$].

$Z = (_j - _j) / se(_j) $

Where $se(\hatβ_j) = \sqrt{var[\hatβ_j]}$, and $Z$ has a standard normal distribution when $σ^2$ is known.

When $σ^2$ is unknown, estimate $σ^2$ by $s^2$ and hence estimate $var[\hatβ_j]$ by $\hat{var}[\hatβ_j]$.

Then $T = (\hatβ_j - β_j) / se(\hatβ_j)$ has a Student’s t distribution with $n − p'$ degrees of freedom, where $se(\hatβ_j) = \sqrt{\hat{var}[\hatβ_j]}$.

2.8.3 Hypothesis Tests for $β_j$

Consider testing the null hypothesis $H_0: β_j = β_j^0$ against a one-sided alternative ($H_A: β_j > β_j^0$ or $H_A: β_j < β_j^0$) or a two-sided alternative ($H_A: β_j \ne β_j^0$), where $β_j^0$ is some hypothesized value of $β_j$ (usually zero).

The statistic $T = (β_j - β_j^0) / se(\hatβ_j)$ is used to test this hypothesis.

When $H_0$ is true, $T$ has a t-distribution with $n − p'$ degrees of freedom when $σ^2$x is unknown, so we determine significance by referring to this distribution.

the lung capacity data

summary( LC.m1 )

## 
## Call:
## lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63278 -0.08657  0.01146  0.09540  0.40701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.943998   0.078639 -24.721  < 2e-16 ***
## Age          0.023387   0.003348   6.984  7.1e-12 ***
## Ht           0.042796   0.001679  25.489  < 2e-16 ***
## GenderM      0.029319   0.011719   2.502   0.0126 *  
## Smoke       -0.046068   0.020910  -2.203   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1455 on 649 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.8095 
## F-statistic: 694.6 on 4 and 649 DF,  p-value: < 2.2e-16

-the Estimate column contains the parameter estimates $\hatβ_j$ ;

-the Std. Error column contains the corresponding standard errors se($\hatβ_j$);

-the t value column contains the corresponding t-statistic for testing $H_0: β_j = 0$;

_the Pr(>|t|) column contains the corresponding two-tailed P-values for the hypothesis tests. (The one-tailed P-value is the two-tailed P-value divided by two.)

This information can be accessed directly using:

round(coef( summary( LC.m1 ) ), 5)

##             Estimate Std. Error   t value Pr(>|t|)
## (Intercept) -1.94400    0.07864 -24.72067  0.00000
## Age          0.02339    0.00335   6.98449  0.00000
## Ht           0.04280    0.00168  25.48933  0.00000
## GenderM      0.02932    0.01172   2.50196  0.01260
## Smoke       -0.04607    0.02091  -2.20311  0.02794

For example, consider a hypothesis test for $β_4$ (the coefficient for Smoke). To test $H_0: β_4 = 0$ against the alternative $H_A: β4 \ne 0$ (in the presence of age, height and gender), the output shows that the t-score is $t$ = −2.203, and the corresponding two-tailed P-value is 0.02794. Thus, some evidence exists that smoking status is statistically significant when age, height and gender are in the model.

If gender was omitted from the model and the relevant null hypothesis retested, the test has a different meaning: this second test determines if age is significant in the model adjusted only for height (but not gender).

2.8.4 Confidence Intervals for $β_j$

The size of the effect can be estimated by computing confidence intervals.

The estimates βˆj and the corresponding standard errors se($\hatβ_j$) can be used to form 100(1 − α)% confidence intervals for each estimate using

$\hatβ_j \pm t_{\alpha/2,n-p'}^*se(\hatβ_j)$, where $t_{\alpha/2,n-p'}^*$ is the value such that an area α/2 is in each tail of the t-distribution with $n − p'$ degrees of freedom.

For the lung capacity data (data set: lungcap), find the 95% confidence interval:

confint( LC.m1 )

##                    2.5 %       97.5 %
## (Intercept) -2.098414941 -1.789581413
## Age          0.016812109  0.029962319
## Ht           0.039498923  0.046092655
## GenderM      0.006308481  0.052330236
## Smoke       -0.087127344 -0.005007728

(By default, 95% confidence intervals are produced; other levels are produced by using, for example, level=0.90 in the call to confint().)

For example, the 95% confidence interval for $β_4$ is from −0.08713 to −0.005008.

2.8.5 Confidence I ntervals for μ

The fitted values $\hatμ$ are used to estimate the mean value for given values of the explanatory variables.

$\hatμ_j \pm t_{\alpha/2,n-p'}^*se(\hatμ_j)$, where se($\hatμ_j$) = $\sqrt{var[\hat{\mu}_g]}$, and where $t_{\alpha/2,n-p'}^*$ is the value such that an area α/2 is in each tail of the t-distribution with $n − p'$ degrees of freedom.

For the lung capacity data (data set: lungcap), suppose we wish to estimate μ = E[log(FEV)] for female smokers aged 18 who are 66 inches tall. Using R, we first create a new data frame containing the values of the explanatory variables for which we need to make the prediction, then, use predict() to compute the estimates of μ:

new.df <- data.frame(Age = 18, Ht = 66, Gender = "F", Smoke = 1)   #Creates a new data frame
out <- predict( LC.m1, newdata = new.df, se.fit = TRUE)
names(out)

## [1] "fit"            "se.fit"         "df"             "residual.scale"

out$se.fit

## [1] 0.02350644

tstar <- qt(df = LC.m1$df, p = 0.975 )        # For a 95% CI
ci.lo <- out$fit - tstar*out$se.fit
ci.hi <- out$fit + tstar*out$se.fit
CIinfo <- cbind( Lower=ci.lo, Estimate=out$fit, Upper=ci.hi)
CIinfo

##      Lower Estimate    Upper
## 1 1.209268 1.255426 1.301584

The prediction is $\hatμ$ = 1.255, and the 95% confidence interval is from 1.209 to 1.302.

exp(CIinfo)

##      Lower Estimate    Upper
## 1 3.351032 3.509334 3.675114

An approximate confidence interval for E[FEV] is 3.509.

computing the confidence intervals for 18 yearold female smokers for varying heights

newHt <- seq(min(lungcap$Ht), max(lungcap$Ht), by = 2)
newlogFEV <- predict( LC.m1, se.fit = TRUE,
newdata=data.frame(Age=18, Ht=newHt, Gender="F", Smoke= 1))
ci.lo <- exp( newlogFEV$fit - tstar*newlogFEV$se.fit )
ci.hi <- exp( newlogFEV$fit + tstar*newlogFEV$se.fit )
cbind( Ht = newHt, FEVhat = exp(newlogFEV$fit), SE = newlogFEV$se.fit,
Lower = ci.lo, Upper = ci.hi, CI.Width = ci.hi - ci.lo)

##    Ht   FEVhat         SE    Lower    Upper  CI.Width
## 1  46 1.491095 0.04886534 1.354669 1.641259 0.2865900
## 2  48 1.624341 0.04585644 1.484469 1.777392 0.2929226
## 3  50 1.769494 0.04289937 1.626540 1.925011 0.2984711
## 4  52 1.927618 0.04000563 1.781987 2.085151 0.3031639
## 5  54 2.099873 0.03719000 1.951990 2.258959 0.3069685
## 6  56 2.287520 0.03447163 2.137804 2.447722 0.3099183
## 7  58 2.491936 0.03187542 2.340743 2.652894 0.3121513
## 8  60 2.714619 0.02943370 2.562170 2.876138 0.3139672
## 9  62 2.957201 0.02718813 2.803464 3.119368 0.3159041
## 10 64 3.221460 0.02519123 3.065984 3.384820 0.3188364
## 11 66 3.509334 0.02350644 3.351032 3.675114 0.3240817
## 12 68 3.822932 0.02220493 3.659826 3.993308 0.3334820
## 13 70 4.164555 0.02135689 3.993518 4.342917 0.3493998
## 14 72 4.536705 0.02101728 4.353286 4.727852 0.3745665
## 15 74 4.942111 0.02121053 4.740502 5.152294 0.4117924

Notice that the intervals do not have the same width over the whole range of the data.

2.9 Analysis of Variance for Regression Models

A linear regression model, having been fitted to the data by least squares, yields a fitted value $\hat{\mu}_i = \hat{\beta}_0 + \sum_{j=1}^p x_{ij}\hat{\beta}_j$ for each observation $y_i$.

And that $y_i = \hat{\mu}_i + (y_i - \hat{\mu}_i)$.

In other words, Data = Fit + Residual.

$y_i - \bar{y}_w = (\hat{\mu}_i - \bar{y}_w) + (y_i - \hat{\mu}_i)$

Squaring each of these terms and summing them over $i$ leads to the key identity

SST = SSREG + RSS,

-where SST = $\sum_{i=1}^n w_i(y_i - \bar{y}_w)^2$ is the total sum of squares,

-SSREG = $\sum_{i=1}^n w_i(\hat{\mu}_i - \bar{y}_w)^2$ is the regression sum of squares,

-RSS = $\sum_{i=1}^n w_i(y_i -\hat{\mu}_i )^2$ is the residual sum of squares.

-The cross-product terms $(\hat{\mu}_i - \bar{y}_w)(y_i -\hat{\mu}_i )$ sum to zero, and so don’t appear in this identity.

-This identity is the basis of what is called analysis of variance, because it analyses the sources from which variance in the data arises.

$F = \frac{SSREG/(p'-1)}{RSS/(n-p')} = \frac{MSREG}{MSE}$

The ratio follows an F-distribution with $(p' − 1, n − p')$ degrees of freedom.

-The MSE, the mean-square error, is equal to $s^2$, the unbiased estimator of $σ^2$ that we have previously seen.

-MSREG is the mean-square for the regression.

the lung capacity data

summary( LC.m1 )

## 
## Call:
## lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63278 -0.08657  0.01146  0.09540  0.40701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.943998   0.078639 -24.721  < 2e-16 ***
## Age          0.023387   0.003348   6.984  7.1e-12 ***
## Ht           0.042796   0.001679  25.489  < 2e-16 ***
## GenderM      0.029319   0.011719   2.502   0.0126 *  
## Smoke       -0.046068   0.020910  -2.203   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1455 on 649 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.8095 
## F-statistic: 694.6 on 4 and 649 DF,  p-value: < 2.2e-16

-The F-statistic is labelled F-statistic, followed by the corresponding degrees of freedom (labelled DF), and the P-value for the test (labelled p-value). The F-statistic and the corresponding degrees of freedom are returned using summary(LC.m1)$fstatistic.

summary(LC.m1)$fstatistic

##    value    numdf    dendf 
## 694.5804   4.0000 649.0000

The proportion of the total variation explained by the regression is the coefficient of determination,

$R^2 = \frac{SSREG}{SST} = 1 - \frac{RSS}{SST}$.

$R^2$ is sometimes also called multiple $R^2$, because it is equal to the squared Pearson correlation coefficient between the $y_i$ and the fitted values $\hatμ_i$, using the weights $w_i$. $R^2$ is labelled Multiple R-squared.

Adding a new explanatory variable to the regression model cannot increase RSS and hence $R^2$ tends to increase with the size p of the model even if the explanatory variables have no real explanatory power. For this reason, some statisticians like to adjust $R^2$ for the number of explanatory variables in the model. The adjusted $R^2$, denoted $\bar{R}^2$, is defined by

$\bar{R}^2 = 1 - \frac{RSS/(n-p')}{SST/(n-1)} = 1 - (1-R^2) \frac{n-1}{n-p'}$.

Unlike $R^2$, $\bar{R}^2$ may be negative. This occurs whenever MSREG < MSE, which can be taken to indicate a very poor model.

$\bar{R}^2$ is labelled Adjusted R-squared. $F$ and $R^2$ are closely related quantities, but it is $F$ that is used to formally test whether the regression is statistically significant.

Computing RSS, SST, $R^2$ and $\bar{R}^2$ of the lung capacity data (recalling that y = log(FEV)) :

mu <- fitted( LC.m1 ); RSS <- sum( (y - mu)^2 )  #兩個不同長度的vector相加，R會自動把較短的vector填補成長度相同，稱作Recycling。

## Warning in y - mu: 較長的物件長度並非較短物件長度的倍數

SST <- sum( (y - mean(y) )^2 )
c(RSS = RSS, SST = SST, SSReg = SST-RSS)

##         RSS         SST       SSReg 
##  2073.61842    22.88262 -2050.73581

R2 <- 1 - RSS/SST                              # Compute R2 explicitly
c( "Output R2" = summary(LC.m1)$r.squared, "Computed R2" = R2,
"adj R2" = summary(LC.m1)$adj.r.squared)

##   Output R2 Computed R2      adj R2 
##   0.8106393 -89.6198103   0.8094722

2.10 Comparing Nested Models

2.10.1 Analysis of Variance to Compare Two Nested Models

Model A: $\mu_A = \beta_0 + \beta_1x_1+ \beta_4x_4$

Model B: $\mu_B = \beta_0 + \beta_1x_1+ \beta_2x_2+ \beta_3x_3+ \beta_4x_4$

Model A is nested in Model B because Model A can be obtained from Model B by setting some parameter(s) in Model B to zero or, more generally, if Model A is a special case of Model B. (In this case, by setting $β_2 = β_3$ = 0)

In comparing these models, we wish to know whether the more complex Model B is necessary, or whether the simpler Model A will suffice. Formally, the null hypothesis is that the two models are equivalent, so that we test $H_0: β_2 = β_3 = 0$ against the alternative that $β_2$ and $β_3$ are not both zero.

Using the lungcap data frame, and fitting the two models:

LC.A <- lm( log(FEV) ~ Age +               Smoke, data = lungcap )
LC.B <- lm( log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap )

RSS.A <- sum( resid( LC.A )^2 )     # resid() computes residuals
RSS.B <- sum( resid( LC.B )^2 )
c( ModelA = RSS.A, ModelB = RSS.B)

##   ModelA   ModelB 
## 28.91982 13.73356

SS <- RSS.A - RSS.B; SS

## [1] 15.18626

DF <- df.residual(LC.A) - df.residual(LC.B); DF

## [1] 2

Is this reduction statistically significant?

The formal test requires comparing the ss divided by the change in the degrees of freedom, to the RSS for Model B divided by the degrees of freedom for Model B:

df.B <- df.residual(LC.B); df.B

## [1] 649

Fstat <- (SS/DF) / ( RSS.B/df.B ); Fstat

## [1] 358.8249

A P-value is found by comparing to an F-distribution with (2, 649) degrees of freedom:

pf(Fstat, df1 = DF, df2 = df.B, lower.tail = FALSE)

## [1] 1.128849e-105

The P-value is almost zero, providing strong evidence that Model B is significantly different from Model A. In R, the results are displayed using anova():

anova( LC.A, LC.B )

More generally, consider fitting two nested models, say Model A and Model B, with systematic components

Model A: $\hat{\mu}_A = \beta_0 + \beta_1x_1+ · · · + \beta_{pA}x_{pA}$

Model B: $\hat{\mu}_B = \beta_0 + \beta_1x_1+ · · · + \beta_{pA}x_{pA} +· · · + \beta_{pB}x_{pB}$.

The difference between the RSS computed for each model is the SS due to the difference between the models, based on $p'_B − p'_A$ degrees of freedom. Assuming $H_0: β_{pA+1} = · · · = β_{pB} = 0$ is true, the models are identical and SS is equivalent to residual variation. The test-statistic is

$F = \frac{(RSS_A-RSS_B)/(p_B'-p'_A)}{s^2} = \frac{SS_B/(p_B'-p'_A)}{RSS_B/(n-p_B')}$.

A P-value is deduced by referring to an F-distribution with $(p'_B − p'_A, n − p'_B)$ degrees of freedom.

2.10.2 Sequential Analysis of Variance

consider model LC.B fitted to the lungcap data, a sequence of nested models could be compared:

LC.0 <- lm( log(FEV) ~ 1, data = lungcap)    # No explanatory variables
LC.1 <- update(LC.0, . ~ . + Age)            # Age
LC.2 <- update(LC.1, . ~ . + Ht)             # Age and Height
LC.3 <- update(LC.2, . ~ . + Gender)         # Age, Height and Gender
LC.4 <- update(LC.3, . ~ . + Smoke)          # Then, add smoking status

Notice the use of update() to update models.

The rss can be computed for each model:

RSS.0 <- sum( resid(LC.0)^2 )
RSS.1 <- sum( resid(LC.1)^2 )
RSS.2 <- sum( resid(LC.2)^2 )
RSS.3 <- sum( resid(LC.3)^2 )
RSS.4 <- sum( resid(LC.4)^2 )
RSS.list <- c( Model4=RSS.4, Model3=RSS.3, Model2=RSS.2,
Model1=RSS.1, Model0=RSS.0)
RSS.list

##   Model4   Model3   Model2   Model1   Model0 
## 13.73356 13.83627 13.98958 29.31586 72.52591

Notice that the rss reduces as the models become more complex. The change in the RSS, the SS, can also be computed:

SS.list <- diff(RSS.list); SS.list

##     Model3     Model2     Model1     Model0 
##  0.1027098  0.1533136 15.3262790 43.2100549

The changes in the degrees of freedom between these nested models are all one in this example. As before, we compare these changes in RSS to an estimate of $σ^2$ = MSE, using the F-statistic:

s2 <- summary(LC.4)$sigma^2                    # One way to get MSE
F.list <- (SS.list / 1) / s2; F.list

##      Model3      Model2      Model1      Model0 
##    4.853708    7.245064  724.266452 2041.956379

P.list <- pf( F.list, 1, df.residual(LC.4), lower.tail=FALSE)
round(P.list, 6)

##   Model3   Model2   Model1   Model0 
## 0.027937 0.007293 0.000000 0.000000

These computations are all performed in R by using anova(), and providing as input the final model in the set of nested models:

anova(LC.4)

Because the F-tests are adjusted for other terms in the model, numerous F-tests are possible to test for the effect of Smoke, depending on the order in which the corresponding nested models are compared. For example, tests based on Smoke include:

-Test for Smoke without adjusting for any other explanatory variables;

-Test for Smoke after first adjusting for Age;

-Test for Smoke after first adjusting for Ht;

-Test for Smoke after first adjusting for both Age and Gender.

More generally, testing a series of sequential models is equivalent to separating the systematic component into contributions from each explanatory variable.

Model LC.4 fits the explanatory variables Age, Ht, Gender and Smoke in that order (data set: lungcap). Consider fitting the explanatory variables in reverse order:

LC.4.rev <- lm(log(FEV) ~ Smoke + Gender + Ht + Age, data=lungcap)
anova(LC.4.rev)

The level of significance of Smoke depends on whether this variable is added first (model LC.4) or last, after adjusting for Age, Ht and Gender. Sometimes, a variable may be significant when added first, but not at all significant when added after other variables.

2.10.3 Parallel and Independent Regressions

For simplicity, we consider the case of one covariate (height $x_2$) and one factor (smoking status $x_4$) to fix ideas. A naive (and obviously untrue) model is that μ = E[log(FEV)] does not depend on smoking status or height. The fitted systematic component is $\hatμ = 0.9154$, with RSS = 72.53 on 653 degrees of freedom.

 mean(log(lungcap$FEV))

## [1] 0.915437

To consider if the influence of height $x_2$ on μ = E[log(FEV)] is significant, the fitted model is $\hatμ = −2.271 + 0.05212x_2$, with RSS = 14.82 on 652 degrees of freedom.

To consider is the relationship different for smokers and non-smokers, add smoking status x 4 as an explanatory variable: $\hatμ = −2.277 + 0.05222x_2 − 0.006830x_4$, with RSS = 14.82 on 651 degrees of freedom.

To formally test if the intercepts are different, a test of the corresponding β is conducted.

printCoefmat(coef(summary(lm( log(FEV) ~ Ht + Smoke, data=lungcap))))

##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.2767801  0.0656677 -34.6712   <2e-16 ***
## Ht           0.0522196  0.0010785  48.4174   <2e-16 ***
## Smoke       -0.0068303  0.0205450  -0.3325   0.7397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The evidence suggests that different intercepts are not needed when the slopes of the lines are common.

Different slopes can be modelled using the interaction between height and smoking status as an explanatory variable: $\hatμ = −2.281 + 0.05230x_2 − 0.002294x_4 + 0.002294x_2.x_4$, with RSS = 14.82 on 650 degrees of freedom. This model produces two separate systematic components; both the intercepts and slopes differ:

$ = -2.281 + 0.05230x_2 $ for non-smokers (set $x_4 = 0$)

$ = -2.137 + 0.05000x_2 $ for smokers (set $x_4 = 1$)

This is not equivalent to fitting two separate linear regression models, since the same estimate of $σ^2$ is shared by both systematic components.

R indicates the interaction between two explanatory variables by joining the interacting variables with : (a colon)

LC.model <- lm( log(FEV) ~ Ht + Smoke + Ht:Smoke, data = lungcap)

A model including all main effects plus the interaction can also be specified using * (an asterisk). The above model, then, could be specified equivalently as:

LC.model <- lm( log(FEV) ~ Ht * Smoke, data = lungcap)

There is no evidence to suggest that different intercepts and slopes are needed for smokers and non-smokers:

printCoefmat(coef(summary(LC.model)))

##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.2814140  0.0668241 -34.1406   <2e-16 ***
## Ht           0.0522961  0.0010977  47.6420   <2e-16 ***
## Smoke        0.1440396  0.3960102   0.3637   0.7162    
## Ht:Smoke    -0.0022937  0.0060125  -0.3815   0.7030    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Models (2.31)–(2.34) represent four ways to use linear regression models to model the relationship between μ = E[log(FEV)], height and smoking status.

The value of RSS reduces as the models become more complex. R produces similar output using the anova() command, using the final model as the input:

anova(LC.model)

The table indicates that the model including only Ht is hard to improve upon; neither Smoke nor the interaction are statistically significant.

2.10.4 The Marginality Principle

-If higher-order powers of a covariate appear in a model, then the lower order power should also be in the model. For example, if $x^2$ is in a model then $x$ should be also. (If $x^2$ remains in the model but $x$ is removed, then the model is artificially constrained to fitting a quadratic model that has zero slope when $x$ = 0, something which is not usually required.)

-If the interaction between two or more factors appears in the model, then the individual factors and lower-order interactions should appear also.

-If the interaction between factors and covariates appears in the linear model, then the individual factors and covariates should appear also.

2.11 Choosing Between Non-nested Models: AIC and BIC

The hypothesis tests discussed in Sect. 2.10 only apply when the models being compared are nested. This section introduces quantities for comparing models that are not necessarily nested.

The two criteria for selecting a statistical model are accuracy and parsimony. The RSS simply measures the accuracy, adding many explanatory variables produces smaller values of the RSS, but also produces a more complicated model.

Akaike’s An Information Criterion (AIC) balances these two criteria, by measuring the accuracy using the RSS but penalizing the complexity of the model as measured by the number of estimated parameters. For a normal linear regression model,

AIC = nlog(rss/$n$) + 2$p'$, when $σ^2$ is unknown.

Bayesian Information Criterion (BIC), also called Schwarz’s criterion:

BIC = nlog(rss/$n$) + $p'$ log $n$, when $σ^2$ is unknown.

smaller values of the AIC (closer to −∞) represent better models, while the BIC is inclined to select lower dimensional (more parsimonious) models than is AIC, as the penalty for extra parameters is more severe (k = log n > 2) unless the number of observations is very small.

The AIC focuses more on creating a model for making good predictions. The BIC requires stronger evidence for including explanatory variables, so produces simpler models having simpler interpretations. AIC is directed purely at prediction, while BIC is a compromise between interpretation and prediction. Neither AIC nor BIC are formal testing methods, so no test statistics or P-values can be produced.

Consider the lung capacity data

The AIC is extracted using R as follows:

LC.A <- lm( log(FEV) ~ Age + Smoke, data = lungcap )
extractAIC(LC.A)

## [1]     3.000 -2033.551

LC.B <- lm( log(FEV) ~ Ht + Smoke, data = lungcap )
extractAIC(LC.B)

## [1]     3.000 -2470.728

To extract the BIC, the same function extractAIC() is used, but the penalty is adjusted:

k <- log( length(lungcap$FEV) )
extractAIC(LC.A, k = k)

## [1]     3.000 -2020.102

extractAIC(LC.B, k = k)

## [1]     3.000 -2457.278

The AIC is lower (closer to −∞) for the second model which uses Ht. The BIC is lower (closer to −∞) for the second model. The AIC and the BIC both suggest the combination of Ht and Smoke is more useful as a set of explanatory variables than the combination of Age and Smoke.

##2.12 Tools to Assist in Model Selection

2.12.1 Adding and Dropping Variables

the order in which the variables are added is usually important.

The functions in R explore the impact of adding one variable (add1()) and dropping one variable (drop1()) from the current model, one at a time. The function step() repeatedly uses add1() and drop1() to suggest a model, basing the decisions on the values of the AIC (by default) or the BIC.

Consider the lung capacity data, and the four explanatory variables Age, Ht, Gender and Smoke.

drop1( lm( log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap), test="F")

The output shows the value of the AIC for the original model, and also when Age, Ht, Gender and Smoke are removed from model one at a time.

LC.full <- lm( log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)
add1( lm( log(FEV) ~ Smoke, data = lungcap), LC.full, test = "F" )

In a similar fashion, using add1() adds explanatory variables one at a time. Using add1() requires two inputs: the simplest and the most complex systematic components to be considered.

The output shows that any one of the explanatory variables can be added to the simple model log(FEV) ~ Smoke and improve the model (the AIC becomes closer to −∞). Since the AIC is smallest when Ht is added, we would add Ht to the systematic component, and then use add1() again.

2.12.2 Automated Methods for Model Selection

The step() function has three commonly-used inputs. The input object and the input scope together indicate the range of models for R to consider, and their use depends on which type of approach is used.

-The first argument in step() is the minimal acceptable model, which no variables can be removed from the model.

min.model <- lm(log(FEV) ~ Age + Ht + Gender + Smoke, data = lungcap)

_We now use step() to suggest a model for the lungcap data, considering models as complex as below which specifies all two-way interactions between the variables.

max.model <- lm( log(FEV) ~ (Smoke + Age + Ht + Gender)^2, data = lungcap)

auto.forward <- step( min.model, direction = "forward",
scope = list(lower = min.model, upper = max.model) )

## Start:  AIC=-2516.58
## log(FEV) ~ Age + Ht + Gender + Smoke
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      13.734 -2516.6
## + Smoke:Age     1  0.040136 13.693 -2516.5
## + Ht:Gender     1  0.006336 13.727 -2514.9
## + Smoke:Gender  1  0.003488 13.730 -2514.7
## + Age:Gender    1  0.003119 13.730 -2514.7
## + Smoke:Ht      1  0.001732 13.732 -2514.7
## + Age:Ht        1  0.001455 13.732 -2514.6

auto.backward <- step( max.model, direction = "backward",
scope = list(lower = min.model, upper = max.model) )

## Start:  AIC=-2507.58
## log(FEV) ~ (Smoke + Age + Ht + Gender)^2
## 
##                Df Sum of Sq    RSS     AIC
## - Smoke:Gender  1  0.000626 13.671 -2509.6
## - Age:Gender    1  0.000859 13.671 -2509.5
## - Age:Ht        1  0.002852 13.674 -2509.4
## - Smoke:Ht      1  0.005607 13.676 -2509.3
## - Ht:Gender     1  0.007841 13.678 -2509.2
## <none>                      13.671 -2507.6
## - Smoke:Age     1  0.048232 13.719 -2507.3
## 
## Step:  AIC=-2509.55
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht + 
##     Age:Ht + Age:Gender + Ht:Gender
## 
##              Df Sum of Sq    RSS     AIC
## - Age:Gender  1  0.000581 13.672 -2511.5
## - Age:Ht      1  0.003282 13.675 -2511.4
## - Ht:Gender   1  0.007945 13.679 -2511.2
## - Smoke:Ht    1  0.009919 13.681 -2511.1
## <none>                    13.671 -2509.6
## - Smoke:Age   1  0.047923 13.719 -2509.3
## 
## Step:  AIC=-2511.52
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht + 
##     Age:Ht + Ht:Gender
## 
##             Df Sum of Sq    RSS     AIC
## - Age:Ht     1  0.003749 13.676 -2513.3
## - Smoke:Ht   1  0.009377 13.681 -2513.1
## - Ht:Gender  1  0.011841 13.684 -2512.9
## <none>                   13.672 -2511.5
## - Smoke:Age  1  0.047397 13.719 -2511.3
## 
## Step:  AIC=-2513.34
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht + 
##     Ht:Gender
## 
##             Df Sum of Sq    RSS     AIC
## - Smoke:Ht   1  0.007326 13.683 -2515.0
## - Ht:Gender  1  0.008717 13.684 -2514.9
## <none>                   13.676 -2513.3
## - Smoke:Age  1  0.050584 13.726 -2512.9
## 
## Step:  AIC=-2514.99
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Ht:Gender
## 
##             Df Sum of Sq    RSS     AIC
## - Ht:Gender  1  0.010484 13.693 -2516.5
## <none>                   13.683 -2515.0
## - Smoke:Age  1  0.044284 13.727 -2514.9
## 
## Step:  AIC=-2516.49
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age
## 
##             Df Sum of Sq    RSS     AIC
## - Smoke:Age  1  0.040136 13.734 -2516.6
## <none>                   13.693 -2516.5
## 
## Step:  AIC=-2516.58
## log(FEV) ~ Smoke + Age + Ht + Gender

auto.both <- step( min.model, direction = "both",
scope = list(lower = min.model, upper = max.model) )

## Start:  AIC=-2516.58
## log(FEV) ~ Age + Ht + Gender + Smoke
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      13.734 -2516.6
## + Smoke:Age     1  0.040136 13.693 -2516.5
## + Ht:Gender     1  0.006336 13.727 -2514.9
## + Smoke:Gender  1  0.003488 13.730 -2514.7
## + Age:Gender    1  0.003119 13.730 -2514.7
## + Smoke:Ht      1  0.001732 13.732 -2514.7
## + Age:Ht        1  0.001455 13.732 -2514.6

In this case, the three approaches produce the same models:

signif( coef(auto.forward), 3 )

## (Intercept)         Age          Ht     GenderM       Smoke 
##     -1.9400      0.0234      0.0428      0.0293     -0.0461

signif( coef(auto.backward), 3 )

## (Intercept)       Smoke         Age          Ht     GenderM 
##     -1.9400     -0.0461      0.0234      0.0428      0.0293

signif( coef(auto.both), 3 )

## (Intercept)         Age          Ht     GenderM       Smoke 
##     -1.9400      0.0234      0.0428      0.0293     -0.0461

The three methods do not always produce the same suggested model.

2.12.3 Objections to Using Stepwise Procedures

2.13 Case Study

data( dental )
summary( dental )

##       Country      Indus        Sugar            DMFT      
##  Albania  : 1   Ind   :29   Min.   : 0.97   Min.   :0.300  
##  Algeria  : 1   NonInd:61   1st Qu.:14.53   1st Qu.:1.600  
##  Angolia  : 1               Median :33.79   Median :2.300  
##  Argentina: 1               Mean   :30.14   Mean   :2.656  
##  Australia: 1               3rd Qu.:44.32   3rd Qu.:3.350  
##  Austria  : 1               Max.   :63.02   Max.   :8.100  
##  (Other)  :84

plot( DMFT ~ Sugar, las=1, data=dental, pch=ifelse( Indus=="Ind", 19, 1),
xlab="Mean annual sugar consumption\n(kg/person/year)",
ylab="Mean DMFT at age 12")
legend("topleft", pch=c(19, 1), legend=c("Indus.","Non-indus."))

boxplot(DMFT ~ Indus, data=dental, las=1,
ylab="Mean DMFT at age 12", xlab="Type of country")

Consider fitting the linear regression model, including interactions:

lm.dental <- lm( DMFT ~ Sugar * Indus, data=dental)
anova(lm.dental)

The effect of Indus is not significant after adjusting for Sugar. The interaction between sugar consumption and whether the country is industrialized is marginally significant after adjusting for sugar consumption and the industrialization.

Consider the fitted model:

coef( summary( lm.dental ) )

##                      Estimate Std. Error    t value    Pr(>|t|)
## (Intercept)        3.90857067 1.28649859  3.0381461 0.003151855
## Sugar             -0.01306504 0.03014315 -0.4334332 0.665785323
## IndusNonInd       -2.74389029 1.32480815 -2.0711605 0.041341018
## Sugar:IndusNonInd  0.06004128 0.03198042  1.8774386 0.063847913

This output indicates that the mean sugar consumption is not significant after adjusting for the other variables. Furthermore, the coefficient for the sugar consumption is negative (though not statistically significant), suggesting greater sugar consumption is associated with lower mean numbers of DMFT. Recall this interpretation is for Indus==“Ind” (that is, for industrialized countries, when Indus=0).

For non-industrialized countries, the coefficient for sugar consumption is

sum( coef(lm.dental)[ c(2, 4) ] )

## [1] 0.04697624

For non-industrialized countries, the coefficient for the sugar consumption is positive. Plotting the two lines (using abline()) is informative

dental.cf <- coef( lm.dental )
plot( DMFT ~ Sugar, las = 1, data = dental, pch = ifelse( Indus=="Ind", 19, 1),
xlab="Mean annual sugar consumption\n(kg/person/year)",
ylab="Mean DMFT at age 12")
legend("topleft", lty = c(1,2), pch = c(19, 1), legend = c("Indus.","Non-indus."))
abline(a = dental.cf[1], b = dental.cf[2], lwd = 2, lty = 1)
abline(a = sum( dental.cf[c(1, 3)]), b = sum(dental.cf[c(2, 4)]),
lwd = 2, lty = 2)

Both the intercept and slope for NonInd are computed as the sum of the appropriate two coefficients.

Note that the data for the industrialized countries span a much narrower range of sugar consumptions than those for non-industrialized countries:

range( dental$Sugar[dental$Indus=="Ind"] )    # Industrialized

## [1] 22.16 53.54

range( dental$Sugar[dental$Indus=="NonInd"] ) # Non-industrialized

## [1]  0.97 63.02

-Any connection between the sugar consumption and number of dmft for individuals cannot be made.

-Assuming that the relationships observed for a population also applies to individuals within the population is called the ecological fallacy.

-Also, since the data are observational, no cause-and-effect can be inferred.

2.14 Using R for Fitting Linear Regression Models

For fitting linear regression models, the function **lm()* is used.

Common inputs to lm() are:

formula: The first input is the model formula, exp.: $y \sim x_1+ x_2 + x_3 + x_1:x_2$.
data: The data frame containing the variables may be given as an input using the data argument, exp.: data = lungcap.
weights: The prior weights are supplied using the weights input argument. The default is to set all prior weights to one, exp.:weights = Births.
subset: Sometimes a model needs to be fitted to a subset of the data. For example, to fit a linear regression model for only the females in the lung capacity data, use, lm(log(FEV) ~ Age, data=lungcap, subset=(Gender==“F”)). Alternatively, the subset() function can be used to create a data frame that is a subset of the original data frame; for example: lm(log(FEV) ~ Age, data=subset(lungcap, Gender==“F”)). _______________________
terms(): to fit explanatory variables in a given order.
update(): updates a model, for example: update(LC.0, . ~ . + Age) to update model 1 to model 2 by adding the variable Age.
coef(fit) (or coefficients(fit)): extracts the parameter estimates $\hatβ_j$
df.residual(fit): extracts the residual degrees of freedom
fitted(fit) (or fitted.values(fit)): extracts the fitted values $\hatμ$
summary(fit) : The summary() of the model prints the following: the parameter estimates with the corresponding standard errors, t-statistics and two-tailed P-values for testing $H_0: β_j = 0$; the estimate of $s$; the value of $R^2$; the value of $\bar{R}^2$; the results of the overall ANOVA test for the regression.
anova(fit): When a single model fit is given as input, an ANOVA table is produced.
anova(fit1, fit2, …): Compares any set of fitted nested models fit1, fit2 and so on by providing all models to anova(). The models are then tested against one another in the specified order, where models earlier in the list of models are nested in later models.
confint(fit): Returns the 95% confidence interval for all the regression coefficients $β_j$ in the systematic component. For different confidence levels, use confint(fit, level=0.99), for example, which creates 99% confidence intervals.
drop1() and add1(): Drops or adds explanatory variables one at a time from the given model using the AIC by default.
extractAIC(fit): Returns the number of estimated regression parameters as the first output value, and the AIC for the given model as the second output value. To compute the BIC instead of the AIC, use extractAIC(fit, k=log(nobs(fit))), where fit is the fitted model.
abline(): Draws a straight line on the current plot. In the form abline(a = 2, b = -3), the straight line with intercept 2 and slope −3 is drawn.

2.15 Summary

:-)

Problems

2.1. In this problem, we consider two ways of writing the systematic component of a simple linear regression model.

Interpret the meaning of the constant term $β_0$ when the systematic component is written as $μ = β_0 + β_1x$.

→$\beta_0$ is the predicted value when $x = 0$.
Interpret the meaning of the constant term α0 when the systematic component is written as μ = α0 + β1(x − x¯).

→