Chapter 5 (S & Y) exercises 7

Table 5.23 shows a set of data originally given by Longley (1967) and reproduced by Wetherill (1986). The objective is to predict the variable y, total derived employment, as a function of six other variables x1, . . . , x6. All the regression models include an intercept.

  1. Table 1, Table 2 and Table 3 shows the residual sum of squares (RSS) for all possible models containing linear combinations of x1,…,x6. Based on these, which model would you choose? (You may use any method of variable selection you prefer, but be sure to indicate the rationale behind your selection.)
rm(list=ls(all=TRUE)) 
d <- read.table("/Users/JieHuang1/Dropbox/Statistics_study/STOR_664/Homework/HW5/longley.dat",header=T)
attach(d)
reg <- lm(y~.,d)
sse <- sum((reg$fitted.values-y)^2)

library(leaps)
b<-regsubsets(y~.,d)
summary(b)
## Subset selection object
## Call: regsubsets.formula(y ~ ., d)
## 6 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## x5     FALSE      FALSE
## x6     FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          x1  x2  x3  x4  x5  x6 
## 1  ( 1 ) " " "*" " " " " " " " "
## 2  ( 1 ) " " " " "*" " " " " "*"
## 3  ( 1 ) " " " " "*" "*" " " "*"
## 4  ( 1 ) " " "*" "*" "*" " " "*"
## 5  ( 1 ) " " "*" "*" "*" "*" "*"
## 6  ( 1 ) "*" "*" "*" "*" "*" "*"
rs<-summary(b)
par(mfrow=c(1,2))
plot (2:7, rs$adjr2, xlab="No. of Parameters",ylab="Adjusted R-square", main="embedded code")

#read from table
n=16
p=seq(7,2)
SSE=c(836424, 839348, 858680, 1323360, 3272124, 6036140)
SSTO=185008830
1-(n-1)*SSE/((n-p)*SSTO)
## [1] 0.9924650 0.9931948 0.9936710 0.9910588 0.9795927 0.9650433
plot(7:2, 1-(n-1)*SSE/((n-p)*SSTO), xlab="No. of Parameters", ylab = "Adjusted R-square", main = "direct calculation")

In both of these two situations, we should choose p=5, i.e, 5 variables: intercept, x2, x3, x4, x6.

# Using Mallow's Cp
plot(2:7, rs$cp, xlab="No. of Parameters",ylab="Cp Statistic", main="Embedded code")
abline (0, 1)

Now consider the model dropping x5 but including all the other variables. (Note: There is no reason why this should be the same model as you selected in (a).) Table 4 shows the parameter values, standard errors, t statistics and p-values. Fig. 1 is a plot which shows the R-student (or externally studentized) residuals against (a) time, and (b) fitted values, for this model.

Based on Table 4, Fig. 1 and any other features of the data that occur to you, write a brief summary report of your conclusions. Your summary should include statistical conclusions, such as whether the model appears to fit the data well, but should also explain the implica- tions of the analysis that might be of interest to an economist. If you had the opportunity to perform further analyses, what would you try?

reg2 <- lm(y~x1+x2+x3+x4+x6)
summary(reg2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.12 -143.42  -35.60   97.28  461.36 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.565e+06  7.724e+05  -4.615 0.000957 ***
## x1           2.771e+01  6.075e+01   0.456 0.657984    
## x2          -4.213e-02  1.762e-02  -2.391 0.037891 *  
## x3          -2.104e+00  3.029e-01  -6.945 3.97e-05 ***
## x4          -1.042e+00  2.002e-01  -5.207 0.000397 ***
## x6           1.869e+03  3.994e+02   4.680 0.000867 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 290 on 10 degrees of freedom
## Multiple R-squared:  0.9955, Adjusted R-squared:  0.9932 
## F-statistic: 437.9 on 5 and 10 DF,  p-value: 2.266e-11