Table 5.23 shows a set of data originally given by Longley (1967) and reproduced by Wetherill (1986). The objective is to predict the variable y, total derived employment, as a function of six other variables x1, . . . , x6. All the regression models include an intercept.
rm(list=ls(all=TRUE))
d <- read.table("/Users/JieHuang1/Dropbox/Statistics_study/STOR_664/Homework/HW5/longley.dat",header=T)
attach(d)
reg <- lm(y~.,d)
sse <- sum((reg$fitted.values-y)^2)
library(leaps)
b<-regsubsets(y~.,d)
summary(b)
## Subset selection object
## Call: regsubsets.formula(y ~ ., d)
## 6 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## x5 FALSE FALSE
## x6 FALSE FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
## x1 x2 x3 x4 x5 x6
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " "*"
## 3 ( 1 ) " " " " "*" "*" " " "*"
## 4 ( 1 ) " " "*" "*" "*" " " "*"
## 5 ( 1 ) " " "*" "*" "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" "*" "*"
rs<-summary(b)
par(mfrow=c(1,2))
plot (2:7, rs$adjr2, xlab="No. of Parameters",ylab="Adjusted R-square", main="embedded code")
#read from table
n=16
p=seq(7,2)
SSE=c(836424, 839348, 858680, 1323360, 3272124, 6036140)
SSTO=185008830
1-(n-1)*SSE/((n-p)*SSTO)
## [1] 0.9924650 0.9931948 0.9936710 0.9910588 0.9795927 0.9650433
plot(7:2, 1-(n-1)*SSE/((n-p)*SSTO), xlab="No. of Parameters", ylab = "Adjusted R-square", main = "direct calculation")
In both of these two situations, we should choose p=5, i.e, 5 variables: intercept, x2, x3, x4, x6.
# Using Mallow's Cp
plot(2:7, rs$cp, xlab="No. of Parameters",ylab="Cp Statistic", main="Embedded code")
abline (0, 1)
Now consider the model dropping x5 but including all the other variables. (Note: There is no reason why this should be the same model as you selected in (a).) Table 4 shows the parameter values, standard errors, t statistics and p-values. Fig. 1 is a plot which shows the R-student (or externally studentized) residuals against (a) time, and (b) fitted values, for this model.
Based on Table 4, Fig. 1 and any other features of the data that occur to you, write a brief summary report of your conclusions. Your summary should include statistical conclusions, such as whether the model appears to fit the data well, but should also explain the implica- tions of the analysis that might be of interest to an economist. If you had the opportunity to perform further analyses, what would you try?
reg2 <- lm(y~x1+x2+x3+x4+x6)
summary(reg2)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.12 -143.42 -35.60 97.28 461.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.565e+06 7.724e+05 -4.615 0.000957 ***
## x1 2.771e+01 6.075e+01 0.456 0.657984
## x2 -4.213e-02 1.762e-02 -2.391 0.037891 *
## x3 -2.104e+00 3.029e-01 -6.945 3.97e-05 ***
## x4 -1.042e+00 2.002e-01 -5.207 0.000397 ***
## x6 1.869e+03 3.994e+02 4.680 0.000867 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 290 on 10 degrees of freedom
## Multiple R-squared: 0.9955, Adjusted R-squared: 0.9932
## F-statistic: 437.9 on 5 and 10 DF, p-value: 2.266e-11