Exercise 6.8 (FLAG2 data):

Consider building a model for the low-bid price (y). Apply stepwise regression to the data to find the independent variables most suitable for modeling y.

I chose to use the steps function in R, using the backwards selection method and AIC criterion to select a suitable model. Also note, due to missing data in only one predictor, I omit SUBCONT from the model.

The step function needs tuning to find the model we want. The ‘k’ parameter needs to be changed to 4 in order to reflect a significance level near .05 for our stepwise selection comparisons.

The resulting model can be seen below:

bwselect <- step(lm(paste("LOWBID~", 
                          paste(colnames(flags)[-c(1,15)], collapse = "+")),
                    data=flags), direction = "backward", k=4, trace = F)
summary(bwselect)
## 
## Call:
## lm(formula = LOWBID ~ DOTEST + LBERATIO, data = flags)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1970032   -54711    14244    63488  1480852 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.798e+05  9.028e+04  -8.637 4.71e-16 ***
## DOTEST       9.316e-01  7.863e-03 118.482  < 2e-16 ***
## LBERATIO     8.232e+05  9.043e+04   9.103  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254100 on 276 degrees of freedom
## Multiple R-squared:  0.9807, Adjusted R-squared:  0.9806 
## F-statistic:  7019 on 2 and 276 DF,  p-value: < 2.2e-16
print(paste("Minimum AIC = ",round(min(bwselect$anova$AIC), digits=3)))
## [1] "Minimum AIC =  6953.64"

The resulting model has an excellent adjusted \(R_{adj}^2\) at 98.06% explained variability. Also note, this model was chosen based on minimum AIC, where AIC for this model was reported as 6953.64.

Residual diagnostics and model assumptions must be checked to fully accept the model.

Interpret the \(\beta\) estimates in the resulting stepwise regression model.

Consider the resulting model from part A,

\[\begin{align*} Y_{ij} &= \beta_0 + \beta_1 x_{\text{DOTEST}} + \beta_2 x_{\text{LBERATIO}} \\ Y_{ij} &= -779766.1 + 0.9316359 x_{\text{DOTEST}} + 823187.1 x_{\text{LBERATIO}} \end{align*}\]

To interpret the \(\beta\) values, let’s consider our predictors - DOTEST and LBERATIO. A one unit change in our response is achieved by a .9316 unit change in DOTEST and a 823187.1 unit change in LBERATIO.

What are the dangers associated with drawing inferences from the stepwise model?

Since we are performing a large number of t-tests while using the stepwise selection method, the rate of type-I and type-II error increases significantly. Additionally, our model only will only include/accept first-order and main effect predictors. If we want interaction terms or higher-order terms, we need other methods of model selection.

Exercise 6.10 (Turbine data)

library(leaps)
turbine <- read.table("/Users/seankrinik/Documents/CSULB/STAT510/HW4/GASTURBINE.txt", 
                      header=T)
turbine[,"ENGINE"] <- ifelse(turbine[,"ENGINE"] == "Traditional",
                              0, ifelse(turbine[,"ENGINE"] == "Advanced", 1, 2))

Use stepwise regression (with stepwise selection) to find the “best” predictors of heat rate.

Using a forwards stepwise regression technique, I used the ‘steps’ iterative model selection method to select the model with lowest AIC value.

fwselect<- step(lm(HEATRATE~1, data=turbine),
                scope=formula(lm(HEATRATE~., data=turbine)),
                direction = "forward", k = 4, trace = F)
paste("AIC for forward selection method: "
      ,round(extractAIC(fwselect)[2],2))
## [1] "AIC for forward selection method:  826.9"

We can use this AIC value as a benchmark for the other models now. Below is the model summary where we achieve a high \(R^2_{adj} = .915\).

summary(lm(fwselect))
## 
## Call:
## lm(formula = fwselect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1025.8  -297.9  -115.3   225.8  1425.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.436e+04  7.333e+02  19.582  < 2e-16 ***
## RPM          1.051e-01  1.071e-02   9.818 2.55e-14 ***
## INLET.TEMP  -9.223e+00  7.869e-01 -11.721  < 2e-16 ***
## EXH.TEMP     1.243e+01  2.071e+00   6.000 1.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 465 on 63 degrees of freedom
## Multiple R-squared:  0.9189, Adjusted R-squared:  0.915 
## F-statistic: 237.9 on 3 and 63 DF,  p-value: < 2.2e-16

Use stepwise regression (with backward elimination) to find the “best” predictors of heat rate.

Similar to part A, I just specify the backward selection method and provide the full model as a starting point:

bwselect<- step(lm(HEATRATE~., data=turbine),
                direction = "backward", k = 4, trace = F)
paste("AIC for backward selection method: ",
      round(extractAIC(bwselect)[2],2))
## [1] "AIC for backward selection method:  826.9"

The AIC’s appear to be the same for the forward and backward methods. Lets compare the predictors chosen.

table(names(fwselect$model)[-1],names(bwselect$model)[-1])
##             
##              EXH.TEMP INLET.TEMP RPM
##   EXH.TEMP          1          0   0
##   INLET.TEMP        0          1   0
##   RPM               0          0   1

As we can see, the models are identical.

Use all-possible-regressions-selection to find the “best” predictors of heat rate.

Using the Mallow’s CP method in leaps, I use the all-possible regressions selection method to achieve a new model:

allselect <- leaps(y=turbine[,"HEATRATE"], x=turbine[,1:8], 
                   method = "Cp", 
                   names =  colnames(turbine[1:8]))
cpkeepers <- allselect$which[which(allselect$Cp == min(allselect$Cp)),]
cpmodel <- lm(paste("HEATRATE~",
                    paste(names(cpkeepers[cpkeepers]), collapse = "+")),
              data=turbine)
paste("AIC for Mallow's CP all-possible regressions selection method: ",
      round(extractAIC(cpmodel)[2],2))
## [1] "AIC for Mallow's CP all-possible regressions selection method:  824.96"

The model has a lower AIC than the forward and backward selection methods.

Compare the results, parts a–c. Which independent variables consistently are selected as the “best” predictors?

Since the third model is the only one with an extra predictor, the three predictors seen in all three models are:

names(fwselect$model)[-1]
## [1] "RPM"        "INLET.TEMP" "EXH.TEMP"

Excluding the intercept term. We can conclude that RPM, inlet temperature and exhaust temperature are consistently the best predictors for heat rate. The CP Mallow’s model includes Airflow as another predictor.

Explain how you would use the results, parts a–c, to develop a model for heat rate.

To develop a model, we could use the common predictors in all three models (RPM, inlet temperature, exhaust temperature), and analyze the contribution of the Airflow term by examining the all regessors model in more detail. We do have to note that these methods of model selection will only yield first order terms. Interaction and higher order predictors are not accounted for. Also, we must analyze the residuals before accepting the model.

At first glance, the mall regressors model seems to have the most attractive \(R^2_{adj} = .9186\), and a lower AIC than the forward and backward models. Choosing the Mallow’s CP based model seems to be the better choice of the three.