I chose to use the steps function in R, using the backwards selection method and AIC criterion to select a suitable model. Also note, due to missing data in only one predictor, I omit SUBCONT from the model.
The step function needs tuning to find the model we want. The ‘k’ parameter needs to be changed to 4 in order to reflect a significance level near .05 for our stepwise selection comparisons.
The resulting model can be seen below:
bwselect <- step(lm(paste("LOWBID~",
paste(colnames(flags)[-c(1,15)], collapse = "+")),
data=flags), direction = "backward", k=4, trace = F)
summary(bwselect)
##
## Call:
## lm(formula = LOWBID ~ DOTEST + LBERATIO, data = flags)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1970032 -54711 14244 63488 1480852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.798e+05 9.028e+04 -8.637 4.71e-16 ***
## DOTEST 9.316e-01 7.863e-03 118.482 < 2e-16 ***
## LBERATIO 8.232e+05 9.043e+04 9.103 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254100 on 276 degrees of freedom
## Multiple R-squared: 0.9807, Adjusted R-squared: 0.9806
## F-statistic: 7019 on 2 and 276 DF, p-value: < 2.2e-16
print(paste("Minimum AIC = ",round(min(bwselect$anova$AIC), digits=3)))
## [1] "Minimum AIC = 6953.64"
The resulting model has an excellent adjusted \(R_{adj}^2\) at 98.06% explained variability. Also note, this model was chosen based on minimum AIC, where AIC for this model was reported as 6953.64.
Residual diagnostics and model assumptions must be checked to fully accept the model.
Consider the resulting model from part A,
\[\begin{align*} Y_{ij} &= \beta_0 + \beta_1 x_{\text{DOTEST}} + \beta_2 x_{\text{LBERATIO}} \\ Y_{ij} &= -779766.1 + 0.9316359 x_{\text{DOTEST}} + 823187.1 x_{\text{LBERATIO}} \end{align*}\]To interpret the \(\beta\) values, let’s consider our predictors - DOTEST and LBERATIO. A one unit change in our response is achieved by a .9316 unit change in DOTEST and a 823187.1 unit change in LBERATIO.
Since we are performing a large number of t-tests while using the stepwise selection method, the rate of type-I and type-II error increases significantly. Additionally, our model only will only include/accept first-order and main effect predictors. If we want interaction terms or higher-order terms, we need other methods of model selection.
library(leaps)
turbine <- read.table("/Users/seankrinik/Documents/CSULB/STAT510/HW4/GASTURBINE.txt",
header=T)
turbine[,"ENGINE"] <- ifelse(turbine[,"ENGINE"] == "Traditional",
0, ifelse(turbine[,"ENGINE"] == "Advanced", 1, 2))
Using a forwards stepwise regression technique, I used the ‘steps’ iterative model selection method to select the model with lowest AIC value.
fwselect<- step(lm(HEATRATE~1, data=turbine),
scope=formula(lm(HEATRATE~., data=turbine)),
direction = "forward", k = 4, trace = F)
paste("AIC for forward selection method: "
,round(extractAIC(fwselect)[2],2))
## [1] "AIC for forward selection method: 826.9"
We can use this AIC value as a benchmark for the other models now. Below is the model summary where we achieve a high \(R^2_{adj} = .915\).
summary(lm(fwselect))
##
## Call:
## lm(formula = fwselect)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1025.8 -297.9 -115.3 225.8 1425.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.436e+04 7.333e+02 19.582 < 2e-16 ***
## RPM 1.051e-01 1.071e-02 9.818 2.55e-14 ***
## INLET.TEMP -9.223e+00 7.869e-01 -11.721 < 2e-16 ***
## EXH.TEMP 1.243e+01 2.071e+00 6.000 1.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 465 on 63 degrees of freedom
## Multiple R-squared: 0.9189, Adjusted R-squared: 0.915
## F-statistic: 237.9 on 3 and 63 DF, p-value: < 2.2e-16
Similar to part A, I just specify the backward selection method and provide the full model as a starting point:
bwselect<- step(lm(HEATRATE~., data=turbine),
direction = "backward", k = 4, trace = F)
paste("AIC for backward selection method: ",
round(extractAIC(bwselect)[2],2))
## [1] "AIC for backward selection method: 826.9"
The AIC’s appear to be the same for the forward and backward methods. Lets compare the predictors chosen.
table(names(fwselect$model)[-1],names(bwselect$model)[-1])
##
## EXH.TEMP INLET.TEMP RPM
## EXH.TEMP 1 0 0
## INLET.TEMP 0 1 0
## RPM 0 0 1
As we can see, the models are identical.
Using the Mallow’s CP method in leaps, I use the all-possible regressions selection method to achieve a new model:
allselect <- leaps(y=turbine[,"HEATRATE"], x=turbine[,1:8],
method = "Cp",
names = colnames(turbine[1:8]))
cpkeepers <- allselect$which[which(allselect$Cp == min(allselect$Cp)),]
cpmodel <- lm(paste("HEATRATE~",
paste(names(cpkeepers[cpkeepers]), collapse = "+")),
data=turbine)
paste("AIC for Mallow's CP all-possible regressions selection method: ",
round(extractAIC(cpmodel)[2],2))
## [1] "AIC for Mallow's CP all-possible regressions selection method: 824.96"
The model has a lower AIC than the forward and backward selection methods.
Since the third model is the only one with an extra predictor, the three predictors seen in all three models are:
names(fwselect$model)[-1]
## [1] "RPM" "INLET.TEMP" "EXH.TEMP"
Excluding the intercept term. We can conclude that RPM, inlet temperature and exhaust temperature are consistently the best predictors for heat rate. The CP Mallow’s model includes Airflow as another predictor.
To develop a model, we could use the common predictors in all three models (RPM, inlet temperature, exhaust temperature), and analyze the contribution of the Airflow term by examining the all regessors model in more detail. We do have to note that these methods of model selection will only yield first order terms. Interaction and higher order predictors are not accounted for. Also, we must analyze the residuals before accepting the model.
At first glance, the mall regressors model seems to have the most attractive \(R^2_{adj} = .9186\), and a lower AIC than the forward and backward models. Choosing the Mallow’s CP based model seems to be the better choice of the three.