Preface

Next week I will be taking my final exam in ISYE 6414, Regression Analysis at Georgia Tech as part of my MS Degree program in Analytics. I have learned a lot of new material in this course, and have a chance to go much deeper on some of the aspects of regression that I thought I already understood pretty well. While the course covered SLR, Anova, MLR, Poisson Regression, logistic Regression, variable reduction and other techniques, this writeup will focus primarily on variable reduction.

The world is a much different place now than it was a month ago. It is my sincere hope that everyone please take social distancing and other steps to slow the spread of the COVID-19 virus seriously. Please check out my RPubs post for some descriptive analytics on that topic.

Thanks and I wish everyone a happy and healthy remainder of 2020.

Introduction

This notebook will explore using various regression techniques on a data set related to air pollution and mortality. This document includes data exploration, model fitting, analysis of outliers, variable selection, goodness of fit, analysis of assumptions and prediction.

Data are from McDonald and Schwing (1973), “Instabilities of Regression Estimates Relating Air Pollution to Mortality,” Technometrics, 15, 463-481. This data set of 15 independent variables (see list below) and a measure of mortality on 60 US metropolitan areas in 1959-1961 was used to illustrate ridge regression (the full X matrix has a huge condition number).

Data Fields

precip: Mean annual precipitation in inches
janTemp: Mean January temperature in degrees Fahrenheit
julyTemp: Mean July temperature in degrees Fahrenheit
over65: Percent of 1960 SMSA population that is 65 years of age or over
household: Population per household, 1960 SMSA
ed25: Median school years completed for those over 25 in 1960 SMSA
housing: Percent of housing units that are found with facilities
population: Population per square mile in urbanized area in 1960
nonWhite: Percent of 1960 urbanized area population that is non-white
employment: Percent employment in white-collar occupations in 1960 urbanized area
lowIncome: Percent of families with income under 3; 000 in 1960 urbanized area
HC: Relative population potential of hydrocarbons, HC
NOx: Relative pollution potential of oxides of nitrogen, NOx
SO2: Relative pollution potential of sulfur dioxide, SO2
humidity: Percent relative humidity, annual average at 1 p.m.
mortalityRate: Total Age Adjusted Mortality Rate

Data Exploration

First, we load the data.

rm(list=ls())
pd <- read.csv('pollution_data.csv', header=TRUE)
head(pd)

We begin by looking at some plots:

plot(pd[1:ncol(pd)-1])

Which is really ugly. What we want to do is see if we have multiple collinearity going on. Let’s try that…

library(corrplot)

## corrplot 0.84 loaded

corr<-cor(pd)
corrplot(corr)

Here it does appear that we have some correlated columns, specifically HC and NOx. We also have some inversely correlated items in nonWhite over 65 and lowIncome and housing. It will be interesting to see if those fall out when we do the variable reduction. Since the focus of the study is the affect of population on mortality, I think we would want to force HC, NOx and SO2 into our models. If we are doing predictions, however, we many want to let those fall out.

Model Fitting

We will begin by fitting a simple linear model to our data to see what we are dealing with.

model1 <- lm(mortalityRate~., data=pd)
summary(model1)

## 
## Call:
## lm(formula = mortalityRate ~ ., data = pd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.068 -18.018   0.912  19.224  86.966 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.764e+03  4.373e+02   4.034 0.000215 ***
## precip       1.905e+00  9.237e-01   2.063 0.045078 *  
## janTemp     -1.938e+00  1.108e+00  -1.748 0.087415 .  
## julyTemp    -3.100e+00  1.902e+00  -1.630 0.110164    
## over65      -9.065e+00  8.486e+00  -1.068 0.291246    
## household   -1.068e+02  6.978e+01  -1.531 0.132936    
## ed25        -1.716e+01  1.186e+01  -1.447 0.155095    
## housing     -6.511e-01  1.768e+00  -0.368 0.714397    
## population   3.601e-03  4.027e-03   0.894 0.376132    
## nonWhite     4.460e+00  1.327e+00   3.360 0.001618 ** 
## employment  -1.871e-01  1.662e+00  -0.113 0.910840    
## lowIncome   -1.674e-01  3.227e+00  -0.052 0.958865    
## HC          -6.722e-01  4.910e-01  -1.369 0.177980    
## NOx          1.340e+00  1.006e+00   1.333 0.189508    
## SO2          8.626e-02  1.475e-01   0.585 0.561697    
## humidity     1.067e-01  1.169e+00   0.091 0.927690    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.93 on 44 degrees of freedom
## Multiple R-squared:  0.7649, Adjusted R-squared:  0.6847 
## F-statistic: 9.542 on 15 and 44 DF,  p-value: 2.193e-09

Here we have an R2 value of .7649, meaning that 76.49% of the variance of the response variable is explained by the predictors. That number is rather high for a study like this. Many of the predictors in the model are not statistically significant, however it is clear that our model has predictive value since the p-value of the F-statistic is so low.

Analysis of outliers

We now turn our attention to the analysis of outliers. We use Cook’s Distance for this metric. If the Cook’s distance is greater than 4/n (or really any big number) then we conclude that the value (record) is an outlier.

t = 4/nrow(pd)
cooks_distance <- cooks.distance(model1)
plot(cooks_distance)

outliers <- which(cooks_distance>t)
print(outliers)

##  2  6 28 29 32 37 48 59 
##  2  6 28 29 32 37 48 59

As we can see from above, we don’t have any records that have a Cook’s distance greater than 1, but we do have several points that exceed the 4/n threshold. We will remove those values and see what that does to the overall R-Squared value for the model:

#Don't forget THE DAMN COMMA!!!
pd_clean <- pd[-c(outliers),]
cat("\nFull dataset records:     ",nrow(pd))

## 
## Full dataset records:      60

cat("\nFiltered dataset records: ", nrow(pd_clean))

## 
## Filtered dataset records:  52

Now we refit…

model1 <- lm(mortalityRate~., data=pd_clean)
summary(model1)

## 
## Call:
## lm(formula = mortalityRate ~ ., data = pd_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.113 -14.502   0.437   8.671  52.358 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.869e+03  3.679e+02   5.081 1.17e-05 ***
## precip       2.426e+00  7.426e-01   3.267  0.00239 ** 
## janTemp     -5.699e-01  8.906e-01  -0.640  0.52625    
## julyTemp    -3.182e+00  1.393e+00  -2.284  0.02840 *  
## over65      -8.455e+00  6.513e+00  -1.298  0.20247    
## household   -8.408e+01  5.688e+01  -1.478  0.14803    
## ed25        -1.808e+01  1.097e+01  -1.648  0.10796    
## housing     -2.176e+00  1.535e+00  -1.417  0.16496    
## population   9.830e-03  3.553e-03   2.767  0.00888 ** 
## nonWhite     3.821e+00  1.108e+00   3.448  0.00146 ** 
## employment  -9.883e-01  1.348e+00  -0.733  0.46812    
## lowIncome   -3.080e+00  2.743e+00  -1.123  0.26887    
## HC          -6.656e-01  4.850e-01  -1.372  0.17848    
## NOx          1.782e+00  1.072e+00   1.663  0.10505    
## SO2         -2.400e-02  1.722e-01  -0.139  0.88997    
## humidity    -7.087e-01  9.304e-01  -0.762  0.45118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.33 on 36 degrees of freedom
## Multiple R-squared:  0.8698, Adjusted R-squared:  0.8156 
## F-statistic: 16.04 on 15 and 36 DF,  p-value: 1.26e-11

Cleaning out the data had a positive effect on the R-Squared value for this data. We now have 86.98% of the variability in mortality rate explained by the predictors.

Variable selection

The purpose of variable selection is to balance the tradeoff between variability and bias in the model. In the case of this dataset, we have a relatively large number of predictor variables, and a relatively small dataset. It is preferred to have less variability in our models for prediction, even if that means accepting some bias in the model.

There are several ways to approach variable selection including:

Evaluate every possible model combination and use the model with the lowest MSE.
Use a stepwise approach (forward, backward or both) and attempt to find the best model.
Use a regularized variable selection technique including lasso, ridge regression and elasticnet.

We will explore each of these in the following sections:

Every possible combination

Our model has 15 predictors. As a result, running through every possible combination would result in 2^15 = 32768 combinations. That is a lot, but not so many that we cannot make it work. Below is an attempt at that:

library(leaps)
# Remove the target variable.  VERY IMPORTANT!
features <- pd_clean[,-c(16)]
out<-leaps(features,pd_clean$mortalityRate)
best.model <- which(out$Cp==min(out$Cp))
print("Complete.")

## [1] "Complete."

as.matrix(out$which)[best.model,]

##     1     2     3     4     5     6     7     8     9     A     B     C     D 
##  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE 
##     E     F 
##  TRUE FALSE

In the above, we see that we did select variables… but we also removed two of our main columns of interest. We can use regsubsets to force those into the model as shown here:

features <- pd_clean[,-c(16)]
reg <- regsubsets(features, pd_clean$mortalityRate, force.in=c(12,13,14))
reg.summary <- summary(reg)
reg.summary

## Subset selection object
## 15 Variables  (and intercept)
##            Forced in Forced out
## HC             FALSE      FALSE
## NOx            FALSE      FALSE
## SO2            FALSE      FALSE
## precip         FALSE      FALSE
## janTemp        FALSE      FALSE
## julyTemp       FALSE      FALSE
## over65         FALSE      FALSE
## household      FALSE      FALSE
## ed25           FALSE      FALSE
## housing        FALSE      FALSE
## population     FALSE      FALSE
## nonWhite        TRUE      FALSE
## employment      TRUE      FALSE
## lowIncome       TRUE      FALSE
## humidity       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          HC  NOx SO2 precip janTemp julyTemp over65 household ed25 housing
## 4  ( 1 ) "*" "*" "*" " "    " "     " "      " "    " "       " "  " "    
## 5  ( 1 ) "*" "*" "*" "*"    " "     " "      " "    " "       " "  " "    
## 6  ( 1 ) "*" "*" "*" "*"    " "     " "      " "    " "       " "  " "    
## 7  ( 1 ) "*" "*" "*" "*"    " "     " "      " "    " "       " "  " "    
## 8  ( 1 ) "*" "*" "*" "*"    " "     "*"      " "    " "       "*"  " "    
##          population nonWhite employment lowIncome humidity
## 4  ( 1 ) " "        "*"      " "        " "       " "     
## 5  ( 1 ) " "        "*"      " "        " "       " "     
## 6  ( 1 ) "*"        "*"      " "        " "       " "     
## 7  ( 1 ) "*"        "*"      "*"        " "       " "     
## 8  ( 1 ) "*"        "*"      " "        " "       " "

If we would like, we can do some various plots to show the resulting RSS, Adjusted R2, Mallows Cp (AIC in this case) and BIC…

par(mfrow=c(2,2))

# RSS
plot(reg.summary$rss, xlab="Number of Variables", ylab="Resid. Sum of Squares", type="l")
best.rss <- which.min(reg.summary$rss)
points(best.rss, reg.summary$rss[best.rss], col="red", cex=2,pch=20)

# AdjR2
plot(reg.summary$adjr2, xlab="Number of Variables", ylab="Adjusted R2", type="l")
best.adjr2 <- which.max(reg.summary$adjr2)
points(best.adjr2, reg.summary$adjr2[best.adjr2], col="red", cex=2,pch=20)

# Note in this case that AIC and Cp are the same.
# Mallows Cp
plot(reg.summary$cp, xlab="Number of Variables", ylab="Mallows Cp", type="l")
best.cp <- which.min(reg.summary$cp)
points(best.cp, reg.summary$cp[best.cp], col="red", cex=2,pch=20)

# BIC
plot(reg.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")
best.bic <- which.min(reg.summary$bic)
points(best.bic, reg.summary$bic[best.bic], col="red", cex=2,pch=20)

Here we identify the best variables based on the type. We will choose Cp.

Even though the graphs above show us only 5 selected variables, there are actually 8. The other 3 are the ones that we forced into the model.

best.cp_model <- which.min(reg.summary$cp)
coef(reg, best.cp_model)

##   (Intercept)            HC           NOx           SO2        precip 
## 1144.58972229   -0.65506607    1.11554493    0.11175302    1.90636465 
##      julyTemp          ed25    population      nonWhite 
##   -2.57860142  -15.49961766    0.01077364    3.50381240

Stepwise variable reduction

An alternative way to the “search every submodel” technique is to use stepwise regression. There are three options here, forward, backward and stepwise. I will start with forward:

Forward Stepwise

library(MASS)
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)

# Note k=2 uses AIC.  You can use k=log(n) if you want BIC
forward_mdl <- step(base_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="forward", k=2)

(Note: Output as been hidden to reduce the size of this file.)

summary(forward_mdl)

## 
## Call:
## lm(formula = mortalityRate ~ HC + NOx + SO2 + nonWhite + precip + 
##     population + employment + julyTemp, data = pd_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.793 -14.461  -2.555  12.972  57.249 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.025e+03  8.439e+01  12.145 1.73e-15 ***
## HC          -4.046e-01  3.843e-01  -1.053 0.298341    
## NOx          5.520e-01  8.527e-01   0.647 0.520838    
## SO2          2.195e-01  1.271e-01   1.728 0.091246 .  
## nonWhite     3.516e+00  6.165e-01   5.702 9.90e-07 ***
## precip       2.195e+00  5.579e-01   3.934 0.000300 ***
## population   1.115e-02  3.097e-03   3.601 0.000817 ***
## employment  -2.189e+00  7.912e-01  -2.767 0.008306 ** 
## julyTemp    -2.120e+00  1.062e+00  -1.996 0.052326 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.29 on 43 degrees of freedom
## Multiple R-squared:  0.8451, Adjusted R-squared:  0.8163 
## F-statistic: 29.32 on 8 and 43 DF,  p-value: 5.156e-15

Which is nice. We can see that, in fact, HC, NOx and SO2 are in the model.

Backward Stepwise

library(MASS)
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)

# Note k=2 uses AIC.  You can use k=log(n) if you want BIC
backward_mdl <- step(full_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="backward", k=2)

(Note: Output has been hidden to reduce the size of this file.)

summary(backward_mdl)

## 
## Call:
## lm(formula = mortalityRate ~ precip + julyTemp + over65 + household + 
##     ed25 + housing + population + nonWhite + lowIncome + HC + 
##     NOx + SO2, data = pd_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.792 -12.753  -0.563  13.067  50.505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.747e+03  3.387e+02   5.159 7.58e-06 ***
## precip       2.125e+00  6.709e-01   3.167 0.002985 ** 
## julyTemp    -2.936e+00  1.257e+00  -2.336 0.024750 *  
## over65      -7.052e+00  5.340e+00  -1.321 0.194342    
## household   -6.125e+01  4.915e+01  -1.246 0.220099    
## ed25        -2.360e+01  7.686e+00  -3.070 0.003886 ** 
## housing     -2.212e+00  1.421e+00  -1.557 0.127482    
## population   1.038e-02  3.453e-03   3.005 0.004626 ** 
## nonWhite     3.795e+00  1.049e+00   3.616 0.000847 ***
## lowIncome   -3.941e+00  2.336e+00  -1.687 0.099622 .  
## HC          -9.267e-01  4.321e-01  -2.144 0.038297 *  
## NOx          2.038e+00  9.761e-01   2.088 0.043361 *  
## SO2         -3.836e-02  1.551e-01  -0.247 0.806009    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.98 on 39 degrees of freedom
## Multiple R-squared:  0.863,  Adjusted R-squared:  0.8209 
## F-statistic: 20.48 on 12 and 39 DF,  p-value: 3.462e-13

Stepwise Both model

Here we will use the stepwise “Both” model.

library(MASS)
# Here I add a few more variables to the starting model.
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
start_mdl <- lm(mortalityRate~HC+NOx+SO2+nonWhite+ed25+housing, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)

# Note k=2 uses AIC.  You can use k=log(n) if you want BIC
sboth_mdl <- step(start_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="both", k=2)

(Note: Output has been hidden to reduce the size of this file.)

summary(sboth_mdl)

## 
## Call:
## lm(formula = mortalityRate ~ HC + NOx + SO2 + nonWhite + ed25 + 
##     population + precip + julyTemp, data = pd_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.282 -14.566   0.067  10.929  60.095 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1144.58972  103.44812  11.064 3.67e-14 ***
## HC            -0.65507    0.37222  -1.760  0.08554 .  
## NOx            1.11555    0.84896   1.314  0.19581    
## SO2            0.11175    0.13276   0.842  0.40458    
## nonWhite       3.50381    0.61393   5.707 9.73e-07 ***
## ed25         -15.49962    5.44603  -2.846  0.00676 ** 
## population     0.01077    0.00308   3.498  0.00110 ** 
## precip         1.90637    0.57483   3.316  0.00186 ** 
## julyTemp      -2.57860    1.04017  -2.479  0.01717 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.18 on 43 degrees of freedom
## Multiple R-squared:  0.8464, Adjusted R-squared:  0.8179 
## F-statistic: 29.63 on 8 and 43 DF,  p-value: 4.295e-15

Let’s compare what we got in the three methods:

library(plyr)
l<-list(data.frame(t(data.frame(coef(forward_mdl)))),
        data.frame(t(data.frame(coef(backward_mdl)))),
        data.frame(t(data.frame(coef(sboth_mdl)))))
stepwise_results <- do.call(rbind.fill, l)
row.names(stepwise_results) <- c('Forward','Backward','Both')
stepwise_results

In the above, we can see that the forward, backward and both models all return different results. (Yay!)

Regularized regression

Regularized regression also comes in three flavors, Lasso, Ridge Regression and Elastic Search. In regularized regression, we are interested in balancing the tradeoff between the sum of squared errors and the complexity of the model. Generally, this looks like this:

\[ \sum_{i=1}^n(Y_i-(\beta_0+\beta_1x_{i,1}+...+\beta_px_{i,p}))^2+\lambda Penalty(\beta_1,...,\beta_p) \] We have 3 flavors of "\(Penalty\) to choose from:

-L0 This is the number of nonzero regression coefficients. It is basically the same thing as searching through all the models. -L1 This is known as the “sparsity” penalty. It is the sum of the absolute value of the regression coefficients to be penalized. It is sparse in the sense that it allows some regression coefficients to be absolutely zero. That means it actually does discard variables from the model. It also means that there is no closed form for minimizing the function above. -L2 This is the sum of squared regression coefficients penalty. It accounts for multicolinearity, but does not do variable reduction. (It allows coefficients to be close to zero, but not absolutely zero.) This penalty does have a closed form solution.

As mentioned above, if we choose L0, we have the submodel search we mentioned above. If we choose L1, that is known as Lasso regression. If we choose L2, that is know as Ridge regression. And, if we include both L1 and L2 with a balancing term \(\alpha\), that is known as Elasticnet Regression. We will explore each of these below.

One additional note, when doing Lasso or Ridge, we must decide on a \(\lambda\) to use. This value remains constant, but we do need to pick it. We typically use cross validation to find the best lambda (the one that minimizes the function above).

Lasso

For lasso, we set alpha=1.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 3.0-2

Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
lassocv <- cv.glmnet(X, Y, alpha=1, nfolds=10)
lassomdl <- glmnet(X,Y,alpha=1, nlambda=100)
lassocoef <- coef(lassomdl, s=lassocv$lambda.min)
plot(lassomdl, xvar="lambda", lwd=2)
abline(v=log(lassocv$lambda.min), col="black",lty=2, lwd=2)

lassocoef

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 1050.76260183
## precip         2.15764730
## janTemp       -0.28356501
## julyTemp      -1.14934417
## over65         .         
## household      .         
## ed25          -6.98037734
## housing       -0.42462068
## population     0.01012992
## nonWhite       2.98761241
## employment    -1.31581902
## lowIncome      .         
## HC             .         
## NOx            .         
## SO2            0.23200129
## humidity      -0.14796521

Ridge

For ridge, we set alpha=0

library(glmnet)
library(Compositional)
Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
ridgecv <- cv.glmnet(X, Y, alpha=0, nfolds=10)
ridgemdl <- glmnet(X,Y,alpha=0, lambda = ridgecv$lambda.min)
ridgecoef <- coef(ridgemdl, s=ridgecv$lambda.min)
ridge.plot(Y,X,lambda=seq(0,5,by=0.1))
abline(v=log(ridgecv$lambda.min), col="black",lty=2, lwd=2)

ridgecoef

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 1319.1386266
## precip         2.2872880
## janTemp       -0.4380676
## julyTemp      -1.7518946
## over65        -3.8009050
## household    -18.5149056
## ed25         -10.4032964
## housing       -1.2454238
## population     0.0103336
## nonWhite       2.7641814
## employment    -1.1527050
## lowIncome     -0.7389130
## HC            -0.1503197
## NOx            0.4941554
## SO2            0.1634661
## humidity      -0.5092659

Elasticnet

For elasticnet, we set alpha=0.5

library(glmnet)
Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
enetcv <- cv.glmnet(X, Y, alpha=0.5, nfolds=10)
enetmdl <- glmnet(X,Y,alpha=0.5, nlambda=100)
enetcoef <- coef(enetmdl, s=enetcv$lambda.min)
plot(enetmdl, xvar="lambda", lwd=2)
abline(v=log(enetcv$lambda.min), col="black",lty=2, lwd=2)

enetcoef

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.736208e+03
## precip       2.410289e+00
## janTemp     -5.051437e-01
## julyTemp    -2.927553e+00
## over65      -6.939102e+00
## household   -6.767421e+01
## ed25        -1.551675e+01
## housing     -1.972315e+00
## population   9.949473e-03
## nonWhite     3.685845e+00
## employment  -1.111314e+00
## lowIncome   -2.594403e+00
## HC          -5.111048e-01
## NOx          1.394349e+00
## SO2          3.147938e-02
## humidity    -7.045609e-01

Comparing all variable selection methods:

Comparing all the variable selection methods used so for we have:

library(plyr)
l<-list(data.frame(t(data.frame(coef(forward_mdl)))),
        data.frame(t(data.frame(coef(backward_mdl)))),
        data.frame(t(data.frame(coef(sboth_mdl)))),
        data.frame(t(as.data.frame(as.matrix(lassocoef)))),
        data.frame(t(as.data.frame(as.matrix(ridgecoef)))),
        data.frame(t(as.data.frame(as.matrix(enetcoef)))))

varr_results <- do.call(rbind.fill, l)
row.names(varr_results) <- c('Forward','Backward','Step Both', 'Lasso','Ridge','ElasticNet')
varr_results

Residual Analysis

For goodness of fit, we can look at several aspects of our model. In this section, we will focus on the graphical plots for our model. We can see that several of the points in our data do have an impact on the overall fit. (See the influencePlot and the cooks distance plot.) However, since the cooks distance is less that 1, we won’t really consider those points to be outliers. The QQplot and the histogram show that our model does have some issus with normality as well. For this normal regression problem, we are concerned with the following assumpations:

library(car)

## Loading required package: carData

full.resid = residuals(sboth_mdl)
cook = cooks.distance(sboth_mdl)
par(mfrow=c(2,2))
influencePlot(sboth_mdl)

plot(cook, type="h",lwd=3,col="red", ylab="Cook's Distance")

abline(0,0,col="red")
qqPlot(full.resid, ylab="Residuals", main="")

## 57 53 
## 50 46

hist(full.resid, xlab="Residuals", main="",nclass=30,col="orange")

We can look specifically at the following plots to get a sense of how the model conforms to the model assumptions. (Again, in this case, we will use the stepwise “both” model from above.)

Specifically, we are concerned with the following assumptions:

Linearity-mean 0 assumption. (Residuals vs Predictors)
Normality. (QQPlot, Histogram)
Independence (Residuals vs Fitted Values)
Constant variance. (Residuals vs Fitted Values)

library(car)
std_resid <- rstandard(sboth_mdl)
par(mfrow=c(2,2))
plot(seq(1,length(sboth_mdl$fitted.values)), std_resid, xlab="Predictor", ylab="Std. Residual", main="Predictor vs Residual (Linearity)")
abline(h=0, col="red", lty=2)
plot(sboth_mdl$fitted.values, std_resid, xlab="Fitted", ylab="Std Resid", main="Fitted Vs Residual (Const Var & Indp)")
qqPlot(std_resid, main="QQ Plot Std. Residuals (Normality)")

## 57 53 
## 50 46

hist(std_resid, main="Histogram of Std. Residuals (Normality")

Based on the above, figure 1 does sow that data spread across the x axis, so linearity appears to be OK. In the Fitted vs Residuals, constant variance looks OK, and there is no reason to suspect that the data is not independent. The last plt shows a general normal shape, but not perfect. We would probably want to call that out in our analysis.

Goodness of fit

It is possible to do a goodness of fit test for our models as well. In this case, we will look specifically at the model that we got back from the Ridge regression. In this test, also called the deviance test, we would want specifically at the model that we got using the value of lambda that we got from the Ridge regression that we applied above.

df = length(Y-nrow(coef(ridgemdl)))
cat("Deviance residuals test (",
    deviance(ridgemdl),
    ")  p =",1-pchisq(deviance(ridgemdl), df),
    "\n")

## Deviance residuals test ( 23579.76 )  p = 0

In this case, the p-value is very low. The hypothesis test for goodness of fit is structured as follows:

\(H_0\): The model is a good fit.
\(H_a\): The model is not a good fit.

In this case, our p-value, which is essentially 0, means that we must reject \(H_0\). Therefore it does not appear that we have a good fitting model.

To see if the Ridge model has predictive power (or how well the variance in the response is explained by the predictors) we can calculate the R2 value:

cat('R2 =', ridgemdl$dev.ratio, "\n")

## R2 = 0.8559839

The above shows that we do have approximately 85% of the variance in the response variable is explained by the predictors. So, while the model is not necessarily a good fit, it does seem to have predictive power.

Conclusion

In this notebook, we have look at various methods of variable reduction techniques, and some other associate topics in multiple regression. I have worked on this document as a way to prepare for my final in Georgia Tech ISYE 6414 course. Thanks for taking the time to read this writup.

Variable Selection R Example

Miles Porter

April 19, 2020