Next week I will be taking my final exam in ISYE 6414, Regression Analysis at Georgia Tech as part of my MS Degree program in Analytics. I have learned a lot of new material in this course, and have a chance to go much deeper on some of the aspects of regression that I thought I already understood pretty well. While the course covered SLR, Anova, MLR, Poisson Regression, logistic Regression, variable reduction and other techniques, this writeup will focus primarily on variable reduction.
The world is a much different place now than it was a month ago. It is my sincere hope that everyone please take social distancing and other steps to slow the spread of the COVID-19 virus seriously. Please check out my RPubs post for some descriptive analytics on that topic.
Thanks and I wish everyone a happy and healthy remainder of 2020.
This notebook will explore using various regression techniques on a data set related to air pollution and mortality. This document includes data exploration, model fitting, analysis of outliers, variable selection, goodness of fit, analysis of assumptions and prediction.
Data are from McDonald and Schwing (1973), “Instabilities of Regression Estimates Relating Air Pollution to Mortality,” Technometrics, 15, 463-481. This data set of 15 independent variables (see list below) and a measure of mortality on 60 US metropolitan areas in 1959-1961 was used to illustrate ridge regression (the full X matrix has a huge condition number).
First, we load the data.
rm(list=ls())
pd <- read.csv('pollution_data.csv', header=TRUE)
head(pd)
We begin by looking at some plots:
plot(pd[1:ncol(pd)-1])
Which is really ugly. What we want to do is see if we have multiple collinearity going on. Let’s try that…
library(corrplot)
## corrplot 0.84 loaded
corr<-cor(pd)
corrplot(corr)
Here it does appear that we have some correlated columns, specifically HC and NOx. We also have some inversely correlated items in nonWhite over 65 and lowIncome and housing. It will be interesting to see if those fall out when we do the variable reduction. Since the focus of the study is the affect of population on mortality, I think we would want to force HC, NOx and SO2 into our models. If we are doing predictions, however, we many want to let those fall out.
We will begin by fitting a simple linear model to our data to see what we are dealing with.
model1 <- lm(mortalityRate~., data=pd)
summary(model1)
##
## Call:
## lm(formula = mortalityRate ~ ., data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.068 -18.018 0.912 19.224 86.966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.764e+03 4.373e+02 4.034 0.000215 ***
## precip 1.905e+00 9.237e-01 2.063 0.045078 *
## janTemp -1.938e+00 1.108e+00 -1.748 0.087415 .
## julyTemp -3.100e+00 1.902e+00 -1.630 0.110164
## over65 -9.065e+00 8.486e+00 -1.068 0.291246
## household -1.068e+02 6.978e+01 -1.531 0.132936
## ed25 -1.716e+01 1.186e+01 -1.447 0.155095
## housing -6.511e-01 1.768e+00 -0.368 0.714397
## population 3.601e-03 4.027e-03 0.894 0.376132
## nonWhite 4.460e+00 1.327e+00 3.360 0.001618 **
## employment -1.871e-01 1.662e+00 -0.113 0.910840
## lowIncome -1.674e-01 3.227e+00 -0.052 0.958865
## HC -6.722e-01 4.910e-01 -1.369 0.177980
## NOx 1.340e+00 1.006e+00 1.333 0.189508
## SO2 8.626e-02 1.475e-01 0.585 0.561697
## humidity 1.067e-01 1.169e+00 0.091 0.927690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.93 on 44 degrees of freedom
## Multiple R-squared: 0.7649, Adjusted R-squared: 0.6847
## F-statistic: 9.542 on 15 and 44 DF, p-value: 2.193e-09
Here we have an R2 value of .7649, meaning that 76.49% of the variance of the response variable is explained by the predictors. That number is rather high for a study like this. Many of the predictors in the model are not statistically significant, however it is clear that our model has predictive value since the p-value of the F-statistic is so low.
We now turn our attention to the analysis of outliers. We use Cook’s Distance for this metric. If the Cook’s distance is greater than 4/n (or really any big number) then we conclude that the value (record) is an outlier.
t = 4/nrow(pd)
cooks_distance <- cooks.distance(model1)
plot(cooks_distance)
outliers <- which(cooks_distance>t)
print(outliers)
## 2 6 28 29 32 37 48 59
## 2 6 28 29 32 37 48 59
As we can see from above, we don’t have any records that have a Cook’s distance greater than 1, but we do have several points that exceed the 4/n threshold. We will remove those values and see what that does to the overall R-Squared value for the model:
#Don't forget THE DAMN COMMA!!!
pd_clean <- pd[-c(outliers),]
cat("\nFull dataset records: ",nrow(pd))
##
## Full dataset records: 60
cat("\nFiltered dataset records: ", nrow(pd_clean))
##
## Filtered dataset records: 52
Now we refit…
model1 <- lm(mortalityRate~., data=pd_clean)
summary(model1)
##
## Call:
## lm(formula = mortalityRate ~ ., data = pd_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.113 -14.502 0.437 8.671 52.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.869e+03 3.679e+02 5.081 1.17e-05 ***
## precip 2.426e+00 7.426e-01 3.267 0.00239 **
## janTemp -5.699e-01 8.906e-01 -0.640 0.52625
## julyTemp -3.182e+00 1.393e+00 -2.284 0.02840 *
## over65 -8.455e+00 6.513e+00 -1.298 0.20247
## household -8.408e+01 5.688e+01 -1.478 0.14803
## ed25 -1.808e+01 1.097e+01 -1.648 0.10796
## housing -2.176e+00 1.535e+00 -1.417 0.16496
## population 9.830e-03 3.553e-03 2.767 0.00888 **
## nonWhite 3.821e+00 1.108e+00 3.448 0.00146 **
## employment -9.883e-01 1.348e+00 -0.733 0.46812
## lowIncome -3.080e+00 2.743e+00 -1.123 0.26887
## HC -6.656e-01 4.850e-01 -1.372 0.17848
## NOx 1.782e+00 1.072e+00 1.663 0.10505
## SO2 -2.400e-02 1.722e-01 -0.139 0.88997
## humidity -7.087e-01 9.304e-01 -0.762 0.45118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.33 on 36 degrees of freedom
## Multiple R-squared: 0.8698, Adjusted R-squared: 0.8156
## F-statistic: 16.04 on 15 and 36 DF, p-value: 1.26e-11
Cleaning out the data had a positive effect on the R-Squared value for this data. We now have 86.98% of the variability in mortality rate explained by the predictors.
The purpose of variable selection is to balance the tradeoff between variability and bias in the model. In the case of this dataset, we have a relatively large number of predictor variables, and a relatively small dataset. It is preferred to have less variability in our models for prediction, even if that means accepting some bias in the model.
There are several ways to approach variable selection including:
We will explore each of these in the following sections:
Our model has 15 predictors. As a result, running through every possible combination would result in 2^15 = 32768 combinations. That is a lot, but not so many that we cannot make it work. Below is an attempt at that:
library(leaps)
# Remove the target variable. VERY IMPORTANT!
features <- pd_clean[,-c(16)]
out<-leaps(features,pd_clean$mortalityRate)
best.model <- which(out$Cp==min(out$Cp))
print("Complete.")
## [1] "Complete."
as.matrix(out$which)[best.model,]
## 1 2 3 4 5 6 7 8 9 A B C D
## TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## E F
## TRUE FALSE
In the above, we see that we did select variables… but we also removed two of our main columns of interest. We can use regsubsets to force those into the model as shown here:
features <- pd_clean[,-c(16)]
reg <- regsubsets(features, pd_clean$mortalityRate, force.in=c(12,13,14))
reg.summary <- summary(reg)
reg.summary
## Subset selection object
## 15 Variables (and intercept)
## Forced in Forced out
## HC FALSE FALSE
## NOx FALSE FALSE
## SO2 FALSE FALSE
## precip FALSE FALSE
## janTemp FALSE FALSE
## julyTemp FALSE FALSE
## over65 FALSE FALSE
## household FALSE FALSE
## ed25 FALSE FALSE
## housing FALSE FALSE
## population FALSE FALSE
## nonWhite TRUE FALSE
## employment TRUE FALSE
## lowIncome TRUE FALSE
## humidity FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## HC NOx SO2 precip janTemp julyTemp over65 household ed25 housing
## 4 ( 1 ) "*" "*" "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" "*" "*" "*" " " "*" " " " " "*" " "
## population nonWhite employment lowIncome humidity
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " " "
## 6 ( 1 ) "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " "
If we would like, we can do some various plots to show the resulting RSS, Adjusted R2, Mallows Cp (AIC in this case) and BIC…
par(mfrow=c(2,2))
# RSS
plot(reg.summary$rss, xlab="Number of Variables", ylab="Resid. Sum of Squares", type="l")
best.rss <- which.min(reg.summary$rss)
points(best.rss, reg.summary$rss[best.rss], col="red", cex=2,pch=20)
# AdjR2
plot(reg.summary$adjr2, xlab="Number of Variables", ylab="Adjusted R2", type="l")
best.adjr2 <- which.max(reg.summary$adjr2)
points(best.adjr2, reg.summary$adjr2[best.adjr2], col="red", cex=2,pch=20)
# Note in this case that AIC and Cp are the same.
# Mallows Cp
plot(reg.summary$cp, xlab="Number of Variables", ylab="Mallows Cp", type="l")
best.cp <- which.min(reg.summary$cp)
points(best.cp, reg.summary$cp[best.cp], col="red", cex=2,pch=20)
# BIC
plot(reg.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")
best.bic <- which.min(reg.summary$bic)
points(best.bic, reg.summary$bic[best.bic], col="red", cex=2,pch=20)
Here we identify the best variables based on the type. We will choose Cp.
Even though the graphs above show us only 5 selected variables, there are actually 8. The other 3 are the ones that we forced into the model.
best.cp_model <- which.min(reg.summary$cp)
coef(reg, best.cp_model)
## (Intercept) HC NOx SO2 precip
## 1144.58972229 -0.65506607 1.11554493 0.11175302 1.90636465
## julyTemp ed25 population nonWhite
## -2.57860142 -15.49961766 0.01077364 3.50381240
An alternative way to the “search every submodel” technique is to use stepwise regression. There are three options here, forward, backward and stepwise. I will start with forward:
library(MASS)
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)
# Note k=2 uses AIC. You can use k=log(n) if you want BIC
forward_mdl <- step(base_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="forward", k=2)
(Note: Output as been hidden to reduce the size of this file.)
summary(forward_mdl)
##
## Call:
## lm(formula = mortalityRate ~ HC + NOx + SO2 + nonWhite + precip +
## population + employment + julyTemp, data = pd_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.793 -14.461 -2.555 12.972 57.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.025e+03 8.439e+01 12.145 1.73e-15 ***
## HC -4.046e-01 3.843e-01 -1.053 0.298341
## NOx 5.520e-01 8.527e-01 0.647 0.520838
## SO2 2.195e-01 1.271e-01 1.728 0.091246 .
## nonWhite 3.516e+00 6.165e-01 5.702 9.90e-07 ***
## precip 2.195e+00 5.579e-01 3.934 0.000300 ***
## population 1.115e-02 3.097e-03 3.601 0.000817 ***
## employment -2.189e+00 7.912e-01 -2.767 0.008306 **
## julyTemp -2.120e+00 1.062e+00 -1.996 0.052326 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.29 on 43 degrees of freedom
## Multiple R-squared: 0.8451, Adjusted R-squared: 0.8163
## F-statistic: 29.32 on 8 and 43 DF, p-value: 5.156e-15
Which is nice. We can see that, in fact, HC, NOx and SO2 are in the model.
library(MASS)
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)
# Note k=2 uses AIC. You can use k=log(n) if you want BIC
backward_mdl <- step(full_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="backward", k=2)
(Note: Output has been hidden to reduce the size of this file.)
summary(backward_mdl)
##
## Call:
## lm(formula = mortalityRate ~ precip + julyTemp + over65 + household +
## ed25 + housing + population + nonWhite + lowIncome + HC +
## NOx + SO2, data = pd_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.792 -12.753 -0.563 13.067 50.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.747e+03 3.387e+02 5.159 7.58e-06 ***
## precip 2.125e+00 6.709e-01 3.167 0.002985 **
## julyTemp -2.936e+00 1.257e+00 -2.336 0.024750 *
## over65 -7.052e+00 5.340e+00 -1.321 0.194342
## household -6.125e+01 4.915e+01 -1.246 0.220099
## ed25 -2.360e+01 7.686e+00 -3.070 0.003886 **
## housing -2.212e+00 1.421e+00 -1.557 0.127482
## population 1.038e-02 3.453e-03 3.005 0.004626 **
## nonWhite 3.795e+00 1.049e+00 3.616 0.000847 ***
## lowIncome -3.941e+00 2.336e+00 -1.687 0.099622 .
## HC -9.267e-01 4.321e-01 -2.144 0.038297 *
## NOx 2.038e+00 9.761e-01 2.088 0.043361 *
## SO2 -3.836e-02 1.551e-01 -0.247 0.806009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.98 on 39 degrees of freedom
## Multiple R-squared: 0.863, Adjusted R-squared: 0.8209
## F-statistic: 20.48 on 12 and 39 DF, p-value: 3.462e-13
Here we will use the stepwise “Both” model.
library(MASS)
# Here I add a few more variables to the starting model.
base_mdl <- lm(mortalityRate~HC+NOx+SO2, data=pd_clean)
start_mdl <- lm(mortalityRate~HC+NOx+SO2+nonWhite+ed25+housing, data=pd_clean)
full_mdl <- lm(mortalityRate~., data=pd_clean)
# Note k=2 uses AIC. You can use k=log(n) if you want BIC
sboth_mdl <- step(start_mdl, scope=list(lower=base_mdl, upper=full_mdl), direction="both", k=2)
(Note: Output has been hidden to reduce the size of this file.)
summary(sboth_mdl)
##
## Call:
## lm(formula = mortalityRate ~ HC + NOx + SO2 + nonWhite + ed25 +
## population + precip + julyTemp, data = pd_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.282 -14.566 0.067 10.929 60.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1144.58972 103.44812 11.064 3.67e-14 ***
## HC -0.65507 0.37222 -1.760 0.08554 .
## NOx 1.11555 0.84896 1.314 0.19581
## SO2 0.11175 0.13276 0.842 0.40458
## nonWhite 3.50381 0.61393 5.707 9.73e-07 ***
## ed25 -15.49962 5.44603 -2.846 0.00676 **
## population 0.01077 0.00308 3.498 0.00110 **
## precip 1.90637 0.57483 3.316 0.00186 **
## julyTemp -2.57860 1.04017 -2.479 0.01717 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.18 on 43 degrees of freedom
## Multiple R-squared: 0.8464, Adjusted R-squared: 0.8179
## F-statistic: 29.63 on 8 and 43 DF, p-value: 4.295e-15
Let’s compare what we got in the three methods:
library(plyr)
l<-list(data.frame(t(data.frame(coef(forward_mdl)))),
data.frame(t(data.frame(coef(backward_mdl)))),
data.frame(t(data.frame(coef(sboth_mdl)))))
stepwise_results <- do.call(rbind.fill, l)
row.names(stepwise_results) <- c('Forward','Backward','Both')
stepwise_results
In the above, we can see that the forward, backward and both models all return different results. (Yay!)
Regularized regression also comes in three flavors, Lasso, Ridge Regression and Elastic Search. In regularized regression, we are interested in balancing the tradeoff between the sum of squared errors and the complexity of the model. Generally, this looks like this:
\[ \sum_{i=1}^n(Y_i-(\beta_0+\beta_1x_{i,1}+...+\beta_px_{i,p}))^2+\lambda Penalty(\beta_1,...,\beta_p) \] We have 3 flavors of "\(Penalty\) to choose from:
-L0 This is the number of nonzero regression coefficients. It is basically the same thing as searching through all the models. -L1 This is known as the “sparsity” penalty. It is the sum of the absolute value of the regression coefficients to be penalized. It is sparse in the sense that it allows some regression coefficients to be absolutely zero. That means it actually does discard variables from the model. It also means that there is no closed form for minimizing the function above. -L2 This is the sum of squared regression coefficients penalty. It accounts for multicolinearity, but does not do variable reduction. (It allows coefficients to be close to zero, but not absolutely zero.) This penalty does have a closed form solution.
As mentioned above, if we choose L0, we have the submodel search we mentioned above. If we choose L1, that is known as Lasso regression. If we choose L2, that is know as Ridge regression. And, if we include both L1 and L2 with a balancing term \(\alpha\), that is known as Elasticnet Regression. We will explore each of these below.
One additional note, when doing Lasso or Ridge, we must decide on a \(\lambda\) to use. This value remains constant, but we do need to pick it. We typically use cross validation to find the best lambda (the one that minimizes the function above).
For lasso, we set alpha=1.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
lassocv <- cv.glmnet(X, Y, alpha=1, nfolds=10)
lassomdl <- glmnet(X,Y,alpha=1, nlambda=100)
lassocoef <- coef(lassomdl, s=lassocv$lambda.min)
plot(lassomdl, xvar="lambda", lwd=2)
abline(v=log(lassocv$lambda.min), col="black",lty=2, lwd=2)
lassocoef
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1050.76260183
## precip 2.15764730
## janTemp -0.28356501
## julyTemp -1.14934417
## over65 .
## household .
## ed25 -6.98037734
## housing -0.42462068
## population 0.01012992
## nonWhite 2.98761241
## employment -1.31581902
## lowIncome .
## HC .
## NOx .
## SO2 0.23200129
## humidity -0.14796521
For ridge, we set alpha=0
library(glmnet)
library(Compositional)
Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
ridgecv <- cv.glmnet(X, Y, alpha=0, nfolds=10)
ridgemdl <- glmnet(X,Y,alpha=0, lambda = ridgecv$lambda.min)
ridgecoef <- coef(ridgemdl, s=ridgecv$lambda.min)
ridge.plot(Y,X,lambda=seq(0,5,by=0.1))
abline(v=log(ridgecv$lambda.min), col="black",lty=2, lwd=2)
ridgecoef
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1319.1386266
## precip 2.2872880
## janTemp -0.4380676
## julyTemp -1.7518946
## over65 -3.8009050
## household -18.5149056
## ed25 -10.4032964
## housing -1.2454238
## population 0.0103336
## nonWhite 2.7641814
## employment -1.1527050
## lowIncome -0.7389130
## HC -0.1503197
## NOx 0.4941554
## SO2 0.1634661
## humidity -0.5092659
For elasticnet, we set alpha=0.5
library(glmnet)
Y <- pd_clean$mortalityRate
X <- model.matrix(mortalityRate~.,pd_clean)[,-1]
enetcv <- cv.glmnet(X, Y, alpha=0.5, nfolds=10)
enetmdl <- glmnet(X,Y,alpha=0.5, nlambda=100)
enetcoef <- coef(enetmdl, s=enetcv$lambda.min)
plot(enetmdl, xvar="lambda", lwd=2)
abline(v=log(enetcv$lambda.min), col="black",lty=2, lwd=2)
enetcoef
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.736208e+03
## precip 2.410289e+00
## janTemp -5.051437e-01
## julyTemp -2.927553e+00
## over65 -6.939102e+00
## household -6.767421e+01
## ed25 -1.551675e+01
## housing -1.972315e+00
## population 9.949473e-03
## nonWhite 3.685845e+00
## employment -1.111314e+00
## lowIncome -2.594403e+00
## HC -5.111048e-01
## NOx 1.394349e+00
## SO2 3.147938e-02
## humidity -7.045609e-01
Comparing all the variable selection methods used so for we have:
library(plyr)
l<-list(data.frame(t(data.frame(coef(forward_mdl)))),
data.frame(t(data.frame(coef(backward_mdl)))),
data.frame(t(data.frame(coef(sboth_mdl)))),
data.frame(t(as.data.frame(as.matrix(lassocoef)))),
data.frame(t(as.data.frame(as.matrix(ridgecoef)))),
data.frame(t(as.data.frame(as.matrix(enetcoef)))))
varr_results <- do.call(rbind.fill, l)
row.names(varr_results) <- c('Forward','Backward','Step Both', 'Lasso','Ridge','ElasticNet')
varr_results
For goodness of fit, we can look at several aspects of our model. In this section, we will focus on the graphical plots for our model. We can see that several of the points in our data do have an impact on the overall fit. (See the influencePlot and the cooks distance plot.) However, since the cooks distance is less that 1, we won’t really consider those points to be outliers. The QQplot and the histogram show that our model does have some issus with normality as well. For this normal regression problem, we are concerned with the following assumpations:
library(car)
## Loading required package: carData
full.resid = residuals(sboth_mdl)
cook = cooks.distance(sboth_mdl)
par(mfrow=c(2,2))
influencePlot(sboth_mdl)
plot(cook, type="h",lwd=3,col="red", ylab="Cook's Distance")
abline(0,0,col="red")
qqPlot(full.resid, ylab="Residuals", main="")
## 57 53
## 50 46
hist(full.resid, xlab="Residuals", main="",nclass=30,col="orange")
We can look specifically at the following plots to get a sense of how the model conforms to the model assumptions. (Again, in this case, we will use the stepwise “both” model from above.)
Specifically, we are concerned with the following assumptions:
library(car)
std_resid <- rstandard(sboth_mdl)
par(mfrow=c(2,2))
plot(seq(1,length(sboth_mdl$fitted.values)), std_resid, xlab="Predictor", ylab="Std. Residual", main="Predictor vs Residual (Linearity)")
abline(h=0, col="red", lty=2)
plot(sboth_mdl$fitted.values, std_resid, xlab="Fitted", ylab="Std Resid", main="Fitted Vs Residual (Const Var & Indp)")
qqPlot(std_resid, main="QQ Plot Std. Residuals (Normality)")
## 57 53
## 50 46
hist(std_resid, main="Histogram of Std. Residuals (Normality")
Based on the above, figure 1 does sow that data spread across the x axis, so linearity appears to be OK. In the Fitted vs Residuals, constant variance looks OK, and there is no reason to suspect that the data is not independent. The last plt shows a general normal shape, but not perfect. We would probably want to call that out in our analysis.
It is possible to do a goodness of fit test for our models as well. In this case, we will look specifically at the model that we got back from the Ridge regression. In this test, also called the deviance test, we would want specifically at the model that we got using the value of lambda that we got from the Ridge regression that we applied above.
df = length(Y-nrow(coef(ridgemdl)))
cat("Deviance residuals test (",
deviance(ridgemdl),
") p =",1-pchisq(deviance(ridgemdl), df),
"\n")
## Deviance residuals test ( 23579.76 ) p = 0
In this case, the p-value is very low. The hypothesis test for goodness of fit is structured as follows:
In this case, our p-value, which is essentially 0, means that we must reject \(H_0\). Therefore it does not appear that we have a good fitting model.
To see if the Ridge model has predictive power (or how well the variance in the response is explained by the predictors) we can calculate the R2 value:
cat('R2 =', ridgemdl$dev.ratio, "\n")
## R2 = 0.8559839
The above shows that we do have approximately 85% of the variance in the response variable is explained by the predictors. So, while the model is not necessarily a good fit, it does seem to have predictive power.
In this notebook, we have look at various methods of variable reduction techniques, and some other associate topics in multiple regression. I have worked on this document as a way to prepare for my final in Georgia Tech ISYE 6414 course. Thanks for taking the time to read this writup.