Why do we need another regression procedure? What is the problem with OLS?
Ordinary Least Squares (OLS) is often the most reliable regression procedure to predict targets on unseen data. But this isn’t always the case. if the model has a high degree of outliers or strong multicollinearity then the high variance within the model caused by these factors can become a problem. One of the possible unwanted results is overfitting.
What is overfitting?
When a training model overfits the data, the model is unduly influenced by idiosyncrasies in the training set that don’t generalize well in the test set. Examples of overfitting abound and even affect the way we make everyday assumptions. For example, assuming Johnny will be good at math because his two brothers were good at math is an overfit model.
There are many causes of overfitting, including those described above, but what they have in common is that they prioritize minimizing bias in the bias-variance ratio. If we were modeling the weather, small shifts in precipitation and cloud cover would be taken seriously, at the risk of the possibility that different randomly-selected training sets would result in different estimates. If instead we were less concerned about any particular training set and more concerned about generalizability, we would then introduce bias into our training estimates.
What does ridge regression do to address the problem of overfitting?
In ridge regression we use “regularization methods” which introduce bias and reduce variance. This lower variance model can actually improve our Meaning Squared Error. In fact, ridge regression allows us to find the optimal trade-off between bias and variance.
In ridge regression, we reduce the size of our coefficients. We do this by introducing a penalty in the loss function represented by the squared sum of the coefficients themselves, multiplied by a factor (designated as lambda) which allows us to control the degree to which the size of the coefficients matters. If lambda is zero, there is no difference between ridge gression and OLS.
How does reducing the sum of the coefficients help reduce overfitting?
It’s easiest to imagine this with one predictor. Using OLS on randomly generated training sets, the slope of the line could vary considerably. We could reduce that variability completely by setting the coefficient to zero, giving us a horizontal line at the mean. But this wouldn’t be a very useful estimate. In between the two, there is a spot where variance and bias are best balanced and MSE is minimized. Thus, the choice of lambda is very important.
Let’s try an example. We’ll use the kanga database from the faraway package because it is notorious for multicollinearity, which ridge addresses well.
First, let’s examine the VIFs from this database. We see how large they are.
data(kanga)
dfKanga <- kanga %>%
dplyr::select(-sex, -species)
dfKanga <- na.omit(dfKanga)
vif(model <- lm(mandible.width ~ ., dfKanga))
## basilar.length occipitonasal.length palate.length
## 79.550699 47.674247 46.899687
## palate.width nasal.length nasal.width
## 4.463039 16.638152 8.317069
## squamosal.depth lacrymal.width zygomatic.width
## 4.428905 13.482867 18.715741
## orbital.width .rostral.width occipital.depth
## 1.821425 6.189327 11.199769
## crest.width foramina.length mandible.length
## 3.597488 1.391683 60.787386
## mandible.depth ramus.height
## 5.850294 17.857351
Now let’s compare the coefficients from the ridge model with OLS:
q <- EHData::EHPrepare_ScaleAllButTarget(dfKanga, "mandible.width")
m1 <- lm(mandible.width ~., dfKanga)
df1 <- tidy(summary(m1))
dfKanga2 <- dfKanga %>%
dplyr::select(-mandible.width)
y <- dfKanga$mandible.width
x <- data.matrix(dfKanga2)
model <- glmnet(x, y, alpha = 0)
R makes it easy to find the best lambda by using kfold validation:
#We find the optimal lambda by performing k-fold cross validation:
mcv <- cv.glmnet(x, y, alpha = 0)
#plot(mcv)
lambda1 <- mcv$lambda.min
#plot(model, xvar = "lambda")
m10 <- glmnet(x, y, alpha = 0, lambda = lambda1)
df2 <- tidy(coef(m10))
## Warning: 'tidy.dgCMatrix' is deprecated.
## See help("Deprecated")
## Warning: 'tidy.dgTMatrix' is deprecated.
## See help("Deprecated")
df3 <- cbind(df1, df2) %>%
dplyr::select(term, estimate, value) %>%
dplyr::rename("Ridge" = value, "OLS" = estimate)
knitr::kable(df3)
| term | OLS | Ridge |
|---|---|---|
| (Intercept) | 8.9260103 | 16.9703770 |
| basilar.length | 0.0205929 | 0.0107690 |
| occipitonasal.length | -0.0172223 | -0.0049375 |
| palate.length | 0.0280587 | 0.0143111 |
| palate.width | -0.0143602 | 0.0023186 |
| nasal.length | -0.0456335 | -0.0347039 |
| nasal.width | -0.0184396 | -0.0456110 |
| squamosal.depth | -0.1914399 | -0.1084603 |
| lacrymal.width | 0.0249381 | 0.0277130 |
| zygomatic.width | 0.0850110 | 0.0536583 |
| orbital.width | -0.0373439 | -0.0206664 |
| .rostral.width | -0.0766662 | -0.0372784 |
| occipital.depth | 0.0804264 | 0.0518912 |
| crest.width | 0.0099180 | 0.0210279 |
| foramina.length | -0.0087944 | -0.0287259 |
| mandible.length | 0.0006274 | 0.0128883 |
| mandible.depth | 0.1249120 | 0.1205759 |
| ramus.height | 0.0519797 | 0.0488270 |
We can see that the ridge regression reduces the size of many of the coefficients, though not all. As long as the overall effect is to reduce the squared sum of coefficients times lambda, some coefficients might increase.
In sum, ridge regression “flattens out” our model by reducing the coefficients compared to OLS. This may introduce some bias, but it results in a model that generalizes better. R makes running ridge regression and selection the best lamda easy.