Why do we need another regression procedure? What is the problem with OLS?

Ordinary Least Squares (OLS) is often the most reliable regression procedure to predict targets on unseen data. But this isn’t always the case. if the model has a high degree of outliers or strong multicollinearity then the high variance within the model caused by these factors can become a problem. One of the possible unwanted results is overfitting.

What is overfitting?

When a training model overfits the data, the model is unduly influenced by idiosyncrasies in the training set that don’t generalize well in the test set. Examples of overfitting abound and even affect the way we make everyday assumptions. For example, assuming Johnny will be good at math because his two brothers were good at math is an overfit model.

There are many causes of overfitting, including those described above, but what they have in common is that they prioritize minimizing bias in the bias-variance ratio. If we were modeling the weather, small shifts in precipitation and cloud cover would be taken seriously, at the risk of the possibility that different randomly-selected training sets would result in different estimates. If instead we were less concerned about any particular training set and more concerned about generalizability, we would then introduce bias into our training estimates.

What does ridge regression do to address the problem of overfitting?

In ridge regression we use “regularization methods” which introduce bias and reduce variance. This lower variance model can actually improve our Meaning Squared Error. In fact, ridge regression allows us to find the optimal trade-off between bias and variance.

In ridge regression, we reduce the size of our coefficients. We do this by introducing a penalty in the loss function represented by the squared sum of the coefficients themselves, multiplied by a factor (designated as lambda) which allows us to control the degree to which the size of the coefficients matters. If lambda is zero, there is no difference between ridge gression and OLS.

How does reducing the sum of the coefficients help reduce overfitting?

It’s easiest to imagine this with one predictor. Using OLS on randomly generated training sets, the slope of the line could vary considerably. We could reduce that variability completely by setting the coefficient to zero, giving us a horizontal line at the mean. But this wouldn’t be a very useful estimate. In between the two, there is a spot where variance and bias are best balanced and MSE is minimized. Thus, the choice of lambda is very important.

Let’s try an example. We’ll use the kanga database from the faraway package because it is notorious for multicollinearity, which ridge addresses well.

First, let’s examine the VIFs from this database. We see how large they are.

data(kanga)

dfKanga <- kanga %>%
  dplyr::select(-sex, -species)

dfKanga <- na.omit(dfKanga)

vif(model <- lm(mandible.width ~ ., dfKanga))
##       basilar.length occipitonasal.length        palate.length 
##            79.550699            47.674247            46.899687 
##         palate.width         nasal.length          nasal.width 
##             4.463039            16.638152             8.317069 
##      squamosal.depth       lacrymal.width      zygomatic.width 
##             4.428905            13.482867            18.715741 
##        orbital.width       .rostral.width      occipital.depth 
##             1.821425             6.189327            11.199769 
##          crest.width      foramina.length      mandible.length 
##             3.597488             1.391683            60.787386 
##       mandible.depth         ramus.height 
##             5.850294            17.857351

Now let’s compare the coefficients from the ridge model with OLS:

q <- EHData::EHPrepare_ScaleAllButTarget(dfKanga, "mandible.width")
m1 <- lm(mandible.width ~., dfKanga)
df1 <- tidy(summary(m1))

dfKanga2 <- dfKanga %>%
    dplyr::select(-mandible.width)

y <- dfKanga$mandible.width
x <- data.matrix(dfKanga2)

model <- glmnet(x, y, alpha = 0)

R makes it easy to find the best lambda by using kfold validation:

#We find the optimal lambda by performing k-fold cross validation:

mcv <- cv.glmnet(x, y, alpha = 0)
#plot(mcv)

lambda1 <- mcv$lambda.min

#plot(model, xvar = "lambda")

m10 <- glmnet(x, y, alpha = 0, lambda = lambda1)

df2 <- tidy(coef(m10))
## Warning: 'tidy.dgCMatrix' is deprecated.
## See help("Deprecated")
## Warning: 'tidy.dgTMatrix' is deprecated.
## See help("Deprecated")
df3 <- cbind(df1, df2) %>%
  dplyr::select(term, estimate, value) %>%
  dplyr::rename("Ridge" = value, "OLS" = estimate)

knitr::kable(df3)
term OLS Ridge
(Intercept) 8.9260103 16.9703770
basilar.length 0.0205929 0.0107690
occipitonasal.length -0.0172223 -0.0049375
palate.length 0.0280587 0.0143111
palate.width -0.0143602 0.0023186
nasal.length -0.0456335 -0.0347039
nasal.width -0.0184396 -0.0456110
squamosal.depth -0.1914399 -0.1084603
lacrymal.width 0.0249381 0.0277130
zygomatic.width 0.0850110 0.0536583
orbital.width -0.0373439 -0.0206664
.rostral.width -0.0766662 -0.0372784
occipital.depth 0.0804264 0.0518912
crest.width 0.0099180 0.0210279
foramina.length -0.0087944 -0.0287259
mandible.length 0.0006274 0.0128883
mandible.depth 0.1249120 0.1205759
ramus.height 0.0519797 0.0488270

We can see that the ridge regression reduces the size of many of the coefficients, though not all. As long as the overall effect is to reduce the squared sum of coefficients times lambda, some coefficients might increase.

In sum, ridge regression “flattens out” our model by reducing the coefficients compared to OLS. This may introduce some bias, but it results in a model that generalizes better. R makes running ridge regression and selection the best lamda easy.