What is Ridge Regression?

Ridge regression is almost identical to linear regression except we introduce a small amount of bias. This regression technique extends OLS regression by adding the penalty term to the loss function. The penalty term is also called the L2 regularization. This model asssumes that after normalization of the data, the regression coefficients should not be very large. Ridge Regression works best when the model matrix os collinear adn the usual least squares estimate beta is unstable.

Ridge Regression estimate chooses beta that minimizes

\[ (y-X \beta)^T(y-X \beta) + \lambda \sum_j \beta ^2_j, \ \text{such that} \ \exists \ \lambda \geq 0 \]

where,

\[ \lambda \sum_j \beta ^2_j \]

is the penatly term.

Applied Ridge Regression Example

require(dplyr)
require(data.table)

Data Source : https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

data <- fread("https://raw.githubusercontent.com/SpencerPao/Ridge-Lasso-ElasticNet/master/Boston_Housing.csv")

data <- na.omit(data)
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(data)) {
    if (is.numeric(data[[col_name]])) {
        hist(data[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

data_scaled <- cbind(scale(data[, 1:13]), data[, 13])

scaled data: take an observation and substract from the column mean then divide the substraction by the standard deviation

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(data_scaled)) {
    if (is.numeric(data_scaled[[col_name]])) {
        hist(data_scaled[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

We create a train-test split of 80-20 for our Ridge Regression

# Train-Test Split
set.seed(123)
size <- floor(0.8 * nrow(data_scaled))

training_ind <- sample(seq_len(nrow(data_scaled)), size = size)

train <- data[training_ind, ]
xtrain <- train[, 1:13] |>
    as.matrix()
ytrain <- train |>
    select(MEDV) |>
    unlist() |>
    as.numeric()


test <- data_scaled[-training_ind, ]
xtest <- data[, 1:13] |>
    as.matrix()
ytest <- data[, 14] |>
    unlist() |>
    as.numeric()
lambda_array <- seq(from = 0.01, to = 100, by = 0.01)
require(glmnet)
ridgeFit <- glmnet(xtrain, ytrain, alpha = 0, lambda = lambda_array)  # alpha = 0 is for ridge, alpha = 1 is for lasso regression
summary(ridgeFit)
##           Length Class     Mode   
## a0         10000 -none-    numeric
## beta      130000 dgCMatrix S4     
## df         10000 -none-    numeric
## dim            2 -none-    numeric
## lambda     10000 -none-    numeric
## dev.ratio  10000 -none-    numeric
## nulldev        1 -none-    numeric
## npasses        1 -none-    numeric
## jerr           1 -none-    numeric
## offset         1 -none-    logical
## call           5 -none-    call   
## nobs           1 -none-    numeric

usually use k-fold cross validation to identify the optimal lambda value

As lambda becomes larger, this will start decreasing the sign. of the coefficients

plot(ridgeFit, xvar = "lambda", label = T)

plot(ridgeFit, xvar = "dev", label = T)

# predictions
y_predicted <- predict(ridgeFit, s = min(lambda_array), newx = xtest)
# coefficients
predict(ridgeFit, s = min(lambda_array), newx = xtest, type = "coefficients")
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  20.968237154
## CRIM         -0.021122044
## ZN            0.035660868
## INDUS         0.025384084
## CHAS          2.663180760
## NOX         -10.544031733
## RM            5.226079433
## AGE          -0.011477516
## DIS          -1.232369876
## RAD           0.197673860
## TAX          -0.011370792
## PTRATIO      -0.773188583
## B             0.007912616
## LSTAT        -0.532947129

SST and SSE

sst <- sum((ytest - mean(ytest))^2)
sst
## [1] 34993.75
sse <- sum((y_predicted - ytest)^2)
sse
## [1] 9195.504
r_squared <- 1 - (sse/sst)
r_squared
## [1] 0.7372244
mse <- mean((y_predicted - ytest)^2)
mse
## [1] 20.34404
plot(ytest, y_predicted, main = "Predicted Price vs Actual Price (MEDV)")

ideally we want a linear line with a 1-1 slope.

So, this ridge regression model is somewhat linear actual vs predicted data where we can state that the model is good for predicting MEDV. In the future, we can improve the model by choosing the optimal lambda which would yield better performance metrics.