Ridge regression is almost identical to linear regression except we introduce a small amount of bias. This regression technique extends OLS regression by adding the penalty term to the loss function. The penalty term is also called the L2 regularization. This model asssumes that after normalization of the data, the regression coefficients should not be very large. Ridge Regression works best when the model matrix os collinear adn the usual least squares estimate beta is unstable.
Ridge Regression estimate chooses beta that minimizes
\[ (y-X \beta)^T(y-X \beta) + \lambda \sum_j \beta ^2_j, \ \text{such that} \ \exists \ \lambda \geq 0 \]
where,
\[ \lambda \sum_j \beta ^2_j \]
is the penatly term.
Data Source : https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
data <- fread("https://raw.githubusercontent.com/SpencerPao/Ridge-Lasso-ElasticNet/master/Boston_Housing.csv")
data <- na.omit(data)par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(data)) {
if (is.numeric(data[[col_name]])) {
hist(data[[col_name]], main = paste(col_name), xlab = "Value")
}
}
par(mfrow = c(1, 1))scaled data: take an observation and substract from the column mean then divide the substraction by the standard deviation
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(data_scaled)) {
if (is.numeric(data_scaled[[col_name]])) {
hist(data_scaled[[col_name]], main = paste(col_name), xlab = "Value")
}
}
par(mfrow = c(1, 1))We create a train-test split of 80-20 for our Ridge Regression
# Train-Test Split
set.seed(123)
size <- floor(0.8 * nrow(data_scaled))
training_ind <- sample(seq_len(nrow(data_scaled)), size = size)
train <- data[training_ind, ]
xtrain <- train[, 1:13] |>
as.matrix()
ytrain <- train |>
select(MEDV) |>
unlist() |>
as.numeric()
test <- data_scaled[-training_ind, ]
xtest <- data[, 1:13] |>
as.matrix()
ytest <- data[, 14] |>
unlist() |>
as.numeric()ridgeFit <- glmnet(xtrain, ytrain, alpha = 0, lambda = lambda_array) # alpha = 0 is for ridge, alpha = 1 is for lasso regression
summary(ridgeFit)## Length Class Mode
## a0 10000 -none- numeric
## beta 130000 dgCMatrix S4
## df 10000 -none- numeric
## dim 2 -none- numeric
## lambda 10000 -none- numeric
## dev.ratio 10000 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
usually use k-fold cross validation to identify the optimal lambda value
As lambda becomes larger, this will start decreasing the sign. of the coefficients
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 20.968237154
## CRIM -0.021122044
## ZN 0.035660868
## INDUS 0.025384084
## CHAS 2.663180760
## NOX -10.544031733
## RM 5.226079433
## AGE -0.011477516
## DIS -1.232369876
## RAD 0.197673860
## TAX -0.011370792
## PTRATIO -0.773188583
## B 0.007912616
## LSTAT -0.532947129
## [1] 34993.75
## [1] 9195.504
## [1] 0.7372244
## [1] 20.34404
ideally we want a linear line with a 1-1 slope.
So, this ridge regression model is somewhat linear actual vs predicted data where we can state that the model is good for predicting MEDV. In the future, we can improve the model by choosing the optimal lambda which would yield better performance metrics.