Regression analysis in machine learning (ML) is a method used to examine the connection between independent variables and a dependent variable. This type of analysis it is known as predictive modeling, in which an algorithm or method is used to predict continuous outcomes.
Here are the steps for conducting a ML regression analysis and deploying the final selected model:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(mlbench)
data(BostonHousing)
BostonHousing %>% glimpse()
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
set.seed(1943)
bh_index <- createDataPartition(BostonHousing$medv, p = .75, list = FALSE)
bh_train <- BostonHousing[ bh_index, ]
bh_test <- BostonHousing[-bh_index, ]
set.seed(1943)
lm_fit <- train(medv ~ .,
data = bh_train,
method = "lm")
bh_pred_lm <- predict(lm_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_lm, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.1567052 0.7242759 3.5349584
# Variable importance
varImp(lm_fit)
## lm variable importance
##
## Overall
## rm 100.00
## lstat 92.94
## ptratio 69.64
## dis 63.49
## nox 40.75
## b 39.02
## rad 35.85
## tax 28.52
## zn 24.20
## chas1 19.72
## crim 18.84
## indus 7.14
## age 0.00
plot(varImp(lm_fit), main ="Liner Model - Variable Importance")
set.seed(1943)
glmStepAIC_fit <- train(medv ~ .,
data = bh_train,
method = "glmStepAIC")
bh_pred_glmStepAIC <- predict(glmStepAIC_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.1200064 0.7280245 3.5068559
# Variable importance
varImp(glmStepAIC_fit)
## loess r-squared variable importance
##
## Overall
## lstat 100.00
## rm 84.32
## indus 81.43
## tax 64.57
## ptratio 62.81
## rad 33.57
## crim 31.85
## zn 30.88
## nox 30.66
## age 21.73
## b 19.11
## dis 15.81
## chas 0.00
plot(varImp(glmStepAIC_fit), main ="glmStepAIC Model - Variable Importance")
set.seed(1943)
rf_fit <- train(medv ~ .,
data = bh_train,
method = "rf")
bh_pred_rf <- predict(rf_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_rf, obs = bh_test$medv)
## RMSE Rsquared MAE
## 3.3977721 0.8789298 2.3707018
# Variable importance
varImp(rf_fit)
## rf variable importance
##
## Overall
## rm 100.00000
## lstat 93.98045
## dis 15.16056
## crim 14.32064
## ptratio 12.85543
## nox 9.35791
## indus 6.91304
## tax 6.52502
## age 5.91583
## b 3.17990
## rad 1.84330
## chas1 0.02393
## zn 0.00000
plot(varImp(rf_fit), main ="Random Forest Model - Variable Importance")
set.seed(1943)
knn_fit <- train(medv ~ .,
data = bh_train,
method = "knn")
bh_pred_knn <- predict(knn_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_knn, obs = bh_test$medv)
## RMSE Rsquared MAE
## 7.1314480 0.4716835 4.7869714
# Variable importance
varImp(knn_fit)
## loess r-squared variable importance
##
## Overall
## lstat 100.00
## rm 84.32
## indus 81.43
## tax 64.57
## ptratio 62.81
## rad 33.57
## crim 31.85
## zn 30.88
## nox 30.66
## age 21.73
## b 19.11
## dis 15.81
## chas 0.00
plot(varImp(knn_fit), main ="k-Nearest Neighbors Model - Variable Importance")
set.seed(1943)
glmboost_fit <- train(medv ~ .,
data = bh_train,
method = "glmboost")
bh_pred_glmboost <- predict(glmboost_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_glmboost, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.5215952 0.6838738 3.7642867
# Variable importance
varImp(glmboost_fit)
## glmboost variable importance
##
## Overall
## nox 100.00000
## rm 60.62286
## chas1 20.97142
## ptratio 9.54375
## dis 7.53362
## lstat 5.85139
## crim 0.18067
## b 0.12122
## zn 0.08720
## tax 0.01517
## indus 0.00000
## rad 0.00000
## age 0.00000
plot(varImp(glmboost_fit), main ="Boosted Generalized Linear Model - Variable Importance")
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
set.seed(1943)
xgbTree_fit <- train(medv ~ .,
data = bh_train,
method = "xgbTree")
bh_pred_xgbTree <- predict(xgbTree_fit, bh_test)
# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)
## RMSE Rsquared MAE
## 3.5778010 0.8670573 2.3382918
# Variable importance
varImp(xgbTree_fit)
## xgbTree variable importance
##
## Overall
## lstat 100.00000
## rm 82.38238
## dis 10.68443
## ptratio 9.44992
## crim 8.11190
## nox 4.62958
## tax 3.10129
## age 2.51098
## b 1.89819
## indus 1.19425
## rad 0.81990
## zn 0.07844
## chas1 0.00000
plot(varImp(xgbTree_fit), main ="xgbTree Model - Variable Importance")
Note. When measuring models performance.
RMSE: The root mean squared error. This measures the average difference between the predictions made by the model and the actual observations. The lower the RMSE, the more closely a model can predict the actual observations.
MAE: The mean absolute error. This is the average absolute difference between the predictions made by the model and the actual observations. The lower the MAE, the more closely a model can predict the actual observations.
MAE and RMSE — Which Metric is Better? Answer: MAE
postResample(pred = bh_pred_lm, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.1567052 0.7242759 3.5349584
postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.1200064 0.7280245 3.5068559
postResample(pred = bh_pred_rf, obs = bh_test$medv)
## RMSE Rsquared MAE
## 3.3977721 0.8789298 2.3707018
postResample(pred = bh_pred_knn, obs = bh_test$medv)
## RMSE Rsquared MAE
## 7.1314480 0.4716835 4.7869714
postResample(pred = bh_pred_glmboost, obs = bh_test$medv)
## RMSE Rsquared MAE
## 5.5215952 0.6838738 3.7642867
postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)
## RMSE Rsquared MAE
## 3.5778010 0.8670573 2.3382918
The selected model will be the model with the lowest MAE, which is the xgbTree Model
BostonHousing[1,]
## crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
home_spec <- tibble(crim = 0.00632, zn = 18,
indus = 2.31, chas = "0",
nox = 0.538, rm = 6.575,
age = 65.2, dis = 4.09,
rad = 1, tax = 296,
ptratio = 15.3, b = 396.9,
lstat = 4.98)
predict(xgbTree_fit, home_spec)
## [1] 24.42943
Note. The ‘medv’ value from the first record of the Boston Housing dataset is 24, and the ‘medv’ value for the xgbTree predictive model is: 24.4. Not Bad!
A.M.D.G.
ite, inflammate omnia