Machine Leaning Regression

Regression analysis in machine learning (ML) is a method used to examine the connection between independent variables and a dependent variable. This type of analysis it is known as predictive modeling, in which an algorithm or method is used to predict continuous outcomes.

Here are the steps for conducting a ML regression analysis and deploying the final selected model:

  1. Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(mlbench)
  1. Upload ‘Boston Housing’ dataset from the {mlbench} package.
data(BostonHousing)
  1. View ‘Boston Housing’ dataset
BostonHousing %>% glimpse()
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
  1. Split dataset into training and testing sets using 75 (bh_train) and 25 (bh_test) splitting method.
set.seed(1943)
bh_index <- createDataPartition(BostonHousing$medv, p = .75, list = FALSE)
bh_train <- BostonHousing[ bh_index, ]
bh_test <- BostonHousing[-bh_index, ]
Create ML Regression Models
  1. Create traditional linear regression model using method = “lm”.
set.seed(1943)
lm_fit <- train(medv ~ .,
                data = bh_train, 
                method = "lm")
bh_pred_lm <- predict(lm_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_lm, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.1567052 0.7242759 3.5349584
# Variable importance
varImp(lm_fit)
## lm variable importance
## 
##         Overall
## rm       100.00
## lstat     92.94
## ptratio   69.64
## dis       63.49
## nox       40.75
## b         39.02
## rad       35.85
## tax       28.52
## zn        24.20
## chas1     19.72
## crim      18.84
## indus      7.14
## age        0.00
plot(varImp(lm_fit), main ="Liner Model - Variable Importance")

  1. Create generalized linear model (glm) and “stepAIC” using method = “glmStepAIC”.
set.seed(1943)
glmStepAIC_fit <- train(medv ~ .,
                data = bh_train, 
                method = "glmStepAIC")
bh_pred_glmStepAIC <- predict(glmStepAIC_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.1200064 0.7280245 3.5068559
# Variable importance
varImp(glmStepAIC_fit)
## loess r-squared variable importance
## 
##         Overall
## lstat    100.00
## rm        84.32
## indus     81.43
## tax       64.57
## ptratio   62.81
## rad       33.57
## crim      31.85
## zn        30.88
## nox       30.66
## age       21.73
## b         19.11
## dis       15.81
## chas       0.00
plot(varImp(glmStepAIC_fit), main ="glmStepAIC Model - Variable Importance")

  1. Create random forest model using method = “rf”.
set.seed(1943)
rf_fit <- train(medv ~ .,
                data = bh_train, 
                method = "rf")
bh_pred_rf <- predict(rf_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_rf, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 3.3977721 0.8789298 2.3707018
# Variable importance
varImp(rf_fit)
## rf variable importance
## 
##           Overall
## rm      100.00000
## lstat    93.98045
## dis      15.16056
## crim     14.32064
## ptratio  12.85543
## nox       9.35791
## indus     6.91304
## tax       6.52502
## age       5.91583
## b         3.17990
## rad       1.84330
## chas1     0.02393
## zn        0.00000
plot(varImp(rf_fit), main ="Random Forest Model - Variable Importance")

  1. Create k-nearest neighbors model using method = “knn”.
set.seed(1943)
knn_fit <- train(medv ~ .,
                data = bh_train, 
                method = "knn")
bh_pred_knn <- predict(knn_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_knn, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 7.1314480 0.4716835 4.7869714
# Variable importance
varImp(knn_fit)
## loess r-squared variable importance
## 
##         Overall
## lstat    100.00
## rm        84.32
## indus     81.43
## tax       64.57
## ptratio   62.81
## rad       33.57
## crim      31.85
## zn        30.88
## nox       30.66
## age       21.73
## b         19.11
## dis       15.81
## chas       0.00
plot(varImp(knn_fit), main ="k-Nearest Neighbors Model - Variable Importance")

  1. Create boosted generalized linear model using method = “glmboost”.
set.seed(1943)
glmboost_fit <- train(medv ~ .,
                 data = bh_train, 
                 method = "glmboost")
bh_pred_glmboost <- predict(glmboost_fit, bh_test)

# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_glmboost, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.5215952 0.6838738 3.7642867
# Variable importance
varImp(glmboost_fit)
## glmboost variable importance
## 
##           Overall
## nox     100.00000
## rm       60.62286
## chas1    20.97142
## ptratio   9.54375
## dis       7.53362
## lstat     5.85139
## crim      0.18067
## b         0.12122
## zn        0.08720
## tax       0.01517
## indus     0.00000
## rad       0.00000
## age       0.00000
plot(varImp(glmboost_fit), main ="Boosted Generalized Linear Model - Variable Importance")

  1. Create extreme gradient-boosted decision tree using method = “xgbTree”.
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
set.seed(1943)
xgbTree_fit <- train(medv ~ .,
                data = bh_train, 
                method = "xgbTree")
bh_pred_xgbTree <- predict(xgbTree_fit, bh_test)

# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 3.5778010 0.8670573 2.3382918
# Variable importance
varImp(xgbTree_fit)
## xgbTree variable importance
## 
##           Overall
## lstat   100.00000
## rm       82.38238
## dis      10.68443
## ptratio   9.44992
## crim      8.11190
## nox       4.62958
## tax       3.10129
## age       2.51098
## b         1.89819
## indus     1.19425
## rad       0.81990
## zn        0.07844
## chas1     0.00000
plot(varImp(xgbTree_fit), main ="xgbTree Model - Variable Importance")

Compare all models’ RMSE and MAE

Note. When measuring models performance.

RMSE: The root mean squared error. This measures the average difference between the predictions made by the model and the actual observations. The lower the RMSE, the more closely a model can predict the actual observations.

MAE: The mean absolute error. This is the average absolute difference between the predictions made by the model and the actual observations. The lower the MAE, the more closely a model can predict the actual observations.

MAE and RMSE — Which Metric is Better? Answer: MAE

  1. Select model with lowest
postResample(pred = bh_pred_lm, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.1567052 0.7242759 3.5349584
postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.1200064 0.7280245 3.5068559
postResample(pred = bh_pred_rf, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 3.3977721 0.8789298 2.3707018
postResample(pred = bh_pred_knn, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 7.1314480 0.4716835 4.7869714
postResample(pred = bh_pred_glmboost, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 5.5215952 0.6838738 3.7642867
postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)
##      RMSE  Rsquared       MAE 
## 3.5778010 0.8670573 2.3382918

The selected model will be the model with the lowest MAE, which is the xgbTree Model

Deploying Selected Model
  1. View first record of ‘Boston Housing’ dataset.
BostonHousing[1,]
##      crim zn indus chas   nox    rm  age  dis rad tax ptratio     b lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.09   1 296    15.3 396.9  4.98   24
  1. Create a new dataset with the values you want to use to predict the ‘1970’s Median value of owner-occupied homes ($1000’s).’ The value being predicted is for the variable ‘medv.’ For this example, I am going to use the values from the first record of the Boston Housing dataset. Do not include the variable ‘medv’ in the home_spec dataset.
home_spec <- tibble(crim = 0.00632,  zn = 18, 
                   indus = 2.31, chas = "0",
                   nox = 0.538,  rm = 6.575, 
                   age = 65.2,   dis = 4.09,
                   rad = 1, tax = 296,
                   ptratio = 15.3, b = 396.9,
                   lstat = 4.98)
  1. Finally, in order to predict the 1970’s median value of an owner-occupied home in Boston, you need to add to the predict() function the ‘xgbTree_fit’ model and the ‘home_spec’ dataset.
predict(xgbTree_fit, home_spec)
## [1] 24.42943

Note. The ‘medv’ value from the first record of the Boston Housing dataset is 24, and the ‘medv’ value for the xgbTree predictive model is: 24.4. Not Bad!

A.M.D.G.

ite, inflammate omnia