Machine Learning Regression Models in R

Machine Leaning Regression

Regression analysis in machine learning (ML) is a method used to examine the connection between independent variables and a dependent variable. This type of analysis it is known as predictive modeling, in which an algorithm or method is used to predict continuous outcomes.

Here are the steps for conducting a ML regression analysis and deploying the final selected model:

Load libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(mlbench)

Upload ‘Boston Housing’ dataset from the {mlbench} package.

data(BostonHousing)

View ‘Boston Housing’ dataset

BostonHousing %>% glimpse()

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

Split dataset into training and testing sets using 75 (bh_train) and 25 (bh_test) splitting method.

set.seed(1943)
bh_index <- createDataPartition(BostonHousing$medv, p = .75, list = FALSE)
bh_train <- BostonHousing[ bh_index, ]
bh_test <- BostonHousing[-bh_index, ]

Create ML Regression Models

Create traditional linear regression model using method = “lm”.

set.seed(1943)
lm_fit <- train(medv ~ .,
                data = bh_train, 
                method = "lm")

bh_pred_lm <- predict(lm_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_lm, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.1567052 0.7242759 3.5349584

# Variable importance
varImp(lm_fit)

## lm variable importance
## 
##         Overall
## rm       100.00
## lstat     92.94
## ptratio   69.64
## dis       63.49
## nox       40.75
## b         39.02
## rad       35.85
## tax       28.52
## zn        24.20
## chas1     19.72
## crim      18.84
## indus      7.14
## age        0.00

plot(varImp(lm_fit), main ="Liner Model - Variable Importance")

Create generalized linear model (glm) and “stepAIC” using method = “glmStepAIC”.

set.seed(1943)
glmStepAIC_fit <- train(medv ~ .,
                data = bh_train, 
                method = "glmStepAIC")

bh_pred_glmStepAIC <- predict(glmStepAIC_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.1200064 0.7280245 3.5068559

# Variable importance
varImp(glmStepAIC_fit)

## loess r-squared variable importance
## 
##         Overall
## lstat    100.00
## rm        84.32
## indus     81.43
## tax       64.57
## ptratio   62.81
## rad       33.57
## crim      31.85
## zn        30.88
## nox       30.66
## age       21.73
## b         19.11
## dis       15.81
## chas       0.00

plot(varImp(glmStepAIC_fit), main ="glmStepAIC Model - Variable Importance")

Create random forest model using method = “rf”.

set.seed(1943)
rf_fit <- train(medv ~ .,
                data = bh_train, 
                method = "rf")

bh_pred_rf <- predict(rf_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_rf, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 3.3977721 0.8789298 2.3707018

# Variable importance
varImp(rf_fit)

## rf variable importance
## 
##           Overall
## rm      100.00000
## lstat    93.98045
## dis      15.16056
## crim     14.32064
## ptratio  12.85543
## nox       9.35791
## indus     6.91304
## tax       6.52502
## age       5.91583
## b         3.17990
## rad       1.84330
## chas1     0.02393
## zn        0.00000

plot(varImp(rf_fit), main ="Random Forest Model - Variable Importance")

Create k-nearest neighbors model using method = “knn”.

set.seed(1943)
knn_fit <- train(medv ~ .,
                data = bh_train, 
                method = "knn")

bh_pred_knn <- predict(knn_fit, bh_test)

# View model RMSE, Rsquared and MAE values 
postResample(pred = bh_pred_knn, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 7.1314480 0.4716835 4.7869714

# Variable importance
varImp(knn_fit)

## loess r-squared variable importance
## 
##         Overall
## lstat    100.00
## rm        84.32
## indus     81.43
## tax       64.57
## ptratio   62.81
## rad       33.57
## crim      31.85
## zn        30.88
## nox       30.66
## age       21.73
## b         19.11
## dis       15.81
## chas       0.00

plot(varImp(knn_fit), main ="k-Nearest Neighbors Model - Variable Importance")

Create boosted generalized linear model using method = “glmboost”.

set.seed(1943)
glmboost_fit <- train(medv ~ .,
                 data = bh_train, 
                 method = "glmboost")

bh_pred_glmboost <- predict(glmboost_fit, bh_test)

# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_glmboost, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.5215952 0.6838738 3.7642867

# Variable importance
varImp(glmboost_fit)

## glmboost variable importance
## 
##           Overall
## nox     100.00000
## rm       60.62286
## chas1    20.97142
## ptratio   9.54375
## dis       7.53362
## lstat     5.85139
## crim      0.18067
## b         0.12122
## zn        0.08720
## tax       0.01517
## indus     0.00000
## rad       0.00000
## age       0.00000

plot(varImp(glmboost_fit), main ="Boosted Generalized Linear Model - Variable Importance")

Create extreme gradient-boosted decision tree using method = “xgbTree”.

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

set.seed(1943)
xgbTree_fit <- train(medv ~ .,
                data = bh_train, 
                method = "xgbTree")

bh_pred_xgbTree <- predict(xgbTree_fit, bh_test)

# View model RMSE, Rsquared and MAE values
postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 3.5778010 0.8670573 2.3382918

# Variable importance
varImp(xgbTree_fit)

## xgbTree variable importance
## 
##           Overall
## lstat   100.00000
## rm       82.38238
## dis      10.68443
## ptratio   9.44992
## crim      8.11190
## nox       4.62958
## tax       3.10129
## age       2.51098
## b         1.89819
## indus     1.19425
## rad       0.81990
## zn        0.07844
## chas1     0.00000

plot(varImp(xgbTree_fit), main ="xgbTree Model - Variable Importance")

Compare all models’ RMSE and MAE

Note. When measuring models performance.

RMSE: The root mean squared error. This measures the average difference between the predictions made by the model and the actual observations. The lower the RMSE, the more closely a model can predict the actual observations.

MAE: The mean absolute error. This is the average absolute difference between the predictions made by the model and the actual observations. The lower the MAE, the more closely a model can predict the actual observations.

MAE and RMSE — Which Metric is Better? Answer: MAE

Select model with lowest

postResample(pred = bh_pred_lm, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.1567052 0.7242759 3.5349584

postResample(pred = bh_pred_glmStepAIC, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.1200064 0.7280245 3.5068559

postResample(pred = bh_pred_rf, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 3.3977721 0.8789298 2.3707018

postResample(pred = bh_pred_knn, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 7.1314480 0.4716835 4.7869714

postResample(pred = bh_pred_glmboost, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 5.5215952 0.6838738 3.7642867

postResample(pred = bh_pred_xgbTree, obs = bh_test$medv)

##      RMSE  Rsquared       MAE 
## 3.5778010 0.8670573 2.3382918

The selected model will be the model with the lowest MAE, which is the xgbTree Model

Deploying Selected Model

View first record of ‘Boston Housing’ dataset.

BostonHousing[1,]

##      crim zn indus chas   nox    rm  age  dis rad tax ptratio     b lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.09   1 296    15.3 396.9  4.98   24

Create a new dataset with the values you want to use to predict the ‘1970’s Median value of owner-occupied homes ($1000’s).’ The value being predicted is for the variable ‘medv.’ For this example, I am going to use the values from the first record of the Boston Housing dataset. Do not include the variable ‘medv’ in the home_spec dataset.

home_spec <- tibble(crim = 0.00632,  zn = 18, 
                   indus = 2.31, chas = "0",
                   nox = 0.538,  rm = 6.575, 
                   age = 65.2,   dis = 4.09,
                   rad = 1, tax = 296,
                   ptratio = 15.3, b = 396.9,
                   lstat = 4.98)

Finally, in order to predict the 1970’s median value of an owner-occupied home in Boston, you need to add to the predict() function the ‘xgbTree_fit’ model and the ‘home_spec’ dataset.

predict(xgbTree_fit, home_spec)

## [1] 24.42943

Note. The ‘medv’ value from the first record of the Boston Housing dataset is 24, and the ‘medv’ value for the xgbTree predictive model is: 24.4. Not Bad!

A.M.D.G.

ite, inflammate omnia

Machine Learning Regression Models in R

Ramon Rodriguez-Santana, MBA, MPH

2024-07-08

Machine Leaning Regression

Create ML Regression Models

Compare all models’ RMSE and MAE

Deploying Selected Model