Generalized Additive Models (GAM) with GeD Splines

Introduction

Geometrically designed spline (GeDS) regression is an efficient and versatile free-knot spline regression technique. This is based on a locally-adaptive knot insertion scheme that produces an initial piecewise linear spline fit, over which smoother higher order spline fits are subsequently built. GeDS was firstly introduced for the univariate Normal case by Kaishev et al., 2016. It was later expanded to the broader context of Generalized Non-linear Models (GNM) —which include Generalized Linear Models (GLM) as a special case—, and to the bivariate case by Dimitrova et al., 2023. More recently, GeDS methodology has been extended towards the benchmark of Generalized Additive Models (GAM) and Functional Gradient Boosting (FGB) by Dimitrova et al., 2024.

GeDS methodology is implemented on the R package GeDS. In this post I will focus in introducing the GAM-GeDS implementation. If interested in the canonical GeDS methodology please check Geometrically Designed Splines: GeDS.

What are Generalized Additive Models?

You are probably familiar with linear regression. Let’s do a quick review of some concepts.

Linear Regression and GLM

Multiple linear regression model (MLR) is used to study the relationship between a dependent/response/outcome variable \(Y\) and one or more independent/predictor/explanatory variables \(X_1,..., X_P\). In particular, we assume that there is a linear relationship between the population mean of the outcome variable \(Y\) and the value of the explanatory variables \(X_1,..., X_P\): \[\begin{align} &\mathbb{E}\left[Y\right] = \alpha + \sum_{j=1}^P\beta_jX_j \quad\text{or}\quad Y = \alpha + \sum_{j=1}^P\beta_jX_j + \varepsilon&& \end{align}\] where the response \(Y\) and the error \(\varepsilon\) are random variables, while the covariates \(X_j, \; j = 1,...,P,\) are assumed to be deterministic. The MLR model relies on four major assumptions (Rencher & Schaalje, 2008, Montgomery et al., 2021):

A1. The relationship between the response \(Y\) and the regressors \(X_1,...,X_P\) is linear.

A2. The error term has zero mean, \(\mathbb{E}\left[\varepsilon\right]=0\), or, equivalently, \(\mathbb{E}\left[Y\right] = \alpha + \sum_{j=1}^P\beta_jX_j\).

A3. The error term has constant variance, \(\mathbb{E}\left[\varepsilon^2\right]=\sigma^2\), or, equivalently, \(\text{Var}(Y) =\sigma^2\).

A4. The errors are uncorrelated, \(\text{Cov}\left(\varepsilon_i, \varepsilon_j\right) = 0,\;\forall i\neq j\), or, equivalently, \(\text{Cov}\left(Y_i, Y_j\right) = 0,\;\forall i\neq j\).

And since we are often interested in testing hypotheses and constructing confidence intervals about the model parameters, an additional assumption is often included:

A5. The errors are normally distributed.

To sum up, we assume that the relation between \(Y\) and \(X_1,...,X_P\) is linear, that \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\) —or \(Y \sim \mathcal{N}(\alpha + \sum_{j=1}^P\beta_jX_j,\sigma^2)\)—, and that the errors are uncorrelated with each other. Jointly normally distributed random variables are independent if they are uncorrelated, and thus, instead of A4, it is also common to see instead the assumption that “errors are independent of each other”, i.e., \(\varepsilon_i,\varepsilon_j\) are independent \(\forall i\neq j\), or, equivalently, \(Y_i,Y_j\) are independent \(\forall i\neq j\). Given a sample \(\{Y_i,X_i\}_{i=1}^N\), estimates of \(\beta_0, \beta_1,...,\beta_P\) are usually obtained using the least squares method. If assumptions A1-A4 hold then, by the Gauss-Markov theorem, the least-squares estimators \(\beta_1,...\beta_P\) are best linear unbiased estimators (BLUE). If A5 is also satisfied, LS coincides with the maximum likelihood estimator (MLE).

Generalized linear models (GLM) extend linear models to accommodate data from any distribution belonging to the exponential family, which, for example, includes the Normal, Poisson, Binomial or Gamma distributions. A GLM relates the conditional mean of the response variable to the linear predictor function via an invertible link function \(g\): \[\begin{align} &g\left(\mathbb{E}\left[Y\right]\right) = \beta_0+\beta_1X_1+\beta_2X_2+...+\beta_PX_P && \end{align}\]

Additive models and GAM

Additive models (AM) extend MLR by allowing for the incorporation of nonlinear smooth functions of the covariates: \[\begin{align} &\mathbb{E}\left[Y|X_1,..., X_P\right] = \alpha+\sum_{j=1}^P f_j(X_j) \quad\text{or}\quad Y = \alpha + \sum_{j=1}^P f_j(X_j) +\varepsilon && \end{align}\] where the error term \(\varepsilon\) is assumed to have zero mean, \(\mathbb{E}\left[\varepsilon\right]=0\), constant variance, \(\mathbb{E}\left[\varepsilon^2\right]=\sigma^2\), and to be independent of the predictor variables \(X_1,...,X_P\). It is also implicitly assumed that \(\mathbb{E}\left[f_j\left(X_j\right)\right]=0\) (which implies \(\mathbb{E}\left[Y\right]=\alpha\)), since otherwise there would be unaccounted constants in each of the functions \(f_j\) (Hastie & Tibshirani, 1990).

Generalized additive models (GAM) extend GLM in a similar way, i.e., by replacing each linear component \(\beta_jX_j\) with a smooth non-linear function \(f_j(X_j)\), in order to allow for non-linear predictor effects. In other words, GAM extend AM by allowing the response variable to follow any distribution from the exponential family.

\[\begin{align} g(\mathbb{E}\left[Y|X_1,...,X_P\right]) = \alpha + \sum_{j=1}^P f_j(X_j), \quad\quad \mathbb{E}\left[f_j(X_j)\right] = 0,\quad j=1,...,P \end{align}\]

Generalized Additive Models fitting: Local-Scoring and Backfitting

Hastie & Tibshirani, 1990, propose the local scoring algorithm in conjunction with the backfitting algorithm to fit GAM.

1. Backfitting

One can fit an AM by means of the backfitting algorithm:

Algorithm 1: Backfitting Algorithm for Additive Models

Initialize: \(\hat{\alpha}=\frac{1}{N}\sum_{i=1}^NY_i\), \(\hat{f}_j= 0\), \(\forall j\)
For each base-learner \(\hat{f}_j\), \(j=1,...,P\) \[\begin{align} &\hat{f}_j \leftarrow S_j\left[\left\{Y_i-\hat{\alpha}-\sum_{k\neq j}\hat{f}_k\left(X_{ik}\right)\right\}_{i=1}^N\right]\\ &\hat{f}_j \leftarrow \hat{f}_j -\frac{1}{N}\sum_{i=1}^N\hat{f}_j(X_{ij}) \end{align}\] for \(S_j\) being some arbitrary scatterplot smoother (i.e. some regression fitting model).
Repeat Step 2 until \[\begin{align} \text{RSS}=\sum_{i=1}^N\left(Y_i-\hat{\alpha}-\sum_{j=1}^Pf_j(X_{ij})\right)^2 \end{align}\] fails to decrease.

2. Local Scoring

For fitting GAM, the local scoring algorithm is proposed:

Algorithm 2: Local Scoring Algorithm for Generalized Additive Models

\(\hat{\alpha}=g\left(\frac{1}{N}\sum_{i=1}^NY_i\right)\), \(\hat{f}_j^0=0\), \(j=1,...,P\), and \(m=0\).
Set \(m=m+1\) and iterate to form the predictor \(\boldsymbol{\hat{\eta}}\), the mean \(\boldsymbol{\hat{\mu}}\), the weights \(\boldsymbol{w}\), and the adjusted dependent variable \(\boldsymbol{z}\):

Form the adjusted dependent variable \[\begin{align} & z_i = \hat{\eta}^{(m-1)}_i+\left(Y_i - \hat{\mu}^{(m-1)}_i \right)\cdot\left(\frac{\partial\hat{\eta}}{\partial\hat{\mu}}\right)_i^{(m-1)} \end{align}\] where, \(\hat{\eta}^{(m-1)}_i = \hat{\alpha} + \sum_{j=1}^P \hat{f}_j^{(m-1)}(X_{ij})\) and \(\hat{\mu}^{(m-1)}_i = g^{-1}\left(\hat{\eta}^{(m-1)}_i\right)\)
Form the weights: \(w_ i = \left(V^{(m-1)}_ i\right)^{-1}\cdot \left[\left(\frac{\partial\hat{\mu}}{\partial \hat{\eta} }\right)_ i^{(m-1)}\right]^2\) where \(V_i^{(m-1)}\) is the variance of \(Y\) at \(\hat{\mu}_i^{(m-1)}\).
Fit an additive model to \(\boldsymbol{z}\) by using the backfitting algorithm with weights \(\boldsymbol{w}\) to obtain the estimated functions \(\hat{f}_j^{(m)}(\cdot)\), \(j=1,...P\), and the model \(\boldsymbol{\hat{\eta}^m}\).

The empirical deviance \(\sum_{i=1}^N\text{dev}\left(Y_i,\hat{\mu}^m_i\right)\) fails to decrease.

Note that if family = "gaussian" the local scoring algorithm stems to straight implementation of backfitting.

Generalized Additive Models with GeD Splines

Our GAM-GeDS implementation involves applying the local scoring algorithm, using Normal GeD splines as the function smoothers to estimate \(f_j\) within the backfitting algorithm.

# install.packages("GeDS")
library(GeDS)

Let us illustrate the GAM-GeDS technique by using the car.test.frame dataset. The car.test.frame data frame has 60 rows and 8 columns, giving data on makes of cars taken from the April, 1990 issue of Consumer Reports (Therneau & Atkinson, 2023). The variables are:

Price: price in US dollars of a standard model
Country of origin: a factor with levels France, Germany, Japan, Japan/USA, Korea, Mexico, Sweden and USA.
Reliability: a numeric vector coded 1 to 5.
Mileage: fuel consumption in (US) miles per gallon (mpg).
Type: a factor with levels Compact, Large, Medium, Small, Sporty, and Van.
Weight: kerb weight in pounds.
Disp.: the engine capacity (displacement) in liters.
HP: net horsepower of the vehicle.

car_data <- rpart::car.test.frame
Gmodgam <- NGeDSgam(Mileage ~ f(Price) + Country + Type + f(Weight) + f(Disp.) + f(HP),
                    data = car_data, phi = 0.95)

## f(Weight) has less than 2 linear internal knots.
## f(Disp.) has less than 2 linear internal knots. Quadratic averaging knot location is not computed for base learners with less than 2 internal knots.
## f(Weight) has less than 3 linear internal knots.
## f(Disp.) has less than 3 linear internal knots. Cubic averaging knot location is not computed for base learners with less than 3 internal knots.

The above messages inform us that no internal knot was fit for Weight, i.e. the GeDS learner considered that a straight line fit was the best option for this predictor. Hence, no averaging knot location is computed for this covariate (see Stage B in Geometrically Designed Splines:GeDS). For Disp. only one internal knot was fit, hence, the averaging knot location for the quadratic and cubic fits is not computed. Note we relax the defaul GeDS stopping rule by setting phi = 0.95 (instead of phi = 0.9).

First, we can extract the coefficients of the linear/quadratic/cubic GeDS-GAM fits (not displayed for conciseness):

# Linear GAM-GeDS Fit
coef(Gmodgam, n = 2)
# Quadratic GAM-GeDS Fit
coef(Gmodgam, n = 3)
# Cubic GAM-GeDS Fit
coef(Gmodgam, n = 4)

The coefficients for the linear fit are provided in piecewise polynomial form, while for the quadratic and cubic fits are given in B-spline form.

Second, the knots can be extracted as follows:

# Linear GAM-GeDS Fit
knots(Gmodgam, n = 2)

## $`f(Price)`
## [1]  5866.000  5866.000  7169.350  7960.505 11902.291 20465.167 24760.000
## [8] 24760.000
## 
## $Country
## NULL
## 
## $Type
## NULL
## 
## $`f(Weight)`
## [1] 1845 1845 3855 3855
## 
## $`f(Disp.)`
## [1]  73.0000  73.0000 227.0536 305.0000 305.0000
## 
## $`f(HP)`
## [1]  63.00000  63.00000  88.56865 116.65251 147.35135 225.00000 225.00000

# Quadratic GAM-GeDS Fit
knots(Gmodgam, n = 3)

## $`f(Price)`
## [1]  5866.000  5866.000  5866.000  7564.928  9931.398 16183.729 24760.000
## [8] 24760.000 24760.000
## 
## $Country
## NULL
## 
## $Type
## NULL
## 
## $`f(Weight)`
## [1] 1845 1845 1845 3855 3855 3855
## 
## $`f(Disp.)`
## [1]  73.0000  73.0000  73.0000 227.0536 305.0000 305.0000 305.0000
## 
## $`f(HP)`
## [1]  63.0000  63.0000  63.0000 102.6106 132.0019 225.0000 225.0000 225.0000

# Cubic GAM-GeDS Fit
knots(Gmodgam, n = 4)

## $`f(Price)`
##  [1]  5866.000  5866.000  5866.000  5866.000  9010.715 13442.654 24760.000
##  [8] 24760.000 24760.000 24760.000
## 
## $Country
## NULL
## 
## $Type
## NULL
## 
## $`f(Weight)`
## [1] 1845 1845 1845 1845 3855 3855 3855 3855
## 
## $`f(Disp.)`
## [1]  73.0000  73.0000  73.0000  73.0000 227.0536 305.0000 305.0000 305.0000
## [9] 305.0000
## 
## $`f(HP)`
## [1]  63.0000  63.0000  63.0000  63.0000 117.5242 225.0000 225.0000 225.0000
## [9] 225.0000

where note that the \(n\) left and right most knots are assumed coalescent.

Third, we can extract the deviances, which in this case (family = “gaussian”) are the Residual Sum of Squares (RSS):

# Linear GAM-GeDS Fit
deviance(Gmodgam, n = 2)

## [1] 146.871

# Quadratic GAM-GeDS Fit
deviance(Gmodgam, n = 3)

## [1] 81.61975

# Cubic GAM-GeDS Fit
deviance(Gmodgam, n = 4)

## [1] 70.37585

The components of a GAM-GeDS object can be plotted in the linear predictor scale as follows:

# Linear GAM-GeDS Fit
layout(matrix(c(1,2), nrow=1, byrow=TRUE))
plot(Gmodgam, n = 2)

# Quadratic GAM-GeDS Fit
layout(matrix(c(1,2), nrow=1, byrow=TRUE))
plot(Gmodgam, n = 3)

# Cubic GAM-GeDS Fit
layout(matrix(c(1,2), nrow=1, byrow=TRUE))
plot(Gmodgam, n = 4)

For each fit, for each learner, the corresponding knots location is traced with vertical dotted lines.

To assess the model’s accuracy on unseen data we can proceed as follows:

# Set seed for reproducibility
set.seed(123)
# Determine the size of the dataset
n <- nrow(car_data)
# Create a random sample of row indices for the training set
trainIndex <- sample(1:n, size = floor(0.8 * n))
# Subset the data into training and test sets
train <- car_data[trainIndex, ]
test <- car_data[-trainIndex, ]

Gmodgam <- NGeDSgam(Mileage ~ f(Price) + Country + Type + f(Weight) + f(Disp.) + f(HP),
                    data = train, phi = 0.9)

The truth is more likely to be smooth than wiggly, hence we reduce further phi. We compute predictions on the test sample and calculate the corresponding mean squared error for each fit:

mean((test$Mileage - predict(Gmodgam, newdata = test, n = 2))^2)

## [1] 4.153196

mean((test$Mileage - predict(Gmodgam, newdata = test, n = 3))^2)

## [1] 8.012237

mean((test$Mileage - predict(Gmodgam, newdata = test, n = 4))^2)

## [1] 7.828827

To conclude, let us move to the actual GAM context by considering the modelling of Price instead. A widely used distribution for pricing models is Gamma.

Gmodgam <- NGeDSgam(Price ~ f(Mileage) + Country + Type + f(Weight) + f(Disp.) + f(HP),
                    data = train, family = Gamma(link=log), phi = 0.9)

And we calculate the Gamma deviance on the test set as follows:

sum(Gamma()$dev(test$Price, predict(Gmodgam, newdata = test, n = 2), wt = 1))
sum(Gamma()$dev(test$Price, predict(Gmodgam, newdata = test, n = 3), wt = 1))
sum(Gamma()$dev(test$Price, predict(Gmodgam, newdata = test, n = 4), wt = 1))

Dimitrova, D. S., Guillen, E. S., & Kaishev, V. K. (2024). GeDS: An R package for regression, generalized additive models and functional gradient boosting, based on geometrically designed (GeD) splines.

Dimitrova, D. S., Kaishev, V. K., Lattuada, A., & Verrall, R. J. (2023). Geometrically designed variable knot splines in generalized (non-)linear models. Applied Mathematics and Computation, 436. https://doi.org/10.1016/j.amc.2022.127493

Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. Monographs on Statistics and Applied Probability. Chapman & Hall, 43, 335.

Kaishev, V. K., Dimitrova, D. S., Haberman, S., & Verrall, R. J. (2016). Geometrically designed, variable knot regression splines. Computational Statistics, 31(3), 1079–1105. https://doi.org/10.1007/s00180-015-0621-7

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to linear regression analysis. John Wiley & Sons.

Rencher, A. C., & Schaalje, G. B. (2008). Linear models in statistics. John Wiley & Sons.

Therneau, T., & Atkinson, B. (2023). Rpart: Recursive partitioning and regression trees. https://CRAN.R-project.org/package=rpart