Case study X10: H2O v.s GAMLSS on a polynomial Gamma regression task.

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Introduction

In conventional statistics, polynomial regression is defined as a special case of multiple linear regression in which the relationship between the response variable Y and predictors Xs is modelled by a high degree (>=2) polynomial function of Xi. Though polynomial function is non-linear, the statistical terms is still linear. Basically, a polynomial model turns a linear model into a curve.

In machine learning, polynomial regression is a technique to improve the basic linear model’s capacity. According to Ian Goodfellow et al (Chapter 5, Deep learning book, 2016), adding polynomial function on a linear regression model could be considered as a method to improve the prediction accuracy by extending the model’s capacity. Models with low capacity might struggle to fit the train subset while models with excessive capacity might overfit by memorizing properties of the training set but perform badly on unseen data (testset).

Polynomial regression could be very useful for modelling the nonlinear variation such as human’s growth, progression of a factor, accumulation of events.

In practice, a polynomial regression could be considered in some situations, such as:

We understand well the underlying theoretical law of the variation (growth of cell, degradation of molecules, progression of count data…).
By visualizing the scatter dot plot that presents a curvilinear relationship between response and explanatory variables. For example: growth of lung volume in terms of age.
By inspecting the distribution of residuals of a linear model. This could be a good sign that a polynomial model is more appropriate than a basic linear model.

GAMLSS package

Built in 2008, the GAMLSS package allows user to fit separated model for each parameters (Mu, Sigma, Nu and Tau) of more than 70 distribution families as linear, polynomial, parametric and non parametric (smoothing) functions of the predictors (including fixed and random effects terms). This makes GAMLSS the most powerful tool for statistical modelling. Implemented GAMLSS additive functions include:

Cs( ) = Cubic splines
Ps( ) = Penalized splines
Fp()=fractional polynomial

The GAMLSS package could also be considered a true Machine learning tool, as every model is “trained” and cross-validated on random splitted subsets until being converged. GAMLSS also provides a tuning process that could eventually determine the best distribution family and the best degree of freedom value for smoothing splines and polynomial functions. Some of these features will be explored in our experiment.

More details could be found on their website: http://www.gamlss.org/

H2O GLM algorithm

Introduced 7 years later, h2o GLM module is currently the most comprehensive GLM based machine learning algorithm for classification and regression supervised learning tasks. First, h2O GLM support a wide range of distribution families, including Gaussian, Poisson, Binomial, Gamma and Tweedie. Secondly, it also integrated an Elastic net regularization for model shrinkage with Ridge, Lasso and Elastic modes. The penalization strength could be automatically or manually tuned up via Lambda hyper-parameter. H2o GLM also integrates different types of solvers, including IRLSM, L_BFGS and Coordinate descent solvers.

Technically, h2o GLM allows performing an orthogonal polynomial regression on both Gaussian and Gamma distributions and even combining our polynomial function with a Lasso or Ridge regularization to optimize the model’s capacity and to avoid overfitting.

For this reason, we would like to compare the h2o GLM algorithm with the GAMLSS Polynomial functions in the present study. Objective

The present case study aims to compare the performance of two packages: GAMLSS and H2O on a simple polynomial regression task with one predictor and gamma distributed outcome. Through this experiment, we will explore following techniques:

Exploring the distribution family using GAMLSS
Training 5 different types of polynomial models in GAMLSS, including:

Orthogonal polynomial function
Fractional polynomial function
Piecewise polynomial function
Penalised spline
Cubic smoothing spline

Training a Lasso-Polynomial Gamma regression model by H2O GLM algorithm.
Combining different packages: h2o, gamlss, mlr and caret

This case study implies a part of famous NHANES dataset. The main objective is to predict the level of glycohemoglobin (gh) in function of Age in heathy American Caucasians.

Materials and method

First, we will prepare the ggplot theme for our experiment

library(tidyverse)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#fcf9ff"),
      strip.background = element_rect(fill = "#3d0066", color = "#3d0066", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

myfill=c("#f32440","#249ff2","#ba33e8","#ffbd0a")

mycolor=c("#b20018","#012670","#2b0354","#c42f01")

Now we load the dataset from UCI website and perform a descriptive analysis

require(rms)
getHdata(nhgh)

data=nhgh%>%as_tibble()%>%na.omit()%>%subset(.,re=="Non-Hispanic White" & dx!=1)%>%.[,c("sex","age","gh")]

data=nhgh%>%as_tibble()%>%na.omit()%>%subset(.,dx!=1)%>%.[,c("sex","age","gh")]


age=data$age
gh=data$gh
sex=as.factor(data$sex)

data=as.data.frame(cbind(sex,age,gh))%>%as_tibble()

data$sex%<>%as.factor()%>%recode_factor(.,`1` = "Male", `2` = "Female")

data=subset(data,gh<8)

Hmisc::describe(data)

## data 
## 
##  3  Variables      4712  Observations
## ---------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     4712        0        2 
##                         
## Value        Male Female
## Frequency    2424   2288
## Proportion  0.514  0.486
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4712        0      804        1     41.2    23.55    13.92    15.75 
##      .25      .50      .75      .90      .95 
##    21.92    39.50    57.02    72.67    79.42 
## 
## lowest : 12.00000 12.08333 12.16667 12.25000 12.33333
## highest: 79.66667 79.75000 79.83333 79.91667 80.00000
## ---------------------------------------------------------------------------
## gh 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4712        0       40    0.993    5.417   0.4398      4.8      4.9 
##      .25      .50      .75      .90      .95 
##      5.2      5.4      5.6      5.9      6.1 
## 
## lowest : 4.0 4.1 4.2 4.3 4.4, highest: 7.5 7.6 7.7 7.8 7.9
## ---------------------------------------------------------------------------

Data visualising

First, we would like to know how the outcome is distributed among 4712 subjects:

data%>%ggplot(aes(x=gh,fill=sex,color=sex))+geom_histogram(alpha=0.5,bins=40)+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)+ggtitle("Glycohemoglobin")

data%>%ggplot(aes(x=age,fill=sex,color=sex))+geom_histogram(alpha=0.5,bins=40)+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)+ggtitle("Age")

To determine whether the models should be specifically built for each gender, we will plot a curve of GH in function of Age:

data%>%ggplot(aes(x=age,y=gh,fill=sex,color=sex))+geom_point(shape=21,alpha=0.2,size=2)+geom_smooth()+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)

The graph indicates that we could build a single model for both males and females, as their two regression curves showed no difference.

The figure above applied a loess smoothing spline. Now we will explore other polynomial function types on our dataset:

p1=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="grey")+geom_smooth(method = "lm", formula = y ~ poly(x, 1),color="black",fill="grey80")+ggtitle("Linear")

p2=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#f32440")+geom_smooth(method = "lm", formula = y ~ poly(x, 2),color="#b20018",fill="#f32440")+ggtitle("Quadratic")

p3=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#249ff2")+geom_smooth(method = "lm", formula = y ~ poly(x, 3),color="#012670",fill="#249ff2")+ggtitle("Cubic")

p4=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#ba33e8")+geom_smooth(method = "lm", formula = y ~ poly(x, 4),color="#2b0354",fill="#ba33e8")+ggtitle("Quartic")

p5=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="gold")+geom_smooth(method = "gam",color="orangered",fill="orange")+ggtitle("GAM")

p6=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#bde509")+geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3),color="darkgreen",fill="#bde509")+ggtitle("LM+Cubic Spline")

library(gridExtra)
grid.arrange(p1,p5,p2,p3,p4,p6,ncol=3)

It seems that a cubic, quartic polynomial model or a model with cubic spline might be more appropriate for our goal.

Machine learning experiment

First, we will split the original dataset into 3 subsets:

A train-subset containing 2594 cases, for model training
A validation subset of 1058 cases for calibrating the model in h2o
An independent subset (1060 cases) fof testing the trained model.

#Generating higher degree polynomial function on X:
data=data%>%mutate(.,x2=(.$age)^2,x3=(.$age)^3,x4=(.$age)^4)


#Data splitting
library(caret)

set.seed(123)

idTrain=caret::createDataPartition(y=data$gh,p=0.55,list=FALSE)
trainset=data[idTrain,]
remain=data[-idTrain,]
idTest=caret::createDataPartition(y=remain$gh,p=0.5,list=FALSE)
testset=remain[idTest,]
validset=remain[-idTest,]

rm(idTrain,idTest,remain)

Following figures confirm that the distribution and relationship of gh/age are unchaged across 3 subsets and compared to origin sample:

s1=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#f32440")+ggtitle("Origin")

s2=trainset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#249ff2")+ggtitle("Train")

s3=validset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#ba33e8")+ggtitle("Valid")

s4=testset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="gold")+ggtitle("Test")

grid.arrange(s1,s2,s3,s4)

d1=data%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#f32440")+ggtitle("Origin")

d2=trainset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#249ff2")+ggtitle("Train")

d3=validset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#ba33e8")+ggtitle("Valid")

d4=testset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="gold")+ggtitle("Test")

grid.arrange(d1,d2,d3,d4)

As the outcome (gh) is a continuous and positive variable, on the next step we would like to know whether Gaussian (normal) or Gamma distribution should be used for modelling gh ? This could be done in GAMLSS package:

library(gamlss)

gh=trainset$gh
xmin=min(trainset$age)
xmax=max(trainset$age)

mGA=histDist(gh,"GA",xmin=xmin,xmax=xmax,density=T)

mNO=histDist(gh,"NO",xmin=xmin,xmax=xmax,density=T)

GAIC(mNO,mGA,k=log(length(gh)))

##     df      AIC
## mGA  2 2577.766
## mNO  2 2615.414

Based on the AIC result, we should adopt the Gamma distribution family insead of Normal for our model.

The pdf of gamma distribution, denoted by GA(Mu,Sigma) is defined by :

\[ fY(y|\mu ,\sigma)= \frac{1}{(\sigma ^2 \mu)^{(1/\sigma ^{2})}}\frac{(y^{1/\sigma^2-1})e^{-y/(\mu \sigma^2)}}{\Gamma (1/\sigma ^2)} \]

Where y, mu and sigma are positive.

Mean: \[E(Y)=\mu\]

Variance: \[Var(Y)= \sigma^2\mu^2\]

1) CUBIC SMOOTHING SPLINE BY GAMLSS

The cubic splines function is similar to smooth.spline function of R and could be used for univariate smoothing.

The first thing to consider when we want to apply more additive terms on a linear model, is how much degree of freedom (df) that would be used. Initial df is determined by the number observations in the dataset, distribution family in use and quantity of predictor, including the intercepts. Using additive terms such as higher degree of polynomial function, smoothing splines or random effects will cost you a price on df. Df should be used as our money and care should be taken on how and where we spent our money.

Following codes will perform a tuning on df value for Mu and Sigma models. The decision is based on AIC value (smaller is better). Finally, we found that optimal df values are 4.175334 for Mu and 1 (linear) for Sigma. The Cubic smoothing spline based model is a black box. It has a sum df of 9.174846

fnaic=function(p)AIC(gamlss(gh~cs(age,df=p[1]),sigma.fo=~cs(age,df=p[2]),family=GA,data=trainset,k=2,trace=FALSE))

opAIC=optim(par=c(1,1),fnaic,method="L-BFGS-B",lower=c(1,1),upper=c(15,15))
opAIC$par

## [1] 4.175334 1.000000

mod1= gamlss(gh~cs(age,df=opAIC$par[1]),sigma.fo=~cs(age,df=opAIC$par[2]),family=GA,data=trainset,trace = FALSE)

summary(mod1)

## ******************************************************************
## Family:  c("GA", "Gamma") 
## 
## Call:  
## gamlss(formula = gh ~ cs(age, df = opAIC$par[1]), sigma.formula = ~cs(age,  
##     df = opAIC$par[2]), family = GA, data = trainset,      trace = FALSE) 
## 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.624034   0.002688  604.12   <2e-16 ***
## cs(age, df = opAIC$par[1]) 0.001560   0.000063   24.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.9163290  0.0313413 -93.051  < 2e-16 ***
## cs(age, df = opAIC$par[2])  0.0043138  0.0006795   6.348 2.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas: 
##  i) Std. Error for smoothers are for the linear effect only. 
## ii) Std. Error for the linear terms maybe are not accurate. 
## ------------------------------------------------------------------
## No. of observations in the fit:  2594 
## Degrees of Freedom for the fit:  9.174846
##       Residual Deg. of Freedom:  2584.825 
##                       at cycle:  3 
##  
## Global Deviance:     1907.401 
##             AIC:     1925.75 
##             SBC:     1979.524 
## ******************************************************************

The spline value in function of age could be shown as follows:

mus=cbind(trainset$age,mod1$mu.s,mod1$sigma.s)%>%as_tibble()

names(mus)=c("Age","Mu","Sigma")

mus%>%gather(.,Mu:Sigma,key="Parameter",value="Spline")%>%ggplot(aes(x=Age,y=Spline,color=Parameter))+geom_line()+scale_color_manual(values=c("red","blue"))+ggtitle("Cubic splines")

2) PENALIZED SPLINE IN GAMLSS

Penalized spline are piecewise polynomials defined by B-spline basis functions in the explanatory variables where the coefficients of the basis function are penalized to warrant sufficient smoothness without overfitting. To define a penalized spline we need to know: the number of knots in the x scale, and where to put them (ps uses equal spaces), the degree of the piecewise polynomial used in the B-spline basis, the order of differences in matrix, indicating the type of penalty imposed and the amount of smoothing required defined by df.

A penalized spline model could be simply built in Gamlss as follows. This model has df values of 6.548 and .451 for Mu and sigma. It’s also a black box model

mod2=gamlss(gh~pb(age),sigma.fo=~pb(age),family=GA,data=trainset,trace=FALSE,gd.tol=10)

mod2$mu.df
mod2$sigma.df

summary(mod2)

mus2=cbind(trainset$age,mod2$mu.s,mod2$sigma.s)%>%as_tibble()
names(mus2)=c("Age","Mu","Sigma")

mus2%>%gather(.,Mu:Sigma,key="Parameter",value="Spline")%>%ggplot(aes(x=Age,y=Spline,color=Parameter))+geom_line()+scale_color_manual(values=c("red","blue"))+ggtitle("Penalized splines")

We can also combine them together: A cubic spline model with df value taken from penalised model. However, a validation based on AIC showed that the Cubic smoothing spline was the best model as this has lower cost than the combined solution and lower AIC.

#Combining them together:

mod3=gamlss(gh~cs(age,df=mod2$mu.df),sigma.fo=~cs(age,df=mod2$sigma.df),family=GA,data=trainset,trace=FALSE)

GAIC(mod1,mod2,mod3)

##             df      AIC
## mod1  9.174846 1925.750
## mod2  8.999424 1925.927
## mod3 12.998841 1927.649

3) PIECEWISE POLYNOMIAL

The 4th degree piecewise polynomial model is bether than cubic or quintic models:

#Piecewise polynomial

ppmod1=gamlss(gh~bs(age,df=3),sigma.fo=~bs(age,df=3),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1917.233 
## GAMLSS-RS iteration 2: Global Deviance = 1917.051 
## GAMLSS-RS iteration 3: Global Deviance = 1917.051

ppmod2=gamlss(gh~bs(age,df=4),sigma.fo=~bs(age,df=4),data=trainset,family=GA)#Best

## GAMLSS-RS iteration 1: Global Deviance = 1907.824 
## GAMLSS-RS iteration 2: Global Deviance = 1907.771 
## GAMLSS-RS iteration 3: Global Deviance = 1907.771

ppmod3=gamlss(gh~bs(age,df=5),sigma.fo=~bs(age,df=5),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1904.142 
## GAMLSS-RS iteration 2: Global Deviance = 1904.137 
## GAMLSS-RS iteration 3: Global Deviance = 1904.137

GAIC(ppmod1,ppmod2,ppmod3)

##        df      AIC
## ppmod2 10 1927.771
## ppmod3 12 1928.137
## ppmod1  8 1933.051

summary(ppmod2)

## ******************************************************************
## Family:  c("GA", "Gamma") 
## 
## Call:  gamlss(formula = gh ~ bs(age, df = 4), sigma.formula = ~bs(age,  
##     df = 4), family = GA, data = trainset) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.671842   0.004286 390.031  < 2e-16 ***
## bs(age, df = 4)1 -0.046937   0.009562  -4.908 9.75e-07 ***
## bs(age, df = 4)2  0.030834   0.009337   3.302 0.000972 ***
## bs(age, df = 4)3  0.071424   0.010994   6.496 9.83e-11 ***
## bs(age, df = 4)4  0.068387   0.006548  10.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.96003    0.05563 -53.211  < 2e-16 ***
## bs(age, df = 4)1  0.22573    0.11464   1.969 0.049062 *  
## bs(age, df = 4)2  0.15610    0.09901   1.577 0.114995    
## bs(age, df = 4)3  0.42136    0.11778   3.577 0.000353 ***
## bs(age, df = 4)4  0.31937    0.07356   4.342 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  2594 
## Degrees of Freedom for the fit:  10
##       Residual Deg. of Freedom:  2584 
##                       at cycle:  3 
##  
## Global Deviance:     1907.771 
##             AIC:     1927.771 
##             SBC:     1986.381 
## ******************************************************************

4) FRACTIONAL POLYNOMIAL

Fractional polynomials (fp) is an implementation of the original function introduced by Royston and Altman (1994) and a later version created by Gareth Amber in S-plus. This function generates the right design matrix for fitting a power polynomial. For given powers, the function could be used in the same way as orthogonal polynomial function (poly) or piecewise polynomial (bs) of the package “splines”. The fp function will fit the best fractional polynomials among the specific set of power values.

The posterior comparison showed that the fractional cubic (3rd degree) model was better than the linear (1st degree) and quadratic (2nd degree model).

#Fractional polynomial

fpmod1=gamlss(gh~fp(age,1),sigma.fo=~fp(age,1),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1978.003 
## GAMLSS-RS iteration 2: Global Deviance = 1977.693 
## GAMLSS-RS iteration 3: Global Deviance = 1977.692

fpmod2=gamlss(gh~fp(age,2),sigma.fo=~fp(age,2),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1923.009 
## GAMLSS-RS iteration 2: Global Deviance = 1923.005 
## GAMLSS-RS iteration 3: Global Deviance = 1923.005

fpmod3=gamlss(gh~fp(age,3),sigma.fo=~fp(age,3),data=trainset,family=GA)#Best

## GAMLSS-RS iteration 1: Global Deviance = 1909.14 
## GAMLSS-RS iteration 2: Global Deviance = 1909.015 
## GAMLSS-RS iteration 3: Global Deviance = 1909.015

GAIC(fpmod1,fpmod2,fpmod3)

##        df      AIC
## fpmod3 14 1937.015
## fpmod2 10 1943.005
## fpmod1  6 1989.692

summary(fpmod3)

## ******************************************************************
## Family:  c("GA", "Gamma") 
## 
## Call:  gamlss(formula = gh ~ fp(age, 3), sigma.formula = ~fp(age,  
##     3), family = GA, data = trainset) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.688351   0.001258    1342   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.73795    0.01387  -197.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas: 
##  i) Std. Error for smoothers are for the linear effect only. 
## ii) Std. Error for the linear terms maybe are not accurate. 
## ------------------------------------------------------------------
## No. of observations in the fit:  2594 
## Degrees of Freedom for the fit:  14
##       Residual Deg. of Freedom:  2580 
##                       at cycle:  3 
##  
## Global Deviance:     1909.015 
##             AIC:     1937.015 
##             SBC:     2019.069 
## ******************************************************************

5) ORTHOGONAL POLYNOMIAL

The poly() function generates orthogonal polynomials which represent a basis for polynomial regression.

The model comparison indicates that the Quartic (4th degree) model was the best among 5 polynomial degrees (linear to quintic)

#Orthogonal polynomial

opmod1=gamlss(gh~poly(age,1),sigma.fo=~poly(age,1),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1980.369 
## GAMLSS-RS iteration 2: Global Deviance = 1980.295 
## GAMLSS-RS iteration 3: Global Deviance = 1980.295

opmod2=gamlss(gh~poly(age,2),sigma.fo=~poly(age,2),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1973.264 
## GAMLSS-RS iteration 2: Global Deviance = 1972.118 
## GAMLSS-RS iteration 3: Global Deviance = 1972.116 
## GAMLSS-RS iteration 4: Global Deviance = 1972.116

opmod3=gamlss(gh~poly(age,3),sigma.fo=~poly(age,3),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1917.233 
## GAMLSS-RS iteration 2: Global Deviance = 1917.051 
## GAMLSS-RS iteration 3: Global Deviance = 1917.051

opmod4=gamlss(gh~poly(age,4),sigma.fo=~poly(age,4),data=trainset,family=GA)#Best

## GAMLSS-RS iteration 1: Global Deviance = 1906.575 
## GAMLSS-RS iteration 2: Global Deviance = 1906.532 
## GAMLSS-RS iteration 3: Global Deviance = 1906.532

opmod5=gamlss(gh~poly(age,5),sigma.fo=~poly(age,5),data=trainset,family=GA)

## GAMLSS-RS iteration 1: Global Deviance = 1905.024 
## GAMLSS-RS iteration 2: Global Deviance = 1905.019 
## GAMLSS-RS iteration 3: Global Deviance = 1905.019

GAIC(opmod1,opmod2,opmod3,opmod4,opmod5)

##        df      AIC
## opmod4 10 1926.531
## opmod5 12 1929.019
## opmod3  8 1933.051
## opmod2  6 1984.116
## opmod1  4 1988.296

summary(opmod4)

## ******************************************************************
## Family:  c("GA", "Gamma") 
## 
## Call:  gamlss(formula = gh ~ poly(age, 4), sigma.formula = ~poly(age,  
##     4), family = GA, data = trainset) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    1.688424   0.001281 1318.289  < 2e-16 ***
## poly(age, 4)1  1.632274   0.066025   24.722  < 2e-16 ***
## poly(age, 4)2  0.093934   0.065474    1.435  0.15150    
## poly(age, 4)3 -0.459222   0.065839   -6.975 3.87e-12 ***
## poly(age, 4)4  0.189451   0.065241    2.904  0.00372 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.73845    0.01387 -197.384  < 2e-16 ***
## poly(age, 4)1  4.51055    0.70688    6.381 2.08e-10 ***
## poly(age, 4)2 -1.30912    0.70708   -1.851   0.0642 .  
## poly(age, 4)3  0.21370    0.71692    0.298   0.7657    
## poly(age, 4)4 -0.96991    0.69290   -1.400   0.1617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  2594 
## Degrees of Freedom for the fit:  10
##       Residual Deg. of Freedom:  2584 
##                       at cycle:  3 
##  
## Global Deviance:     1906.531 
##             AIC:     1926.531 
##             SBC:     1985.141 
## ******************************************************************

Finally, we found that the Cubic smoothing spline model was the best solution, there was no difference between Orthogonal, Piecewise and penalised splines methods, the fractional polynomial was the worst solution.

GAIC(mod2,mod1,opmod4,fpmod3,ppmod2)

##               df      AIC
## mod1    9.174846 1925.750
## mod2    8.999424 1925.927
## opmod4 10.000000 1926.531
## ppmod2 10.000000 1927.771
## fpmod3 14.000000 1937.015

H2O GLM

Now we will train another polynomial using h2O GLM algorithm. First, we need to initialise h2o in R:

library(h2o)

h2o.init(nthreads = -1,max_mem_size ="4g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 14 minutes 
##     H2O cluster version:        3.10.3.6 
##     H2O cluster version age:    2 months and 15 days  
##     H2O cluster name:           H2O_started_from_R_Admin_jya313 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.39 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.1 (2016-06-21)

wdata=as.h2o(data)
wtrain=as.h2o(trainset)
wtest=as.h2o(testset)
wvalid=as.h2o(validset)

Details of the GLM arguments and parameters could be found here:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

In our experiment, we will train a GLM model on a training set of 5 predictors: Age, Age^2, Age^3 and Age^4 (up to 4th degree). This model implied a Gamma distribution with logarithmic link function. We also applied a Lasso regularisation by setting alpha=0. The regularisation strength will be automatically determined through a search of 150 possible lambda values. The model will be trained by cross-validation using fixed validation subset and early stopping mode.

response="gh"
features=setdiff(colnames(wdata),c(response,"sex"))

h2ofit=h2o.glm(x=features,
               y=response, 
               training_frame=wtrain, 
               validation_frame = wvalid,
               nfolds = 10,seed = 123, 
               keep_cross_validation_predictions = TRUE,
               keep_cross_validation_fold_assignment = TRUE, 
               fold_assignment="AUTO", 
               score_each_iteration = TRUE,
               family = "gamma",link="log",
               solver = "L_BFGS",
               lambda_search = TRUE,alpha=0,nlambdas =150,
               early_stopping = TRUE,
               standardize = TRUE,
               missing_values_handling ="Skip",
               compute_p_values = FALSE, 
               remove_collinear_columns = FALSE,
               intercept = TRUE, 
               non_negative = FALSE
               )

h2ofit

## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model ID:  GLM_model_R_1494158163717_4 
## GLM Model: summary
##   family link             regularization
## 1  gamma  log Ridge ( lambda = 0.03122 )
##                                                                   lambda_search
## 1 nlambda = 150, lambda.max = 3.2195, lambda.min = 0.02845, lambda.1se = 1.3976
##   number_of_predictors_total number_of_active_predictors
## 1                          4                           4
##   number_of_iterations training_frame
## 1                  103       trainset
## 
## Coefficients: glm coefficients
##       names coefficients standardized_coefficients
## 1 Intercept     1.635888                  1.688461
## 2       age     0.000873                  0.017903
## 3        x2     0.000008                  0.014689
## 4        x3     0.000000                  0.004882
## 5        x4    -0.000000                 -0.006112
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.1288209
## RMSE:  0.3589162
## MAE:  0.2734149
## RMSLE:  0.05572236
## Mean Residual Deviance :  0.004349371
## R^2 :  0.1923687
## Null Deviance :13.97529
## Null D.o.F. :2593
## Residual Deviance :11.28227
## Residual D.o.F. :2589
## AIC :NaN
## 
## 
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  0.1441667
## RMSE:  0.3796929
## MAE:  0.2759836
## RMSLE:  0.05800154
## Mean Residual Deviance :  0.004730564
## R^2 :  0.1872527
## Null Deviance :6.191245
## Null D.o.F. :1057
## Residual Deviance :5.004937
## Residual D.o.F. :1053
## AIC :NaN
## 
## 
## H2ORegressionMetrics: glm
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1289211
## RMSE:  0.3590559
## MAE:  0.2735421
## RMSLE:  0.05574573
## Mean Residual Deviance :  0.004352769
## R^2 :  0.1917403
## Null Deviance :13.97877
## Null D.o.F. :2593
## Residual Deviance :11.29108
## Residual D.o.F. :2589
## AIC :NaN
## 
## 
## Cross-Validation Metrics Summary: 
##                          mean           sd cv_1_valid  cv_2_valid
## mae                0.27340493  0.008355091 0.28451344  0.27967656
## mse                 0.1288284  0.010064627 0.14173545  0.13876098
## null_deviance       1.3978775  0.091519505   1.495645   1.6113418
## r2                 0.19276007  0.023682296 0.18898739  0.18601803
## residual_deviance   1.1277183   0.08511159  1.2103963    1.309848
## rmse               0.35837132  0.014114098  0.3764777  0.37250635
## rmsle             0.055655014 0.0020188198 0.05823262 0.057491224
##                   cv_3_valid  cv_4_valid  cv_5_valid  cv_6_valid
## mae               0.25561687  0.25966153   0.2776107   0.2569407
## mse               0.10940834  0.11454897  0.13530691  0.10866868
## null_deviance      1.2263895   1.2794584   1.4159609   1.3167939
## r2                0.18194892  0.24431878  0.15211809  0.24990796
## residual_deviance  1.0029817  0.97199255    1.196076   0.9828831
## rmse              0.33076933  0.33845085  0.36784086  0.32964933
## rmsle             0.05149267 0.053131666 0.057332374 0.051421497
##                   cv_7_valid cv_8_valid  cv_9_valid cv_10_valid
## mae               0.27796203 0.29192773  0.26915556  0.28098413
## mse               0.12994897 0.15044233  0.11830346  0.14115998
## null_deviance      1.3863152  1.6134675   1.2830396   1.3503631
## r2                0.22208755 0.18595453  0.16706163  0.14919774
## residual_deviance  1.0793375  1.3107362   1.0699044   1.1430275
## rmse              0.36048436 0.38786897  0.34395272  0.37571263
## rmsle             0.05614277 0.06002886 0.053676162 0.057600282

When validated on independent test set, here is the model’s performance:

h2o.performance(h2ofit,wtest)

## H2ORegressionMetrics: glm
## 
## MSE:  0.1212607
## RMSE:  0.3482251
## MAE:  0.2590904
## RMSLE:  0.05351539
## Mean Residual Deviance :  0.004026166
## R^2 :  0.2356361
## Null Deviance :5.602816
## Null D.o.F. :1059
## Residual Deviance :4.267736
## Residual D.o.F. :1055
## AIC :NaN

predh2o=predict(h2ofit,newdata=wtest,type="response")%>%as.vector()# h2O

predg1=predict(mod2,newdata=testset,type="response")#Penalized spline
predg5=predict(mod1,newdata=testset,type="response")#Cubic spline
predg2=predict(opmod4,newdata=testset,type="response")#Orthogonal 4
predg3=predict(fpmod3,newdata=testset,type="response")#Fractioned cubic
predg4=predict(ppmod2,newdata=testset,type="response")#piecewise poly5

predf=cbind(predg1,predg5,predg2,predg3,predg4,predh2o,testset$gh,testset$age)%>%as_tibble()

names(predf)=c("Penalizedspline","Cubicspline","Orthogonal4","Fractioned3","Piecewise5","h2oLasso","Truth","Age")

predlong=predf%>%gather(Penalizedspline:h2oLasso,key="Model",value="Predicted")

Following figures show the prediction curves of 6 models, including the Lasso model by h2o:

predlong%>%ggplot()+geom_point(aes(x=Age,y=Truth),alpha=0.1,color="grey")+geom_smooth(aes(x=Age,y=Predicted,color=Model,fill=Model),show.legend = F)+facet_wrap(~Model,ncol=3)

EVALUATING THE PERFORMANCE BY MLR

On the final step, we will use the mlr package to evaluate the performance of 6 models.

8 metrics will be evaluated, including:

expvar = Explained variance (best =1, worst =0), defined as explained_sum_of_squares / total_sum_of_squares.
mae = Mean of absolute errors. Defined as: mean(abs(response - truth))
medae = Median of absolute errors
medse = Median of squared errors
mse = Mean of squared errors. Defined as: mean((response - truth)^2)
Rsquared (R2) or Coefficient of determination, which is 1 - residual_sum_of_squares / total_sum_of_squares.
rmse= Root mean squared error. The RMSE is aggregated as sqrt(mean(rmse.vals.on.test.sets^2))
rrse = Root relative squared error. Defined as sqrt (sum_of_squared_errors / total_sum_of_squares). Undefined for single instances and when every truth value is identical. In this case the output will be NA.

library(mlr)

regr.task= mlr::makeRegrTask(id = "gh", data=testset, target = "gh")
regr.lrn = makeLearner("regr.glm")

dummy=mlr::train(regr.lrn,regr.task)

dumpred1=predict(dummy,regr.task)
dumpred1$data$response<-predg1

dumpred2=predict(dummy,regr.task)
dumpred2$data$response<-predg2

dumpred3=predict(dummy,regr.task)
dumpred3$data$response<-predg3

dumpred4=predict(dummy,regr.task)
dumpred4$data$response<-predg4

dumpred5=predict(dummy,regr.task)
dumpred5$data$response<-predg5

dumpredh2o=predict(dummy,regr.task)
dumpredh2o$data$response<-predh2o

mets=list(rsq,expvar,mae,rmse,mse,medse,medae,rrse)

dp1=performance(dumpred1,measures =mets)
dp2=performance(dumpred2,measures =mets)
dp3=performance(dumpred3,measures =mets)
dp4=performance(dumpred4,measures =mets)
dp5=performance(dumpred5,measures =mets)
dph2o=performance(dumpredh2o,measures =mets)

dpd=cbind(dp1,dp2,dp3,dp4,dp5,dph2o)%>%as_tibble()
dpd$Metric=c("R2","ExpVAR","MAE","RMSE","MSE","MEDSE","MEDAE","RRSE")

names(dpd)=c("Penalizedspline","Cubicspline","Orthogonal4","Fractioned3","Piecewise5","h2oLasso","Metric")

Here is the verdict of our match:

cbind(dpd)

##   Penalizedspline Cubicspline Orthogonal4 Fractioned3 Piecewise5
## 1      0.24434196  0.24530770   0.2463909  0.24605791 0.24448688
## 2      0.20941423  0.21181764   0.2110116  0.21098507 0.20813995
## 3      0.25685386  0.25630161   0.2559750  0.25602488 0.25690363
## 4      0.34623635  0.34601504   0.3457666  0.34584301 0.34620315
## 5      0.11987961  0.11972641   0.1195546  0.11960739 0.11985662
## 6      0.04103231  0.04139082   0.0404188  0.04050637 0.04103404
## 7      0.20256432  0.20344734   0.2010443  0.20126195 0.20256793
## 8      0.86928594  0.86873028   0.8681066  0.86829839 0.86920258
##     h2oLasso Metric
## 1 0.23563604     R2
## 2 0.18672473 ExpVAR
## 3 0.25909037    MAE
## 4 0.34822513   RMSE
## 5 0.12126074    MSE
## 6 0.04068146  MEDSE
## 7 0.20169621  MEDAE
## 8 0.87427911   RRSE

dpdlong=gather(dpd,Penalizedspline:h2oLasso,key="Model",value="Score")

library(viridis)

dpdlong%>%ggplot(aes(x=reorder(Model,-Score),y=reorder(Metric,Score),fill=Score,label=round(Score,3)))+geom_tile(color="black")+geom_text(color = "white")+scale_fill_viridis(option="A",begin=0.2,end=0.7)+scale_x_discrete("Models")

dpdlong%>%ggplot(aes(x=Model,y=Score,color=Model))+geom_point(size=3,show.legend = F)+coord_flip()+facet_wrap(~Metric,ncol=2,scales="free_x")

Conclusion

In this case study, we have explored two R packages-GAMLSS and H2O that support polynomial regression tasks. Based on the result, the GAMLSS package won the match, as the its polynomial models showed higher R2 and lower prediction error than the lasso model by H2O. Overall, the subject’s Age alone in any equation among six models can only explain about 20% of the gh variance in healthy White Americans polulation. Cubic smoothing spline is a very powerful technique to extend the capacity of a basic linear model. Simple orthogonal polynomial regression works better than Lasso regularisation based model in h2o. A better tuning might help to improve the h2o GLM model’s performance but GAMLSS is still the best among all available GLM packages.

END

MLM Case study X10

Lê Ngọc Khả Nhi (MD,PhD)

06 May 2017

Case study X10: H2O v.s GAMLSS on a polynomial Gamma regression task.