Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Introduction
In conventional statistics, polynomial regression is defined as a special case of multiple linear regression in which the relationship between the response variable Y and predictors Xs is modelled by a high degree (>=2) polynomial function of Xi. Though polynomial function is non-linear, the statistical terms is still linear. Basically, a polynomial model turns a linear model into a curve.
In machine learning, polynomial regression is a technique to improve the basic linear model’s capacity. According to Ian Goodfellow et al (Chapter 5, Deep learning book, 2016), adding polynomial function on a linear regression model could be considered as a method to improve the prediction accuracy by extending the model’s capacity. Models with low capacity might struggle to fit the train subset while models with excessive capacity might overfit by memorizing properties of the training set but perform badly on unseen data (testset).
Polynomial regression could be very useful for modelling the nonlinear variation such as human’s growth, progression of a factor, accumulation of events.
In practice, a polynomial regression could be considered in some situations, such as:
We understand well the underlying theoretical law of the variation (growth of cell, degradation of molecules, progression of count data…).
By visualizing the scatter dot plot that presents a curvilinear relationship between response and explanatory variables. For example: growth of lung volume in terms of age.
By inspecting the distribution of residuals of a linear model. This could be a good sign that a polynomial model is more appropriate than a basic linear model.
GAMLSS package
Built in 2008, the GAMLSS package allows user to fit separated model for each parameters (Mu, Sigma, Nu and Tau) of more than 70 distribution families as linear, polynomial, parametric and non parametric (smoothing) functions of the predictors (including fixed and random effects terms). This makes GAMLSS the most powerful tool for statistical modelling. Implemented GAMLSS additive functions include:
The GAMLSS package could also be considered a true Machine learning tool, as every model is “trained” and cross-validated on random splitted subsets until being converged. GAMLSS also provides a tuning process that could eventually determine the best distribution family and the best degree of freedom value for smoothing splines and polynomial functions. Some of these features will be explored in our experiment.
More details could be found on their website: http://www.gamlss.org/
H2O GLM algorithm
Introduced 7 years later, h2o GLM module is currently the most comprehensive GLM based machine learning algorithm for classification and regression supervised learning tasks. First, h2O GLM support a wide range of distribution families, including Gaussian, Poisson, Binomial, Gamma and Tweedie. Secondly, it also integrated an Elastic net regularization for model shrinkage with Ridge, Lasso and Elastic modes. The penalization strength could be automatically or manually tuned up via Lambda hyper-parameter. H2o GLM also integrates different types of solvers, including IRLSM, L_BFGS and Coordinate descent solvers.
Technically, h2o GLM allows performing an orthogonal polynomial regression on both Gaussian and Gamma distributions and even combining our polynomial function with a Lasso or Ridge regularization to optimize the model’s capacity and to avoid overfitting.
For this reason, we would like to compare the h2o GLM algorithm with the GAMLSS Polynomial functions in the present study. Objective
The present case study aims to compare the performance of two packages: GAMLSS and H2O on a simple polynomial regression task with one predictor and gamma distributed outcome. Through this experiment, we will explore following techniques:
Exploring the distribution family using GAMLSS
Training 5 different types of polynomial models in GAMLSS, including:
Training a Lasso-Polynomial Gamma regression model by H2O GLM algorithm.
Combining different packages: h2o, gamlss, mlr and caret
This case study implies a part of famous NHANES dataset. The main objective is to predict the level of glycohemoglobin (gh) in function of Age in heathy American Caucasians.
Materials and method
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#fcf9ff"),
strip.background = element_rect(fill = "#3d0066", color = "#3d0066", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
myfill=c("#f32440","#249ff2","#ba33e8","#ffbd0a")
mycolor=c("#b20018","#012670","#2b0354","#c42f01")
Now we load the dataset from UCI website and perform a descriptive analysis
require(rms)
getHdata(nhgh)
data=nhgh%>%as_tibble()%>%na.omit()%>%subset(.,re=="Non-Hispanic White" & dx!=1)%>%.[,c("sex","age","gh")]
data=nhgh%>%as_tibble()%>%na.omit()%>%subset(.,dx!=1)%>%.[,c("sex","age","gh")]
age=data$age
gh=data$gh
sex=as.factor(data$sex)
data=as.data.frame(cbind(sex,age,gh))%>%as_tibble()
data$sex%<>%as.factor()%>%recode_factor(.,`1` = "Male", `2` = "Female")
data=subset(data,gh<8)
Hmisc::describe(data)
## data
##
## 3 Variables 4712 Observations
## ---------------------------------------------------------------------------
## sex
## n missing distinct
## 4712 0 2
##
## Value Male Female
## Frequency 2424 2288
## Proportion 0.514 0.486
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 4712 0 804 1 41.2 23.55 13.92 15.75
## .25 .50 .75 .90 .95
## 21.92 39.50 57.02 72.67 79.42
##
## lowest : 12.00000 12.08333 12.16667 12.25000 12.33333
## highest: 79.66667 79.75000 79.83333 79.91667 80.00000
## ---------------------------------------------------------------------------
## gh
## n missing distinct Info Mean Gmd .05 .10
## 4712 0 40 0.993 5.417 0.4398 4.8 4.9
## .25 .50 .75 .90 .95
## 5.2 5.4 5.6 5.9 6.1
##
## lowest : 4.0 4.1 4.2 4.3 4.4, highest: 7.5 7.6 7.7 7.8 7.9
## ---------------------------------------------------------------------------
Data visualising
First, we would like to know how the outcome is distributed among 4712 subjects:
data%>%ggplot(aes(x=gh,fill=sex,color=sex))+geom_histogram(alpha=0.5,bins=40)+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)+ggtitle("Glycohemoglobin")
data%>%ggplot(aes(x=age,fill=sex,color=sex))+geom_histogram(alpha=0.5,bins=40)+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)+ggtitle("Age")
To determine whether the models should be specifically built for each gender, we will plot a curve of GH in function of Age:
data%>%ggplot(aes(x=age,y=gh,fill=sex,color=sex))+geom_point(shape=21,alpha=0.2,size=2)+geom_smooth()+scale_color_manual(values=mycolor)+scale_fill_manual(values=myfill)
The graph indicates that we could build a single model for both males and females, as their two regression curves showed no difference.
The figure above applied a loess smoothing spline. Now we will explore other polynomial function types on our dataset:
p1=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="grey")+geom_smooth(method = "lm", formula = y ~ poly(x, 1),color="black",fill="grey80")+ggtitle("Linear")
p2=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#f32440")+geom_smooth(method = "lm", formula = y ~ poly(x, 2),color="#b20018",fill="#f32440")+ggtitle("Quadratic")
p3=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#249ff2")+geom_smooth(method = "lm", formula = y ~ poly(x, 3),color="#012670",fill="#249ff2")+ggtitle("Cubic")
p4=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#ba33e8")+geom_smooth(method = "lm", formula = y ~ poly(x, 4),color="#2b0354",fill="#ba33e8")+ggtitle("Quartic")
p5=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="gold")+geom_smooth(method = "gam",color="orangered",fill="orange")+ggtitle("GAM")
p6=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.05,color="#bde509")+geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3),color="darkgreen",fill="#bde509")+ggtitle("LM+Cubic Spline")
library(gridExtra)
grid.arrange(p1,p5,p2,p3,p4,p6,ncol=3)
It seems that a cubic, quartic polynomial model or a model with cubic spline might be more appropriate for our goal.
Machine learning experiment
First, we will split the original dataset into 3 subsets:
#Generating higher degree polynomial function on X:
data=data%>%mutate(.,x2=(.$age)^2,x3=(.$age)^3,x4=(.$age)^4)
#Data splitting
library(caret)
set.seed(123)
idTrain=caret::createDataPartition(y=data$gh,p=0.55,list=FALSE)
trainset=data[idTrain,]
remain=data[-idTrain,]
idTest=caret::createDataPartition(y=remain$gh,p=0.5,list=FALSE)
testset=remain[idTest,]
validset=remain[-idTest,]
rm(idTrain,idTest,remain)
Following figures confirm that the distribution and relationship of gh/age are unchaged across 3 subsets and compared to origin sample:
s1=data%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#f32440")+ggtitle("Origin")
s2=trainset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#249ff2")+ggtitle("Train")
s3=validset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="#ba33e8")+ggtitle("Valid")
s4=testset%>%ggplot(aes(x=age,y=gh))+geom_point(alpha=0.1,color="gold")+ggtitle("Test")
grid.arrange(s1,s2,s3,s4)
d1=data%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#f32440")+ggtitle("Origin")
d2=trainset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#249ff2")+ggtitle("Train")
d3=validset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="#ba33e8")+ggtitle("Valid")
d4=testset%>%ggplot(aes(x=gh))+geom_density(alpha=0.5,fill="gold")+ggtitle("Test")
grid.arrange(d1,d2,d3,d4)
As the outcome (gh) is a continuous and positive variable, on the next step we would like to know whether Gaussian (normal) or Gamma distribution should be used for modelling gh ? This could be done in GAMLSS package:
library(gamlss)
gh=trainset$gh
xmin=min(trainset$age)
xmax=max(trainset$age)
mGA=histDist(gh,"GA",xmin=xmin,xmax=xmax,density=T)
mNO=histDist(gh,"NO",xmin=xmin,xmax=xmax,density=T)
GAIC(mNO,mGA,k=log(length(gh)))
## df AIC
## mGA 2 2577.766
## mNO 2 2615.414
Based on the AIC result, we should adopt the Gamma distribution family insead of Normal for our model.
The pdf of gamma distribution, denoted by GA(Mu,Sigma) is defined by :
\[ fY(y|\mu ,\sigma)= \frac{1}{(\sigma ^2 \mu)^{(1/\sigma ^{2})}}\frac{(y^{1/\sigma^2-1})e^{-y/(\mu \sigma^2)}}{\Gamma (1/\sigma ^2)} \]
Where y, mu and sigma are positive.
Mean: \[E(Y)=\mu\]
Variance: \[Var(Y)= \sigma^2\mu^2\]
1) CUBIC SMOOTHING SPLINE BY GAMLSS
The cubic splines function is similar to smooth.spline function of R and could be used for univariate smoothing.
The first thing to consider when we want to apply more additive terms on a linear model, is how much degree of freedom (df) that would be used. Initial df is determined by the number observations in the dataset, distribution family in use and quantity of predictor, including the intercepts. Using additive terms such as higher degree of polynomial function, smoothing splines or random effects will cost you a price on df. Df should be used as our money and care should be taken on how and where we spent our money.
Following codes will perform a tuning on df value for Mu and Sigma models. The decision is based on AIC value (smaller is better). Finally, we found that optimal df values are 4.175334 for Mu and 1 (linear) for Sigma. The Cubic smoothing spline based model is a black box. It has a sum df of 9.174846
fnaic=function(p)AIC(gamlss(gh~cs(age,df=p[1]),sigma.fo=~cs(age,df=p[2]),family=GA,data=trainset,k=2,trace=FALSE))
opAIC=optim(par=c(1,1),fnaic,method="L-BFGS-B",lower=c(1,1),upper=c(15,15))
opAIC$par
## [1] 4.175334 1.000000
mod1= gamlss(gh~cs(age,df=opAIC$par[1]),sigma.fo=~cs(age,df=opAIC$par[2]),family=GA,data=trainset,trace = FALSE)
summary(mod1)
## ******************************************************************
## Family: c("GA", "Gamma")
##
## Call:
## gamlss(formula = gh ~ cs(age, df = opAIC$par[1]), sigma.formula = ~cs(age,
## df = opAIC$par[2]), family = GA, data = trainset, trace = FALSE)
##
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.624034 0.002688 604.12 <2e-16 ***
## cs(age, df = opAIC$par[1]) 0.001560 0.000063 24.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: log
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.9163290 0.0313413 -93.051 < 2e-16 ***
## cs(age, df = opAIC$par[2]) 0.0043138 0.0006795 6.348 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas:
## i) Std. Error for smoothers are for the linear effect only.
## ii) Std. Error for the linear terms maybe are not accurate.
## ------------------------------------------------------------------
## No. of observations in the fit: 2594
## Degrees of Freedom for the fit: 9.174846
## Residual Deg. of Freedom: 2584.825
## at cycle: 3
##
## Global Deviance: 1907.401
## AIC: 1925.75
## SBC: 1979.524
## ******************************************************************
The spline value in function of age could be shown as follows:
mus=cbind(trainset$age,mod1$mu.s,mod1$sigma.s)%>%as_tibble()
names(mus)=c("Age","Mu","Sigma")
mus%>%gather(.,Mu:Sigma,key="Parameter",value="Spline")%>%ggplot(aes(x=Age,y=Spline,color=Parameter))+geom_line()+scale_color_manual(values=c("red","blue"))+ggtitle("Cubic splines")
2) PENALIZED SPLINE IN GAMLSS
Penalized spline are piecewise polynomials defined by B-spline basis functions in the explanatory variables where the coefficients of the basis function are penalized to warrant sufficient smoothness without overfitting. To define a penalized spline we need to know: the number of knots in the x scale, and where to put them (ps uses equal spaces), the degree of the piecewise polynomial used in the B-spline basis, the order of differences in matrix, indicating the type of penalty imposed and the amount of smoothing required defined by df.
A penalized spline model could be simply built in Gamlss as follows. This model has df values of 6.548 and .451 for Mu and sigma. It’s also a black box model
mod2=gamlss(gh~pb(age),sigma.fo=~pb(age),family=GA,data=trainset,trace=FALSE,gd.tol=10)
mod2$mu.df
mod2$sigma.df
summary(mod2)
mus2=cbind(trainset$age,mod2$mu.s,mod2$sigma.s)%>%as_tibble()
names(mus2)=c("Age","Mu","Sigma")
mus2%>%gather(.,Mu:Sigma,key="Parameter",value="Spline")%>%ggplot(aes(x=Age,y=Spline,color=Parameter))+geom_line()+scale_color_manual(values=c("red","blue"))+ggtitle("Penalized splines")
We can also combine them together: A cubic spline model with df value taken from penalised model. However, a validation based on AIC showed that the Cubic smoothing spline was the best model as this has lower cost than the combined solution and lower AIC.
#Combining them together:
mod3=gamlss(gh~cs(age,df=mod2$mu.df),sigma.fo=~cs(age,df=mod2$sigma.df),family=GA,data=trainset,trace=FALSE)
GAIC(mod1,mod2,mod3)
## df AIC
## mod1 9.174846 1925.750
## mod2 8.999424 1925.927
## mod3 12.998841 1927.649
3) PIECEWISE POLYNOMIAL
The 4th degree piecewise polynomial model is bether than cubic or quintic models:
#Piecewise polynomial
ppmod1=gamlss(gh~bs(age,df=3),sigma.fo=~bs(age,df=3),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1917.233
## GAMLSS-RS iteration 2: Global Deviance = 1917.051
## GAMLSS-RS iteration 3: Global Deviance = 1917.051
ppmod2=gamlss(gh~bs(age,df=4),sigma.fo=~bs(age,df=4),data=trainset,family=GA)#Best
## GAMLSS-RS iteration 1: Global Deviance = 1907.824
## GAMLSS-RS iteration 2: Global Deviance = 1907.771
## GAMLSS-RS iteration 3: Global Deviance = 1907.771
ppmod3=gamlss(gh~bs(age,df=5),sigma.fo=~bs(age,df=5),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1904.142
## GAMLSS-RS iteration 2: Global Deviance = 1904.137
## GAMLSS-RS iteration 3: Global Deviance = 1904.137
GAIC(ppmod1,ppmod2,ppmod3)
## df AIC
## ppmod2 10 1927.771
## ppmod3 12 1928.137
## ppmod1 8 1933.051
summary(ppmod2)
## ******************************************************************
## Family: c("GA", "Gamma")
##
## Call: gamlss(formula = gh ~ bs(age, df = 4), sigma.formula = ~bs(age,
## df = 4), family = GA, data = trainset)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.671842 0.004286 390.031 < 2e-16 ***
## bs(age, df = 4)1 -0.046937 0.009562 -4.908 9.75e-07 ***
## bs(age, df = 4)2 0.030834 0.009337 3.302 0.000972 ***
## bs(age, df = 4)3 0.071424 0.010994 6.496 9.83e-11 ***
## bs(age, df = 4)4 0.068387 0.006548 10.444 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: log
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.96003 0.05563 -53.211 < 2e-16 ***
## bs(age, df = 4)1 0.22573 0.11464 1.969 0.049062 *
## bs(age, df = 4)2 0.15610 0.09901 1.577 0.114995
## bs(age, df = 4)3 0.42136 0.11778 3.577 0.000353 ***
## bs(age, df = 4)4 0.31937 0.07356 4.342 1.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 2594
## Degrees of Freedom for the fit: 10
## Residual Deg. of Freedom: 2584
## at cycle: 3
##
## Global Deviance: 1907.771
## AIC: 1927.771
## SBC: 1986.381
## ******************************************************************
4) FRACTIONAL POLYNOMIAL
Fractional polynomials (fp) is an implementation of the original function introduced by Royston and Altman (1994) and a later version created by Gareth Amber in S-plus. This function generates the right design matrix for fitting a power polynomial. For given powers, the function could be used in the same way as orthogonal polynomial function (poly) or piecewise polynomial (bs) of the package “splines”. The fp function will fit the best fractional polynomials among the specific set of power values.
The posterior comparison showed that the fractional cubic (3rd degree) model was better than the linear (1st degree) and quadratic (2nd degree model).
#Fractional polynomial
fpmod1=gamlss(gh~fp(age,1),sigma.fo=~fp(age,1),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1978.003
## GAMLSS-RS iteration 2: Global Deviance = 1977.693
## GAMLSS-RS iteration 3: Global Deviance = 1977.692
fpmod2=gamlss(gh~fp(age,2),sigma.fo=~fp(age,2),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1923.009
## GAMLSS-RS iteration 2: Global Deviance = 1923.005
## GAMLSS-RS iteration 3: Global Deviance = 1923.005
fpmod3=gamlss(gh~fp(age,3),sigma.fo=~fp(age,3),data=trainset,family=GA)#Best
## GAMLSS-RS iteration 1: Global Deviance = 1909.14
## GAMLSS-RS iteration 2: Global Deviance = 1909.015
## GAMLSS-RS iteration 3: Global Deviance = 1909.015
GAIC(fpmod1,fpmod2,fpmod3)
## df AIC
## fpmod3 14 1937.015
## fpmod2 10 1943.005
## fpmod1 6 1989.692
summary(fpmod3)
## ******************************************************************
## Family: c("GA", "Gamma")
##
## Call: gamlss(formula = gh ~ fp(age, 3), sigma.formula = ~fp(age,
## 3), family = GA, data = trainset)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.688351 0.001258 1342 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: log
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.73795 0.01387 -197.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas:
## i) Std. Error for smoothers are for the linear effect only.
## ii) Std. Error for the linear terms maybe are not accurate.
## ------------------------------------------------------------------
## No. of observations in the fit: 2594
## Degrees of Freedom for the fit: 14
## Residual Deg. of Freedom: 2580
## at cycle: 3
##
## Global Deviance: 1909.015
## AIC: 1937.015
## SBC: 2019.069
## ******************************************************************
5) ORTHOGONAL POLYNOMIAL
The poly() function generates orthogonal polynomials which represent a basis for polynomial regression.
The model comparison indicates that the Quartic (4th degree) model was the best among 5 polynomial degrees (linear to quintic)
#Orthogonal polynomial
opmod1=gamlss(gh~poly(age,1),sigma.fo=~poly(age,1),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1980.369
## GAMLSS-RS iteration 2: Global Deviance = 1980.295
## GAMLSS-RS iteration 3: Global Deviance = 1980.295
opmod2=gamlss(gh~poly(age,2),sigma.fo=~poly(age,2),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1973.264
## GAMLSS-RS iteration 2: Global Deviance = 1972.118
## GAMLSS-RS iteration 3: Global Deviance = 1972.116
## GAMLSS-RS iteration 4: Global Deviance = 1972.116
opmod3=gamlss(gh~poly(age,3),sigma.fo=~poly(age,3),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1917.233
## GAMLSS-RS iteration 2: Global Deviance = 1917.051
## GAMLSS-RS iteration 3: Global Deviance = 1917.051
opmod4=gamlss(gh~poly(age,4),sigma.fo=~poly(age,4),data=trainset,family=GA)#Best
## GAMLSS-RS iteration 1: Global Deviance = 1906.575
## GAMLSS-RS iteration 2: Global Deviance = 1906.532
## GAMLSS-RS iteration 3: Global Deviance = 1906.532
opmod5=gamlss(gh~poly(age,5),sigma.fo=~poly(age,5),data=trainset,family=GA)
## GAMLSS-RS iteration 1: Global Deviance = 1905.024
## GAMLSS-RS iteration 2: Global Deviance = 1905.019
## GAMLSS-RS iteration 3: Global Deviance = 1905.019
GAIC(opmod1,opmod2,opmod3,opmod4,opmod5)
## df AIC
## opmod4 10 1926.531
## opmod5 12 1929.019
## opmod3 8 1933.051
## opmod2 6 1984.116
## opmod1 4 1988.296
summary(opmod4)
## ******************************************************************
## Family: c("GA", "Gamma")
##
## Call: gamlss(formula = gh ~ poly(age, 4), sigma.formula = ~poly(age,
## 4), family = GA, data = trainset)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.688424 0.001281 1318.289 < 2e-16 ***
## poly(age, 4)1 1.632274 0.066025 24.722 < 2e-16 ***
## poly(age, 4)2 0.093934 0.065474 1.435 0.15150
## poly(age, 4)3 -0.459222 0.065839 -6.975 3.87e-12 ***
## poly(age, 4)4 0.189451 0.065241 2.904 0.00372 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: log
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.73845 0.01387 -197.384 < 2e-16 ***
## poly(age, 4)1 4.51055 0.70688 6.381 2.08e-10 ***
## poly(age, 4)2 -1.30912 0.70708 -1.851 0.0642 .
## poly(age, 4)3 0.21370 0.71692 0.298 0.7657
## poly(age, 4)4 -0.96991 0.69290 -1.400 0.1617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 2594
## Degrees of Freedom for the fit: 10
## Residual Deg. of Freedom: 2584
## at cycle: 3
##
## Global Deviance: 1906.531
## AIC: 1926.531
## SBC: 1985.141
## ******************************************************************
Finally, we found that the Cubic smoothing spline model was the best solution, there was no difference between Orthogonal, Piecewise and penalised splines methods, the fractional polynomial was the worst solution.
GAIC(mod2,mod1,opmod4,fpmod3,ppmod2)
## df AIC
## mod1 9.174846 1925.750
## mod2 8.999424 1925.927
## opmod4 10.000000 1926.531
## ppmod2 10.000000 1927.771
## fpmod3 14.000000 1937.015
H2O GLM
Now we will train another polynomial using h2O GLM algorithm. First, we need to initialise h2o in R:
library(h2o)
h2o.init(nthreads = -1,max_mem_size ="4g")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 14 minutes
## H2O cluster version: 3.10.3.6
## H2O cluster version age: 2 months and 15 days
## H2O cluster name: H2O_started_from_R_Admin_jya313
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.39 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.1 (2016-06-21)
wdata=as.h2o(data)
wtrain=as.h2o(trainset)
wtest=as.h2o(testset)
wvalid=as.h2o(validset)
Details of the GLM arguments and parameters could be found here:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
In our experiment, we will train a GLM model on a training set of 5 predictors: Age, Age^2, Age^3 and Age^4 (up to 4th degree). This model implied a Gamma distribution with logarithmic link function. We also applied a Lasso regularisation by setting alpha=0. The regularisation strength will be automatically determined through a search of 150 possible lambda values. The model will be trained by cross-validation using fixed validation subset and early stopping mode.
response="gh"
features=setdiff(colnames(wdata),c(response,"sex"))
h2ofit=h2o.glm(x=features,
y=response,
training_frame=wtrain,
validation_frame = wvalid,
nfolds = 10,seed = 123,
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
fold_assignment="AUTO",
score_each_iteration = TRUE,
family = "gamma",link="log",
solver = "L_BFGS",
lambda_search = TRUE,alpha=0,nlambdas =150,
early_stopping = TRUE,
standardize = TRUE,
missing_values_handling ="Skip",
compute_p_values = FALSE,
remove_collinear_columns = FALSE,
intercept = TRUE,
non_negative = FALSE
)
h2ofit
## Model Details:
## ==============
##
## H2ORegressionModel: glm
## Model ID: GLM_model_R_1494158163717_4
## GLM Model: summary
## family link regularization
## 1 gamma log Ridge ( lambda = 0.03122 )
## lambda_search
## 1 nlambda = 150, lambda.max = 3.2195, lambda.min = 0.02845, lambda.1se = 1.3976
## number_of_predictors_total number_of_active_predictors
## 1 4 4
## number_of_iterations training_frame
## 1 103 trainset
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 1.635888 1.688461
## 2 age 0.000873 0.017903
## 3 x2 0.000008 0.014689
## 4 x3 0.000000 0.004882
## 5 x4 -0.000000 -0.006112
##
## H2ORegressionMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.1288209
## RMSE: 0.3589162
## MAE: 0.2734149
## RMSLE: 0.05572236
## Mean Residual Deviance : 0.004349371
## R^2 : 0.1923687
## Null Deviance :13.97529
## Null D.o.F. :2593
## Residual Deviance :11.28227
## Residual D.o.F. :2589
## AIC :NaN
##
##
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
##
## MSE: 0.1441667
## RMSE: 0.3796929
## MAE: 0.2759836
## RMSLE: 0.05800154
## Mean Residual Deviance : 0.004730564
## R^2 : 0.1872527
## Null Deviance :6.191245
## Null D.o.F. :1057
## Residual Deviance :5.004937
## Residual D.o.F. :1053
## AIC :NaN
##
##
## H2ORegressionMetrics: glm
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1289211
## RMSE: 0.3590559
## MAE: 0.2735421
## RMSLE: 0.05574573
## Mean Residual Deviance : 0.004352769
## R^2 : 0.1917403
## Null Deviance :13.97877
## Null D.o.F. :2593
## Residual Deviance :11.29108
## Residual D.o.F. :2589
## AIC :NaN
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## mae 0.27340493 0.008355091 0.28451344 0.27967656
## mse 0.1288284 0.010064627 0.14173545 0.13876098
## null_deviance 1.3978775 0.091519505 1.495645 1.6113418
## r2 0.19276007 0.023682296 0.18898739 0.18601803
## residual_deviance 1.1277183 0.08511159 1.2103963 1.309848
## rmse 0.35837132 0.014114098 0.3764777 0.37250635
## rmsle 0.055655014 0.0020188198 0.05823262 0.057491224
## cv_3_valid cv_4_valid cv_5_valid cv_6_valid
## mae 0.25561687 0.25966153 0.2776107 0.2569407
## mse 0.10940834 0.11454897 0.13530691 0.10866868
## null_deviance 1.2263895 1.2794584 1.4159609 1.3167939
## r2 0.18194892 0.24431878 0.15211809 0.24990796
## residual_deviance 1.0029817 0.97199255 1.196076 0.9828831
## rmse 0.33076933 0.33845085 0.36784086 0.32964933
## rmsle 0.05149267 0.053131666 0.057332374 0.051421497
## cv_7_valid cv_8_valid cv_9_valid cv_10_valid
## mae 0.27796203 0.29192773 0.26915556 0.28098413
## mse 0.12994897 0.15044233 0.11830346 0.14115998
## null_deviance 1.3863152 1.6134675 1.2830396 1.3503631
## r2 0.22208755 0.18595453 0.16706163 0.14919774
## residual_deviance 1.0793375 1.3107362 1.0699044 1.1430275
## rmse 0.36048436 0.38786897 0.34395272 0.37571263
## rmsle 0.05614277 0.06002886 0.053676162 0.057600282
When validated on independent test set, here is the model’s performance:
h2o.performance(h2ofit,wtest)
## H2ORegressionMetrics: glm
##
## MSE: 0.1212607
## RMSE: 0.3482251
## MAE: 0.2590904
## RMSLE: 0.05351539
## Mean Residual Deviance : 0.004026166
## R^2 : 0.2356361
## Null Deviance :5.602816
## Null D.o.F. :1059
## Residual Deviance :4.267736
## Residual D.o.F. :1055
## AIC :NaN
predh2o=predict(h2ofit,newdata=wtest,type="response")%>%as.vector()# h2O
predg1=predict(mod2,newdata=testset,type="response")#Penalized spline
predg5=predict(mod1,newdata=testset,type="response")#Cubic spline
predg2=predict(opmod4,newdata=testset,type="response")#Orthogonal 4
predg3=predict(fpmod3,newdata=testset,type="response")#Fractioned cubic
predg4=predict(ppmod2,newdata=testset,type="response")#piecewise poly5
predf=cbind(predg1,predg5,predg2,predg3,predg4,predh2o,testset$gh,testset$age)%>%as_tibble()
names(predf)=c("Penalizedspline","Cubicspline","Orthogonal4","Fractioned3","Piecewise5","h2oLasso","Truth","Age")
predlong=predf%>%gather(Penalizedspline:h2oLasso,key="Model",value="Predicted")
Following figures show the prediction curves of 6 models, including the Lasso model by h2o:
predlong%>%ggplot()+geom_point(aes(x=Age,y=Truth),alpha=0.1,color="grey")+geom_smooth(aes(x=Age,y=Predicted,color=Model,fill=Model),show.legend = F)+facet_wrap(~Model,ncol=3)
EVALUATING THE PERFORMANCE BY MLR
On the final step, we will use the mlr package to evaluate the performance of 6 models.
8 metrics will be evaluated, including:
expvar = Explained variance (best =1, worst =0), defined as explained_sum_of_squares / total_sum_of_squares.
mae = Mean of absolute errors. Defined as: mean(abs(response - truth))
medae = Median of absolute errors
medse = Median of squared errors
mse = Mean of squared errors. Defined as: mean((response - truth)^2)
Rsquared (R2) or Coefficient of determination, which is 1 - residual_sum_of_squares / total_sum_of_squares.
rmse= Root mean squared error. The RMSE is aggregated as sqrt(mean(rmse.vals.on.test.sets^2))
rrse = Root relative squared error. Defined as sqrt (sum_of_squared_errors / total_sum_of_squares). Undefined for single instances and when every truth value is identical. In this case the output will be NA.
library(mlr)
regr.task= mlr::makeRegrTask(id = "gh", data=testset, target = "gh")
regr.lrn = makeLearner("regr.glm")
dummy=mlr::train(regr.lrn,regr.task)
dumpred1=predict(dummy,regr.task)
dumpred1$data$response<-predg1
dumpred2=predict(dummy,regr.task)
dumpred2$data$response<-predg2
dumpred3=predict(dummy,regr.task)
dumpred3$data$response<-predg3
dumpred4=predict(dummy,regr.task)
dumpred4$data$response<-predg4
dumpred5=predict(dummy,regr.task)
dumpred5$data$response<-predg5
dumpredh2o=predict(dummy,regr.task)
dumpredh2o$data$response<-predh2o
mets=list(rsq,expvar,mae,rmse,mse,medse,medae,rrse)
dp1=performance(dumpred1,measures =mets)
dp2=performance(dumpred2,measures =mets)
dp3=performance(dumpred3,measures =mets)
dp4=performance(dumpred4,measures =mets)
dp5=performance(dumpred5,measures =mets)
dph2o=performance(dumpredh2o,measures =mets)
dpd=cbind(dp1,dp2,dp3,dp4,dp5,dph2o)%>%as_tibble()
dpd$Metric=c("R2","ExpVAR","MAE","RMSE","MSE","MEDSE","MEDAE","RRSE")
names(dpd)=c("Penalizedspline","Cubicspline","Orthogonal4","Fractioned3","Piecewise5","h2oLasso","Metric")
Here is the verdict of our match:
cbind(dpd)
## Penalizedspline Cubicspline Orthogonal4 Fractioned3 Piecewise5
## 1 0.24434196 0.24530770 0.2463909 0.24605791 0.24448688
## 2 0.20941423 0.21181764 0.2110116 0.21098507 0.20813995
## 3 0.25685386 0.25630161 0.2559750 0.25602488 0.25690363
## 4 0.34623635 0.34601504 0.3457666 0.34584301 0.34620315
## 5 0.11987961 0.11972641 0.1195546 0.11960739 0.11985662
## 6 0.04103231 0.04139082 0.0404188 0.04050637 0.04103404
## 7 0.20256432 0.20344734 0.2010443 0.20126195 0.20256793
## 8 0.86928594 0.86873028 0.8681066 0.86829839 0.86920258
## h2oLasso Metric
## 1 0.23563604 R2
## 2 0.18672473 ExpVAR
## 3 0.25909037 MAE
## 4 0.34822513 RMSE
## 5 0.12126074 MSE
## 6 0.04068146 MEDSE
## 7 0.20169621 MEDAE
## 8 0.87427911 RRSE
dpdlong=gather(dpd,Penalizedspline:h2oLasso,key="Model",value="Score")
library(viridis)
dpdlong%>%ggplot(aes(x=reorder(Model,-Score),y=reorder(Metric,Score),fill=Score,label=round(Score,3)))+geom_tile(color="black")+geom_text(color = "white")+scale_fill_viridis(option="A",begin=0.2,end=0.7)+scale_x_discrete("Models")
dpdlong%>%ggplot(aes(x=Model,y=Score,color=Model))+geom_point(size=3,show.legend = F)+coord_flip()+facet_wrap(~Metric,ncol=2,scales="free_x")
Conclusion
In this case study, we have explored two R packages-GAMLSS and H2O that support polynomial regression tasks. Based on the result, the GAMLSS package won the match, as the its polynomial models showed higher R2 and lower prediction error than the lasso model by H2O. Overall, the subject’s Age alone in any equation among six models can only explain about 20% of the gh variance in healthy White Americans polulation. Cubic smoothing spline is a very powerful technique to extend the capacity of a basic linear model. Simple orthogonal polynomial regression works better than Lasso regularisation based model in h2o. A better tuning might help to improve the h2o GLM model’s performance but GAMLSS is still the best among all available GLM packages.
END