Case study N°40: Predicting Body surface area by Bayesian regularization-feed-forward neural network model

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project with participation of 12 data analysts aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam 2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

Body surface area (BSA) is an indicator for metabolic body mass, and is used for adjusting other physiological parameters. In clinical practice, BSA is usually approximated from Weight and Height using the Dubois formula (A polynomial model). This study aims to verify whether a model based on “Bayesian regularization-feed-forward neural network - BRNN” algorithm could be applied to accurately predict the body surface from gender, age, weight and height?

Materials and method

The original dataset was provided in Chaper 17, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). The target variable is BSA (in cm2) of 90 persons, calculated using direct photometric measurements. The features include: Gender, Age, Height and Weight.

Data analysis was performed in R statistical programming language. A BRNN algorithm based model was trained using brnn and mlr package. The final model’s performance was evaluated by a resampling process that based on cross-validation. Attribution of each predictors to the predicted BSA was graphically evaluated using the Partial dependence function method.

library(foreign)
library(tidyverse)
library(mlr)

Results

Step 0: Data loading and reshape

df=read.spss("chap17radialbasisnn.sav",use.value.labels = TRUE,to.data.frame = TRUE)

df$VAR00001=recode_factor(df$VAR00001,`0`="Male",`1`="Female")

dft<-df%>%rename(.,Gender=VAR00001,Age=VAR00002,Weight=VAR00003,Height=VAR00004,BSA=VAR00005)%>%as_tibble()

Step 1) Data exploration

Table 1: Descriptive statistics

Hmisc::describe(dft)
## dft 
## 
##  5  Variables      90  Observations
## ---------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##       90        0        2 
##                         
## Value        Male Female
## Frequency      45     45
## Proportion    0.5    0.5
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd 
##       90        0        9    0.988    7.111    5.793 
##                                                                 
## Value          0     1     3     5     7     9    11    13    15
## Frequency     10    10    10    10    10    10    10    10    10
## Proportion 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111
## ---------------------------------------------------------------------------
## Weight 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       90        0       49    0.999    21.41    14.17     3.00     3.45 
##      .25      .50      .75      .90      .95 
##    12.12    20.50    30.00    40.05    43.27 
## 
## lowest :  2.25  2.50  3.00  3.50  7.50, highest: 43.00 43.50 44.00 45.00 50.00
## ---------------------------------------------------------------------------
## Height 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       90        0       62        1    112.1    36.86    51.73    56.25 
##      .25      .50      .75      .90      .95 
##    91.25   120.50   138.00   151.00   155.65 
## 
## lowest :  50.5  51.0  51.5  52.0  53.0, highest: 157.0 159.5 163.0 165.0 168.0
## ---------------------------------------------------------------------------
## BSA 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       90        0       90        1     7741     3909     2105     2501 
##      .25      .50      .75      .90      .95 
##     5314     7918    10121    12642    13055 
## 
## lowest :  1632.5  1850.3  1906.2  1928.4  2094.5
## highest: 13082.8 13221.6 13426.7 13480.9 14832.0
## ---------------------------------------------------------------------------

Figure 1: Relationship between BSA and other features in 2 sexes

dft%>%gather(Age:Height,key ="Parameter", value = "Predictors")%>%ggplot(aes(x=Predictors,y=BSA))+geom_point(aes(color=Gender))+geom_smooth(method="lm",aes(color=Gender,fill=Gender),alpha=0.5)+facet_wrap(~Parameter,ncol=2,scales="free")+theme_bw()

Those figures suggests that in both gender subsets, BSA is associated to all 3 features (Age, Height and Weight) by a proportional relationship.

Step 2: Evaluating the performance of a BRNN model by repeated cross-validation

regr.task= makeRegrTask(id = "BS", data=dft, target = "BSA")
regr.lrn = makeLearner("regr.brnn")
rdesc = makeResampleDesc("CV", iters=25L)
r=resample(regr.lrn,regr.task,measures=list(rsq,expvar,mae,rmse,mse),rdesc,models=FALSE)
r$aggr
##    rsq.test.mean expvar.test.mean    mae.test.mean   rmse.test.rmse 
##     9.886672e-01     1.009495e+00     1.697584e+02     2.179570e+02 
##    mse.test.mean 
##     4.750525e+04

Figure 2: Evolution of the model’s performance metrics through 25 iterations

r$measures.test%>%gather(rsq:mse,key ="Metrics", value = "Value")%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metrics))+facet_wrap(~Metrics,ncol=1,scales="free")+theme_bw()

Figure 3: Performance metrics of the brnn model:

r$measures.test%>%gather(rsq:mse,key ="Metrics", value = "Value")%>%ggplot(aes(x=Metrics,y=Value))+geom_boxplot(aes(fill=Metrics),alpha=0.7)+coord_flip()+facet_wrap(~Metrics,ncol=1,scales="free")+theme_bw()

Note: _expvar = Explained variance : Similar to rsq (R-squared). Defined as explained_sum_of_squares / total_sum_of_squares.

mae: Mean of absolute errors

mse: Mean of squared errors

rmse: Root mean square error

rsq: Coefficient of determination which is 1 - residual_sum_of_squares / total_sum_of_squares._

The results demonstrated that brnn algorithm performed well through 20 CV iterations. The averaged performance metrics were very good: the brnn model could explain 99% of BSA variance with very low error.

Now we train a real model on the entire dataset then estimate the performance metrics

mod=train(regr.lrn,regr.task)
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.705412 
## gamma= 11.4299    alpha= 1.2382   beta= 537.9344
pred=predict(mod,regr.task)

performance(pred,measures =list(rsq,expvar,mae,rmse,mse))
##          rsq       expvar          mae         rmse          mse 
## 9.968895e-01 9.967424e-01 1.445201e+02 1.879988e+02 3.534355e+04

Step 3: Partial dependency analysis

pr1=generatePartialDependenceData(mod,regr.task,features="Age",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Age")%>%rename(.,Value=Age)

pr2=generatePartialDependenceData(mod,regr.task,features="Weight",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Weight")%>%rename(.,Value=Weight)

pr3=generatePartialDependenceData(mod,regr.task,features="Height",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Height")%>%rename(.,Value=Height)

prdf<-rbind(pr1,pr2,pr3)

Figure 4: Partial dependencies of BSA on Height, Weight and Age

As neural network is a blackbox model, we adopted the PDF method to present the attribution of each feature to the predicted BSA in an explicit way. This process consists of reducing the potentially high dimensional function estimated by the learner, and developping a marginalized version of this function in a lower dimensional space. For a regression problem, the partial derivative of a single feature is the gradient of the partial dependence function.

prdf%>%ggplot(aes(x=Value,y=BSA))+geom_path(aes(color=as.factor(idx)),alpha=0.2,show.legend =F)+geom_smooth(alpha=0.5,fill="gold",color="purple4",show.legend =F)+facet_wrap(~predictor,scales ="free",ncol = 2)+theme_bw()
## `geom_smooth()` using method = 'loess'

The ‘marginalized’ attribution of Age, Height and Weight to the predicted BSA was presented in the figure 6. Those functions suggest that Height and Weight contribute the most to the prediction of BSA.

Figure 5A and 5B: Predicted BSA compared with true observed value

At this point, we want to compare the predicted value of BSA to its true value. This could be done visually as follow:

predf<-pred%>%.$data%>%mutate(Height=dft$Height,Age=dft$Age,Weight=dft$Weight,Gender=dft$Gender)%>%gather(Height:Weight,key ="Metric", value = "Predictor")
pred%>%.$data%>%rename(.,Predicted=response,True=truth)%>%mutate(Gender=dft$Gender)%>%gather(True:Predicted,key ="Type", value = "Value")%>%ggplot()+geom_freqpoly(aes(x=Value,color=Type),size=1,binwidth =100)+facet_wrap(~Gender,ncol=1,scales="free")+theme_bw()

predf%>%ggplot()+geom_point(aes(x=Predictor,y=truth,color=Gender),size=3,alpha=0.5)+geom_smooth(aes(x=Predictor,y=response,fill=Gender,color=Gender),alpha=0.5)+facet_wrap(~Metric,ncol=2,scales="free")+theme_bw()
## `geom_smooth()` using method = 'loess'

These figures showed that in our 90 patients, the predicted BSA values given by brnn model are almost identical to the true BSA values obtained by photometric based method.

Discussion: With a larger sample size, the original dataset should be splitted into training and testing subset so we can check the model’s performance on an independent subset.

Conclusion: BRNN based models could be applied to estimate accurately the body surface values of individual patients.

END