Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project with participation of 12 data analysts aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam 2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
Body surface area (BSA) is an indicator for metabolic body mass, and is used for adjusting other physiological parameters. In clinical practice, BSA is usually approximated from Weight and Height using the Dubois formula (A polynomial model). This study aims to verify whether a model based on “Bayesian regularization-feed-forward neural network - BRNN” algorithm could be applied to accurately predict the body surface from gender, age, weight and height?
Materials and method
The original dataset was provided in Chaper 17, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). The target variable is BSA (in cm2) of 90 persons, calculated using direct photometric measurements. The features include: Gender, Age, Height and Weight.
Data analysis was performed in R statistical programming language. A BRNN algorithm based model was trained using brnn and mlr package. The final model’s performance was evaluated by a resampling process that based on cross-validation. Attribution of each predictors to the predicted BSA was graphically evaluated using the Partial dependence function method.
library(foreign)
library(tidyverse)
library(mlr)
Results
Step 0: Data loading and reshape
df=read.spss("chap17radialbasisnn.sav",use.value.labels = TRUE,to.data.frame = TRUE)
df$VAR00001=recode_factor(df$VAR00001,`0`="Male",`1`="Female")
dft<-df%>%rename(.,Gender=VAR00001,Age=VAR00002,Weight=VAR00003,Height=VAR00004,BSA=VAR00005)%>%as_tibble()
Step 1) Data exploration
Table 1: Descriptive statistics
Hmisc::describe(dft)
## dft
##
## 5 Variables 90 Observations
## ---------------------------------------------------------------------------
## Gender
## n missing distinct
## 90 0 2
##
## Value Male Female
## Frequency 45 45
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd
## 90 0 9 0.988 7.111 5.793
##
## Value 0 1 3 5 7 9 11 13 15
## Frequency 10 10 10 10 10 10 10 10 10
## Proportion 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111
## ---------------------------------------------------------------------------
## Weight
## n missing distinct Info Mean Gmd .05 .10
## 90 0 49 0.999 21.41 14.17 3.00 3.45
## .25 .50 .75 .90 .95
## 12.12 20.50 30.00 40.05 43.27
##
## lowest : 2.25 2.50 3.00 3.50 7.50, highest: 43.00 43.50 44.00 45.00 50.00
## ---------------------------------------------------------------------------
## Height
## n missing distinct Info Mean Gmd .05 .10
## 90 0 62 1 112.1 36.86 51.73 56.25
## .25 .50 .75 .90 .95
## 91.25 120.50 138.00 151.00 155.65
##
## lowest : 50.5 51.0 51.5 52.0 53.0, highest: 157.0 159.5 163.0 165.0 168.0
## ---------------------------------------------------------------------------
## BSA
## n missing distinct Info Mean Gmd .05 .10
## 90 0 90 1 7741 3909 2105 2501
## .25 .50 .75 .90 .95
## 5314 7918 10121 12642 13055
##
## lowest : 1632.5 1850.3 1906.2 1928.4 2094.5
## highest: 13082.8 13221.6 13426.7 13480.9 14832.0
## ---------------------------------------------------------------------------
Figure 1: Relationship between BSA and other features in 2 sexes
dft%>%gather(Age:Height,key ="Parameter", value = "Predictors")%>%ggplot(aes(x=Predictors,y=BSA))+geom_point(aes(color=Gender))+geom_smooth(method="lm",aes(color=Gender,fill=Gender),alpha=0.5)+facet_wrap(~Parameter,ncol=2,scales="free")+theme_bw()
Those figures suggests that in both gender subsets, BSA is associated to all 3 features (Age, Height and Weight) by a proportional relationship.
Step 2: Evaluating the performance of a BRNN model by repeated cross-validation
regr.task= makeRegrTask(id = "BS", data=dft, target = "BSA")
regr.lrn = makeLearner("regr.brnn")
rdesc = makeResampleDesc("CV", iters=25L)
r=resample(regr.lrn,regr.task,measures=list(rsq,expvar,mae,rmse,mse),rdesc,models=FALSE)
r$aggr
## rsq.test.mean expvar.test.mean mae.test.mean rmse.test.rmse
## 9.886672e-01 1.009495e+00 1.697584e+02 2.179570e+02
## mse.test.mean
## 4.750525e+04
Figure 2: Evolution of the model’s performance metrics through 25 iterations
r$measures.test%>%gather(rsq:mse,key ="Metrics", value = "Value")%>%ggplot(aes(x=iter,y=Value))+geom_path(aes(color=Metrics))+facet_wrap(~Metrics,ncol=1,scales="free")+theme_bw()
Figure 3: Performance metrics of the brnn model:
r$measures.test%>%gather(rsq:mse,key ="Metrics", value = "Value")%>%ggplot(aes(x=Metrics,y=Value))+geom_boxplot(aes(fill=Metrics),alpha=0.7)+coord_flip()+facet_wrap(~Metrics,ncol=1,scales="free")+theme_bw()
Note: _expvar = Explained variance : Similar to rsq (R-squared). Defined as explained_sum_of_squares / total_sum_of_squares.
mae: Mean of absolute errors
mse: Mean of squared errors
rmse: Root mean square error
rsq: Coefficient of determination which is 1 - residual_sum_of_squares / total_sum_of_squares._
The results demonstrated that brnn algorithm performed well through 20 CV iterations. The averaged performance metrics were very good: the brnn model could explain 99% of BSA variance with very low error.
Now we train a real model on the entire dataset then estimate the performance metrics
mod=train(regr.lrn,regr.task)
## Number of parameters (weights and biases) to estimate: 12
## Nguyen-Widrow method
## Scaling factor= 0.705412
## gamma= 11.4299 alpha= 1.2382 beta= 537.9344
pred=predict(mod,regr.task)
performance(pred,measures =list(rsq,expvar,mae,rmse,mse))
## rsq expvar mae rmse mse
## 9.968895e-01 9.967424e-01 1.445201e+02 1.879988e+02 3.534355e+04
Step 3: Partial dependency analysis
pr1=generatePartialDependenceData(mod,regr.task,features="Age",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Age")%>%rename(.,Value=Age)
pr2=generatePartialDependenceData(mod,regr.task,features="Weight",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Weight")%>%rename(.,Value=Weight)
pr3=generatePartialDependenceData(mod,regr.task,features="Height",gridsize=10L,individual=TRUE)%>%.$data%>%mutate(predictor="Height")%>%rename(.,Value=Height)
prdf<-rbind(pr1,pr2,pr3)
Figure 4: Partial dependencies of BSA on Height, Weight and Age
As neural network is a blackbox model, we adopted the PDF method to present the attribution of each feature to the predicted BSA in an explicit way. This process consists of reducing the potentially high dimensional function estimated by the learner, and developping a marginalized version of this function in a lower dimensional space. For a regression problem, the partial derivative of a single feature is the gradient of the partial dependence function.
prdf%>%ggplot(aes(x=Value,y=BSA))+geom_path(aes(color=as.factor(idx)),alpha=0.2,show.legend =F)+geom_smooth(alpha=0.5,fill="gold",color="purple4",show.legend =F)+facet_wrap(~predictor,scales ="free",ncol = 2)+theme_bw()
## `geom_smooth()` using method = 'loess'
The ‘marginalized’ attribution of Age, Height and Weight to the predicted BSA was presented in the figure 6. Those functions suggest that Height and Weight contribute the most to the prediction of BSA.
Figure 5A and 5B: Predicted BSA compared with true observed value
At this point, we want to compare the predicted value of BSA to its true value. This could be done visually as follow:
predf<-pred%>%.$data%>%mutate(Height=dft$Height,Age=dft$Age,Weight=dft$Weight,Gender=dft$Gender)%>%gather(Height:Weight,key ="Metric", value = "Predictor")
pred%>%.$data%>%rename(.,Predicted=response,True=truth)%>%mutate(Gender=dft$Gender)%>%gather(True:Predicted,key ="Type", value = "Value")%>%ggplot()+geom_freqpoly(aes(x=Value,color=Type),size=1,binwidth =100)+facet_wrap(~Gender,ncol=1,scales="free")+theme_bw()
predf%>%ggplot()+geom_point(aes(x=Predictor,y=truth,color=Gender),size=3,alpha=0.5)+geom_smooth(aes(x=Predictor,y=response,fill=Gender,color=Gender),alpha=0.5)+facet_wrap(~Metric,ncol=2,scales="free")+theme_bw()
## `geom_smooth()` using method = 'loess'
These figures showed that in our 90 patients, the predicted BSA values given by brnn model are almost identical to the true BSA values obtained by photometric based method.
Discussion: With a larger sample size, the original dataset should be splitted into training and testing subset so we can check the model’s performance on an independent subset.
Conclusion: BRNN based models could be applied to estimate accurately the body surface values of individual patients.
END