Motivation and Expectation from Centile Estimation

knitr::opts_chunk$set(echo = TRUE)
#install.packages("expectreg") #not available anymore
#Thus, install from Github
#library(devtools) #not working also
#install_github("cran/expectreg")
#library(expectreg)

library(repmis) # source_file: load Rda from Github
library(ggplot2) # ggplot
library(gamlss) # lms()
library(tidyverse) # %>%

The Dutch boys data

#https://github.com/opetchey/RREEBES/wiki/Reading-data-and-code-from-an-online-github-repository
source_data("https://github.com/cran/expectreg/blob/master/data/dutchboys.rda?raw=True")
## [1] "dutchboys"
#load(file="Centile Estimation\\dutchboys.rda")
dim(dutchboys)
## [1] 6848   10
names(dutchboys)
##  [1] "defnr" "age"   "hgt"   "wgt"   "hc"    "hgt.z" "wgt.z" "hc.z"  "bmi.z"
## [10] "hfw.z"
  • hc : the head circumference of 6848 boys
  • age : the age in years

Observed Head Circumference of Males (From the Fourth Dutch Growth Study) Against Age

ggplot(dutchboys, aes(x=age, y=hc)) +
  geom_point(shape = 1, color = "#0052bb", size = 1.5) + 
  ggtitle("The Dutch boys data") +
  xlab("age") + ylab("head circumference") +
  theme_bw()

Simplify the plot (better visualization)

set.seed(080320)
IND<-sample.int(dim(dutchboys)[1],1000, replace=FALSE)
dutchboys1<-dutchboys[IND,]
ggplot(dutchboys1, aes(x=age, y=hc)) +
  geom_point(shape = 1, color = "#0052bb", size = 1.5) + 
  ggtitle("Random sample of 1,000 observations") +
  xlab("age") + ylab("head circumference") +
  theme_bw()

The problem

Centile estimation includes methods for estimating the age-related distribution of human growth. The standard estimation of centile curves usually involves two continuous variables:
1. the response variable, that is, the variable we are interested in and for which we are trying to find the centile curves, e.g. weight, BMI, head circumference.
2. the explanatory variable, e.g. age.

Centile estimation can be extended to more than one explanatory continuous variable, e.g. age and height. For explanatory categorical variables such as gender, the usual practice is to produce separate charts for each level of the categorical variable.

\(\Longrightarrow\) Dutch boys data with centiles

The methods

The popular methodology for creating centile references for individuals from a population comprises two different methods:

  1. the nonparametric method of quantile regression [Koenker, 2005, Koenker and Bassett, 1978, Koenker and Ng, 2005, He and Ng, 1999, Np and Maechler, 2007]; and
  2. the parametric LMS (lambda, mu, sigma) method of Cole [1988], Cole and Green [1992] and its extensions. For example, see Wright and Royston [1997] and Rigby and Stasinopoulos [2004, 2006].

The 100\(p\) centile of a continuous random variable \(Y\) is the value \(y_p\) such that:

  • Prob(\(Y \le y_p\)) = \(p\). Then \(y_p\) = \(F^{-1}_Y(p)\), and \(y_p\) is the inverse cdf of \(Y\) applied to \(p\).
  • If the conditional centile of \(Y\) given explanatory variable \(X = x\), i.e. \(y_p = F^{-1}_{Y \mid x}(p)\).
  • By varying \(x\), a 100\(p\) centile curve of \(y_p(x)\) against x is obtained. Centile curves can be obtained for different values of \(p\).
  • The WHO uses the values 100\(p\)=(3,15,50,85,97) in its charts and 100\(p\)=(1,3,5,15,25,50,75,85,95,97,99) in its tables.

The LMS model and its extentions

Centiles: the model

The LMS method can be fitted within gamlss is defined as follows:

\[Y \sim f_Y (y \mid \mu, \sigma, \upsilon, \tau)\] where \(f_Y()\) is any distribution, typically represents the BCCG (Box-Cox Cole and Green), BCPE (Box-Cox power exponential), and BCT (Box-Cox t).

  • a BCCG by assuming that the response variable has , which is suitable for positively or negatively skewed data with \(Y\) > 0
  • a BCPE assumes that the transformed random variable \(Z\) has a truncated power exponential distribution
  • a BCT assumes that \(Z\) has a truncated \(t\) distribution

\[ Y \ = \ \text{head circumference} \ \text{and} \ X \ = \ AGE^\color{red}{\epsilon}\]

The power transformation for \(x\) (here is AGE), i.e. \(x^\epsilon\), is usually needed when the response variable has an early or late spell of fast growth. In those cases the transformation of \(x\) can stretch the \(x\) scale, improving the fit of the smooth curve.

\[\begin{align} \mu &= s(x, \color{red}{df_\mu}) &\text{approximate median}\\ log(\sigma) &= s(x, \color{red}{df_\sigma}) &\text{approximate coefficient of variation}\\ \upsilon &= s(x, \color{red}{df_\upsilon}) &\text{skewness parameter}\\ log(\tau) &= s(x, \color{red}{df_\tau}) &\text{kurtosis parameter} \end{align}\]
#m0<-lms(hc, age, data=dutchboys, trans.x=TRUE, k=2)
#m0$family # "BCTo"            "Box-Cox-t-orig."
#m0$power # 0.3827328
m0<-lms(hc,age,families=c("BCCGo","BCPEo","BCTo"),data=dutchboys,
        k=3,calibration=F, trans.x=T)
## *** Checking for transformation for x *** 
## *** power parameters  0.3805277 ***  
## *** Initial  fit***  
## GAMLSS-RS iteration 1: Global Deviance = 26367.98 
## GAMLSS-RS iteration 2: Global Deviance = 26367.98 
## *** Fitting BCCGo *** 
## GAMLSS-RS iteration 1: Global Deviance = 26425.3 
## GAMLSS-RS iteration 2: Global Deviance = 26284.07 
## GAMLSS-RS iteration 3: Global Deviance = 26282.6 
## GAMLSS-RS iteration 4: Global Deviance = 26282.61 
## GAMLSS-RS iteration 5: Global Deviance = 26282.61 
## *** Fitting BCPEo *** 
## GAMLSS-RS iteration 1: Global Deviance = 26660.28 
## GAMLSS-RS iteration 2: Global Deviance = 26110.56 
## GAMLSS-RS iteration 3: Global Deviance = 26107.54 
## GAMLSS-RS iteration 4: Global Deviance = 26107.57 
## GAMLSS-RS iteration 5: Global Deviance = 26107.56 
## *** Fitting BCTo *** 
## GAMLSS-RS iteration 1: Global Deviance = 26185.76 
## GAMLSS-RS iteration 2: Global Deviance = 26021.22 
## GAMLSS-RS iteration 3: Global Deviance = 26020.8 
## GAMLSS-RS iteration 4: Global Deviance = 26021.14 
## GAMLSS-RS iteration 5: Global Deviance = 26021.39 
## GAMLSS-RS iteration 6: Global Deviance = 26021.51 
## GAMLSS-RS iteration 7: Global Deviance = 26021.58 
## GAMLSS-RS iteration 8: Global Deviance = 26021.61 
## GAMLSS-RS iteration 9: Global Deviance = 26021.62 
## GAMLSS-RS iteration 10: Global Deviance = 26021.63 
## GAMLSS-RS iteration 11: Global Deviance = 26021.63 
## GAMLSS-RS iteration 12: Global Deviance = 26021.63 
## GAMLSS-RS iteration 13: Global Deviance = 26021.63

## % of cases below  0.3830381 centile is  0.3942757 
## % of cases below  2.275013 centile is  2.000584 
## % of cases below  9.121122 centile is  8.995327 
## % of cases below  25.24925 centile is  25.93458 
## % of cases below  50 centile is  50.05841 
## % of cases below  74.75075 centile is  74.10923 
## % of cases below  90.87888 centile is  90.71262 
## % of cases below  97.72499 centile is  97.95561 
## % of cases below  99.61696 centile is  99.70794
m0$family # [1] "BCTo"            "Box-Cox-t-orig." for k = 4
## [1] "BCTo"            "Box-Cox-t-orig."
m0$power # 0.3083458 for k = 4
## [1] 0.3805277

\(\Longrightarrow\) Note that the best distribution family, according to GAIC(3), is stored in m0$family, i.e. BCTo. The power transformation chosen for age is stored in m0$power, i.e. \(u = age^{m0\$power}\) The chosen power transformation for age is given by m0$power, i.e. \[u = age^{0.3805277}\] and the best distribution, according to GAIC(\(\kappa\) = 2), is given by m0$family, and is BCCGo. The chosen model m0 can be fitted directly by

dutchboys$Tage<-(dutchboys$age)^(m0$power)
m1<-gamlss(hc~pb(Tage), sigma.formula=~pb(Tage),
            nu.formula=~pb(Tage), tau.fo=~pb(Tage), family=BCTo,
            trace=FALSE, data=dutchboys)

Alternatively m0 can be fitted by

m2<-gamlss(hc~pb(age^m0$power), sigma.formula=~pb(age^m0$power),
           nu.formula=~pb(age^m0$power), tau.fo=~pb(age^m0$power),
           family=BCTo, data=dutchboys,
           trace=FALSE)

The effective degrees of freedom used for the smooth functions in model m0 (including 2 for the constant and linear terms) are given by

edfAll(m0)
## $mu
## $mu$`pb(x, df = mu.df)`
## [1] 13.40125
## 
## 
## $sigma
## $sigma$`pb(x, df = sigma.df)`
## [1] 5.645407
## 
## 
## $nu
## $nu$`pb(x, df = nu.df)`
## [1] 2.911063
## 
## 
## $tau
## $tau$`pb(x, df = tau.df)`
## [1] 2.000227
# m0$mu.df; m0$sigma.df; m0$nu.df; m0$tau.df

The LMS method and extensions

Let \(Y\) be a random variable with range \(Y > 0\) defined through the transformed variable \(Z\) given by:

\[\begin{align} Z &= \frac{1}{\sigma\upsilon} [(\frac{Y}{\mu})^\upsilon - 1] &\text{, if} \ \upsilon \neq 0 \\ &= \frac{1}{\sigma} log(\frac{Y}{\mu}) &\text{, if} \ \upsilon = 0. \end{align}\]
  1. if \(Z \sim N(0,1)\) then \(Y \sim BCCG(\mu,\sigma,\upsilon)\) = LMS method
  2. if \(Z \sim t_\tau\) then \(Y \sim BCT(\mu,\sigma,\upsilon, \tau)\) = LMST method
  3. if \(Z \sim PE(0,1, \tau)\) then \(Y \sim BCPE(\mu,\sigma,\upsilon)\) = LMSP method adopted by WHO

Estimation of the smoothing parameters

We need to select the five values \(df_\mu, \ df_\sigma, \ df_\upsilon, \ df_\tau, \ \epsilon\)

  • by trial and error
  • minimize the generalized Akaike information criterion, GAIC(#)
  • minimize the the validation global deviance VGD
  • using local selection criteria, i.e. CV, ML

Diagnostics should be used in all above cases

Centiles: GAIC(#)

For different values of # minimize GAIC(#) to find estimates for \(df_\mu, \ df_\sigma, \ df_\upsilon, \ df_\tau, \ \epsilon\)

  • # = 2, AIC
  • # = 3, our preference
  • # = 3.84, \(\chi^2\)
  • # = log(n), BIC, SBC
dutchboys$Tage<-(dutchboys$age)^(m0$power)
m3<-gamlss(hc~pb(Tage, method="GAIC", k=3),
           sigma.formula=~pb(Tage, method="GAIC", k=3),
           nu.formula=~pb(Tage, method="GAIC", k=3),
           tau.formula = ~pb(Tage, method="GAIC", k=3),
           family=BCT,data=dutchboys, trace=FALSE)
edfAll(m3)
## $mu
## $mu$`pb(Tage, method = "GAIC", k = 3)`
## [1] 11.44023
## 
## 
## $sigma
## $sigma$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.013972
## 
## 
## $nu
## $nu$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.595608
## 
## 
## $tau
## $tau$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.039529

The gamlss functions find.hyper() can be used

mod<-quote(gamlss(hc~cs(Tage,df=p[1]), sigma.formula=~cs(Tage,df=p[2]),
                  nu.formula=~cs(Tage,df=p[3], c.spar=c(-1.5,2.5)),
                  tau.formula=~Tage, data=dutchboys, family=BCT,
                  control=gamlss.control(trace=FALSE)))
op<-find.hyper(model=mod, other=quote(Tage<-age^p[4]),
               par=c(10,4,1,0.25),lower=c(0.1,0.1,0.1,0.001),
               steps=c(0.1,0.1,0.1,0.2),factr=2e9,parscale=c(1,1,1,0.035),
               k=3)
## par 10 4 1 0.25 crit= 26092.45 with pen= 3 
## par 10.1 4 1 0.25 crit= 26092.46 with pen= 3 
## par 9.9 4 1 0.25 crit= 26092.45 with pen= 3 
## par 10 4.1 1 0.25 crit= 26092.4 with pen= 3 
## par 10 3.9 1 0.25 crit= 26092.5 with pen= 3 
## par 10 4 1.1 0.25 crit= 26092.57 with pen= 3 
## par 10 4 0.9 0.25 crit= 26092.35 with pen= 3 
## par 10 4 1 0.257 crit= 26092.45 with pen= 3 
## par 10 4 1 0.243 crit= 26092.45 with pen= 3 
## par 9.962885 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3 
## par 10.06288 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3 
## par 9.862885 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3 
## par 9.962885 4.6036 0.1368609 0.25 crit= 26092.22 with pen= 3 
## par 9.962885 4.4036 0.1368609 0.25 crit= 26092.27 with pen= 3 
## par 9.962885 4.5036 0.2368609 0.25 crit= 26092.14 with pen= 3 
## par 9.962885 4.5036 0.1 0.25 crit= 26092.3 with pen= 3 
## par 9.962885 4.5036 0.1368609 0.257 crit= 26092.25 with pen= 3 
## par 9.962885 4.5036 0.1368609 0.243 crit= 26092.25 with pen= 3 
## par 9.971324 4.472681 0.5274981 0.25 crit= 26092 with pen= 3 
## par 10.07132 4.472681 0.5274981 0.25 crit= 26092.01 with pen= 3 
## par 9.871324 4.472681 0.5274981 0.25 crit= 26092.01 with pen= 3 
## par 9.971324 4.572681 0.5274981 0.25 crit= 26091.98 with pen= 3 
## par 9.971324 4.372681 0.5274981 0.25 crit= 26092.04 with pen= 3 
## par 9.971324 4.472681 0.6274981 0.25 crit= 26092.01 with pen= 3 
## par 9.971324 4.472681 0.4274981 0.25 crit= 26092.03 with pen= 3 
## par 9.971324 4.472681 0.5274981 0.257 crit= 26092 with pen= 3 
## par 9.971324 4.472681 0.5274981 0.243 crit= 26092 with pen= 3 
## par 9.966412 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3 
## par 10.06641 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3 
## par 9.866412 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3 
## par 9.966412 4.723534 0.5604418 0.25 crit= 26091.95 with pen= 3 
## par 9.966412 4.523534 0.5604418 0.25 crit= 26091.99 with pen= 3 
## par 9.966412 4.623534 0.6604418 0.25 crit= 26091.98 with pen= 3 
## par 9.966412 4.623534 0.4604418 0.25 crit= 26091.98 with pen= 3 
## par 9.966412 4.623534 0.5604418 0.257 crit= 26091.97 with pen= 3 
## par 9.966412 4.623534 0.5604418 0.243 crit= 26091.97 with pen= 3 
## par 9.969574 5.005125 0.5626791 0.25 crit= 26091.92 with pen= 3 
## par 10.06957 5.005125 0.5626791 0.25 crit= 26091.93 with pen= 3 
## par 9.869574 5.005125 0.5626791 0.25 crit= 26091.93 with pen= 3 
## par 9.969574 5.105125 0.5626791 0.25 crit= 26091.92 with pen= 3 
## par 9.969574 4.905125 0.5626791 0.25 crit= 26091.93 with pen= 3 
## par 9.969574 5.005125 0.6626791 0.25 crit= 26091.94 with pen= 3 
## par 9.969574 5.005125 0.4626791 0.25 crit= 26091.93 with pen= 3 
## par 9.969574 5.005125 0.5626791 0.257 crit= 26091.92 with pen= 3 
## par 9.969574 5.005125 0.5626791 0.243 crit= 26091.92 with pen= 3 
## par 9.960439 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3 
## par 10.06044 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3 
## par 9.860439 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3 
## par 9.960439 5.167937 0.5571777 0.25 crit= 26091.92 with pen= 3 
## par 9.960439 4.967937 0.5571777 0.25 crit= 26091.92 with pen= 3 
## par 9.960439 5.067937 0.6571777 0.25 crit= 26091.93 with pen= 3 
## par 9.960439 5.067937 0.4571777 0.25 crit= 26091.93 with pen= 3 
## par 9.960439 5.067937 0.5571777 0.257 crit= 26091.92 with pen= 3 
## par 9.960439 5.067937 0.5571777 0.243 crit= 26091.92 with pen= 3
Tage<-(dutchboys$age)^(op$par[4])
m4<-gamlss(hc~cs(Tage,df=op$par[1]), 
            sigma.formula=~cs(Tage,df=op$par[2]),
            nu.formula=~cs(Tage,df=op$par[3],c.spar=c(-1.5,2.5)),
            family=BCT, data=dutchboys)
## GAMLSS-RS iteration 1: Global Deviance = 26198.73 
## GAMLSS-RS iteration 2: Global Deviance = 26041.95 
## GAMLSS-RS iteration 3: Global Deviance = 26041.66 
## GAMLSS-RS iteration 4: Global Deviance = 26041.72 
## GAMLSS-RS iteration 5: Global Deviance = 26041.73 
## GAMLSS-RS iteration 6: Global Deviance = 26041.73 
## GAMLSS-RS iteration 7: Global Deviance = 26041.73

Comparing all models:

GAIC(m0,m1,m2,m3,m4,k=3)

Diagnostics

Normalized (randomized) quantile residuals

The plot() function

plot(m0)

## ******************************************************************
##        Summary of the Quantile Residuals
##                            mean   =  -0.0004796974 
##                        variance   =  1.000162 
##                coef. of skewness  =  0.007264889 
##                coef. of kurtosis  =  3.055616 
## Filliben correlation coefficient  =  0.9995629 
## ******************************************************************

The worm plot wp()

wp(m0)
## Warning in wp(m0): Some points are missed out 
## increase the y limits using ylim.all

The dtop() function

dtop(m0, xvar = dutchboys$age, n.inter = 9)

The Q.stats() function

qstats<-Q.stats(m0, xvar = dutchboys$age, n.inter = 9)
## Warning in Q.stats(m0, xvar = dutchboys$age, n.inter = 9): the number of intervals have change to 16
## 

print(qstats, digits = 3)
##                        Z1     Z2      Z3      Z4 AgostinoK2    N
##  0.031 to  0.225   0.2559  0.972  1.2742  0.4142     1.7952  431
##  0.225 to  0.685  -0.2327 -1.437 -1.6868  0.0237     2.8460  425
##  0.685 to  1.189   0.3821 -1.516  1.1526 -1.1710     2.6996  434
##  1.189 to  1.797   0.3776  1.164  1.7013  2.6038     9.6739  422
##  1.797 to  2.807  -0.4435  1.538 -1.9807  0.1493     3.9454  428
##  2.807 to  4.296  -0.3873 -1.338 -0.0945  1.1782     1.3971  428
##  4.296 to  8.852   0.1397  1.091 -0.6385  0.1364     0.4263  428
##  8.852 to 10.141  -0.3535 -0.362 -0.7342 -0.4677     0.7578  431
## 10.141 to 11.395   0.2134 -0.156  0.7364 -0.2422     0.6009  425
## 11.395 to 12.617  -0.0419 -0.582  0.6609 -1.7263     3.4169  428
## 12.617 to 13.736  -0.5515 -0.988  1.2625 -0.5220     1.8662  428
## 13.736 to 14.886   0.0546  1.569  0.2815  0.9695     1.0192  428
## 14.886 to 16.091   0.2269 -0.106 -0.2831 -0.6806     0.5434  429
## 16.091 to 17.380  -0.2647  0.868 -1.3765  0.1720     1.9243  429
## 17.380 to 18.760   1.1709 -0.237  1.7929  0.9704     4.1560  427
## 18.760 to 20.986  -0.7040 -0.605 -0.9640  0.4204     1.1060  427
## TOTAL Q stats      3.2420 17.127 22.3410 15.8332    38.1742 6848
## df for Q stats     2.5988 12.677 13.0889 13.9998    27.0887    0
## p-val for Q stats  0.2903  0.177  0.0520  0.3237     0.0768    0

The table above indicates that the \(Q\) statistics are reasonable (since all their pvalues are greater than 0.05). The plot of the \(Z\) statistics indicates an adequate model, since all the values of |Z| are less than 2, shown by no squares within the circles.

Coming back to the interpretation of centile estimation

Looking at the centile curve

Actually for “magic look”, i.e. cool, not much an applicant

centiles(m0)

## % of cases below  0.4 centile is  0.4088785 
## % of cases below  2 centile is  1.796145 
## % of cases below  10 centile is  9.886098 
## % of cases below  25 centile is  25.70093 
## % of cases below  50 centile is  50.05841 
## % of cases below  75 centile is  74.41589 
## % of cases below  90 centile is  90.02629 
## % of cases below  98 centile is  98.23306 
## % of cases below  99.6 centile is  99.70794

Looking at the centile fan

Same as above, 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 99th centiles of HC

centiles.fan(m0,dutchboys$age,cent=c(1,5,10,25,50,75,90,95), ylab="head circumference", xlab="age")

Looking at the median and quantile

More intuitive measures

dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%mutate(P50=median(hc, rm.na=T))%>%distinct(agefr,P50)
#dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%summarise(quantile(hc,c(.1, 0.25, 0.5, 0.75, 0.9, 0.99)))
#https://tbradley1013.github.io/2018/10/01/calculating-quantiles-for-groups-with-dplyr-summarize-and-purrr-partial/
p <- c(.1, 0.25, 0.5, 0.75, 0.9, 0.99)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)
p_funs
## $`10%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## 
## $`25%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## 
## $`50%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## 
## $`75%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## 
## $`90%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## 
## $`99%`
## <partialised>
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%summarise_at(vars(hc), funs(!!!p_funs))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Predictions

The most application of centile estimation

References