knitr::opts_chunk$set(echo = TRUE)
#install.packages("expectreg") #not available anymore
#Thus, install from Github
#library(devtools) #not working also
#install_github("cran/expectreg")
#library(expectreg)
library(repmis) # source_file: load Rda from Github
library(ggplot2) # ggplot
library(gamlss) # lms()
library(tidyverse) # %>%#https://github.com/opetchey/RREEBES/wiki/Reading-data-and-code-from-an-online-github-repository
source_data("https://github.com/cran/expectreg/blob/master/data/dutchboys.rda?raw=True")## [1] "dutchboys"
#load(file="Centile Estimation\\dutchboys.rda")
dim(dutchboys)## [1] 6848 10
names(dutchboys)## [1] "defnr" "age" "hgt" "wgt" "hc" "hgt.z" "wgt.z" "hc.z" "bmi.z"
## [10] "hfw.z"
hc : the head circumference of 6848 boysage : the age in yearsggplot(dutchboys, aes(x=age, y=hc)) +
geom_point(shape = 1, color = "#0052bb", size = 1.5) +
ggtitle("The Dutch boys data") +
xlab("age") + ylab("head circumference") +
theme_bw()set.seed(080320)
IND<-sample.int(dim(dutchboys)[1],1000, replace=FALSE)
dutchboys1<-dutchboys[IND,]
ggplot(dutchboys1, aes(x=age, y=hc)) +
geom_point(shape = 1, color = "#0052bb", size = 1.5) +
ggtitle("Random sample of 1,000 observations") +
xlab("age") + ylab("head circumference") +
theme_bw()Centile estimation includes methods for estimating the age-related distribution of human growth. The standard estimation of centile curves usually involves two continuous variables:
1. the response variable, that is, the variable we are interested in and for which we are trying to find the centile curves, e.g. weight, BMI, head circumference.
2. the explanatory variable, e.g. age.
Centile estimation can be extended to more than one explanatory continuous variable, e.g. age and height. For explanatory categorical variables such as gender, the usual practice is to produce separate charts for each level of the categorical variable.
\(\Longrightarrow\) Dutch boys data with centiles
The popular methodology for creating centile references for individuals from a population comprises two different methods:
The 100\(p\) centile of a continuous random variable \(Y\) is the value \(y_p\) such that:
The LMS method can be fitted within gamlss is defined as follows:
\[Y \sim f_Y (y \mid \mu, \sigma, \upsilon, \tau)\] where \(f_Y()\) is any distribution, typically represents the BCCG (Box-Cox Cole and Green), BCPE (Box-Cox power exponential), and BCT (Box-Cox t).
\[ Y \ = \ \text{head circumference} \ \text{and} \ X \ = \ AGE^\color{red}{\epsilon}\]
The power transformation for \(x\) (here is AGE), i.e. \(x^\epsilon\), is usually needed when the response variable has an early or late spell of fast growth. In those cases the transformation of \(x\) can stretch the \(x\) scale, improving the fit of the smooth curve.
#m0<-lms(hc, age, data=dutchboys, trans.x=TRUE, k=2)
#m0$family # "BCTo" "Box-Cox-t-orig."
#m0$power # 0.3827328
m0<-lms(hc,age,families=c("BCCGo","BCPEo","BCTo"),data=dutchboys,
k=3,calibration=F, trans.x=T)## *** Checking for transformation for x ***
## *** power parameters 0.3805277 ***
## *** Initial fit***
## GAMLSS-RS iteration 1: Global Deviance = 26367.98
## GAMLSS-RS iteration 2: Global Deviance = 26367.98
## *** Fitting BCCGo ***
## GAMLSS-RS iteration 1: Global Deviance = 26425.3
## GAMLSS-RS iteration 2: Global Deviance = 26284.07
## GAMLSS-RS iteration 3: Global Deviance = 26282.6
## GAMLSS-RS iteration 4: Global Deviance = 26282.61
## GAMLSS-RS iteration 5: Global Deviance = 26282.61
## *** Fitting BCPEo ***
## GAMLSS-RS iteration 1: Global Deviance = 26660.28
## GAMLSS-RS iteration 2: Global Deviance = 26110.56
## GAMLSS-RS iteration 3: Global Deviance = 26107.54
## GAMLSS-RS iteration 4: Global Deviance = 26107.57
## GAMLSS-RS iteration 5: Global Deviance = 26107.56
## *** Fitting BCTo ***
## GAMLSS-RS iteration 1: Global Deviance = 26185.76
## GAMLSS-RS iteration 2: Global Deviance = 26021.22
## GAMLSS-RS iteration 3: Global Deviance = 26020.8
## GAMLSS-RS iteration 4: Global Deviance = 26021.14
## GAMLSS-RS iteration 5: Global Deviance = 26021.39
## GAMLSS-RS iteration 6: Global Deviance = 26021.51
## GAMLSS-RS iteration 7: Global Deviance = 26021.58
## GAMLSS-RS iteration 8: Global Deviance = 26021.61
## GAMLSS-RS iteration 9: Global Deviance = 26021.62
## GAMLSS-RS iteration 10: Global Deviance = 26021.63
## GAMLSS-RS iteration 11: Global Deviance = 26021.63
## GAMLSS-RS iteration 12: Global Deviance = 26021.63
## GAMLSS-RS iteration 13: Global Deviance = 26021.63
## % of cases below 0.3830381 centile is 0.3942757
## % of cases below 2.275013 centile is 2.000584
## % of cases below 9.121122 centile is 8.995327
## % of cases below 25.24925 centile is 25.93458
## % of cases below 50 centile is 50.05841
## % of cases below 74.75075 centile is 74.10923
## % of cases below 90.87888 centile is 90.71262
## % of cases below 97.72499 centile is 97.95561
## % of cases below 99.61696 centile is 99.70794
m0$family # [1] "BCTo" "Box-Cox-t-orig." for k = 4## [1] "BCTo" "Box-Cox-t-orig."
m0$power # 0.3083458 for k = 4## [1] 0.3805277
\(\Longrightarrow\) Note that the best distribution family, according to GAIC(3), is stored in m0$family, i.e. BCTo. The power transformation chosen for age is stored in m0$power, i.e. \(u = age^{m0\$power}\) The chosen power transformation for age is given by m0$power, i.e. \[u = age^{0.3805277}\] and the best distribution, according to GAIC(\(\kappa\) = 2), is given by m0$family, and is BCCGo. The chosen model m0 can be fitted directly by
dutchboys$Tage<-(dutchboys$age)^(m0$power)
m1<-gamlss(hc~pb(Tage), sigma.formula=~pb(Tage),
nu.formula=~pb(Tage), tau.fo=~pb(Tage), family=BCTo,
trace=FALSE, data=dutchboys)Alternatively m0 can be fitted by
m2<-gamlss(hc~pb(age^m0$power), sigma.formula=~pb(age^m0$power),
nu.formula=~pb(age^m0$power), tau.fo=~pb(age^m0$power),
family=BCTo, data=dutchboys,
trace=FALSE)The effective degrees of freedom used for the smooth functions in model m0 (including 2 for the constant and linear terms) are given by
edfAll(m0)## $mu
## $mu$`pb(x, df = mu.df)`
## [1] 13.40125
##
##
## $sigma
## $sigma$`pb(x, df = sigma.df)`
## [1] 5.645407
##
##
## $nu
## $nu$`pb(x, df = nu.df)`
## [1] 2.911063
##
##
## $tau
## $tau$`pb(x, df = tau.df)`
## [1] 2.000227
# m0$mu.df; m0$sigma.df; m0$nu.df; m0$tau.dfLet \(Y\) be a random variable with range \(Y > 0\) defined through the transformed variable \(Z\) given by:
\[\begin{align} Z &= \frac{1}{\sigma\upsilon} [(\frac{Y}{\mu})^\upsilon - 1] &\text{, if} \ \upsilon \neq 0 \\ &= \frac{1}{\sigma} log(\frac{Y}{\mu}) &\text{, if} \ \upsilon = 0. \end{align}\]We need to select the five values \(df_\mu, \ df_\sigma, \ df_\upsilon, \ df_\tau, \ \epsilon\)
Diagnostics should be used in all above cases
For different values of # minimize GAIC(#) to find estimates for \(df_\mu, \ df_\sigma, \ df_\upsilon, \ df_\tau, \ \epsilon\)
dutchboys$Tage<-(dutchboys$age)^(m0$power)
m3<-gamlss(hc~pb(Tage, method="GAIC", k=3),
sigma.formula=~pb(Tage, method="GAIC", k=3),
nu.formula=~pb(Tage, method="GAIC", k=3),
tau.formula = ~pb(Tage, method="GAIC", k=3),
family=BCT,data=dutchboys, trace=FALSE)
edfAll(m3)## $mu
## $mu$`pb(Tage, method = "GAIC", k = 3)`
## [1] 11.44023
##
##
## $sigma
## $sigma$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.013972
##
##
## $nu
## $nu$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.595608
##
##
## $tau
## $tau$`pb(Tage, method = "GAIC", k = 3)`
## [1] 2.039529
The gamlss functions find.hyper() can be used
mod<-quote(gamlss(hc~cs(Tage,df=p[1]), sigma.formula=~cs(Tage,df=p[2]),
nu.formula=~cs(Tage,df=p[3], c.spar=c(-1.5,2.5)),
tau.formula=~Tage, data=dutchboys, family=BCT,
control=gamlss.control(trace=FALSE)))
op<-find.hyper(model=mod, other=quote(Tage<-age^p[4]),
par=c(10,4,1,0.25),lower=c(0.1,0.1,0.1,0.001),
steps=c(0.1,0.1,0.1,0.2),factr=2e9,parscale=c(1,1,1,0.035),
k=3)## par 10 4 1 0.25 crit= 26092.45 with pen= 3
## par 10.1 4 1 0.25 crit= 26092.46 with pen= 3
## par 9.9 4 1 0.25 crit= 26092.45 with pen= 3
## par 10 4.1 1 0.25 crit= 26092.4 with pen= 3
## par 10 3.9 1 0.25 crit= 26092.5 with pen= 3
## par 10 4 1.1 0.25 crit= 26092.57 with pen= 3
## par 10 4 0.9 0.25 crit= 26092.35 with pen= 3
## par 10 4 1 0.257 crit= 26092.45 with pen= 3
## par 10 4 1 0.243 crit= 26092.45 with pen= 3
## par 9.962885 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3
## par 10.06288 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3
## par 9.862885 4.5036 0.1368609 0.25 crit= 26092.25 with pen= 3
## par 9.962885 4.6036 0.1368609 0.25 crit= 26092.22 with pen= 3
## par 9.962885 4.4036 0.1368609 0.25 crit= 26092.27 with pen= 3
## par 9.962885 4.5036 0.2368609 0.25 crit= 26092.14 with pen= 3
## par 9.962885 4.5036 0.1 0.25 crit= 26092.3 with pen= 3
## par 9.962885 4.5036 0.1368609 0.257 crit= 26092.25 with pen= 3
## par 9.962885 4.5036 0.1368609 0.243 crit= 26092.25 with pen= 3
## par 9.971324 4.472681 0.5274981 0.25 crit= 26092 with pen= 3
## par 10.07132 4.472681 0.5274981 0.25 crit= 26092.01 with pen= 3
## par 9.871324 4.472681 0.5274981 0.25 crit= 26092.01 with pen= 3
## par 9.971324 4.572681 0.5274981 0.25 crit= 26091.98 with pen= 3
## par 9.971324 4.372681 0.5274981 0.25 crit= 26092.04 with pen= 3
## par 9.971324 4.472681 0.6274981 0.25 crit= 26092.01 with pen= 3
## par 9.971324 4.472681 0.4274981 0.25 crit= 26092.03 with pen= 3
## par 9.971324 4.472681 0.5274981 0.257 crit= 26092 with pen= 3
## par 9.971324 4.472681 0.5274981 0.243 crit= 26092 with pen= 3
## par 9.966412 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3
## par 10.06641 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3
## par 9.866412 4.623534 0.5604418 0.25 crit= 26091.97 with pen= 3
## par 9.966412 4.723534 0.5604418 0.25 crit= 26091.95 with pen= 3
## par 9.966412 4.523534 0.5604418 0.25 crit= 26091.99 with pen= 3
## par 9.966412 4.623534 0.6604418 0.25 crit= 26091.98 with pen= 3
## par 9.966412 4.623534 0.4604418 0.25 crit= 26091.98 with pen= 3
## par 9.966412 4.623534 0.5604418 0.257 crit= 26091.97 with pen= 3
## par 9.966412 4.623534 0.5604418 0.243 crit= 26091.97 with pen= 3
## par 9.969574 5.005125 0.5626791 0.25 crit= 26091.92 with pen= 3
## par 10.06957 5.005125 0.5626791 0.25 crit= 26091.93 with pen= 3
## par 9.869574 5.005125 0.5626791 0.25 crit= 26091.93 with pen= 3
## par 9.969574 5.105125 0.5626791 0.25 crit= 26091.92 with pen= 3
## par 9.969574 4.905125 0.5626791 0.25 crit= 26091.93 with pen= 3
## par 9.969574 5.005125 0.6626791 0.25 crit= 26091.94 with pen= 3
## par 9.969574 5.005125 0.4626791 0.25 crit= 26091.93 with pen= 3
## par 9.969574 5.005125 0.5626791 0.257 crit= 26091.92 with pen= 3
## par 9.969574 5.005125 0.5626791 0.243 crit= 26091.92 with pen= 3
## par 9.960439 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3
## par 10.06044 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3
## par 9.860439 5.067937 0.5571777 0.25 crit= 26091.92 with pen= 3
## par 9.960439 5.167937 0.5571777 0.25 crit= 26091.92 with pen= 3
## par 9.960439 4.967937 0.5571777 0.25 crit= 26091.92 with pen= 3
## par 9.960439 5.067937 0.6571777 0.25 crit= 26091.93 with pen= 3
## par 9.960439 5.067937 0.4571777 0.25 crit= 26091.93 with pen= 3
## par 9.960439 5.067937 0.5571777 0.257 crit= 26091.92 with pen= 3
## par 9.960439 5.067937 0.5571777 0.243 crit= 26091.92 with pen= 3
Tage<-(dutchboys$age)^(op$par[4])
m4<-gamlss(hc~cs(Tage,df=op$par[1]),
sigma.formula=~cs(Tage,df=op$par[2]),
nu.formula=~cs(Tage,df=op$par[3],c.spar=c(-1.5,2.5)),
family=BCT, data=dutchboys)## GAMLSS-RS iteration 1: Global Deviance = 26198.73
## GAMLSS-RS iteration 2: Global Deviance = 26041.95
## GAMLSS-RS iteration 3: Global Deviance = 26041.66
## GAMLSS-RS iteration 4: Global Deviance = 26041.72
## GAMLSS-RS iteration 5: Global Deviance = 26041.73
## GAMLSS-RS iteration 6: Global Deviance = 26041.73
## GAMLSS-RS iteration 7: Global Deviance = 26041.73
Comparing all models:
GAIC(m0,m1,m2,m3,m4,k=3)plot() functionplot(m0)## ******************************************************************
## Summary of the Quantile Residuals
## mean = -0.0004796974
## variance = 1.000162
## coef. of skewness = 0.007264889
## coef. of kurtosis = 3.055616
## Filliben correlation coefficient = 0.9995629
## ******************************************************************
wp()wp(m0)## Warning in wp(m0): Some points are missed out
## increase the y limits using ylim.all
dtop() functiondtop(m0, xvar = dutchboys$age, n.inter = 9)Q.stats() functionqstats<-Q.stats(m0, xvar = dutchboys$age, n.inter = 9)## Warning in Q.stats(m0, xvar = dutchboys$age, n.inter = 9): the number of intervals have change to 16
##
print(qstats, digits = 3)## Z1 Z2 Z3 Z4 AgostinoK2 N
## 0.031 to 0.225 0.2559 0.972 1.2742 0.4142 1.7952 431
## 0.225 to 0.685 -0.2327 -1.437 -1.6868 0.0237 2.8460 425
## 0.685 to 1.189 0.3821 -1.516 1.1526 -1.1710 2.6996 434
## 1.189 to 1.797 0.3776 1.164 1.7013 2.6038 9.6739 422
## 1.797 to 2.807 -0.4435 1.538 -1.9807 0.1493 3.9454 428
## 2.807 to 4.296 -0.3873 -1.338 -0.0945 1.1782 1.3971 428
## 4.296 to 8.852 0.1397 1.091 -0.6385 0.1364 0.4263 428
## 8.852 to 10.141 -0.3535 -0.362 -0.7342 -0.4677 0.7578 431
## 10.141 to 11.395 0.2134 -0.156 0.7364 -0.2422 0.6009 425
## 11.395 to 12.617 -0.0419 -0.582 0.6609 -1.7263 3.4169 428
## 12.617 to 13.736 -0.5515 -0.988 1.2625 -0.5220 1.8662 428
## 13.736 to 14.886 0.0546 1.569 0.2815 0.9695 1.0192 428
## 14.886 to 16.091 0.2269 -0.106 -0.2831 -0.6806 0.5434 429
## 16.091 to 17.380 -0.2647 0.868 -1.3765 0.1720 1.9243 429
## 17.380 to 18.760 1.1709 -0.237 1.7929 0.9704 4.1560 427
## 18.760 to 20.986 -0.7040 -0.605 -0.9640 0.4204 1.1060 427
## TOTAL Q stats 3.2420 17.127 22.3410 15.8332 38.1742 6848
## df for Q stats 2.5988 12.677 13.0889 13.9998 27.0887 0
## p-val for Q stats 0.2903 0.177 0.0520 0.3237 0.0768 0
The table above indicates that the \(Q\) statistics are reasonable (since all their pvalues are greater than 0.05). The plot of the \(Z\) statistics indicates an adequate model, since all the values of |Z| are less than 2, shown by no squares within the circles.
Actually for “magic look”, i.e. cool, not much an applicant
centiles(m0)## % of cases below 0.4 centile is 0.4088785
## % of cases below 2 centile is 1.796145
## % of cases below 10 centile is 9.886098
## % of cases below 25 centile is 25.70093
## % of cases below 50 centile is 50.05841
## % of cases below 75 centile is 74.41589
## % of cases below 90 centile is 90.02629
## % of cases below 98 centile is 98.23306
## % of cases below 99.6 centile is 99.70794
Same as above, 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 99th centiles of HC
centiles.fan(m0,dutchboys$age,cent=c(1,5,10,25,50,75,90,95), ylab="head circumference", xlab="age")More intuitive measures
dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%mutate(P50=median(hc, rm.na=T))%>%distinct(agefr,P50)#dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%summarise(quantile(hc,c(.1, 0.25, 0.5, 0.75, 0.9, 0.99)))
#https://tbradley1013.github.io/2018/10/01/calculating-quantiles-for-groups-with-dplyr-summarize-and-purrr-partial/
p <- c(.1, 0.25, 0.5, 0.75, 0.9, 0.99)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
p_funs## $`10%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
## $`25%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
## $`50%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
## $`75%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
## $`90%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
## $`99%`
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
dutchboys%>%mutate(agefr=floor(age))%>%group_by(agefr)%>%summarise_at(vars(hc), funs(!!!p_funs))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
The most application of centile estimation