::install_github('tingtingzhan/TukeyGH77') remotes
Optimized Letter-Value Based Estimates for Tukey’s g-&-h Distribution
Introduction
This vignette of package TukeyGH77
(CRAN, Github) documents
- Tukey (1977)’s g-&-h transformation and its inverse
- Tukey (1977)’s g-&-h distribution density, cumulative probability, quantile, and random generator
- an improvement to Hoaglin (1985)’s letter-value based estimates.
R terminology might be different from that of mathematics and statistics. Please refer to Appendix Section 5.1 for explanation and reference of the terms and abbreviations used in this vignette.
Package TukeyGH77
Imports
packages
ggplot2
(Wickham 2016, version 3.5.2),geomtextpath
(Cameron and van den Brand 2025, version 0.2.0),latex2exp
(Meschiari 2022, version 0.9.6),patchwork
(Pedersen 2025, version 1.3.1),rlang
(Henry and Wickham 2025, version 1.1.6),scales
(Wickham, Pedersen, and Seidel 2025, version 1.4.0), for visualization- 🗝
rstpm2
(X.-R. Liu, Y. Pawitan, and M. Clements 2018, version 1.6.7), key dependency, for functionrstpm2::vuniroot()
Prerequisite
New features are first implemented on Github.
And eventually make their way to CRAN
.
::install.packages('TukeyGH77') # Developers: do NOT use! utils
Note to Users
Examples in this vignette require that the search
path has
library(TukeyGH77)
Tukey’s g-&-h Transformation
Tukey’s Transformation
Tukey (1977)’s g-&-h random variable t_{gh} is a monotone transformation of the standard normal variable z , t_{gh} = \begin{cases} z\cdot G_g(z) & g\neq 0,\ g\text{-distribution} \\ z\cdot H_h(z) & h>0,\ h\text{-distribution} \\ z\cdot G_g(z)\cdot H_h(z) & g\neq 0,h>0,\ gh\text{-distribution} \\ \end{cases} with skewness
G_g(z)= \begin{cases} (e^{gz}-1)/gz & g\neq 0\\\ \displaystyle\lim_{g\rightarrow 0} G_{g \ne 0}(z)=1 & g=0 \end{cases} and kurtosis
H_h(z)=
\begin{cases}
e^{hz^2/2} & h>0\\
1 & h =0
\end{cases}
Tukey’s transformation function z2gh()
transforms the standard normal variable z to Tukey’s g-&-h random variable t_{gh}.
set.seed(15); rnorm(1e3L) |>
z2gh(g = .5, h = .1) |>
hist(breaks = 30L, xlab = NULL, main = c('Tukey(g=.5, h=.1)'))
Inverse Tukey’s Transformation
Inverse Tukey’s transformation \zeta_{gh}
\zeta_{gh}(t) =
\begin{cases}
g^{-1}\ln(gt+1), & g\neq0 \\
z:\ z\cdot H_h(z) = t, & h>0 \\
z:\ z\cdot G_g(z)\cdot H_h(z) = t, & g\neq 0,\ h>0
\end{cases}
does not have a closed analytical form when h>0, but the numerical solution could be found by using Brent (1973)’s root-finding algorithm implemented in function rstpm2::vuniroot()
.
Inverse Tukey’s transformation function gh2z()
transforms the Tukey’s g-&-h random variable t_{gh} to standard normal variable z. A closed analytical form of \zeta_{gh} exists when h=0, thus the discrepancy is merely a floating-point error.
set.seed(24); x = rnorm(1e3L)
|>
x z2gh(g = .3, h = 0) |>
gh2z(g = .3, h = 0) |>
all.equal.numeric(target = x, tolerance = .Machine$double.eps)
#> [1] "Mean relative difference: 2.871014e-16"
We have to rely on a root-finding algorithm when h>0. To find a numerical solution for standard normal quantile z, we need only to search the interval of (-8.3, 8.3), as
stopifnot(identical(pnorm(8.3), 1))
Nonetheless, the root-finding algorithm would have much larger discrepancy (which can be controlled by parameter tol
inside function gh2z()
).
set.seed(91); x = rnorm(1e3L)
|>
x z2gh(g = 0, h = .1) |>
gh2z(g = 0, h = .1) |>
all.equal.numeric(target = x, tolerance = .Machine$double.eps)
#> [1] "Mean relative difference: 5.191943e-06"
set.seed(56); x = rnorm(1e3L)
|>
x z2gh(g = .3, h = .1) |>
gh2z(g = .3, h = .1) |>
all.equal.numeric(target = x, tolerance = .Machine$double.eps)
#> [1] "Mean relative difference: 5.138869e-06"
Tukey g-&-h Distribution
Probability
The distribution function is F_{gh}(t)=\text{Pr}(T_{gh} \le t)=\text{Pr}\left(Z \le \zeta_{gh}(t)\right).
curve(pGH(x, g = .3, h = .1), from = -2.5, to = 3.5, n = 501L, ylab = 'Probability')
Density
Density function f_{gh} has a closed analytical form in terms of \zeta_{gh},
f_{gh}(t) = \dfrac{e^{-z^2/2}}{\sqrt{2\pi}\cdot\partial t_{gh}/\partial z}\Bigg|_{z=\zeta_{gh}(t)}
where
\dfrac{\partial t_{gh}}{\partial z}= \begin{cases} e^{gz}, & g\neq0\\ e^{hz^2/2}(1+hz^2), & h>0 \\ e^{hz^2/2}\left(e^{gz}+g^{-1}hz(e^{gz}-1)\right), & g\neq 0,\ h>0 \end{cases}
Note that when h=0, domain has a bound.
curve(dGH(x, g = 1, h = 0), from = -1.2, to = 3.5, n = 501L, ylab = 'Density')
When h>0, domain is (-\infty, \infty).
curve(dGH(x, g = .3, h = .1), from = -2.5, to = 3.5, n = 501L, ylab = 'Density')
Quantile
curve(qGH(x, g = .3, h = .1), from = 0, to = 1, n = 501L, ylab = 'Quantile')
Simulation
set.seed(17); rGH(n = 1e3L, g = .3, h = .1) |>
hist(breaks = 30L, xlab = NULL, main = 'TukeyGH(g = .3, h = .1)')
Optimized Letter-Value Based Estimates
Consider a random sample T from Tukey’s g-&-h distribution with parameters (A,B,g,h).
Let \hat{t}_p be the p-th sample quantile, for 0<p<.5.
We refer A-t_p as the lower half-spread (LHS) and t_{1-p}-A as the upper half-spread (UHS).
Tukey’s h-Model
From Tukey’s h-transformation (i.e., g=0) on LHS and UHS, as well as their summation (Hoaglin (1985), equations (26a), (26b), (27) and (28)),
\begin{cases} A - t_p = -Bz_pe^{hz^2_p/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{A}-\hat{t}_p}{-z_p}\right) = {\color{blue}{\ln B_\text{L}}}+{\color{blue}{h_\text{L}}}z^2_p/2 + \varepsilon\\ t_{1-p} - A = Bz_{1-p}e^{hz^2_{1-p}/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{t}_{1-p}-\hat{A}}{z_{1-p}}\right) = {\color{blue}{\ln B_\text{U}}}+{\color{blue}{h_\text{U}}}z^2_p/2 + \varepsilon\\ t_{1-p}-t_p = B(z_{1-p}-z_p)e^{hz^2_p/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{t}_{1-p} - \hat{t}_p}{-2z_p}\right) = {\color{blue}{\ln B}} + {\color{blue}{h}}z^2_p/2 + \varepsilon \end{cases}
Function letterValue_h()
estimates Tukey’s h-model. We explore some choices of \hat{A} between the 45th to 55th sample percentile, with parameter A_select
,
A_select = 'median'
(default) of random sample T, the naïve choice in Hoaglin (1985)A_select = 'B.optim'
, to minimize \text{err}_B =(\hat{B}_\text{U}-\hat{B}_\text{L})^2A_select = 'h.optim'
, to minimize \text{err}_h =(\hat{h}_\text{U}-\hat{h}_\text{L})^2
Note that the estimated intercept and slope parameters (\ln\hat{B}, \hat{h}) on both half spreads do not depend on \hat{A}.
set.seed(28); rGH(1e3L, h = .2) |>
letterValue_h()
#> A B g h
#> -0.07240777 1.01353750 0.00000000 0.14859619
Figure 1 shows different choices of \hat{A} affect (\hat{B}_\text{L}, \hat{h}_\text{L}) and (\hat{B}_\text{U}, \hat{h}_\text{U}), but do not affect (\hat{B}, \hat{h}).
[R] Tukey h-model: choices of A
set.seed(28); rGH(1e3L, h = .2) |>
letterValue_h(A_select = 'demo')
Tukey’s g-Model
From Tukey’s g-transformation (i.e., h=0) on LHS and UHS, as well as their ratio (Hoaglin (1985), equations (8a), (8b), (10)),
\begin{cases} t_p - A = B(e^{gz_p}-1)/g & \Rightarrow\quad \hat{t}_p - \hat{A} = {\color{blue}{0}} + {\color{blue}{B_\text{L}}}(e^{\hat{g}z_p}-1)/\hat{g} + \varepsilon\\ t_{1-p} - A = B(e^{gz_{1-p}}-1)/g & \Rightarrow\quad \hat{t}_{1-p} - \hat{A} = {\color{blue}{0}} + {\color{blue}{B_\text{U}}}(e^{\hat{g}z_{1-p}}-1)/\hat{g} + \varepsilon\\ \dfrac{A-t_p}{t_{1-p}-A} = \dfrac{1-e^{gz_p}}{e^{-gz_p}-1} = e^{gz_p} & \Rightarrow\quad \hat{g}_p = \dfrac{1}{z_{p}}\cdot \ln\left(\dfrac{\hat{A}-\hat{t}_{p}}{\hat{t}_{1-p}-\hat{A}}\right) \end{cases}
Function letterValue_g()
estimates Tukey’s g-model. We first explore some choices of \hat{A} between the 45th to 55th sample percentile, with parameter A_select
,
A_select = 'median'
of random sample T, the naïve choice in Hoaglin (1985)A_select = 'optim'
(default), to minimize the standard deviationstats::sd()
of \hat{g}_p
We then explore some choices of \hat{g} between the 40th to 60th percentile of \hat{g}_p, with parameter g_select
,
g_select = 'median'
of \hat{g}_p, the naïve choice in Hoaglin (1985)g_select = 'optim'
(default), to minimize \text{err}_B =(\hat{B}_\text{U}-\hat{B}_\text{L})^2
set.seed(43); rGH(1e3L, g = .3) |>
letterValue_g()
#> A B g h
#> 0.0001778168 0.9962584286 0.3360145770 0.0000000000
Figure 2 shows different choices of \hat{A} affect the estimated \hat{g}_p.
Figure 3 shows different choices of \hat{g} typically do not affect \hat{B}_\text{L}, \hat{B}_\text{U} nor \hat{B} too much.
[R] Tukey g-model: choices of A
set.seed(43); rGH(1e3L, g = .3) |>
letterValue_g(A_select = 'demo')
[R] Tukey g-model: choices of g
set.seed(43); rGH(1e3L, g = .3) |>
letterValue_g(g_select = 'demo')
Tukey’s g-&-h Model
From Tukey’s g-&-h transformation on LHS and UHS, as well as their difference and ratio (Hoaglin (1985), equation (33)),
\begin{cases} t_p - A= Bg^{-1}(e^{gz_p}-1)e^{hz^2_p/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{g}(\hat{t}_p-\hat{A})}{e^{\hat{g}z_p}-1}\right) = {\color{blue}{\ln B_\text{L}}}+{\color{blue}{h_\text{L}}}z^2_p/2 + \varepsilon\\ t_{1-p} - A = Bg^{-1}(e^{gz_{1-p}}-1)e^{hz^2_{1-p}/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{g}(\hat{t}_{1-p}-\hat{A})}{e^{-\hat{g}z_p}-1}\right) = {\color{blue}{\ln B_\text{U}}}+{\color{blue}{h_\text{U}}}z^2_p/2 + \varepsilon\\ t_{1-p}-t_p = Bg^{-1}(e^{-gz_p}-e^{gz_p})e^{hz^2_p/2} & \Rightarrow\quad \ln\left(\dfrac{\hat{g}(\hat{t}_{1-p}-\hat{t}_p)}{e^{-\hat{g}z_p}-e^{\hat{g}z_p}}\right) = {\color{blue}{\ln B}} + {\color{blue}{h}}z^2_p/2 + \varepsilon\\ \dfrac{A-t_p}{t_{1-p}-A} = \dfrac{1-e^{gz_p}}{e^{-gz_p}-1} = e^{gz_p} & \Rightarrow\quad \hat{g}_p\quad \text{same as}\ g\text{-model} \end{cases}
The estimates (\ln\hat{B}_\text{L}, \hat{h}_\text{L}), (\ln\hat{B}_\text{U}, \hat{h}_\text{U}) and (\ln\hat{B}, \hat{h}) could all be very sensitive to the choice of \hat{g}.
Function letterValue_gh()
estimates Tukey’s g-&-h model. We first explore some choices of \hat{A} between the 45th to 55th sample percentile, with parameter A_select
,
A_select = 'median'
of random sample T, the naïve choice in Hoaglin (1985)A_select = 'optim'
(default), to minimize the standard deviationstats::sd()
of \hat{g}_p.
We then explore some choices of \hat{g} between the 40th to 60th percentile of \hat{g}_p, with parameter g_select
,
g_select = 'median'
of \hat{g}_p, the naïve choice in Hoaglin (1985)g_select = 'B.optim'
, to minimize \text{err}_B =(\hat{B}_\text{U}-\hat{B}_\text{L})^2g_select = 'h.optim'
(default), to minimize \text{err}_h =(\hat{h}_\text{U}-\hat{h}_\text{L})^2
Perfect example
set.seed(54); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh()
#> A B g h
#> -0.004519068 0.986487241 -0.294966233 0.158689118
This is a ‘perfect’ data set that
- all choices of \hat{A} are pretty close, in terms of the dispersion and/or pattern of \hat{g}_p, Figure 4;
- all choices of \hat{g} have very small \text{err}_h and \text{err}_B, Figure 5.
[R] Tukey gh-model (Perfect): choices of A
set.seed(54); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh(A_select = 'demo')
[R] Tukey gh-model (Perfect): choices of g
set.seed(54); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh(g_select = 'demo')
Salvageable example
set.seed(335); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh()
#> A B g h
#> 0.0004409318 1.0404701657 -0.3128586259 0.0172237338
This is a ‘salvageable’ data set that
- \hat{A}_\text{optim} not only reduces the standard deviations of \hat{g}_p, but also removes the clear pattern in \hat{g}_p, Figure 6;
- \hat{g}_\text{h.optim} provides both \hat{h}_\text{L}>0 and \hat{h}_\text{U}>0, and surely \hat{h}>0, Figure 7.
[R] Tukey gh-model (Salvageable): choices of g
set.seed(335); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh(g_select = 'demo')
Hopeless example
set.seed(39); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh()
#> A B g h
#> -0.03884931 1.09634946 -0.25245660 0.00000000
This is a ‘hopeless’ data set that
- \hat{A}_\text{optim} neither reduces much the standard deviations of \hat{g}_p, nor removes the clear pattern in \hat{g}_p, Figure 8;
- estimated slope \hat{h}<0 for all choices of \hat{g}, Figure 9.
[R] Tukey gh-model (Hopeless): choices of A
set.seed(39); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh(A_select = 'demo')
[R] Tukey gh-model (Hopeless): choices of g
set.seed(39); rGH(n = 1e3L, g = -.3, h = .1) |>
letterValue_gh(g_select = 'demo')
Good new is that we do not see ‘hopeless’ data sets often in simulations (\approx 6.3\% in random seeds 1:1e3L
). In this case, we recommend reporting \hat{h}=0.
Appendix
Terms & Abbreviations
Term / Abbreviation | Description |
---|---|
|> |
Forward pipe operator introduced in R 4.1.0 |
all.equal.numeric |
Test of near equality |
CRAN , R |
The Comprehensive R Archive Network |
curve |
Function plots |
.lm.fit |
Least squares regression, the internal workhorse |
mad |
Median Absolute Deviation |
median |
Median |
dnorm , pnorm , qnorm , rnorm |
Normal Distribution |
optim , optimize |
Optimization |
sd |
Standard Deviation |
search |
Search path |
seed |
Random number generation seed |
vuniroot |
Vectorised one-dimensional root-finding, in package rstpm2 |