fastFGEE is a fast toolkit for analyzing longitudinal
functional data with Functional Generalized Estimating Equations (FGEE).
We begin with an evenly spaced-grid and then show how to model uneven
grids. We currently support exchangeable, AR1 and independence working
correlation structures for continuous, count, and binary outcomes.
The development version of fastFGEE can be downloaded by
executing the following command on your R console (you may need to set
dependencies=TRUE:
devtools::install_github("gloewing/fastFGEE")
To load the package use:
library(fastFGEE)
We start by analyzing simulated data. Our implementation works by
first fitting a function-on-scalar regression with the pffr
function in the refund R package, and then updating that
model. Thus we need to format data for the pffr function.
We can fit the pffr model ourselves and feed it in as an
argument to the fgee function. Alternatively, we can fit
the model all with one command by specifying the model formula in the
fgee function. This example includes both a functional and
scalar covariate, thus this is implicitly a concurrent
function-on-scalar regression example. Our code applies to other models,
and only requires one to alter their initial pffr fit.
We will use simulated data saved as an RDS file in the
data folder on the github repo:
binary_ar1_data.
Let’s load the data and take a look at its structure. Please make
sure to either set your working directory in R first using
setwd() function, or specify the file pathname. This RDS
file is a dataframe where some dataframe “columns” are themselves a
matrix. We show examples of these columns and then repeat the steps we
used to construct the data object so users can replicate it with their
dataset. We start by showing how the dataframe’s Y variable
is a matrix with \(L\) columns, where
\(L\) is the number of points in the
functional domain.
dat <- readRDS("binary_ar1_data") # read in data
head(dat$Y[,1:5])
The other variables in the dataframe are standard numeric columns.
dat <- readRDS("binary_ar1_data") # read in data
print("Participant/cluster ID")
## [1] "Participant/cluster ID"
head(dat$ID) # cluster/subject IDs
## [1] 1 1 1 1 1 1
print("Covariate: X1")
## [1] "Covariate: X1"
head(dat$X1) # covariate: X1
## [1] 1.230814 1.230814 1.230814 1.230814 1.230814 1.230814
We can reconstruct this data object as follows. dat$Y is
a matrix with \(L\) columns and each
row is a functional observation of the outcome. We use the
I() syntax to store this matrix as a “column” in the
dat data.frame.
dat <- data.frame(Y = I(dat$Y), # save the outcome matrix Y as "AsIs" format
X1 = dat$X1,
X2 = dat$X2,
ID = dat$ID,
time = dat$time)
Let’s fit the following function-on-scalar regression model, where \(i \in \{1,2,\ldots,N\}\), and longitduinal observation \(j \in \{1,2,\ldots,n_i\}\), where \(n_i\) is the cluster size of cluster \(i\). Notice that the \(X_{1,i,j}\) is a scalar covariate and \(X_{2,i,j}(s)\) is a functional covariate that varies across the functional domain, indexed by \(s\).
Model 1: \[ \begin{aligned} & \text{logit} \left(\mathbb{E}[Y_{ij}(s) \mid {X}_{1,i,j}, ~{X}_{2,i,j}(s)] \right ) = \beta_0(s) + {X}_{1,i,j}{\beta}_1(s) + {X}_{2,i,j}(s){\beta}_2(s) \notag \end{aligned} \]
We specify the cluster variable and correlation type as
cov.type=ar1. We have to specify the time variable name
used for the “ar1” covariance type (“time”). We fit the model and and
plot the coefficient estimates. Currently, the ar1 correlation
implementation works for evenly spaced longitudinal observations
(“time”), but message the package authors if you need the implementation
for timepoints at uneven points (this only takes a minor code adjustment
to implement).
fit <- fgee(formula = Y ~ X1 + X2,
data = dat,
cluster = "ID",
family = "binomial",
time = "time",
cov.type = "ar1")
fgee.plot(fit)
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
Now let’s show model fitting that apply to cases where the functional
outcome is observed on an uneven grid. For demonstration purposes, we
reuse the same dataset but the code apply to even and uneven grids. We
need to reorganize the data to a “long” format with one column for the
outcome, Y, and a second column with the functional domain
index. First we concatenate all the columns together (i.e. we remove the
I() formatting), and then use tidyr to
reformat the data.
dat_wide <- data.frame(ID = dat$ID,
X1 = dat$X1,
X2 = dat$X2,
time = dat$time,
as.matrix(dat$Y))
head(dat_wide[,1:7])
Since all the column names of the outcome matrix Y start
with the prefix Y_, we can reorganize them with the
following code.
# for uneven grids
library(tidyr)
dat_long <- tidyr::pivot_longer(dat_wide,
cols = starts_with("Y_"),
names_prefix = "Y_",
names_to = "yindex",
values_to = "Y",
values_drop_na = FALSE) %>%
# ensure functional domain points and time are numeric values
dplyr::mutate(yindex = as.integer(yindex),
time = as.numeric(time))
head(dat_long)
Now we only have a single (numeric) column for the Y
variable and a second column for the functional domain index
(yindex). The last step is to make a new outcome matrix
with the following format:
# for uneven grids
Y.mat <- data.frame(.obs = 1:nrow(dat_long), # row number
.index = dat_long$yindex, # functional domain point index
.value = dat_long$Y) # outcome value
head(Y.mat)
Now we can fit a refund::pffr() model and feed that to
the fgee code. Let’s use 10 knots and B-splines.
knots <- 10
spline.basis <- "bs"
# initial fit
fit_pffr <- refund::pffr(formula = Y ~ X1 + X2,
algorithm = "bam",
family = "binomial",
discrete = TRUE,
yind = Y.mat$yindex,
ydata = Y.mat,
bs.yindex = list(bs = spline.basis,
k = knots),
data = dat_long)
# fgee update
fit <- fgee(formula = Y ~ X1 + X2,
pffr.mod = fit_pffr,
data = dat,
cluster = "ID",
family = "binomial",
time = "time",
cov.type = "exchangeable")
# fgee.plot(fit)
We can fit exchangeable or independence working correlation
structures by adjusting the cov.type argument. The
time argument is unnecessary to specify here since those
correlation structures do not incorporate the timing of the longitudinal
observations.
# fgee update
fit.ex <- fgee(formula = Y ~ X1 + X2,
data = dat,
cluster = "ID",
family = "binomial",
cov.type = "exchangeable")
# fgee.plot(fit.ex)
fit.ind <- fgee(formula = Y ~ X1 + X2,
data = dat,
cluster = "ID",
family = "binomial",
cov.type = "independence")
# fgee.plot(fit.ind)
Sometimes we might instead wish to model correlation in the
functional direction. Currently the package allows this with an AR1
structure (assuming the functional domain points are evenly spaced). To
do this, we set long.dir = FALSE. We need a “time” variable
that specifies the longitudinal observation number (these need
not be evenly spaced).
fit <- fgee(formula = Y ~ X1 + X2,
data = dat,
cluster = "ID",
family = "binomial",
cov.type = "ar1",
time = "time",
long.dir = FALSE)
# fgee.plot(fit)
To use a non-parametric cluster bootstrap for coefficient variance,
set sandwich = "fastBoot". This will influence both
pointwise and joint CIs. Set joint.CI = "np" to estimate
quantiles used in joint CI construction with a non-parametric cluster
bootstrap-based procedure. These are set independently of each other.
For example, a setting of joint.CI = "np" and
sandwich = TRUE uses a sandwich estimator to calculate
standard errors for both pointwise and joint CIs.
For use of this package or vignette, please cite the following papers:
Gabriel Loewinger, Alex Levis, Erjia Cui, and Francisco Pereira. Fast Penalized Generalized Estimating Equations for Large Longitudinal Functional Datasets. arXiv (2025).