sMTL
is a fast toolkit for \(\ell_0\)-constrained Multi-Task Learning,
Multi-Label Learning and Domain Generalization. \(\ell_0\) constraints select the best subset
of features in a variety of multi-dataset settings. The toolkit can
(approximately) solve the following two problems
\[ \begin{aligned} &\underset{~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 \quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~\mbox{Supp}(\boldsymbol{\beta}_k) = \mbox{Supp}(\boldsymbol{\beta}) ~~~~~~~\mbox{for} ~k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~~~~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]
where \(\boldsymbol{\beta}_k\) is the vector of coefficients associated with dataset (task) \(k\) and is also the \(k^{th}\) column of the matrix \(\mathbb{B}\). We denote \(\mbox{Supp}(\boldsymbol{\beta}_k) =\boldsymbol{z}_k = \mathbb{1}{(\boldsymbol{\beta}_k \neq \mathbf{1})}\), as the ``support’’ of \(\boldsymbol{\beta}_k\): i.e., a \(p \times 1\) indicator vector of whether the entries of \(\boldsymbol{\beta}_k\) are non-zero. Note \(||\boldsymbol{\beta}_k||_0 = \sum_{j=1}^p z_{k,j}\), where the \(\ell_0\) norm, \(||\boldsymbol{u}||_0\), equals the number of non-zero entries in the vector \(\boldsymbol{u}\). For example,
\[ \begin{aligned} &\boldsymbol{\beta} = \begin{bmatrix} 1.23 \\ 0 \\ -0.71 \\ 0 \end{bmatrix} ~~~~~~~~~~~~~ &\mbox{Supp}(\boldsymbol{\beta}) = \boldsymbol{z} = \begin{bmatrix} 1 \\ 0 \\ 1 \\ 0 \end{bmatrix} \end{aligned} \]
The Common Support problem ensures that each of the \(K\) task-specific (or dataset-specific) \(\boldsymbol{\beta}_k\) have the same support. The Heterogeneous Support problem allows each task to have a different support but penalizes the \(\boldsymbol{\beta}_k\) for having differing supports using the \(\lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2\) penalty.
The parameter \(\lambda_2\) shrinks the \(\boldsymbol{\beta}_k\) \(values\) towards a common \(\bar{\boldsymbol{\beta}} = \frac{1}{K}\sum_{k=1}^K \boldsymbol{\beta}_k\). The parameter \(\lambda_1\) controls the strength of \(\ell_2\) (Ridge) shrinkage where \(\lambda_1||\mathbb{B}||_2^2 = \lambda_1 \sum_{k=1}^K ||\boldsymbol{\beta}_k||_2^2\). Adding either of these shrinkage terms can be effective in reducing the risk of overfitting and often results in better predictive models. We recommend using either but not both of these regularization terms. For Hetergeneous Support problems \(\lambda_z\) controls the degree to which the \(\boldsymbol{z}_k\), the supports of the \(\boldsymbol{\beta}_k\), are shrunk towards a common \(\bar{\boldsymbol{z}} = \frac{1}{K}\sum_{k=1}^K \boldsymbol{z}_k\). When \(\lambda_z\) is high, this results in solutions equivalent to the Common Support approach. The fitting of all methods is conducted over a grid of \(\lambda_1\), \(\lambda_2\) and \(\lambda_z\) values to generate a solution path.
The algorithms provided in sMTL
are based on block
coordinate descent and local combinatorial search. Numerous
computational tricks and heuristics are used to speed up the algorithms
and improve the solution quality. These heuristics include warm starts
and active set convergence. For more details on the algorithms used,
please refer to our paper Multi-Task Learning for Sparsity
Pattern Heterogeneity: A Discrete Optimization Approach.
The toolkit is implemented in Julia along with an easy-to-use R
interface. In this vignette, we provide a tutorial on using the R
interface. We demonstrate how use sMTL
’s main functions for
fitting models and cross-validation.
Note that model functions assume the design (feature) matrices are a
\(\texttt{matrix}\) data-type (not
\(\texttt{data.frames}\)). Thus any
conversion of, for example, categorical/factor variables into a set of
indicator variables, to construct the corresponding design matrix must
be completed before using the sMTL
functions.
The development version of sMTL
can be downloaded from
github by executing:
library(devtools)
install_github("gloewing/sMTL", subdir = "Package/sMTL")
The very first time we use sMTL
we have to keep in mind
the following:
library(sMTL)
smtl_setup(installJulia = TRUE, installPackages = TRUE)
If you ran the code for step 1 above, then you can ignore the rest of the Package Setup section (steps 2-3) and skip to the Tutorial Guide section.
>> println(Sys.BINDIR)
and this will print the binary
path.The following code is an example. Note the string after the code
path=
should be replaced with your computer’s path for
Julia’s binary.
library(sMTL)
smtl_setup(path = "/Applications/Julia-1.5.app/Contents/Resources/julia/bin")
Three \(\texttt{Julia}\) packages are required so if you use an existing \(\texttt{Julia}\) installation see the next step (step 3).
smtl_setup()
can install
them without re-installing \(\texttt{Julia}\). As described in step 2,
the path=
must be set to the location of the \(\texttt{Julia}\) binary on your
computer.library(sMTL)
smtl_setup(path = "/Applications/Julia-1.5.app/Contents/Resources/julia/bin",
installJulia = FALSE,
installPackages = TRUE)
The \(\texttt{Julia}\) binary path can be saved internally, so one does not have to enter the binary path every time (assuming the initial binary path is correct). To do so, follow these steps:
::edit_r_environ() usethis
.Renviron
file.Add the following line to the top of the file:
JULIA_HOME = <>
, where <>
is a
string of the Julia binary path. Make sure there is one empty line at
the end of the .Renviron
file. Save and close.
Every time you load the package henceforth, one only need run the following.
library(sMTL)
smtl_setup()
Failure to run the smtl_setup()
function at at each
sMTL
pack load will leave R unable to locate \(\texttt{Julia}\) and the sMTL
package functions will not work.
To illustrate the main sMTL
functions, we start by
generating synthetic datasets and proceed by fitting models. We use
terminology from Zhang & Yang, 2021. The package has three main
machine learning problems it aims to solve:
Multi-Task Learning in which we have a collection of \(K\) separate training datasets defined by their design (feature) matrices, \(\{\mathbb{X}_1, \mathbb{X}_2,...,\mathbb{X}_K\}\), and outcome vectors, \(\{\mathbf{y}_1, \mathbf{y}_2,...,\mathbf{y}_K\}\), where \(\mathbb{X}_k \in \mathbb{R}^{n_k \times p},~ \mathbf{y}_k \in \mathbb{R}^{n_k}\). We jointly train \(K\) models, one fit to each of the \(K\) datasets, with the goal of making predictions on new observations from task \(k\) using model \(k\).
Multi-Label Learning is a special case of Multi-Task Learning (1) in which we have a collection of tasks with a single common design matrix, \(\mathbb{X}\in \mathbb{R}^{n \times p}\), but separate task-specific outcome vectors, \(\mathbf{y}_k \in \mathbb{R}^{n}\). One can think of each observation having a \(K\) dimensional multivariate outcome. We jointly train \(K\) models, each fit with one of the \(K\) outcome vectors, \(\mathbf{y}_k\). The goal is to make predictions on new observations from task \(k\) (i.e., to predict an observation of outcome \(k\)) using model \(k\). While this is equivalent to (1) for the case that \(\mathbb{X}_k = \mathbb{X} ~\forall k \in \{1,2,...,K\}\), we include specialized algorithms for this case that yield gains in efficiency in terms of memory and solve time.
Domain Generalization in which we have a collection of \(K\) separate training datasets: \(\{\mathbb{X}_1, \mathbb{X}_2,...,\mathbb{X}_K\}\) and \(\{\mathbf{y}_1, \mathbf{y}_2,...,\mathbf{y}_K\}\) where \(\mathbb{X}_k \in \mathbb{R}^{n_k \times p},~ \mathbf{y}_k \in \mathbb{R}^{n_k}\). Unlike in (1) and (2), the goal is to make predictions on new observations from an unseen ``domain’’ or dataset, \(K+1\), using an \(ensemble\) of \(K\) models, each trained on a single dataset. We calculate ensemble weights based upon multi-study stacking (Patil and Parmigiani, 2018) and simple average weights. The code to fit models is identical to Case 1 (Multi-Task Learning). The only difference comes in how one tunes the models and makes predictions.
We begin with case 1. We generate \(K = 4\) synthetic datasets from a sparse linear model with the following:
This dataset can be generated in R as follows:
set.seed(1) # fix the seed to get a reproducible result
<- 4 # number of datasets
K <- 100 # covariate dimension
p <- 5 # support size
s <- 7 # size of subset of covariates that can be non-zero for any task
q <- 50 # task sample size
n_k <- n_k * p # full dataset samplesize
N <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
X <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
B <- matrix(0, nrow = p, ncol = K) # matrix of supports
Z <- vector(length = N) # outcome vector
y
# randomly sample support to make betas sparse
for(j in 1:K) Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept
B[
<- rep(1:K, each = n_k) # vector of task labels (indices)
task
# iterate through and make each task specific dataset
for(j in 1:K){
<- which(task == j) # indices of task
indx <- rnorm(n_k)
e <- B[1, j] + X[indx,] %*% B[-1,j] + e
y[indx]
}colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )
Note that the task
vector is a \(N \times 1\) vector where element \(i\) indicates the task index (\(\in \{1,2,...,K\}\)) of observation \(i\).
First let’s see the structure of the \(\mathbb{B} \in \mathbb{R}^{p+1 \times K}\) where column \(k\) is equal to \(\boldsymbol{\beta}_k\). As we can see there is heterogeneity across tasks both in the support (which elements are non-zero) and in the value of the non-zero coefficients. But there is enough shared structure that borrowing strength across tasks would be expected to improve performance.
print(round(B[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.08 -0.64 0.33 0.72
## X_1 1.61 0.00 1.06 1.37
## X_2 -0.50 2.25 1.83 0.00
## X_3 0.00 1.74 1.49 1.29
## X_4 1.22 0.34 0.00 0.81
## X_5 0.39 0.00 1.62 1.11
## X_6 0.00 2.03 0.61 0.00
## X_7 0.81 1.11 0.00 2.26
We will use the smtl()
function to estimate \(\mathbb{B}\) from the data.
We will start with fitting Common Support models.
We start with the following model where we abbreviate the constraints above for conciseness.
\[
\begin{aligned}
&\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }}
\sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k
\boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad
\mbox{(Common Support)} \\
& \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s,
~\forall~k \in \{1,2,...,K\} \notag
\end{aligned}
\] To fit a path of solutions for the Common Support Multi-Task
model with 5 non-zero coefficients (i.e., \(s=5\)) and a ridge penalty \(\lambda_1 = 0.001\), we use the
smtl()
function and then print out the coefficients stored
in beta
:
<- sMTL::smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = TRUE,
lambda_1 = 0.001)
print(round(mod$beta[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.37 -0.60 0.36 1.13
## V1 1.51 -0.11 1.16 1.49
## V2 -0.43 2.28 1.74 0.27
## V3 0.29 1.52 1.65 1.41
## V4 0.00 0.00 0.00 0.00
## V5 0.00 0.00 0.00 0.00
## V6 0.46 2.21 0.80 -0.25
## V7 0.45 1.02 -0.25 2.23
As we can see the solutions have a common support. That is, each task \(\boldsymbol{\beta}_k\) has the same sparsity pattern (support): each row is either all zeros or all non-zeros.
Let’s now try a model where we shrink the \(\boldsymbol{\beta}_k\) \(values\) towards each other: \[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 +\lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2\quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = TRUE,
lambda_2 = 2.75)
print(round(mod$beta[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.18 -0.52 0.28 1.32
## V1 1.10 0.55 0.96 1.00
## V2 0.69 1.34 1.22 0.82
## V3 1.19 1.53 1.41 1.24
## V4 0.00 0.00 0.00 0.00
## V5 0.00 0.00 0.00 0.00
## V6 0.81 1.15 0.84 0.51
## V7 0.92 0.85 0.76 1.14
We can see that within a row (across columns) the coefficient estimates have far more similar values than the estimates we obtained from a model that included only a Ridge penalty.
In some applications, it may not be justifiable to assume the
sparsity patterns (supports) are identical across the \(\boldsymbol{\beta}_k\) of the different
tasks. In such cases, all we have to do is set
commonSupp = FALSE
and the function will fit a
Heterogeneous Support model. Let’s start with a simple model with a
Ridge penalty:
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = FALSE,
lambda_1 = 0.001)
print(round(mod$beta[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.16 -0.57 0.25 0.88
## V1 1.74 0.00 0.98 1.42
## V2 -0.61 2.30 1.89 0.00
## V3 0.00 1.48 1.78 1.34
## V4 1.21 0.00 0.00 0.77
## V5 0.35 0.00 1.51 1.10
## V6 0.00 2.17 0.49 0.00
## V7 0.91 1.09 0.00 2.27
We can see that now many rows have a mix of zero and non-zero entries. With this model, however, we have essentially fit \(K\) independent sparse regression models: the models are not sharing strength across tasks. In the next section, we borrow information across the supports of the models by adding in a penalty.
We now shrink the supports of the \(\boldsymbol{\beta}_k\), \(\boldsymbol{z}_k\), towards each other.
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = FALSE,
lambda_1 = 0.001,
lambda_z = 0.25)
print(round(mod$beta[1:8,], 2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.36 -0.60 0.37 1.10
## V1 1.50 -0.12 1.18 1.48
## V2 -0.44 2.27 1.75 0.00
## V3 0.28 1.52 1.67 1.40
## V4 0.96 0.00 0.00 0.76
## V5 0.00 0.00 1.42 0.95
## V6 0.00 2.20 0.00 0.00
## V7 0.48 1.02 -0.08 2.23
As we can see, the additional penalty reduced the support heterogeneity. Most rows are now either all zeros or all non-zeros. Now let’s turn the \(\lambda_z\) penalty up to show that the Common Support problem is a special case when \(\lambda_z \rightarrow \infty\) (although in practice \(\lambda_z\) does not need to be very big to induce a solution with a common support).
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = FALSE,
lambda_1 = 0.001,
lambda_z = 10)
print(round( mod$beta[1:8,], 2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.36 -0.60 0.37 1.10
## V1 1.50 -0.12 1.18 1.48
## V2 -0.44 2.27 1.75 0.25
## V3 0.28 1.52 1.67 1.40
## V4 0.00 0.00 0.00 0.00
## V5 0.00 0.00 0.00 0.00
## V6 0.43 2.20 0.79 -0.24
## V7 0.48 1.02 -0.26 2.23
Now we see all rows have the same sparsity patterns across columns (tasks). Note that in the above example we did not include the \(\lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2\).
If we wanted to include penalties that share information both across the coefficient values and the coefficient supports, we could fit the following model where we replace the Ridge penalty (we do not recommend including both this penalty and the Ridge because both shrink the coefficient values):
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = FALSE,
lambda_2 = 0.5,
lambda_z = 0.1)
Now that we have fit models, let’s make predictions and see the
output structure. We feed predict()
a model object and some
new data X
. Here we make predictions on the training data
for simplicity but in practice we would likely be more interested in
predictions on new observations.
<- sMTL::predict(model = mod, X = X[1:5,])
preds
print( head(preds) )
## task_1 task_2 task_3 task_4
## 1 -1.23179826 -4.4010715 -3.4726519 0.5208468
## 2 -0.48204960 -0.5706786 -0.2639711 1.7627481
## 3 -1.96315150 -3.9042719 -4.4020299 -2.6233604
## 4 0.04404262 -1.6988063 -0.3304088 -0.6197724
## 5 -0.62250746 0.9352522 3.2517639 0.4736623
We used each model to make predictions on the new data producing a
\(n^* \times K\) matrix where \(n^*\) is the sample size of the data we
make predictions on. In practice we may only want the predictions
associated with one of the tasks. In those cases, we would pick out the
corresponding column. Note that the predict()
function
assumes the column orders of the features/covariates in the design
matrix are the same as those when the model was fit with the
smtl()
function (it does not use column names to order the
columns).
Multi-Label Learning can be cast as a special case of Multi-Task Learning (Case 1). The main difference here is that we can imagine that we have a single \(\mathbb{X} \in \mathbb{R}^{n \times p}\) that is common to all tasks. Put another way, we have one \(\mathbb{X}\) and we store all the \(\mathbf{y}_k\) as columns in a matrix \(\mathbb{Y} \in \mathbb{R}^{p \times K}\). We set \(K = 4\) and generate data from a sparse linear model with the following:
This dataset can be generated in R as follows:
set.seed(1) # fix the seed to get a reproducible result
<- 4 # number of datasets
K <- 100 # covariate dimension
p <- 5 # support size
s <- 7 # size of subset of covariates that can be non-zero for any task
q <- 50 # full dataset samplesize
N <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
X <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
B <- matrix(0, nrow = p, ncol = K) # matrix of supports
Z <- matrix(nrow = N, ncol = K) # outcome vector
y
# randomly sample support to make betas sparse
for(j in 1:K) Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept
B[
# iterate through and make each task specific dataset
for(j in 1:K){
<- rnorm(N)
e <- B[1, j] + X %*% B[-1,j] + e
y[,j]
}colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )
Note that we do not have a task
vector here. Instead we
concatenate the outcomes \(\boldsymbol{y}_k\) into a matrix, \(\mathbb{Y} \in \mathbb{R}^{n \times
K}\).
First let’s see the structure of the \(\mathbb{B} \in \mathbb{R}^{p+1 \times K}\) where column \(k\) is equal to \(\boldsymbol{\beta}_k\). As we can see there is heterogeneity across tasks both in the support (which elements are non-zero) and in the magnitude of the coefficients. But there is enough shared structure that borrowing strength across tasks would be expected to improve performance.
print(round(B[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.52 2.38 2.31 1.08
## X_1 1.63 2.34 0.00 0.00
## X_2 -0.68 0.24 0.30 1.73
## X_3 0.00 0.00 1.87 0.00
## X_4 2.12 2.00 0.21 -0.22
## X_5 -0.24 0.00 1.55 0.89
## X_6 0.00 0.69 0.00 0.32
## X_7 1.60 1.72 0.18 1.49
We will use smtl()
to estimate \(\mathbb{B}\) from the data.
Again, we will start with fitting Common Support models.
We start with the following model where we abbreviate the constraints above for conciseness (we assume the constraint that all models have the same support is implicit). Notice that there is a single \(\mathbb{X}\) for all tasks.
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X} \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]
To fit a solutions for the Common Support Multi-Label model with 5
non-zero coefficients (i.e., \(s=5\))
and a Ridge penalty \(\lambda_1 =
0.001\), we use the smtl()
function and then print
out the coefficients stored in beta
. Notice here the \(y\) object in \(\texttt{R}\) is a matrix, and that we do
not include the study=
syntax that we use for Multi-Task
learning or Domain Generalization. Instead the columns of the \(y\) object are treated as separate
``labels’’ (akin to separate tasks). Otherwise the syntax is the exact
same.
<- sMTL::smtl(y = y,
mod X = X,
s = 5,
commonSupp = TRUE,
lambda_1 = 0.001)
print(round(mod$beta[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.53 2.29 2.31 1.42
## V1 1.69 2.07 -0.03 -0.16
## V2 0.00 0.00 0.00 0.00
## V3 0.10 0.07 2.16 -0.01
## V4 2.06 2.02 0.04 -0.25
## V5 -0.39 0.14 1.55 0.87
## V6 0.00 0.00 0.00 0.00
## V7 1.58 1.61 0.21 1.39
We next show code for the Heterogeneous Support case.
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X} \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]
<- sMTL::smtl(y = y,
mod X = X,
s = 5,
commonSupp = FALSE,
lambda_1 = 0.001,
lambda_z = 0.1)
print(round(mod$beta[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.54 2.29 2.29 1.22
## V1 1.70 2.07 0.00 0.00
## V2 -0.46 0.51 0.55 1.56
## V3 0.00 0.00 2.15 0.00
## V4 2.07 2.02 0.00 -0.10
## V5 -0.38 0.00 1.55 0.82
## V6 0.00 0.59 0.00 0.36
## V7 1.56 1.60 0.19 1.36
Predictions will be labeled as “task_1”,…,“task_K.” Since this is a ``Multi-Label’’ learning problem, we would interpret “task_j” to mean “label_j” (or “outcome_j”).
The syntax and setup here is basically identical to Case 1. The main difference comes in how we tune the models and make predictions. For that reason, we include an example involving tuning, fitting and predictions.
We again generate \(K = 4\) synthetic datasets from a sparse linear model with the following:
This dataset can be generated in R as follows:
set.seed(1) # fix the seed to get a reproducible result
<- 4 # number of datasets
K <- 100 # covariate dimension
p <- 5 # support size
s <- 7 # size of subset of covariates that can be non-zero for any task
q <- 50 # task sample size
n_k <- n_k * p # full dataset samplesize
N <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
X <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
B <- matrix(0, nrow = p, ncol = K) # matrix of supports
Z <- vector(length = N) # outcome vector
y
# randomly sample support to make betas sparse
for(j in 1:K) Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept
B[
<- rep(1:K, each = n_k) # vector of task labels (indices)
task
# iterate through and make each task specific dataset
for(j in 1:K){
<- which(task == j) # indices of task
indx <- rnorm(n_k)
e <- B[1, j] + X[indx,] %*% B[-1,j] + e
y[indx]
}colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )
Note that the task
vector is a \(N \times 1\) vector where element \(i\) indicates the task index (\(\{1,2,...,K\}\)) of observation \(i\). This is equivalent to a “Domain” or
“Study.”
First let’s see the structure of the \(\mathbb{B} \in \mathbb{R}^{p+1 \times K}\) where column \(k\) is equal to \(\boldsymbol{\beta}_k\). As we can see there is heterogeneity across domain-specific regression coefficients both in the support (which elements are non-zero) and in the magnitude of the coefficients. Again, there appears to be enough shared structure that borrowing strength across domains would be expected to improve performance on predicting some new, “unseen” domain.
print(round(B[1:8,],2))
## beta_1 beta_2 beta_3 beta_4
## Intercept -0.08 -0.64 0.33 0.72
## X_1 1.61 0.00 1.06 1.37
## X_2 -0.50 2.25 1.83 0.00
## X_3 0.00 1.74 1.49 1.29
## X_4 1.22 0.34 0.00 0.81
## X_5 0.39 0.00 1.62 1.11
## X_6 0.00 2.03 0.61 0.00
## X_7 0.81 1.11 0.00 2.26
We start with the following model where we abbreviate the constraints above for conciseness.
\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]
Here we use the same syntax from Case 1 except for 1) in the
cv.smtl()
function we set multiTask = FALSE
and 2) when making predictions, we set stack = TRUE
in the
predict()
function to use multi-study stacking to generate
ensemble weights. The predict()
function then produces
aggregate predictions for some new dataset or ``domain’’ by taking a
weighted average of the model-specific predictions: \(\hat{f}(\boldsymbol{x}) = \sum_{k=1}^K w_k
\boldsymbol{x}^T \hat{\boldsymbol{\beta}}_k\). We provide the
output using average weights, \(w_k\),
(i.e., \(w_k = 1/K, \forall~ k ~ \in
\{1,2,...,K\}\)), or multi-study stacking weights, \(w_k\). We introduce cross validation
functions more generally below.
# tuning grid
<- data.frame(s = c(4, 4, 5, 5),
grid lambda_1 = c(0.01, 0.1, 0.01, 0.1),
lambda_2 = rep(0, 4),
lambda_z = c(0.01, 0.1, 0.01, 0.1)
)
# cross validation
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
nfolds = 5,
multiTask = FALSE)
# model fitting
<- sMTL::smtl(y = y,
mod X = X,
study = task,
s = tn$best.1se$s,
commonSupp = TRUE,
lambda_1 = tn$best.1se$lambda_1,
lambda_z = tn$best.1se$lambda_z)
We can use our cross validation function to tune models. The syntax
is similar to the smtl()
function except one can specify an
optional tuning grid, and the number of folds (nfolds
). For
Domain Generalization problems, one needs to switch
multiTask=FALSE
to change the cross validation to a
hold-one-domain-out cross validation approach appropriate for the
problem.
Here we show a Multi-Task Learning problem in which we let the
cv.smtl()
function generate a grid of hyperparameters for
tuning internally, tune and then we fit a final model with the tuned
hyperparameters.
# cross validation with internally generated grid
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
lambda_1 = TRUE,
lambda_2 = FALSE,
lambda_z = TRUE,
nfolds = 5)
# fit final model
<- smtl(y = y,
mod X = X,
study = task,
s = tn$best$s,
commonSupp = FALSE,
lambda_1 = tn$best$lambda_1,
lambda_z = tn$best$lambda_z)
We see that instead of providing a grid, we chose which
hyperparameters we wanted to tune by setting them to
TRUE
.
We highly recommend specifying your own grids as good ranges for hyperparameter values may be very problem-specific. Next we show the same Multi-Task Learning problem with a (very small) custom grid and then fit a final model with the tuned hyperparameters.
# generate grid
<- data.frame(s = c(4,4,5,5),
grid lambda_1 = c(0.01, 0.1,0.01, 0.1),
lambda_2 = 0,
lambda_z = c(0.01, 0.1,0.01, 0.1))
# cross validation with custom grid
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
nfolds = 5)
# fit final model
<- smtl(y = y,
mod X = X,
study = task,
s = tn$best$s,
commonSupp = FALSE,
lambda_1 = tn$best$lambda_1,
lambda_z = tn$best$lambda_z)
Now we show a Domain-Generalization problem with a custom grid and
then fit a final model with the tuned hyperparameters. We switch
multiTask=FALSE
and proceed as normal.
For illustrative purposes, we also choose the best.1se
parameter values which chooses the smallest \(s\) that achieves an average cross
validated prediction error within 1 standard error of the parameters
with the best cross validated error (and the best regularization
parameters for that \(s\)). This can
result in sparser models and is an alternative option.
# generate grid
<- data.frame(s = c(4,4,5,5),
grid lambda_1 = c(0.01, 0.1,0.01, 0.1),
lambda_2 = 0,
lambda_z = c(0.01, 0.1,0.01, 0.1))
# cross validation with custom grid
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
nfolds = 5,
multiTask = FALSE)
# fit final model
<- smtl(y = y,
mod X = X,
study = task,
s = tn$best$s,
commonSupp = FALSE,
lambda_1 = tn$best$lambda_1,
lambda_z = tn$best$lambda_2)
There are a few advanced options for users looking to adjust the
speed of the algorithms and quality of solutions. The
maxIter
argument can be increased to improve the quality of
the coordinate descent algorithm or reduced to improve speed (at a cost
to the solution quality). Similarly, LocSrch_maxIter
argument specifies the number of local search iterations and increasing
it can improve the quality of the solutions, but can slow down the
algorithm substantially. Setting LocSrch_maxIter=0
tells
the package to use no local search and may even occasionally be
preferable for some problems. Finally, increasing
LocSrch_skip
to an integer above 1 will increase the speed
of the algorithm by only conducting local search on a subset of
parameter values on the solution path. For example,
LocSrch_skip=3
only uses local search on every \(3^{rd}\) parameter value. If some local
search is desirable, but one wishes to avoid using it at every parameter
value because it is too slow, setting LocSrch_skip
to
somewhere between 2 and 5 could speed up cross-validation substantially
while still retaining some of the benefits of local search. The
assumption is using local search to find a solution for hyperparameter
value \(j\) should improve the quality
of the solution for a model fit with hyperparameter \(j+1\) too, since we use warm starts and
paths of solutions.
# cross validation with custom grid using
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
LocSrch_skip = 3,
LocSrch_maxIter = 5,
nfolds = 5,
multiTask = FALSE)
The optimization framework proposed here offers modeling richness at the cost of additional tuning parameters. For example, tuning a Heterogeneous Support model with a Zbar penalty and a Ridge penalty has 3 hyperparameters (\(s\), \(\lambda_1\) and \(\lambda_z\)) and thus tuning across a full tuning grid may incur a large computational cost. One solution to tune hyperparameters with a lower computational burden is to tune in two stages. First, in stage 1 we tune across a smaller grid with just \(s\) and \(\lambda_z\) and a small, fixed \(lambda_1\) for numerical reasons. Then in stage 2, we take the optimal hyperparameter values and create a second grid built around a neighborhood of the optimal hyperparameter values from stage 1 and perform a second cross validation. This way we avoid making a large grid with all three hyperparameters \(s\), \(\lambda_a\) and \(\lambda_z\).
######################################
# generate initial tuning grid
######################################
<- c(5, 10, 15, 20)
s <- c(1e-6) # small fixed ridge penalty for regularization
lambda_1 <- 0
lambda_2 <- sort( unique( c(0, 1e-6, 1e-5, 1e-4, 1e-3,
lambda_z exp(-seq(0,5, length = 8)), 1, 3) ), decreasing = TRUE )
<- expand.grid(s, lambda_1, lambda_2, lambda_z)
grid colnames(grid) <- c("s", "lambda_1", "lambda_2", "lambda_z")
# order correctly
<- grid[ order(-grid$s,
grid $lambda_1,
grid-grid$lambda_z,
decreasing=TRUE), ]
###################################################
# stage 1: cross validation with initial grid
###################################################
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
nfolds = 5,
multiTask = FALSE)
#############################################################################
# create second grid with optimal hyperparameters from stage 1
#############################################################################
# new sparsity grid
<- tn$best$s
s <- c(s-2, s, s+2)
s <- s[s >= 1]
s
# ridge penalty grid
<- c(1e-7, 1e-5, 1e-3, 1e-1)
lambda_1
# new lambda_z grid
<- tn$best$lambda_z
lambda_z <- sort( c( seq(1.5, 10, length = 5), seq(0.1, 1, length = 5) ) * lambda_z, decreasing = TRUE)
lambda_z <- lambda_z[lambda_z <= 10] # make sure this is below a threshold to prevent numerical issues
lambda_z
<- as.data.frame( expand.grid( lambda_1, lambda_2, lambda_z, s) )
grid colnames(grid) <- c("lambda_1", "lambda_2", "lambda_z", "s")
# order correctly
<- grid[ order(-grid$s,
grid $lambda_1,
grid-grid$lambda_z,
decreasing=TRUE), ]
###################################################
# stage 2: cross validation with updated grid
###################################################
<- cv.smtl(y = y,
tn X = X,
study = task,
commonSupp = FALSE,
grid = grid,
nfolds = 5,
multiTask = FALSE)
Sometimes we may want to fit entire paths of solution for some fixed
\(s\). The algorithm is coded to fit
whole paths to efficiently use warm starts to speed up the code
substantially and identify better local minima. For that reason, if
solutions for multiple values of tuning parameters, \((\lambda_1, \lambda_2, \lambda_z)\) are
desired, it is far better to fit them jointly than to call
smtl()
repeatedly. The code is almost identical except that
now one or more of the tuning parameter arguments \((\lambda_1, \lambda_2, \lambda_z)\) are a
vector not a scalar. Notice that \(s\)
is always a scalar. We cannot put multiple values of \(s\) in because the path of solutions is
generated independently for each value of \(s\).
<- smtl(y = y,
mod X = X,
study = task,
s = 5,
commonSupp = FALSE,
lambda_1 = c(0.1, 0.2, 0.3),
lambda_z = c(0.01, 0.05, 0.1)
)
Now the solutions in mod$beta
is a \(K \times p \times T\) array where \(T\) is the number of unique tuning value
combinations (3 in the example above) used to fit the path of
solutions.
Now we have many solutions so if we want to make predictions we have to specify which hyperparameter value solutins we want predictions for or else the package will issue a warning and return no predictions.
<- sMTL::predict(model = mod,
preds X = X,
lambda_1 = 0.1,
lambda_z = 0.01)
You are welcome to contact me about sMTL
at gloewinger@gmail.com with any questions or error
reports.
This GitHub account gloewing/sMTL hosts the development version of sMTL.
For any errors with downloading \(\texttt{Julia}\), or the packages, see
documentation for the dependent \(\texttt{R}\) package
JuliaCall
. We use JuliaCall
to download \(\texttt{Julia}\) with the function
JuliaCall::julia_setup()
and we use
JuliaCall::julia_install_package_if_needed()
to download
packages the \(\texttt{Julia}\)
packages \(\texttt{TSVD}\), \(~\texttt{LinearAlgebra}\) and \(~\texttt{Statistics}\) through \(\texttt{R}\).
For any errors with connecting to an existing installation of
\(\texttt{Julia}\) see documentation
for the JuliaConnectoR
\(\texttt{R}\) package. We use this to allow
\(\texttt{R}\) to start a \(\texttt{Julia}\) session and call the \(\texttt{.jl}\) files that contain our
algorithms.
If using the smtl_setup()
function causes errors,
the following steps can be taken after installing \(\texttt{Julia}\) and the \(\texttt{Julia}\) packages:
>> println(Sys.BINDIR)
and copy the binary path (a
string). Replace the string after path <-
in the
R
code below with this binary path and run the
following:# save path of Julia binary location on your computer
# ***replace with your computer's path***
<- "/Applications/Julia-1.7.app/Contents/Resources/julia/bin"
path
# set so JuliaConnectoR can access Julia installation
Sys.setenv(JULIA_BINDIR = path )
sMTL
package installation on your
computer and save a \(\texttt{.txt}\)
file in the package installation source files so the package knows where
to find the \(\texttt{Julia}\)
binary.# find path of sMTL package installation on your computer
<- paste0( .libPaths("sMTL"), "/sMTL/julia_path/" )
smtl_path <- smtl_path[length(smtl_path)] # use most recent
smtl_path
# save Julia path for sMTL package directory for future use
<- file( paste0(smtl_path, "julia.path.txt") )
fileConn ::writeLines(path, fileConn) # save new julia path
baseclose(fileConn)
Gabriel Loewinger, Kayhan Behdin, Kenneth Kishida, Giovanni Parmigiani and Rahul Mazumder. Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach. arXiv:2212.08697 (2022).
Prasad Patil and Giovanni Parmigiani. Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences (2018).
Yu Zhang and Qiang Yang. A Survey on Multi-Task Learning. IEEE Transactions on Knowledge and Data Engineering (2021).