Introduction

sMTL is a fast toolkit for $\ell_0$-constrained Multi-Task Learning, Multi-Label Learning and Domain Generalization. $\ell_0$ constraints select the best subset of features in a variety of multi-dataset settings. The toolkit can (approximately) solve the following two problems

\[ \begin{aligned} &\underset{~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 \quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~\mbox{Supp}(\boldsymbol{\beta}_k) = \mbox{Supp}(\boldsymbol{\beta}) ~~~~~~~\mbox{for} ~k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~~~~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]

where $\boldsymbol{\beta}_k$ is the vector of coefficients associated with dataset (task) $k$ and is also the $k^{th}$ column of the matrix $\mathbb{B}$. We denote $\mbox{Supp}(\boldsymbol{\beta}_k) =\boldsymbol{z}_k = \mathbb{1}{(\boldsymbol{\beta}_k \neq \mathbf{1})}$, as the ``support’’ of $\boldsymbol{\beta}_k$: i.e., a $p \times 1$ indicator vector of whether the entries of $\boldsymbol{\beta}_k$ are non-zero. Note $||\boldsymbol{\beta}_k||_0 = \sum_{j=1}^p z_{k,j}$, where the $\ell_0$ norm, $||\boldsymbol{u}||_0$, equals the number of non-zero entries in the vector $\boldsymbol{u}$. For example,

\[ \begin{aligned} &\boldsymbol{\beta} = \begin{bmatrix} 1.23 \\ 0 \\ -0.71 \\ 0 \end{bmatrix} ~~~~~~~~~~~~~ &\mbox{Supp}(\boldsymbol{\beta}) = \boldsymbol{z} = \begin{bmatrix} 1 \\ 0 \\ 1 \\ 0 \end{bmatrix} \end{aligned} \]

The Common Support problem ensures that each of the $K$ task-specific (or dataset-specific) $\boldsymbol{\beta}_k$ have the same support. The Heterogeneous Support problem allows each task to have a different support but penalizes the $\boldsymbol{\beta}_k$ for having differing supports using the $\lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2$ penalty.

The parameter $\lambda_2$ shrinks the $\boldsymbol{\beta}_k$ $values$ towards a common $\bar{\boldsymbol{\beta}} = \frac{1}{K}\sum_{k=1}^K \boldsymbol{\beta}_k$. The parameter $\lambda_1$ controls the strength of $\ell_2$ (Ridge) shrinkage where $\lambda_1||\mathbb{B}||_2^2 = \lambda_1 \sum_{k=1}^K ||\boldsymbol{\beta}_k||_2^2$. Adding either of these shrinkage terms can be effective in reducing the risk of overfitting and often results in better predictive models. We recommend using either but not both of these regularization terms. For Hetergeneous Support problems $\lambda_z$ controls the degree to which the $\boldsymbol{z}_k$, the supports of the $\boldsymbol{\beta}_k$, are shrunk towards a common $\bar{\boldsymbol{z}} = \frac{1}{K}\sum_{k=1}^K \boldsymbol{z}_k$. When $\lambda_z$ is high, this results in solutions equivalent to the Common Support approach. The fitting of all methods is conducted over a grid of $\lambda_1$, $\lambda_2$ and $\lambda_z$ values to generate a solution path.

The algorithms provided in sMTL are based on block coordinate descent and local combinatorial search. Numerous computational tricks and heuristics are used to speed up the algorithms and improve the solution quality. These heuristics include warm starts and active set convergence. For more details on the algorithms used, please refer to our paper Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach.

The toolkit is implemented in Julia along with an easy-to-use R interface. In this vignette, we provide a tutorial on using the R interface. We demonstrate how use sMTL’s main functions for fitting models and cross-validation.

Note that model functions assume the design (feature) matrices are a $\texttt{matrix}$ data-type (not $\texttt{data.frames}$). Thus any conversion of, for example, categorical/factor variables into a set of indicator variables, to construct the corresponding design matrix must be completed before using the sMTL functions.

Installation

The development version of sMTL can be downloaded from github by executing:

library(devtools)
install_github("gloewing/sMTL", subdir = "Package/sMTL")

The very first time we use sMTL we have to keep in mind the following:

Package Setup

The programming language $\texttt{Julia}$ is required. We have a function to automatically install it if it is not already:

library(sMTL)
smtl_setup(installJulia = TRUE, installPackages = TRUE)

If you ran the code for step 1 above, then you can ignore the rest of the Package Setup section (steps 2-3) and skip to the Tutorial Guide section.

The $\texttt{Julia}$ binary path must be set if you have an existing $\texttt{Julia}$ installation. If $\texttt{Julia}$ is installed with our function (item 1), this should be set automatically. If $\texttt{Julia}$ was already installed, simply open $\texttt{Julia}$ and execute the $\texttt{Julia}$ command >> println(Sys.BINDIR) and this will print the binary path.

The following code is an example. Note the string after the code path= should be replaced with your computer’s path for Julia’s binary.

library(sMTL)
smtl_setup(path = "/Applications/Julia-1.5.app/Contents/Resources/julia/bin")

Three $\texttt{Julia}$ packages are required so if you use an existing $\texttt{Julia}$ installation see the next step (step 3).

Three $\texttt{Julia}$ packages are required ($\texttt{TSVD}$, $~\texttt{LinearAlgebra}$ and $~\texttt{Statistics}$). We include a function to automatically install these if they are not already. If $\texttt{Julia}$ is installed but the packages are not, the function smtl_setup() can install them without re-installing $\texttt{Julia}$. As described in step 2, the path= must be set to the location of the $\texttt{Julia}$ binary on your computer.

library(sMTL)
smtl_setup(path = "/Applications/Julia-1.5.app/Contents/Resources/julia/bin",
           installJulia = FALSE, 
           installPackages = TRUE)

The $\texttt{Julia}$ binary path can be saved internally, so one does not have to enter the binary path every time (assuming the initial binary path is correct). To do so, follow these steps:

Execute the following

usethis::edit_r_environ()

This will open the .Renviron file.

Add the following line to the top of the file: JULIA_HOME = <>, where <> is a string of the Julia binary path. Make sure there is one empty line at the end of the .Renviron file. Save and close.

Every time you load the package henceforth, one only need run the following.

library(sMTL)
smtl_setup()

Failure to run the smtl_setup() function at at each sMTL pack load will leave R unable to locate $\texttt{Julia}$ and the sMTL package functions will not work.

Tutorial Guide

To illustrate the main sMTL functions, we start by generating synthetic datasets and proceed by fitting models. We use terminology from Zhang & Yang, 2021. The package has three main machine learning problems it aims to solve:

Multi-Task Learning in which we have a collection of $K$ separate training datasets defined by their design (feature) matrices, $\{\mathbb{X}_1, \mathbb{X}_2,...,\mathbb{X}_K\}$, and outcome vectors, $\{\mathbf{y}_1, \mathbf{y}_2,...,\mathbf{y}_K\}$, where $\mathbb{X}_k \in \mathbb{R}^{n_k \times p},~ \mathbf{y}_k \in \mathbb{R}^{n_k}$. We jointly train $K$ models, one fit to each of the $K$ datasets, with the goal of making predictions on new observations from task $k$ using model $k$.
Multi-Label Learning is a special case of Multi-Task Learning (1) in which we have a collection of tasks with a single common design matrix, $\mathbb{X}\in \mathbb{R}^{n \times p}$, but separate task-specific outcome vectors, $\mathbf{y}_k \in \mathbb{R}^{n}$. One can think of each observation having a $K$ dimensional multivariate outcome. We jointly train $K$ models, each fit with one of the $K$ outcome vectors, $\mathbf{y}_k$. The goal is to make predictions on new observations from task $k$ (i.e., to predict an observation of outcome $k$) using model $k$. While this is equivalent to (1) for the case that $\mathbb{X}_k = \mathbb{X} ~\forall k \in \{1,2,...,K\}$, we include specialized algorithms for this case that yield gains in efficiency in terms of memory and solve time.
Domain Generalization in which we have a collection of $K$ separate training datasets: $\{\mathbb{X}_1, \mathbb{X}_2,...,\mathbb{X}_K\}$ and $\{\mathbf{y}_1, \mathbf{y}_2,...,\mathbf{y}_K\}$ where $\mathbb{X}_k \in \mathbb{R}^{n_k \times p},~ \mathbf{y}_k \in \mathbb{R}^{n_k}$. Unlike in (1) and (2), the goal is to make predictions on new observations from an unseen ``domain’’ or dataset, $K+1$, using an $ensemble$ of $K$ models, each trained on a single dataset. We calculate ensemble weights based upon multi-study stacking (Patil and Parmigiani, 2018) and simple average weights. The code to fit models is identical to Case 1 (Multi-Task Learning). The only difference comes in how one tunes the models and makes predictions.

Multi-Task Learning (Case 1)

We begin with case 1. We generate $K = 4$ synthetic datasets from a sparse linear model with the following:

$\mathbb{X}_k$ is a $50 \times 100$ design matrix with iid standard normal entries
$\boldsymbol{\beta}_k$ is a $50 \times 1$ vector with 10 entries set to 1 (plus some task-specific gaussian noise) and the rest are zeros. The indices of the non-zero entries differ slightly between tasks.
$\boldsymbol{\epsilon}_k$ is a $100 \times 1$ vector with iid standard normal entries
$\mathbf{y}_k$ is a $100 \times 1$ response vector such that $\mathbf{y}_k = \mathbb{X}_k \boldsymbol{\beta}_k + \boldsymbol{\epsilon}_k$

This dataset can be generated in R as follows:

set.seed(1) # fix the seed to get a reproducible result
K <- 4 # number of datasets
p <- 100 # covariate dimension
s <- 5 # support size
q <- 7 # size of subset of covariates that can be non-zero for any task
n_k <- 50 # task sample size
N <- n_k * p # full dataset samplesize
X <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
B <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
Z <- matrix(0, nrow = p, ncol = K) # matrix of supports
y <- vector(length = N) # outcome vector

# randomly sample support to make betas sparse
for(j in 1:K)     Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
B[-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept

task <- rep(1:K, each = n_k) # vector of task labels (indices)

# iterate through and make each task specific dataset
for(j in 1:K){
    indx <- which(task == j) # indices of task
    e <- rnorm(n_k)
    y[indx] <- B[1, j] + X[indx,] %*% B[-1,j] + e
}
colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )

Note that the task vector is a $N \times 1$ vector where element $i$ indicates the task index ($\in \{1,2,...,K\}$) of observation $i$.

First let’s see the structure of the $\mathbb{B} \in \mathbb{R}^{p+1 \times K}$ where column $k$ is equal to $\boldsymbol{\beta}_k$. As we can see there is heterogeneity across tasks both in the support (which elements are non-zero) and in the value of the non-zero coefficients. But there is enough shared structure that borrowing strength across tasks would be expected to improve performance.

print(round(B[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.08  -0.64   0.33   0.72
## X_1         1.61   0.00   1.06   1.37
## X_2        -0.50   2.25   1.83   0.00
## X_3         0.00   1.74   1.49   1.29
## X_4         1.22   0.34   0.00   0.81
## X_5         0.39   0.00   1.62   1.11
## X_6         0.00   2.03   0.61   0.00
## X_7         0.81   1.11   0.00   2.26

We will use the smtl() function to estimate $\mathbb{B}$ from the data.

We will start with fitting Common Support models.

1a) Common Support Models with Ridge Penalty

We start with the following model where we abbreviate the constraints above for conciseness.

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \] To fit a path of solutions for the Common Support Multi-Task model with 5 non-zero coefficients (i.e., $s=5$) and a ridge penalty $\lambda_1 = 0.001$, we use the smtl() function and then print out the coefficients stored in beta:

mod <- sMTL::smtl(y = y, 
                X = X, 
                study = task, 
                s = 5, 
                commonSupp = TRUE,
                lambda_1 = 0.001)

print(round(mod$beta[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.37  -0.60   0.36   1.13
## V1          1.51  -0.11   1.16   1.49
## V2         -0.43   2.28   1.74   0.27
## V3          0.29   1.52   1.65   1.41
## V4          0.00   0.00   0.00   0.00
## V5          0.00   0.00   0.00   0.00
## V6          0.46   2.21   0.80  -0.25
## V7          0.45   1.02  -0.25   2.23

As we can see the solutions have a common support. That is, each task $\boldsymbol{\beta}_k$ has the same sparsity pattern (support): each row is either all zeros or all non-zeros.

1b) Common Support Models with Cross-Task Coefficient Shrinkage

Let’s now try a model where we shrink the $\boldsymbol{\beta}_k$ $values$ towards each other: \[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 +\lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2\quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = TRUE,
            lambda_2 = 2.75)

print(round(mod$beta[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.18  -0.52   0.28   1.32
## V1          1.10   0.55   0.96   1.00
## V2          0.69   1.34   1.22   0.82
## V3          1.19   1.53   1.41   1.24
## V4          0.00   0.00   0.00   0.00
## V5          0.00   0.00   0.00   0.00
## V6          0.81   1.15   0.84   0.51
## V7          0.92   0.85   0.76   1.14

We can see that within a row (across columns) the coefficient estimates have far more similar values than the estimates we obtained from a model that included only a Ridge penalty.

1c) Heterogeneous Support Models with Ridge Penalty

In some applications, it may not be justifiable to assume the sparsity patterns (supports) are identical across the $\boldsymbol{\beta}_k$ of the different tasks. In such cases, all we have to do is set commonSupp = FALSE and the function will fit a Heterogeneous Support model. Let’s start with a simple model with a Ridge penalty:

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = FALSE,
            lambda_1 = 0.001)

print(round(mod$beta[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.16  -0.57   0.25   0.88
## V1          1.74   0.00   0.98   1.42
## V2         -0.61   2.30   1.89   0.00
## V3          0.00   1.48   1.78   1.34
## V4          1.21   0.00   0.00   0.77
## V5          0.35   0.00   1.51   1.10
## V6          0.00   2.17   0.49   0.00
## V7          0.91   1.09   0.00   2.27

We can see that now many rows have a mix of zero and non-zero entries. With this model, however, we have essentially fit $K$ independent sparse regression models: the models are not sharing strength across tasks. In the next section, we borrow information across the supports of the models by adding in a penalty.

1d) Heterogeneous Support Models with Support Heterogeneity Regularization

We now shrink the supports of the $\boldsymbol{\beta}_k$, $\boldsymbol{z}_k$, towards each other.

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = FALSE,
            lambda_1 = 0.001,
            lambda_z = 0.25)

print(round(mod$beta[1:8,], 2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.36  -0.60   0.37   1.10
## V1          1.50  -0.12   1.18   1.48
## V2         -0.44   2.27   1.75   0.00
## V3          0.28   1.52   1.67   1.40
## V4          0.96   0.00   0.00   0.76
## V5          0.00   0.00   1.42   0.95
## V6          0.00   2.20   0.00   0.00
## V7          0.48   1.02  -0.08   2.23

As we can see, the additional penalty reduced the support heterogeneity. Most rows are now either all zeros or all non-zeros. Now let’s turn the $\lambda_z$ penalty up to show that the Common Support problem is a special case when $\lambda_z \rightarrow \infty$ (although in practice $\lambda_z$ does not need to be very big to induce a solution with a common support).

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = FALSE,
            lambda_1 = 0.001,
            lambda_z = 10)

print(round( mod$beta[1:8,], 2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.36  -0.60   0.37   1.10
## V1          1.50  -0.12   1.18   1.48
## V2         -0.44   2.27   1.75   0.25
## V3          0.28   1.52   1.67   1.40
## V4          0.00   0.00   0.00   0.00
## V5          0.00   0.00   0.00   0.00
## V6          0.43   2.20   0.79  -0.24
## V7          0.48   1.02  -0.26   2.23

Now we see all rows have the same sparsity patterns across columns (tasks). Note that in the above example we did not include the $\lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2$.

1e) Heterogeneous Support Models with Support Heterogeneity Regularization and Bbar Penalty

If we wanted to include penalties that share information both across the coefficient values and the coefficient supports, we could fit the following model where we replace the Ridge penalty (we do not recommend including both this penalty and the Ridge because both shrink the coefficient values):

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X}_k \boldsymbol{\beta}_k ||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = FALSE,
            lambda_2 = 0.5,
            lambda_z = 0.1)

1f) Predictions

Now that we have fit models, let’s make predictions and see the output structure. We feed predict() a model object and some new data X. Here we make predictions on the training data for simplicity but in practice we would likely be more interested in predictions on new observations.

preds <- sMTL::predict(model = mod, X = X[1:5,])

print( head(preds) )

##        task_1     task_2     task_3     task_4
## 1 -1.23179826 -4.4010715 -3.4726519  0.5208468
## 2 -0.48204960 -0.5706786 -0.2639711  1.7627481
## 3 -1.96315150 -3.9042719 -4.4020299 -2.6233604
## 4  0.04404262 -1.6988063 -0.3304088 -0.6197724
## 5 -0.62250746  0.9352522  3.2517639  0.4736623

We used each model to make predictions on the new data producing a $n^* \times K$ matrix where $n^*$ is the sample size of the data we make predictions on. In practice we may only want the predictions associated with one of the tasks. In those cases, we would pick out the corresponding column. Note that the predict() function assumes the column orders of the features/covariates in the design matrix are the same as those when the model was fit with the smtl() function (it does not use column names to order the columns).

Multi-Label Learning (Case 2)

Multi-Label Learning can be cast as a special case of Multi-Task Learning (Case 1). The main difference here is that we can imagine that we have a single $\mathbb{X} \in \mathbb{R}^{n \times p}$ that is common to all tasks. Put another way, we have one $\mathbb{X}$ and we store all the $\mathbf{y}_k$ as columns in a matrix $\mathbb{Y} \in \mathbb{R}^{p \times K}$. We set $K = 4$ and generate data from a sparse linear model with the following:

$\mathbb{X}$ is a $50 \times 100$ design matrix with iid standard normal entries
$\boldsymbol{\beta}_k$ is a $50 \times 1$ vector with 10 entries set to 1 (plus some task-specific gaussian noise) and the rest are zeros. The indices of the non-zero entries differ slightly between tasks.
$\boldsymbol{\epsilon}_k$ is a $100 \times 1$ vector with iid standard normal entries
$\mathbf{y}_k$ is a $100 \times 1$ response vector such that $\mathbf{y}_k = \mathbb{X} \boldsymbol{\beta}_k + \boldsymbol{\epsilon}_k$

This dataset can be generated in R as follows:

set.seed(1) # fix the seed to get a reproducible result
K <- 4 # number of datasets
p <- 100 # covariate dimension
s <- 5 # support size
q <- 7 # size of subset of covariates that can be non-zero for any task
N <- 50 # full dataset samplesize
X <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
B <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
Z <- matrix(0, nrow = p, ncol = K) # matrix of supports
y <- matrix(nrow = N, ncol = K) # outcome vector

# randomly sample support to make betas sparse
for(j in 1:K)     Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
B[-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept

# iterate through and make each task specific dataset
for(j in 1:K){
    e <- rnorm(N)
    y[,j] <- B[1, j] + X %*% B[-1,j] + e
}
colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )

Note that we do not have a task vector here. Instead we concatenate the outcomes $\boldsymbol{y}_k$ into a matrix, $\mathbb{Y} \in \mathbb{R}^{n \times K}$.

First let’s see the structure of the $\mathbb{B} \in \mathbb{R}^{p+1 \times K}$ where column $k$ is equal to $\boldsymbol{\beta}_k$. As we can see there is heterogeneity across tasks both in the support (which elements are non-zero) and in the magnitude of the coefficients. But there is enough shared structure that borrowing strength across tasks would be expected to improve performance.

print(round(B[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.52   2.38   2.31   1.08
## X_1         1.63   2.34   0.00   0.00
## X_2        -0.68   0.24   0.30   1.73
## X_3         0.00   0.00   1.87   0.00
## X_4         2.12   2.00   0.21  -0.22
## X_5        -0.24   0.00   1.55   0.89
## X_6         0.00   0.69   0.00   0.32
## X_7         1.60   1.72   0.18   1.49

We will use smtl() to estimate $\mathbb{B}$ from the data.

Again, we will start with fitting Common Support models.

2a) Common Support Models with Ridge Penalty

We start with the following model where we abbreviate the constraints above for conciseness (we assume the constraint that all models have the same support is implicit). Notice that there is a single $\mathbb{X}$ for all tasks.

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X} \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 \quad \quad \mbox{(Common Support)} \\ & \mbox{subject to: }~ ||\boldsymbol{\beta}_k ||_0 \leq s, ~\forall~k \in \{1,2,...,K\} \notag \end{aligned} \]

To fit a solutions for the Common Support Multi-Label model with 5 non-zero coefficients (i.e., $s=5$) and a Ridge penalty $\lambda_1 = 0.001$, we use the smtl() function and then print out the coefficients stored in beta. Notice here the $y$ object in $\texttt{R}$ is a matrix, and that we do not include the study= syntax that we use for Multi-Task learning or Domain Generalization. Instead the columns of the $y$ object are treated as separate ``labels’’ (akin to separate tasks). Otherwise the syntax is the exact same.

mod <- sMTL::smtl(y = y, 
                X = X, 
                s = 5, 
                commonSupp = TRUE,
                lambda_1 = 0.001)

print(round(mod$beta[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.53   2.29   2.31   1.42
## V1          1.69   2.07  -0.03  -0.16
## V2          0.00   0.00   0.00   0.00
## V3          0.10   0.07   2.16  -0.01
## V4          2.06   2.02   0.04  -0.25
## V5         -0.39   0.14   1.55   0.87
## V6          0.00   0.00   0.00   0.00
## V7          1.58   1.61   0.21   1.39

2b) Heterogeneous Support Models

We next show code for the Heterogeneous Support case.

\[ \begin{aligned} &\underset{\boldsymbol{z},~\mathbb{B},\bar{\boldsymbol{\beta}} }{\mbox{min }} \sum_{k=1}^K \frac{1}{n_k} || \mathbf{y}_k - \mathbb{X} \boldsymbol{\beta}_k ||_2^2 + \lambda_1 ||\mathbb{B}||_2^2 + \lambda_2 \sum_{k=1}^K ||\boldsymbol{\beta}_k - \bar{{\boldsymbol{\beta}}} ||_2^2 + \lambda_z \sum_{k=1}^K ||\boldsymbol{z}_k - \bar{\boldsymbol{z}}||_2^2 \quad \quad \mbox{(Heterogeneous Support)} \\ & \mbox{subject to: }~ \boldsymbol{z}_k = \mbox{Supp}(\boldsymbol{\beta}_k) ~~~~~~~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \\ & ~~~~~~~~~~~~~~~~~~~ ||\boldsymbol{\beta}_k ||_0 \leq s ~~~~~~~~~~~ ~~~\mbox{for}~ k \in \{1,2,...,K\} \notag \end{aligned} \]

mod <- sMTL::smtl(y = y, 
                X = X, 
                s = 5, 
                commonSupp = FALSE,
                lambda_1 = 0.001,
                lambda_z = 0.1)

print(round(mod$beta[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.54   2.29   2.29   1.22
## V1          1.70   2.07   0.00   0.00
## V2         -0.46   0.51   0.55   1.56
## V3          0.00   0.00   2.15   0.00
## V4          2.07   2.02   0.00  -0.10
## V5         -0.38   0.00   1.55   0.82
## V6          0.00   0.59   0.00   0.36
## V7          1.56   1.60   0.19   1.36

Predictions will be labeled as “task_1”,…,“task_K.” Since this is a ``Multi-Label’’ learning problem, we would interpret “task_j” to mean “label_j” (or “outcome_j”).

Domain-Generalization (Case 3)

The syntax and setup here is basically identical to Case 1. The main difference comes in how we tune the models and make predictions. For that reason, we include an example involving tuning, fitting and predictions.

We again generate $K = 4$ synthetic datasets from a sparse linear model with the following:

$\mathbb{X}_k$ is a $50 \times 100$ design matrix with iid standard normal entries
$\boldsymbol{\beta}_k$ is a $50 \times 1$ vector with 10 entries set to 1 (plus some task-specific gaussian noise) and the rest are zeros. The indices of the non-zero entries differ slightly between tasks.
$\boldsymbol{\epsilon}_k$ is a $100 \times 1$ vector with iid standard normal entries
$\mathbf{y}_k$ is a $100 \times 1$ response vector such that $\mathbf{y}_k = \mathbb{X}_k \boldsymbol{\beta}_k + \boldsymbol{\epsilon}_k$

This dataset can be generated in R as follows:

set.seed(1) # fix the seed to get a reproducible result
K <- 4 # number of datasets
p <- 100 # covariate dimension
s <- 5 # support size
q <- 7 # size of subset of covariates that can be non-zero for any task
n_k <- 50 # task sample size
N <- n_k * p # full dataset samplesize
X <- matrix( rnorm(N * p), nrow = N, ncol=p) # full design matrix
B <- matrix(1 + rnorm(K * (p+1) ), nrow = p + 1, ncol = K) # betas before making sparse
Z <- matrix(0, nrow = p, ncol = K) # matrix of supports
y <- vector(length = N) # outcome vector

# randomly sample support to make betas sparse
for(j in 1:K)     Z[1:q, j] <- sample( c( rep(1,s), rep(0, q - s) ), q, replace = FALSE )
B[-1,] <- B[-1,] * Z # make betas sparse and ensure all models have an intercept

task <- rep(1:K, each = n_k) # vector of task labels (indices)

# iterate through and make each task specific dataset
for(j in 1:K){
    indx <- which(task == j) # indices of task
    e <- rnorm(n_k)
    y[indx] <- B[1, j] + X[indx,] %*% B[-1,j] + e
}
colnames(B) <- paste0("beta_", 1:K)
rownames(B) <- c("Intercept", paste0( "X_", 1:p ) )

Note that the task vector is a $N \times 1$ vector where element $i$ indicates the task index ($\{1,2,...,K\}$) of observation $i$. This is equivalent to a “Domain” or “Study.”

First let’s see the structure of the $\mathbb{B} \in \mathbb{R}^{p+1 \times K}$ where column $k$ is equal to $\boldsymbol{\beta}_k$. As we can see there is heterogeneity across domain-specific regression coefficients both in the support (which elements are non-zero) and in the magnitude of the coefficients. Again, there appears to be enough shared structure that borrowing strength across domains would be expected to improve performance on predicting some new, “unseen” domain.

print(round(B[1:8,],2))

##           beta_1 beta_2 beta_3 beta_4
## Intercept  -0.08  -0.64   0.33   0.72
## X_1         1.61   0.00   1.06   1.37
## X_2        -0.50   2.25   1.83   0.00
## X_3         0.00   1.74   1.49   1.29
## X_4         1.22   0.34   0.00   0.81
## X_5         0.39   0.00   1.62   1.11
## X_6         0.00   2.03   0.61   0.00
## X_7         0.81   1.11   0.00   2.26

3a) Heterogeneous Support Model with Ridge Penalty

We start with the following model where we abbreviate the constraints above for conciseness.

Here we use the same syntax from Case 1 except for 1) in the cv.smtl() function we set multiTask = FALSE and 2) when making predictions, we set stack = TRUE in the predict() function to use multi-study stacking to generate ensemble weights. The predict() function then produces aggregate predictions for some new dataset or ``domain’’ by taking a weighted average of the model-specific predictions: $\hat{f}(\boldsymbol{x}) = \sum_{k=1}^K w_k \boldsymbol{x}^T \hat{\boldsymbol{\beta}}_k$. We provide the output using average weights, $w_k$, (i.e., $w_k = 1/K, \forall~ k ~ \in \{1,2,...,K\}$), or multi-study stacking weights, $w_k$. We introduce cross validation functions more generally below.

# tuning grid
grid <- data.frame(s = c(4, 4, 5, 5), 
                   lambda_1 = c(0.01, 0.1, 0.01, 0.1), 
                   lambda_2 = rep(0, 4), 
                   lambda_z = c(0.01, 0.1, 0.01, 0.1)
                   )

# cross validation
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              nfolds = 5,
              multiTask = FALSE) 

# model fitting
mod <- sMTL::smtl(y = y, 
                X = X, 
                study = task, 
                s = tn$best.1se$s, 
                commonSupp = TRUE,
                lambda_1 = tn$best.1se$lambda_1,
                lambda_z = tn$best.1se$lambda_z)

Cross Validation

We can use our cross validation function to tune models. The syntax is similar to the smtl() function except one can specify an optional tuning grid, and the number of folds (nfolds). For Domain Generalization problems, one needs to switch multiTask=FALSE to change the cross validation to a hold-one-domain-out cross validation approach appropriate for the problem.

Here we show a Multi-Task Learning problem in which we let the cv.smtl() function generate a grid of hyperparameters for tuning internally, tune and then we fit a final model with the tuned hyperparameters.

# cross validation with internally generated grid
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              lambda_1 = TRUE, 
              lambda_2 = FALSE, 
              lambda_z = TRUE,
              nfolds = 5) 

# fit final model
mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = tn$best$s, 
            commonSupp = FALSE,
            lambda_1 = tn$best$lambda_1,
            lambda_z = tn$best$lambda_z)

We see that instead of providing a grid, we chose which hyperparameters we wanted to tune by setting them to TRUE.

User Specified Grids

We highly recommend specifying your own grids as good ranges for hyperparameter values may be very problem-specific. Next we show the same Multi-Task Learning problem with a (very small) custom grid and then fit a final model with the tuned hyperparameters.

# generate grid
grid <- data.frame(s = c(4,4,5,5), 
                   lambda_1 = c(0.01, 0.1,0.01, 0.1), 
                   lambda_2 = 0, 
                   lambda_z = c(0.01, 0.1,0.01, 0.1))

# cross validation with custom grid
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              nfolds = 5) 

# fit final model
mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = tn$best$s, 
            commonSupp = FALSE,
            lambda_1 = tn$best$lambda_1,
            lambda_z = tn$best$lambda_z)

Now we show a Domain-Generalization problem with a custom grid and then fit a final model with the tuned hyperparameters. We switch multiTask=FALSE and proceed as normal.

For illustrative purposes, we also choose the best.1se parameter values which chooses the smallest $s$ that achieves an average cross validated prediction error within 1 standard error of the parameters with the best cross validated error (and the best regularization parameters for that $s$). This can result in sparser models and is an alternative option.

# generate grid
grid <- data.frame(s = c(4,4,5,5), 
                   lambda_1 = c(0.01, 0.1,0.01, 0.1), 
                   lambda_2 = 0, 
                   lambda_z = c(0.01, 0.1,0.01, 0.1))

# cross validation with custom grid
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              nfolds = 5,
              multiTask = FALSE) 

# fit final model
mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = tn$best$s, 
            commonSupp = FALSE,
            lambda_1 = tn$best$lambda_1,
            lambda_z = tn$best$lambda_2)

Advanced Options

Local Search Options for Cross Validation

There are a few advanced options for users looking to adjust the speed of the algorithms and quality of solutions. The maxIter argument can be increased to improve the quality of the coordinate descent algorithm or reduced to improve speed (at a cost to the solution quality). Similarly, LocSrch_maxIter argument specifies the number of local search iterations and increasing it can improve the quality of the solutions, but can slow down the algorithm substantially. Setting LocSrch_maxIter=0 tells the package to use no local search and may even occasionally be preferable for some problems. Finally, increasing LocSrch_skip to an integer above 1 will increase the speed of the algorithm by only conducting local search on a subset of parameter values on the solution path. For example, LocSrch_skip=3 only uses local search on every $3^{rd}$ parameter value. If some local search is desirable, but one wishes to avoid using it at every parameter value because it is too slow, setting LocSrch_skip to somewhere between 2 and 5 could speed up cross-validation substantially while still retaining some of the benefits of local search. The assumption is using local search to find a solution for hyperparameter value $j$ should improve the quality of the solution for a model fit with hyperparameter $j+1$ too, since we use warm starts and paths of solutions.

# cross validation with custom grid using 
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              LocSrch_skip = 3,
              LocSrch_maxIter = 5,
              nfolds = 5,
              multiTask = FALSE)

Two Stage Cross Validation for Solution Quality and Speed

The optimization framework proposed here offers modeling richness at the cost of additional tuning parameters. For example, tuning a Heterogeneous Support model with a Zbar penalty and a Ridge penalty has 3 hyperparameters ($s$, $\lambda_1$ and $\lambda_z$) and thus tuning across a full tuning grid may incur a large computational cost. One solution to tune hyperparameters with a lower computational burden is to tune in two stages. First, in stage 1 we tune across a smaller grid with just $s$ and $\lambda_z$ and a small, fixed $lambda_1$ for numerical reasons. Then in stage 2, we take the optimal hyperparameter values and create a second grid built around a neighborhood of the optimal hyperparameter values from stage 1 and perform a second cross validation. This way we avoid making a large grid with all three hyperparameters $s$, $\lambda_a$ and $\lambda_z$.

######################################
# generate initial tuning grid
######################################
s <- c(5, 10, 15, 20)
lambda_1 <- c(1e-6) # small fixed ridge penalty for regularization
lambda_2 <- 0
lambda_z <- sort( unique( c(0, 1e-6, 1e-5, 1e-4, 1e-3, 
                            exp(-seq(0,5, length = 8)),  1, 3) ),  decreasing = TRUE )

grid <- expand.grid(s, lambda_1, lambda_2, lambda_z)
colnames(grid) <- c("s", "lambda_1", "lambda_2", "lambda_z")

# order correctly
grid <- grid[  order(-grid$s,
                      grid$lambda_1,
                      -grid$lambda_z,
                      decreasing=TRUE),     ]
###################################################
# stage 1: cross validation with initial grid 
###################################################
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              nfolds = 5,
              multiTask = FALSE) 

#############################################################################
# create second grid with optimal hyperparameters from stage 1
#############################################################################
# new sparsity grid
s <- tn$best$s
s <- c(s-2, s, s+2)
s <- s[s >= 1]

# ridge penalty grid
lambda_1 <- c(1e-7, 1e-5, 1e-3, 1e-1)

# new lambda_z grid
lambda_z <- tn$best$lambda_z
lambda_z <- sort( c( seq(1.5, 10, length = 5), seq(0.1, 1, length = 5) ) * lambda_z, decreasing = TRUE)
lambda_z <- lambda_z[lambda_z <= 10] # make sure this is below a threshold to prevent numerical issues

grid <- as.data.frame(  expand.grid( lambda_1, lambda_2, lambda_z, s) )
colnames(grid) <- c("lambda_1", "lambda_2", "lambda_z", "s")

# order correctly
grid <- grid[  order(-grid$s,
                      grid$lambda_1,
                      -grid$lambda_z,
                      decreasing=TRUE),     ]
            
###################################################
# stage 2: cross validation with updated grid 
###################################################
tn <- cv.smtl(y = y, 
              X = X, 
              study = task, 
              commonSupp = FALSE,
              grid = grid,
              nfolds = 5,
              multiTask = FALSE)

Fitting Whole Paths

Sometimes we may want to fit entire paths of solution for some fixed $s$. The algorithm is coded to fit whole paths to efficiently use warm starts to speed up the code substantially and identify better local minima. For that reason, if solutions for multiple values of tuning parameters, $(\lambda_1, \lambda_2, \lambda_z)$ are desired, it is far better to fit them jointly than to call smtl() repeatedly. The code is almost identical except that now one or more of the tuning parameter arguments $(\lambda_1, \lambda_2, \lambda_z)$ are a vector not a scalar. Notice that $s$ is always a scalar. We cannot put multiple values of $s$ in because the path of solutions is generated independently for each value of $s$.

mod <- smtl(y = y, 
            X = X, 
            study = task, 
            s = 5, 
            commonSupp = FALSE,
            lambda_1 = c(0.1, 0.2, 0.3),
            lambda_z = c(0.01, 0.05, 0.1)
            )

Now the solutions in mod$beta is a $K \times p \times T$ array where $T$ is the number of unique tuning value combinations (3 in the example above) used to fit the path of solutions.

Predictions on Whole Path Fits

Now we have many solutions so if we want to make predictions we have to specify which hyperparameter value solutins we want predictions for or else the package will issue a warning and return no predictions.

preds <- sMTL::predict(model = mod, 
                       X = X, 
                       lambda_1 = 0.1, 
                       lambda_z = 0.01)

Troubleshooting

You are welcome to contact me about sMTL at gloewinger@gmail.com with any questions or error reports.
This GitHub account gloewing/sMTL hosts the development version of sMTL.
For any errors with downloading $\texttt{Julia}$, or the packages, see documentation for the dependent $\texttt{R}$ package JuliaCall. We use JuliaCall to download $\texttt{Julia}$ with the function JuliaCall::julia_setup() and we use JuliaCall::julia_install_package_if_needed() to download packages the $\texttt{Julia}$ packages $\texttt{TSVD}$, $~\texttt{LinearAlgebra}$ and $~\texttt{Statistics}$ through $\texttt{R}$.
For any errors with connecting to an existing installation of $\texttt{Julia}$ see documentation for the JuliaConnectoR $\texttt{R}$ package. We use this to allow $\texttt{R}$ to start a $\texttt{Julia}$ session and call the $\texttt{.jl}$ files that contain our algorithms.
If using the smtl_setup() function causes errors, the following steps can be taken after installing $\texttt{Julia}$ and the $\texttt{Julia}$ packages:

Open $\texttt{Julia}$ and type >> println(Sys.BINDIR) and copy the binary path (a string). Replace the string after path <- in the R code below with this binary path and run the following:

# save path of Julia binary location on your computer
# ***replace with your computer's path***
path <- "/Applications/Julia-1.7.app/Contents/Resources/julia/bin" 

# set so JuliaConnectoR can access Julia installation
Sys.setenv(JULIA_BINDIR = path )

Find the path of the sMTL package installation on your computer and save a $\texttt{.txt}$ file in the package installation source files so the package knows where to find the $\texttt{Julia}$ binary.

# find path of sMTL package installation on your computer
smtl_path <- paste0( .libPaths("sMTL"), "/sMTL/julia_path/" )
smtl_path <- smtl_path[length(smtl_path)] # use most recent 
    
# save Julia path for sMTL package directory for future use
fileConn <- file( paste0(smtl_path, "julia.path.txt") )
base::writeLines(path, fileConn) # save new julia path
close(fileConn)

References

Gabriel Loewinger, Kayhan Behdin, Kenneth Kishida, Giovanni Parmigiani and Rahul Mazumder. Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach. arXiv:2212.08697 (2022).

Prasad Patil and Giovanni Parmigiani. Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences (2018).

Yu Zhang and Qiang Yang. A Survey on Multi-Task Learning. IEEE Transactions on Knowledge and Data Engineering (2021).

sMTL Vignette

Gabriel Loewinger, Kayhan Behdin, Giovanni Parmigiani, Rahul Mazumder

National Institute on Mental Health (NIH), Harvard University, Massachusetts Institute of Technology (MIT)

2023-01-29

Introduction

Installation

Package Setup

Tutorial Guide

Multi-Task Learning (Case 1)

1a) Common Support Models with Ridge Penalty

1b) Common Support Models with Cross-Task Coefficient Shrinkage

1c) Heterogeneous Support Models with Ridge Penalty

1d) Heterogeneous Support Models with Support Heterogeneity Regularization

1e) Heterogeneous Support Models with Support Heterogeneity Regularization and Bbar Penalty

1f) Predictions

Multi-Label Learning (Case 2)

2a) Common Support Models with Ridge Penalty

2b) Heterogeneous Support Models

Domain-Generalization (Case 3)

3a) Heterogeneous Support Model with Ridge Penalty

Cross Validation

User Specified Grids

Advanced Options

Local Search Options for Cross Validation

Two Stage Cross Validation for Solution Quality and Speed

Fitting Whole Paths

Predictions on Whole Path Fits

Troubleshooting

References