Abstract
In this short note we present and briefly discuss the R package islasso dealing with regression models having a large number of covariates. Estimation is carried out by penalizing the coefficients via a quasi-lasso penalty, wherein the nonsmooth lasso penalty is replaced by its smooth counterpart determined iteratively by data according to the induced smoothing idea. The package includes functions to estimate the model and to test for linear hypothesis on linear combinations of relevant coefficients. We illustrate R code throughout a worked example, by avoiding intentionally to report details and extended bibliography.
Introduction
Let \(\mathbf{y} = \mathbf{X}\beta + \mathbf{\epsilon}\) be the linear model of interest with usual zero-means and homoscedastic errors. As usual, \(\mathbf{y} = (y_1,\ldots,y_n)^T\) is the response vector, \(\mathbf{X}\) is the \(n \times p\) design matrix (having \(p\) quite large) with regression coefficients \(\mathbf{\beta}\). When interest lies in selecting the non-noise covariates and estimating the relevant effect, one assumes the lasso penalized objective function (Tibshirani, 1996), \[\frac{1}{2}||\mathbf{y}-\mathbf{X}\mathbf{\beta}||_2^2+\lambda||\mathbf{\beta}||_1\]
The R functions
The main function of the package is islasso() where the user supplies the model formula as in the usual lm or glm functions, i.e.
islasso(formula, family, lambda, alpha, data, weights, subset, offset, unpenalized,
contrasts, control = is.control())
family accepts specification of family and link function as in Table 1, lambda is the tuning parameter and unpenalized allows to indicate covariates with unpenalized coefficients.
Table 1. Families and link functions allowed in islasso
family | link |
---|---|
gaussian | identity |
binomial | logit, probit |
poisson | log |
gamma | identity, log, inverse |
The fitter function is which reads as
islasso.fit(X, y, family, lambda, alpha = 1, intercept = FALSE, weights = NULL, offset = NULL,
unpenalized = NULL, control = is.control())
which actually implements the estimating algorithm as described in the paper. The lambda argument in islasso.fit and islasso specifies the positive tuning parameter in the penalized objective. Any non-negative value can be provided, but if missing, it is computed via \(K\)-fold cross validation by the function cv.glmnet() from package glmnet. The number of folds being used can be specified via the argument nfolds of the auxiliary function is.control().
A worked example: the Diabetes data set
We use the well-known diabetes dataset available in the lars package. The data refer to \(n = 442\) patients enrolled to investigate a measure of disease progression one year after the baseline. There are ten covariates, (age, sex, bmi (body mass index), map (average blood pressure) and several blood serum measurements (tc, ldl, hdl, tch, ltg, glu). The matrix x2 in the dataframe also includes second-order terms, namely first-order interactions between covariates, and quadratic terms for the continuous variables.
To select the important terms in the regression equation we apply the lasso
library(lars)
library(glmnet)
data("diabetes", package = "lars")
a1 <- with(diabetes, cv.glmnet(x2, y))
n <- nrow(diabetes)
a1$lambda.min * n
> [1] 1344.186
> [1] 15
Ten-fold cross validation “selects” \(\lambda=\) 1344.186. corresponding to 15 non null coefficients
> [1] "(Intercept)" "sex" "bmi" "map" "hdl"
> [6] "ltg" "glu" "age^2" "bmi^2" "glu^2"
> [11] "age:sex" "age:map" "age:ltg" "age:glu" "bmi:map"
The last six estimates are
> glu^2 age:sex age:map age:ltg age:glu bmi:map
> 69.599081 107.479925 29.970061 8.506032 11.675332 85.530937
A reasonable question is if all the “selected” coefficients are significant in the model. Unfortunately lasso regression does not return standard errors due to nonsmoothness of objective, and some alternative approaches have been proposed, including the `covariance test’ (Lockhart et al., 2013)
> Predictor_Number Drop_in_covariance P-value
> 3 20.1981 0.0000
> 9 52.5964 0.0000
> 4 5.7714 0.0034
> 7 4.0840 0.0176
> 37 1.3310 0.2655
> 20 0.3244 0.7232
The results suggest that only the terms corresponding to columns 3, 9, 4, and 7 in the matrix x2 are significant, namely
> [1] "bmi" "ltg" "map" "hdl"
Covtest returns \(p\)-values across the \(\lambda\) path. It means that such \(p\)-values are not matched to the corresponding point estimates obtained at the optimal \(\lambda\) value (\(\lambda=\) 1344.186 in this example). As a consequence, just some of the selected non-null coefficients could be assessed as significant. Among the (few) strategies, including the post-selective inference and the (modified) residual bootstrap, here we illustrate the R package islasso implementing the recent `quasi’ lasso approach based on the induced smoothing idea (Brown and Wang, 2005) as discussed in Cilluffo et al. (2019)
While the optimal lambda could be selected (without supplying any value to lambda), we use the same above value to facilitate comparisons
The summary method quickly returns the main output of the fitted model, including point estimates, standard errors and \(p\)-values. Visualizing estimates for all covariates could be somewhat inconvenient, especially when the number of covariates is large, thus we decide to print estimates only if the pvalue is less than a threshold value. We use 0.10
>
> Call:
> islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes)
>
> Residuals:
> Min 1Q Median 3Q Max
> -138.65 -40.18 -4.46 34.45 143.35
>
> Estimate Std. Error Df z value Pr(>|z|)
> (Intercept) 152.133 2.542 1.000 59.842 < 2e-16 ***
> x2sex -114.624 53.818 0.890 -2.130 0.03318 *
> x2bmi 494.758 69.273 1.000 7.142 9.19e-13 ***
> x2map 250.768 63.537 1.000 3.947 7.92e-05 ***
> x2hdl -188.773 63.736 0.985 -2.962 0.00306 **
> x2ltg 464.780 67.117 1.000 6.925 4.36e-12 ***
> x2glu^2 74.504 45.089 0.758 1.652 0.09846 .
> x2age:sex 108.371 50.174 0.904 2.160 0.03078 *
> x2bmi:map 85.180 46.552 0.811 1.830 0.06728 .
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for gaussian family taken to be 2856.735)
>
> Null deviance: 2621009 on 441.0 degrees of freedom
> Residual deviance: 1223354 on 428.2 degrees of freedom
> AIC: 4787.1
> Lambda: 1344.2
>
> Number of Newton-Raphson iterations: 136
In addition to the usual information printed by the summary method, the output also includes the column Df representing the degrees of freedom of each coefficient. Their sum is used to quantify the model complexity
> [1] 13.76514
and the corresponding residual degrees of freedom (428.2348576) as reported above. The Wald test (column z value) and \(p\)-values can be used to assess important or significant covariates. In addition to those ruled out by covTest, “bmi” “ltg” “map” “hdl”, islasso() also returns ‘small’ p-values for the terms “sex”, “sex:age”, and “bmi:map”.
As an alternative to cross validation, it is also possible to select the tuning parameter \(\lambda\) by means the Bayesian or Akaike Information Criterion. The function aic.islasso, requires a islasso fit object and specification of the criterion to be used (AIC/BIC). Hence
Comparisons between methods to select the tuning parameter and further discussions We conclude this short note by emphasizing that islasso also accepts the so-called elastic-net penalty, such that \[ \frac{1}{2}||\mathbf{y}- \mathbf{X\beta}||_2^{2}+\lambda \{ \alpha ||\mathbf{\beta} ||^{}_1 + \frac{1}{2}(1-\alpha) ||\mathbf{\beta} ||^{2}_2 \} \] where \(0\le \alpha\le 1\) is the mixing parameter to be specified in islasso() and islasso.fit() via the argument alpha.
References
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Series B 1996; 58: 267–288
- Cilluffo, G, Sottile, G, La Grutta, S and Muggeo, VMR (2019) The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression. Statistical Methods in Medical Research, online doi: 10.1177/0962280219842890.
- Brown B and Wang Y. Standard errors and covariance matrices for smoothed rank estimators. Biometrika 2005; 92: 149–158.
- Lockhart R, Taylor J and Tibshirani R.. covTest: Computes covariance test for adaptive linear modelling, https://CRANR-project.org/package/covTest. R package version 1.02 (2013).
- Lockhart R, Taylor J, Tibshirani R, et al. A significance test for the lasso. Ann Stat 2014; 42: 413–468.