Knockoff Statistic \(W_j\)
Generally speaking, knockoff statistic \(W\) make use of the information between design matrix \(X\) and its knockofff copy \(\tilde{X}\). After we get the knockoff statistic \(W_j\), we can compute data-dependent threshold using following formula: \[
T=\min \left\{t \in \mathcal{R}: \frac{\#\left(j: W_{j} \leq-t\right\}}{\#\left(j: W_{j} \geq t\right\} \vee 1} \leq q\right\}
\] Where q is target FDR level. Based on the threshold \(T\), features j’s with \(W_j>T\) will be selected.
There are many ways to evaluate the importance of variable and compute \(W_j\), here we just list some kncokoff statistics used in our project and interactive tools.
Advanced usage with custom knockoff statistic
- LASSO: model \(Y\sim(X,\tilde{X})\) through Lasso, no intercept
- lasso coefdiff : \(W_j=|\hat{\beta}_j|-|\tilde{\beta}_j|\), here \(\hat{\beta}_j, \tilde{\beta}_j\) are lasso coefficient for \(X_j, \tilde{X}_j\) with fixed \(\lambda\) (based on cross-validation)
- lasso lambdadiff :\(W_{j}=\lambda_{j}-\lambda_{j + d}, \lambda_{k}=\sup \left\{\lambda: \hat{\beta}_{k} \neq 0\right\}\), first entering time in Lasso path.
- lasso lambdasmax : \(\lambda_{k}=\sup \left\{\lambda: \hat{\beta}_{k} \neq 0\right\}\) \[W_{j}=\left\{\begin{array}{l}\lambda_{j} \quad \quad \text { if } \lambda_{j}>\lambda_{j+d} \\ -\lambda_{j+d} \quad\text { if } \lambda_{j}<\lambda_{j+ d}\end{array}\right.\]
- Pairwise relationship
- Correlation difference : \(W_j=\left|X_{j}^{\top} y\right|-\left|\tilde{X}_{j}^{\top} y\right|\)
- Kendall : Kendall rank correlation coefficient \((\tau)\) measure the ordinal association between two measured quantities, used in nonparameteric statistics. \[\tau(x,y)=\frac{2}{n(n-1)} \sum_{i<j} \operatorname{sgn}\left(x_{i}-x_{j}\right) \operatorname{sgn}\left(y_{i}-y_{j}\right)\] We define \(W_j=|\tau(X_j,y)|-|\tau(\tilde{X}_j,y)|\)
- Spearman : Spearman’s rank correlation coefficient (\(\rho\)) is a nonparametric measure of dependence between the rankings of two variables. \[d_i=rank(x_i)-rank(y_i), \rho(x,y)=1-\frac{6 \sum d_{i}^{2}}{n\left(n^{2}-1\right)}\] We define \(W_j=|\rho(X_j,y)|-|\rho(\tilde{X}_j,y)|\)
Model data using GLM
If you believe the response variable \(y\) is from generalied linear model, you may choose penalized generalized linear models to get the estimate of coeffiients as well as the entering time \(\lambda_k\). For example, if response variable \(y\) is binary, we can model the data with GLM, family=“binomial”; if response variable is non-negative integer, you can consider family=“poisson”.
Reference
- Barber, R. F., & Candes, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055-2085. Link
- Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33, Issue 1, Feb 2010 Link
- Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). Link
- Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. Link