1 An Illustrative Example

To illustrate the model confidence bounds, we adopt the simulation model below as in Li et al. (2019) for Gaussian distributed responses. \[ y_i=\sum_{j=1}^{p^{\ast}}\theta_{j}x_{ij}+\sum_{j=p^{\ast}+1}^{p}0\times x_{ij}+\epsilon_{i} \]

where \(p^{\ast}\) is the number of true variables and p is the number of candidate variables. Chatterjee and Lahiri (2011)

We simulate the random error according to \(\epsilon_{i}\sim N(0,\sigma^2)\) and covariate according to \(x_{i}\sim N_{p}(0,\Sigma)\), where \(\Sigma\)=\([\Sigma_{ij}]_{p\times p}\) and \(\Sigma_{ij}=\rho^{|i-j|}\).

1.1 Data generation

We consider a scenarios below: (a) p=10, \(p^{\ast}\)=5.

1.1.1 Step 1:

We assume that we have 10 total covariates with correlation 0.5. 5 out of 10 covariates are significant covariates. we simulate 300 subjects.

# Data preprocessing
REPS =1; #number of simulations
I = 10;#  number of total covariates 
N = 300; # number of subjects 
I0 = 5; # number of significant covariates   
rho = 0.5; # correlation between covariates  # 0, 0.5 or 0.99

1.1.2 Step 2: generate n by I with I mvn covariates

\(\mathbf{x}\sim N_{p}(0,\Sigma)\),where \(\Sigma\)=\([\Sigma_{ij}]_{p\times p}\) and \(\Sigma_{ij}=\rho^{|i-j|}\).

set.seed(123)
# set.seed(123)
# For Covariates #
mu_X <-matrix(0,I,1) # mean of covariate vectors
Sig_X <-matrix(1,I,I) # covariance of covariate vectors 
for (i in 1:I-1){ 
  for (j in (i+1):I){
    Sig_X[i,j]<-rho^(j-i)
    Sig_X[j,i]<-Sig_X[i,j]
  }
}
Sig_X
##              [,1]       [,2]      [,3]     [,4]    [,5]    [,6]     [,7]
##  [1,] 1.000000000 0.50000000 0.2500000 0.125000 0.06250 0.03125 0.015625
##  [2,] 0.500000000 1.00000000 0.5000000 0.250000 0.12500 0.06250 0.031250
##  [3,] 0.250000000 0.50000000 1.0000000 0.500000 0.25000 0.12500 0.062500
##  [4,] 0.125000000 0.25000000 0.5000000 1.000000 0.50000 0.25000 0.125000
##  [5,] 0.062500000 0.12500000 0.2500000 0.500000 1.00000 0.50000 0.250000
##  [6,] 0.031250000 0.06250000 0.1250000 0.250000 0.50000 1.00000 0.500000
##  [7,] 0.015625000 0.03125000 0.0625000 0.125000 0.25000 0.50000 1.000000
##  [8,] 0.007812500 0.01562500 0.0312500 0.062500 0.12500 0.25000 0.500000
##  [9,] 0.003906250 0.00781250 0.0156250 0.031250 0.06250 0.12500 0.250000
## [10,] 0.001953125 0.00390625 0.0078125 0.015625 0.03125 0.06250 0.125000
##            [,8]       [,9]       [,10]
##  [1,] 0.0078125 0.00390625 0.001953125
##  [2,] 0.0156250 0.00781250 0.003906250
##  [3,] 0.0312500 0.01562500 0.007812500
##  [4,] 0.0625000 0.03125000 0.015625000
##  [5,] 0.1250000 0.06250000 0.031250000
##  [6,] 0.2500000 0.12500000 0.062500000
##  [7,] 0.5000000 0.25000000 0.125000000
##  [8,] 1.0000000 0.50000000 0.250000000
##  [9,] 0.5000000 1.00000000 0.500000000
## [10,] 0.2500000 0.50000000 1.000000000
#X<-matrix(mvrnorm(n=1,mu_X, Sig_X),N,I)  
#If n = 1 a vector of the same length as mu, otherwise an n by length(mu) matrix with one sample in each row.
X<-mvrnorm(n=N,mu=mu_X, Sigma =Sig_X)
X_scaled= scale(X, center= TRUE, scale=TRUE)
# check that we get mean of 0 and sd of 1
colMeans(X_scaled)  # faster version of apply(scaled.dat, 2, mean)
##  [1] -5.452814e-18  1.210259e-17 -7.939251e-18 -1.231943e-17 -1.088250e-17
##  [6]  1.513257e-17 -3.162979e-18 -2.532696e-18  2.612494e-17  9.442678e-18
apply(X_scaled, 2, sd)
##  [1] 1 1 1 1 1 1 1 1 1 1
head(X_scaled,5)
##             [,1]         [,2]       [,3]        [,4]       [,5]
## [1,]  1.60940464 -0.085021999 -0.3792668 -0.51549844  0.2593259
## [2,]  0.06681852 -0.445116573 -0.3001291 -0.01724257 -0.4376752
## [3,] -0.27526113 -1.829945458 -1.4877424 -1.34426248 -0.4111843
## [4,] -2.71900689 -1.123919510 -0.1521078  0.26564921  0.7490104
## [5,] -0.55198900  0.006873067  0.6130698 -1.08909907  0.1058743
##             [,6]         [,7]        [,8]      [,9]       [,10]
## [1,] -0.63316431  1.583601014  0.64284158 1.0406042  0.66825134
## [2,]  0.06023995  1.219893999  0.86942523 0.5215203 -0.30608721
## [3,] -1.15699877 -0.521990363 -1.09982958 0.6420850 -0.16802682
## [4,]  0.56342109 -0.001640709  0.02853053 0.5622696  0.01783038
## [5,]  0.07326517 -0.801830038 -0.45856790 1.1033287  1.08735480

1.1.3 Step 3: Generate the simulation model as in Li et al. (2019) for Gaussian distributed responses.

set.seed(123)
sigma=1
eps<-rnorm(N, mean = 0, sd = sigma) 
beta<- runif(I0, min =1, max = 1)
beta
## [1] 1 1 1 1 1
set.seed(123)
#I_X<-sample(I,I0,replace=FALSE)
I_X<-c(1,2,3,4,5)
I_X<-as.vector(I_X)
Y_full<-X_scaled[,I_X]%*%beta+ eps

DF=data.frame("Y"= Y_full,"X"= X_scaled)
head(DF,5)
##            Y         X.1          X.2        X.3         X.4        X.5
## 1  0.3284676  1.60940464 -0.085021999 -0.3792668 -0.51549844  0.2593259
## 2 -1.3635224  0.06681852 -0.445116573 -0.3001291 -0.01724257 -0.4376752
## 3 -3.7896874 -0.27526113 -1.829945458 -1.4877424 -1.34426248 -0.4111843
## 4 -2.9098662 -2.71900689 -1.123919510 -0.1521078  0.26564921  0.7490104
## 5 -0.7859832 -0.55198900  0.006873067  0.6130698 -1.08909907  0.1058743
##           X.6          X.7         X.8       X.9        X.10
## 1 -0.63316431  1.583601014  0.64284158 1.0406042  0.66825134
## 2  0.06023995  1.219893999  0.86942523 0.5215203 -0.30608721
## 3 -1.15699877 -0.521990363 -1.09982958 0.6420850 -0.16802682
## 4  0.56342109 -0.001640709  0.02853053 0.5622696  0.01783038
## 5  0.07326517 -0.801830038 -0.45856790 1.1033287  1.08735480
mean(Y_full)
## [1] 0.03444141
#hist(Y_full)

1.1.4 Step 4:Bootstrap

In this step, we take an argument x/y which is assumed to be a numerical sample and do the bootstrap B times, where B=100.

1.1.4.1 Step 4-1

I sample with replacement (replace=TRUE) to fill all B*N elements in the matrix for Y.

set.seed(123)
B = 100
#n = nrow(CommuteAtlanta)
boot.samples_Y = matrix(sample(Y_full, size = B * N, replace = TRUE), B, N)
boot.mean_Y = apply(boot.samples_Y, 1, mean)
head(boot.mean_Y,20)
##  [1] -0.01693792 -0.11295660  0.18361847 -0.22014834  0.17989428
##  [6]  0.07326172 -0.33881065  0.18714089  0.16366041  0.34790616
## [11]  0.01373613 -0.07618314 -0.01724815 -0.12365707  0.26581383
## [16] -0.11363561  0.05694913  0.01760151 -0.11932825  0.38920455
#set.seed(123)
#require(ggplot2)
#ggplot(data.frame(meanY = boot.mean_Y),aes(x=meanY)) +
#geom_histogram(binwidth=0.25,aes(y=..density..)) +
#geom_density(color="red")

I construct 95% Confidence Intervals for bootstap Y

set.seed(123)
 mean(Y_full) + c(-1, 1) * 2 * sd(boot.mean_Y)
## [1] -0.2999688  0.3688516

bootstrap for \(\mathbf{x_i}, i=1,2\cdots, 10\).

I sample with replacement (replace=TRUE) to fill all B*N elements in the matrix for \(\mathbf{X}\).

set.seed(123)
boot.samples_X<- array(0, dim = c(B, N, I))
for (i in 1:I){
boot.samples_X[,,i]= matrix(sample(X_scaled[,i], size = B * N, replace = TRUE),B, N)

}

dim(boot.samples_X[,,1])
## [1] 100 300

Alternatively, I bootstrap for \(\mathbf{x_1}\).

set.seed(123)
boot.samples_X1= matrix(sample(X_scaled[,1], size = B * N, replace = TRUE),B, N)
dim(boot.samples_X1)
## [1] 100 300

1.1.5 fitting with lasso

We follow steps below to fit our simulated data:

  • 5-fold cross-validation, based on mean squared error criterion.

  • \(\lambda\) is chosen by 5-fold cross-validation.

  • “lambda.min”: the \(\lambda\) at which the minimal MSE is achieved.

1.1.5.1 A Demo of Implementation with B=1

set.seed(123)
Boot_matrix=matrix(0, nrow=1, ncol=I+1)
  lasso_y=boot.samples_Y[1,]
  lasso_x1=boot.samples_X[1,,1]
  lasso_x2=boot.samples_X[1,,2]
  lasso_x3=boot.samples_X[1,,3]
  lasso_x4=boot.samples_X[1,,4]
  lasso_x5=boot.samples_X[1,,5]
  lasso_x6=boot.samples_X[1,,6]
  lasso_x7=boot.samples_X[1,,7]
  lasso_x8=boot.samples_X[1,,8]
  lasso_x9=boot.samples_X[1,,9]
  lasso_x10=boot.samples_X[1,,10]
  lasso_x_df = data.frame( lasso_x1,  lasso_x2,  lasso_x3,  lasso_x4,  lasso_x5,  lasso_x6,  lasso_x7,  lasso_x8,  lasso_x9,  lasso_x10)
  lasso_x_df=as.matrix(lasso_x_df,300,10)
  lasso_y=as.matrix(lasso_y,300,1)
  cvfit = cv.glmnet(lasso_x_df, lasso_y, type.measure = "mse", nfolds = 5)
  cvfit$lambda.min
## [1] 0.1892632
  coeffff=coef(cvfit, s = "lambda.min")
  index=coeffff@i
  index=index+1
 Boot_matrix[1,][index]=1
 Boot_matrix
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,]    1    1    0    0    0    0    0    0    0     0     0
coeffff@i
## [1] 0 1
coeffff@x
## [1] 0.06791294 1.61999907

1.1.5.2 Detailed Implementation with B=100

set.seed(123)
# testing
#testtest=matrix(0, nrow =2, ncol = 11)
Boot_matrix=matrix(0, nrow=B, ncol=I+1)
for (i in 1:B){
  lasso_y=boot.samples_Y[i,]
  lasso_x1=boot.samples_X[i,,1]
  lasso_x2=boot.samples_X[i,,2]
  lasso_x3=boot.samples_X[i,,3]
  lasso_x4=boot.samples_X[i,,4]
  lasso_x5=boot.samples_X[i,,5]
  lasso_x6=boot.samples_X[i,,6]
  lasso_x7=boot.samples_X[i,,7]
  lasso_x8=boot.samples_X[i,,8]
  lasso_x9=boot.samples_X[i,,9]
  lasso_x10=boot.samples_X[i,,10]
  lasso_x_df = data.frame( lasso_x1,  lasso_x2,  lasso_x3,  lasso_x4,  lasso_x5,  lasso_x6,  lasso_x7,  lasso_x8,  lasso_x9,  lasso_x10)
  lasso_x_df=as.matrix(lasso_x_df,300,10)
  lasso_y=as.matrix(lasso_y,300,1)
  cvfit = cv.glmnet(lasso_x_df, lasso_y, type.measure = "mse", nfolds = 5)
  cvfit$lambda.min
  coeffff=coef(cvfit, s = "lambda.min")
  ####0604
  index=coeffff@i
  index=index+1
  Boot_matrix[i,][index]=1
} 
 dim(Boot_matrix)
## [1] 100  11
  Boot_matrix_col_means <- colmeans( Boot_matrix)
  Boot_matrix_col_sums <- apply(Boot_matrix, 2, sum)
    Boot_matrix_col_sums
##  [1] 100 100  22  25  26  22  19  24  23  24  25
 #    image(Boot_matrix)
   # Boot_matrix_rev_row_a <- apply(Boot_matrix, 2, rev)    # rev() & apply() functions combined
  #    image(Boot_matrix_rev_row_a)
# https://stackoverflow.com/questions/16496210/rotate-a-matrix-in-r

 
rotate <- function(x) t(apply(x, 2, rev))
  Boot_matrix_rotate=rotate( Boot_matrix)
Boot_matrix_rotate_no_intercept=Boot_matrix_rotate[-c(1),]
dim(Boot_matrix_rotate_no_intercept)
## [1]  10 100
  image(Boot_matrix_rotate_no_intercept)

 # 06-05 check image() inR

1.2 Calculate the Model confidences bounds and their corresponding coverage rate

1.2.1 Step 1

 feq=Boot_matrix_col_sums
feq_rev=feq[-c(1:1)]
feq_rev=as.matrix(feq_rev,1,I)
feq_rev=t(feq_rev)
colnames(feq_rev)=paste("x",1:I, sep="")
feq_rev
##       x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
## [1,] 100 22 25 26 22 19 24 23 24  25
feq_rev_sort_1=sort(feq_rev,decreasing = TRUE)
feq_rev_sort_1
##  [1] 100  26  25  25  24  24  23  22  22  19
feq_rev_sort=order(feq_rev,decreasing = TRUE)
feq_rev_sort
##  [1]  1  4  3 10  7  9  8  2  5  6

1.2.2 Step 2

#lowerset=feq_rev_sort[1,]
#upperset=feq_rev_sort[1:2,]
#for (i in 1: I){
 
  freq<-vector(length=I+1);
  freq
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  freq[1]<-0
  lower<-matrix(0,nrow=(I+1),ncol=I)
  lower
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    0    0    0    0    0    0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    0     0
## [11,]    0    0    0    0    0    0    0    0    0     0
  upper<-matrix(0,nrow=(I+1),ncol=I)
  upper
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    0    0    0    0    0    0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    0     0
## [11,]    0    0    0    0    0    0    0    0    0     0
#}

1.2.3 Step 3

1.2.4 Step 4

1.3 Width-coverage trade-off

2 References

#https://ucfilespace.uc.edu/account
#http://homepages.uc.edu/~fanzh/

Chatterjee, Arindam, and Soumendra Nath Lahiri. 2011. “Bootstrapping Lasso Estimators.” Journal of the American Statistical Association 106 (494). Taylor & Francis: 608–25.

Li, Yang, Yuetian Luo, Davide Ferrari, Xiaonan Hu, and Yichen Qin. 2019. “Model Confidence Bounds for Variable Selection.” Biometrics. Wiley Online Library.