To illustrate the model confidence bounds, we adopt the simulation model below as in Li et al. (2019) for Gaussian distributed responses. \[ y_i=\sum_{j=1}^{p^{\ast}}\theta_{j}x_{ij}+\sum_{j=p^{\ast}+1}^{p}0\times x_{ij}+\epsilon_{i} \]
where \(p^{\ast}\) is the number of true variables and p is the number of candidate variables. Chatterjee and Lahiri (2011)
We simulate the random error according to \(\epsilon_{i}\sim N(0,\sigma^2)\) and covariate according to \(x_{i}\sim N_{p}(0,\Sigma)\), where \(\Sigma\)=\([\Sigma_{ij}]_{p\times p}\) and \(\Sigma_{ij}=\rho^{|i-j|}\).
We consider a scenarios below: (a) p=10, \(p^{\ast}\)=5.
We assume that we have 10 total covariates with correlation 0.5. 5 out of 10 covariates are significant covariates. we simulate 300 subjects.
# Data preprocessing
REPS =1; #number of simulations
I = 10;# number of total covariates
N = 300; # number of subjects
I0 = 5; # number of significant covariates
rho = 0.5; # correlation between covariates # 0, 0.5 or 0.99
\(\mathbf{x}\sim N_{p}(0,\Sigma)\),where \(\Sigma\)=\([\Sigma_{ij}]_{p\times p}\) and \(\Sigma_{ij}=\rho^{|i-j|}\).
set.seed(123)
# set.seed(123)
# For Covariates #
mu_X <-matrix(0,I,1) # mean of covariate vectors
Sig_X <-matrix(1,I,I) # covariance of covariate vectors
for (i in 1:I-1){
for (j in (i+1):I){
Sig_X[i,j]<-rho^(j-i)
Sig_X[j,i]<-Sig_X[i,j]
}
}
Sig_X
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.000000000 0.50000000 0.2500000 0.125000 0.06250 0.03125 0.015625
## [2,] 0.500000000 1.00000000 0.5000000 0.250000 0.12500 0.06250 0.031250
## [3,] 0.250000000 0.50000000 1.0000000 0.500000 0.25000 0.12500 0.062500
## [4,] 0.125000000 0.25000000 0.5000000 1.000000 0.50000 0.25000 0.125000
## [5,] 0.062500000 0.12500000 0.2500000 0.500000 1.00000 0.50000 0.250000
## [6,] 0.031250000 0.06250000 0.1250000 0.250000 0.50000 1.00000 0.500000
## [7,] 0.015625000 0.03125000 0.0625000 0.125000 0.25000 0.50000 1.000000
## [8,] 0.007812500 0.01562500 0.0312500 0.062500 0.12500 0.25000 0.500000
## [9,] 0.003906250 0.00781250 0.0156250 0.031250 0.06250 0.12500 0.250000
## [10,] 0.001953125 0.00390625 0.0078125 0.015625 0.03125 0.06250 0.125000
## [,8] [,9] [,10]
## [1,] 0.0078125 0.00390625 0.001953125
## [2,] 0.0156250 0.00781250 0.003906250
## [3,] 0.0312500 0.01562500 0.007812500
## [4,] 0.0625000 0.03125000 0.015625000
## [5,] 0.1250000 0.06250000 0.031250000
## [6,] 0.2500000 0.12500000 0.062500000
## [7,] 0.5000000 0.25000000 0.125000000
## [8,] 1.0000000 0.50000000 0.250000000
## [9,] 0.5000000 1.00000000 0.500000000
## [10,] 0.2500000 0.50000000 1.000000000
#X<-matrix(mvrnorm(n=1,mu_X, Sig_X),N,I)
#If n = 1 a vector of the same length as mu, otherwise an n by length(mu) matrix with one sample in each row.
X<-mvrnorm(n=N,mu=mu_X, Sigma =Sig_X)
X_scaled= scale(X, center= TRUE, scale=TRUE)
# check that we get mean of 0 and sd of 1
colMeans(X_scaled) # faster version of apply(scaled.dat, 2, mean)
## [1] -5.452814e-18 1.210259e-17 -7.939251e-18 -1.231943e-17 -1.088250e-17
## [6] 1.513257e-17 -3.162979e-18 -2.532696e-18 2.612494e-17 9.442678e-18
apply(X_scaled, 2, sd)
## [1] 1 1 1 1 1 1 1 1 1 1
head(X_scaled,5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.60940464 -0.085021999 -0.3792668 -0.51549844 0.2593259
## [2,] 0.06681852 -0.445116573 -0.3001291 -0.01724257 -0.4376752
## [3,] -0.27526113 -1.829945458 -1.4877424 -1.34426248 -0.4111843
## [4,] -2.71900689 -1.123919510 -0.1521078 0.26564921 0.7490104
## [5,] -0.55198900 0.006873067 0.6130698 -1.08909907 0.1058743
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.63316431 1.583601014 0.64284158 1.0406042 0.66825134
## [2,] 0.06023995 1.219893999 0.86942523 0.5215203 -0.30608721
## [3,] -1.15699877 -0.521990363 -1.09982958 0.6420850 -0.16802682
## [4,] 0.56342109 -0.001640709 0.02853053 0.5622696 0.01783038
## [5,] 0.07326517 -0.801830038 -0.45856790 1.1033287 1.08735480
set.seed(123)
sigma=1
eps<-rnorm(N, mean = 0, sd = sigma)
beta<- runif(I0, min =1, max = 1)
beta
## [1] 1 1 1 1 1
set.seed(123)
#I_X<-sample(I,I0,replace=FALSE)
I_X<-c(1,2,3,4,5)
I_X<-as.vector(I_X)
Y_full<-X_scaled[,I_X]%*%beta+ eps
DF=data.frame("Y"= Y_full,"X"= X_scaled)
head(DF,5)
## Y X.1 X.2 X.3 X.4 X.5
## 1 0.3284676 1.60940464 -0.085021999 -0.3792668 -0.51549844 0.2593259
## 2 -1.3635224 0.06681852 -0.445116573 -0.3001291 -0.01724257 -0.4376752
## 3 -3.7896874 -0.27526113 -1.829945458 -1.4877424 -1.34426248 -0.4111843
## 4 -2.9098662 -2.71900689 -1.123919510 -0.1521078 0.26564921 0.7490104
## 5 -0.7859832 -0.55198900 0.006873067 0.6130698 -1.08909907 0.1058743
## X.6 X.7 X.8 X.9 X.10
## 1 -0.63316431 1.583601014 0.64284158 1.0406042 0.66825134
## 2 0.06023995 1.219893999 0.86942523 0.5215203 -0.30608721
## 3 -1.15699877 -0.521990363 -1.09982958 0.6420850 -0.16802682
## 4 0.56342109 -0.001640709 0.02853053 0.5622696 0.01783038
## 5 0.07326517 -0.801830038 -0.45856790 1.1033287 1.08735480
mean(Y_full)
## [1] 0.03444141
#hist(Y_full)
In this step, we take an argument x/y which is assumed to be a numerical sample and do the bootstrap B times, where B=100.
I sample with replacement (replace=TRUE) to fill all B*N elements in the matrix for Y.
set.seed(123)
B = 100
#n = nrow(CommuteAtlanta)
boot.samples_Y = matrix(sample(Y_full, size = B * N, replace = TRUE), B, N)
boot.mean_Y = apply(boot.samples_Y, 1, mean)
head(boot.mean_Y,20)
## [1] -0.01693792 -0.11295660 0.18361847 -0.22014834 0.17989428
## [6] 0.07326172 -0.33881065 0.18714089 0.16366041 0.34790616
## [11] 0.01373613 -0.07618314 -0.01724815 -0.12365707 0.26581383
## [16] -0.11363561 0.05694913 0.01760151 -0.11932825 0.38920455
#set.seed(123)
#require(ggplot2)
#ggplot(data.frame(meanY = boot.mean_Y),aes(x=meanY)) +
#geom_histogram(binwidth=0.25,aes(y=..density..)) +
#geom_density(color="red")
I construct 95% Confidence Intervals for bootstap Y
set.seed(123)
mean(Y_full) + c(-1, 1) * 2 * sd(boot.mean_Y)
## [1] -0.2999688 0.3688516
bootstrap for \(\mathbf{x_i}, i=1,2\cdots, 10\).
I sample with replacement (replace=TRUE) to fill all B*N elements in the matrix for \(\mathbf{X}\).
set.seed(123)
boot.samples_X<- array(0, dim = c(B, N, I))
for (i in 1:I){
boot.samples_X[,,i]= matrix(sample(X_scaled[,i], size = B * N, replace = TRUE),B, N)
}
dim(boot.samples_X[,,1])
## [1] 100 300
Alternatively, I bootstrap for \(\mathbf{x_1}\).
set.seed(123)
boot.samples_X1= matrix(sample(X_scaled[,1], size = B * N, replace = TRUE),B, N)
dim(boot.samples_X1)
## [1] 100 300
We follow steps below to fit our simulated data:
5-fold cross-validation, based on mean squared error criterion.
\(\lambda\) is chosen by 5-fold cross-validation.
“lambda.min”: the \(\lambda\) at which the minimal MSE is achieved.
set.seed(123)
Boot_matrix=matrix(0, nrow=1, ncol=I+1)
lasso_y=boot.samples_Y[1,]
lasso_x1=boot.samples_X[1,,1]
lasso_x2=boot.samples_X[1,,2]
lasso_x3=boot.samples_X[1,,3]
lasso_x4=boot.samples_X[1,,4]
lasso_x5=boot.samples_X[1,,5]
lasso_x6=boot.samples_X[1,,6]
lasso_x7=boot.samples_X[1,,7]
lasso_x8=boot.samples_X[1,,8]
lasso_x9=boot.samples_X[1,,9]
lasso_x10=boot.samples_X[1,,10]
lasso_x_df = data.frame( lasso_x1, lasso_x2, lasso_x3, lasso_x4, lasso_x5, lasso_x6, lasso_x7, lasso_x8, lasso_x9, lasso_x10)
lasso_x_df=as.matrix(lasso_x_df,300,10)
lasso_y=as.matrix(lasso_y,300,1)
cvfit = cv.glmnet(lasso_x_df, lasso_y, type.measure = "mse", nfolds = 5)
cvfit$lambda.min
## [1] 0.1892632
coeffff=coef(cvfit, s = "lambda.min")
index=coeffff@i
index=index+1
Boot_matrix[1,][index]=1
Boot_matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1 1 0 0 0 0 0 0 0 0 0
coeffff@i
## [1] 0 1
coeffff@x
## [1] 0.06791294 1.61999907
set.seed(123)
# testing
#testtest=matrix(0, nrow =2, ncol = 11)
Boot_matrix=matrix(0, nrow=B, ncol=I+1)
for (i in 1:B){
lasso_y=boot.samples_Y[i,]
lasso_x1=boot.samples_X[i,,1]
lasso_x2=boot.samples_X[i,,2]
lasso_x3=boot.samples_X[i,,3]
lasso_x4=boot.samples_X[i,,4]
lasso_x5=boot.samples_X[i,,5]
lasso_x6=boot.samples_X[i,,6]
lasso_x7=boot.samples_X[i,,7]
lasso_x8=boot.samples_X[i,,8]
lasso_x9=boot.samples_X[i,,9]
lasso_x10=boot.samples_X[i,,10]
lasso_x_df = data.frame( lasso_x1, lasso_x2, lasso_x3, lasso_x4, lasso_x5, lasso_x6, lasso_x7, lasso_x8, lasso_x9, lasso_x10)
lasso_x_df=as.matrix(lasso_x_df,300,10)
lasso_y=as.matrix(lasso_y,300,1)
cvfit = cv.glmnet(lasso_x_df, lasso_y, type.measure = "mse", nfolds = 5)
cvfit$lambda.min
coeffff=coef(cvfit, s = "lambda.min")
####0604
index=coeffff@i
index=index+1
Boot_matrix[i,][index]=1
}
dim(Boot_matrix)
## [1] 100 11
Boot_matrix_col_means <- colmeans( Boot_matrix)
Boot_matrix_col_sums <- apply(Boot_matrix, 2, sum)
Boot_matrix_col_sums
## [1] 100 100 22 25 26 22 19 24 23 24 25
# image(Boot_matrix)
# Boot_matrix_rev_row_a <- apply(Boot_matrix, 2, rev) # rev() & apply() functions combined
# image(Boot_matrix_rev_row_a)
# https://stackoverflow.com/questions/16496210/rotate-a-matrix-in-r
rotate <- function(x) t(apply(x, 2, rev))
Boot_matrix_rotate=rotate( Boot_matrix)
Boot_matrix_rotate_no_intercept=Boot_matrix_rotate[-c(1),]
dim(Boot_matrix_rotate_no_intercept)
## [1] 10 100
image(Boot_matrix_rotate_no_intercept)
# 06-05 check image() inR
feq=Boot_matrix_col_sums
feq_rev=feq[-c(1:1)]
feq_rev=as.matrix(feq_rev,1,I)
feq_rev=t(feq_rev)
colnames(feq_rev)=paste("x",1:I, sep="")
feq_rev
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
## [1,] 100 22 25 26 22 19 24 23 24 25
feq_rev_sort_1=sort(feq_rev,decreasing = TRUE)
feq_rev_sort_1
## [1] 100 26 25 25 24 24 23 22 22 19
feq_rev_sort=order(feq_rev,decreasing = TRUE)
feq_rev_sort
## [1] 1 4 3 10 7 9 8 2 5 6
#lowerset=feq_rev_sort[1,]
#upperset=feq_rev_sort[1:2,]
#for (i in 1: I){
freq<-vector(length=I+1);
freq
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
freq[1]<-0
lower<-matrix(0,nrow=(I+1),ncol=I)
lower
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0
## [11,] 0 0 0 0 0 0 0 0 0 0
upper<-matrix(0,nrow=(I+1),ncol=I)
upper
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0
## [11,] 0 0 0 0 0 0 0 0 0 0
#}
#https://ucfilespace.uc.edu/account
#http://homepages.uc.edu/~fanzh/
Chatterjee, Arindam, and Soumendra Nath Lahiri. 2011. “Bootstrapping Lasso Estimators.” Journal of the American Statistical Association 106 (494). Taylor & Francis: 608–25.
Li, Yang, Yuetian Luo, Davide Ferrari, Xiaonan Hu, and Yichen Qin. 2019. “Model Confidence Bounds for Variable Selection.” Biometrics. Wiley Online Library.