We use glmnet package in R to run LASSO regressions with different values of \(\lambda\).
fitcv = cv.glmnet(X,y, family="gaussian")
plot(fitcv)
The plot below shows that as \(\lambda\) gets larger, more coefficients shrink to zero, thus increasing the bias of the model as shown by an increasing mean-squared error. Each \(\lambda\) in the plot corresponds to a regression. For each regression We calculate a Prediction Error defined as:
\(\left \| X \widetilde{\beta} -X\beta^0\right \|_2^2/\left \| X\beta^0 \right \|_2^2\),
or in equivalent matrix form:
\((\widetilde{\beta}^TX^TX\widetilde{\beta}-2\widetilde{\beta}^TX^TX\beta^0 + \beta^{0T}X^TX\beta^0)/ \beta^{0T}X^TX\beta^0\)
where \(\widetilde{\beta}\) are the estimated coefficients and \(\beta^0\) are real coefficients.
The table and plot below show then number of variables kept and corresponding prediction error.
## Number of Non-zero Coefficients Lambda Prediction Error
## [1,] 0 1.2278578 1.0000000
## [2,] 2 1.1187783 0.9530738
## [3,] 3 1.0193891 0.8976484
## [4,] 6 0.9288294 0.8124570
## [5,] 8 0.8463148 0.6978552
## [6,] 8 0.7711305 0.5778719
## [7,] 8 0.7026254 0.4784579
## [8,] 8 0.6402061 0.3961034
## [9,] 8 0.5833319 0.3278958
## [10,] 8 0.5315103 0.2714188
gurobiThe formulation is equivalent to:
\(min\ \beta^TX^TX\beta - 2y^TX\beta\)
\(s.t.\)
\((1)\ \sum_iz_i \leq k\)
\((2)\ \beta_i \leq Mz_i,\ i=1,...,p\)
\((3)\ \beta_i \geq -Mz_i,\ i=1,...,p\)
\((4)\ z_i\ are\ binary\)
The function below will solve for optimal \(\beta s\) with varying values of \(M\). When \(M\) is small, constraints \((2)\) and \((3)\) above could be binding when \(z_i \neq 0\). As \(M\) continues to increase to a point when \((2)\) and \((3)\) are no longer binding constraints for non-zero \(\beta s\), the values of \(\beta s\) should no longer change with the increase of \(M\).
gurobi_m <- function(m){
model <- list()
model$Q = cbind(rbind(XtX,matrix(0,p,p)),matrix(0,2*p,p))
model$obj = append(YtX, rep(0,p))
model$vtype = append(rep('C',p), rep('B',p))
model$sense = c('<=',rep('>=',p),rep('>=',p))
M = m
k = 8
cons1 = append(rep(0,p), rep(1,p))
cons2 = cbind(diag(1,p), diag(M,p))
cons3 = cbind(diag(-1,p), diag(M,p))
model$A = rbind(cons1,cons2,cons3)
model$rhs = c(k, rep(0,p), rep(0,p))
return(model)
}
We choose \(M\) values from \(0.5,1,2,4,8\)
m_values <- c(0.5,1,2,4,8)
gurobi_table <- cbind(m_values, rep(0,5))
beta_list = rep(NULL,8)
for(i in 1:length(m_values)){
m = m_values[i]
result <- gurobi(gurobi_m(m))
beta_new = result$x[(1):(p)]
beta_non_zero = beta_new[beta_new != 0]
beta_list = cbind(beta_list,beta_non_zero)
error = (t(beta_new) %*% XtX %*% beta_new + t(beta_real) %*% XtX %*% beta_real -2 * t(X%*%beta_new) %*% (X%*%beta_real)) / (t(beta_real) %*% XtX %*% beta_real )
gurobi_table[i,2] <- error
}
## Warning for adding variables: zero or small (< 1e-13) coefficients, ignored
## Optimize a model with 129 rows, 128 columns and 320 nonzeros
## Model has 2080 quadratic objective terms
## Coefficient statistics:
## Matrix range [5e-01, 1e+00]
## Objective range [5e+00, 1e+03]
## Bounds range [1e+00, 1e+00]
## RHS range [8e+00, 8e+00]
## Found heuristic solution: objective 0
## Presolve removed 64 rows and 0 columns
## Presolve time: 0.00s
## Presolved: 65 rows, 128 columns, 192 nonzeros
## Presolved model has 2080 quadratic objective terms
## Variable types: 64 continuous, 64 integer (64 binary)
##
## Root relaxation: objective -3.080563e+03, 149 iterations, 0.00 seconds
##
## Nodes | Current Node | Objective Bounds | Work
## Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
##
## 0 0 -3080.5630 0 4 0.00000 -3080.5630 - - 0s
## H 0 0 -3080.233051 -3080.5630 0.01% - 0s
## 0 0 cutoff 0 -3080.2331 -3080.2331 0.00% - 0s
##
## Explored 0 nodes (152 simplex iterations) in 0.00 seconds
## Thread count was 8 (of 8 available processors)
##
## Optimal solution found (tolerance 1.00e-04)
## Best objective -3.080233051211e+03, best bound -3.080233051211e+03, gap 0.0%
## Warning for adding variables: zero or small (< 1e-13) coefficients, ignored
## Optimize a model with 129 rows, 128 columns and 320 nonzeros
## Model has 2080 quadratic objective terms
## Coefficient statistics:
## Matrix range [1e+00, 1e+00]
## Objective range [5e+00, 1e+03]
## Bounds range [1e+00, 1e+00]
## RHS range [8e+00, 8e+00]
## Found heuristic solution: objective 0
## Presolve removed 64 rows and 0 columns
## Presolve time: 0.00s
## Presolved: 65 rows, 128 columns, 192 nonzeros
## Presolved model has 2080 quadratic objective terms
## Variable types: 64 continuous, 64 integer (64 binary)
##
## Root relaxation: objective -4.178648e+03, 130 iterations, 0.00 seconds
##
## Nodes | Current Node | Objective Bounds | Work
## Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
##
## 0 0 -4178.6483 0 22 0.00000 -4178.6483 - - 0s
## H 0 0 -4157.513179 -4178.6483 0.51% - 0s
## 0 0 -4157.5192 0 2 -4157.5132 -4157.5192 0.00% - 0s
##
## Explored 0 nodes (178 simplex iterations) in 0.02 seconds
## Thread count was 8 (of 8 available processors)
##
## Optimal solution found (tolerance 1.00e-04)
## Best objective -4.157513178863e+03, best bound -4.157519237282e+03, gap 0.0001%
## Warning for adding variables: zero or small (< 1e-13) coefficients, ignored
## Optimize a model with 129 rows, 128 columns and 320 nonzeros
## Model has 2080 quadratic objective terms
## Coefficient statistics:
## Matrix range [1e+00, 2e+00]
## Objective range [5e+00, 1e+03]
## Bounds range [1e+00, 1e+00]
## RHS range [8e+00, 8e+00]
## Found heuristic solution: objective 0
## Presolve removed 64 rows and 0 columns
## Presolve time: 0.00s
## Presolved: 65 rows, 128 columns, 192 nonzeros
## Presolved model has 2080 quadratic objective terms
## Variable types: 64 continuous, 64 integer (64 binary)
##
## Root relaxation: objective -4.195961e+03, 91 iterations, 0.00 seconds
##
## Nodes | Current Node | Objective Bounds | Work
## Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
##
## 0 0 -4195.9608 0 29 0.00000 -4195.9608 - - 0s
## H 0 0 -2840.152545 -4195.9608 47.7% - 0s
## H 0 0 -4169.663033 -4195.9608 0.63% - 0s
## 0 0 cutoff 0 -4169.6630 -4169.6630 0.00% - 0s
##
## Explored 0 nodes (91 simplex iterations) in 0.02 seconds
## Thread count was 8 (of 8 available processors)
##
## Optimal solution found (tolerance 1.00e-04)
## Best objective -4.169663033237e+03, best bound -4.169663033237e+03, gap 0.0%
## Warning for adding variables: zero or small (< 1e-13) coefficients, ignored
## Optimize a model with 129 rows, 128 columns and 320 nonzeros
## Model has 2080 quadratic objective terms
## Coefficient statistics:
## Matrix range [1e+00, 4e+00]
## Objective range [5e+00, 1e+03]
## Bounds range [1e+00, 1e+00]
## RHS range [8e+00, 8e+00]
## Found heuristic solution: objective 0
## Presolve removed 64 rows and 0 columns
## Presolve time: 0.02s
## Presolved: 65 rows, 128 columns, 192 nonzeros
## Presolved model has 2080 quadratic objective terms
## Variable types: 64 continuous, 64 integer (64 binary)
##
## Root relaxation: objective -4.195961e+03, 90 iterations, 0.00 seconds
##
## Nodes | Current Node | Objective Bounds | Work
## Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
##
## 0 0 -4195.9608 0 29 0.00000 -4195.9608 - - 0s
## H 0 0 -2204.698698 -4195.9608 90.3% - 0s
## H 0 0 -4169.663033 -4195.9608 0.63% - 0s
## 0 0 cutoff 0 -4169.6630 -4169.6630 0.00% - 0s
##
## Explored 0 nodes (90 simplex iterations) in 0.03 seconds
## Thread count was 8 (of 8 available processors)
##
## Optimal solution found (tolerance 1.00e-04)
## Best objective -4.169663033237e+03, best bound -4.169663033237e+03, gap 0.0%
## Warning for adding variables: zero or small (< 1e-13) coefficients, ignored
## Optimize a model with 129 rows, 128 columns and 320 nonzeros
## Model has 2080 quadratic objective terms
## Coefficient statistics:
## Matrix range [1e+00, 8e+00]
## Objective range [5e+00, 1e+03]
## Bounds range [1e+00, 1e+00]
## RHS range [8e+00, 8e+00]
## Found heuristic solution: objective 0
## Presolve removed 64 rows and 0 columns
## Presolve time: 0.02s
## Presolved: 65 rows, 128 columns, 192 nonzeros
## Presolved model has 2080 quadratic objective terms
## Variable types: 64 continuous, 64 integer (64 binary)
##
## Root relaxation: objective -4.195961e+03, 90 iterations, 0.00 seconds
##
## Nodes | Current Node | Objective Bounds | Work
## Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
##
## 0 0 -4195.9608 0 29 0.00000 -4195.9608 - - 0s
## H 0 0 -2204.698698 -4195.9608 90.3% - 0s
## H 0 0 -4169.663033 -4195.9608 0.63% - 0s
## 0 0 cutoff 0 -4169.6630 -4169.6630 0.00% - 0s
##
## Explored 0 nodes (90 simplex iterations) in 0.04 seconds
## Thread count was 8 (of 8 available processors)
##
## Optimal solution found (tolerance 1.00e-04)
## Best objective -4.169663033237e+03, best bound -4.169663033237e+03, gap 0.0%
print (beta_new)
## [1] 0.8930573 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [8] 0.0000000 1.0892370 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [15] 0.0000000 0.0000000 0.9921031 0.0000000 0.0000000 0.0000000 0.0000000
## [22] 0.0000000 0.0000000 0.0000000 1.1158345 0.0000000 0.0000000 0.0000000
## [29] 0.0000000 0.0000000 0.0000000 0.0000000 0.9797358 0.0000000 0.0000000
## [36] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0029896 0.0000000
## [43] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0201952
## [50] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [57] 1.0389175 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [64] 0.0000000
The outputs below show the prediction error for the two approaches. For MIQP, prediction error converges to \(0.004456\) as \(M\) gets larger. For LASSO, the smallest \(\lambda\) giving us 8 variables yields a prediction error of \(0.024466\), about 5 times the prediction error than that of MIQP. LASSO regression is regularized by a penalty parameter \(\lambda\) which brings in statistical bias to the model, so the prediction is much off comparing to that from MIQP method.
## m_values Prediction Error
## [1,] 0.5 0.250000000
## [2,] 1.0 0.001431754
## [3,] 2.0 0.004456055
## [4,] 4.0 0.004456055
## [5,] 8.0 0.004456055
## Number of Non-zero Coefficients Lambda Prediction Error
## [1,] 8 0.8463148 0.69785520
## [2,] 8 0.7711305 0.57787191
## [3,] 8 0.7026254 0.47845794
## [4,] 8 0.6402061 0.39610338
## [5,] 8 0.5833319 0.32789580
## [6,] 8 0.5315103 0.27141881
## [7,] 8 0.4842924 0.22466720
## [8,] 8 0.4412692 0.18597773
## [9,] 8 0.4020681 0.15397053
## [10,] 8 0.3663495 0.12750096
## [11,] 8 0.3338040 0.10561967
## [12,] 8 0.3041498 0.08753928
## [13,] 8 0.2771299 0.07260683
## [14,] 8 0.2525105 0.06028085
## [15,] 8 0.2300781 0.05011268
## [16,] 8 0.2096386 0.04172997
## [17,] 8 0.1910149 0.03482441
## [18,] 8 0.1740457 0.02914033
## [19,] 8 0.1585839 0.02446620
We can see that the best LASSO regression (in terms of smallest prediction error) with 8 non-zero coefficients picks out the same set of variables as MIQP method does. The estimates of the two methods do not converge. Since LASSO introduces bias by the penalty parameter \(\lambda\), the estimates generated by MIQP method should be more accurate.
## MIQP Beta Estimates
## [1,] "Beta 1" "0.893057321191464"
## [2,] "Beta 9" "1.08923704401365"
## [3,] "Beta 17" "0.992103052242995"
## [4,] "Beta 25" "1.11583445462063"
## [5,] "Beta 33" "0.979735809388095"
## [6,] "Beta 41" "1.00298956064304"
## [7,] "Beta 49" "1.02019522338848"
## [8,] "Beta 57" "1.03891751790867"
## LASSO Beta Estimates
## [1,] "Beta 1" "0.749627302917313"
## [2,] "Beta 9" "0.910242153018034"
## [3,] "Beta 17" "0.812497028012497"
## [4,] "Beta 25" "0.97058930405387"
## [5,] "Beta 33" "0.801172093676408"
## [6,] "Beta 41" "0.832688277865887"
## [7,] "Beta 49" "0.886554127861054"
## [8,] "Beta 57" "0.88196233303159"