Group Lasso Sebagai ilustrasi, kita akan menggunakan data bardet, yang merupakan ekspresi gen (20 gen dari 120 sampel). Peubah respon (y) adalah level expression dari gen TRIM32, yang menyebabkan sindrom Bardet-Biedl. Data diperoleh dari Scheetz et al. (2006).
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
bardet<-read_csv("https://raw.githubusercontent.com/raoy/data/master/bardet2")
## Rows: 120 Columns: 101
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (101): Y, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(bardet)
## [1] 120 101
names(bardet)
## [1] "Y" "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
## [11] "X10" "X11" "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19"
## [21] "X20" "X21" "X22" "X23" "X24" "X25" "X26" "X27" "X28" "X29"
## [31] "X30" "X31" "X32" "X33" "X34" "X35" "X36" "X37" "X38" "X39"
## [41] "X40" "X41" "X42" "X43" "X44" "X45" "X46" "X47" "X48" "X49"
## [51] "X50" "X51" "X52" "X53" "X54" "X55" "X56" "X57" "X58" "X59"
## [61] "X60" "X61" "X62" "X63" "X64" "X65" "X66" "X67" "X68" "X69"
## [71] "X70" "X71" "X72" "X73" "X74" "X75" "X76" "X77" "X78" "X79"
## [81] "X80" "X81" "X82" "X83" "X84" "X85" "X86" "X87" "X88" "X89"
## [91] "X90" "X91" "X92" "X93" "X94" "X95" "X96" "X97" "X98" "X99"
## [101] "X100"
Terdapat beberapa package yang dapat digunakan untuk melakukan pemodelan dengan regularisasi group lasso, di antaranya adalah grpreg, grplasso, oem. Selanjutnya package yang akan digunakan adalah package grpreg. Sebelum melakukan pemodelan, perlu ditentukan terlebih dahulu pengelompokkan peubah penjelas yang akan digunakan.
library(grpreg)
## Warning: package 'grpreg' was built under R version 4.0.5
##
## Attaching package: 'grpreg'
## The following object is masked from 'package:dplyr':
##
## select
group1 <- rep(1:20,each=5)
group1
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
## [26] 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
## [51] 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14 14 14 14 15 15 15 15 15
## [76] 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18 19 19 19 19 19 20 20 20 20 20
Setelah itu, kita dapat menggunakan fungsi grpreg untuk melakukan pemodelan dengan pendekatan group lasso.
fit <- grpreg(bardet[,-1], bardet$Y,group1,penalty="grLasso")
plot(fit)
Warna yang berbeda menunjukkan group yang berbeda pula. Untuk mengetahui lebih spesifik nilai dari penduga koefisien, dapat digunakan fungsi coef(). Misalnya kita mengetahui nilai penduga koefisien model pada saat lambda=0.03.
coef(fit, lambda=0.03)
## (Intercept) X1 X2 X3 X4
## 8.1438890477 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X5 X6 X7 X8 X9
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X10 X11 X12 X13 X14
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X15 X16 X17 X18 X19
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X20 X21 X22 X23 X24
## 0.0000000000 -0.0184176366 -0.0096718123 -0.0707926448 -0.0066209284
## X25 X26 X27 X28 X29
## -0.1819705554 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X30 X31 X32 X33 X34
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X35 X36 X37 X38 X39
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X40 X41 X42 X43 X44
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X45 X46 X47 X48 X49
## 0.0000000000 0.0203444929 0.0742885085 0.0709400111 0.0848818199
## X50 X51 X52 X53 X54
## 0.0718342645 0.1102619926 0.0686899478 0.1064836267 0.1173297411
## X55 X56 X57 X58 X59
## 0.1002532155 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X60 X61 X62 X63 X64
## 0.0000000000 0.1392457931 0.0998493873 0.1169540317 0.1252844903
## X65 X66 X67 X68 X69
## 0.1640337431 0.0004270203 0.0015217266 0.0015146601 0.0018883848
## X70 X71 X72 X73 X74
## 0.0014696660 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X75 X76 X77 X78 X79
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X80 X81 X82 X83 X84
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X85 X86 X87 X88 X89
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X90 X91 X92 X93 X94
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X95 X96 X97 X98 X99
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## X100
## 0.0000000000
Perhatikan pada output terlihat cukup banyak peubah yang memiliki koefisien nol. Cobalah jika Anda periksa seperti apa koefisien model jika lambda bernilai 0.01?
Umumnya, kita dapat menggunakan cross validation untuk menentukan parameter yang optimal.
cvfit<-cv.grpreg(bardet[,-1], bardet$Y,group1,penalty="grLasso")
plot(cvfit)
summary(cvfit)
## grLasso-penalized linear regression with n=120, p=100
## At minimum cross-validation error (lambda=0.0240):
## -------------------------------------------------
## Nonzero coefficients: 25
## Nonzero groups: 5
## Cross-validation error of 0.01
## Maximum R-squared: 0.46
## Maximum signal-to-noise ratio: 0.85
## Scale estimate (sigma) at lambda.min: 0.106
Koefisien pada saat lambda meminimumkan cross-validation error dapat diperoleh dengan fungsi coef().
coef(cvfit)
## (Intercept) X1 X2 X3 X4 X5
## 8.043557412 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X6 X7 X8 X9 X10 X11
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X12 X13 X14 X15 X16 X17
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X18 X19 X20 X21 X22 X23
## 0.000000000 0.000000000 0.000000000 -0.024166978 -0.010352290 -0.084388661
## X24 X25 X26 X27 X28 X29
## -0.006204463 -0.205217473 0.000000000 0.000000000 0.000000000 0.000000000
## X30 X31 X32 X33 X34 X35
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X36 X37 X38 X39 X40 X41
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X42 X43 X44 X45 X46 X47
## 0.000000000 0.000000000 0.000000000 0.000000000 0.015735033 0.082949160
## X48 X49 X50 X51 X52 X53
## 0.078169273 0.093569871 0.078191246 0.147278291 0.098771133 0.150803837
## X54 X55 X56 X57 X58 X59
## 0.169911397 0.139581969 0.000000000 0.000000000 0.000000000 0.000000000
## X60 X61 X62 X63 X64 X65
## 0.000000000 0.200317829 0.146975542 0.169494367 0.182047090 0.244694890
## X66 X67 X68 X69 X70 X71
## 0.002009770 0.009406503 0.009266257 0.011716089 0.008815365 0.000000000
## X72 X73 X74 X75 X76 X77
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X78 X79 X80 X81 X82 X83
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X84 X85 X86 X87 X88 X89
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X90 X91 X92 X93 X94 X95
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## X96 X97 X98 X99 X100
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Perhatikan pula pada output di atas bahwa koefisien yang tidak nol dan yang nol sesuai dengan group yang telah ditentukan sebelumnya. Artinya, jika koefisien suatu peubah bernilai nol, maka ini akan berlaku untuk satu group peubah penjelas.
cvfit$lambda.min
## [1] 0.02403177
Berikut ini adalah ilustrasi untuk melakukan prediksi menggunakan model yang telah dibangun sebelumnya.
predict(fit, as.matrix(bardet[1,-1]), type="response", lambda=cvfit$lambda.min)
## 0.024
## 8.384639
Fused Lasso Terdapat beberapa package yang dapat digunakan untuk pemodelan dengan fused lasso, beberapa diantaranya adalah package genlasso, lqa, HDPenReg, extlasso, dan SQOPT. Pada ilustrasi yang digunakan saat ini akan digunakan package extlasso.
#install.packages("extlasso")
library(extlasso)
Sebagai ilustrasi, akan digunakan data simulasi berukuran 50 x 100.
sim<-read_csv("https://raw.githubusercontent.com/raoy/data/master/sim.csv")
## Rows: 50 Columns: 101
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (101): y, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(sim)
## [1] 50 101
Pemodelan menggunakan fused lasso dapat dilakukan dengan fungsi fusedlasso().
f1<-fusedlasso(as.matrix(sim[,-1]), as.matrix(sim[,1]),lambda1=0.1,lambda2=1)
coef(f1)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.9985542 0.8809084 1.052734 1.018349 1.340504 1.128273 1.028252 1.053717
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 1.020136 0.9938192 1.838774 1.89653 1.91239 1.873694 1.900375 0.8999032
## [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 0.8472078 0.914119 0.7739948 0.8010527 0.01276438 0.01495297 0.0128031
## [,24] [,25] [,26] [,27] [,28] [,29]
## [1,] 0.01436333 2.616725 5.498628e-05 -8.717723e-06 -3.033097e-05 -3.57002e-05
## [,30] [,31] [,32] [,33] [,34]
## [1,] -4.405684e-05 -0.0001165966 -0.0001547554 -8.052292e-05 0.0001297334
## [,35] [,36] [,37] [,38] [,39] [,40]
## [1,] 0.0001252225 0.0003142408 0.0003471909 0.03020907 0.03345018 0.02796283
## [,41] [,42] [,43] [,44] [,45] [,46] [,47]
## [1,] 0.9044475 0.9147917 0.9792206 1.002253 0.93152 -0.00480494 -0.005441846
## [,48] [,49] [,50] [,51] [,52] [,53]
## [1,] -0.03072133 -0.04793111 -0.05139274 -0.05694777 -0.05450936 0.006558022
## [,54] [,55] [,56] [,57] [,58] [,59]
## [1,] 0.007373171 0.007489036 0.007784385 0.05857675 0.04817141 0.05135564
## [,60] [,61] [,62] [,63] [,64] [,65]
## [1,] 0.05345925 0.0506941 0.05016291 0.05703892 0.05720315 -0.03571555
## [,66] [,67] [,68] [,69] [,70]
## [1,] -0.03005153 -0.03141345 -0.03292674 -0.03787794 -0.0006769726
## [,71] [,72] [,73] [,74] [,75]
## [1,] -0.0002915209 -0.0003195786 -0.0001994306 -0.0001009414 -2.526325e-05
## [,76] [,77] [,78] [,79] [,80]
## [1,] -2.637759e-05 -2.032202e-05 -0.0001147803 -0.0005119542 -0.0007495951
## [,81] [,82] [,83] [,84] [,85]
## [1,] -0.0005618497 -0.0005195027 -0.0004253216 -0.0003266014 -0.0002515777
## [,86] [,87] [,88] [,89] [,90]
## [1,] -0.0002531176 -0.0003191778 -0.0003493597 -0.0003497417 -0.0003128372
## [,91] [,92] [,93] [,94] [,95]
## [1,] -0.0001741252 -4.546216e-05 0.0001593625 0.0001221346 0.0002796896
## [,96] [,97] [,98] [,99] [,100]
## [1,] 0.02680598 0.03347207 0.0249627 0.0277366 0.03978241
## attr(,"scaled:scale")
## X1 X2 X3 X4 X5 X6 X7 X8
## 6.939902 7.866348 6.581477 6.803221 5.167865 6.140369 6.737816 6.575499
## X9 X10 X11 X12 X13 X14 X15 X16
## 6.793027 6.973019 7.477864 7.274640 7.214235 6.415380 6.325191 6.414886
## X17 X18 X19 X20 X21 X22 X23 X24
## 6.813532 6.314449 6.494148 6.273874 7.672459 6.523397 7.620340 6.789309
## X25 X26 X27 X28 X29 X30 X31 X32
## 8.161741 7.834052 6.530196 6.736080 7.299615 7.362834 6.807923 6.827046
## X33 X34 X35 X36 X37 X38 X39 X40
## 6.390859 6.473684 6.889220 7.322770 8.037910 7.068715 6.402240 7.704442
## X41 X42 X43 X44 X45 X46 X47 X48
## 6.218902 6.607825 6.894281 6.735905 7.246899 7.656877 6.835065 6.335380
## X49 X50 X51 X52 X53 X54 X55 X56
## 7.665685 7.150558 6.440146 6.716975 6.822126 6.399245 6.710955 6.603881
## X57 X58 X59 X60 X61 X62 X63 X64
## 5.857500 7.171870 6.753479 6.486864 6.847070 6.916320 6.084934 6.059492
## X65 X66 X67 X68 X69 X70 X71 X72
## 6.717182 7.996874 7.637940 7.240682 6.264713 5.661580 8.033317 6.724174
## X73 X74 X75 X76 X77 X78 X79 X80
## 7.334116 6.455271 6.946565 7.253525 6.664404 6.011731 6.884126 5.599166
## X81 X82 X83 X84 X85 X86 X87 X88
## 7.521062 6.636285 6.991143 6.718125 7.692717 7.080031 6.270202 7.223226
## X89 X90 X91 X92 X93 X94 X95 X96
## 7.100539 8.695310 7.671459 7.179124 7.224380 7.051190 7.455627 7.737688
## X97 X98 X99 X100
## 6.195473 8.273287 7.441367 5.179346
Penjelasan selengkapnya tentang pendekatan fused lasso dapat diperoleh pada Tibshirani et al.(2005).
Elastic Net
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.0.5
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-3
Data yang digunakan sebagai ilustrasi adalah data hiv yang dapat diakses pada link ini.
load("hiv.rda")
class(hiv.train) # The data are stored as a list
## [1] "list"
names(hiv.train) # The names of the list elements are x and y
## [1] "x" "y"
dim(hiv.train$x) # The explanatory data consists of 704 observations of 208 binary mutation variables
## [1] 704 208
Fitting model menggunakan elastic net dapat dilakukan dengan fungsi glmnet(). Berikut adalah contoh pemodelan elastic net dengan α=0.2, ini dapat pula ditentukan berdasarkan validasi silang.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #Elastic Net penalty
plot(fit)
Berikut ini adalah contoh syntax untuk melakukan korelasi silang dengan fungsi cv.glmnet() pada berbagai nilai α.
for (i in 0:10) {
assign(paste("fit", i, sep=""), cv.glmnet(hiv.train$x,hiv.train$y, type.measure="mse", alpha=i/10))
}
Selanjutnya dapat dihitung Mean Squared Error (MSE) dari prediksi menggunakan data uji, seperti berikut ini.
yhat0 <- predict(fit0, s=fit0$lambda.1se, newx=hiv.test$x)
yhat1 <- predict(fit1, s=fit1$lambda.1se, newx=hiv.test$x)
yhat2 <- predict(fit2, s=fit2$lambda.1se, newx=hiv.test$x)
yhat3 <- predict(fit3, s=fit3$lambda.1se, newx=hiv.test$x)
yhat4 <- predict(fit4, s=fit4$lambda.1se, newx=hiv.test$x)
yhat5 <- predict(fit5, s=fit5$lambda.1se, newx=hiv.test$x)
yhat6 <- predict(fit6, s=fit6$lambda.1se, newx=hiv.test$x)
yhat7 <- predict(fit7, s=fit7$lambda.1se, newx=hiv.test$x)
yhat8 <- predict(fit8, s=fit8$lambda.1se, newx=hiv.test$x)
yhat9 <- predict(fit9, s=fit9$lambda.1se, newx=hiv.test$x)
yhat10 <- predict(fit10, s=fit10$lambda.1se, newx=hiv.test$x)
mse0 <- mean((hiv.test$y - yhat0)^2)
mse1 <- mean((hiv.test$y - yhat1)^2)
mse2 <- mean((hiv.test$y - yhat2)^2)
mse3 <- mean((hiv.test$y - yhat3)^2)
mse4 <- mean((hiv.test$y - yhat4)^2)
mse5 <- mean((hiv.test$y - yhat5)^2)
mse6 <- mean((hiv.test$y - yhat6)^2)
mse7 <- mean((hiv.test$y - yhat7)^2)
mse8 <- mean((hiv.test$y - yhat8)^2)
mse9 <- mean((hiv.test$y - yhat9)^2)
mse10 <- mean((hiv.test$y - yhat10)^2)
alpha<-seq(0,1,by=0.1)
mse<-c(mse0,mse1,mse2,mse3,mse4,mse5,mse6,mse7,mse8,mse9,mse10)
cbind(alpha,mse)
## alpha mse
## [1,] 0.0 0.10400213
## [2,] 0.1 0.07696483
## [3,] 0.2 0.07333605
## [4,] 0.3 0.07048255
## [5,] 0.4 0.06953684
## [6,] 0.5 0.06832147
## [7,] 0.6 0.06864500
## [8,] 0.7 0.06896511
## [9,] 0.8 0.06951287
## [10,] 0.9 0.06930162
## [11,] 1.0 0.06781460
Ternyata MSE terkecil diperoleh ketika digunakan nilai α=1. Artinya, berdasarkan kriteria tersebut, pendekatan Lasso lebih baik dibandingkan dengan elastic net dan ridge.
Prediksi dapat dilakukan dengan predict(). Berikut ini juga diperlihatkan plot antara hasil cross validation dan prediksi.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.9) #Elastic Net penalty
cv.fit <- cv.glmnet(hiv.train$x,hiv.train$y, alpha=0.9) #10-fold cross-validation
plot(cv.fit)
pred.y <- predict(fit, hiv.test$x) #predict the test data
mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)
lines(log(fit$lambda), mean.test.error, col="blue",pch="*")
legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"),pch="*",
col=c("red","blue"), cex=0.75)