MAT 4376 Project 3 Presentation

Jimmy Huang

Agenda Items

  • Problem Statement

  • Aims of the Project

  • Introduction to the Dataset Selected

  • Overview of the Implementation

  • Result

  • Conclusion

Problem Statement

The Lasso problem in Lagrangian form, we are aimed to:

\[\text{minimize}_{\beta \in \mathbf{R}^{p}}\{ \frac{1}{2N}||y-X \beta ||^{2}_{2} +\lambda ||\beta||_{1}\}\]

\(\lambda\) is a tunable parameter, we often want to find the best lambda possible for our model

Aims of the Project

  • Write a function that does the same thing as cv.glmnet from the package glmnet

  • Validate the outputs produced from my functions and cv.glmnet

Introduction to the Dataset Selected

  • Data set was downloaded on Kaggle (Link can be found on the Reference Page)

  • This data set has 507 observations and 25 variables including 24 scalar variables and 1 dummy variable

  • This dataset tracks the body dimensions and characteristics of the respondents, such as: sex, age, weight, height, knee diameter in centimeters and so on

  • The dummy variable is sex

Introduction to the Dataset Selected

  • This dataset is split into two samples, one consists of female respondents only, another one consists of male respondents only

  • Why?

  • The female sample is being used for developing and validating purposes, the male sample is being used for testing purposes

  • If my function works for the first sample, then it should also work for the second sample, as cv.glmnet comes with randomness, it is crucial to validate the results with more confidence, and I am also using the Monte Carlo method to validate my outputs versus the outputs produced by cv.glmnet

  • The dependent variable is respondents weight in kilograms, denoted as \(Y_{\text{Female}}\) it is a \(260 \times 1\) column vector, and also \(Y_{\text{Male}}\) it is a \(247 \times 1\) column vector

  • Denote the design matrix for the female sample as \(X_{\text{Female}}\), it is a \(260 \times 23\) matrix, and denote the design matrix for the male sample as \(X_{\text{Male}}\), it is a \(247 \times 23\) matrix

Introduction to the dataset selected

Table1

Introduction to the dataset selected

https://rpubs.com/JamesW/1161691

Please click the link above to view the picture! Thanks

Cross Validation

\(1\). Divide the dataset \(n\) into \(m\) disjoint sets \(D_{1}, \cdots , D_{m}\) of size \(n/m\) each.

\(2\). For each \(\lambda \in \Lambda\), evaluate \(\hat{\beta}_{LASSO}(\lambda)\) the LASSO estimator based on the dataset \(\frac{D}{D_{h}}, h=1,\cdots, m.\)

Each \(D_{h}\) is treated as a test dataset, while \(D/D_{h}\) as a training dataset.

\(3\). Thus, for each \(\lambda \in \Lambda\) we get \(h\) LASSO estimator, making in total \(h \times q\) LASSO estimators.

\(4\). Define the loss function

\(CV(\lambda)=\sum^{m}_{h=1} \sum_{i:(X_{i},Y_{i}) \in D_{h}} (Y_{i}-X^{T}_{i} \hat{\beta}^{-(h)}_{LASSO}(\lambda))^{2}\)

\(5\). Choose \(\lambda\) that minimizes the loss function

Implementation

library(glmnet)
library(dplyr)#### You need to get dplyr loaded 
My_cv_Lasso<-function(X,Y,nfolds){
  ##### Step 1
  n<-nrow(X)
  store<-data.frame(Y,X)%>%
    mutate(fold=sample(rep(1:nfolds,length.out=n)))
  #### Step 2
  obj_glmnet<-glmnet(X,Y,family = "gaussian",
                     alpha=1)
  lambda_candidate<-obj_glmnet$lambda
  cv_lasso_error<-matrix(0,nrow=length(lambda_candidate),ncol=nfolds)
  for(i in seq_along(lambda_candidate)){
    lam1<-lambda_candidate[i]
    for(j in 1:nfolds){
      training_set<-store%>%
        filter(fold != j)
      test_set<-store%>%
        filter(fold == j)
      X_train<-training_set%>%######inactivate package MASS 
        ##### or add dplyr:: in front of select
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_train<-training_set$Y
      X_test<-test_set%>%
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_test<-test_set$Y
      ###### Step 3
      lasso_mod<-glmnet(X_train,Y_train,
                        family = "gaussian",
                        alpha=1,lambda=lam1)
      ###### Step 4
      pred1<-predict(lasso_mod,X_test,s=lam1,"response")
      cv_lasso_error[i,j]<-mean((Y_test-pred1)^2)
      }
  }
  ########Step 5
  mean_cv<-rowMeans(cv_lasso_error)
  smallest_error<-which(mean_cv == min(mean_cv))
  best_lam<-lambda_candidate[smallest_error]
  ########Done 
  return(list(candidate_for_lambda=lambda_candidate,
              mean_loss=mean_cv,
              best_lambda_picked=best_lam))
}

Implementation

set.seed(886)
Variant_My_cv_Lasso<-function(X,Y,nfolds,mu){
  ##### Step 1
  n<-nrow(X)
  store<-data.frame(Y,X)%>%
    mutate(fold=sample(rep(1:nfolds,length.out=n)))
  #### Step 2
  lambda_candidate<-rexp(100,mu)
  cv_lasso_error<-matrix(0,nrow=length(lambda_candidate),ncol=nfolds)
  for(i in seq_along(lambda_candidate)){
    lam1<-lambda_candidate[i]
    for(j in 1:nfolds){
      training_set<-store%>%
        filter(fold != j)
      test_set<-store%>%
        filter(fold == j)
      X_train<-training_set%>%######inactivate package MASS 
        ##### or add dplyr:: in front of select
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_train<-training_set$Y
      X_test<-test_set%>%
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_test<-test_set$Y
      ###### Step 3
      lasso_mod<-glmnet(X_train,Y_train,
                        family = "gaussian",
                        alpha=1,lambda=lam1)
      ###### Step 4
      pred1<-predict(lasso_mod,X_test,s=lam1,"response")
      cv_lasso_error[i,j]<-mean((Y_test-pred1)^2)
    }
  }
  ########Step 5
  mean_cv<-rowMeans(cv_lasso_error)
  smallest_error<-which(mean_cv == min(mean_cv))
  best_lam<-lambda_candidate[smallest_error]
  ########Done 
  return(list(candidate_for_lambda=lambda_candidate,
              mean_loss=mean_cv,
              best_lambda_picked=best_lam))
}

Result

Range of \(\Lambda_{1} \wedge \Lambda_{2}\) in the Female Set

Result

Chosen \(lambda\) Depending on \(m\)

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using cv.glmnet

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using My_cv_Lasso

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using Variant_My_cv_Lasso

Result

Range of \(\Lambda_{3} \wedge \Lambda_{4}\) in the Testing Set

Result

Chosen \(\lambda\) Depending on \(m\)

Result

Monte Carlo Simulation for the Male Set Pick \(m=5,10,15\) using cv.glmnet

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using My_cv_Lasso

Result

Monte Carlo Simulation for the Male Set Pick \(m=5,10,15\) using Variant_My_cv_Lasso

Result

> #mean(best_lam_glment_mc)
> #mean(best_lam_glment_mc_10)
> #mean(best_lam_glment_mc_15)
> #mean(best_lam_mc_my_cv_lasso_m5)
> #mean(best_lam_mc_my_cv_lasso_m10)
> #mean(best_lam_mc_my_cv_lasso_m15)
> #mean(best_lam_varitant_mc_my_cv_lasso_m5)
> #mean(best_lam_varitant_mc_my_cv_lasso_m10)
> #mean(best_lam_varitant_mc_my_cv_lasso_m15)
> 
> 
> S1<-c(0.0760686,0.08039263,0.07978809,0.0765162,
+       0.08057883,0.08102547,0.07735165,0.08033728,0.08015581)
> mod_lasso_female<-glmnet(X1,Y1,alpha=1,family="gaussian",lambda=S1)
> pred_female<-predict.glmnet(mod_lasso_female,newx=X1,s=S1,type="response")
> mse_female<-apply(pred_female,2, function(pred_female_col) mean((pred_female_col-Y1)^2))
> mse_female
      s1       s2       s3       s4       s5       s6       s7       s8 
3.114068 3.117036 3.116399 3.114357 3.117437 3.118108 3.114881 3.116824 
      s9 
3.116636 
> min(mse_female)
[1] 3.114068
> pred_female_s<-predict.glmnet(mod_lasso_female,newx=X1,s=S1,type="nonzero")
> pred_female_s
$s1
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s2
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s3
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s4
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s5
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s6
 [1]  2  4  7  8 10 11 12 14 15 17 18 19 22 23

$s7
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s8
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s9
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

Result

> #mean(best_lam_glment_mc_male_m5)
> #mean(best_lam_glment_mc_male_m10)
> #mean(best_lam_glment_mc_male_m15)
> #mean(lam_mc_my_cv_lasso_male_m5)
> #mean(lam_mc_my_cv_lasso_male_m10)
> #mean(lam_mc_my_cv_lasso_male_m15)
> #mean(lammc_v_my_cv_lasso_male_m5)
> #mean(lammc_v_my_cv_lasso_male_m10)
> #mean(lammc_v_my_cv_lasso_male_m15)
> S2<-c(0.06802671,0.06917023,0.07191867,0.06902735,
+       0.0701646,0.07374311,0.06998219,0.07005154,0.07265663)
> mod_lasso_male<-glmnet(X2,Y2,alpha=1,family="gaussian",lambda=S2)
> pred_male<-predict.glmnet(mod_lasso_male,newx=X2,s=S2,type="response")
> mse_male<-apply(pred_male,2, function(pred_male_col) mean((pred_male_col-Y2)^2))
> mse_male
      s1       s2       s3       s4       s5       s6       s7       s8 
4.540997 4.541764 4.543278 4.541612 4.542379 4.543822 4.542234 4.542280 
      s9 
4.543511 
> min(mse_male)
[1] 4.540997
> pred_male_s<-predict.glmnet(mod_lasso_male,newx=X2,s=S2,type="nonzero")
> pred_male_s
   s1 s2 s3 s4 s5 s6 s7 s8 s9
1   2  2  2  2  2  2  2  2  2
2   4  4  4  4  4  4  4  4  4
3   5  5  5  5  5  5  5  5  5
4   6  6  6  6  6  6  6  6  6
5   7  7  7  7  7  7  7  7  7
6   8  8  8  8  8  8  8  8  8
7   9  9  9  9  9  9  9  9  9
8  10 10 10 10 10 10 10 10 10
9  11 11 11 11 11 11 11 11 11
10 12 12 12 12 12 12 12 12 12
11 13 13 13 13 13 13 13 13 13
12 14 14 14 14 14 14 14 14 14
13 15 15 15 15 15 15 15 15 15
14 16 16 16 16 16 16 16 16 16
15 17 17 17 17 17 17 17 17 17
16 18 18 18 18 18 18 18 18 18
17 19 19 19 19 19 19 19 19 19
18 22 22 22 22 22 22 22 22 22
19 23 23 23 23 23 23 23 23 23

Sample Run

> set.seed(886)
> res2<-Variant_My_cv_Lasso(X1,Y1,5,5)
> res2
$candidate_for_lambda
  [1] 0.012487123 0.352385823 0.232502957 0.009957949 0.079516527 0.194752352
  [7] 0.340769092 0.266880039 0.532727541 0.026055493 0.320029560 0.279370330
 [13] 0.013437844 0.046183559 0.127120746 0.021967883 0.071861608 0.643606116
 [19] 0.191531646 0.065656968 0.065922472 0.602106430 0.092080867 0.091706455
 [25] 0.053757905 0.119769080 0.316442357 0.228761920 0.520901318 0.308747648
 [31] 0.270080524 0.575462043 0.334027557 0.046376753 0.211811092 0.264450000
 [37] 0.053003845 0.256836649 0.074206675 0.033404985 0.133465785 0.177114280
 [43] 0.167761748 0.025464315 0.882541958 0.005598880 0.862175975 0.329636912
 [49] 0.080211182 0.156022957 0.035986217 0.050723594 0.149527735 0.013479271
 [55] 0.238313868 0.135953414 0.509925953 0.915590206 0.008865030 0.055820811
 [61] 0.002430579 0.276009405 0.280531697 0.117340148 0.153570089 0.181194891
 [67] 0.082919934 0.356803690 0.019943266 0.157916053 0.596500337 0.310484051
 [73] 0.358506566 0.039563498 0.293180220 0.091518236 0.120260399 0.568856753
 [79] 0.555350788 0.361090035 0.581617962 0.162488700 0.030624983 0.100611279
 [85] 0.034662556 0.090623805 0.110446260 0.476125166 0.066091100 0.227380392
 [91] 0.313598560 0.208455373 0.091880510 0.564742577 0.114848789 0.019729175
 [97] 0.018121741 0.075361771 0.095357318 0.053773372

$mean_loss
  [1] 4.326719 4.363514 4.243704 4.351243 4.068727 4.188051 4.349598 4.276576
  [9] 4.655676 4.213217 4.328811 4.289910 4.318009 4.118218 4.090955 4.242635
 [17] 4.075828 4.916212 4.182904 4.075151 4.075051 4.811073 4.069328 4.069280
 [25] 4.082273 4.085336 4.325474 4.234394 4.629235 4.318122 4.279497 4.748098
 [33] 4.342493 4.115495 4.215000 4.274092 4.082477 4.266118 4.075015 4.166005
 [41] 4.098110 4.158170 4.144667 4.217370 5.673885 4.393299 5.598849 4.338047
 [49] 4.068708 4.128075 4.156527 4.095877 4.119152 4.317631 4.249079 4.101141
 [57] 4.607759 5.800193 4.361797 4.079682 4.427682 4.286694 4.291051 4.083548
 [65] 4.124766 4.165121 4.068681 4.368172 4.258867 4.130544 4.797565 4.319682
 [73] 4.369993 4.141330 4.303859 4.069257 4.085837 4.733147 4.702570 4.372781
 [81] 4.761939 4.136805 4.182365 4.075355 4.160067 4.069152 4.078239 4.545854
 [89] 4.074988 4.233277 4.322496 4.210204 4.069302 4.723837 4.082141 4.259009
 [97] 4.270222 4.074724 4.069788 4.082248

$best_lambda_picked
[1] 0.08291993

> plot(log(res2$candidate_for_lambda), res2$mean_loss, xlab="Log Lambda",
+      ylab="Mean Squared Error")
> abline(v=log(res2$best_lambda_picked),col="red")
> plot(log(res2$candidate_for_lambda), res2$mean_loss, xlab="Log Lambda",
+      ylab="Mean Squared Error")
> abline(v=log(res2$best_lambda_picked),col="red")

Sample Run

Sample Run

Elapsed Time

Average Elapsed Time of 3 Functions over 10 Iterations with (m=10)
Set cv.glmnet My_cv_Lasso Variant_My_cv_Lasso
Female Set 0.0318s 3.6987s 4.954s
Male Set 0.0327s 3.715s 5.091s

Conslusion

  • There does not exist an “Optimal” \(\lambda\) or “Best” Lambda

  • The Monte Carlo Method shows that My_cv_Lasso and Variant_My_cv_Lasso can work at least as effectively as cv.glmnet from glmnet

  • It would take a significantly longer time to run My_cv_Lasso and Variant_My_cv_Lasso than cv.glmnet from glmnet

Reference

The dataset can be found at: https://www.kaggle.com/datasets/mexwell/body-measurements/data