MAT 4376 Project 3 Presentation

Jimmy Huang

Agenda Items

Problem Statement
Aims of the Project
Introduction to the Dataset Selected
Overview of the Implementation
Result
Conclusion

Problem Statement

The Lasso problem in Lagrangian form, we are aimed to:

\[\text{minimize}_{\beta \in \mathbf{R}^{p}}\{ \frac{1}{2N}||y-X \beta ||^{2}_{2} +\lambda ||\beta||_{1}\}\]

\(\lambda\) is a tunable parameter, we often want to find the best lambda possible for our model

Aims of the Project

Write a function that does the same thing as cv.glmnet from the package glmnet
Validate the outputs produced from my functions and cv.glmnet

Introduction to the Dataset Selected

Data set was downloaded on Kaggle (Link can be found on the Reference Page)
This data set has 507 observations and 25 variables including 24 scalar variables and 1 dummy variable
This dataset tracks the body dimensions and characteristics of the respondents, such as: sex, age, weight, height, knee diameter in centimeters and so on
The dummy variable is sex

Introduction to the Dataset Selected

This dataset is split into two samples, one consists of female respondents only, another one consists of male respondents only
Why?
The female sample is being used for developing and validating purposes, the male sample is being used for testing purposes
If my function works for the first sample, then it should also work for the second sample, as cv.glmnet comes with randomness, it is crucial to validate the results with more confidence, and I am also using the Monte Carlo method to validate my outputs versus the outputs produced by cv.glmnet
The dependent variable is respondents weight in kilograms, denoted as \(Y_{\text{Female}}\) it is a \(260 \times 1\) column vector, and also \(Y_{\text{Male}}\) it is a \(247 \times 1\) column vector
Denote the design matrix for the female sample as \(X_{\text{Female}}\), it is a \(260 \times 23\) matrix, and denote the design matrix for the male sample as \(X_{\text{Male}}\), it is a \(247 \times 23\) matrix

Introduction to the dataset selected

Table1

Introduction to the dataset selected

https://rpubs.com/JamesW/1161691

Please click the link above to view the picture! Thanks

Cross Validation

\(1\). Divide the dataset \(n\) into \(m\) disjoint sets \(D_{1}, \cdots , D_{m}\) of size \(n/m\) each.

\(2\). For each \(\lambda \in \Lambda\), evaluate \(\hat{\beta}_{LASSO}(\lambda)\) the LASSO estimator based on the dataset \(\frac{D}{D_{h}}, h=1,\cdots, m.\)

Each \(D_{h}\) is treated as a test dataset, while \(D/D_{h}\) as a training dataset.

\(3\). Thus, for each \(\lambda \in \Lambda\) we get \(h\) LASSO estimator, making in total \(h \times q\) LASSO estimators.

\(4\). Define the loss function

\(CV(\lambda)=\sum^{m}_{h=1} \sum_{i:(X_{i},Y_{i}) \in D_{h}} (Y_{i}-X^{T}_{i} \hat{\beta}^{-(h)}_{LASSO}(\lambda))^{2}\)

\(5\). Choose \(\lambda\) that minimizes the loss function

Implementation

library(glmnet)
library(dplyr)#### You need to get dplyr loaded 
My_cv_Lasso<-function(X,Y,nfolds){
  ##### Step 1
  n<-nrow(X)
  store<-data.frame(Y,X)%>%
    mutate(fold=sample(rep(1:nfolds,length.out=n)))
  #### Step 2
  obj_glmnet<-glmnet(X,Y,family = "gaussian",
                     alpha=1)
  lambda_candidate<-obj_glmnet$lambda
  cv_lasso_error<-matrix(0,nrow=length(lambda_candidate),ncol=nfolds)
  for(i in seq_along(lambda_candidate)){
    lam1<-lambda_candidate[i]
    for(j in 1:nfolds){
      training_set<-store%>%
        filter(fold != j)
      test_set<-store%>%
        filter(fold == j)
      X_train<-training_set%>%######inactivate package MASS 
        ##### or add dplyr:: in front of select
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_train<-training_set$Y
      X_test<-test_set%>%
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_test<-test_set$Y
      ###### Step 3
      lasso_mod<-glmnet(X_train,Y_train,
                        family = "gaussian",
                        alpha=1,lambda=lam1)
      ###### Step 4
      pred1<-predict(lasso_mod,X_test,s=lam1,"response")
      cv_lasso_error[i,j]<-mean((Y_test-pred1)^2)
      }
  }
  ########Step 5
  mean_cv<-rowMeans(cv_lasso_error)
  smallest_error<-which(mean_cv == min(mean_cv))
  best_lam<-lambda_candidate[smallest_error]
  ########Done 
  return(list(candidate_for_lambda=lambda_candidate,
              mean_loss=mean_cv,
              best_lambda_picked=best_lam))
}

Implementation

set.seed(886)
Variant_My_cv_Lasso<-function(X,Y,nfolds,mu){
  ##### Step 1
  n<-nrow(X)
  store<-data.frame(Y,X)%>%
    mutate(fold=sample(rep(1:nfolds,length.out=n)))
  #### Step 2
  lambda_candidate<-rexp(100,mu)
  cv_lasso_error<-matrix(0,nrow=length(lambda_candidate),ncol=nfolds)
  for(i in seq_along(lambda_candidate)){
    lam1<-lambda_candidate[i]
    for(j in 1:nfolds){
      training_set<-store%>%
        filter(fold != j)
      test_set<-store%>%
        filter(fold == j)
      X_train<-training_set%>%######inactivate package MASS 
        ##### or add dplyr:: in front of select
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_train<-training_set$Y
      X_test<-test_set%>%
        dplyr::select(-Y,-fold)%>%
        as.matrix()
      Y_test<-test_set$Y
      ###### Step 3
      lasso_mod<-glmnet(X_train,Y_train,
                        family = "gaussian",
                        alpha=1,lambda=lam1)
      ###### Step 4
      pred1<-predict(lasso_mod,X_test,s=lam1,"response")
      cv_lasso_error[i,j]<-mean((Y_test-pred1)^2)
    }
  }
  ########Step 5
  mean_cv<-rowMeans(cv_lasso_error)
  smallest_error<-which(mean_cv == min(mean_cv))
  best_lam<-lambda_candidate[smallest_error]
  ########Done 
  return(list(candidate_for_lambda=lambda_candidate,
              mean_loss=mean_cv,
              best_lambda_picked=best_lam))
}

Result

Range of \(\Lambda_{1} \wedge \Lambda_{2}\) in the Female Set

Result

Chosen \(lambda\) Depending on \(m\)

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using cv.glmnet

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using My_cv_Lasso

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using Variant_My_cv_Lasso

Result

Range of \(\Lambda_{3} \wedge \Lambda_{4}\) in the Testing Set

Result

Chosen \(\lambda\) Depending on \(m\)

Result

Monte Carlo Simulation for the Male Set Pick \(m=5,10,15\) using cv.glmnet

Result

Monte Carlo Simulation for the Female Set Pick \(m=5,10,15\) using My_cv_Lasso

Result

Monte Carlo Simulation for the Male Set Pick \(m=5,10,15\) using Variant_My_cv_Lasso

Result

> #mean(best_lam_glment_mc)
> #mean(best_lam_glment_mc_10)
> #mean(best_lam_glment_mc_15)
> #mean(best_lam_mc_my_cv_lasso_m5)
> #mean(best_lam_mc_my_cv_lasso_m10)
> #mean(best_lam_mc_my_cv_lasso_m15)
> #mean(best_lam_varitant_mc_my_cv_lasso_m5)
> #mean(best_lam_varitant_mc_my_cv_lasso_m10)
> #mean(best_lam_varitant_mc_my_cv_lasso_m15)
> 
> 
> S1<-c(0.0760686,0.08039263,0.07978809,0.0765162,
+       0.08057883,0.08102547,0.07735165,0.08033728,0.08015581)
> mod_lasso_female<-glmnet(X1,Y1,alpha=1,family="gaussian",lambda=S1)
> pred_female<-predict.glmnet(mod_lasso_female,newx=X1,s=S1,type="response")
> mse_female<-apply(pred_female,2, function(pred_female_col) mean((pred_female_col-Y1)^2))
> mse_female
      s1       s2       s3       s4       s5       s6       s7       s8 
3.114068 3.117036 3.116399 3.114357 3.117437 3.118108 3.114881 3.116824 
      s9 
3.116636 
> min(mse_female)
[1] 3.114068
> pred_female_s<-predict.glmnet(mod_lasso_female,newx=X1,s=S1,type="nonzero")
> pred_female_s
$s1
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s2
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s3
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s4
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s5
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s6
 [1]  2  4  7  8 10 11 12 14 15 17 18 19 22 23

$s7
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s8
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

$s9
 [1]  2  4  6  7  8 10 11 12 14 15 17 18 19 22 23

Result

> #mean(best_lam_glment_mc_male_m5)
> #mean(best_lam_glment_mc_male_m10)
> #mean(best_lam_glment_mc_male_m15)
> #mean(lam_mc_my_cv_lasso_male_m5)
> #mean(lam_mc_my_cv_lasso_male_m10)
> #mean(lam_mc_my_cv_lasso_male_m15)
> #mean(lammc_v_my_cv_lasso_male_m5)
> #mean(lammc_v_my_cv_lasso_male_m10)
> #mean(lammc_v_my_cv_lasso_male_m15)
> S2<-c(0.06802671,0.06917023,0.07191867,0.06902735,
+       0.0701646,0.07374311,0.06998219,0.07005154,0.07265663)
> mod_lasso_male<-glmnet(X2,Y2,alpha=1,family="gaussian",lambda=S2)
> pred_male<-predict.glmnet(mod_lasso_male,newx=X2,s=S2,type="response")
> mse_male<-apply(pred_male,2, function(pred_male_col) mean((pred_male_col-Y2)^2))
> mse_male
      s1       s2       s3       s4       s5       s6       s7       s8 
4.540997 4.541764 4.543278 4.541612 4.542379 4.543822 4.542234 4.542280 
      s9 
4.543511 
> min(mse_male)
[1] 4.540997
> pred_male_s<-predict.glmnet(mod_lasso_male,newx=X2,s=S2,type="nonzero")
> pred_male_s
   s1 s2 s3 s4 s5 s6 s7 s8 s9
1   2  2  2  2  2  2  2  2  2
2   4  4  4  4  4  4  4  4  4
3   5  5  5  5  5  5  5  5  5
4   6  6  6  6  6  6  6  6  6
5   7  7  7  7  7  7  7  7  7
6   8  8  8  8  8  8  8  8  8
7   9  9  9  9  9  9  9  9  9
8  10 10 10 10 10 10 10 10 10
9  11 11 11 11 11 11 11 11 11
10 12 12 12 12 12 12 12 12 12
11 13 13 13 13 13 13 13 13 13
12 14 14 14 14 14 14 14 14 14
13 15 15 15 15 15 15 15 15 15
14 16 16 16 16 16 16 16 16 16
15 17 17 17 17 17 17 17 17 17
16 18 18 18 18 18 18 18 18 18
17 19 19 19 19 19 19 19 19 19
18 22 22 22 22 22 22 22 22 22
19 23 23 23 23 23 23 23 23 23

Sample Run

> set.seed(886)
> res2<-Variant_My_cv_Lasso(X1,Y1,5,5)
> res2
$candidate_for_lambda
  [1] 0.012487123 0.352385823 0.232502957 0.009957949 0.079516527 0.194752352
  [7] 0.340769092 0.266880039 0.532727541 0.026055493 0.320029560 0.279370330
 [13] 0.013437844 0.046183559 0.127120746 0.021967883 0.071861608 0.643606116
 [19] 0.191531646 0.065656968 0.065922472 0.602106430 0.092080867 0.091706455
 [25] 0.053757905 0.119769080 0.316442357 0.228761920 0.520901318 0.308747648
 [31] 0.270080524 0.575462043 0.334027557 0.046376753 0.211811092 0.264450000
 [37] 0.053003845 0.256836649 0.074206675 0.033404985 0.133465785 0.177114280
 [43] 0.167761748 0.025464315 0.882541958 0.005598880 0.862175975 0.329636912
 [49] 0.080211182 0.156022957 0.035986217 0.050723594 0.149527735 0.013479271
 [55] 0.238313868 0.135953414 0.509925953 0.915590206 0.008865030 0.055820811
 [61] 0.002430579 0.276009405 0.280531697 0.117340148 0.153570089 0.181194891
 [67] 0.082919934 0.356803690 0.019943266 0.157916053 0.596500337 0.310484051
 [73] 0.358506566 0.039563498 0.293180220 0.091518236 0.120260399 0.568856753
 [79] 0.555350788 0.361090035 0.581617962 0.162488700 0.030624983 0.100611279
 [85] 0.034662556 0.090623805 0.110446260 0.476125166 0.066091100 0.227380392
 [91] 0.313598560 0.208455373 0.091880510 0.564742577 0.114848789 0.019729175
 [97] 0.018121741 0.075361771 0.095357318 0.053773372

$mean_loss
  [1] 4.326719 4.363514 4.243704 4.351243 4.068727 4.188051 4.349598 4.276576
  [9] 4.655676 4.213217 4.328811 4.289910 4.318009 4.118218 4.090955 4.242635
 [17] 4.075828 4.916212 4.182904 4.075151 4.075051 4.811073 4.069328 4.069280
 [25] 4.082273 4.085336 4.325474 4.234394 4.629235 4.318122 4.279497 4.748098
 [33] 4.342493 4.115495 4.215000 4.274092 4.082477 4.266118 4.075015 4.166005
 [41] 4.098110 4.158170 4.144667 4.217370 5.673885 4.393299 5.598849 4.338047
 [49] 4.068708 4.128075 4.156527 4.095877 4.119152 4.317631 4.249079 4.101141
 [57] 4.607759 5.800193 4.361797 4.079682 4.427682 4.286694 4.291051 4.083548
 [65] 4.124766 4.165121 4.068681 4.368172 4.258867 4.130544 4.797565 4.319682
 [73] 4.369993 4.141330 4.303859 4.069257 4.085837 4.733147 4.702570 4.372781
 [81] 4.761939 4.136805 4.182365 4.075355 4.160067 4.069152 4.078239 4.545854
 [89] 4.074988 4.233277 4.322496 4.210204 4.069302 4.723837 4.082141 4.259009
 [97] 4.270222 4.074724 4.069788 4.082248

$best_lambda_picked
[1] 0.08291993

> plot(log(res2$candidate_for_lambda), res2$mean_loss, xlab="Log Lambda",
+      ylab="Mean Squared Error")
> abline(v=log(res2$best_lambda_picked),col="red")
> plot(log(res2$candidate_for_lambda), res2$mean_loss, xlab="Log Lambda",
+      ylab="Mean Squared Error")
> abline(v=log(res2$best_lambda_picked),col="red")

Sample Run

Elapsed Time

Average Elapsed Time of 3 Functions over 10 Iterations with (m=10)
Set	cv.glmnet	My_cv_Lasso	Variant_My_cv_Lasso
Female Set	0.0318s	3.6987s	4.954s
Male Set	0.0327s	3.715s	5.091s

Conslusion

There does not exist an “Optimal” \(\lambda\) or “Best” Lambda
The Monte Carlo Method shows that My_cv_Lasso and Variant_My_cv_Lasso can work at least as effectively as cv.glmnet from glmnet
It would take a significantly longer time to run My_cv_Lasso and Variant_My_cv_Lasso than cv.glmnet from glmnet

Reference

The dataset can be found at: https://www.kaggle.com/datasets/mexwell/body-measurements/data