suppressPackageStartupMessages( require(oetteR) ) 
suppressPackageStartupMessages( require(pipelearner) )
suppressPackageStartupMessages( require(tidyverse) )

1 Regression Feature Selection

To my understanding Feature Selection for regression was pretty straigth forward using stepwise regression. However I recently learned that this approach is flawed. The alternatively used method should be the Lasso. However most of the response variables I do regression for are not normally distributed and have a distribution that looks to my like a gamma or beta distribution. I recently had good results fitting a gamma regression model, beta distribution models did usually not converge to a minimum or did not improve the fit. It is however challenging to find an implementation of Lasso that works for a gamma regression.

2 R Implementations for Lasso

  • glmnet does not support gamma regression but is the most used lasso implementation. Documenatation
  • HDtweedie a gamma regression is a tweedie distribution for p == 2, the package offers lasso for those distributions. documentation
  • gamlss offers Lasso and gamma regression. The way the predict.gamlss is implemented (expects fitting data to be present in global environment when calling predict) makes it difficult to integrate in functions. There also is no documentation.

3 oetteR::f_train_lasso

I have written a wrapper functions for glmnet and HDtweedie which is implemented in oetteR. The Implementation should make it easy to judge the feature reducing qualities of the lasso for different datasets.

4 Feature selection with the lasso

The cv.glmnet and the cv.HDtweedie functions already return a model. Since they have built in cross validation I cannot really be sure about the compositions of the coefficients. Since the functions are handeling the internals of the cross validation I am not a hundred percent sure how the coefficients are being averaged. Thus I prefer to select the best model via lasso select the simplest model within 1SE range of the best model and use allfeatures with none zero coefficients to fit a new OLS regression model.

https://stats.stackexchange.com/questions/82466/why-use-lasso-estimates-over-ols-estimates-on-the-lasso-identified-subset-of-var

This also reduces a bias in the cofficients towards zero.

5 Tweedie Distributions

Tweedie distributions have depending on the p value a more or less extreme distribution imbalance at zero. As it is not always clear which distribution to use. I prefer to perform lasso with a couple of them and see which variable selection the models come up with.

tweedies = tibble( p = seq(1,2,0.1) ) 

tweedies = tweedies %>%
  mutate( dist = map( p, function(x) tweedie::rtweedie(5000,x, phi = 1, mu = 1) ) ) %>%
  unnest( dist ) %>%
  mutate( p = as.factor(p) )


ggplot(tweedies) +
  geom_histogram( aes( dist)
                  , bins = 50 ) +
  facet_wrap(~p, scales = 'free' )

6 Lasso with not normally distributed response variables

Note: I am using pipelearner for the internals of my lasso functions and for fitting. pipelearner is not on CRAN and might not be available in a few years. I think it is much better to come up with a solution using caret. However piplearner is more flexible and lets you use your own functions.

6.1 Simulated Data

data = tibble( resp = rgamma(5000,1) ) %>%
  mutate( exp1 = rnorm(5000, 50,5)
          , exp2 = log(resp) ##non linear relationship
          , exp3 = exp2 * rnorm(5000, 1, 0.5) # colinearity to exp2
          , exp4 = resp * rgamma(5000, 0.5)
          , exp5 = resp * rgamma(5000, 0.5) * rnorm(5000, 1, 0.5)
  )

6.2 Training the lasso

  • f_manip_data_2_model_matrix_format is a wrapper for model.matrix which comes up with weird variable names when creating dummy variables that cannot be used in a formula syntax. f_manip_data_2_model_matrix_format comes up with better names and returns a new dataframe and a compatible formula.

  • f_train_lasso works just like cv.glmnet and cv.HDtweedie. It also comes with more usefull plots as ouput and helps to extract the correct formulas for the features

form = resp~exp1+exp2+exp3+exp4+exp5

data_trans = f_manip_data_2_model_matrix_format(data, form)

new_data = data_trans$data
new_formula =  data_trans$formula


lasso = f_train_lasso(new_data, new_formula
                      , lambda = 10^seq(-3,3,length.out = 25)
                      , p = c( 1, 1.25, 1.5, 1.75, 2 )
                      , k = 10 )
## [1] "Progress: 1 / 6 ; 16.7 % ; ETA: 0.1 min"
## [1] "Progress: 2 / 6 ; 33.3 % ; ETA: 0.1 min"
## [1] "Progress: 3 / 6 ; 50 % ; ETA: 0.1 min"
## [1] "Progress: 4 / 6 ; 66.7 % ; ETA: 0 min"
## [1] "Progress: 5 / 6 ; 83.3 % ; ETA: 0 min"
## [1] "Progress: 6 / 6 ; 100 % ; ETA: 0 min"

6.2.1 lasso tibble

Here we find for each lambda the correspnding mse, a tibble with the coefficients, a formula including all none zero coefficients, the formulas as strings for readability, and the number of coefficients.

lasso$tib_all %>%
  head(5) %>%
  knitr::kable()
distribution lambda lambda_min lambda_1se mse coeff formula formula_str n_coeff_before_lasso n_coeff_after_lasso
gaussian 1000.0000 0.001 0.0562341 0.9859456 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 resp ~ 1 resp ~ 1 5 0
gaussian 562.3413 0.001 0.0562341 0.9859456 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 resp ~ 1 resp ~ 1 5 0
gaussian 316.2278 0.001 0.0562341 0.9859456 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 resp ~ 1 resp ~ 1 5 0
gaussian 177.8279 0.001 0.0562341 0.9859456 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 resp ~ 1 resp ~ 1 5 0
gaussian 100.0000 0.001 0.0562341 0.9859456 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 resp ~ 1 resp ~ 1 5 0

6.2.2 Plot MSE

In this plot we can see that the standard gaussian distribution has a much higher MSE than the tweedie distributions. We find the minimum MSE maked by the dashed horizontal line at the third lambda value meaning that for the gaussian distribution shrinkage of the initial coefficients improves the fit. The minimum lambda value that is still in 1SE range of the min MSE lambda value is at the eighth lambda value. Possibly this would already lead to some features beeing eliminated.

The tweedie distributions have their minimum MSE and their 1SE at the first lambda value (use the plotly graph to deselect the lines one by one). Zooming further in on that point we find that the distribution for p = 2, which is the gamma distribution performs the best as is to be expected since the response variable has a gamma distribution.

plotly::ggplotly(lasso$plot_mse)

6.3 Plot coefficients

Here we find that for the gaussian distribution at lambda 1SE 1 out of 5 coefficients is below zero which is the coefficient for exp3. Exp3 has been computed to be extremely colinear to exp2 thus carries only redundant information.

For the tweedie distributions we find that all coefficients except for exp2 are already at zero in the initial fit. If we look at the simulated data we find that the formulations for all explanatory variables except exp2 carry some random element which make them less reliable predictors for the response compared to exp2 which has no random element.

plotly::ggplotly(lasso$plot_coef)

6.4 Get formulas

The lasso object already has an attribute which returns a set of unique formulas which have worked best for each value of p.

lasso$formula_str_lambda_1se
## [1] "resp ~ exp2 + exp4 + exp5" "resp ~ exp2"

6.5 Fit regression models

We now will fit each formula to each distribution (gaussian, tweedie, gamma)

6.5.1 Prepare formulas and wrappers

the statmod::tweedie function provides the distribution family for glm. As a control we will also use HDtweedie to check whether the results of these two methods are comparable. We will use the fixed lambda value of 0.001. Which was the minimum lambda we used previously and which was the min lambda 1SE value of all tweedie distributions tested. Note that p_fact and var.power are synonyms for the p parameter of the Tweedie distributions. Another control we will add is a randomForest regression.

sometimes the fittings will not converge, we will thereofe also grid search different parameters for the link which can take values between -1 and 1

formulas_df = tibble( f_str = lasso$formula_str_lambda_1se ) %>%
  mutate( f = map( f_str, as.formula ) )



wr_glm = function(formula, data, ... ){
  
  fit = function(){
    m = glm(formula, data, family = statmod::tweedie( ... ))
  }
  
  fit_safely = safely(fit)
  
  m = fit_safely()
  
  if( is.null(m$error) ){
    return(m$result)
  }else{
    return(NULL)
  }
  
}

#this wrapper can be found in the internals of the f_train_lasso_function
wr_tweedie = function( data, formula, lambda, p_fact ){

  response_var = f_manip_get_response_variable_from_formula(formula)

  y = data[[response_var]]
  x = as.matrix( model.matrix(formula, data)[,-1] ) # will not return a matrix if x is only one column
  
  m = HDtweedie::HDtweedie(x,y, lambda = lambda, p = p_fact, alpha = 1, standardize = F )

}

6.5.2 Pipelearner

pl = pipelearner(new_data) %>%
  learn_cvpairs( pipelearner::crossv_kfold, k = 10 ) %>%
  learn_models( lm, formulas = formulas_df$f ) %>%
  learn_models( lm, formulas = resp ~ 1 ) %>%
  learn_models( wr_glm
                , formulas = formulas_df$f
                , var.power = c( 1, 1.25, 1.5, 1.75, 2 )
                , link.power = c( seq( -1 , 1, 0.25) )
                ) %>%
  learn_models( wr_tweedie
                , formulas = formulas_df$f
                , lambda = 0.001
                , p_fact = c( 1, 1.25, 1.5, 1.75, 2 ) 
                ) %>%
  learn_models( randomForest::randomForest
                , formulas = new_formula ) %>%
  learn()


pl_filt = pl %>%
  mutate( not_converged = map_lgl( fit, is.null) ) %>%
  filter( not_converged == F ) 

6.5.3 Label Models

pl_lab_lm = pl_filt %>%
  filter( model == 'lm' ) %>%
  mutate( form = map( params, 'formula')
          , form = as.character(form)
          , title = paste(model, form) )

pl_lab_glm = pl_filt %>%
  filter( model == 'wr_glm' ) %>%
  mutate( form = map( params, 'formula')
          , form = as.character(form)
          , var.power = map_dbl( params, 'var.power' )
          , link.power = map_dbl( params, 'link.power' )
          , title = paste(model, 'p:', round( var.power, 1), 'l:', round( link.power, 1 ) ,form) )


pl_lab_HDtweedie = pl_filt %>%
  filter( model == 'wr_tweedie' ) %>%
  mutate( form = map( params, 'formula')
          , form = as.character(form)
          , p_fact = map_dbl( params, 'p_fact' )
          , title = paste(model, 'p:', round( p_fact, 1 ), form) )

pl_lab_forest = pl_filt %>%
  filter( model == 'randomForest' ) %>%
  mutate(  title = model )


pl_lab = pl_lab_lm %>%
  bind_rows( pl_lab_glm) %>%
  bind_rows( pl_lab_HDtweedie ) %>%
  bind_rows( pl_lab_forest )

6.5.4 Add prediction

f_predict_pl_regression is a wrapper that adds predictions to a modelling dataframe, its default parameters (naming of data, test and model columns) are set to work with pipelearner dataframes.

pl_pred = pl_lab %>%
  f_predict_pl_regression() 

6.5.5 Performance of all models

f_predict_pl_regression_summarize is a function that generates a summary table of a modelling dataframe with unnested predictions with several performance parameters.

We can see that the best performing models are obtained using the HDtweedie function. Thus we were not able to replicate the fits using a glm apporach. Which means we cannot really eliminate the bias of the coefficients towards zero. Also I cannot fathom how resilient the optimal lambda values are in regards to changes in the data. Meaning that I am not sure whether it is enough to establish the tuning parameters once (features and lambda) and then use such a model in production without retuning.

pl_sum = pl_pred %>%
  unnest(preds, .drop = F) %>%
  f_predict_pl_regression_summarize() %>%
  select( title, rtmse )

f_datatable_universal(pl_sum)

6.5.6 Visualisation of model performance

We will select the best glm and all HDtweedie models and all of the lm models which serve as controls and visualis them.

6.5.6.1 Selections

title_best_glm = pl_sum %>%
  filter( startsWith(title, 'wr_glm') ) %>%
  filter( rtmse == min(rtmse) ) %>%
  .$title


pl_glm = pl_pred %>%
  filter( title == title_best_glm )

pl_tweedie = pl_pred %>%
  filter( model == 'wr_tweedie' & form == "resp ~ exp2"  )
  
pl_lm_forest = pl_pred %>%
  filter( model == 'lm' | model == 'randomForest' )

pl_vis = pl_glm %>%
  bind_rows( pl_tweedie) %>%
  bind_rows( pl_lm_forest)

6.5.6.2 Visualisation

6.5.6.2.1 Compare distribution of predictions

f_predict_plot_regression_distribution can take care of that. We have to pass the column names for the predictions, the observations and the model titles. The default values againa are set to work with pipelearner defaults.

Here we see that only the HDtweedie function produces responses that match the distribution of the observations.

dist_prediction = pl_vis %>%
  unnest(preds)

f_predict_plot_regression_distribution(dist_prediction, col_title = 'title')
## $p_box

## 
## $p_hist

dist_plots = dist_observed = tibble( title = 'observed'
                                      , pred = new_data$resp )


dist_plots
## # A tibble: 5,000 x 2
##       title       pred
##       <chr>      <dbl>
##  1 observed 0.52119966
##  2 observed 0.70324377
##  3 observed 0.04671291
##  4 observed 0.67876936
##  5 observed 0.50679615
##  6 observed 0.89390337
##  7 observed 1.16761460
##  8 observed 0.13353700
##  9 observed 2.40704317
## 10 observed 1.46342186
## # ... with 4,990 more rows
6.5.6.2.2 Visualize performance

f_predict_plot_model_performance_regression builds an automated html output. We need to bin the predictions in order to look at the prediction qiuality of different ranges of the data. We can use f_plot_obj_2_html with type = 'model_performance' in order to generate a well-formated interactive html file from it. Using the additional argument dist we could even seemlessly integrate the distribution plots already displayed above. Additionally we can add an alluvial plot with f_predict_plot_regression_alluvial, which allows us to investigate which observations are being over and which are being under predicted by the differenct models. However this does not make much sense for the simulated dataset at hand.

taglist = pl_vis %>%
  unnest(preds, .drop = F) %>%
  mutate( bins = cut(target1, breaks = 5) ) %>%
  f_predict_plot_model_performance_regression()
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"

We can now use the taglist which is a list generated with htmltools::tagList which can store htmlwidgets to either (i) create a new html document using f_plot_obj_2_html with type = 'model_performance'. or (ii) we could simply call it inside a Rmd document such as this one, or (iii) we could render it as html and then inegrate a screenshot into another document using the webshot package. Since we want to keep this document as a single file and do not want a lot of interactive tables and plots for performance reasons we chose the last option.

Note the commented out arguments for inserting the distribution and a potential alluvial plot into the model performance html.

link = f_plot_obj_2_html(taglist
                         , type = 'model_performance'
                         , output_file = 'model_performance'
                         , title = 'Model Performance Gamma-Distributed Response Variable'
                         #, dist = dist
                         #, alluvial = alluvial
                         )
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   4%
##    inline R code fragments
## 
## 
  |                                                                       
  |.....                                                            |   8%
## label: taglist_2_html_chunk1 (with options) 
## List of 1
##  $ include: logi FALSE
## 
## 
  |                                                                       
  |........                                                         |  12%
##   ordinary text without R code
## 
## 
  |                                                                       
  |..........                                                       |  16%
## label: taglist_2_html_chunk2 (with options) 
## List of 1
##  $ results: chr "asis"
## 
## 
  |                                                                       
  |.............                                                    |  20%
##   ordinary text without R code
## 
## 
  |                                                                       
  |................                                                 |  24%
## label: taglist_2_html_chunk3 (with options) 
## List of 1
##  $ results: chr "asis"
## 
## 
  |                                                                       
  |..................                                               |  28%
##   ordinary text without R code
## 
## 
  |                                                                       
  |.....................                                            |  32%
## label: taglist_2_html_chunk4a (with options) 
## List of 1
##  $ results: chr "asis"
## 
## 
  |                                                                       
  |.......................                                          |  36%
##   ordinary text without R code
## 
## 
  |                                                                       
  |..........................                                       |  40%
## label: taglist_2_html_chunk4b (with options) 
## List of 1
##  $ screenshot.force: language params$render_points_as_png
## 
## 
  |                                                                       
  |.............................                                    |  44%
##   ordinary text without R code
## 
## 
  |                                                                       
  |...............................                                  |  48%
## label: taglist_2_html_chunk5
## 
  |                                                                       
  |..................................                               |  52%
##   ordinary text without R code
## 
## 
  |                                                                       
  |....................................                             |  56%
## label: taglist_2_html_chunk6a (with options) 
## List of 1
##  $ results: chr "asis"
## 
## 
  |                                                                       
  |.......................................                          |  60%
##   ordinary text without R code
## 
## 
  |                                                                       
  |..........................................                       |  64%
## label: taglist_2_html_chunk6b (with options) 
## List of 1
##  $ screenshot.force: language params$render_points_as_png
## 
## 
  |                                                                       
  |............................................                     |  68%
##   ordinary text without R code
## 
## 
  |                                                                       
  |...............................................                  |  72%
## label: taglist_2_html_chunk7
## 
  |                                                                       
  |.................................................                |  76%
##   ordinary text without R code
## 
## 
  |                                                                       
  |....................................................             |  80%
## label: taglist_2_html_chunk8a
## 
  |                                                                       
  |.......................................................          |  84%
##   ordinary text without R code
## 
## 
  |                                                                       
  |.........................................................        |  88%
## label: taglist_2_html_chunk8b (with options) 
## List of 2
##  $ results   : chr "asis"
##  $ fig.height: symbol fig_height
## 
## 
  |                                                                       
  |............................................................     |  92%
##   ordinary text without R code
## 
## 
  |                                                                       
  |..............................................................   |  96%
## label: taglist_2_html_chunk9
## 
  |                                                                       
  |.................................................................| 100%
##   ordinary text without R code
## 
## 
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS model_perf_2_html_template.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output model_performance.html --smart --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_print=1 --template "C:\R-3.4.3\library\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\erbla\AppData\Local\Temp\RtmpQl5Jlj\rmarkdown-str2f7052c01e26.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
webshot::webshot(link)

file.remove(link)
## [1] TRUE
file.remove('webshot.png')
## [1] TRUE

7 Alternative Dataset: Auto Data

This Dataset comes with the HDtweedie package, the response variable has a typical tweedie p:1.5 distribution.

library(HDtweedie)
data('auto')


data = tibble( resp = auto$y ) %>%
  bind_cols( as_tibble( scale(auto$x) ) )