suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(pipelearner) )
suppressPackageStartupMessages( require(tidyverse) )
To my understanding Feature Selection for regression was pretty straigth forward using stepwise regression. However I recently learned that this approach is flawed. The alternatively used method should be the Lasso. However most of the response variables I do regression for are not normally distributed and have a distribution that looks to my like a gamma or beta distribution. I recently had good results fitting a gamma regression model, beta distribution models did usually not converge to a minimum or did not improve the fit. It is however challenging to find an implementation of Lasso that works for a gamma regression.
glmnet does not support gamma regression but is the most used lasso implementation. DocumenatationHDtweedie a gamma regression is a tweedie distribution for p == 2, the package offers lasso for those distributions. documentationgamlss offers Lasso and gamma regression. The way the predict.gamlss is implemented (expects fitting data to be present in global environment when calling predict) makes it difficult to integrate in functions. There also is no documentation.oetteR::f_train_lassoI have written a wrapper functions for glmnet and HDtweedie which is implemented in oetteR. The Implementation should make it easy to judge the feature reducing qualities of the lasso for different datasets.
The cv.glmnet and the cv.HDtweedie functions already return a model. Since they have built in cross validation I cannot really be sure about the compositions of the coefficients. Since the functions are handeling the internals of the cross validation I am not a hundred percent sure how the coefficients are being averaged. Thus I prefer to select the best model via lasso select the simplest model within 1SE range of the best model and use allfeatures with none zero coefficients to fit a new OLS regression model.
This also reduces a bias in the cofficients towards zero.
Tweedie distributions have depending on the p value a more or less extreme distribution imbalance at zero. As it is not always clear which distribution to use. I prefer to perform lasso with a couple of them and see which variable selection the models come up with.
tweedies = tibble( p = seq(1,2,0.1) )
tweedies = tweedies %>%
mutate( dist = map( p, function(x) tweedie::rtweedie(5000,x, phi = 1, mu = 1) ) ) %>%
unnest( dist ) %>%
mutate( p = as.factor(p) )
ggplot(tweedies) +
geom_histogram( aes( dist)
, bins = 50 ) +
facet_wrap(~p, scales = 'free' )
Note: I am using pipelearner for the internals of my lasso functions and for fitting. pipelearner is not on CRAN and might not be available in a few years. I think it is much better to come up with a solution using caret. However piplearner is more flexible and lets you use your own functions.
data = tibble( resp = rgamma(5000,1) ) %>%
mutate( exp1 = rnorm(5000, 50,5)
, exp2 = log(resp) ##non linear relationship
, exp3 = exp2 * rnorm(5000, 1, 0.5) # colinearity to exp2
, exp4 = resp * rgamma(5000, 0.5)
, exp5 = resp * rgamma(5000, 0.5) * rnorm(5000, 1, 0.5)
)
f_manip_data_2_model_matrix_format is a wrapper for model.matrix which comes up with weird variable names when creating dummy variables that cannot be used in a formula syntax. f_manip_data_2_model_matrix_format comes up with better names and returns a new dataframe and a compatible formula.
f_train_lasso works just like cv.glmnet and cv.HDtweedie. It also comes with more usefull plots as ouput and helps to extract the correct formulas for the features
form = resp~exp1+exp2+exp3+exp4+exp5
data_trans = f_manip_data_2_model_matrix_format(data, form)
new_data = data_trans$data
new_formula = data_trans$formula
lasso = f_train_lasso(new_data, new_formula
, lambda = 10^seq(-3,3,length.out = 25)
, p = c( 1, 1.25, 1.5, 1.75, 2 )
, k = 10 )
## [1] "Progress: 1 / 6 ; 16.7 % ; ETA: 0.1 min"
## [1] "Progress: 2 / 6 ; 33.3 % ; ETA: 0.1 min"
## [1] "Progress: 3 / 6 ; 50 % ; ETA: 0.1 min"
## [1] "Progress: 4 / 6 ; 66.7 % ; ETA: 0 min"
## [1] "Progress: 5 / 6 ; 83.3 % ; ETA: 0 min"
## [1] "Progress: 6 / 6 ; 100 % ; ETA: 0 min"
Here we find for each lambda the correspnding mse, a tibble with the coefficients, a formula including all none zero coefficients, the formulas as strings for readability, and the number of coefficients.
lasso$tib_all %>%
head(5) %>%
knitr::kable()
| distribution | lambda | lambda_min | lambda_1se | mse | coeff | formula | formula_str | n_coeff_before_lasso | n_coeff_after_lasso |
|---|---|---|---|---|---|---|---|---|---|
| gaussian | 1000.0000 | 0.001 | 0.0562341 | 0.9859456 | 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 | resp ~ 1 | resp ~ 1 | 5 | 0 |
| gaussian | 562.3413 | 0.001 | 0.0562341 | 0.9859456 | 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 | resp ~ 1 | resp ~ 1 | 5 | 0 |
| gaussian | 316.2278 | 0.001 | 0.0562341 | 0.9859456 | 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 | resp ~ 1 | resp ~ 1 | 5 | 0 |
| gaussian | 177.8279 | 0.001 | 0.0562341 | 0.9859456 | 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 | resp ~ 1 | resp ~ 1 | 5 | 0 |
| gaussian | 100.0000 | 0.001 | 0.0562341 | 0.9859456 | 0.999982291968335, 0, 0, 0, 0, 0, (Intercept), exp1, exp2, exp3, exp4, exp5 | resp ~ 1 | resp ~ 1 | 5 | 0 |
In this plot we can see that the standard gaussian distribution has a much higher MSE than the tweedie distributions. We find the minimum MSE maked by the dashed horizontal line at the third lambda value meaning that for the gaussian distribution shrinkage of the initial coefficients improves the fit. The minimum lambda value that is still in 1SE range of the min MSE lambda value is at the eighth lambda value. Possibly this would already lead to some features beeing eliminated.
The tweedie distributions have their minimum MSE and their 1SE at the first lambda value (use the plotly graph to deselect the lines one by one). Zooming further in on that point we find that the distribution for p = 2, which is the gamma distribution performs the best as is to be expected since the response variable has a gamma distribution.
plotly::ggplotly(lasso$plot_mse)
Here we find that for the gaussian distribution at lambda 1SE 1 out of 5 coefficients is below zero which is the coefficient for exp3. Exp3 has been computed to be extremely colinear to exp2 thus carries only redundant information.
For the tweedie distributions we find that all coefficients except for exp2 are already at zero in the initial fit. If we look at the simulated data we find that the formulations for all explanatory variables except exp2 carry some random element which make them less reliable predictors for the response compared to exp2 which has no random element.
plotly::ggplotly(lasso$plot_coef)
The lasso object already has an attribute which returns a set of unique formulas which have worked best for each value of p.
lasso$formula_str_lambda_1se
## [1] "resp ~ exp2 + exp4 + exp5" "resp ~ exp2"
We now will fit each formula to each distribution (gaussian, tweedie, gamma)
the statmod::tweedie function provides the distribution family for glm. As a control we will also use HDtweedie to check whether the results of these two methods are comparable. We will use the fixed lambda value of 0.001. Which was the minimum lambda we used previously and which was the min lambda 1SE value of all tweedie distributions tested. Note that p_fact and var.power are synonyms for the p parameter of the Tweedie distributions. Another control we will add is a randomForest regression.
sometimes the fittings will not converge, we will thereofe also grid search different parameters for the link which can take values between -1 and 1
formulas_df = tibble( f_str = lasso$formula_str_lambda_1se ) %>%
mutate( f = map( f_str, as.formula ) )
wr_glm = function(formula, data, ... ){
fit = function(){
m = glm(formula, data, family = statmod::tweedie( ... ))
}
fit_safely = safely(fit)
m = fit_safely()
if( is.null(m$error) ){
return(m$result)
}else{
return(NULL)
}
}
#this wrapper can be found in the internals of the f_train_lasso_function
wr_tweedie = function( data, formula, lambda, p_fact ){
response_var = f_manip_get_response_variable_from_formula(formula)
y = data[[response_var]]
x = as.matrix( model.matrix(formula, data)[,-1] ) # will not return a matrix if x is only one column
m = HDtweedie::HDtweedie(x,y, lambda = lambda, p = p_fact, alpha = 1, standardize = F )
}
pl = pipelearner(new_data) %>%
learn_cvpairs( pipelearner::crossv_kfold, k = 10 ) %>%
learn_models( lm, formulas = formulas_df$f ) %>%
learn_models( lm, formulas = resp ~ 1 ) %>%
learn_models( wr_glm
, formulas = formulas_df$f
, var.power = c( 1, 1.25, 1.5, 1.75, 2 )
, link.power = c( seq( -1 , 1, 0.25) )
) %>%
learn_models( wr_tweedie
, formulas = formulas_df$f
, lambda = 0.001
, p_fact = c( 1, 1.25, 1.5, 1.75, 2 )
) %>%
learn_models( randomForest::randomForest
, formulas = new_formula ) %>%
learn()
pl_filt = pl %>%
mutate( not_converged = map_lgl( fit, is.null) ) %>%
filter( not_converged == F )
pl_lab_lm = pl_filt %>%
filter( model == 'lm' ) %>%
mutate( form = map( params, 'formula')
, form = as.character(form)
, title = paste(model, form) )
pl_lab_glm = pl_filt %>%
filter( model == 'wr_glm' ) %>%
mutate( form = map( params, 'formula')
, form = as.character(form)
, var.power = map_dbl( params, 'var.power' )
, link.power = map_dbl( params, 'link.power' )
, title = paste(model, 'p:', round( var.power, 1), 'l:', round( link.power, 1 ) ,form) )
pl_lab_HDtweedie = pl_filt %>%
filter( model == 'wr_tweedie' ) %>%
mutate( form = map( params, 'formula')
, form = as.character(form)
, p_fact = map_dbl( params, 'p_fact' )
, title = paste(model, 'p:', round( p_fact, 1 ), form) )
pl_lab_forest = pl_filt %>%
filter( model == 'randomForest' ) %>%
mutate( title = model )
pl_lab = pl_lab_lm %>%
bind_rows( pl_lab_glm) %>%
bind_rows( pl_lab_HDtweedie ) %>%
bind_rows( pl_lab_forest )
f_predict_pl_regression is a wrapper that adds predictions to a modelling dataframe, its default parameters (naming of data, test and model columns) are set to work with pipelearner dataframes.
pl_pred = pl_lab %>%
f_predict_pl_regression()
f_predict_pl_regression_summarize is a function that generates a summary table of a modelling dataframe with unnested predictions with several performance parameters.
We can see that the best performing models are obtained using the HDtweedie function. Thus we were not able to replicate the fits using a glm apporach. Which means we cannot really eliminate the bias of the coefficients towards zero. Also I cannot fathom how resilient the optimal lambda values are in regards to changes in the data. Meaning that I am not sure whether it is enough to establish the tuning parameters once (features and lambda) and then use such a model in production without retuning.
pl_sum = pl_pred %>%
unnest(preds, .drop = F) %>%
f_predict_pl_regression_summarize() %>%
select( title, rtmse )
f_datatable_universal(pl_sum)
We will select the best glm and all HDtweedie models and all of the lm models which serve as controls and visualis them.
title_best_glm = pl_sum %>%
filter( startsWith(title, 'wr_glm') ) %>%
filter( rtmse == min(rtmse) ) %>%
.$title
pl_glm = pl_pred %>%
filter( title == title_best_glm )
pl_tweedie = pl_pred %>%
filter( model == 'wr_tweedie' & form == "resp ~ exp2" )
pl_lm_forest = pl_pred %>%
filter( model == 'lm' | model == 'randomForest' )
pl_vis = pl_glm %>%
bind_rows( pl_tweedie) %>%
bind_rows( pl_lm_forest)
f_predict_plot_regression_distribution can take care of that. We have to pass the column names for the predictions, the observations and the model titles. The default values againa are set to work with pipelearner defaults.
Here we see that only the HDtweedie function produces responses that match the distribution of the observations.
dist_prediction = pl_vis %>%
unnest(preds)
f_predict_plot_regression_distribution(dist_prediction, col_title = 'title')
## $p_box
##
## $p_hist
dist_plots = dist_observed = tibble( title = 'observed'
, pred = new_data$resp )
dist_plots
## # A tibble: 5,000 x 2
## title pred
## <chr> <dbl>
## 1 observed 0.52119966
## 2 observed 0.70324377
## 3 observed 0.04671291
## 4 observed 0.67876936
## 5 observed 0.50679615
## 6 observed 0.89390337
## 7 observed 1.16761460
## 8 observed 0.13353700
## 9 observed 2.40704317
## 10 observed 1.46342186
## # ... with 4,990 more rows
f_predict_plot_model_performance_regression builds an automated html output. We need to bin the predictions in order to look at the prediction qiuality of different ranges of the data. We can use f_plot_obj_2_html with type = 'model_performance' in order to generate a well-formated interactive html file from it. Using the additional argument dist we could even seemlessly integrate the distribution plots already displayed above. Additionally we can add an alluvial plot with f_predict_plot_regression_alluvial, which allows us to investigate which observations are being over and which are being under predicted by the differenct models. However this does not make much sense for the simulated dataset at hand.
taglist = pl_vis %>%
unnest(preds, .drop = F) %>%
mutate( bins = cut(target1, breaks = 5) ) %>%
f_predict_plot_model_performance_regression()
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
We can now use the taglist which is a list generated with htmltools::tagList which can store htmlwidgets to either (i) create a new html document using f_plot_obj_2_html with type = 'model_performance'. or (ii) we could simply call it inside a Rmd document such as this one, or (iii) we could render it as html and then inegrate a screenshot into another document using the webshot package. Since we want to keep this document as a single file and do not want a lot of interactive tables and plots for performance reasons we chose the last option.
Note the commented out arguments for inserting the distribution and a potential alluvial plot into the model performance html.
link = f_plot_obj_2_html(taglist
, type = 'model_performance'
, output_file = 'model_performance'
, title = 'Model Performance Gamma-Distributed Response Variable'
#, dist = dist
#, alluvial = alluvial
)
##
|
| | 0%
|
|... | 4%
## inline R code fragments
##
##
|
|..... | 8%
## label: taglist_2_html_chunk1 (with options)
## List of 1
## $ include: logi FALSE
##
##
|
|........ | 12%
## ordinary text without R code
##
##
|
|.......... | 16%
## label: taglist_2_html_chunk2 (with options)
## List of 1
## $ results: chr "asis"
##
##
|
|............. | 20%
## ordinary text without R code
##
##
|
|................ | 24%
## label: taglist_2_html_chunk3 (with options)
## List of 1
## $ results: chr "asis"
##
##
|
|.................. | 28%
## ordinary text without R code
##
##
|
|..................... | 32%
## label: taglist_2_html_chunk4a (with options)
## List of 1
## $ results: chr "asis"
##
##
|
|....................... | 36%
## ordinary text without R code
##
##
|
|.......................... | 40%
## label: taglist_2_html_chunk4b (with options)
## List of 1
## $ screenshot.force: language params$render_points_as_png
##
##
|
|............................. | 44%
## ordinary text without R code
##
##
|
|............................... | 48%
## label: taglist_2_html_chunk5
##
|
|.................................. | 52%
## ordinary text without R code
##
##
|
|.................................... | 56%
## label: taglist_2_html_chunk6a (with options)
## List of 1
## $ results: chr "asis"
##
##
|
|....................................... | 60%
## ordinary text without R code
##
##
|
|.......................................... | 64%
## label: taglist_2_html_chunk6b (with options)
## List of 1
## $ screenshot.force: language params$render_points_as_png
##
##
|
|............................................ | 68%
## ordinary text without R code
##
##
|
|............................................... | 72%
## label: taglist_2_html_chunk7
##
|
|................................................. | 76%
## ordinary text without R code
##
##
|
|.................................................... | 80%
## label: taglist_2_html_chunk8a
##
|
|....................................................... | 84%
## ordinary text without R code
##
##
|
|......................................................... | 88%
## label: taglist_2_html_chunk8b (with options)
## List of 2
## $ results : chr "asis"
## $ fig.height: symbol fig_height
##
##
|
|............................................................ | 92%
## ordinary text without R code
##
##
|
|.............................................................. | 96%
## label: taglist_2_html_chunk9
##
|
|.................................................................| 100%
## ordinary text without R code
##
##
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS model_perf_2_html_template.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output model_performance.html --smart --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_print=1 --template "C:\R-3.4.3\library\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\erbla\AppData\Local\Temp\RtmpQl5Jlj\rmarkdown-str2f7052c01e26.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
webshot::webshot(link)
file.remove(link)
## [1] TRUE
file.remove('webshot.png')
## [1] TRUE
This Dataset comes with the HDtweedie package, the response variable has a typical tweedie p:1.5 distribution.
library(HDtweedie)
data('auto')
data = tibble( resp = auto$y ) %>%
bind_cols( as_tibble( scale(auto$x) ) )