These are some thoughts about the list-column workflow for a multi-dataset analysis, here consisting of the following 3 datasets:
PS: A workflow alternative to the tidymodel framework, which is postponed for later.
These datasets arecome from R-packages or from a few courses from DataCamp.
The topics touched upon here are:
The main learning resources are from:
This document is structured as follows: libraries and data are first loaded; then a few data exploration and trasformation is carried out; an exploratory PCA/HCP analysis is performed; classification modelling simulation exercises and variable importance analysis are finally done.
PS: An additional step would be a more in-depth comparison between the results from the exploratory part of the analysis (PCA/HCPC) and the ML part (best overall models and in particular the variable importance outputs).
I started wiht a brief focus on the continuous variables of all 3 datasets to see some general correlation or non linearity patterns, which served as a first step for the feature engineered process:
Both steps with a potential to enrich the multivariate exploratory analysis and the modelling performances of the considrered ML models.
In the selection process of variable candidates for FE, I plotted a few correlation charts using PerformanceAnalytics. Subsequent to this, I selected a few continuous variables that showed non linearity patterns as candidates for binning operations into categorical variables using HCPC transformations.
Additional correlation/histogrram plots using ggally can be then be done for additional exploration purposes.
For more data explorations, a PCA analysis followed. Part of its outcome will be used in the ML-modelling part (even though the caret package offers pca-transformations on the fly). The reason being the use of FactoMineR as the package of reference for PCA/HCPC analysis and, most importantly, its many informative and enriching analysis extentions. Subsequently, an HCPC analysis will complement the PCA one for additional insights.
Working with 3 datasets (or more) at the same time, the list-column workflow comes very handy. The tibble bellow collects basic information for the first PCA-call using FactoMineR.
## ---- 2.3.1.0. Long-formatting-RAW-outputs!!! ######
# 1. Long-formatting- PCA-call!
list_pca_res_RAW %>%
select(data_ID, pca_data, quanti, quali, sup) %>%
gather(pca_call, value,
-data_ID,
-pca_data)
## # A tibble: 9 x 4
## data_ID pca_data pca_call value
## <int> <chr> <chr> <list>
## 1 1 diamonds quanti <named list [3]>
## 2 2 pulsar quanti <named list [3]>
## 3 3 stackoverflow quanti <named list [3]>
## 4 1 diamonds quali <named list [5]>
## 5 2 pulsar quali <named list [5]>
## 6 3 stackoverflow quali <named list [5]>
## 7 1 diamonds sup <int [100]>
## 8 2 pulsar sup <int [100]>
## 9 3 stackoverflow sup <int [100]>
The following tibble goes a step deeper and illustrates the many objects than can be stored in the same tidy format that can be used for further analysis. Among others, screeplots, variable correlations (loadings) and component-projections of the 3 datasets’ observations with focus on the main components can be plucked when needed.
# 2. Long-formatting- PCA-RAW-output!
list_pca_res_RAW %>%
select(data_ID, pca_data,
res.PCA, eigen, screeplot, var, ind, call) %>%
gather(pca_raw, value,
-data_ID,
-pca_data)
## # A tibble: 18 x 4
## data_ID pca_data pca_raw value
## <int> <chr> <chr> <list>
## 1 1 diamonds res.PCA <PCA>
## 2 2 pulsar res.PCA <PCA>
## 3 3 stackoverflow res.PCA <PCA>
## 4 1 diamonds eigen <dbl[,3] [7 x 3]>
## 5 2 pulsar eigen <dbl[,3] [8 x 3]>
## 6 3 stackoverflow eigen <dbl[,3] [16 x 3]>
## 7 1 diamonds screeplot <gg>
## 8 2 pulsar screeplot <gg>
## 9 3 stackoverflow screeplot <gg>
## 10 1 diamonds var <named list [4]>
## 11 2 pulsar var <named list [4]>
## 12 3 stackoverflow var <named list [4]>
## 13 1 diamonds ind <named list [4]>
## 14 2 pulsar ind <named list [4]>
## 15 3 stackoverflow ind <named list [4]>
## 16 1 diamonds call <named list [12]>
## 17 2 pulsar call <named list [12]>
## 18 3 stackoverflow call <named list [12]>
Below, I pluck the typical list-structure of the different PCA outputs (performed for instance on the stackoverflow dataset) to remind me of the different outputs of a PCA-object created from FactoMineR.
# 2. Long-formatting- PCA-sup!
list_pca_res_desc_supl %>%
select(data_ID, pca_data,
res.PCA, eigen, screeplot, label, catdes, condes, dimdesc) %>% # ADDED condes
gather(pca_sup, value,
-data_ID,
-pca_data) %>%
filter(pca_data == params$dataset) %>%
filter(pca_sup == "res.PCA") %>%
pluck("value")
## [[1]]
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 3350 individuals, described by 23 variables
## *The results are available in the following objects:
##
## name
## 1 "$eig"
## 2 "$var"
## 3 "$var$coord"
## 4 "$var$cor"
## 5 "$var$cos2"
## 6 "$var$contrib"
## 7 "$ind"
## 8 "$ind$coord"
## 9 "$ind$cos2"
## 10 "$ind$contrib"
## 11 "$ind.sup"
## 12 "$ind.sup$coord"
## 13 "$ind.sup$cos2"
## 14 "$quanti.sup"
## 15 "$quanti.sup$coord"
## 16 "$quanti.sup$cor"
## 17 "$quali.sup"
## 18 "$quali.sup$coord"
## 19 "$quali.sup$v.test"
## 20 "$call"
## 21 "$call$centre"
## 22 "$call$ecart.type"
## 23 "$call$row.w"
## 24 "$call$col.w"
## description
## 1 "eigenvalues"
## 2 "results for the variables"
## 3 "coord. for the variables"
## 4 "correlations variables - dimensions"
## 5 "cos2 for the variables"
## 6 "contributions of the variables"
## 7 "results for the individuals"
## 8 "coord. for the individuals"
## 9 "cos2 for the individuals"
## 10 "contributions of the individuals"
## 11 "results for the supplementary individuals"
## 12 "coord. for the supplementary individuals"
## 13 "cos2 for the supplementary individuals"
## 14 "results for the supplementary quantitative variables"
## 15 "coord. for the supplementary quantitative variables"
## 16 "correlations suppl. quantitative variables - dimensions"
## 17 "results for the supplementary categorical variables"
## 18 "coord. for the supplementary categories"
## 19 "v-test of the supplementary categories"
## 20 "summary statistics"
## 21 "mean of the variables"
## 22 "standard error of the variables"
## 23 "weights for the individuals"
## 24 "weights for the variables"
In this section, a hierachical clustering is carried out on the PCA-results of each datasets. Again, the list-column workflow is extremely useful in its tidiness.
First, I pluck from the resulting list-column-tibble (shown later) the list-column storing the optimal number of clusters for each datasets as estimated by the HCPC function.
### Check clusters
list_hcpc_res %>%
pluck("n_clusters") %>%
set_names(paste0(c("diamonds", "pulsar", "stackoverflow"),
"_clusters"))
## diamonds_clusters pulsar_clusters stackoverflow_clusters
## 3 4 4
Thereafter, I pluck the variable description outputs from the HCPC results for the stackoverflow dataset.
## [[1]]
##
## Link between the cluster variable and the categorical variables (chi-square test)
## =================================================================================
## p.value df
## company_size_number_q 0.0008083454 9
## l_salary_cat 0.0059078282 6
## country 0.0178111437 12
## years_coded_job_q 0.0412555929 9
##
## Description of each cluster by the categories
## =============================================
## $`1`
## Cla/Mod Mod/Cla Global p.value
## country=India 83.15018 9.120129 8.40000 0.005994449
## years_coded_job_q=[0,3] 79.27928 35.355564 34.15385 0.008642691
## career_satisfaction_q=[0,7] 78.76561 43.069506 41.87692 0.012515020
## career_satisfaction_q=(9,10] 72.23587 11.811973 12.52308 0.029129274
## country=Germany 72.08791 13.177983 14.00000 0.016200808
## company_size_number_q=[1,20] 74.00778 38.208116 39.53846 0.005214632
## v.test
## country=India 2.748085
## years_coded_job_q=[0,3] 2.625874
## career_satisfaction_q=[0,7] 2.497280
## career_satisfaction_q=(9,10] -2.181733
## country=Germany -2.404360
## company_size_number_q=[1,20] -2.793467
##
## $`2`
## Cla/Mod Mod/Cla Global p.value
## country=Germany 7.472527 21.93548 14.00000 0.005977468
## company_size_number_q=[1,20] 5.836576 48.38710 39.53846 0.022404574
## v.test
## country=Germany 2.749015
## company_size_number_q=[1,20] 2.283438
##
## $`3`
## Cla/Mod Mod/Cla Global p.value
## l_salary_cat=(11.2,12.2] 8.594816 56.25000 45.10769 0.000548817
## country=United Kingdom 4.777595 12.94643 18.67692 0.018770752
## company_size_number_q=[1,20] 5.525292 31.69643 39.53846 0.012122333
## career_satisfaction_q=[0,7] 5.510654 33.48214 41.87692 0.007861372
## l_salary_cat=(9.78,11.2] 5.367793 36.16071 46.43077 0.001330406
## v.test
## l_salary_cat=(11.2,12.2] 3.455724
## country=United Kingdom -2.350053
## company_size_number_q=[1,20] -2.508563
## career_satisfaction_q=[0,7] -2.657966
## l_salary_cat=(9.78,11.2] -3.209339
##
## $`4`
## Cla/Mod Mod/Cla Global
## company_size_number_q=[1,20] 14.630350 49.214660 39.53846
## years_coded_job_q=(11,20] 14.285714 28.534031 23.47692
## years_coded_job_q=(3,5] 14.561404 21.727749 17.53846
## career_satisfaction_q=(9,10] 14.987715 15.968586 12.52308
## l_salary_cat=(9.78,11.2] 12.988734 51.308901 46.43077
## company_size_number_q=(100,1e+03] 9.380235 14.659686 18.36923
## company_size_number_q=(1e+03,1e+04] 9.509658 16.753927 20.70769
## country=India 7.326007 5.235602 8.40000
## years_coded_job_q=[0,3] 9.279279 26.963351 34.15385
## p.value v.test
## company_size_number_q=[1,20] 0.0000457833 4.076171
## years_coded_job_q=(11,20] 0.0147813679 2.437692
## years_coded_job_q=(3,5] 0.0249717046 2.241840
## career_satisfaction_q=(9,10] 0.0349839181 2.108544
## l_salary_cat=(9.78,11.2] 0.0423543243 2.030022
## company_size_number_q=(100,1e+03] 0.0428119430 -2.025540
## company_size_number_q=(1e+03,1e+04] 0.0393894845 -2.060095
## country=India 0.0132118822 -2.478006
## years_coded_job_q=[0,3] 0.0013832383 -3.198126
##
##
## Link between the cluster variable and the quantitative variables
## ================================================================
## Eta2 P-value
## data_scientist 0.690538739 0.000000e+00
## graphic_designer 0.439343822 0.000000e+00
## graphics_programming 0.630091254 0.000000e+00
## machine_learning_specialist 0.376932045 0.000000e+00
## systems_administrator 0.580636178 0.000000e+00
## database_administrator 0.362452610 1.449833e-316
## dev_ops 0.172581719 5.555308e-133
## developer_with_stats_math_background 0.079670386 3.890740e-58
## quality_assurance_engineer 0.062517066 3.583074e-45
## web_developer 0.036034427 1.178854e-25
## desktop_applications_developer 0.025811686 2.729205e-18
## mobile_developer 0.023258295 1.815591e-16
## hobby 0.015545650 5.226147e-11
## l_salary_exp 0.007739991 1.385092e-05
## embedded_developer 0.004428085 2.399598e-03
## l_salary 0.004406273 2.481039e-03
## open_source 0.004140204 3.724593e-03
##
## Description of each cluster by quantitative variables
## =====================================================
## $`1`
## v.test Mean in category
## mobile_developer -2.884045 0.1719566091
## open_source -2.999266 0.3085576537
## hobby -6.231020 0.7239855364
## desktop_applications_developer -7.479973 0.2494977903
## developer_with_stats_math_background -10.292243 0.0751305745
## quality_assurance_engineer -10.740220 0.0172760145
## graphic_designer -15.516161 0.0000000000
## dev_ops -16.831046 0.0666934512
## machine_learning_specialist -18.367062 0.0000000000
## graphics_programming -19.564869 0.0000000000
## database_administrator -27.640301 0.0429891523
## data_scientist -28.125231 0.0008035356
## systems_administrator -33.286396 0.0072318200
## Overall mean sd in category
## mobile_developer 0.18276923 0.37734273
## open_source 0.32215385 0.46189807
## hobby 0.75015385 0.44702403
## desktop_applications_developer 0.28215385 0.43272236
## developer_with_stats_math_background 0.10584615 0.26360192
## quality_assurance_engineer 0.03692308 0.13029794
## graphic_designer 0.02215385 0.00000000
## dev_ops 0.11969231 0.24949035
## machine_learning_specialist 0.03076923 0.00000000
## graphics_programming 0.03476923 0.00000000
## database_administrator 0.13446154 0.20283265
## data_scientist 0.07076923 0.02833531
## systems_administrator 0.10707692 0.08473205
## Overall sd p.value
## mobile_developer 0.3864772 3.926022e-03
## open_source 0.4673016 2.706305e-03
## hobby 0.4329238 4.634080e-10
## desktop_applications_developer 0.4500478 7.433789e-14
## developer_with_stats_math_background 0.3076406 7.637634e-25
## quality_assurance_engineer 0.1885730 6.588731e-27
## graphic_designer 0.1471837 2.697119e-54
## dev_ops 0.3246014 1.445262e-63
## machine_learning_specialist 0.1726919 2.411142e-75
## graphics_programming 0.1831948 3.081962e-85
## database_administrator 0.3411475 3.650334e-168
## data_scientist 0.2564390 4.815178e-174
## systems_administrator 0.3092110 6.075388e-243
##
## $`2`
## v.test Mean in category Overall mean
## graphics_programming 45.197044 6.838710e-01 3.476923e-02
## graphic_designer 37.778790 4.580645e-01 2.215385e-02
## mobile_developer 7.382752 4.064516e-01 1.827692e-01
## desktop_applications_developer 6.997370 5.290323e-01 2.821538e-01
## systems_administrator 4.365651 2.129032e-01 1.070769e-01
## hobby 4.130000 8.903226e-01 7.501538e-01
## embedded_developer 3.581162 1.548387e-01 7.907692e-02
## database_administrator 3.174253 2.193548e-01 1.344615e-01
## open_source 3.005503 4.322581e-01 3.221538e-01
## l_salary -1.977571 1.083677e+01 1.096566e+01
## dev_ops -2.421791 5.806452e-02 1.196923e-01
## l_salary_exp -2.574518 6.489707e+04 7.300748e+04
## sd in category Overall sd p.value
## graphics_programming 4.649639e-01 1.831948e-01 0.000000e+00
## graphic_designer 4.982383e-01 1.471837e-01 0.000000e+00
## mobile_developer 4.911707e-01 3.864772e-01 1.550511e-13
## desktop_applications_developer 4.991564e-01 4.500478e-01 2.608125e-12
## systems_administrator 4.093598e-01 3.092110e-01 1.267446e-05
## hobby 3.124873e-01 4.329238e-01 3.627636e-05
## embedded_developer 3.617509e-01 2.698588e-01 3.420688e-04
## database_administrator 4.138095e-01 3.411475e-01 1.502226e-03
## open_source 4.953898e-01 4.673016e-01 2.651423e-03
## l_salary 8.438177e-01 8.313656e-01 4.797713e-02
## dev_ops 2.338654e-01 3.246014e-01 1.544424e-02
## l_salary_exp 3.748896e+04 4.018435e+04 1.003797e-02
##
## $`3`
## v.test Mean in category
## data_scientist 47.016477 8.482143e-01
## machine_learning_specialist 34.922150 4.196429e-01
## developer_with_stats_math_background 15.818639 4.196429e-01
## l_salary_exp 4.443609 8.452153e+04
## l_salary 3.332479 1.114431e+01
## systems_administrator -2.235731 6.250000e-02
## mobile_developer -4.288666 7.589286e-02
## web_developer -8.782802 4.732143e-01
## Overall mean sd in category
## data_scientist 7.076923e-02 3.588131e-01
## machine_learning_specialist 3.076923e-02 4.935005e-01
## developer_with_stats_math_background 1.058462e-01 4.935005e-01
## l_salary_exp 7.300748e+04 4.214726e+04
## l_salary 1.096566e+01 7.666559e-01
## systems_administrator 1.070769e-01 2.420615e-01
## mobile_developer 1.827692e-01 2.648266e-01
## web_developer 7.258462e-01 4.992820e-01
## Overall sd p.value
## data_scientist 2.564390e-01 0.000000e+00
## machine_learning_specialist 1.726919e-01 3.428758e-267
## developer_with_stats_math_background 3.076406e-01 2.314418e-56
## l_salary_exp 4.018435e+04 8.846213e-06
## l_salary 8.313656e-01 8.607594e-04
## systems_administrator 3.092110e-01 2.536940e-02
## mobile_developer 3.864772e-01 1.797495e-05
## web_developer 4.460869e-01 1.594522e-18
##
## $`4`
## v.test Mean in category Overall mean
## systems_administrator 42.637070 0.740837696 0.10707692
## database_administrator 33.463904 0.683246073 0.13446154
## dev_ops 23.533763 0.486910995 0.11969231
## quality_assurance_engineer 14.120239 0.164921466 0.03692308
## web_developer 7.047116 0.876963351 0.72584615
## hobby 5.338633 0.861256545 0.75015385
## desktop_applications_developer 5.229400 0.395287958 0.28215385
## mobile_developer 2.280164 0.225130890 0.18276923
## graphic_designer -3.131167 0.000000000 0.02215385
## machine_learning_specialist -3.391138 0.002617801 0.03076923
## graphics_programming -3.650935 0.002617801 0.03476923
## sd in category Overall sd p.value
## systems_administrator 0.43817486 0.3092110 0.000000e+00
## database_administrator 0.46521057 0.3411475 1.615697e-245
## dev_ops 0.49982865 0.3246014 1.841103e-122
## quality_assurance_engineer 0.37110965 0.1885730 2.850245e-45
## web_developer 0.32847927 0.4460869 1.826647e-12
## hobby 0.34567862 0.4329238 9.365000e-08
## desktop_applications_developer 0.48891245 0.4500478 1.700611e-07
## mobile_developer 0.41766850 0.3864772 2.259798e-02
## graphic_designer 0.00000000 0.1471837 1.741130e-03
## machine_learning_specialist 0.05109744 0.1726919 6.960296e-04
## graphics_programming 0.05109744 0.1831948 2.612871e-04
As hinted at earlier and similar to the PCA-analysis section, 2 summarising list-column-structures are shown gathering raw HCPC-outputs plus a few customised transformations performed on a HCPC-object created from FactoMineR.
## ---- 3.0.1.1 Long-formatting-RAW-outputs!!! ######
# Alternative 1. Quick checks!!!
# 1. Long-formatting- HCPC-call!
list_hcpc_res_RAW %>%
select(data_ID, pca_data, quanti, quali, sup, n_clusters) %>%
gather(hcpc_call, value,
-data_ID,
-pca_data)
## # A tibble: 12 x 4
## data_ID pca_data hcpc_call value
## <int> <chr> <chr> <list>
## 1 1 diamonds quanti <named list [3]>
## 2 2 pulsar quanti <named list [3]>
## 3 3 stackoverflow quanti <named list [3]>
## 4 1 diamonds quali <named list [5]>
## 5 2 pulsar quali <named list [5]>
## 6 3 stackoverflow quali <named list [5]>
## 7 1 diamonds sup <int [100]>
## 8 2 pulsar sup <int [100]>
## 9 3 stackoverflow sup <int [100]>
## 10 1 diamonds n_clusters <dbl [1]>
## 11 2 pulsar n_clusters <dbl [1]>
## 12 3 stackoverflow n_clusters <dbl [1]>
# 2. Long-formatting- HCPC-RAW-output!
list_hcpc_res_RAW %>%
select(data_ID, pca_data,
eigen, screeplot, var,
data_clust, desc_var_hcpc, desc_axes_hcpc, desc_ind, call_hcpc) %>%
gather(hcpc_raw, value,
-data_ID,
-pca_data) # Check tails!
## # A tibble: 24 x 4
## data_ID pca_data hcpc_raw value
## <int> <chr> <chr> <list>
## 1 1 diamonds eigen <dbl[,3] [7 x 3]>
## 2 2 pulsar eigen <dbl[,3] [8 x 3]>
## 3 3 stackoverflow eigen <dbl[,3] [16 x 3]>
## 4 1 diamonds screeplot <gg>
## 5 2 pulsar screeplot <gg>
## 6 3 stackoverflow screeplot <gg>
## 7 1 diamonds var <named list [4]>
## 8 2 pulsar var <named list [4]>
## 9 3 stackoverflow var <named list [4]>
## 10 1 diamonds data_clust <df[,15] [3,170 x 15]>
## # ... with 14 more rows
One main driver of this section was to initially discover if some of the qualitative results from the PCA/HCPC analysis parts “matched” the ones from ML-modelling.
In the modelling section, I chose to experiment with a classification problem. A regression analysis using the same list-colum-workflow was carried out separately.
So a pre-labelling step came first, where I decided which factor variable from each single dataset where to be predicted (even a continous-transformed factor variable from earlier feature engineering was considered).
After a few data-wrangling, I obtained 3 tidy datasets with selected/transformed binary-reponse variables, original or PCA-transformed variables augmented with feature engineered ones.
PS: I carried out minimally hypertuned ML-modelling using glm, glmnet, random forests and gbm for speeding considerations.
Under the same list-column workflow, a tidy list-column was obtained after the modelling phase and subsequent prediction/evaluation calculations.
Below, I pluck the confusion matrix obtained from predicting the stackoverflow-label of new streamed inputs using the best tuned gbm model based on the pca-data.
# cm checks
cm_tune_tb %>%
left_join(data_names_df,
by = "row_id") %>%
filter(data_names == params$dataset) %>%
filter(model == "gbm_tune") %>%
filter(source == params$source) %>%
pluck("cm_tune")
## [[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 75 5
## pos 18 2
##
## Accuracy : 0.77
## 95% CI : (0.6751, 0.8483)
## No Information Rate : 0.93
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.0496
##
## Mcnemar's Test P-Value : 0.01234
##
## Sensitivity : 0.2857
## Specificity : 0.8065
## Pos Pred Value : 0.1000
## Neg Pred Value : 0.9375
## Prevalence : 0.0700
## Detection Rate : 0.0200
## Detection Prevalence : 0.2000
## Balanced Accuracy : 0.5461
##
## 'Positive' Class : pos
##
Below, I pluck the predicted classes for the stackoverflow-labels that served for the confusion matrix above. This calculation is also used as an input for the construction of ROC-curves, as shown subsequently, for default/customised tuned models .
## [1] 0.04623349 0.28777444 0.24047262 0.02801547 0.08695865 0.27801468
## [7] 0.03083461 0.45144870 0.25979844 0.77895867 0.82764792 0.23869535
## [13] 0.02067273 0.20249019 0.04781930 0.58186475 0.42191466 0.28841435
## [19] 0.86052745 0.33882593 0.70505400 0.11442220 0.06547673 0.16856067
## [25] 0.55471955 0.22567319 0.52726619 0.24070372 0.99619384 0.80957166
## [31] 0.49365518 0.22120480 0.13449928 0.31083609 0.05473438 0.45845952
## [37] 0.63664236 0.05385075 0.17324399 0.11985137 0.59050201 0.87836788
## [43] 0.55754714 0.09575299 0.99336434 0.04221202 0.13946449 0.56816937
## [49] 0.06170841 0.43994040 0.06441524 0.77316838 0.43119490 0.25148807
## [55] 0.62732005 0.06628451 0.06400821 0.10916509 0.80913860 0.61045660
## [61] 0.33346919 0.03253836 0.01125492 0.93072678 0.54720772 0.22763178
## [67] 0.89907382 0.05957049 0.11485895 0.07221109 0.33105894 0.45079891
## [73] 0.62477382 0.51035073 0.22997399 0.54831521 0.54983415 0.78122004
## [79] 0.05958379 0.42903235 0.23619830 0.04487570 0.97987164 0.35635075
## [85] 0.15476779 0.37129253 0.62349821 0.36860580 0.10918816 0.99499124
## [91] 0.39351347 0.63044086 0.04812088 0.09502582 0.78122368 0.21921873
## [97] 0.38762517 0.27041737 0.61323845 0.96008331
## [[1]]
## glmnet ranger gbm glmnet_tune rf_tune gbm_tune
## neg vs. pos 0.8156863 0.9137255 0.8862745 0.8376471 0.9286275 0.9231373
To illustrate again how a list-column workflow is very handy at all steps of the analysis, the following tibble collects all intermediary and final results in a tidy format.
## # A tibble: 6 x 15
## row_id data_ID data_names source train test data_mtry caret_stack_tune
## <dbl> <int> <chr> <chr> <lis> <lis> <int> <list>
## 1 1 1 diamonds_ origi~ <df[~ <df[~ 14 <caretLst>
## 2 2 2 pulsar_ origi~ <df[~ <df[~ 11 <caretLst>
## 3 3 3 stackover~ origi~ <df[~ <df[~ 23 <caretLst>
## 4 4 1 diamonds_ pca <df[~ <df[~ 6 <caretLst>
## 5 5 2 pulsar_ pca <df[~ <df[~ 10 <caretLst>
## 6 6 3 stackover~ pca <df[~ <df[~ 11 <caretLst>
## # ... with 7 more variables: results <list>, dotplot <list>, splom <list>,
## # cm <list>, roc_curves <list>, eval_tune <list>, data <list>
Further restructuring and data-wrangling can continue near-endlessly. Below, I was curious about whether PCA-transformed predictors brought additional performance benefits in the modelling process.
### Superiority of PCA transformation sometimes!!!
eval_up_tune_original_pca %>%
mutate(ranking = value_org < value_pca)
## # A tibble: 18 x 6
## data_ID model_data metric value_org value_pca ranking
## <dbl> <chr> <chr> <dbl> <dbl> <lgl>
## 1 1 gbm auc 0.886 0.912 TRUE
## 2 1 gbm_tune auc 0.923 0.922 FALSE
## 3 1 glmnet auc 0.816 0.858 TRUE
## 4 1 glmnet_tune auc 0.838 0.859 TRUE
## 5 1 ranger auc 0.914 0.851 FALSE
## 6 1 ranger_tune auc 0.929 0.912 FALSE
## 7 2 gbm auc 0.877 0.962 TRUE
## 8 2 gbm_tune auc 0.938 0.966 TRUE
## 9 2 glmnet auc 0.960 0.961 TRUE
## 10 2 glmnet_tune auc 0.965 0.962 FALSE
## 11 2 ranger auc 0.925 0.886 FALSE
## 12 2 ranger_tune auc 0.945 0.953 TRUE
## 13 3 gbm auc 0.591 0.547 FALSE
## 14 3 gbm_tune auc 0.659 0.538 FALSE
## 15 3 glmnet auc 0.648 0.455 FALSE
## 16 3 glmnet_tune auc 0.608 0.472 FALSE
## 17 3 ranger auc 0.531 0.657 TRUE
## 18 3 ranger_tune auc 0.628 0.593 FALSE
One of the main objectives was to simulate the streaming of new inputs in order to evaluate the different ML models, as an extention to the CV-train-validation phase of best performing model selection. That is, usually, a single, best performing model is selected at the final step, but for learning purposes, the same 4 models were kept using unseen, streamed, test sets.
Below, I pluck the stackoverflow dataset and, for each model, I show the evolution of different metrics calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).
# A. Time Series for eeach DATA with facets MODEL/METRIC (3 datasets)
map(params$dataset,
fun_model_eval_stream_TS)
## [[1]]
Below, I pluck the auc-performance metric, and, for each dataset and model, I compare performances using both the original and PCA-transformed data along with its evolution calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).
# B. Facets for each StreamDate/DATA with Source as geom_paths (4 metrics) !!!
map(params$metric,
fun_eval_pca_original_path)
## [[1]]
Below, I define another plot-function and pluck the auc-performance metric based on both the original and PCA-transformed data, and, for each dataset and model, I show its evolution calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).
# C. Time Series of Orgiginal Vs PCA per Metrics facets by data/model (4 metrics)
fun_eval_pca_original_stream_TS <- function(eval) {
eval_output_tidy %>%
left_join(eval_output_date_index,
by = "stream_date") %>%
filter(metric == eval) %>%
group_by(stream_date, metric, data_names) %>%
# mutate(model_data = fct_reorder(model_data, value, mean)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(factor(date), value,
color = source,
group = source)) +
ggplot2::facet_grid(factor(data_names) ~ factor(model_data),
scales = "free") +
ggtitle(eval)
}
map(eval_output_tb %>%
pull(metric) %>%
unique() %>%
str_subset(params$metric), # acc
fun_eval_pca_original_stream_TS)
Below, I pluck the stackoverflow dataset and, for each model, I show the evolution of different metrics calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).
# D. Time Series of Orgiginal Vs PCA per DATA facets by metrics/model (3 data!!!)
map(params$dataset,
fun_eval_pca_original_stream_TS)
## [[1]]
Variable importance ouputs from random forests/gbm are very informative and can add additional insights to the ones obtained from the more exploratory PCA/HCPC analysis.
Below, I pluck the barcharts drawn for the following combination:
This to show the simulated evolutions of the most important variables in predicting the different labels of the stackoverflow datset.
## [[1]]