Exploratory Analysis and Classification Models with PCA pre-processing

0.1 Introduction

These are some thoughts about the list-column workflow for a multi-dataset analysis, here consisting of the following 3 datasets:

diamonds
pulsar
stackoverflow

PS: A workflow alternative to the tidymodel framework, which is postponed for later.

These datasets arecome from R-packages or from a few courses from DataCamp.

The topics touched upon here are:

List-Column workflow
Database reading
Simulation of new data inputs
Exploratory Data Analysis with correlation charts (performance analysics and ggally)
Feature engineering with simple creation/transformation of continuous variables
Feature engineering with simple binning of continuous variables via HCPC
PCA/HCPC analysis
ML modelling with glm, glmnet, random forests and gbm
Variable Importance analysis from random forest/gbm modelling
Bulk/summary data backups from time-consuming ML-modelling (.rds)
Database saving
Analysis of results:
- Performance metrics evolution along the many simulations (new data inputs)
- Comparative modelling analysis of original Vs PCA-transformed data
- Variable importance evolution along the many simulations (new data inputs)

The main learning resources are from:

DataCamp
FactoMineR (Fun Mooc)
RsquareAcademy (database)
Many blogs and other sources

0.2 Outline

This document is structured as follows: libraries and data are first loaded; then a few data exploration and trasformation is carried out; an exploratory PCA/HCP analysis is performed; classification modelling simulation exercises and variable importance analysis are finally done.

PS: An additional step would be a more in-depth comparison between the results from the exploratory part of the analysis (PCA/HCPC) and the ML part (best overall models and in particular the variable importance outputs).

0.3 Exploratory Data Analysis with GGally

I started wiht a brief focus on the continuous variables of all 3 datasets to see some general correlation or non linearity patterns, which served as a first step for the feature engineered process:

Feature engineering (FE) with simple creation/transformation of continuous variables
Feature engineering (FE) with simple binning of continuous variables via HCPC.

Both steps with a potential to enrich the multivariate exploratory analysis and the modelling performances of the considrered ML models.

In the selection process of variable candidates for FE, I plotted a few correlation charts using PerformanceAnalytics. Subsequent to this, I selected a few continuous variables that showed non linearity patterns as candidates for binning operations into categorical variables using HCPC transformations.

Additional correlation/histogrram plots using ggally can be then be done for additional exploration purposes.

# checks
list_data_ggally_label %>% 
  pluck("ggally_label") %>% 
  pluck(1)

0.4 PCA part

For more data explorations, a PCA analysis followed. Part of its outcome will be used in the ML-modelling part (even though the caret package offers pca-transformations on the fly). The reason being the use of FactoMineR as the package of reference for PCA/HCPC analysis and, most importantly, its many informative and enriching analysis extentions. Subsequently, an HCPC analysis will complement the PCA one for additional insights.

Working with 3 datasets (or more) at the same time, the list-column workflow comes very handy. The tibble bellow collects basic information for the first PCA-call using FactoMineR.

## ---- 2.3.1.0. Long-formatting-RAW-outputs!!!  ######

# 1. Long-formatting- PCA-call!
list_pca_res_RAW %>% 
  select(data_ID, pca_data, quanti, quali, sup) %>% 
  gather(pca_call, value,
         -data_ID,
         -pca_data)

## # A tibble: 9 x 4
##   data_ID pca_data      pca_call value           
##     <int> <chr>         <chr>    <list>          
## 1       1 diamonds      quanti   <named list [3]>
## 2       2 pulsar        quanti   <named list [3]>
## 3       3 stackoverflow quanti   <named list [3]>
## 4       1 diamonds      quali    <named list [5]>
## 5       2 pulsar        quali    <named list [5]>
## 6       3 stackoverflow quali    <named list [5]>
## 7       1 diamonds      sup      <int [100]>     
## 8       2 pulsar        sup      <int [100]>     
## 9       3 stackoverflow sup      <int [100]>

The following tibble goes a step deeper and illustrates the many objects than can be stored in the same tidy format that can be used for further analysis. Among others, screeplots, variable correlations (loadings) and component-projections of the 3 datasets’ observations with focus on the main components can be plucked when needed.

# 2. Long-formatting- PCA-RAW-output!
list_pca_res_RAW %>% 
  select(data_ID, pca_data, 
         res.PCA, eigen, screeplot, var, ind, call) %>% 
  gather(pca_raw, value,
         -data_ID,
         -pca_data)

## # A tibble: 18 x 4
##    data_ID pca_data      pca_raw   value             
##      <int> <chr>         <chr>     <list>            
##  1       1 diamonds      res.PCA   <PCA>             
##  2       2 pulsar        res.PCA   <PCA>             
##  3       3 stackoverflow res.PCA   <PCA>             
##  4       1 diamonds      eigen     <dbl[,3] [7 x 3]> 
##  5       2 pulsar        eigen     <dbl[,3] [8 x 3]> 
##  6       3 stackoverflow eigen     <dbl[,3] [16 x 3]>
##  7       1 diamonds      screeplot <gg>              
##  8       2 pulsar        screeplot <gg>              
##  9       3 stackoverflow screeplot <gg>              
## 10       1 diamonds      var       <named list [4]>  
## 11       2 pulsar        var       <named list [4]>  
## 12       3 stackoverflow var       <named list [4]>  
## 13       1 diamonds      ind       <named list [4]>  
## 14       2 pulsar        ind       <named list [4]>  
## 15       3 stackoverflow ind       <named list [4]>  
## 16       1 diamonds      call      <named list [12]> 
## 17       2 pulsar        call      <named list [12]> 
## 18       3 stackoverflow call      <named list [12]>

Below, I pluck the typical list-structure of the different PCA outputs (performed for instance on the stackoverflow dataset) to remind me of the different outputs of a PCA-object created from FactoMineR.

# 2. Long-formatting- PCA-sup!
list_pca_res_desc_supl %>% 
  select(data_ID, pca_data, 
         res.PCA, eigen, screeplot, label, catdes, condes, dimdesc) %>% # ADDED condes
  gather(pca_sup, value,
         -data_ID,
         -pca_data) %>% 
  filter(pca_data == params$dataset) %>% 
  filter(pca_sup == "res.PCA") %>% 
  pluck("value")

## [[1]]
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 3350 individuals, described by 23 variables
## *The results are available in the following objects:
## 
##    name               
## 1  "$eig"             
## 2  "$var"             
## 3  "$var$coord"       
## 4  "$var$cor"         
## 5  "$var$cos2"        
## 6  "$var$contrib"     
## 7  "$ind"             
## 8  "$ind$coord"       
## 9  "$ind$cos2"        
## 10 "$ind$contrib"     
## 11 "$ind.sup"         
## 12 "$ind.sup$coord"   
## 13 "$ind.sup$cos2"    
## 14 "$quanti.sup"      
## 15 "$quanti.sup$coord"
## 16 "$quanti.sup$cor"  
## 17 "$quali.sup"       
## 18 "$quali.sup$coord" 
## 19 "$quali.sup$v.test"
## 20 "$call"            
## 21 "$call$centre"     
## 22 "$call$ecart.type" 
## 23 "$call$row.w"      
## 24 "$call$col.w"      
##    description                                              
## 1  "eigenvalues"                                            
## 2  "results for the variables"                              
## 3  "coord. for the variables"                               
## 4  "correlations variables - dimensions"                    
## 5  "cos2 for the variables"                                 
## 6  "contributions of the variables"                         
## 7  "results for the individuals"                            
## 8  "coord. for the individuals"                             
## 9  "cos2 for the individuals"                               
## 10 "contributions of the individuals"                       
## 11 "results for the supplementary individuals"              
## 12 "coord. for the supplementary individuals"               
## 13 "cos2 for the supplementary individuals"                 
## 14 "results for the supplementary quantitative variables"   
## 15 "coord. for the supplementary quantitative variables"    
## 16 "correlations suppl. quantitative variables - dimensions"
## 17 "results for the supplementary categorical variables"    
## 18 "coord. for the supplementary categories"                
## 19 "v-test of the supplementary categories"                 
## 20 "summary statistics"                                     
## 21 "mean of the variables"                                  
## 22 "standard error of the variables"                        
## 23 "weights for the individuals"                            
## 24 "weights for the variables"

# %>% 
#   pluck(1)

0.5 HCPC part

In this section, a hierachical clustering is carried out on the PCA-results of each datasets. Again, the list-column workflow is extremely useful in its tidiness.

First, I pluck from the resulting list-column-tibble (shown later) the list-column storing the optimal number of clusters for each datasets as estimated by the HCPC function.

### Check clusters
list_hcpc_res %>% 
  pluck("n_clusters") %>% 
  set_names(paste0(c("diamonds", "pulsar", "stackoverflow"),
                   "_clusters"))

##      diamonds_clusters        pulsar_clusters stackoverflow_clusters 
##                      3                      4                      4

Thereafter, I pluck the variable description outputs from the HCPC results for the stackoverflow dataset.

# checks
list_hcpc_res_RAW %>%
  filter(pca_data == params$dataset) %>% 
  pluck("desc_var_hcpc")

## [[1]]
## 
## Link between the cluster variable and the categorical variables (chi-square test)
## =================================================================================
##                            p.value df
## company_size_number_q 0.0008083454  9
## l_salary_cat          0.0059078282  6
## country               0.0178111437 12
## years_coded_job_q     0.0412555929  9
## 
## Description of each cluster by the categories
## =============================================
## $`1`
##                               Cla/Mod   Mod/Cla   Global     p.value
## country=India                83.15018  9.120129  8.40000 0.005994449
## years_coded_job_q=[0,3]      79.27928 35.355564 34.15385 0.008642691
## career_satisfaction_q=[0,7]  78.76561 43.069506 41.87692 0.012515020
## career_satisfaction_q=(9,10] 72.23587 11.811973 12.52308 0.029129274
## country=Germany              72.08791 13.177983 14.00000 0.016200808
## company_size_number_q=[1,20] 74.00778 38.208116 39.53846 0.005214632
##                                 v.test
## country=India                 2.748085
## years_coded_job_q=[0,3]       2.625874
## career_satisfaction_q=[0,7]   2.497280
## career_satisfaction_q=(9,10] -2.181733
## country=Germany              -2.404360
## company_size_number_q=[1,20] -2.793467
## 
## $`2`
##                               Cla/Mod  Mod/Cla   Global     p.value
## country=Germany              7.472527 21.93548 14.00000 0.005977468
## company_size_number_q=[1,20] 5.836576 48.38710 39.53846 0.022404574
##                                v.test
## country=Germany              2.749015
## company_size_number_q=[1,20] 2.283438
## 
## $`3`
##                               Cla/Mod  Mod/Cla   Global     p.value
## l_salary_cat=(11.2,12.2]     8.594816 56.25000 45.10769 0.000548817
## country=United Kingdom       4.777595 12.94643 18.67692 0.018770752
## company_size_number_q=[1,20] 5.525292 31.69643 39.53846 0.012122333
## career_satisfaction_q=[0,7]  5.510654 33.48214 41.87692 0.007861372
## l_salary_cat=(9.78,11.2]     5.367793 36.16071 46.43077 0.001330406
##                                 v.test
## l_salary_cat=(11.2,12.2]      3.455724
## country=United Kingdom       -2.350053
## company_size_number_q=[1,20] -2.508563
## career_satisfaction_q=[0,7]  -2.657966
## l_salary_cat=(9.78,11.2]     -3.209339
## 
## $`4`
##                                       Cla/Mod   Mod/Cla   Global
## company_size_number_q=[1,20]        14.630350 49.214660 39.53846
## years_coded_job_q=(11,20]           14.285714 28.534031 23.47692
## years_coded_job_q=(3,5]             14.561404 21.727749 17.53846
## career_satisfaction_q=(9,10]        14.987715 15.968586 12.52308
## l_salary_cat=(9.78,11.2]            12.988734 51.308901 46.43077
## company_size_number_q=(100,1e+03]    9.380235 14.659686 18.36923
## company_size_number_q=(1e+03,1e+04]  9.509658 16.753927 20.70769
## country=India                        7.326007  5.235602  8.40000
## years_coded_job_q=[0,3]              9.279279 26.963351 34.15385
##                                          p.value    v.test
## company_size_number_q=[1,20]        0.0000457833  4.076171
## years_coded_job_q=(11,20]           0.0147813679  2.437692
## years_coded_job_q=(3,5]             0.0249717046  2.241840
## career_satisfaction_q=(9,10]        0.0349839181  2.108544
## l_salary_cat=(9.78,11.2]            0.0423543243  2.030022
## company_size_number_q=(100,1e+03]   0.0428119430 -2.025540
## company_size_number_q=(1e+03,1e+04] 0.0393894845 -2.060095
## country=India                       0.0132118822 -2.478006
## years_coded_job_q=[0,3]             0.0013832383 -3.198126
## 
## 
## Link between the cluster variable and the quantitative variables
## ================================================================
##                                             Eta2       P-value
## data_scientist                       0.690538739  0.000000e+00
## graphic_designer                     0.439343822  0.000000e+00
## graphics_programming                 0.630091254  0.000000e+00
## machine_learning_specialist          0.376932045  0.000000e+00
## systems_administrator                0.580636178  0.000000e+00
## database_administrator               0.362452610 1.449833e-316
## dev_ops                              0.172581719 5.555308e-133
## developer_with_stats_math_background 0.079670386  3.890740e-58
## quality_assurance_engineer           0.062517066  3.583074e-45
## web_developer                        0.036034427  1.178854e-25
## desktop_applications_developer       0.025811686  2.729205e-18
## mobile_developer                     0.023258295  1.815591e-16
## hobby                                0.015545650  5.226147e-11
## l_salary_exp                         0.007739991  1.385092e-05
## embedded_developer                   0.004428085  2.399598e-03
## l_salary                             0.004406273  2.481039e-03
## open_source                          0.004140204  3.724593e-03
## 
## Description of each cluster by quantitative variables
## =====================================================
## $`1`
##                                          v.test Mean in category
## mobile_developer                      -2.884045     0.1719566091
## open_source                           -2.999266     0.3085576537
## hobby                                 -6.231020     0.7239855364
## desktop_applications_developer        -7.479973     0.2494977903
## developer_with_stats_math_background -10.292243     0.0751305745
## quality_assurance_engineer           -10.740220     0.0172760145
## graphic_designer                     -15.516161     0.0000000000
## dev_ops                              -16.831046     0.0666934512
## machine_learning_specialist          -18.367062     0.0000000000
## graphics_programming                 -19.564869     0.0000000000
## database_administrator               -27.640301     0.0429891523
## data_scientist                       -28.125231     0.0008035356
## systems_administrator                -33.286396     0.0072318200
##                                      Overall mean sd in category
## mobile_developer                       0.18276923     0.37734273
## open_source                            0.32215385     0.46189807
## hobby                                  0.75015385     0.44702403
## desktop_applications_developer         0.28215385     0.43272236
## developer_with_stats_math_background   0.10584615     0.26360192
## quality_assurance_engineer             0.03692308     0.13029794
## graphic_designer                       0.02215385     0.00000000
## dev_ops                                0.11969231     0.24949035
## machine_learning_specialist            0.03076923     0.00000000
## graphics_programming                   0.03476923     0.00000000
## database_administrator                 0.13446154     0.20283265
## data_scientist                         0.07076923     0.02833531
## systems_administrator                  0.10707692     0.08473205
##                                      Overall sd       p.value
## mobile_developer                      0.3864772  3.926022e-03
## open_source                           0.4673016  2.706305e-03
## hobby                                 0.4329238  4.634080e-10
## desktop_applications_developer        0.4500478  7.433789e-14
## developer_with_stats_math_background  0.3076406  7.637634e-25
## quality_assurance_engineer            0.1885730  6.588731e-27
## graphic_designer                      0.1471837  2.697119e-54
## dev_ops                               0.3246014  1.445262e-63
## machine_learning_specialist           0.1726919  2.411142e-75
## graphics_programming                  0.1831948  3.081962e-85
## database_administrator                0.3411475 3.650334e-168
## data_scientist                        0.2564390 4.815178e-174
## systems_administrator                 0.3092110 6.075388e-243
## 
## $`2`
##                                   v.test Mean in category Overall mean
## graphics_programming           45.197044     6.838710e-01 3.476923e-02
## graphic_designer               37.778790     4.580645e-01 2.215385e-02
## mobile_developer                7.382752     4.064516e-01 1.827692e-01
## desktop_applications_developer  6.997370     5.290323e-01 2.821538e-01
## systems_administrator           4.365651     2.129032e-01 1.070769e-01
## hobby                           4.130000     8.903226e-01 7.501538e-01
## embedded_developer              3.581162     1.548387e-01 7.907692e-02
## database_administrator          3.174253     2.193548e-01 1.344615e-01
## open_source                     3.005503     4.322581e-01 3.221538e-01
## l_salary                       -1.977571     1.083677e+01 1.096566e+01
## dev_ops                        -2.421791     5.806452e-02 1.196923e-01
## l_salary_exp                   -2.574518     6.489707e+04 7.300748e+04
##                                sd in category   Overall sd      p.value
## graphics_programming             4.649639e-01 1.831948e-01 0.000000e+00
## graphic_designer                 4.982383e-01 1.471837e-01 0.000000e+00
## mobile_developer                 4.911707e-01 3.864772e-01 1.550511e-13
## desktop_applications_developer   4.991564e-01 4.500478e-01 2.608125e-12
## systems_administrator            4.093598e-01 3.092110e-01 1.267446e-05
## hobby                            3.124873e-01 4.329238e-01 3.627636e-05
## embedded_developer               3.617509e-01 2.698588e-01 3.420688e-04
## database_administrator           4.138095e-01 3.411475e-01 1.502226e-03
## open_source                      4.953898e-01 4.673016e-01 2.651423e-03
## l_salary                         8.438177e-01 8.313656e-01 4.797713e-02
## dev_ops                          2.338654e-01 3.246014e-01 1.544424e-02
## l_salary_exp                     3.748896e+04 4.018435e+04 1.003797e-02
## 
## $`3`
##                                         v.test Mean in category
## data_scientist                       47.016477     8.482143e-01
## machine_learning_specialist          34.922150     4.196429e-01
## developer_with_stats_math_background 15.818639     4.196429e-01
## l_salary_exp                          4.443609     8.452153e+04
## l_salary                              3.332479     1.114431e+01
## systems_administrator                -2.235731     6.250000e-02
## mobile_developer                     -4.288666     7.589286e-02
## web_developer                        -8.782802     4.732143e-01
##                                      Overall mean sd in category
## data_scientist                       7.076923e-02   3.588131e-01
## machine_learning_specialist          3.076923e-02   4.935005e-01
## developer_with_stats_math_background 1.058462e-01   4.935005e-01
## l_salary_exp                         7.300748e+04   4.214726e+04
## l_salary                             1.096566e+01   7.666559e-01
## systems_administrator                1.070769e-01   2.420615e-01
## mobile_developer                     1.827692e-01   2.648266e-01
## web_developer                        7.258462e-01   4.992820e-01
##                                        Overall sd       p.value
## data_scientist                       2.564390e-01  0.000000e+00
## machine_learning_specialist          1.726919e-01 3.428758e-267
## developer_with_stats_math_background 3.076406e-01  2.314418e-56
## l_salary_exp                         4.018435e+04  8.846213e-06
## l_salary                             8.313656e-01  8.607594e-04
## systems_administrator                3.092110e-01  2.536940e-02
## mobile_developer                     3.864772e-01  1.797495e-05
## web_developer                        4.460869e-01  1.594522e-18
## 
## $`4`
##                                   v.test Mean in category Overall mean
## systems_administrator          42.637070      0.740837696   0.10707692
## database_administrator         33.463904      0.683246073   0.13446154
## dev_ops                        23.533763      0.486910995   0.11969231
## quality_assurance_engineer     14.120239      0.164921466   0.03692308
## web_developer                   7.047116      0.876963351   0.72584615
## hobby                           5.338633      0.861256545   0.75015385
## desktop_applications_developer  5.229400      0.395287958   0.28215385
## mobile_developer                2.280164      0.225130890   0.18276923
## graphic_designer               -3.131167      0.000000000   0.02215385
## machine_learning_specialist    -3.391138      0.002617801   0.03076923
## graphics_programming           -3.650935      0.002617801   0.03476923
##                                sd in category Overall sd       p.value
## systems_administrator              0.43817486  0.3092110  0.000000e+00
## database_administrator             0.46521057  0.3411475 1.615697e-245
## dev_ops                            0.49982865  0.3246014 1.841103e-122
## quality_assurance_engineer         0.37110965  0.1885730  2.850245e-45
## web_developer                      0.32847927  0.4460869  1.826647e-12
## hobby                              0.34567862  0.4329238  9.365000e-08
## desktop_applications_developer     0.48891245  0.4500478  1.700611e-07
## mobile_developer                   0.41766850  0.3864772  2.259798e-02
## graphic_designer                   0.00000000  0.1471837  1.741130e-03
## machine_learning_specialist        0.05109744  0.1726919  6.960296e-04
## graphics_programming               0.05109744  0.1831948  2.612871e-04

As hinted at earlier and similar to the PCA-analysis section, 2 summarising list-column-structures are shown gathering raw HCPC-outputs plus a few customised transformations performed on a HCPC-object created from FactoMineR.

## ---- 3.0.1.1 Long-formatting-RAW-outputs!!!  ######

# Alternative 1. Quick checks!!!
# 1. Long-formatting- HCPC-call!
list_hcpc_res_RAW %>% 
  select(data_ID, pca_data, quanti, quali, sup, n_clusters) %>% 
  gather(hcpc_call, value,
         -data_ID,
         -pca_data)

## # A tibble: 12 x 4
##    data_ID pca_data      hcpc_call  value           
##      <int> <chr>         <chr>      <list>          
##  1       1 diamonds      quanti     <named list [3]>
##  2       2 pulsar        quanti     <named list [3]>
##  3       3 stackoverflow quanti     <named list [3]>
##  4       1 diamonds      quali      <named list [5]>
##  5       2 pulsar        quali      <named list [5]>
##  6       3 stackoverflow quali      <named list [5]>
##  7       1 diamonds      sup        <int [100]>     
##  8       2 pulsar        sup        <int [100]>     
##  9       3 stackoverflow sup        <int [100]>     
## 10       1 diamonds      n_clusters <dbl [1]>       
## 11       2 pulsar        n_clusters <dbl [1]>       
## 12       3 stackoverflow n_clusters <dbl [1]>

# 2. Long-formatting- HCPC-RAW-output!
list_hcpc_res_RAW %>% 
  select(data_ID, pca_data, 
         eigen, screeplot, var, 
         data_clust, desc_var_hcpc, desc_axes_hcpc, desc_ind, call_hcpc) %>% 
  gather(hcpc_raw, value,
         -data_ID,
         -pca_data) # Check tails!

## # A tibble: 24 x 4
##    data_ID pca_data      hcpc_raw   value                 
##      <int> <chr>         <chr>      <list>                
##  1       1 diamonds      eigen      <dbl[,3] [7 x 3]>     
##  2       2 pulsar        eigen      <dbl[,3] [8 x 3]>     
##  3       3 stackoverflow eigen      <dbl[,3] [16 x 3]>    
##  4       1 diamonds      screeplot  <gg>                  
##  5       2 pulsar        screeplot  <gg>                  
##  6       3 stackoverflow screeplot  <gg>                  
##  7       1 diamonds      var        <named list [4]>      
##  8       2 pulsar        var        <named list [4]>      
##  9       3 stackoverflow var        <named list [4]>      
## 10       1 diamonds      data_clust <df[,15] [3,170 x 15]>
## # ... with 14 more rows

0.6 ML part

One main driver of this section was to initially discover if some of the qualitative results from the PCA/HCPC analysis parts “matched” the ones from ML-modelling.

In the modelling section, I chose to experiment with a classification problem. A regression analysis using the same list-colum-workflow was carried out separately.

So a pre-labelling step came first, where I decided which factor variable from each single dataset where to be predicted (even a continous-transformed factor variable from earlier feature engineering was considered).

After a few data-wrangling, I obtained 3 tidy datasets with selected/transformed binary-reponse variables, original or PCA-transformed variables augmented with feature engineered ones.

PS: I carried out minimally hypertuned ML-modelling using glm, glmnet, random forests and gbm for speeding considerations.

Under the same list-column workflow, a tidy list-column was obtained after the modelling phase and subsequent prediction/evaluation calculations.

Below, I pluck the confusion matrix obtained from predicting the stackoverflow-label of new streamed inputs using the best tuned gbm model based on the pca-data.

# cm checks
cm_tune_tb %>% 
  left_join(data_names_df, 
            by = "row_id") %>% 
  filter(data_names == params$dataset) %>% 
  filter(model == "gbm_tune") %>% 
  filter(source == params$source) %>% 
  pluck("cm_tune")

## [[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  75   5
##        pos  18   2
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.6751, 0.8483)
##     No Information Rate : 0.93            
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.0496          
##                                           
##  Mcnemar's Test P-Value : 0.01234         
##                                           
##             Sensitivity : 0.2857          
##             Specificity : 0.8065          
##          Pos Pred Value : 0.1000          
##          Neg Pred Value : 0.9375          
##              Prevalence : 0.0700          
##          Detection Rate : 0.0200          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.5461          
##                                           
##        'Positive' Class : pos             
##

Below, I pluck the predicted classes for the stackoverflow-labels that served for the confusion matrix above. This calculation is also used as an input for the construction of ROC-curves, as shown subsequently, for default/customised tuned models .

# prediction checks
pred_tune_tb %>% 
  pluck("pred_tune") %>% 
  pluck(1)

##   [1] 0.04623349 0.28777444 0.24047262 0.02801547 0.08695865 0.27801468
##   [7] 0.03083461 0.45144870 0.25979844 0.77895867 0.82764792 0.23869535
##  [13] 0.02067273 0.20249019 0.04781930 0.58186475 0.42191466 0.28841435
##  [19] 0.86052745 0.33882593 0.70505400 0.11442220 0.06547673 0.16856067
##  [25] 0.55471955 0.22567319 0.52726619 0.24070372 0.99619384 0.80957166
##  [31] 0.49365518 0.22120480 0.13449928 0.31083609 0.05473438 0.45845952
##  [37] 0.63664236 0.05385075 0.17324399 0.11985137 0.59050201 0.87836788
##  [43] 0.55754714 0.09575299 0.99336434 0.04221202 0.13946449 0.56816937
##  [49] 0.06170841 0.43994040 0.06441524 0.77316838 0.43119490 0.25148807
##  [55] 0.62732005 0.06628451 0.06400821 0.10916509 0.80913860 0.61045660
##  [61] 0.33346919 0.03253836 0.01125492 0.93072678 0.54720772 0.22763178
##  [67] 0.89907382 0.05957049 0.11485895 0.07221109 0.33105894 0.45079891
##  [73] 0.62477382 0.51035073 0.22997399 0.54831521 0.54983415 0.78122004
##  [79] 0.05958379 0.42903235 0.23619830 0.04487570 0.97987164 0.35635075
##  [85] 0.15476779 0.37129253 0.62349821 0.36860580 0.10918816 0.99499124
##  [91] 0.39351347 0.63044086 0.04812088 0.09502582 0.78122368 0.21921873
##  [97] 0.38762517 0.27041737 0.61323845 0.96008331

# colAUC checks    
map(1,
    fun_ROC_models_tune)

## [[1]]
##                glmnet    ranger       gbm glmnet_tune   rf_tune  gbm_tune
## neg vs. pos 0.8156863 0.9137255 0.8862745   0.8376471 0.9286275 0.9231373

To illustrate again how a list-column workflow is very handy at all steps of the analysis, the following tibble collects all intermediary and final results in a tidy format.

### Final List column!!!
list_data_ml_tune_stack_cm_roc_eval

## # A tibble: 6 x 15
##   row_id data_ID data_names source train test  data_mtry caret_stack_tune
##    <dbl>   <int> <chr>      <chr>  <lis> <lis>     <int> <list>          
## 1      1       1 diamonds_  origi~ <df[~ <df[~        14 <caretLst>      
## 2      2       2 pulsar_    origi~ <df[~ <df[~        11 <caretLst>      
## 3      3       3 stackover~ origi~ <df[~ <df[~        23 <caretLst>      
## 4      4       1 diamonds_  pca    <df[~ <df[~         6 <caretLst>      
## 5      5       2 pulsar_    pca    <df[~ <df[~        10 <caretLst>      
## 6      6       3 stackover~ pca    <df[~ <df[~        11 <caretLst>      
## # ... with 7 more variables: results <list>, dotplot <list>, splom <list>,
## #   cm <list>, roc_curves <list>, eval_tune <list>, data <list>

Further restructuring and data-wrangling can continue near-endlessly. Below, I was curious about whether PCA-transformed predictors brought additional performance benefits in the modelling process.

### Superiority of PCA transformation sometimes!!!
eval_up_tune_original_pca %>% 
  mutate(ranking = value_org < value_pca)

## # A tibble: 18 x 6
##    data_ID model_data  metric value_org value_pca ranking
##      <dbl> <chr>       <chr>      <dbl>     <dbl> <lgl>  
##  1       1 gbm         auc        0.886     0.912 TRUE   
##  2       1 gbm_tune    auc        0.923     0.922 FALSE  
##  3       1 glmnet      auc        0.816     0.858 TRUE   
##  4       1 glmnet_tune auc        0.838     0.859 TRUE   
##  5       1 ranger      auc        0.914     0.851 FALSE  
##  6       1 ranger_tune auc        0.929     0.912 FALSE  
##  7       2 gbm         auc        0.877     0.962 TRUE   
##  8       2 gbm_tune    auc        0.938     0.966 TRUE   
##  9       2 glmnet      auc        0.960     0.961 TRUE   
## 10       2 glmnet_tune auc        0.965     0.962 FALSE  
## 11       2 ranger      auc        0.925     0.886 FALSE  
## 12       2 ranger_tune auc        0.945     0.953 TRUE   
## 13       3 gbm         auc        0.591     0.547 FALSE  
## 14       3 gbm_tune    auc        0.659     0.538 FALSE  
## 15       3 glmnet      auc        0.648     0.455 FALSE  
## 16       3 glmnet_tune auc        0.608     0.472 FALSE  
## 17       3 ranger      auc        0.531     0.657 TRUE   
## 18       3 ranger_tune auc        0.628     0.593 FALSE

0.6.1 PART Eval_Output

One of the main objectives was to simulate the streaming of new inputs in order to evaluate the different ML models, as an extention to the CV-train-validation phase of best performing model selection. That is, usually, a single, best performing model is selected at the final step, but for learning purposes, the same 4 models were kept using unseen, streamed, test sets.

Below, I pluck the stackoverflow dataset and, for each model, I show the evolution of different metrics calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).

# A. Time Series for eeach DATA with facets MODEL/METRIC (3 datasets)
map(params$dataset,
    fun_model_eval_stream_TS)

## [[1]]

Below, I pluck the auc-performance metric, and, for each dataset and model, I compare performances using both the original and PCA-transformed data along with its evolution calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).

# B. Facets for each StreamDate/DATA with Source as geom_paths (4 metrics) !!!
map(params$metric,
    fun_eval_pca_original_path)

## [[1]]

Below, I define another plot-function and pluck the auc-performance metric based on both the original and PCA-transformed data, and, for each dataset and model, I show its evolution calculated on new streams of unseen test data (all previous streams were accumulated on older, historical data).

# C. Time Series of Orgiginal Vs PCA per Metrics facets by data/model  (4 metrics)
fun_eval_pca_original_stream_TS <- function(eval) {
  
  eval_output_tidy %>% 
    left_join(eval_output_date_index,
              by = "stream_date") %>% 
    filter(metric == eval) %>% 
    group_by(stream_date, metric, data_names) %>% 
    # mutate(model_data = fct_reorder(model_data, value, mean)) %>% 
    ungroup() %>% 
    ggplot() +
    geom_line(aes(factor(date), value,
                  color = source,
                  group = source)) +
    ggplot2::facet_grid(factor(data_names) ~ factor(model_data),
               scales = "free") +
    ggtitle(eval)
  
}


map(eval_output_tb %>%
      pull(metric) %>%
      unique() %>%
      str_subset(params$metric), # acc
    fun_eval_pca_original_stream_TS)

# D. Time Series of Orgiginal Vs PCA per DATA facets by metrics/model  (3 data!!!)
map(params$dataset,
    fun_eval_pca_original_stream_TS)

## [[1]]

0.6.2 PART VarImp

Variable importance ouputs from random forests/gbm are very informative and can add additional insights to the ones obtained from the more exploratory PCA/HCPC analysis.

Below, I pluck the barcharts drawn for the following combination:

data-source = pca
dataset = stackoverflow

This to show the simulated evolutions of the most important variables in predicting the different labels of the stackoverflow datset.

### Original data
pmap(list(params$source,
          params$dataset),
     fun_var_imp_stream_1)

## [[1]]