suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(tidyverse) )

1 Introduction Group Analysis

When given a dataset, we often a have predifined groups of observations and several variables that we want to explore for relevant variables that describe the differences between those groups. Traditionally we would probably calculate the means of each variable and then sort by difference of the means in percent. Then we would generate histograms of the most promising candidates, check their distributions and then chose an appropriate statistical method to determine whether the difference in mean is of statistical significance. We would also have to use a different methods for numerical and categorical variables. My goal here is to present a workflow that automates all these steps.

2 f_stat_group_ana

is a function that simply takes a list data_ls generated by f_clean_data() and col_group a character vector that denotes the name of the grouping variable.

It then generates a full html file with interactive histograms of all relevant variables and interactive tables containing summary statistics of all variables. As well as alluvial plots, tabplots and static plots with statistical annotations.

In order to detect statistical relevant variables it uses kruskal and regular anova and performs a shapiro test for normality. Depending on the shapio result it selects either the kruskal or the regular anova p value.

It sorts the relevant variables by percent difference for numerical and percent difference in frequency for categorical variables.

It uses in total a number of functions that are quite usefull on its own, including a decent histogram function f_plot_hist, which we are going to walk through step-bystep.

2.1 Sample Data

We are going to use the ISLR::Caravan data set which consists of 86 different variables. We are going to investigate which variables influnce the purchase of a caravan by grouping on variable Purchase.


data_ls = f_clean_data(ISLR::Caravan)
#> [1] "Number of excluded observations: 0"

col_group = 'Purchase'

2.2 Calculate P values and percent difference


df_anova = f_stat_anova( data_ls, col_group )

df_chi = f_stat_chi_square( data_ls, col_group)

df_anova
#> # A tibble: 55 x 9
#>    variable shapiro_stat shapiro_pval   anova_pval kruskal_pval
#>       <chr>        <dbl>        <dbl>        <dbl>        <dbl>
#>  1  MOSTYPE    0.8790604 1.833958e-52 1.161297e-07 9.720052e-06
#>  2 MAANTHUI    0.2918228 7.420946e-88 4.542756e-01 9.313900e-01
#>  3 MGEMLEEF    0.8588394 3.457862e-55 7.319163e-01 7.599799e-01
#>  4 MOSHOOFD    0.8880026 3.895995e-51 1.203613e-07 2.636569e-06
#>  5   MGODRK    0.6992253 1.518687e-69 6.348475e-01 1.349118e-01
#>  6   MGODPR    0.9569654 4.428784e-36 1.214487e-02 7.944751e-03
#>  7   MGODOV    0.8441362 5.691839e-57 7.579684e-01 7.557576e-01
#>  8   MGODGE    0.9513418 7.333304e-38 1.457467e-03 1.574501e-03
#>  9   MRELGE    0.9264267 3.467169e-44 8.016575e-08 2.015729e-07
#> 10   MRELSA    0.7953003 4.990890e-62 1.090361e-02 3.723669e-02
#> # ... with 45 more rows, and 4 more variables: diff_of_means <dbl>,
#> #   diff_of_means_perc <dbl>, diff_of_medians <dbl>,
#> #   diff_of_medians_perc <dbl>

df_chi
#> # A tibble: 30 x 4
#>    variable     chi_pval max_diff_freq max_diff_freq_perc
#>       <chr>        <dbl>         <dbl>              <dbl>
#>  1  MGEMOMV 5.346666e-02          2411           97.66082
#>  2  PWAPART 2.033526e-12          3326           99.73013
#>  3  PWALAND 4.158361e-01          5354          100.00000
#>  4  PBESAUT 8.415373e-01          5425          100.00000
#>  5  PVRAAUT 9.025744e-01          5464          100.00000
#>  6 PTRACTOR 4.176891e-01          5328          100.00000
#>  7   PWERKT 8.545754e-01          5450          100.00000
#>  8  PGEZONG 7.309731e-04          5433           99.83462
#>  9  PWAOREG 9.310246e-02          5454          100.00000
#> 10  PZEILPL 3.077105e-02          5471          100.00000
#> # ... with 20 more rows

2.3 Combine the results of anova and chisquare

f_stat_combine_anova_with_chi_square() keeps either the regular or the kruskal anova p value depending on the results of the shapiro test (shapiro_stat > 0.9 p_val > 0.05). Then combines the df_anova and the df_chi dataframes into one.


df_stat = f_stat_combine_anova_with_chi_square( df_anova, df_chi ) %>%
  arrange( desc(diff_perc) )

df_stat
#> # A tibble: 85 x 3
#>    variable      p_value diff_perc
#>       <chr>        <dbl>     <dbl>
#>  1  AWAPART 3.246262e-11       100
#>  2 ABYSTAND 9.205356e-07       100
#>  3 PBYSTAND 6.598398e-06       100
#>  4    ABROM 2.548896e-03       100
#>  5  AWAOREG 1.869769e-02       100
#>  6  PZEILPL 3.077105e-02       100
#>  7  PWAOREG 9.310246e-02       100
#>  8  AINBOED 3.085995e-01       100
#>  9  PWALAND 4.158361e-01       100
#> 10 PTRACTOR 4.176891e-01       100
#> # ... with 75 more rows

2.4 Plot histograms

2.4.1 f_plot_hist

this function provides a standard interface for histograms with sensible defaults. It automatically detects whether a variable is numeric or categorical and adds statistical information on its own. See the following examples.

The statistical annotation is added using the ggpubr package. ggpubr::stat_compare_means() can be used to add significance annotation to most plots. Although we have to predefine the values that should be compared. For this I wrote f_plot_generate_comparison_pairs() which automatically looks for combinations that are significantly different using a non-parametric wilcox test.


data_ls2 = f_clean_data(mtcars)
#> [1] "Number of excluded observations: 0"
f_plot_hist('disp', data_ls2)

f_plot_hist('disp', data_ls2, add = 'median')

f_plot_hist('disp', data_ls2, add = 'none')

f_plot_hist('disp', data_ls2, y_axis = 'density')

f_plot_hist('cyl', data_ls2 , group = 'gear' )

f_plot_hist('cyl', data_ls2 , group = 'gear', y_axis = 'density' )

f_plot_hist('cyl', data_ls2, y_axis = 'density' )

f_plot_hist('cyl', data_ls2, y_axis = 'count' )

f_plot_hist('disp', data_ls2, graph_type = 'line', group = 'cyl')

f_plot_hist('disp', data_ls2, graph_type = 'bar', group = 'cyl')

f_plot_hist('disp', data_ls2, graph_type = 'violin', group = 'cyl'
             , caption ='caption', title = 'title', subtitle = 'subtitle')

2.4.2 Generating histograms using a pipe


df_stat %>%
  filter( diff_perc >= 30, p_value <= 0.001 ) %>%
  .$variable %>%
  map( f_plot_hist, data_ls, col_group, y_axis = 'density' )
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

#> 
#> [[5]]

#> 
#> [[6]]

#> 
#> [[7]]

#> 
#> [[8]]

#> 
#> [[9]]

#> 
#> [[10]]

#> 
#> [[11]]

#> 
#> [[12]]

2.5 Tables with means/medians and frequencies

f_stat_group_mean_medians() and f_stat_group_counts_percentages() create meaningful tables with minimal input. They actually are recalculating the statistics again. Which is compuationally a bit wasteful but should not take to much extra time. f_datatable_universal creates a nicely formatted interactive table. That can be easily be downloaded as an excel file.


f_stat_group_mean_medians( data_ls, col_group ) %>%
  f_datatable_universal()

f_stat_group_counts_percentages( data_ls, col_group ) %>%
  f_datatable_universal()

2.6 Bringing it all together

We can apply all these steps automatically by simply using f_stat_group_ana(). We have the following (and more options) when rendering.

-return_taglist (Instead of rednering the html file we can get a taglist in return which we can integrate into any Rmd document) - alluvial (adds an additional html file with an alluvial plot) - tabplot( adds an additional html file with tabplots) - static_plots( adds an additional html file with static versions of the histograms, all histograms loose the statistical annotations when converted to interactive plotly objects )

The reason why we generate seperate html files for each plot type is that I did not yet manage to write a generic function that is able to print a list of heterogeneous objects inside a Rmd file. So far I am using f_plot_obj_2_html() to generate html files from homogeneous lists.

For demsonstration purposes we will generate 4 html files with rather high p value and percent difference thresholds. We will integrate them as screenshots.


f_stat_group_ana( data_ls, col_group
                  , thresh_p_val = 0.0001, thresh_diff_perc = 30
                  # it ususally makes sense to increase limit the amount of variables in an alluvial plot
                  , alluvial_thresh_p_val = 0.0001, alluvial_thres_diff_perc = 30
                  , alluvial = T
                  , static_plots = T
                  , tabplot = T 
                  , output_file = 'f_stat_group_ana'
                  , quiet = T )
#> [1] "Number of flows: 238"
#> [1] "Original Dataframe reduced to 4.1 %"
#> [1] "Maximum weight of a singfle flow 10.7 %"
#> [1] "C:/OneDrive/R/oetteR/inst/vignettes/f_stat_group_ana.html"

2.7 Screenshot Main File


webshot::webshot( 'f_stat_group_ana.html' )


file.remove('f_stat_group_ana.html')
#> [1] TRUE

2.8 Screenshot Alluvial File


webshot::webshot( 'f_stat_group_ana_alluvial.html' )


file.remove('f_stat_group_ana_alluvial.html')
#> [1] TRUE

2.9 Screenshot Tablplot File


webshot::webshot( 'f_stat_group_ana_tabplots.html' )


file.remove('f_stat_group_ana_tabplots.html')
#> [1] TRUE

2.10 Screenshot Static Plots File


webshot::webshot( 'f_stat_group_ana_stat_plots.html' )


file.remove('f_stat_group_ana_stat_plots.html')
#> [1] TRUE

file.remove('webshot.png')
#> [1] TRUE