1 数値変数を分割してカテゴリカル変数に変換する(binning).


教師なしビニング方法として, いくつかの区間の指定の方法がある.

  1. 指定した境界値で分割(Equal Width)
  2. 含まれるデータ数が同一になるように分割(Equal Frequency)
  3. 分位数で分割
  4. その他(k-means 等)

2 区間の決め方


3 cutを使って離散化


##        x ct_1 ct_2 ct_3 ct_4 ct_5
##  [1,]  0   NA    1    1    1    1
##  [2,]  1    1    1    1    1    1
##  [3,]  2    1    2    1    2    1
##  [4,]  3    2    2    2    2    2
##  [5,]  4    2    3    2    3    2
##  [6,]  5    3    3    3    3    3
##  [7,]  6    3    4    3    4    3
##  [8,]  7    4    4    4    4    4
##  [9,]  8    4    5    4    5    4
## [10,]  9    5    5    5    5    5
## [11,] 10    5   NA    5    5    5

4 findIntervalを用いて離散化


##        x intvl_1 intvl_2 intvl_3 intvl_4
##  [1,]  0       1       0       1       1
##  [2,]  1       1       1       1       1
##  [3,]  2       2       1       2       1
##  [4,]  3       2       2       2       2
##  [5,]  4       3       2       3       2
##  [6,]  5       3       3       3       3
##  [7,]  6       4       3       4       3
##  [8,]  7       4       4       4       4
##  [9,]  8       5       4       5       4
## [10,]  9       5       5       5       5
## [11,] 10       6       5       5       5

5 infotheo::discretizeを用いて離散値化


データフレームの各列を等間隔もしくは等頻度区間に基づいて離散値化する.
重複した数値がある場合, equalfreqでは各区間のデータ数が均一な分割にならない.

equalfreq/equalwidth/globalequalwidth
x ef ew gew
1 3 1 1 1
4 3 1 1 1
7 3 1 1 1
3 5 1 2 1
9 5 1 2 1
10 5 1 2 1
2 6 3 2 2
8 7 3 3 2
5 9 3 3 3
6 9 3 3 3

6 dplyr::ntileを用いて均等に分割


##    x ntl
## 1  3   1
## 4  3   1
## 7  3   1
## 3  5   1
## 9  5   2
## 10 5   2
## 2  6   2
## 8  7   3
## 5  9   3
## 6  9   3

7 環境


## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.8.0.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        rstudioapi_0.9.0  xml2_1.2.0       
##  [4] knitr_1.21        magrittr_1.5      hms_0.4.2        
##  [7] munsell_0.5.0     tidyselect_0.2.5  rvest_0.3.2      
## [10] viridisLite_0.3.0 colorspace_1.4-0  R6_2.3.0         
## [13] rlang_0.3.1       highr_0.7         httr_1.4.0       
## [16] stringr_1.3.1     tools_3.5.1       webshot_0.5.1    
## [19] xfun_0.4          htmltools_0.3.6   yaml_2.2.0       
## [22] assertthat_0.2.0  digest_0.6.18     tibble_2.0.1     
## [25] crayon_1.3.4      infotheo_1.2.0    kableExtra_1.0.0 
## [28] purrr_0.3.0       readr_1.3.1       glue_1.3.0       
## [31] evaluate_0.12     rmarkdown_1.11    stringi_1.2.4    
## [34] compiler_3.5.1    pillar_1.3.1      scales_1.0.0     
## [37] pkgconfig_2.0.2