Contents

1 Introduction

The SSL package features a Gaussian mixture model with expectation maximization classifier for semi-supervised learning. However the implimentation produces an error when more than four features are present. I have attempted to fix a number of the issues by modifying the source from CRAN but have been unsuccessful. I have made this small example to demonstrate the error.

2 Example Data

2.1 Labeled Data

There is a data.frame of gene expression measurements with known labels that has the following dimensions.

dim(labeled_data)
## [1] 72 10

And the data itself looks like the following.

labeled_data[1:12, 1:8]
##        A1BG   A1CF  A2BP1    A2LD1   A2ML1       A2M   A4GALT  A4GNT
## 1  164.6427 0.0000 0.7561 112.8242  7.9395 21007.130 225.7089 0.0000
## 2  355.0592 0.0000 2.1149  66.0921  0.2644  4696.467 227.8765 0.0000
## 3  223.8031 0.0000 0.4352  64.1508  1.7410 16436.538 360.3831 0.8705
## 4   42.0627 0.0000 0.0000 103.6612  1.6750 46129.706 877.0041 3.5894
## 5  131.4105 0.0000 0.7686 113.5242 39.2006 25351.222 407.3789 0.0000
## 6   83.4274 0.0000 0.0000 112.1780  8.0606  5507.300 265.0330 1.2897
## 7  199.4404 0.0000 1.6221 127.3804  0.8110  6768.305 779.3998 0.0000
## 8   65.8438 0.0000 2.1135  87.9510 35.5067 38388.071 218.1126 1.6908
## 9   80.7959 0.3651 1.8255  70.0548  1.8255 27741.435 236.2176 0.0000
## 10  85.8338 0.0000 9.1556 102.3654  9.7278 17003.014 477.2362 1.1445
## 11 132.6911 0.0000 0.0000 190.3170  3.1075  7574.798 318.2101 0.6215
## 12  59.1574 0.0000 1.8450  61.5621  8.6101 24218.167 191.2669 7.9951

2.2 Unlabeled Data

There is a data.frame of gene expression measurements with unknown labels that has the following dimensions.

dim(unlabeled_data)
## [1] 216  10

And the data itself looks like the following.

unlabeled_data[1:12, 1:8]
##        A1BG   A1CF  A2BP1    A2LD1      A2ML1        A2M    A4GALT  A4GNT
## 1  119.1522 0.0000 0.0000  84.6118  1244.7568 14903.0143  167.7822 0.4462
## 2   80.1438 0.0000 1.5134  72.1945     7.1888 15558.0628  246.3110 0.0000
## 3  522.9042 0.0000 0.0000  52.5057     1.1494   836.7318 1118.0077 0.0000
## 4  182.5415 0.0000 0.0000  76.0683     2.2398 12264.8924  590.7402 0.0000
## 5  195.6499 0.0000 0.0000 117.7400     0.4409  3953.4551   49.8207 0.0000
## 6  192.8194 0.0000 0.0000  70.9305     0.9872 15728.4870  266.2279 0.0000
## 7  189.0577 0.0000 0.3542  93.4042     3.5423 22170.2763  227.0634 0.0000
## 8   74.7731 0.0000 0.2532 107.1494     1.5194 13234.3533  200.0569 0.2532
## 9  205.0052 0.0000 0.0000 122.4446     0.0000 16641.0205  256.9871 0.0000
## 10  69.3158 0.0000 3.1822  91.2092     0.0000 10629.1169  167.0644 1.5911
## 11  94.7577 0.0000 0.4361  95.7041     2.6168  5353.4476  119.0632 0.4361
## 12  53.0100 0.2993 0.0000  17.8735 10166.5392  4571.8211   26.9360 0.0000

2.3 Labels Vector

Finally, there is a vector of labels with the following length.

length(labels_vector)
## [1] 72

And the labels themselves looks like the following.

labels_vector
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [37] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 SSL Errors

When run with just four features the sslGmmEM method from the SSL package will produce the following output.

SSL::sslGmmEM(labeled_data[, 1:4], labels_vector, unlabeled_data[, 1:4])
## $para
##           [,1]       [,2]       [,3]     [,4]
## [1,] 155.45001 0.01610553  2.7617877 84.83448
## [2,] 115.63840 0.09724083 13.4125509 34.69570
## [3,] 142.66697 0.55022133  0.7828245 94.88403
## [4,]  68.21061 2.11565583  1.6738634 59.94932
## 
## $classProb
## [1] 0.8090278 0.1909722
## 
## $yu
##   [1] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1
##  [37] 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1
##  [73] 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [109] 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [145] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 1
## [181] 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2
## 
## $optLambda
## [1] 1

However, when run with just one more feature, for a total of five features, the method produces an error.

try(SSL::sslGmmEM(labeled_data[, 1:5], labels_vector, unlabeled_data[, 1:5]))
## Error in all.label[index] : invalid subscript type 'list'

As learning with four features is not useful this would need to be rectified to use the SSL package in production.

4 Session Info

sessionInfo()
## R Under development (unstable) (2019-01-26 r76018)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.11.0   BiocManager_1.30.4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0          lubridate_1.7.4     lattice_0.20-38    
##  [4] class_7.3-15        assertthat_0.2.0    digest_0.6.18      
##  [7] ipred_0.9-8         foreach_1.4.4       mime_0.6           
## [10] R6_2.3.0            plyr_1.8.4          NetPreProc_1.1     
## [13] SSL_0.1             stats4_3.6.0        evaluate_0.13      
## [16] e1071_1.7-0.1       ggplot2_3.1.0       highr_0.7          
## [19] pillar_1.3.1        rlang_0.3.1         lazyeval_0.2.1     
## [22] caret_6.0-81        rstudioapi_0.9.0    data.table_1.12.0  
## [25] miniUI_0.1.1.1      rpart_4.1-13        Matrix_1.2-15      
## [28] combinat_0.0-8      rmarkdown_1.11      splines_3.6.0      
## [31] gower_0.1.2         stringr_1.4.0       questionr_0.7.0    
## [34] munsell_0.5.0       proxy_0.4-22        shiny_1.2.0        
## [37] compiler_3.6.0      httpuv_1.4.5.1      xfun_0.4           
## [40] pkgconfig_2.0.2     BiocGenerics_0.29.1 htmltools_0.3.6    
## [43] nnet_7.3-12         tidyselect_0.2.5    prodlim_2018.04.18 
## [46] tibble_2.0.1        bookdown_0.9        codetools_0.2-16   
## [49] withr_2.1.2         crayon_1.3.4        dplyr_0.7.8        
## [52] later_0.8.0         MASS_7.3-51.1       recipes_0.1.4      
## [55] ModelMetrics_1.2.2  grid_3.6.0          nlme_3.1-137       
## [58] xtable_1.8-3        gtable_0.2.0        magrittr_1.5       
## [61] scales_1.0.0        graph_1.61.0        stringi_1.2.4      
## [64] reshape2_1.4.3      promises_1.0.1      bindrcpp_0.2.2     
## [67] timeDate_3043.102   generics_0.0.2      xgboost_0.81.0.1   
## [70] lava_1.6.5          klaR_0.6-14         iterators_1.0.10   
## [73] tools_3.6.0         glue_1.3.0          purrr_0.3.0        
## [76] survival_2.43-3     parallel_3.6.0      yaml_2.2.0         
## [79] colorspace_1.4-0    knitr_1.21          bindr_0.1.1