Does quanteda() obey R and R Markdown Output Options?

A student in the August 2017 run of the Capstone wanted to know whether it was possible to save the output from a corpus summary to an object without having logging information written to the output file from R Markdown.

Both R and R Markdown include options to limit the amount of output generated by R functions. To test the bevhavior of quanteda in a scenario where we want a clean report (i.e. no output generated automatically by R within chunks of knitr code blocks), we load libraries with R and knitr options to suppress warnings, echoed code, errors, and informative messages.

The test is to see whether we can run elements of quanteda without generating messages or result to the output document unless we explicitly print them.

For our test we will load and sample the data from the Capstone project, and attempt to process it with the quanteda package. All library() statements have been coded with options to suppress load messages and warnings, including the one for quanteda. If they work correctly, the code block we will execute next should generate no output to the HTML document.

quanteda version 0.9.9.50
Using 3 of 4 cores for parallel computing

The presence of R output above means that quanteda does not obey the options on the library() function, unlike the other packages we loaded (see Appendix for list of loaded packages). Now that we’ve loaded the data and created a corpus, we’ll summarize it here. First, we’ll attempt to create the corpus summary to evaluate whether quanteda_options(verbose=FALSE) supresses the summary information.

Corpus consisting of 42697 documents, showing 100 documents.

    Text Types Tokens Sentences
   text1    16     16         2
   text2    64     81         5
   text3     3      3         1
   text4     9     11         1
   text5     8      8         1
   text6    30     33         2
   text7     4      4         1
   text8    19     24         4
   text9    36     38         1
  text10    40     58         2
  text11    17     18         1
  text12    18     20         1
  text13     3      3         1
  text14    17     17         1
  text15    10     23         1
  text16    14     15         3
  text17    17     22         1
  text18     3      3         1
  text19     9      9         1
  text20    12     14         1
  text21    48     60         2
  text22    19     22         2
  text23    14     14         1
  text24    17     17         2
  text25     5      5         1
  text26    12     16         1
  text27    16     18         1
  text28    22     27         2
  text29    27     32         2
  text30    16     18         2
  text31    13     14         1
  text32    86    113         6
  text33    14     16         2
  text34    13     14         1
  text35    30     36         1
  text36    15     17         2
  text37    24     27         1
  text38     7      7         1
  text39     3      3         1
  text40    34     39         1
  text41    16     20         1
  text42    17     27         1
  text43    13     13         1
  text44    22     23         1
  text45    20     23         1
  text46    13     13         1
  text47    10     16         3
  text48    13     13         1
  text49    14     14         1
  text50    15     17         1
  text51    20     22         1
  text52    28     42         2
  text53    91    125         5
  text54    14     14         1
  text55    19     21         2
  text56    15     15         2
  text57    42     47         3
  text58    24     29         2
  text59    68     83         2
  text60    11     11         2
  text61    18     20         2
  text62    11     11         1
  text63    23     26         1
  text64    46     60         2
  text65     2      2         1
  text66    15     18         1
  text67    19     22         2
  text68    49     54         3
  text69     5      5         1
  text70    12     12         1
  text71    68     84         3
  text72    16     17         2
  text73     5     13         2
  text74    76     89         4
  text75     3      3         1
  text76     7      8         1
  text77    16     17         2
  text78    10     12         2
  text79    64     75         3
  text80    23     25         2
  text81    31     39         3
  text82     6      6         1
  text83     9      9         1
  text84    26     28         1
  text85    51     58         2
  text86    30     34         1
  text87     4      4         1
  text88    13     15         2
  text89     2      2         1
  text90    22     24         1
  text91    27     32         3
  text92    48     58         2
  text93    35     39         1
  text94     9     16         2
  text95    39     43         2
  text96    15     15         2
  text97    21     23         2
  text98    72    115         4
  text99    15     16         2
 text100     2      2         1

Source:  C:/Users/leona/gitrepos/datascience/capstone/* on x86-64 by leona
Created: Tue Aug 15 06:30:39 2017
Notes:   

Since we see output from the summary() function in code block we just executed, it is clear that quanteda does not obey the quanteda_settings() value when a corpus is summarized with the summary() function.

Now we will print the first 10 rows of the summary object.

         Text Types Tokens Sentences
text1   text1    16     16         2
text2   text2    64     81         5
text3   text3     3      3         1
text4   text4     9     11         1
text5   text5     8      8         1
text6   text6    30     33         2
text7   text7     4      4         1
text8   text8    19     24         4
text9   text9    36     38         1
text10 text10    40     58         2

Scenario 2: Directly Suppressing summary.corpus() Output

Here we call summary.corpus() directly, and set the verbose option to FALSE.

Since there is no output in the space above this text, summary.corpus() does obey its verbose argument, but does not use the setting from quanteda_options() which is supposed to control all functions that have a verbose option.

Now we will print the first 10 rows of the output object from the second summary() function call.

         Text Types Tokens Sentences
text1   text1    16     16         2
text2   text2    64     81         5
text3   text3     3      3         1
text4   text4     9     11         1
text5   text5     8      8         1
text6   text6    30     33         2
text7   text7     4      4         1
text8   text8    19     24         4
text9   text9    36     38         1
text10 text10    40     58         2

Conclusion

Having demonstrated that we are unable to completely suppress unwanted output from quanteda, it’s time to post a bug report / feature request to the author.

Appendix

Here we will print the code used to run the example.

knitr::opts_chunk$set(echo = FALSE,warning = FALSE,messsage=FALSE,comment=NA)
library(readr,quietly=TRUE)
library(dtplyr,quietly=TRUE)
library(stringr,quietly=TRUE)
library(stringi,quietly=TRUE)
library(quanteda,quietly=TRUE,verbose=FALSE,warn.conflicts=FALSE)
library(knitr,quietly=TRUE)
quanteda_options(verbose = FALSE)

blogFile <- "./data/en_us.blogs.txt"
newsFile <- "./data/en_us.news.txt"
twitterFile <- "./data/en_us.twitter.txt"
#
# note: change inFile & outFile references for each file type
#
inFile <- blogFile
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- readLines(twitterFile)
allData <- c(blogData,newsData,twitterData) # about 800mb file
# take a random sample of items
set.seed(902101347)
samplePct <- .01
sampleSize <- round(length(allData) * samplePct,0)
allData <- sample(allData,sampleSize)
rm(blogData,newsData,twitterData,blogFile,inFile,newsFile,samplePct,sampleSize)
theText <- corpus(allData)
# first, set quanteda_options()
quanteda_options(verbose = FALSE)
aResult <- summary(theText)
aResult2 <- summary(theText,verbose=FALSE)
aResult[1:10,]
aResult2[1:10,]

One last bit of housekeeping…

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.15.1      quanteda_0.9.9-50 stringi_1.1.5     stringr_1.2.0    
[5] dtplyr_0.0.2      readr_1.1.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10        magrittr_1.5        hms_0.3            
 [4] munsell_0.4.3       colorspace_1.3-2    lattice_0.20-35    
 [7] R6_2.2.0            fastmatch_1.1-0     plyr_1.8.4         
[10] dplyr_0.5.0         tools_3.4.0         grid_3.4.0         
[13] gtable_0.2.0        data.table_1.10.4   DBI_0.6-1          
[16] htmltools_0.3.6     lazyeval_0.2.0      yaml_2.1.14        
[19] RcppParallel_4.3.20 rprojroot_1.2       digest_0.6.12      
[22] assertthat_0.2.0    tibble_1.3.0        Matrix_1.2-9       
[25] ggplot2_2.2.1       evaluate_0.10       rmarkdown_1.5      
[28] compiler_3.4.0      scales_0.4.1        backports_1.0.5