A student in the August 2017 run of the Capstone wanted to know whether it was possible to save the output from a corpus summary to an object without having logging information written to the output file from R Markdown.
Both R and R Markdown include options to limit the amount of output generated by R functions. To test the bevhavior of quanteda in a scenario where we want a clean report (i.e. no output generated automatically by R within chunks of knitr code blocks), we load libraries with R and knitr options to suppress warnings, echoed code, errors, and informative messages.
The test is to see whether we can run elements of quanteda without generating messages or result to the output document unless we explicitly print them.
For our test we will load and sample the data from the Capstone project, and attempt to process it with the quanteda package. All library() statements have been coded with options to suppress load messages and warnings, including the one for quanteda. If they work correctly, the code block we will execute next should generate no output to the HTML document.
quanteda version 0.9.9.50
Using 3 of 4 cores for parallel computing
The presence of R output above means that quanteda does not obey the options on the library() function, unlike the other packages we loaded (see Appendix for list of loaded packages). Now that we’ve loaded the data and created a corpus, we’ll summarize it here. First, we’ll attempt to create the corpus summary to evaluate whether quanteda_options(verbose=FALSE) supresses the summary information.
Corpus consisting of 42697 documents, showing 100 documents.
Text Types Tokens Sentences
text1 16 16 2
text2 64 81 5
text3 3 3 1
text4 9 11 1
text5 8 8 1
text6 30 33 2
text7 4 4 1
text8 19 24 4
text9 36 38 1
text10 40 58 2
text11 17 18 1
text12 18 20 1
text13 3 3 1
text14 17 17 1
text15 10 23 1
text16 14 15 3
text17 17 22 1
text18 3 3 1
text19 9 9 1
text20 12 14 1
text21 48 60 2
text22 19 22 2
text23 14 14 1
text24 17 17 2
text25 5 5 1
text26 12 16 1
text27 16 18 1
text28 22 27 2
text29 27 32 2
text30 16 18 2
text31 13 14 1
text32 86 113 6
text33 14 16 2
text34 13 14 1
text35 30 36 1
text36 15 17 2
text37 24 27 1
text38 7 7 1
text39 3 3 1
text40 34 39 1
text41 16 20 1
text42 17 27 1
text43 13 13 1
text44 22 23 1
text45 20 23 1
text46 13 13 1
text47 10 16 3
text48 13 13 1
text49 14 14 1
text50 15 17 1
text51 20 22 1
text52 28 42 2
text53 91 125 5
text54 14 14 1
text55 19 21 2
text56 15 15 2
text57 42 47 3
text58 24 29 2
text59 68 83 2
text60 11 11 2
text61 18 20 2
text62 11 11 1
text63 23 26 1
text64 46 60 2
text65 2 2 1
text66 15 18 1
text67 19 22 2
text68 49 54 3
text69 5 5 1
text70 12 12 1
text71 68 84 3
text72 16 17 2
text73 5 13 2
text74 76 89 4
text75 3 3 1
text76 7 8 1
text77 16 17 2
text78 10 12 2
text79 64 75 3
text80 23 25 2
text81 31 39 3
text82 6 6 1
text83 9 9 1
text84 26 28 1
text85 51 58 2
text86 30 34 1
text87 4 4 1
text88 13 15 2
text89 2 2 1
text90 22 24 1
text91 27 32 3
text92 48 58 2
text93 35 39 1
text94 9 16 2
text95 39 43 2
text96 15 15 2
text97 21 23 2
text98 72 115 4
text99 15 16 2
text100 2 2 1
Source: C:/Users/leona/gitrepos/datascience/capstone/* on x86-64 by leona
Created: Tue Aug 15 06:30:39 2017
Notes:
Since we see output from the summary() function in code block we just executed, it is clear that quanteda does not obey the quanteda_settings() value when a corpus is summarized with the summary() function.
Now we will print the first 10 rows of the summary object.
Text Types Tokens Sentences
text1 text1 16 16 2
text2 text2 64 81 5
text3 text3 3 3 1
text4 text4 9 11 1
text5 text5 8 8 1
text6 text6 30 33 2
text7 text7 4 4 1
text8 text8 19 24 4
text9 text9 36 38 1
text10 text10 40 58 2
Here we call summary.corpus() directly, and set the verbose option to FALSE.
Since there is no output in the space above this text, summary.corpus() does obey its verbose argument, but does not use the setting from quanteda_options() which is supposed to control all functions that have a verbose option.
Now we will print the first 10 rows of the output object from the second summary() function call.
Text Types Tokens Sentences
text1 text1 16 16 2
text2 text2 64 81 5
text3 text3 3 3 1
text4 text4 9 11 1
text5 text5 8 8 1
text6 text6 30 33 2
text7 text7 4 4 1
text8 text8 19 24 4
text9 text9 36 38 1
text10 text10 40 58 2
Having demonstrated that we are unable to completely suppress unwanted output from quanteda, it’s time to post a bug report / feature request to the author.
Here we will print the code used to run the example.
knitr::opts_chunk$set(echo = FALSE,warning = FALSE,messsage=FALSE,comment=NA)
library(readr,quietly=TRUE)
library(dtplyr,quietly=TRUE)
library(stringr,quietly=TRUE)
library(stringi,quietly=TRUE)
library(quanteda,quietly=TRUE,verbose=FALSE,warn.conflicts=FALSE)
library(knitr,quietly=TRUE)
quanteda_options(verbose = FALSE)
blogFile <- "./data/en_us.blogs.txt"
newsFile <- "./data/en_us.news.txt"
twitterFile <- "./data/en_us.twitter.txt"
#
# note: change inFile & outFile references for each file type
#
inFile <- blogFile
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- readLines(twitterFile)
allData <- c(blogData,newsData,twitterData) # about 800mb file
# take a random sample of items
set.seed(902101347)
samplePct <- .01
sampleSize <- round(length(allData) * samplePct,0)
allData <- sample(allData,sampleSize)
rm(blogData,newsData,twitterData,blogFile,inFile,newsFile,samplePct,sampleSize)
theText <- corpus(allData)
# first, set quanteda_options()
quanteda_options(verbose = FALSE)
aResult <- summary(theText)
aResult2 <- summary(theText,verbose=FALSE)
aResult[1:10,]
aResult2[1:10,]
One last bit of housekeeping…
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.15.1 quanteda_0.9.9-50 stringi_1.1.5 stringr_1.2.0
[5] dtplyr_0.0.2 readr_1.1.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 magrittr_1.5 hms_0.3
[4] munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35
[7] R6_2.2.0 fastmatch_1.1-0 plyr_1.8.4
[10] dplyr_0.5.0 tools_3.4.0 grid_3.4.0
[13] gtable_0.2.0 data.table_1.10.4 DBI_0.6-1
[16] htmltools_0.3.6 lazyeval_0.2.0 yaml_2.1.14
[19] RcppParallel_4.3.20 rprojroot_1.2 digest_0.6.12
[22] assertthat_0.2.0 tibble_1.3.0 Matrix_1.2-9
[25] ggplot2_2.2.1 evaluate_0.10 rmarkdown_1.5
[28] compiler_3.4.0 scales_0.4.1 backports_1.0.5