US English Text Analyzer Milestone Report

This is prepared for Johns Hopkins’ Data Science Capstone online class Milestone Report

The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).

Executive Summary

The large number of text-based information have been using in current social media from such sources as e-mail, personal blogs, newspaper news, twitter, Web pages, and scanned/handwritten notes.

Understanding the problem Majority data in unstructured format which is harder to search, query, retrieve and analyze.
Using natural language processing(NLP) techniques can add more structure and semantic information to unstructured text content, and allowing us to be efficient, and treat data valuable in decision management such as marketing, sale advertisement, business decision, and kid/youth education, and healthcare etc.

My primary focus is to use the freeform text of the English(United State) language data files for my exploratory analysis, and then to build the best algorithm, to predict the possible the next word inputted from the user’s typing. Furthermore, if the prediction performance is fast, this could help users typing problem currently most of us struggling in phone/tablet devices.

The textAnalyzer analysis will learn terms/words from all documents from English data files
Models each document by counting the number of time each word/term appears. If collecting words/terms are extremely large, I would consider to limit the size of data result by defining the maximum numbers of Most Frequent Words, and also to remove the common and less used words.

Methods:

1. Data Collection: Access data

This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info

The source is a compressed file which contain text files including twitter, news, and personal blogs in languages/locales of English (United States), Finnish (Finland), German (Germany) and Russian (Russia).
Toolset Programming language “R” was used to download

2. Exploratory Analysis: Explore Data and Basic Statistics

This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables (just learned past 2 weeks) described in Natural Language Processing

Exploratory analysis tasks involved

Created sample data set since source are extremely big. Ideal 70% for training, and 30% for testing sample dataset. Note: In reality, my PC capacity only could handle 10% up to 40% of data.
Cleaned data by removing URLS, replacing number, punctuation, control keys (non-English letters), converting to lower case (better for building NGram /dictionary), removing common/stop/unused words, and stemming etc. And, Cached executing result were stored into local repository for later use
Identified and collected word frequency/occurrence, associates of words used in blogs, news, and twitters
Explored N-gram API capability: unigram, bigram, trigram, N4 and N5-grams. Cached NGram result into local repository served as my first word dictionary database (3) adopted NLP modeling and determined the terms used in the regression model

Toolset: “R” was used to preprocess data, create sample files and generate N-gram result.

3. Statistical Modeling

Plan to relate NGram word count and probability, I would perform a multiple linear regression: fitting the models, diagnostic plots, compare models, cross validation.

The strategy for the best model selection would be good tuning smoothing methods such as Markov matrix, predict on 3-gram, 2-gram, or even unigram, and propose up to 5 or more best answers.
Consider to explore the Katz Back-off model for unseen sentences
Toolset “R” was used to generate N-gram result, predict modeling, testing, cross validation.
Toolset knit kit was used to generate the Milestone Report in HTML format

4. Reproducibility

All analysis performed in this report are reproducible in the R markdown file textAnalyzeMilestone.Rmd located in the github project repository
All analysis performed in preprocess and tasks are reproducible in the R sources located in the project repository.
As data source change, or sample data probability change, preprocess data R scripts must be executed to reflect the latest cached version of preprocessed datasets. And Rmd file need to regenerate HTML report, and publish to RPub site.

5. Create A Data Product

Plan to build a web based data product, implement the best prediction model selected from task 3.

The product will be web-based and available in public via internet
Expect real-time to interact with user(s) to take user’s typing as input method and to display predicted next word(s)
Toolset “R” will be used to predict algorithm next words
Toolset knit kit was used to generate the Milestone Report in HTML format
Toolset RStudio Presenter will be used for product slide creation to introduce data product feature
Toolset Shiny will be used to build a data product, will deploy to the shiny server site, and available to public access via internet.

Note: the corresponding summary results, figures, references information are available in Appendix. The source R codes are in github project repository

Summarized Data and Findings:

Due to data files and NLP APIs require extremely memory and CPUs time to process algorithm and calculate, my R codes would preprocess for the most of summary results, and frequency words and counts were stored in the report folder in order to improve the report knit compiling time.
The “UTF-8” encoding was used to read line from English(US) files, and assumed it should work well for other language such as Finnish (Finland), German (Germany) and Russian (Russia)
The preprocess and cleanup data were performed to summarize, and cache the result into RData files. The result data set measured word frequency/occurrences, word association probability.

So far, I identified no missing values in the summarized dataset I preprocessed, and all measured variables were observed to be within the standard ranges based on NLP.

English(US) Summary Result

Blogs contain multiple sentence(s), average words were 40.94 words per line. Longest line has 39,240 characters, 6,327 words, 1,685 unique words. In total, we saw 899,288 lines.
News contain samll senentence(s), average words were 33.31 per line. Longest line has 3,764 characters, 544 words, 271 unique words. In total, we saw 2,360,148 lines.
Twitter contain samller senentence(s), average words were 12.44 per line. Longest line has 140 characters, 47 words, 36 unique words. In total, we saw 1,010,242 lines.

Summary of All Data Files
Filename	FileSizenInByte	FileSizeInMByte	FileLineCnt	LanguageLocation	Encoding1	Encoding2
de_DE.blogs.txt	85459666	81.5 Mb	371440	German (Germany)	UTF-8	ISO-8859-1
de_DE.news.txt	95591959	91.2 Mb	244743	German (Germany)	UTF-8	ISO-8859-1
de_DE.twitter.txt	75578341	72.1 Mb	947774	German (Germany)	UTF-8	ISO-8859-1
en_US.blogs.txt	210160014	200.4 Mb	899288	English (United States)	UTF-8	ISO-8859-1
en_US.news.txt	205811889	196.3 Mb	1010242	English (United States)	UTF-8	ISO-8859-1
en_US.twitter.txt	167105338	159.4 Mb	2360148	English (United States)	UTF-8	ISO-8859-1
fi_FI.blogs.txt	108503595	103.5 Mb	439785	Finnish (Finland)	UTF-8	ISO-8859-1
fi_FI.news.txt	94234350	89.9 Mb	485758	Finnish (Finland)	UTF-8	ISO-8859-1
fi_FI.twitter.txt	25331142	24.2 Mb	285214	Finnish (Finland)	UTF-8	ISO-8859-1
ru_RU.blogs.txt	116855835	111.4 Mb	337100	Russian (Russia)	UTF-8	ISO-8859-5
ru_RU.news.txt	118996424	113.5 Mb	196360	Russian (Russia)	UTF-8	ISO-8859-5
ru_RU.twitter.txt	105182346	100.3 Mb	881414	Russian (Russia)	UTF-8	ISO-8859-5

Summary of English (United State) Data Files
FieldSummaryBy	Filetype	Min.	X1st.Qu.	Median	Mean	X3rd.Qu.	Max.
LineCharCnt	en_US.blogs	0	44	149	221.5	317	39240
LineCharCnt	en_US.news	0	104	177	193.1	258	3764
LineCharCnt	en_US.twitter	1	34	60	64.82	94	140
rowId	en_US.blogs	1	224800	449600	449600	674500	899300
rowId	en_US.news	1	19320	38630	38630	57940	77260
rowId	en_US.twitter	1	590000	1180000	1180000	1770000	2360000
LineWordCnt	en_US.blogs	0	8	28	40.94	59	6327
LineWordCnt	en_US.news	0	18	30	33.31	44	544
LineWordCnt	en_US.twitter	1	7	12	12.44	18	47
LineAvgWordLen	en_US.blogs	1	4	4.387	4.556	4.879	74
LineAvgWordLen	en_US.news	1	4.415	4.812	4.873	5.231	31
LineAvgWordLen	en_US.twitter	1	3.733	4.188	4.305	4.733	126
LineUniqWordCnt	en_US.blogs	0	8	24	31.2	46	1685
LineUniqWordCnt	en_US.news	0	17	27	28.06	37	271
LineUniqWordCnt	en_US.twitter	1	7	11	11.72	17	36
LineHashtagCnt	en_US.blogs	0	0	0	0	0	0
LineHashtagCnt	en_US.news	0	0	0	0	0	0
LineHashtagCnt	en_US.twitter	0	0	0	0	0	0
LineHttpCnt	en_US.blogs	0	0	0	0.001702	0	8
LineHttpCnt	en_US.news	0	0	0	0.0009449	0	4
LineHttpCnt	en_US.twitter	0	0	0	0.000222	0	2

Most 20 Frequent Words
blogs.word	blogs.wordcnt	blogs.wordprob	news.word	news.wordcnt	news.wordprob	twt.word	twt.wordcnt	twt.wordprob
the	1855771	0.0528795	the	151524	0.0608450	the	934172	0.0335948
and	1086110	0.0309483	to	69348	0.0278470	to	786629	0.0282888
to	1065698	0.0303667	and	68216	0.0273924	you	543700	0.0195526
of	875028	0.0249336	of	59089	0.0237274	and	433686	0.0155963
in	593633	0.0169154	in	51464	0.0206656	for	384535	0.0138287
that	459500	0.0130933	for	27112	0.0108869	in	377036	0.0135590
is	431834	0.0123050	that	26358	0.0105842	of	358981	0.0129097
it	400905	0.0114236	is	21961	0.0088185	is	357544	0.0128580
for	362867	0.0103398	on	20578	0.0082632	it	291398	0.0104793
you	296855	0.0084588	with	19754	0.0079323	my	290517	0.0104476
with	286177	0.0081545	said	19167	0.0076966	on	276264	0.0099350
was	278002	0.0079216	was	17625	0.0070774	that	232907	0.0083758
on	274047	0.0078089	he	17556	0.0070497	me	200067	0.0071948
my	270181	0.0076987	it	16693	0.0067031	be	187176	0.0067312
this	257977	0.0073510	at	16413	0.0065907	at	185524	0.0066718
as	223359	0.0063645	as	14662	0.0058876	with	172995	0.0062213
have	218541	0.0062272	his	12107	0.0048616	your	170771	0.0061413
be	208303	0.0059355	but	11658	0.0046813	have	168051	0.0060435
but	203446	0.0057971	from	11648	0.0046773	so	163273	0.0058716
are	193634	0.0055175	be	11579	0.0046496	this	162736	0.0058523

Conclusions:

Observation from my analysis

Limitation Identified So far
- Experienced API packages were not available for the latest R version. (i.e. rJava, stringi, etc.)
- Using 90% sparse rate did not reserve more words to be measured, so need to use 99%, but it could reach PC limitation if dataset is too huge.
- Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.
- Using TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.
- Performance of smoothing API seems slower, and could not do well using larger word sets

Suggestion from my analysis

maintain the common relationship between corpus document elements, unique words and their density. Normalize my dictionary content, and keep dictionary size smaller to adopt more vacabulary set
Optimize (good and tune) the prediction modeling with speed of processing. The ideal is to have the balanced depth of information and fast speed of prediction processing. In addition, accuracy and scalability

Planned Next Steps:

Continue Statistical Modeling tasks: perform a multiple linear regression: fitting the models, diagnostic plots, compare models, cross validation, test using 70%/30% training/test dataset, and add back-off smooth algorithm.
Create A Data Product, and deploy to the shiny server site.
Prepare data product slide, and publish to Rpub

References:

The R Project for Statistical Computing
Johns Hopkins Data Science Capstone by Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhD Milestone Report
Natural language processing Wikipedia page
Text mining infrastructure in R
CRAN Task View: Natural Language Processing
Natural Language Processing by by Dan Jurafsky, Christopher Manning
Google.com
Bing.com

Appendix: R source in R files

Getting data - Download data files

rawdata.filename<-"SwiftKey.zip"
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
  dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
downloadZipFile<-function(src.file.url, dest.dir, dest.file, unzip.now=TRUE ) {
  dest.filepath<-file.path(dest.dir, dest.file)
    #("Downloading from ", src.file.url, "\nto dest.filepath=",dest.filepath)  
  if (file.exists(dest.filepath)) {# if destnation file does not exist, download
      #("dest.filepath exist.No download is needed.")
    } else {
      download.file(src.file.url, destfile=dest.filepath, method="libcurl", mode="wb") 
      if (unzip.now ) {  unzip(zipfile=dest.filepath, exdir=dest.dir ) }
      }}

preproceeData is to split raw data content to training and testing data files based on sample probability

## This script was tested in the following environment
## R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
## Platform: x86_64-w64-mingw32/x64 (64-bit)

autopreprocessData<-function() {
  smpproblist<-c(0.05, 0.1, 0.4, 0.7)  
  for (i in seq(smpproblist)) {
    preprocessData(smpproblist[i]) 
  }
}
# preprocessData is to split raw data content to training and testing data files based on sample probability
# The rbinom function defining 70%/30% probability to create 2 sample datasets, and store as .txt files
# - 70% for Training dataset, and save to the filename suffixed with "-train.txt"  
# - 30% Testing sub dataset, and save to the filename suffixed with "-test.txt"  
#
# input: samprob - sample probability, default to 70%
# output: training and test files store in sample folder
#
preprocessData<-function (smpprob=0.7) {
  library(rJava);   library(NLP);   library(openNLP);  library(RWeka);  library(R.utils);   library(stringr); 
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob)))
  sample.train.dir<-checkDir(file.path(sample.dir,"train"))
  sample.test.dir<-checkDir(file.path(sample.dir,"test"))

  #("(1) Getting data - Download data files")
  rawdata.filename<-"SwiftKey.zip"
  downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
  data.filename<-file.path(data.dir, "final/en_US/en_US.blogs.txt")
  #("(2) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.blogs-train.txt"), 
                  file.path(sample.test.dir, "en_US.blogs-test.txt"))
  data.filename<-file.path(data.dir,"final/en_US/en_US.news.txt")
  #("(3) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.news-train.txt"), 
                file.path(sample.test.dir, "en_US.news-test.txt"))
  data.filename<-file.path(data.dir,"final/en_US/en_US.twitter.txt")
  #("(4) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.twitter-train.txt"), 
                file.path(sample.test.dir, "en_US.twitter-test.txt"))
}

creatSampleData<-function (data.filename, smpprob=0.7, smptrain.filename, smptest.filename) {
  #smpprob<-0.7 # get split 70/30% dataset into Train, and Test set  
  #(".1) Read data file, and load to data set")
  data.filesize<-file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row<-sapply(data.filename, countLines)
  #("...file countlines =", data.row,  "\tlength(count.fields=", length(count.fields(data.filename)) )
  fcon<-file(data.filename, open="rb")
  data.lines<-readLines(fcon, encoding="UTF-8")
  close(fcon)
  #("...length of data.lines =", length(data.lines))
  data.lines<-iconv(data.lines, "latin1", "ASCII", sub="")  
  #(".2) Create sample files: ", 100*smpprob,"% for training, ", 100*(1-smpprob), "% for test dataset, and save to local files")
  smpsize<-1
  smpidx<-rbinom( 1:length(data.lines), smpsize, smpprob)  
  smp.train<-data.lines[smpidx==1] 
  smp.test<-data.lines[smpidx==0] 
  #("....length of smp.train=", length(smp.train), "\tlength of smp.test=", length(smp.test)) 
  smp.filename<-smptrain.filename
  #("..smptrain save to ", smp.filename)
  writeLines( smp.train,  smp.filename, sep="\n", useBytes=TRUE)  
  data.filesize<-file.info(smp.filename)$size
  #("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
  data.row<-sapply(smp.filename, countLines)
  #("....rows/file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(smp.filename)) )
  smp.filename<-smptest.filename
  #("..smptest save to ", smp.filename)
  writeLines( smp.test,  smp.filename, sep="\n", useBytes=TRUE)  
  data.filesize<-file.info(smp.filename)$size
  #("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto")) 
  data.row<-sapply(smp.filename, countLines)
  #("....rows/file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(smp.filename)) )
}

task1.1 is to get various counts per file, and result include file size, line count, language(location), encoding
task1.2 is to get various counts per line, and summary as min, max, mean and max value.

autotask1<-function () {
  task1.1()
  task1.2(showOne=0)
}
#
# task1.1 is to get various counts per file, and result include file size, line count, language(location), encoding  
#   All summary result is stored in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: none
# output: summary result files in report folder
#
task1.1<-function () {
  library(pryr);   library(tm);   library(R.utils)
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));  #(" current.dir=", cur.dir, "\ndata.dir", data.dir, "\noutput.dir", output.dir )
  #( "(1) Get basic summary data for all data files")
  #("..Get all file names and file size\n")
  flist.all<-DirSource( file.path(cur.dir, 'mydata', 'final'), recursive=TRUE)
  flist<- basename(flist.all$filelist)
  data.filesize<-file.info(flist.all$filelist)$size
  data.filesize2<-utils:::format.object_size(data.filesize, "auto")
  #("..Get each file line count")
  require(R.utils)
  data.row<-sapply(flist.all$filelist, countLines)
  langlist<-c( rep('German (Germany)',3), rep('English (United States)',3), rep('Finnish (Finland)', 3), rep('Russian (Russia)',3))
  elist.1<-c(rep("UTF-8",12))
  elist.2<-c(rep("ISO-8859-1",9), rep('ISO-8859-5',3))
  dfsummary<-as.data.frame( cbind(flist, data.filesize, data.filesize2, data.row,langlist, elist.1, elist.2))  
  rownames(dfsummary)<-c(1:12)
  colnames(dfsummary)<-c('Filename', 'FileSizenInByte', 'FileSizeInMByte', 'FileLineCnt', 'LanguageLocation', 'Encoding1', 'Encoding2')
  #("..Build data frame colname=", colnames(dfsummary))
  output.filename <-file.path(output.dir, "data.summary.rds")
  #("..Save summary for all data file to ", output.filename)
  write.csv( dfsummary,  file.path(output.dir, "data.summary.txt"))  
}
#
# task1.2 is to get various counts per line, and summary as min, max, mean and max value.  
#   All summary result is stored in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: showOne - index of file number (sequence order of filelist at en_US folder).
#   If zero, process, generate summary report for all files, and store into report folder
#   ie.  1=en_US.blogx.txt  2=en_US.news.txt  3=en_US.twitter.txt
# output: summary result files in report folder
#
task1.2<-function (showOne=0) {
  library(pryr);   library(stringi);   library(stringr);   library(tm)
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  if (showOne==0) {
    rangevalues<-c(1:3)
  }
  else  {
    rangevalues <-showOne
  }
  #( "(2) Get summary details for English US data files")   #("..Get all file names and file size\n")
  flist.en<-DirSource( file.path(cur.dir, 'mydata', 'final', "en_US"), recursive=TRUE)
  flist<- basename(flist.en$filelist)    
  for (i in rangevalues)
  {
  #("(",i+2, ") Get summary details ", flist.en$filelist[i]) #("..Readline from ", flist[i])
    data.lines<-sapply(flist.en$filelist[i], function(x) { 
      theLine<-readLines(x, encoding="UTF-8")
      iconv(theLine, "latin1", "ASCII", sub="")
    } )
    colnames(data.lines)<-flist[i]
    require(stringi)
  #("..Replace special characters") 
    data<-str_replace_all(data.lines, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
    data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
    data<-stri_trans_tolower(data)    
    data.nrow<-length(data)
  
    dfdata<-data.frame( LineCharCnt=sapply(data, nchar)) #("..Character count per line")
    dfdata$rowId<- c(1:data.nrow) 
    rownames(dfdata)<-c(1:data.nrow)
    line.wordlist<- strsplit(data,' ')  
  
    dfdata$LineWordCnt<-sapply(line.wordlist, function(x) {#("..Word count per line")
      sum(table(x[ x != ""]))
    }    )
  #("..Average Word Length per line")
    dfdata$LineAvgWordLen<-sapply(line.wordlist, function(x) {mean(nchar(x[ x != ""]))} )
  #("..Unique Word count per line")
    dfdata$LineUniqWordCnt<-sapply(line.wordlist, function(x) {length(table(unique(x[ x != ""])))}    )
  #("..Hash tag count per line")
    dfdata$LineHashtagCnt<-sapply(line.wordlist, function(x) {length(grep("#", x))} )
  #("..Http text count per line")
    dfdata$LineHttpCnt<-sapply(line.wordlist, function(x) {length(grep("http", x))} )    
    fbasename<-substr(flist[i], 1, nchar(flist[i])-4)
    output.filename <-file.path(output.dir, paste0(fbasename , ".linesummary.rds"))
  #("..Save line summary for en_US data files to ", output.filename)
    write.csv( dfdata,  file=file.path(output.dir, paste0(fbasename , ".linesummary.txt")))    
    mdatasummary<-sapply(dfdata, summary) # return class could be list or matrix
    output.filename <-file.path(output.dir, paste0(fbasename , ".datasummary.rds"))
    mdatasummary <-readRDS(output.filename)
    dfdatasummary<-data.frame()
    if (class(mdatasummary)=='list') {
      data.row<-length(mdatasummary) 
    } 
    else {
      data.row<-ncol(mdatasummary) 
    }
    range.value<-c(1:data.row)    
    for (j in range.value) {
      if (class(mdatasummary)=='list') {
        theRow<- cbind(Filetype=fbasename, FieldSummaryBy=names(mdatasummary[j]), rbind(mdatasummary[[j]][1:6]))
      } else {
        theRow<- cbind(Filetype=fbasename, FieldSummaryBy=colnames(mdatasummary)[j], rbind(mdatasummary[1:6,j]))
      }
      dfdatasummary<-rbind(dfdatasummary, theRow)
    }    
    output.filename <-file.path(output.dir, paste0(fbasename , ".summary.rds"))
    write.csv( dfdatasummary,  file=file.path(output.dir, paste0(fbasename , ".summary.txt")))  
    
    wordlist<-unlist(line.wordlist)
    wordlist<-unlist( sapply(wordlist, function(x) {x[ nchar(x)>1]} ) )
    mostfreqword<-sort(table(wordlist), decreasing=TRUE)
    
    dfmfw<-data.frame(count=mostfreqword)
    dfmfw$word<-rownames(mostfreqword)
    dfmfw$prob<- sapply(mostfreqword,sum)/sum(mostfreqword)
    data.nrow<-nrow(dfmfw)
    rownames(dfmfw)<-c(1:data.nrow)
    
    output.filename <-file.path(output.dir, paste( fbasename ,".mostfreqword.rds"))
  #("..Save most frequent word for en_US data files to ", output.filename)
    write.csv( dfmfw,  file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))      
  }
}

task2.1 to explore data analysis, relationship of words and document/corpus. And calculate the most frequency words and counts. Due to analysis data result are extremely large, all results are store in report folder.
task2.2 is to get frequency of ngram algorithm. Uni, Bi, N3, N4, and N5 grams result are stored in report folder

autotask2<-function ( showOne=0, smpprob=0.7, nline2read=500,
                  searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(tm)
  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  
  #( "Explore data analysis for English US data files")
  #("..Get all file names and file size\n")
  flist.en<-DirSource( sample.dir, recursive=TRUE)
  flist<- basename(flist.en$filelist)  
  if (showOne==0) {   rangevalues<-c(1:3)  }
  else  {    rangevalues <-showOne  }  

  for (i in rangevalues)
  {
    datafname<- flist[i]
    print (gc())
    task2.1( datafname=datafname, smpprob, nline2read, searchword, searchcormin, maxsparse) 
    task2.2( datafname=datafname, smpprob, nline2read, searchword, maxsparse) 
  }
} #autotask2
#
# task2 is to explore data analysis, relationship of words and document/corpus.  And calculate the 
# most frequency words and counts.  Due to analysis data result are extremely large, all results are
# store in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
#
# input: datafname - data file name which contain train data
#     samprob - sample probability for analysis, default to 70%
#     nline2read - max lines to read.  If zero, it read all lines from the time  default to 500
#     searchword - word to search association from frequent words
#     searchcormin - correlation limit used for search association, so word result will return if greater
#          or equal to.  default to 30%
#     maxsparse - maximum probability to remove sparse term from document matrix.  default to 90%
# output: summary result files in report folder
#
task2.1<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500,
                  searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(rJava); library(NLP); library(openNLP); library(RWeka); library(R.utils); library(stringr);library(stringi);
library(tm);library(SnowballC);library(RColorBrewer);library(wordcloud);library(slam);library(reshape2)

  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  
  data.filename<-file.path(data.dir, datafname)
  #("..Read from ",  data.filename)
  data.filesize <- file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row <- sapply(data.filename, countLines)
  #("..file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(data.filename)) )  
  fcon <- file(data.filename, open="rb")
  if (nline2read == 0) {     theLine<-readLines(fcon, encoding="UTF-8")  }
  else {    theLine<-readLines(fcon, encoding="UTF-8", n=nline2read)  }
  data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
  close(fcon)
  data.nrow<-length(data.lines)

  #("(2) Clean Data and transform text")
  #("....remove URLs ")
  data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
  #("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
  data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
  data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
  #("....convert tolower case ")
  data<-stri_trans_tolower(data)
  #("....build stop words adding available/via, no r/big")
   myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
   myStopwords <- setdiff(myStopwords, c("r", "big"))
  #("....remove stop (common, unused) words: ", myStopwords)
  stopword.cnt <- length(myStopwords)
  range.value<-c(1:stopword.cnt)
  for ( i in range.value) {
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
  }
  #("..Character count per line", sum(nchar(data.lines)))
  
  line.wordlist<- strsplit(data,' ') # word tokenize
  wordlist<-line.wordlist 
  wordlist<-lapply(wordlist, function(x) {x[ nchar(x)>1]} ) #("..wordlist nchar >1 = ", length(wordlist))
  wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))

  #("..Build a corpus with charactor vector source")
  corpus.doc <- Corpus(VectorSource(wordlist)) 
  #("....using stemming document to remove common words endings (ie. ing/es/s)")
  require(SnowballC)
  corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
  #("....head(summary(corpus.doc))")
  #("....remove unnecessary/extra whitespaces")
  corpus.doc <- tm_map(corpus.doc, stripWhitespace)
  
  fbasename<-substr(datafname, 1, nchar(datafname)-4)
  docs<-corpus.doc
  tdm<-TermDocumentMatrix(docs)  
  #("....tdm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
  #("..Save TermDocumentMatrix result to ", output.filename)
  #(saveRDS(tdm, file= output.filename))   
  dtm<-DocumentTermMatrix(docs)  
  #("....dtm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
  #("..Save DocumentTermMatrix result to ", output.filename)
  #(saveRDS(dtm, file= output.filename))

  #------- FYI only, so we could tune max sparse value during runtime
  dtmz <- removeSparseTerms(dtm, 0.89)   #("dtm.89=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.9)    #("dtm.9=",   dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.91)   #("dtm.91=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.92)   #("dtm.92=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.93)   #("dtm.93=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.94)   #("dtm.94=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.95)   #("dtm.95=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.96)   #("dtm.96=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.97)   #("dtm.97=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.98)   #("dtm.98=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.99)   #("dtm.99=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.999)  #("dtm.999=", dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.9999) #("dtm.9999=", dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.99999) #("dtm.9999=", dim(dtmz))

  mfw.maxnbr<-20
  dtm <- removeSparseTerms(dtm, maxsparse)
  #("..Call remove sparse earlier max sparse=", maxsparse, " dim="); 
  dtm.dense<-as.matrix(dtm)
  #("..dtm.dense object.size=", object.size(dtm.dense))
  require(reshape2)
  dtm.dense = melt(dtm.dense, value.name = "count")
  #("....dtm.dense dim=" );print(dim(dtm.dense))
  dtm.dense<-dtm.dense[which(dtm.dense$count>0),]
  dtm.dense<-aggregate(dtm.dense["count"], by=dtm.dense[c("Terms")], FUN=sum)
  #("....dtm.dense >0  dim=" );print(dim(dtm.dense))
  dtm.dense<-dtm.dense[order(dtm.dense$count, decreasing = TRUE),]
  freq<-dtm.dense 
  #("....All Frequency dtm.dense len=", length(freq), " nrow=", nrow(freq))
  if (nrow(freq) < mfw.maxnbr) range.value <- c(1:nrow(freq))
  else range.value <- c(1:mfw.maxnbr)

  graphics.off()
  png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")), 
    width = 500, height = 500, units = "px", bg="transparent")
  par(mfrow=c(3,1), mar=c(5,4,2,0)+0.5,  cex=1, las=1, oma=c(2,1,0,0))
  p1<-hist(log(freq[range.value,"count"]), main =paste0(fbasename, "- Histogram for Most ", mfw.maxnbr, " Frequent Words"), 
           xlab='Log of Frequent Word Count')
  p2<-barplot(freq[range.value,"count"], las = 2, names.arg = freq[range.value,"Terms"],
        col ="lightgreen", main =paste0(fbasename, "- Most ", mfw.maxnbr, " Frequent Words"),
        ylab = "Word Count")
  output.filename <-file.path(output.dir, paste0(fbasename, ".mostfreqword.rds"))
  #("..Save frequent words to ", output.filename)  #(saveRDS(freq, file= output.filename))
  write.csv( freq,  file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))    
  if (nrow(freq) > 0) {
    require(wordcloud) 
    pal = brewer.pal(8,"Dark2") #Paired, Set1-3, Pastel1-2, Accent,Dark2, BuPu, BuGn, Blues    
    p3<-wordcloud( words=freq$Terms, freq=freq$count, scale=c(4,0.5), random.order=FALSE, 
               color=pal, min.freq = 5, main=paste0(fbasename, "- Most ", mfw.maxnbr, " Requent Word Cloud")) 
  }
  freqterm <-findAssocs(dtm, searchword, searchcormin) #("..search word=", searchword, "\nfindAssocs=\n")
  #("....suggest next words..", names(freqterm[[1]][1:length(freqterm[[1]])]))
  print(p1)
  print(p2)
  print(p3)
  dev.off()
  graphics.off()
}

#
# task2.2 is to get frequency of ngram algorithm.  Uni, Bi, N3, N4, and N5 grams result are stored
#    in report.
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: datafname - data file name which contain train data
#     samprob - sample probability for analysis, default to 70%
#     nline2read - max lines to read.  If zero, it read all lines from the time.  default to 0.7
#     searchword - word to search association from frequent words
#     searchcormin - correlation limit used for search association, so word result will return if greater
#          or equal to
#     maxsparse - maximum probability to remove sparse term from document matrix.  default to 90%
# output: summary result files in report folder
#
task2.2<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500, 
                    searchword='day', maxsparse=0.90) {
  library(rJava);  library(NLP);  library(openNLP);  library(RWeka);  library(R.utils);  library(stringr)
  library(stringi);  library(tm);  library(SnowballC);  library(RColorBrewer);  library(wordcloud)
  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob),"train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));

  #("(1) Read training data file, and load to data set")  
  data.filename<-file.path(data.dir, datafname)
  #("..Read from ",  data.filename)
  data.filesize <- file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row <- sapply(data.filename, countLines)
  #("..file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(data.filename)) )
  fcon <- file(data.filename, open="rb")
  if (nline2read == 0) {     theLine<-readLines(fcon, encoding="UTF-8")  }
  else {    theLine<-readLines(fcon, encoding="UTF-8", n=nline2read)  }
  data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
  close(fcon)
  data.nrow<-length(data.lines)
  
  #("(2) Clean Data and transform text")  
  #("..Character count per line", sum(nchar(data.lines)))

  #("....remove URLs ")
  data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
  #("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
  data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
  data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
  #("....convert tolower case ")
  data<-stri_trans_tolower(data)  
  #("....build stop words adding available/via, no r/big")
  myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
  myStopwords <- setdiff(myStopwords, c("r", "big")) 
  #("....remove stop (common, unused) words: ", myStopwords)  
  stopword.cnt <- length(myStopwords)  
  range.value<-c(1:stopword.cnt)
  for ( i in range.value) {
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
  }
  #("..Character count per line", sum(nchar(data.lines)))
  line.wordlist<- strsplit(data,' ')  #word tokenize
  wordlist<-line.wordlist  #(".. wordlist length=", length(wordlist), " nchar(head wordlist)=", nchar(head(wordlist)))
  #("..wordlist nchar >1 = ", length(wordlist))
  wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))
  #("..Build a corpus with charactor vector source   class(wordlist)=", class(wordlist))
  corpus.doc <- Corpus(VectorSource(wordlist)) 
  writeLines(strwrap(as.character(corpus.doc[[1]], 60)))    
  #("....using stemming document to remove common words endings (ie. ing/es/s)")
  require(SnowballC)
  corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
  #("....head(summary(corpus.doc))")
  writeLines(strwrap(corpus.doc[[1]], width=73))
  #("....remove unnecessary/extra whitespaces")
  corpus.doc <- tm_map(corpus.doc, stripWhitespace)

  fbasename<-substr(datafname, 1, nchar(datafname)-4)
  docs<-corpus.doc
  tdm<-TermDocumentMatrix(docs) 
  #("....tdm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
  #("..Save TermDocumentMatrix result to ", output.filename)  #(saveRDS(tdm, file= output.filename))

  dtm<-DocumentTermMatrix(docs)    #("....dtm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
  #("..Save DocumentTermMatrix result to ", output.filename)  #(saveRDS(dtm, file= output.filename))

  #("(3) Ngram Language Modeling")
  mfw.maxnbr<-20
  require(RWeka)
  dtm <- removeSparseTerms(dtm, maxsparse)   #Due to Vector error, so call remove sparse earlier max sparse print(dim(dtm))
  
  #("..Unigram " )
  n1gramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1, delimeters = " \r\n\t"))
  tdm.1gram<-TermDocumentMatrix(docs, control=list(tokenize = n1gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.1gram.rds"))
  #("..Save n1gram result to ", output.filename) #(saveRDS(tdm.1gram, file= output.filename))
  tdm.1gram <- removeSparseTerms(tdm.1gram, maxsparse)
  nrow.1gram<-rowSums(as.matrix(tdm.1gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".1gram.rds"))
  #("..Save n1gram result to ", output.filename)  #(saveRDS(tdm.1gram, file= output.filename))
  mfw.1gram <- tail(sort(nrow.1gram), mfw.maxnbr)
  #("....Unigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.1gram) )# , " length=", length(nrow.1gram)) print(mfw.1gram)

  #("..Bigram")
  n2gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=2, max=2, delimeters = " \r\n\t"))
  tdm.2gram<- TermDocumentMatrix(docs, control = list(tokenize = n2gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.2gram.rds"))
  #("..Save n2gram result to ", output.filename)  #(saveRDS(tdm.2gram, file= output.filename))
  tdm.2gram <- removeSparseTerms(tdm.2gram, maxsparse)
  nrow.2gram<-rowSums(as.matrix(tdm.2gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".2gram.rds"))
  #("..Save n2gram result to ", output.filename)  #(saveRDS(tdm.2gram, file= output.filename))
  mfw.2gram<-tail(sort(nrow.2gram), mfw.maxnbr)
  #("....Bigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.2gram), " length=", length(nrow.2gram))  print(mfw.2gram)

  #("..n3-gram")
  n3gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=3, max=3, delimeters = " \r\n\t"))
  tdm.3gram<- TermDocumentMatrix(docs, control = list(tokenize = n3gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.3gram.rds"))
  #("..Save n3gram result to ", output.filename)  #(saveRDS(tdm.3gram, file= output.filename))
  tdm.3gram <- removeSparseTerms(tdm.3gram, maxsparse)
  nrow.3gram<-rowSums(as.matrix(tdm.3gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".3gram.rds"))
  #("..Save n3gram result to ", output.filename)  #(saveRDS(tdm.3gram, file= output.filename))
  mfw.3gram<-tail(sort(nrow.3gram), mfw.maxnbr)
  #("....Trigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.3gram), " length=", length(nrow.3gram))  print(mfw.3gram)
  
  #("..n4-gram")
  n4gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=4, max=4, delimeters = " \r\n\t"))
  tdm.4gram<- TermDocumentMatrix(docs, control = list(tokenize = n4gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.4gram.rds"))
  #("..Save n4gram result to ", output.filename)  #(saveRDS(tdm.4gram, file= output.filename))
  tdm.4gram <- removeSparseTerms(tdm.4gram, maxsparse)
  nrow.4gram<-rowSums(as.matrix(tdm.4gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".4gram.rds"))
  #("..Save n4gram result to ", output.filename)  #(saveRDS(tdm.4gram, file= output.filename))
  mfw.4gram<-tail(sort(nrow.4gram), mfw.maxnbr)
  #("....n4gram...Most ", mfw.maxnbr, " words\n class=", class(nrow.4gram), " length=", length(nrow.4gram))  print(mfw.4gram )

  #("..n5-gram")
  n5gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=5, max=5, delimeters = " \r\n\t"))
  tdm.5gram<- TermDocumentMatrix(docs, control = list(tokenize = n5gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.5gram.rds"))
  #("..Save n5gram result to ", output.filename)  #(saveRDS(tdm.5gram, file= output.filename))
  tdm.5gram <- removeSparseTerms(tdm.5gram, maxsparse)
  nrow.5gram<-rowSums(as.matrix(tdm.5gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".5gram.rds"))
  #("..Save n5gram result to ", output.filename)  #(saveRDS(tdm.5gram, file= output.filename))
  mfw.5gram<-tail(sort(nrow.5gram), mfw.maxnbr)
  #("....n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.5gram), " length=", length(nrow.5gram))  print(mfw.5gram)

  #("..n1-n5 gram")
  n15gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=1, max=5, delimeters = " \r\n\t"))
  tdm.15gram<- TermDocumentMatrix(docs, control = list(tokenize = n15gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.15gram.rds"))
  #("..Save n15gram result to ", output.filename)   #(saveRDS(tdm.15gram, file= output.filename))
  tdm.15gram <- removeSparseTerms(tdm.15gram, maxsparse)
  nrow.15gram<-rowSums(as.matrix(tdm.15gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".15gram.rds"))
  #("..Save n15gram result to ", output.filename)  #(saveRDS(tdm.15gram, file= output.filename))
  mfw.15gram<-tail(sort(nrow.15gram), mfw.maxnbr)
  #("....n1-n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.15gram), " length=", length(nrow.15gram))  print( mfw.15gram)

  graphics.off()
  png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")),
    width = 600, height = 800, units = "px", bg="transparent")
  par(mfrow=c(3,2), mar=c(5,4,2,1)+0.5, cex=1, las=2, oma=c(2,4,2,2) )
  if (length(nrow.1gram) > 0 ) {
    p1<-barplot(mfw.1gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Unigrams"),horiz=TRUE)
    print(p1) #("....Print Unigrams images")
  }
  if (length(nrow.2gram)>0  ) {
    p2<-barplot(mfw.2gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Bigrams"),horiz=TRUE)
    print(p2) #("....Print Bigrams images")
  }
  if (length(nrow.3gram) > 0 ) {
    p3<-barplot(mfw.3gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Trigrams"), horiz=TRUE)
    print(p3) #("....Print Trigrams images")
  }
  if (length(nrow.4gram) > 0 ) {
    p4<-barplot(mfw.4gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n4 grams"), horiz=TRUE)
    print(p4) #("....Print n4 grams images")
  }
  if (length(nrow.5gram) > 0 ) {
    p5<-barplot(mfw.5gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n5 grams"), horiz=TRUE)
    print(p5) #("....Print n5 grams images")
  }
  if (length(nrow.15gram) > 0 ) {
    p6<-barplot(mfw.15gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n1-n5 grams"), horiz=TRUE)
    print(p6) #("....Print n1-n5 grams images")
  }
  dev.off()
  graphics.off()
}

Appendix: R codes in this/textAnalyzerMilestone.Rmd file

#display output of task1() in task1.R
dfsummary.alldata<-readRDS(file.path(cur.dir, 'mydata', 'report', 'data.summary.rds'))
require(knitr)
kable(dfsummary.alldata, align='l', caption = "Summary of All Data Files" )
#display output of task1.2() in task1.R
dblogs<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs.summary.rds'))
dnews<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news.summary.rds'))
dtwt<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter.summary.rds'))
dfsummary <- data.frame( rbind(dblogs, dnews, dtwt) )

attach(dfsummary)
kable(dfsummary[order(FieldSummaryBy, Filetype),c(2,1,3:8)], align='l', caption = "Summary of English (United State) Data Files", row.names=FALSE )
detach(dfsummary) 
#most frequent words
mfw.maxnbr<-20
dblogs.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs .mostfreqword.rds'))
dnews.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news .mostfreqword.rds'))
dtwt.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter .mostfreqword.rds'))
dfmfw <- data.frame(cbind( dblogs.mfw[,c(2,1,3)], dnews.mfw[,c(2,1,3)], dtwt.mfw[,c(2,1,3)] ))

dfmfw.r <- data.frame(rbind( cbind(type='blogs',dblogs.mfw[1:mfw.maxnbr,c(2,1,3)]), 
                             cbind(type='news', dnews.mfw[1:mfw.maxnbr,c(2,1,3)]),
                             cbind(type='twitter',dtwt.mfw[1:mfw.maxnbr,c(2,1,3)]) ))
dfmfw.r$type <- factor(dfmfw.r$type)

require(ggplot2)
colnames(dfmfw) <- c("blogs.word", "blogs.wordcnt","blogs.wordprob","news.word","news.wordcnt", "news.wordprob","twt.word","twt.wordcnt", "twt.wordprob")
kable( dfmfw[1:mfw.maxnbr,], align='l', caption = paste0("Most ", mfw.maxnbr," Frequent Words" ))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_bar(stat="identity", position=position_dodge()) + facet_grid(type~.) +
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_line() + geom_smooth() + 
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

p4<-ggplot(data=dfmfw.r, aes(x=count,y=prob, color=type, fill=type)) + geom_bar(stat="bin")+ 
    geom_line() + geom_point(size=4, shape=21, fill="white") +
    labs(x="Word Count Per Line", y="Log of Word Probability Per Line") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p5<-ggplot(data=dfmfw.r, aes(x=count, y=log(prob), color=type, fill=type))+ geom_point(size=4, shape=21, fill="white") + 
  stat_smooth(method="lm") + labs(x="Word Count Per Line", y="Log of Word Probability Per Line") + 
  geom_line() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

grid.arrange(p4,p5, ncol=2)

qplot(log(count), data=dfmfw.r, geom="density",  fill=type, xlab='Log(Word Count Per Line)',  ylab='Density', main="Distribution of Word Count by File Type")+ 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))