This is prepared for Johns Hopkins’ Data Science Capstone online class Milestone Report

The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).

Executive Summary

The large number of text-based information have been using in current social media from such sources as e-mail, personal blogs, newspaper news, twitter, Web pages, and scanned/handwritten notes.

Understanding the problem Majority data in unstructured format which is harder to search, query, retrieve and analyze.
Using natural language processing(NLP) techniques can add more structure and semantic information to unstructured text content, and allowing us to be efficient, and treat data valuable in decision management such as marketing, sale advertisement, business decision, and kid/youth education, and healthcare etc.

My primary focus is to use the freeform text of the English(United State) language data files for my exploratory analysis, and then to build the best algorithm, to predict the possible the next word inputted from the user’s typing. Furthermore, if the prediction performance is fast, this could help users typing problem currently most of us struggling in phone/tablet devices.

  1. The textAnalyzer analysis will learn terms/words from all documents from English data files

  2. Models each document by counting the number of time each word/term appears. If collecting words/terms are extremely large, I would consider to limit the size of data result by defining the maximum numbers of Most Frequent Words, and also to remove the common and less used words.

Methods:

1. Data Collection: Access data

This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info

  • The source is a compressed file which contain text files including twitter, news, and personal blogs in languages/locales of English (United States), Finnish (Finland), German (Germany) and Russian (Russia).
  • Toolset Programming language “R” was used to download

2. Exploratory Analysis: Explore Data and Basic Statistics

This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables (just learned past 2 weeks) described in Natural Language Processing

Exploratory analysis tasks involved

  1. Created sample data set since source are extremely big. Ideal 70% for training, and 30% for testing sample dataset. Note: In reality, my PC capacity only could handle 10% up to 40% of data.
  2. Cleaned data by removing URLS, replacing number, punctuation, control keys (non-English letters), converting to lower case (better for building NGram /dictionary), removing common/stop/unused words, and stemming etc. And, Cached executing result were stored into local repository for later use
  3. Identified and collected word frequency/occurrence, associates of words used in blogs, news, and twitters
  4. Explored N-gram API capability: unigram, bigram, trigram, N4 and N5-grams. Cached NGram result into local repository served as my first word dictionary database (3) adopted NLP modeling and determined the terms used in the regression model
  • Toolset: “R” was used to preprocess data, create sample files and generate N-gram result.

3. Statistical Modeling

Plan to relate NGram word count and probability, I would perform a multiple linear regression: fitting the models, diagnostic plots, compare models, cross validation.

  • The strategy for the best model selection would be good tuning smoothing methods such as Markov matrix, predict on 3-gram, 2-gram, or even unigram, and propose up to 5 or more best answers.
  • Consider to explore the Katz Back-off model for unseen sentences
  • Toolset “R” was used to generate N-gram result, predict modeling, testing, cross validation.
  • Toolset knit kit was used to generate the Milestone Report in HTML format

4. Reproducibility

  • All analysis performed in this report are reproducible in the R markdown file textAnalyzeMilestone.Rmd located in the github project repository
  • All analysis performed in preprocess and tasks are reproducible in the R sources located in the project repository.
  • As data source change, or sample data probability change, preprocess data R scripts must be executed to reflect the latest cached version of preprocessed datasets. And Rmd file need to regenerate HTML report, and publish to RPub site.

5. Create A Data Product

Plan to build a web based data product, implement the best prediction model selected from task 3.

  • The product will be web-based and available in public via internet
  • Expect real-time to interact with user(s) to take user’s typing as input method and to display predicted next word(s)
  • Toolset “R” will be used to predict algorithm next words
  • Toolset knit kit was used to generate the Milestone Report in HTML format
  • Toolset RStudio Presenter will be used for product slide creation to introduce data product feature
  • Toolset Shiny will be used to build a data product, will deploy to the shiny server site, and available to public access via internet.

Note: the corresponding summary results, figures, references information are available in Appendix. The source R codes are in github project repository

Summarized Data and Findings:

So far, I identified no missing values in the summarized dataset I preprocessed, and all measured variables were observed to be within the standard ranges based on NLP.

English(US) Summary Result

  • Blogs contain multiple sentence(s), average words were 40.94 words per line. Longest line has 39,240 characters, 6,327 words, 1,685 unique words. In total, we saw 899,288 lines.
  • News contain samll senentence(s), average words were 33.31 per line. Longest line has 3,764 characters, 544 words, 271 unique words. In total, we saw 2,360,148 lines.
  • Twitter contain samller senentence(s), average words were 12.44 per line. Longest line has 140 characters, 47 words, 36 unique words. In total, we saw 1,010,242 lines.
Summary of All Data Files
Filename FileSizenInByte FileSizeInMByte FileLineCnt LanguageLocation Encoding1 Encoding2
de_DE.blogs.txt 85459666 81.5 Mb 371440 German (Germany) UTF-8 ISO-8859-1
de_DE.news.txt 95591959 91.2 Mb 244743 German (Germany) UTF-8 ISO-8859-1
de_DE.twitter.txt 75578341 72.1 Mb 947774 German (Germany) UTF-8 ISO-8859-1
en_US.blogs.txt 210160014 200.4 Mb 899288 English (United States) UTF-8 ISO-8859-1
en_US.news.txt 205811889 196.3 Mb 1010242 English (United States) UTF-8 ISO-8859-1
en_US.twitter.txt 167105338 159.4 Mb 2360148 English (United States) UTF-8 ISO-8859-1
fi_FI.blogs.txt 108503595 103.5 Mb 439785 Finnish (Finland) UTF-8 ISO-8859-1
fi_FI.news.txt 94234350 89.9 Mb 485758 Finnish (Finland) UTF-8 ISO-8859-1
fi_FI.twitter.txt 25331142 24.2 Mb 285214 Finnish (Finland) UTF-8 ISO-8859-1
ru_RU.blogs.txt 116855835 111.4 Mb 337100 Russian (Russia) UTF-8 ISO-8859-5
ru_RU.news.txt 118996424 113.5 Mb 196360 Russian (Russia) UTF-8 ISO-8859-5
ru_RU.twitter.txt 105182346 100.3 Mb 881414 Russian (Russia) UTF-8 ISO-8859-5
Summary of English (United State) Data Files
FieldSummaryBy Filetype Min. X1st.Qu. Median Mean X3rd.Qu. Max.
LineCharCnt en_US.blogs 0 44 149 221.5 317 39240
LineCharCnt en_US.news 0 104 177 193.1 258 3764
LineCharCnt en_US.twitter 1 34 60 64.82 94 140
rowId en_US.blogs 1 224800 449600 449600 674500 899300
rowId en_US.news 1 19320 38630 38630 57940 77260
rowId en_US.twitter 1 590000 1180000 1180000 1770000 2360000
LineWordCnt en_US.blogs 0 8 28 40.94 59 6327
LineWordCnt en_US.news 0 18 30 33.31 44 544
LineWordCnt en_US.twitter 1 7 12 12.44 18 47
LineAvgWordLen en_US.blogs 1 4 4.387 4.556 4.879 74
LineAvgWordLen en_US.news 1 4.415 4.812 4.873 5.231 31
LineAvgWordLen en_US.twitter 1 3.733 4.188 4.305 4.733 126
LineUniqWordCnt en_US.blogs 0 8 24 31.2 46 1685
LineUniqWordCnt en_US.news 0 17 27 28.06 37 271
LineUniqWordCnt en_US.twitter 1 7 11 11.72 17 36
LineHashtagCnt en_US.blogs 0 0 0 0 0 0
LineHashtagCnt en_US.news 0 0 0 0 0 0
LineHashtagCnt en_US.twitter 0 0 0 0 0 0
LineHttpCnt en_US.blogs 0 0 0 0.001702 0 8
LineHttpCnt en_US.news 0 0 0 0.0009449 0 4
LineHttpCnt en_US.twitter 0 0 0 0.000222 0 2
Most 20 Frequent Words
blogs.word blogs.wordcnt blogs.wordprob news.word news.wordcnt news.wordprob twt.word twt.wordcnt twt.wordprob
the 1855771 0.0528795 the 151524 0.0608450 the 934172 0.0335948
and 1086110 0.0309483 to 69348 0.0278470 to 786629 0.0282888
to 1065698 0.0303667 and 68216 0.0273924 you 543700 0.0195526
of 875028 0.0249336 of 59089 0.0237274 and 433686 0.0155963
in 593633 0.0169154 in 51464 0.0206656 for 384535 0.0138287
that 459500 0.0130933 for 27112 0.0108869 in 377036 0.0135590
is 431834 0.0123050 that 26358 0.0105842 of 358981 0.0129097
it 400905 0.0114236 is 21961 0.0088185 is 357544 0.0128580
for 362867 0.0103398 on 20578 0.0082632 it 291398 0.0104793
you 296855 0.0084588 with 19754 0.0079323 my 290517 0.0104476
with 286177 0.0081545 said 19167 0.0076966 on 276264 0.0099350
was 278002 0.0079216 was 17625 0.0070774 that 232907 0.0083758
on 274047 0.0078089 he 17556 0.0070497 me 200067 0.0071948
my 270181 0.0076987 it 16693 0.0067031 be 187176 0.0067312
this 257977 0.0073510 at 16413 0.0065907 at 185524 0.0066718
as 223359 0.0063645 as 14662 0.0058876 with 172995 0.0062213
have 218541 0.0062272 his 12107 0.0048616 your 170771 0.0061413
be 208303 0.0059355 but 11658 0.0046813 have 168051 0.0060435
but 203446 0.0057971 from 11648 0.0046773 so 163273 0.0058716
are 193634 0.0055175 be 11579 0.0046496 this 162736 0.0058523

Conclusions:

Observation from my analysis

Limitation Identified So far
- Experienced API packages were not available for the latest R version. (i.e. rJava, stringi, etc.)
- Using 90% sparse rate did not reserve more words to be measured, so need to use 99%, but it could reach PC limitation if dataset is too huge.
- Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.
- Using TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.
- Performance of smoothing API seems slower, and could not do well using larger word sets

Suggestion from my analysis

Planned Next Steps:

  1. Continue Statistical Modeling tasks: perform a multiple linear regression: fitting the models, diagnostic plots, compare models, cross validation, test using 70%/30% training/test dataset, and add back-off smooth algorithm.
  2. Create A Data Product, and deploy to the shiny server site.
  3. Prepare data product slide, and publish to Rpub

Appendix: R source in R files

  • Getting data - Download data files
rawdata.filename<-"SwiftKey.zip"
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
  dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
downloadZipFile<-function(src.file.url, dest.dir, dest.file, unzip.now=TRUE ) {
  dest.filepath<-file.path(dest.dir, dest.file)
    #("Downloading from ", src.file.url, "\nto dest.filepath=",dest.filepath)  
  if (file.exists(dest.filepath)) {# if destnation file does not exist, download
      #("dest.filepath exist.No download is needed.")
    } else {
      download.file(src.file.url, destfile=dest.filepath, method="libcurl", mode="wb") 
      if (unzip.now ) {  unzip(zipfile=dest.filepath, exdir=dest.dir ) }
      }}
  • preproceeData is to split raw data content to training and testing data files based on sample probability
## This script was tested in the following environment
## R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
## Platform: x86_64-w64-mingw32/x64 (64-bit)

autopreprocessData<-function() {
  smpproblist<-c(0.05, 0.1, 0.4, 0.7)  
  for (i in seq(smpproblist)) {
    preprocessData(smpproblist[i]) 
  }
}
# preprocessData is to split raw data content to training and testing data files based on sample probability
# The rbinom function defining 70%/30% probability to create 2 sample datasets, and store as .txt files
# - 70% for Training dataset, and save to the filename suffixed with "-train.txt"  
# - 30% Testing sub dataset, and save to the filename suffixed with "-test.txt"  
#
# input: samprob - sample probability, default to 70%
# output: training and test files store in sample folder
#
preprocessData<-function (smpprob=0.7) {
  library(rJava);   library(NLP);   library(openNLP);  library(RWeka);  library(R.utils);   library(stringr); 
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob)))
  sample.train.dir<-checkDir(file.path(sample.dir,"train"))
  sample.test.dir<-checkDir(file.path(sample.dir,"test"))

  #("(1) Getting data - Download data files")
  rawdata.filename<-"SwiftKey.zip"
  downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
  data.filename<-file.path(data.dir, "final/en_US/en_US.blogs.txt")
  #("(2) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.blogs-train.txt"), 
                  file.path(sample.test.dir, "en_US.blogs-test.txt"))
  data.filename<-file.path(data.dir,"final/en_US/en_US.news.txt")
  #("(3) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.news-train.txt"), 
                file.path(sample.test.dir, "en_US.news-test.txt"))
  data.filename<-file.path(data.dir,"final/en_US/en_US.twitter.txt")
  #("(4) Process data.filename=",data.filename)
  creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.twitter-train.txt"), 
                file.path(sample.test.dir, "en_US.twitter-test.txt"))
}

creatSampleData<-function (data.filename, smpprob=0.7, smptrain.filename, smptest.filename) {
  #smpprob<-0.7 # get split 70/30% dataset into Train, and Test set  
  #(".1) Read data file, and load to data set")
  data.filesize<-file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row<-sapply(data.filename, countLines)
  #("...file countlines =", data.row,  "\tlength(count.fields=", length(count.fields(data.filename)) )
  fcon<-file(data.filename, open="rb")
  data.lines<-readLines(fcon, encoding="UTF-8")
  close(fcon)
  #("...length of data.lines =", length(data.lines))
  data.lines<-iconv(data.lines, "latin1", "ASCII", sub="")  
  #(".2) Create sample files: ", 100*smpprob,"% for training, ", 100*(1-smpprob), "% for test dataset, and save to local files")
  smpsize<-1
  smpidx<-rbinom( 1:length(data.lines), smpsize, smpprob)  
  smp.train<-data.lines[smpidx==1] 
  smp.test<-data.lines[smpidx==0] 
  #("....length of smp.train=", length(smp.train), "\tlength of smp.test=", length(smp.test)) 
  smp.filename<-smptrain.filename
  #("..smptrain save to ", smp.filename)
  writeLines( smp.train,  smp.filename, sep="\n", useBytes=TRUE)  
  data.filesize<-file.info(smp.filename)$size
  #("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
  data.row<-sapply(smp.filename, countLines)
  #("....rows/file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(smp.filename)) )
  smp.filename<-smptest.filename
  #("..smptest save to ", smp.filename)
  writeLines( smp.test,  smp.filename, sep="\n", useBytes=TRUE)  
  data.filesize<-file.info(smp.filename)$size
  #("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto")) 
  data.row<-sapply(smp.filename, countLines)
  #("....rows/file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(smp.filename)) )
}
  • task1.1 is to get various counts per file, and result include file size, line count, language(location), encoding
  • task1.2 is to get various counts per line, and summary as min, max, mean and max value.
autotask1<-function () {
  task1.1()
  task1.2(showOne=0)
}
#
# task1.1 is to get various counts per file, and result include file size, line count, language(location), encoding  
#   All summary result is stored in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: none
# output: summary result files in report folder
#
task1.1<-function () {
  library(pryr);   library(tm);   library(R.utils)
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));  #(" current.dir=", cur.dir, "\ndata.dir", data.dir, "\noutput.dir", output.dir )
  #( "(1) Get basic summary data for all data files")
  #("..Get all file names and file size\n")
  flist.all<-DirSource( file.path(cur.dir, 'mydata', 'final'), recursive=TRUE)
  flist<- basename(flist.all$filelist)
  data.filesize<-file.info(flist.all$filelist)$size
  data.filesize2<-utils:::format.object_size(data.filesize, "auto")
  #("..Get each file line count")
  require(R.utils)
  data.row<-sapply(flist.all$filelist, countLines)
  langlist<-c( rep('German (Germany)',3), rep('English (United States)',3), rep('Finnish (Finland)', 3), rep('Russian (Russia)',3))
  elist.1<-c(rep("UTF-8",12))
  elist.2<-c(rep("ISO-8859-1",9), rep('ISO-8859-5',3))
  dfsummary<-as.data.frame( cbind(flist, data.filesize, data.filesize2, data.row,langlist, elist.1, elist.2))  
  rownames(dfsummary)<-c(1:12)
  colnames(dfsummary)<-c('Filename', 'FileSizenInByte', 'FileSizeInMByte', 'FileLineCnt', 'LanguageLocation', 'Encoding1', 'Encoding2')
  #("..Build data frame colname=", colnames(dfsummary))
  output.filename <-file.path(output.dir, "data.summary.rds")
  #("..Save summary for all data file to ", output.filename)
  write.csv( dfsummary,  file.path(output.dir, "data.summary.txt"))  
}
#
# task1.2 is to get various counts per line, and summary as min, max, mean and max value.  
#   All summary result is stored in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: showOne - index of file number (sequence order of filelist at en_US folder).
#   If zero, process, generate summary report for all files, and store into report folder
#   ie.  1=en_US.blogx.txt  2=en_US.news.txt  3=en_US.twitter.txt
# output: summary result files in report folder
#
task1.2<-function (showOne=0) {
  library(pryr);   library(stringi);   library(stringr);   library(tm)
  cur.dir<-getwd(); 
  data.dir<-checkDir(file.path(cur.dir, "mydata"))
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  if (showOne==0) {
    rangevalues<-c(1:3)
  }
  else  {
    rangevalues <-showOne
  }
  #( "(2) Get summary details for English US data files")   #("..Get all file names and file size\n")
  flist.en<-DirSource( file.path(cur.dir, 'mydata', 'final', "en_US"), recursive=TRUE)
  flist<- basename(flist.en$filelist)    
  for (i in rangevalues)
  {
  #("(",i+2, ") Get summary details ", flist.en$filelist[i]) #("..Readline from ", flist[i])
    data.lines<-sapply(flist.en$filelist[i], function(x) { 
      theLine<-readLines(x, encoding="UTF-8")
      iconv(theLine, "latin1", "ASCII", sub="")
    } )
    colnames(data.lines)<-flist[i]
    require(stringi)
  #("..Replace special characters") 
    data<-str_replace_all(data.lines, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
    data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
    data<-stri_trans_tolower(data)    
    data.nrow<-length(data)
  
    dfdata<-data.frame( LineCharCnt=sapply(data, nchar)) #("..Character count per line")
    dfdata$rowId<- c(1:data.nrow) 
    rownames(dfdata)<-c(1:data.nrow)
    line.wordlist<- strsplit(data,' ')  
  
    dfdata$LineWordCnt<-sapply(line.wordlist, function(x) {#("..Word count per line")
      sum(table(x[ x != ""]))
    }    )
  #("..Average Word Length per line")
    dfdata$LineAvgWordLen<-sapply(line.wordlist, function(x) {mean(nchar(x[ x != ""]))} )
  #("..Unique Word count per line")
    dfdata$LineUniqWordCnt<-sapply(line.wordlist, function(x) {length(table(unique(x[ x != ""])))}    )
  #("..Hash tag count per line")
    dfdata$LineHashtagCnt<-sapply(line.wordlist, function(x) {length(grep("#", x))} )
  #("..Http text count per line")
    dfdata$LineHttpCnt<-sapply(line.wordlist, function(x) {length(grep("http", x))} )    
    fbasename<-substr(flist[i], 1, nchar(flist[i])-4)
    output.filename <-file.path(output.dir, paste0(fbasename , ".linesummary.rds"))
  #("..Save line summary for en_US data files to ", output.filename)
    write.csv( dfdata,  file=file.path(output.dir, paste0(fbasename , ".linesummary.txt")))    
    mdatasummary<-sapply(dfdata, summary) # return class could be list or matrix
    output.filename <-file.path(output.dir, paste0(fbasename , ".datasummary.rds"))
    mdatasummary <-readRDS(output.filename)
    dfdatasummary<-data.frame()
    if (class(mdatasummary)=='list') {
      data.row<-length(mdatasummary) 
    } 
    else {
      data.row<-ncol(mdatasummary) 
    }
    range.value<-c(1:data.row)    
    for (j in range.value) {
      if (class(mdatasummary)=='list') {
        theRow<- cbind(Filetype=fbasename, FieldSummaryBy=names(mdatasummary[j]), rbind(mdatasummary[[j]][1:6]))
      } else {
        theRow<- cbind(Filetype=fbasename, FieldSummaryBy=colnames(mdatasummary)[j], rbind(mdatasummary[1:6,j]))
      }
      dfdatasummary<-rbind(dfdatasummary, theRow)
    }    
    output.filename <-file.path(output.dir, paste0(fbasename , ".summary.rds"))
    write.csv( dfdatasummary,  file=file.path(output.dir, paste0(fbasename , ".summary.txt")))  
    
    wordlist<-unlist(line.wordlist)
    wordlist<-unlist( sapply(wordlist, function(x) {x[ nchar(x)>1]} ) )
    mostfreqword<-sort(table(wordlist), decreasing=TRUE)
    
    dfmfw<-data.frame(count=mostfreqword)
    dfmfw$word<-rownames(mostfreqword)
    dfmfw$prob<- sapply(mostfreqword,sum)/sum(mostfreqword)
    data.nrow<-nrow(dfmfw)
    rownames(dfmfw)<-c(1:data.nrow)
    
    output.filename <-file.path(output.dir, paste( fbasename ,".mostfreqword.rds"))
  #("..Save most frequent word for en_US data files to ", output.filename)
    write.csv( dfmfw,  file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))      
  }
}
  • task2.1 to explore data analysis, relationship of words and document/corpus. And calculate the most frequency words and counts. Due to analysis data result are extremely large, all results are store in report folder.
  • task2.2 is to get frequency of ngram algorithm. Uni, Bi, N3, N4, and N5 grams result are stored in report folder
autotask2<-function ( showOne=0, smpprob=0.7, nline2read=500,
                  searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(tm)
  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  
  #( "Explore data analysis for English US data files")
  #("..Get all file names and file size\n")
  flist.en<-DirSource( sample.dir, recursive=TRUE)
  flist<- basename(flist.en$filelist)  
  if (showOne==0) {   rangevalues<-c(1:3)  }
  else  {    rangevalues <-showOne  }  

  for (i in rangevalues)
  {
    datafname<- flist[i]
    print (gc())
    task2.1( datafname=datafname, smpprob, nline2read, searchword, searchcormin, maxsparse) 
    task2.2( datafname=datafname, smpprob, nline2read, searchword, maxsparse) 
  }
} #autotask2
#
# task2 is to explore data analysis, relationship of words and document/corpus.  And calculate the 
# most frequency words and counts.  Due to analysis data result are extremely large, all results are
# store in report folder.  Output file format is 
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
#
# input: datafname - data file name which contain train data
#     samprob - sample probability for analysis, default to 70%
#     nline2read - max lines to read.  If zero, it read all lines from the time  default to 500
#     searchword - word to search association from frequent words
#     searchcormin - correlation limit used for search association, so word result will return if greater
#          or equal to.  default to 30%
#     maxsparse - maximum probability to remove sparse term from document matrix.  default to 90%
# output: summary result files in report folder
#
task2.1<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500,
                  searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(rJava); library(NLP); library(openNLP); library(RWeka); library(R.utils); library(stringr);library(stringi);
library(tm);library(SnowballC);library(RColorBrewer);library(wordcloud);library(slam);library(reshape2)

  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
  
  data.filename<-file.path(data.dir, datafname)
  #("..Read from ",  data.filename)
  data.filesize <- file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row <- sapply(data.filename, countLines)
  #("..file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(data.filename)) )  
  fcon <- file(data.filename, open="rb")
  if (nline2read == 0) {     theLine<-readLines(fcon, encoding="UTF-8")  }
  else {    theLine<-readLines(fcon, encoding="UTF-8", n=nline2read)  }
  data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
  close(fcon)
  data.nrow<-length(data.lines)

  #("(2) Clean Data and transform text")
  #("....remove URLs ")
  data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
  #("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
  data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
  data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
  #("....convert tolower case ")
  data<-stri_trans_tolower(data)
  #("....build stop words adding available/via, no r/big")
   myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
   myStopwords <- setdiff(myStopwords, c("r", "big"))
  #("....remove stop (common, unused) words: ", myStopwords)
  stopword.cnt <- length(myStopwords)
  range.value<-c(1:stopword.cnt)
  for ( i in range.value) {
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
  }
  #("..Character count per line", sum(nchar(data.lines)))
  
  line.wordlist<- strsplit(data,' ') # word tokenize
  wordlist<-line.wordlist 
  wordlist<-lapply(wordlist, function(x) {x[ nchar(x)>1]} ) #("..wordlist nchar >1 = ", length(wordlist))
  wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))

  #("..Build a corpus with charactor vector source")
  corpus.doc <- Corpus(VectorSource(wordlist)) 
  #("....using stemming document to remove common words endings (ie. ing/es/s)")
  require(SnowballC)
  corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
  #("....head(summary(corpus.doc))")
  #("....remove unnecessary/extra whitespaces")
  corpus.doc <- tm_map(corpus.doc, stripWhitespace)
  
  fbasename<-substr(datafname, 1, nchar(datafname)-4)
  docs<-corpus.doc
  tdm<-TermDocumentMatrix(docs)  
  #("....tdm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
  #("..Save TermDocumentMatrix result to ", output.filename)
  #(saveRDS(tdm, file= output.filename))   
  dtm<-DocumentTermMatrix(docs)  
  #("....dtm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
  #("..Save DocumentTermMatrix result to ", output.filename)
  #(saveRDS(dtm, file= output.filename))

  #------- FYI only, so we could tune max sparse value during runtime
  dtmz <- removeSparseTerms(dtm, 0.89)   #("dtm.89=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.9)    #("dtm.9=",   dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.91)   #("dtm.91=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.92)   #("dtm.92=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.93)   #("dtm.93=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.94)   #("dtm.94=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.95)   #("dtm.95=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.96)   #("dtm.96=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.97)   #("dtm.97=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.98)   #("dtm.98=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.99)   #("dtm.99=",  dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.999)  #("dtm.999=", dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.9999) #("dtm.9999=", dim(dtmz))
  dtmz <- removeSparseTerms(dtm, 0.99999) #("dtm.9999=", dim(dtmz))

  mfw.maxnbr<-20
  dtm <- removeSparseTerms(dtm, maxsparse)
  #("..Call remove sparse earlier max sparse=", maxsparse, " dim="); 
  dtm.dense<-as.matrix(dtm)
  #("..dtm.dense object.size=", object.size(dtm.dense))
  require(reshape2)
  dtm.dense = melt(dtm.dense, value.name = "count")
  #("....dtm.dense dim=" );print(dim(dtm.dense))
  dtm.dense<-dtm.dense[which(dtm.dense$count>0),]
  dtm.dense<-aggregate(dtm.dense["count"], by=dtm.dense[c("Terms")], FUN=sum)
  #("....dtm.dense >0  dim=" );print(dim(dtm.dense))
  dtm.dense<-dtm.dense[order(dtm.dense$count, decreasing = TRUE),]
  freq<-dtm.dense 
  #("....All Frequency dtm.dense len=", length(freq), " nrow=", nrow(freq))
  if (nrow(freq) < mfw.maxnbr) range.value <- c(1:nrow(freq))
  else range.value <- c(1:mfw.maxnbr)

  graphics.off()
  png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")), 
    width = 500, height = 500, units = "px", bg="transparent")
  par(mfrow=c(3,1), mar=c(5,4,2,0)+0.5,  cex=1, las=1, oma=c(2,1,0,0))
  p1<-hist(log(freq[range.value,"count"]), main =paste0(fbasename, "- Histogram for Most ", mfw.maxnbr, " Frequent Words"), 
           xlab='Log of Frequent Word Count')
  p2<-barplot(freq[range.value,"count"], las = 2, names.arg = freq[range.value,"Terms"],
        col ="lightgreen", main =paste0(fbasename, "- Most ", mfw.maxnbr, " Frequent Words"),
        ylab = "Word Count")
  output.filename <-file.path(output.dir, paste0(fbasename, ".mostfreqword.rds"))
  #("..Save frequent words to ", output.filename)  #(saveRDS(freq, file= output.filename))
  write.csv( freq,  file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))    
  if (nrow(freq) > 0) {
    require(wordcloud) 
    pal = brewer.pal(8,"Dark2") #Paired, Set1-3, Pastel1-2, Accent,Dark2, BuPu, BuGn, Blues    
    p3<-wordcloud( words=freq$Terms, freq=freq$count, scale=c(4,0.5), random.order=FALSE, 
               color=pal, min.freq = 5, main=paste0(fbasename, "- Most ", mfw.maxnbr, " Requent Word Cloud")) 
  }
  freqterm <-findAssocs(dtm, searchword, searchcormin) #("..search word=", searchword, "\nfindAssocs=\n")
  #("....suggest next words..", names(freqterm[[1]][1:length(freqterm[[1]])]))
  print(p1)
  print(p2)
  print(p3)
  dev.off()
  graphics.off()
}

#
# task2.2 is to get frequency of ngram algorithm.  Uni, Bi, N3, N4, and N5 grams result are stored
#    in report.
#   (1) ".rds" can be read for knit/Rmd report.  File size is smaller
#   (2) ".txt" can be easily read for users. File size is extremely big
# input: datafname - data file name which contain train data
#     samprob - sample probability for analysis, default to 70%
#     nline2read - max lines to read.  If zero, it read all lines from the time.  default to 0.7
#     searchword - word to search association from frequent words
#     searchcormin - correlation limit used for search association, so word result will return if greater
#          or equal to
#     maxsparse - maximum probability to remove sparse term from document matrix.  default to 90%
# output: summary result files in report folder
#
task2.2<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500, 
                    searchword='day', maxsparse=0.90) {
  library(rJava);  library(NLP);  library(openNLP);  library(RWeka);  library(R.utils);  library(stringr)
  library(stringi);  library(tm);  library(SnowballC);  library(RColorBrewer);  library(wordcloud)
  cat('current working sdirectory=', getwd())  
  cur.dir<-getwd(); 
  sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob),"train"))
  data.dir<-sample.dir
  output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));

  #("(1) Read training data file, and load to data set")  
  data.filename<-file.path(data.dir, datafname)
  #("..Read from ",  data.filename)
  data.filesize <- file.info(data.filename)$size
  #("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))  
  data.row <- sapply(data.filename, countLines)
  #("..file countlines =", data.row,  "\tlength(count.fields)=", length(count.fields(data.filename)) )
  fcon <- file(data.filename, open="rb")
  if (nline2read == 0) {     theLine<-readLines(fcon, encoding="UTF-8")  }
  else {    theLine<-readLines(fcon, encoding="UTF-8", n=nline2read)  }
  data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
  close(fcon)
  data.nrow<-length(data.lines)
  
  #("(2) Clean Data and transform text")  
  #("..Character count per line", sum(nchar(data.lines)))

  #("....remove URLs ")
  data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
  #("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
  data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
  data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
  #("....convert tolower case ")
  data<-stri_trans_tolower(data)  
  #("....build stop words adding available/via, no r/big")
  myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
  myStopwords <- setdiff(myStopwords, c("r", "big")) 
  #("....remove stop (common, unused) words: ", myStopwords)  
  stopword.cnt <- length(myStopwords)  
  range.value<-c(1:stopword.cnt)
  for ( i in range.value) {
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
    data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
  }
  #("..Character count per line", sum(nchar(data.lines)))
  line.wordlist<- strsplit(data,' ')  #word tokenize
  wordlist<-line.wordlist  #(".. wordlist length=", length(wordlist), " nchar(head wordlist)=", nchar(head(wordlist)))
  #("..wordlist nchar >1 = ", length(wordlist))
  wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))
  #("..Build a corpus with charactor vector source   class(wordlist)=", class(wordlist))
  corpus.doc <- Corpus(VectorSource(wordlist)) 
  writeLines(strwrap(as.character(corpus.doc[[1]], 60)))    
  #("....using stemming document to remove common words endings (ie. ing/es/s)")
  require(SnowballC)
  corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
  #("....head(summary(corpus.doc))")
  writeLines(strwrap(corpus.doc[[1]], width=73))
  #("....remove unnecessary/extra whitespaces")
  corpus.doc <- tm_map(corpus.doc, stripWhitespace)

  fbasename<-substr(datafname, 1, nchar(datafname)-4)
  docs<-corpus.doc
  tdm<-TermDocumentMatrix(docs) 
  #("....tdm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
  #("..Save TermDocumentMatrix result to ", output.filename)  #(saveRDS(tdm, file= output.filename))

  dtm<-DocumentTermMatrix(docs)    #("....dtm dim="); print( dim(tdm) )
  output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
  #("..Save DocumentTermMatrix result to ", output.filename)  #(saveRDS(dtm, file= output.filename))

  #("(3) Ngram Language Modeling")
  mfw.maxnbr<-20
  require(RWeka)
  dtm <- removeSparseTerms(dtm, maxsparse)   #Due to Vector error, so call remove sparse earlier max sparse print(dim(dtm))
  
  #("..Unigram " )
  n1gramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1, delimeters = " \r\n\t"))
  tdm.1gram<-TermDocumentMatrix(docs, control=list(tokenize = n1gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.1gram.rds"))
  #("..Save n1gram result to ", output.filename) #(saveRDS(tdm.1gram, file= output.filename))
  tdm.1gram <- removeSparseTerms(tdm.1gram, maxsparse)
  nrow.1gram<-rowSums(as.matrix(tdm.1gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".1gram.rds"))
  #("..Save n1gram result to ", output.filename)  #(saveRDS(tdm.1gram, file= output.filename))
  mfw.1gram <- tail(sort(nrow.1gram), mfw.maxnbr)
  #("....Unigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.1gram) )# , " length=", length(nrow.1gram)) print(mfw.1gram)

  #("..Bigram")
  n2gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=2, max=2, delimeters = " \r\n\t"))
  tdm.2gram<- TermDocumentMatrix(docs, control = list(tokenize = n2gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.2gram.rds"))
  #("..Save n2gram result to ", output.filename)  #(saveRDS(tdm.2gram, file= output.filename))
  tdm.2gram <- removeSparseTerms(tdm.2gram, maxsparse)
  nrow.2gram<-rowSums(as.matrix(tdm.2gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".2gram.rds"))
  #("..Save n2gram result to ", output.filename)  #(saveRDS(tdm.2gram, file= output.filename))
  mfw.2gram<-tail(sort(nrow.2gram), mfw.maxnbr)
  #("....Bigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.2gram), " length=", length(nrow.2gram))  print(mfw.2gram)

  #("..n3-gram")
  n3gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=3, max=3, delimeters = " \r\n\t"))
  tdm.3gram<- TermDocumentMatrix(docs, control = list(tokenize = n3gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.3gram.rds"))
  #("..Save n3gram result to ", output.filename)  #(saveRDS(tdm.3gram, file= output.filename))
  tdm.3gram <- removeSparseTerms(tdm.3gram, maxsparse)
  nrow.3gram<-rowSums(as.matrix(tdm.3gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".3gram.rds"))
  #("..Save n3gram result to ", output.filename)  #(saveRDS(tdm.3gram, file= output.filename))
  mfw.3gram<-tail(sort(nrow.3gram), mfw.maxnbr)
  #("....Trigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.3gram), " length=", length(nrow.3gram))  print(mfw.3gram)
  
  #("..n4-gram")
  n4gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=4, max=4, delimeters = " \r\n\t"))
  tdm.4gram<- TermDocumentMatrix(docs, control = list(tokenize = n4gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.4gram.rds"))
  #("..Save n4gram result to ", output.filename)  #(saveRDS(tdm.4gram, file= output.filename))
  tdm.4gram <- removeSparseTerms(tdm.4gram, maxsparse)
  nrow.4gram<-rowSums(as.matrix(tdm.4gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".4gram.rds"))
  #("..Save n4gram result to ", output.filename)  #(saveRDS(tdm.4gram, file= output.filename))
  mfw.4gram<-tail(sort(nrow.4gram), mfw.maxnbr)
  #("....n4gram...Most ", mfw.maxnbr, " words\n class=", class(nrow.4gram), " length=", length(nrow.4gram))  print(mfw.4gram )

  #("..n5-gram")
  n5gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=5, max=5, delimeters = " \r\n\t"))
  tdm.5gram<- TermDocumentMatrix(docs, control = list(tokenize = n5gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.5gram.rds"))
  #("..Save n5gram result to ", output.filename)  #(saveRDS(tdm.5gram, file= output.filename))
  tdm.5gram <- removeSparseTerms(tdm.5gram, maxsparse)
  nrow.5gram<-rowSums(as.matrix(tdm.5gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".5gram.rds"))
  #("..Save n5gram result to ", output.filename)  #(saveRDS(tdm.5gram, file= output.filename))
  mfw.5gram<-tail(sort(nrow.5gram), mfw.maxnbr)
  #("....n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.5gram), " length=", length(nrow.5gram))  print(mfw.5gram)

  #("..n1-n5 gram")
  n15gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=1, max=5, delimeters = " \r\n\t"))
  tdm.15gram<- TermDocumentMatrix(docs, control = list(tokenize = n15gramTokenizer))
  output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.15gram.rds"))
  #("..Save n15gram result to ", output.filename)   #(saveRDS(tdm.15gram, file= output.filename))
  tdm.15gram <- removeSparseTerms(tdm.15gram, maxsparse)
  nrow.15gram<-rowSums(as.matrix(tdm.15gram))
  output.filename <-file.path(output.dir, paste0(fbasename, ".15gram.rds"))
  #("..Save n15gram result to ", output.filename)  #(saveRDS(tdm.15gram, file= output.filename))
  mfw.15gram<-tail(sort(nrow.15gram), mfw.maxnbr)
  #("....n1-n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.15gram), " length=", length(nrow.15gram))  print( mfw.15gram)

  graphics.off()
  png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")),
    width = 600, height = 800, units = "px", bg="transparent")
  par(mfrow=c(3,2), mar=c(5,4,2,1)+0.5, cex=1, las=2, oma=c(2,4,2,2) )
  if (length(nrow.1gram) > 0 ) {
    p1<-barplot(mfw.1gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Unigrams"),horiz=TRUE)
    print(p1) #("....Print Unigrams images")
  }
  if (length(nrow.2gram)>0  ) {
    p2<-barplot(mfw.2gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Bigrams"),horiz=TRUE)
    print(p2) #("....Print Bigrams images")
  }
  if (length(nrow.3gram) > 0 ) {
    p3<-barplot(mfw.3gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Trigrams"), horiz=TRUE)
    print(p3) #("....Print Trigrams images")
  }
  if (length(nrow.4gram) > 0 ) {
    p4<-barplot(mfw.4gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n4 grams"), horiz=TRUE)
    print(p4) #("....Print n4 grams images")
  }
  if (length(nrow.5gram) > 0 ) {
    p5<-barplot(mfw.5gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n5 grams"), horiz=TRUE)
    print(p5) #("....Print n5 grams images")
  }
  if (length(nrow.15gram) > 0 ) {
    p6<-barplot(mfw.15gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n1-n5 grams"), horiz=TRUE)
    print(p6) #("....Print n1-n5 grams images")
  }
  dev.off()
  graphics.off()
}

Appendix: R codes in this/textAnalyzerMilestone.Rmd file

#display output of task1() in task1.R
dfsummary.alldata<-readRDS(file.path(cur.dir, 'mydata', 'report', 'data.summary.rds'))
require(knitr)
kable(dfsummary.alldata, align='l', caption = "Summary of All Data Files" )
#display output of task1.2() in task1.R
dblogs<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs.summary.rds'))
dnews<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news.summary.rds'))
dtwt<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter.summary.rds'))
dfsummary <- data.frame( rbind(dblogs, dnews, dtwt) )

attach(dfsummary)
kable(dfsummary[order(FieldSummaryBy, Filetype),c(2,1,3:8)], align='l', caption = "Summary of English (United State) Data Files", row.names=FALSE )
detach(dfsummary) 
#most frequent words
mfw.maxnbr<-20
dblogs.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs .mostfreqword.rds'))
dnews.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news .mostfreqword.rds'))
dtwt.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter .mostfreqword.rds'))
dfmfw <- data.frame(cbind( dblogs.mfw[,c(2,1,3)], dnews.mfw[,c(2,1,3)], dtwt.mfw[,c(2,1,3)] ))

dfmfw.r <- data.frame(rbind( cbind(type='blogs',dblogs.mfw[1:mfw.maxnbr,c(2,1,3)]), 
                             cbind(type='news', dnews.mfw[1:mfw.maxnbr,c(2,1,3)]),
                             cbind(type='twitter',dtwt.mfw[1:mfw.maxnbr,c(2,1,3)]) ))
dfmfw.r$type <- factor(dfmfw.r$type)

require(ggplot2)
colnames(dfmfw) <- c("blogs.word", "blogs.wordcnt","blogs.wordprob","news.word","news.wordcnt", "news.wordprob","twt.word","twt.wordcnt", "twt.wordprob")
kable( dfmfw[1:mfw.maxnbr,], align='l', caption = paste0("Most ", mfw.maxnbr," Frequent Words" ))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_bar(stat="identity", position=position_dodge()) + facet_grid(type~.) +
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) + 
  geom_line() + geom_smooth() + 
  ggtitle("Most 20 Frequent Word Count by File Type") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

p4<-ggplot(data=dfmfw.r, aes(x=count,y=prob, color=type, fill=type)) + geom_bar(stat="bin")+ 
    geom_line() + geom_point(size=4, shape=21, fill="white") +
    labs(x="Word Count Per Line", y="Log of Word Probability Per Line") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p5<-ggplot(data=dfmfw.r, aes(x=count, y=log(prob), color=type, fill=type))+ geom_point(size=4, shape=21, fill="white") + 
  stat_smooth(method="lm") + labs(x="Word Count Per Line", y="Log of Word Probability Per Line") + 
  geom_line() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

grid.arrange(p4,p5, ncol=2)

qplot(log(count), data=dfmfw.r, geom="density",  fill=type, xlab='Log(Word Count Per Line)',  ylab='Density', main="Distribution of Word Count by File Type")+ 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))