This is prepared for Johns Hopkins’ Data Science Capstone online class Milestone Report
The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).
The large number of text-based information have been using in current social media from such sources as e-mail, personal blogs, newspaper news, twitter, Web pages, and scanned/handwritten notes.
Understanding the problem Majority data in unstructured format which is harder to search, query, retrieve and analyze.
Using natural language processing(NLP) techniques can add more structure and semantic information to unstructured text content, and allowing us to be efficient, and treat data valuable in decision management such as marketing, sale advertisement, business decision, and kid/youth education, and healthcare etc.
My primary focus is to use the freeform text of the English(United State) language data files for my exploratory analysis, and then to build the best algorithm, to predict the possible the next word inputted from the user’s typing. Furthermore, if the prediction performance is fast, this could help users typing problem currently most of us struggling in phone/tablet devices.
The textAnalyzer analysis will learn terms/words from all documents from English data files
Models each document by counting the number of time each word/term appears. If collecting words/terms are extremely large, I would consider to limit the size of data result by defining the maximum numbers of Most Frequent Words, and also to remove the common and less used words.
This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info
This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables (just learned past 2 weeks) described in Natural Language Processing
Exploratory analysis tasks involved
Plan to relate NGram word count and probability, I would perform a multiple linear regression: fitting the models, diagnostic plots, compare models, cross validation.
Plan to build a web based data product, implement the best prediction model selected from task 3.
Note: the corresponding summary results, figures, references information are available in Appendix. The source R codes are in github project repository
So far, I identified no missing values in the summarized dataset I preprocessed, and all measured variables were observed to be within the standard ranges based on NLP.
| Filename | FileSizenInByte | FileSizeInMByte | FileLineCnt | LanguageLocation | Encoding1 | Encoding2 |
|---|---|---|---|---|---|---|
| de_DE.blogs.txt | 85459666 | 81.5 Mb | 371440 | German (Germany) | UTF-8 | ISO-8859-1 |
| de_DE.news.txt | 95591959 | 91.2 Mb | 244743 | German (Germany) | UTF-8 | ISO-8859-1 |
| de_DE.twitter.txt | 75578341 | 72.1 Mb | 947774 | German (Germany) | UTF-8 | ISO-8859-1 |
| en_US.blogs.txt | 210160014 | 200.4 Mb | 899288 | English (United States) | UTF-8 | ISO-8859-1 |
| en_US.news.txt | 205811889 | 196.3 Mb | 1010242 | English (United States) | UTF-8 | ISO-8859-1 |
| en_US.twitter.txt | 167105338 | 159.4 Mb | 2360148 | English (United States) | UTF-8 | ISO-8859-1 |
| fi_FI.blogs.txt | 108503595 | 103.5 Mb | 439785 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| fi_FI.news.txt | 94234350 | 89.9 Mb | 485758 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| fi_FI.twitter.txt | 25331142 | 24.2 Mb | 285214 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| ru_RU.blogs.txt | 116855835 | 111.4 Mb | 337100 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| ru_RU.news.txt | 118996424 | 113.5 Mb | 196360 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| ru_RU.twitter.txt | 105182346 | 100.3 Mb | 881414 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| FieldSummaryBy | Filetype | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. |
|---|---|---|---|---|---|---|---|
| LineCharCnt | en_US.blogs | 0 | 44 | 149 | 221.5 | 317 | 39240 |
| LineCharCnt | en_US.news | 0 | 104 | 177 | 193.1 | 258 | 3764 |
| LineCharCnt | en_US.twitter | 1 | 34 | 60 | 64.82 | 94 | 140 |
| rowId | en_US.blogs | 1 | 224800 | 449600 | 449600 | 674500 | 899300 |
| rowId | en_US.news | 1 | 19320 | 38630 | 38630 | 57940 | 77260 |
| rowId | en_US.twitter | 1 | 590000 | 1180000 | 1180000 | 1770000 | 2360000 |
| LineWordCnt | en_US.blogs | 0 | 8 | 28 | 40.94 | 59 | 6327 |
| LineWordCnt | en_US.news | 0 | 18 | 30 | 33.31 | 44 | 544 |
| LineWordCnt | en_US.twitter | 1 | 7 | 12 | 12.44 | 18 | 47 |
| LineAvgWordLen | en_US.blogs | 1 | 4 | 4.387 | 4.556 | 4.879 | 74 |
| LineAvgWordLen | en_US.news | 1 | 4.415 | 4.812 | 4.873 | 5.231 | 31 |
| LineAvgWordLen | en_US.twitter | 1 | 3.733 | 4.188 | 4.305 | 4.733 | 126 |
| LineUniqWordCnt | en_US.blogs | 0 | 8 | 24 | 31.2 | 46 | 1685 |
| LineUniqWordCnt | en_US.news | 0 | 17 | 27 | 28.06 | 37 | 271 |
| LineUniqWordCnt | en_US.twitter | 1 | 7 | 11 | 11.72 | 17 | 36 |
| LineHashtagCnt | en_US.blogs | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHashtagCnt | en_US.news | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHashtagCnt | en_US.twitter | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHttpCnt | en_US.blogs | 0 | 0 | 0 | 0.001702 | 0 | 8 |
| LineHttpCnt | en_US.news | 0 | 0 | 0 | 0.0009449 | 0 | 4 |
| LineHttpCnt | en_US.twitter | 0 | 0 | 0 | 0.000222 | 0 | 2 |
| blogs.word | blogs.wordcnt | blogs.wordprob | news.word | news.wordcnt | news.wordprob | twt.word | twt.wordcnt | twt.wordprob |
|---|---|---|---|---|---|---|---|---|
| the | 1855771 | 0.0528795 | the | 151524 | 0.0608450 | the | 934172 | 0.0335948 |
| and | 1086110 | 0.0309483 | to | 69348 | 0.0278470 | to | 786629 | 0.0282888 |
| to | 1065698 | 0.0303667 | and | 68216 | 0.0273924 | you | 543700 | 0.0195526 |
| of | 875028 | 0.0249336 | of | 59089 | 0.0237274 | and | 433686 | 0.0155963 |
| in | 593633 | 0.0169154 | in | 51464 | 0.0206656 | for | 384535 | 0.0138287 |
| that | 459500 | 0.0130933 | for | 27112 | 0.0108869 | in | 377036 | 0.0135590 |
| is | 431834 | 0.0123050 | that | 26358 | 0.0105842 | of | 358981 | 0.0129097 |
| it | 400905 | 0.0114236 | is | 21961 | 0.0088185 | is | 357544 | 0.0128580 |
| for | 362867 | 0.0103398 | on | 20578 | 0.0082632 | it | 291398 | 0.0104793 |
| you | 296855 | 0.0084588 | with | 19754 | 0.0079323 | my | 290517 | 0.0104476 |
| with | 286177 | 0.0081545 | said | 19167 | 0.0076966 | on | 276264 | 0.0099350 |
| was | 278002 | 0.0079216 | was | 17625 | 0.0070774 | that | 232907 | 0.0083758 |
| on | 274047 | 0.0078089 | he | 17556 | 0.0070497 | me | 200067 | 0.0071948 |
| my | 270181 | 0.0076987 | it | 16693 | 0.0067031 | be | 187176 | 0.0067312 |
| this | 257977 | 0.0073510 | at | 16413 | 0.0065907 | at | 185524 | 0.0066718 |
| as | 223359 | 0.0063645 | as | 14662 | 0.0058876 | with | 172995 | 0.0062213 |
| have | 218541 | 0.0062272 | his | 12107 | 0.0048616 | your | 170771 | 0.0061413 |
| be | 208303 | 0.0059355 | but | 11658 | 0.0046813 | have | 168051 | 0.0060435 |
| but | 203446 | 0.0057971 | from | 11648 | 0.0046773 | so | 163273 | 0.0058716 |
| are | 193634 | 0.0055175 | be | 11579 | 0.0046496 | this | 162736 | 0.0058523 |
Observation from my analysis
Limitation Identified So far
- Experienced API packages were not available for the latest R version. (i.e. rJava, stringi, etc.)
- Using 90% sparse rate did not reserve more words to be measured, so need to use 99%, but it could reach PC limitation if dataset is too huge.
- Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.
- Using TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.
- Performance of smoothing API seems slower, and could not do well using larger word sets
Suggestion from my analysis
rawdata.filename<-"SwiftKey.zip"
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
downloadZipFile<-function(src.file.url, dest.dir, dest.file, unzip.now=TRUE ) {
dest.filepath<-file.path(dest.dir, dest.file)
#("Downloading from ", src.file.url, "\nto dest.filepath=",dest.filepath)
if (file.exists(dest.filepath)) {# if destnation file does not exist, download
#("dest.filepath exist.No download is needed.")
} else {
download.file(src.file.url, destfile=dest.filepath, method="libcurl", mode="wb")
if (unzip.now ) { unzip(zipfile=dest.filepath, exdir=dest.dir ) }
}}
## This script was tested in the following environment
## R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
## Platform: x86_64-w64-mingw32/x64 (64-bit)
autopreprocessData<-function() {
smpproblist<-c(0.05, 0.1, 0.4, 0.7)
for (i in seq(smpproblist)) {
preprocessData(smpproblist[i])
}
}
# preprocessData is to split raw data content to training and testing data files based on sample probability
# The rbinom function defining 70%/30% probability to create 2 sample datasets, and store as .txt files
# - 70% for Training dataset, and save to the filename suffixed with "-train.txt"
# - 30% Testing sub dataset, and save to the filename suffixed with "-test.txt"
#
# input: samprob - sample probability, default to 70%
# output: training and test files store in sample folder
#
preprocessData<-function (smpprob=0.7) {
library(rJava); library(NLP); library(openNLP); library(RWeka); library(R.utils); library(stringr);
cur.dir<-getwd();
data.dir<-checkDir(file.path(cur.dir, "mydata"))
sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob)))
sample.train.dir<-checkDir(file.path(sample.dir,"train"))
sample.test.dir<-checkDir(file.path(sample.dir,"test"))
#("(1) Getting data - Download data files")
rawdata.filename<-"SwiftKey.zip"
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
dest.dir=data.dir, dest.file=rawdata.filename, unzip=TRUE)
data.filename<-file.path(data.dir, "final/en_US/en_US.blogs.txt")
#("(2) Process data.filename=",data.filename)
creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.blogs-train.txt"),
file.path(sample.test.dir, "en_US.blogs-test.txt"))
data.filename<-file.path(data.dir,"final/en_US/en_US.news.txt")
#("(3) Process data.filename=",data.filename)
creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.news-train.txt"),
file.path(sample.test.dir, "en_US.news-test.txt"))
data.filename<-file.path(data.dir,"final/en_US/en_US.twitter.txt")
#("(4) Process data.filename=",data.filename)
creatSampleData(data.filename, smpprob, file.path(sample.train.dir, "en_US.twitter-train.txt"),
file.path(sample.test.dir, "en_US.twitter-test.txt"))
}
creatSampleData<-function (data.filename, smpprob=0.7, smptrain.filename, smptest.filename) {
#smpprob<-0.7 # get split 70/30% dataset into Train, and Test set
#(".1) Read data file, and load to data set")
data.filesize<-file.info(data.filename)$size
#("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
data.row<-sapply(data.filename, countLines)
#("...file countlines =", data.row, "\tlength(count.fields=", length(count.fields(data.filename)) )
fcon<-file(data.filename, open="rb")
data.lines<-readLines(fcon, encoding="UTF-8")
close(fcon)
#("...length of data.lines =", length(data.lines))
data.lines<-iconv(data.lines, "latin1", "ASCII", sub="")
#(".2) Create sample files: ", 100*smpprob,"% for training, ", 100*(1-smpprob), "% for test dataset, and save to local files")
smpsize<-1
smpidx<-rbinom( 1:length(data.lines), smpsize, smpprob)
smp.train<-data.lines[smpidx==1]
smp.test<-data.lines[smpidx==0]
#("....length of smp.train=", length(smp.train), "\tlength of smp.test=", length(smp.test))
smp.filename<-smptrain.filename
#("..smptrain save to ", smp.filename)
writeLines( smp.train, smp.filename, sep="\n", useBytes=TRUE)
data.filesize<-file.info(smp.filename)$size
#("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
data.row<-sapply(smp.filename, countLines)
#("....rows/file countlines =", data.row, "\tlength(count.fields)=", length(count.fields(smp.filename)) )
smp.filename<-smptest.filename
#("..smptest save to ", smp.filename)
writeLines( smp.test, smp.filename, sep="\n", useBytes=TRUE)
data.filesize<-file.info(smp.filename)$size
#("....file info size=",data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
data.row<-sapply(smp.filename, countLines)
#("....rows/file countlines =", data.row, "\tlength(count.fields)=", length(count.fields(smp.filename)) )
}
autotask1<-function () {
task1.1()
task1.2(showOne=0)
}
#
# task1.1 is to get various counts per file, and result include file size, line count, language(location), encoding
# All summary result is stored in report folder. Output file format is
# (1) ".rds" can be read for knit/Rmd report. File size is smaller
# (2) ".txt" can be easily read for users. File size is extremely big
# input: none
# output: summary result files in report folder
#
task1.1<-function () {
library(pryr); library(tm); library(R.utils)
cur.dir<-getwd();
data.dir<-checkDir(file.path(cur.dir, "mydata"))
output.dir<-checkDir(file.path (cur.dir, "mydata", "report")); #(" current.dir=", cur.dir, "\ndata.dir", data.dir, "\noutput.dir", output.dir )
#( "(1) Get basic summary data for all data files")
#("..Get all file names and file size\n")
flist.all<-DirSource( file.path(cur.dir, 'mydata', 'final'), recursive=TRUE)
flist<- basename(flist.all$filelist)
data.filesize<-file.info(flist.all$filelist)$size
data.filesize2<-utils:::format.object_size(data.filesize, "auto")
#("..Get each file line count")
require(R.utils)
data.row<-sapply(flist.all$filelist, countLines)
langlist<-c( rep('German (Germany)',3), rep('English (United States)',3), rep('Finnish (Finland)', 3), rep('Russian (Russia)',3))
elist.1<-c(rep("UTF-8",12))
elist.2<-c(rep("ISO-8859-1",9), rep('ISO-8859-5',3))
dfsummary<-as.data.frame( cbind(flist, data.filesize, data.filesize2, data.row,langlist, elist.1, elist.2))
rownames(dfsummary)<-c(1:12)
colnames(dfsummary)<-c('Filename', 'FileSizenInByte', 'FileSizeInMByte', 'FileLineCnt', 'LanguageLocation', 'Encoding1', 'Encoding2')
#("..Build data frame colname=", colnames(dfsummary))
output.filename <-file.path(output.dir, "data.summary.rds")
#("..Save summary for all data file to ", output.filename)
write.csv( dfsummary, file.path(output.dir, "data.summary.txt"))
}
#
# task1.2 is to get various counts per line, and summary as min, max, mean and max value.
# All summary result is stored in report folder. Output file format is
# (1) ".rds" can be read for knit/Rmd report. File size is smaller
# (2) ".txt" can be easily read for users. File size is extremely big
# input: showOne - index of file number (sequence order of filelist at en_US folder).
# If zero, process, generate summary report for all files, and store into report folder
# ie. 1=en_US.blogx.txt 2=en_US.news.txt 3=en_US.twitter.txt
# output: summary result files in report folder
#
task1.2<-function (showOne=0) {
library(pryr); library(stringi); library(stringr); library(tm)
cur.dir<-getwd();
data.dir<-checkDir(file.path(cur.dir, "mydata"))
output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
if (showOne==0) {
rangevalues<-c(1:3)
}
else {
rangevalues <-showOne
}
#( "(2) Get summary details for English US data files") #("..Get all file names and file size\n")
flist.en<-DirSource( file.path(cur.dir, 'mydata', 'final', "en_US"), recursive=TRUE)
flist<- basename(flist.en$filelist)
for (i in rangevalues)
{
#("(",i+2, ") Get summary details ", flist.en$filelist[i]) #("..Readline from ", flist[i])
data.lines<-sapply(flist.en$filelist[i], function(x) {
theLine<-readLines(x, encoding="UTF-8")
iconv(theLine, "latin1", "ASCII", sub="")
} )
colnames(data.lines)<-flist[i]
require(stringi)
#("..Replace special characters")
data<-str_replace_all(data.lines, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
data<-stri_trans_tolower(data)
data.nrow<-length(data)
dfdata<-data.frame( LineCharCnt=sapply(data, nchar)) #("..Character count per line")
dfdata$rowId<- c(1:data.nrow)
rownames(dfdata)<-c(1:data.nrow)
line.wordlist<- strsplit(data,' ')
dfdata$LineWordCnt<-sapply(line.wordlist, function(x) {#("..Word count per line")
sum(table(x[ x != ""]))
} )
#("..Average Word Length per line")
dfdata$LineAvgWordLen<-sapply(line.wordlist, function(x) {mean(nchar(x[ x != ""]))} )
#("..Unique Word count per line")
dfdata$LineUniqWordCnt<-sapply(line.wordlist, function(x) {length(table(unique(x[ x != ""])))} )
#("..Hash tag count per line")
dfdata$LineHashtagCnt<-sapply(line.wordlist, function(x) {length(grep("#", x))} )
#("..Http text count per line")
dfdata$LineHttpCnt<-sapply(line.wordlist, function(x) {length(grep("http", x))} )
fbasename<-substr(flist[i], 1, nchar(flist[i])-4)
output.filename <-file.path(output.dir, paste0(fbasename , ".linesummary.rds"))
#("..Save line summary for en_US data files to ", output.filename)
write.csv( dfdata, file=file.path(output.dir, paste0(fbasename , ".linesummary.txt")))
mdatasummary<-sapply(dfdata, summary) # return class could be list or matrix
output.filename <-file.path(output.dir, paste0(fbasename , ".datasummary.rds"))
mdatasummary <-readRDS(output.filename)
dfdatasummary<-data.frame()
if (class(mdatasummary)=='list') {
data.row<-length(mdatasummary)
}
else {
data.row<-ncol(mdatasummary)
}
range.value<-c(1:data.row)
for (j in range.value) {
if (class(mdatasummary)=='list') {
theRow<- cbind(Filetype=fbasename, FieldSummaryBy=names(mdatasummary[j]), rbind(mdatasummary[[j]][1:6]))
} else {
theRow<- cbind(Filetype=fbasename, FieldSummaryBy=colnames(mdatasummary)[j], rbind(mdatasummary[1:6,j]))
}
dfdatasummary<-rbind(dfdatasummary, theRow)
}
output.filename <-file.path(output.dir, paste0(fbasename , ".summary.rds"))
write.csv( dfdatasummary, file=file.path(output.dir, paste0(fbasename , ".summary.txt")))
wordlist<-unlist(line.wordlist)
wordlist<-unlist( sapply(wordlist, function(x) {x[ nchar(x)>1]} ) )
mostfreqword<-sort(table(wordlist), decreasing=TRUE)
dfmfw<-data.frame(count=mostfreqword)
dfmfw$word<-rownames(mostfreqword)
dfmfw$prob<- sapply(mostfreqword,sum)/sum(mostfreqword)
data.nrow<-nrow(dfmfw)
rownames(dfmfw)<-c(1:data.nrow)
output.filename <-file.path(output.dir, paste( fbasename ,".mostfreqword.rds"))
#("..Save most frequent word for en_US data files to ", output.filename)
write.csv( dfmfw, file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))
}
}
autotask2<-function ( showOne=0, smpprob=0.7, nline2read=500,
searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(tm)
cat('current working sdirectory=', getwd())
cur.dir<-getwd();
sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
data.dir<-sample.dir
output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
#( "Explore data analysis for English US data files")
#("..Get all file names and file size\n")
flist.en<-DirSource( sample.dir, recursive=TRUE)
flist<- basename(flist.en$filelist)
if (showOne==0) { rangevalues<-c(1:3) }
else { rangevalues <-showOne }
for (i in rangevalues)
{
datafname<- flist[i]
print (gc())
task2.1( datafname=datafname, smpprob, nline2read, searchword, searchcormin, maxsparse)
task2.2( datafname=datafname, smpprob, nline2read, searchword, maxsparse)
}
} #autotask2
#
# task2 is to explore data analysis, relationship of words and document/corpus. And calculate the
# most frequency words and counts. Due to analysis data result are extremely large, all results are
# store in report folder. Output file format is
# (1) ".rds" can be read for knit/Rmd report. File size is smaller
# (2) ".txt" can be easily read for users. File size is extremely big
#
# input: datafname - data file name which contain train data
# samprob - sample probability for analysis, default to 70%
# nline2read - max lines to read. If zero, it read all lines from the time default to 500
# searchword - word to search association from frequent words
# searchcormin - correlation limit used for search association, so word result will return if greater
# or equal to. default to 30%
# maxsparse - maximum probability to remove sparse term from document matrix. default to 90%
# output: summary result files in report folder
#
task2.1<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500,
searchword='day', searchcormin=0.3, maxsparse=0.9) {
library(rJava); library(NLP); library(openNLP); library(RWeka); library(R.utils); library(stringr);library(stringi);
library(tm);library(SnowballC);library(RColorBrewer);library(wordcloud);library(slam);library(reshape2)
cat('current working sdirectory=', getwd())
cur.dir<-getwd();
sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob), "train"))
data.dir<-sample.dir
output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
data.filename<-file.path(data.dir, datafname)
#("..Read from ", data.filename)
data.filesize <- file.info(data.filename)$size
#("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
data.row <- sapply(data.filename, countLines)
#("..file countlines =", data.row, "\tlength(count.fields)=", length(count.fields(data.filename)) )
fcon <- file(data.filename, open="rb")
if (nline2read == 0) { theLine<-readLines(fcon, encoding="UTF-8") }
else { theLine<-readLines(fcon, encoding="UTF-8", n=nline2read) }
data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
close(fcon)
data.nrow<-length(data.lines)
#("(2) Clean Data and transform text")
#("....remove URLs ")
data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
#("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
#("....convert tolower case ")
data<-stri_trans_tolower(data)
#("....build stop words adding available/via, no r/big")
myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
myStopwords <- setdiff(myStopwords, c("r", "big"))
#("....remove stop (common, unused) words: ", myStopwords)
stopword.cnt <- length(myStopwords)
range.value<-c(1:stopword.cnt)
for ( i in range.value) {
data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
}
#("..Character count per line", sum(nchar(data.lines)))
line.wordlist<- strsplit(data,' ') # word tokenize
wordlist<-line.wordlist
wordlist<-lapply(wordlist, function(x) {x[ nchar(x)>1]} ) #("..wordlist nchar >1 = ", length(wordlist))
wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))
#("..Build a corpus with charactor vector source")
corpus.doc <- Corpus(VectorSource(wordlist))
#("....using stemming document to remove common words endings (ie. ing/es/s)")
require(SnowballC)
corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
#("....head(summary(corpus.doc))")
#("....remove unnecessary/extra whitespaces")
corpus.doc <- tm_map(corpus.doc, stripWhitespace)
fbasename<-substr(datafname, 1, nchar(datafname)-4)
docs<-corpus.doc
tdm<-TermDocumentMatrix(docs)
#("....tdm dim="); print( dim(tdm) )
output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
#("..Save TermDocumentMatrix result to ", output.filename)
#(saveRDS(tdm, file= output.filename))
dtm<-DocumentTermMatrix(docs)
#("....dtm dim="); print( dim(tdm) )
output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
#("..Save DocumentTermMatrix result to ", output.filename)
#(saveRDS(dtm, file= output.filename))
#------- FYI only, so we could tune max sparse value during runtime
dtmz <- removeSparseTerms(dtm, 0.89) #("dtm.89=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.9) #("dtm.9=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.91) #("dtm.91=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.92) #("dtm.92=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.93) #("dtm.93=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.94) #("dtm.94=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.95) #("dtm.95=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.96) #("dtm.96=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.97) #("dtm.97=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.98) #("dtm.98=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.99) #("dtm.99=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.999) #("dtm.999=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.9999) #("dtm.9999=", dim(dtmz))
dtmz <- removeSparseTerms(dtm, 0.99999) #("dtm.9999=", dim(dtmz))
mfw.maxnbr<-20
dtm <- removeSparseTerms(dtm, maxsparse)
#("..Call remove sparse earlier max sparse=", maxsparse, " dim=");
dtm.dense<-as.matrix(dtm)
#("..dtm.dense object.size=", object.size(dtm.dense))
require(reshape2)
dtm.dense = melt(dtm.dense, value.name = "count")
#("....dtm.dense dim=" );print(dim(dtm.dense))
dtm.dense<-dtm.dense[which(dtm.dense$count>0),]
dtm.dense<-aggregate(dtm.dense["count"], by=dtm.dense[c("Terms")], FUN=sum)
#("....dtm.dense >0 dim=" );print(dim(dtm.dense))
dtm.dense<-dtm.dense[order(dtm.dense$count, decreasing = TRUE),]
freq<-dtm.dense
#("....All Frequency dtm.dense len=", length(freq), " nrow=", nrow(freq))
if (nrow(freq) < mfw.maxnbr) range.value <- c(1:nrow(freq))
else range.value <- c(1:mfw.maxnbr)
graphics.off()
png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")),
width = 500, height = 500, units = "px", bg="transparent")
par(mfrow=c(3,1), mar=c(5,4,2,0)+0.5, cex=1, las=1, oma=c(2,1,0,0))
p1<-hist(log(freq[range.value,"count"]), main =paste0(fbasename, "- Histogram for Most ", mfw.maxnbr, " Frequent Words"),
xlab='Log of Frequent Word Count')
p2<-barplot(freq[range.value,"count"], las = 2, names.arg = freq[range.value,"Terms"],
col ="lightgreen", main =paste0(fbasename, "- Most ", mfw.maxnbr, " Frequent Words"),
ylab = "Word Count")
output.filename <-file.path(output.dir, paste0(fbasename, ".mostfreqword.rds"))
#("..Save frequent words to ", output.filename) #(saveRDS(freq, file= output.filename))
write.csv( freq, file=file.path(output.dir, paste0(fbasename , ".mostfreqword.txt")))
if (nrow(freq) > 0) {
require(wordcloud)
pal = brewer.pal(8,"Dark2") #Paired, Set1-3, Pastel1-2, Accent,Dark2, BuPu, BuGn, Blues
p3<-wordcloud( words=freq$Terms, freq=freq$count, scale=c(4,0.5), random.order=FALSE,
color=pal, min.freq = 5, main=paste0(fbasename, "- Most ", mfw.maxnbr, " Requent Word Cloud"))
}
freqterm <-findAssocs(dtm, searchword, searchcormin) #("..search word=", searchword, "\nfindAssocs=\n")
#("....suggest next words..", names(freqterm[[1]][1:length(freqterm[[1]])]))
print(p1)
print(p2)
print(p3)
dev.off()
graphics.off()
}
#
# task2.2 is to get frequency of ngram algorithm. Uni, Bi, N3, N4, and N5 grams result are stored
# in report.
# (1) ".rds" can be read for knit/Rmd report. File size is smaller
# (2) ".txt" can be easily read for users. File size is extremely big
# input: datafname - data file name which contain train data
# samprob - sample probability for analysis, default to 70%
# nline2read - max lines to read. If zero, it read all lines from the time. default to 0.7
# searchword - word to search association from frequent words
# searchcormin - correlation limit used for search association, so word result will return if greater
# or equal to
# maxsparse - maximum probability to remove sparse term from document matrix. default to 90%
# output: summary result files in report folder
#
task2.2<-function ( datafname="en_US.news-train.txt", smpprob=0.7, nline2read=500,
searchword='day', maxsparse=0.90) {
library(rJava); library(NLP); library(openNLP); library(RWeka); library(R.utils); library(stringr)
library(stringi); library(tm); library(SnowballC); library(RColorBrewer); library(wordcloud)
cat('current working sdirectory=', getwd())
cur.dir<-getwd();
sample.dir<-checkDir(file.path(cur.dir, "mydata", paste0("sample",smpprob),"train"))
data.dir<-sample.dir
output.dir<-checkDir(file.path (cur.dir, "mydata", "report"));
#("(1) Read training data file, and load to data set")
data.filename<-file.path(data.dir, datafname)
#("..Read from ", data.filename)
data.filesize <- file.info(data.filename)$size
#("..file info size=", data.filesize, " in bytes ", utils:::format.object_size(data.filesize, "auto"))
data.row <- sapply(data.filename, countLines)
#("..file countlines =", data.row, "\tlength(count.fields)=", length(count.fields(data.filename)) )
fcon <- file(data.filename, open="rb")
if (nline2read == 0) { theLine<-readLines(fcon, encoding="UTF-8") }
else { theLine<-readLines(fcon, encoding="UTF-8", n=nline2read) }
data.lines<-iconv(theLine, "latin1", "ASCII", sub="")
close(fcon)
data.nrow<-length(data.lines)
#("(2) Clean Data and transform text")
#("..Character count per line", sum(nchar(data.lines)))
#("....remove URLs ")
data<-str_replace_all(data.lines, pattern="http[^[:space:]]*", replacement="")
#("....replace Number, Punctuation, Control keys (any other than English letters/) with a space")
data<-str_replace_all(data, pattern="[[:punct:][:digit:][:cntrl:]]", replacement="")
data<-str_replace_all(data, pattern="[[0-9][]\\?!\"\'$%&(){}+*/:;,._`|~\\[<=>@\\^-]]", replacement="" )
#("....convert tolower case ")
data<-stri_trans_tolower(data)
#("....build stop words adding available/via, no r/big")
myStopwords <- c(stopwords('english'), "available", "via", "na","rrrr")
myStopwords <- setdiff(myStopwords, c("r", "big"))
#("....remove stop (common, unused) words: ", myStopwords)
stopword.cnt <- length(myStopwords)
range.value<-c(1:stopword.cnt)
for ( i in range.value) {
data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], " "), replacement=" " )
data<-str_replace_all(data, pattern=paste0("^", myStopwords[i], " "), replacement=" " )
data<-str_replace_all(data, pattern=paste0(" ", myStopwords[i], "$"), replacement=" " )
}
#("..Character count per line", sum(nchar(data.lines)))
line.wordlist<- strsplit(data,' ') #word tokenize
wordlist<-line.wordlist #(".. wordlist length=", length(wordlist), " nchar(head wordlist)=", nchar(head(wordlist)))
#("..wordlist nchar >1 = ", length(wordlist))
wordlist<- lapply(wordlist, reverselist) #("..reverse list which have word nchar >1 = ", length(wordlist))
#("..Build a corpus with charactor vector source class(wordlist)=", class(wordlist))
corpus.doc <- Corpus(VectorSource(wordlist))
writeLines(strwrap(as.character(corpus.doc[[1]], 60)))
#("....using stemming document to remove common words endings (ie. ing/es/s)")
require(SnowballC)
corpus.doc <- tm_map(corpus.doc, stemDocument, language = "english")
#("....head(summary(corpus.doc))")
writeLines(strwrap(corpus.doc[[1]], width=73))
#("....remove unnecessary/extra whitespaces")
corpus.doc <- tm_map(corpus.doc, stripWhitespace)
fbasename<-substr(datafname, 1, nchar(datafname)-4)
docs<-corpus.doc
tdm<-TermDocumentMatrix(docs)
#("....tdm dim="); print( dim(tdm) )
output.filename <-file.path(output.dir, paste0(fbasename , ".tdm.rds"))
#("..Save TermDocumentMatrix result to ", output.filename) #(saveRDS(tdm, file= output.filename))
dtm<-DocumentTermMatrix(docs) #("....dtm dim="); print( dim(tdm) )
output.filename <-file.path(output.dir, paste0(fbasename , ".dtm.rds"))
#("..Save DocumentTermMatrix result to ", output.filename) #(saveRDS(dtm, file= output.filename))
#("(3) Ngram Language Modeling")
mfw.maxnbr<-20
require(RWeka)
dtm <- removeSparseTerms(dtm, maxsparse) #Due to Vector error, so call remove sparse earlier max sparse print(dim(dtm))
#("..Unigram " )
n1gramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1, delimeters = " \r\n\t"))
tdm.1gram<-TermDocumentMatrix(docs, control=list(tokenize = n1gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.1gram.rds"))
#("..Save n1gram result to ", output.filename) #(saveRDS(tdm.1gram, file= output.filename))
tdm.1gram <- removeSparseTerms(tdm.1gram, maxsparse)
nrow.1gram<-rowSums(as.matrix(tdm.1gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".1gram.rds"))
#("..Save n1gram result to ", output.filename) #(saveRDS(tdm.1gram, file= output.filename))
mfw.1gram <- tail(sort(nrow.1gram), mfw.maxnbr)
#("....Unigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.1gram) )# , " length=", length(nrow.1gram)) print(mfw.1gram)
#("..Bigram")
n2gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=2, max=2, delimeters = " \r\n\t"))
tdm.2gram<- TermDocumentMatrix(docs, control = list(tokenize = n2gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.2gram.rds"))
#("..Save n2gram result to ", output.filename) #(saveRDS(tdm.2gram, file= output.filename))
tdm.2gram <- removeSparseTerms(tdm.2gram, maxsparse)
nrow.2gram<-rowSums(as.matrix(tdm.2gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".2gram.rds"))
#("..Save n2gram result to ", output.filename) #(saveRDS(tdm.2gram, file= output.filename))
mfw.2gram<-tail(sort(nrow.2gram), mfw.maxnbr)
#("....Bigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.2gram), " length=", length(nrow.2gram)) print(mfw.2gram)
#("..n3-gram")
n3gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=3, max=3, delimeters = " \r\n\t"))
tdm.3gram<- TermDocumentMatrix(docs, control = list(tokenize = n3gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.3gram.rds"))
#("..Save n3gram result to ", output.filename) #(saveRDS(tdm.3gram, file= output.filename))
tdm.3gram <- removeSparseTerms(tdm.3gram, maxsparse)
nrow.3gram<-rowSums(as.matrix(tdm.3gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".3gram.rds"))
#("..Save n3gram result to ", output.filename) #(saveRDS(tdm.3gram, file= output.filename))
mfw.3gram<-tail(sort(nrow.3gram), mfw.maxnbr)
#("....Trigram...Most ", mfw.maxnbr, " words\n class=", class(nrow.3gram), " length=", length(nrow.3gram)) print(mfw.3gram)
#("..n4-gram")
n4gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=4, max=4, delimeters = " \r\n\t"))
tdm.4gram<- TermDocumentMatrix(docs, control = list(tokenize = n4gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.4gram.rds"))
#("..Save n4gram result to ", output.filename) #(saveRDS(tdm.4gram, file= output.filename))
tdm.4gram <- removeSparseTerms(tdm.4gram, maxsparse)
nrow.4gram<-rowSums(as.matrix(tdm.4gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".4gram.rds"))
#("..Save n4gram result to ", output.filename) #(saveRDS(tdm.4gram, file= output.filename))
mfw.4gram<-tail(sort(nrow.4gram), mfw.maxnbr)
#("....n4gram...Most ", mfw.maxnbr, " words\n class=", class(nrow.4gram), " length=", length(nrow.4gram)) print(mfw.4gram )
#("..n5-gram")
n5gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=5, max=5, delimeters = " \r\n\t"))
tdm.5gram<- TermDocumentMatrix(docs, control = list(tokenize = n5gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.5gram.rds"))
#("..Save n5gram result to ", output.filename) #(saveRDS(tdm.5gram, file= output.filename))
tdm.5gram <- removeSparseTerms(tdm.5gram, maxsparse)
nrow.5gram<-rowSums(as.matrix(tdm.5gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".5gram.rds"))
#("..Save n5gram result to ", output.filename) #(saveRDS(tdm.5gram, file= output.filename))
mfw.5gram<-tail(sort(nrow.5gram), mfw.maxnbr)
#("....n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.5gram), " length=", length(nrow.5gram)) print(mfw.5gram)
#("..n1-n5 gram")
n15gramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=1, max=5, delimeters = " \r\n\t"))
tdm.15gram<- TermDocumentMatrix(docs, control = list(tokenize = n15gramTokenizer))
output.filename <-file.path(output.dir, paste0(fbasename, ".tdm.15gram.rds"))
#("..Save n15gram result to ", output.filename) #(saveRDS(tdm.15gram, file= output.filename))
tdm.15gram <- removeSparseTerms(tdm.15gram, maxsparse)
nrow.15gram<-rowSums(as.matrix(tdm.15gram))
output.filename <-file.path(output.dir, paste0(fbasename, ".15gram.rds"))
#("..Save n15gram result to ", output.filename) #(saveRDS(tdm.15gram, file= output.filename))
mfw.15gram<-tail(sort(nrow.15gram), mfw.maxnbr)
#("....n1-n5ngram...Most ", mfw.maxnbr, " words\n class=", class(nrow.15gram), " length=", length(nrow.15gram)) print( mfw.15gram)
graphics.off()
png(file.path(output.dir, paste0(callingFuncName,fbasename,".plots.png")),
width = 600, height = 800, units = "px", bg="transparent")
par(mfrow=c(3,2), mar=c(5,4,2,1)+0.5, cex=1, las=2, oma=c(2,4,2,2) )
if (length(nrow.1gram) > 0 ) {
p1<-barplot(mfw.1gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Unigrams"),horiz=TRUE)
print(p1) #("....Print Unigrams images")
}
if (length(nrow.2gram)>0 ) {
p2<-barplot(mfw.2gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Bigrams"),horiz=TRUE)
print(p2) #("....Print Bigrams images")
}
if (length(nrow.3gram) > 0 ) {
p3<-barplot(mfw.3gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," Trigrams"), horiz=TRUE)
print(p3) #("....Print Trigrams images")
}
if (length(nrow.4gram) > 0 ) {
p4<-barplot(mfw.4gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n4 grams"), horiz=TRUE)
print(p4) #("....Print n4 grams images")
}
if (length(nrow.5gram) > 0 ) {
p5<-barplot(mfw.5gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n5 grams"), horiz=TRUE)
print(p5) #("....Print n5 grams images")
}
if (length(nrow.15gram) > 0 ) {
p6<-barplot(mfw.15gram, las = 2, main = paste0(fbasename," - Top ", mfw.maxnbr," n1-n5 grams"), horiz=TRUE)
print(p6) #("....Print n1-n5 grams images")
}
dev.off()
graphics.off()
}
#display output of task1() in task1.R
dfsummary.alldata<-readRDS(file.path(cur.dir, 'mydata', 'report', 'data.summary.rds'))
require(knitr)
kable(dfsummary.alldata, align='l', caption = "Summary of All Data Files" )
#display output of task1.2() in task1.R
dblogs<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs.summary.rds'))
dnews<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news.summary.rds'))
dtwt<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter.summary.rds'))
dfsummary <- data.frame( rbind(dblogs, dnews, dtwt) )
attach(dfsummary)
kable(dfsummary[order(FieldSummaryBy, Filetype),c(2,1,3:8)], align='l', caption = "Summary of English (United State) Data Files", row.names=FALSE )
detach(dfsummary)
#most frequent words
mfw.maxnbr<-20
dblogs.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.blogs .mostfreqword.rds'))
dnews.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.news .mostfreqword.rds'))
dtwt.mfw<-readRDS(file.path(cur.dir, 'mydata', 'report', 'en_US.twitter .mostfreqword.rds'))
dfmfw <- data.frame(cbind( dblogs.mfw[,c(2,1,3)], dnews.mfw[,c(2,1,3)], dtwt.mfw[,c(2,1,3)] ))
dfmfw.r <- data.frame(rbind( cbind(type='blogs',dblogs.mfw[1:mfw.maxnbr,c(2,1,3)]),
cbind(type='news', dnews.mfw[1:mfw.maxnbr,c(2,1,3)]),
cbind(type='twitter',dtwt.mfw[1:mfw.maxnbr,c(2,1,3)]) ))
dfmfw.r$type <- factor(dfmfw.r$type)
require(ggplot2)
colnames(dfmfw) <- c("blogs.word", "blogs.wordcnt","blogs.wordprob","news.word","news.wordcnt", "news.wordprob","twt.word","twt.wordcnt", "twt.wordprob")
kable( dfmfw[1:mfw.maxnbr,], align='l', caption = paste0("Most ", mfw.maxnbr," Frequent Words" ))
ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Most 20 Frequent Word Count by File Type") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) +
geom_bar(stat="identity", position=position_dodge()) + facet_grid(type~.) +
ggtitle("Most 20 Frequent Word Count by File Type") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
ggplot(data=dfmfw.r, aes(x=word, y=count, group=type, shape=type, color=type, fill=type)) +
geom_line() + geom_smooth() +
ggtitle("Most 20 Frequent Word Count by File Type") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p4<-ggplot(data=dfmfw.r, aes(x=count,y=prob, color=type, fill=type)) + geom_bar(stat="bin")+
geom_line() + geom_point(size=4, shape=21, fill="white") +
labs(x="Word Count Per Line", y="Log of Word Probability Per Line") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p5<-ggplot(data=dfmfw.r, aes(x=count, y=log(prob), color=type, fill=type))+ geom_point(size=4, shape=21, fill="white") +
stat_smooth(method="lm") + labs(x="Word Count Per Line", y="Log of Word Probability Per Line") +
geom_line() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
grid.arrange(p4,p5, ncol=2)
qplot(log(count), data=dfmfw.r, geom="density", fill=type, xlab='Log(Word Count Per Line)', ylab='Density', main="Distribution of Word Count by File Type")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))