Introduction

Text Mining and Predictive Text Analytics are interesting areas of study and application in the modern world of data science. Their applications are numerous including forensics, marketing, and customer relationship management. Even mobile devices leverage text mining as keyboards predict the word a user is about to type in order to improve text entry.

The Capstone Project for the Data Science Certificate offered by Johns Hopkins University requires students to write a prediction model for text using the principles learned in the certification courses. This report is an analysis of the corpora (the text samples) given for building the prediction algorithm. It will suvery the documents, provide summary information about them and discuss intersting findings. Additionally, it will cover the sampling and modeling strategy to complete the modeling assignment.

Data Survey

First, lets look at the data given to us. After reading in the directory of text samples, loop over the documents and get their descriptions:

#get names of dcouments
metaCopora<-lapply(corpora[1:length(corpora)],meta)
print(metaCopora)
## $en_US.blogs.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.blogs.txt
##   language     : en
##   origin       : character(0)
## 
## $en_US.news.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.news.txt
##   language     : en
##   origin       : character(0)
## 
## $en_US.twitter.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.twitter.txt
##   language     : en
##   origin       : character(0)
#get their length in lines
originalLengths<-lapply( lapply(corpora[1:length(corpora)],as.character) , length)
print(originalLengths)
## $en_US.blogs.txt
## [1] 899288
## 
## $en_US.news.txt
## [1] 1010242
## 
## $en_US.twitter.txt
## [1] 2360148

The texts are too large to process and map in their entirety. Hence, we will sample the documents and write the output to a temp space. Then we will re-read them and clean up the text by removing numbers, punctuation, stop-words, and profanity. We will then continue with the analysis.

sampleAndWriteTexts<- function(dataSourcePath="data/final/en_US/en_US.blogs.txt",
                               #where to start reading, pick a random spot and start reading
                               startLine=sample(1:10000,size=1,replace=T), 
                               #how many lines to read
                               readvector=1000){
        con<- file(dataSourcePath,open="r")

        for(i in 1:startLine){
                txtTmp<-readLines(con,1)
        }
        #now that the skip point has been reached, read the rest of the file
        #read in the profanity filter

        #newfile <- readLines(con)

        #read vector //how many samples to use
        rv<-readvector

        #dataframe
        df<-data.frame(txt=character())
        #define function

        txtR<-readLines(con,n=readvector,skipNul=TRUE)
        
        close(con) #done reading lines, now write lines

        # write the text to a file; the [[1]][[4]] gets the file name of the original document
        write.table(txtR,paste0("temp/",strsplit(dataSourcePath,"/")[[1]][4]),col.names=FALSE)

}

#read 50000 lines of text
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.blogs.txt",startLine=startLine,
                    readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.twitter.txt",startLine=startLine,
                    readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.news.txt",startLine=startLine,
                    readvector=50000)

(corpora <- VCorpus(DirSource("temp/"),readerControl=list(language="english")))
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
corpora<-tm_map(corpora,content_transformer(tolower))
corpora<-tm_map(corpora,removeNumbers)
corpora<-tm_map(corpora,removePunctuation)
corpora<-tm_map(corpora, removeWords, stopwords("english"))
corpora<-tm_map(corpora, removeWords, stopwords("SMART"))
corpora<-tm_map(corpora,removeWords, profanity) #removeWords comes from the tm package
corpora<-tm_map(corpora,stripWhitespace)
corpora<-tm_map(corpora,stemDocument,lazy = TRUE)


createTextFrequencyDF <- function (corpustext,controlArg,source=""){
       
        #create a document term matrix from the corpora for analysis
        dtm<-DocumentTermMatrix(corpustext,control=controlArg)
        #create a matrix and sort it 
        #decreasing each item in matrix will be a word with a nubmer value 
        freq<-sort(colSums(as.matrix(dtm)),decreasing = TRUE) 
        ord<-order(freq,decreasing=TRUE)
        length(freq) #how many terms do I have (tell me the lengt)
        
        #build pareto analysis of the terms
        wf=data.frame(term=names(freq),
                      occurrences=freq,
                      cumfreqpct=cumsum((freq/sum(freq))*100),
                      source=source
        )
        return (wf)
}

Summary Statistics

Looking over all three documents in the corpora, the following table shows the top 25 most frequent words

wf_master<-createTextFrequencyDF(controlArg = list(wordLengths=c(4, 20)),corpustext = corpora,source="All Docs")
wfstat<-slice(wf_master,1:25) #top 25
kable(wfstat,digits=1)
term occurrences cumfreqpct source
time 11074 0.6 All Docs
year 9984 1.2 All Docs
make 8553 1.7 All Docs
work 7269 2.1 All Docs
peopl 6952 2.6 All Docs
love 6940 3.0 All Docs
good 6599 3.3 All Docs
dont 6398 3.7 All Docs
back 5877 4.1 All Docs
thing 5588 4.4 All Docs
start 4570 4.6 All Docs
week 4557 4.9 All Docs
state 4498 5.2 All Docs
great 4139 5.4 All Docs
play 4109 5.7 All Docs
home 3961 5.9 All Docs
call 3957 6.1 All Docs
today 3925 6.3 All Docs
feel 3895 6.6 All Docs
game 3895 6.8 All Docs
school 3811 7.0 All Docs
show 3786 7.2 All Docs
made 3720 7.5 All Docs
life 3621 7.7 All Docs
live 3569 7.9 All Docs

From this table, we can see that the top 25 terms make up about 8% of the most frequently used words.

Now let’s look at an bi-gram and tri-gram analysis to see what word pairs occur most frequently.

options(mc.cores=1)  #on MacOS you have to set the cores to single
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bgf<-slice(createTextFrequencyDF(controlArg =  list(tokenize = BigramTokenizer),corpustext = corpora,source="All Docs"),1:25)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tgf<-slice(createTextFrequencyDF(controlArg =  list(tokenize = TrigramTokenizer),corpustext = corpora, source="All Docs"),1:25)

The top 25 word pairs

term occurrences cumfreqpct source
year ago 750 0.0425155 All Docs
high school 661 0.0799859 All Docs
st loui 457 0.1058920 All Docs
unit state 380 0.1274332 All Docs
los angel 327 0.1459699 All Docs
san francisco 314 0.1637698 All Docs
past year 251 0.1779983 All Docs
long time 246 0.1919434 All Docs
good thing 236 0.2053216 All Docs
health care 233 0.2185297 All Docs
happi birthday 228 0.2314545 All Docs
ice cream 207 0.2431887 All Docs
good morn 205 0.2548096 All Docs
san diego 204 0.2663739 All Docs
spend time 199 0.2776546 All Docs
week ago 198 0.2888787 All Docs
social media 189 0.2995926 All Docs
mother day 188 0.3102499 All Docs
time year 183 0.3206237 All Docs
school district 176 0.3306006 All Docs
good luck 173 0.3404075 All Docs
peopl dont 171 0.3501011 All Docs
white hous 169 0.3596812 All Docs
make feel 164 0.3689780 All Docs
work hard 164 0.3782747 All Docs

The top 25 word triplets

term occurrences cumfreqpct source
presid barack obama 93 0.0057291 All Docs
happi mother day 72 0.0101645 All Docs
gov chris christi 55 0.0135527 All Docs
st loui counti 48 0.0165096 All Docs
world war ii 46 0.0193434 All Docs
high school student 41 0.0218691 All Docs
st patrick day 36 0.0240868 All Docs
cinco de mayo 35 0.0262429 All Docs
chief financi offic 29 0.0280294 All Docs
coupl year ago 29 0.0298159 All Docs
martin luther king 28 0.0315408 All Docs
counti prosecutor offic 27 0.0332040 All Docs
spend lot time 27 0.0348673 All Docs
coupl week ago 26 0.0364690 All Docs
averag point rebound 25 0.0380091 All Docs
chief execut offic 25 0.0395492 All Docs
world trade center 25 0.0410892 All Docs
doesnt make sens 23 0.0425061 All Docs
high blood pressur 23 0.0439230 All Docs
senior vice presid 23 0.0453399 All Docs
amazon servic llc 22 0.0466951 All Docs
beep beep beep 22 0.0480504 All Docs
love love love 22 0.0494057 All Docs
state attorney general 22 0.0507609 All Docs
high school graduat 21 0.0520546 All Docs

Looking at this from a Pareto perspective, how many words does it take to get to 90%?

It takes me approximately 9737 rows to get t 90%, which is 10.69 percent of the entire sampled corpus.

## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Analysis and Conclusions

It’s important that these different corpora are analyzed together for building the model. Running the same analysis on them individually shows differences in their text profiles. Look at en_US.blogs.txt versus en_US.news.txt to see which terms come to the top. ###en_US.blogs.txt

term occurrences cumfreqpct source
time 5997 0.8 en_US.blogs.txt
make 4386 1.4 en_US.blogs.txt
year 3599 1.9 en_US.blogs.txt
love 3560 2.3 en_US.blogs.txt
thing 3489 2.8 en_US.blogs.txt
peopl 3415 3.3 en_US.blogs.txt
work 3338 3.7 en_US.blogs.txt
dont 3046 4.1 en_US.blogs.txt
good 2909 4.5 en_US.blogs.txt
back 2866 4.9 en_US.blogs.txt
feel 2219 5.2 en_US.blogs.txt
life 2209 5.5 en_US.blogs.txt
book 2180 5.8 en_US.blogs.txt
start 2179 6.1 en_US.blogs.txt
week 2153 6.4 en_US.blogs.txt
made 2005 6.6 en_US.blogs.txt
live 1908 6.9 en_US.blogs.txt
read 1758 7.1 en_US.blogs.txt
find 1752 7.4 en_US.blogs.txt
world 1751 7.6 en_US.blogs.txt
friend 1706 7.8 en_US.blogs.txt
home 1662 8.0 en_US.blogs.txt
great 1632 8.3 en_US.blogs.txt
call 1601 8.5 en_US.blogs.txt
place 1591 8.7 en_US.blogs.txt

en_US.news.txt

term occurrences cumfreqpct source
year 5501 0.7 en_US.news.txt
state 3362 1.2 en_US.news.txt
time 3284 1.6 en_US.news.txt
make 2578 2.0 en_US.news.txt
work 2541 2.3 en_US.news.txt
peopl 2439 2.7 en_US.news.txt
game 2323 3.0 en_US.news.txt
school 2288 3.3 en_US.news.txt
citi 2215 3.6 en_US.news.txt
play 2094 3.9 en_US.news.txt
includ 1917 4.1 en_US.news.txt
team 1869 4.4 en_US.news.txt
home 1766 4.6 en_US.news.txt
back 1735 4.9 en_US.news.txt
percent 1720 5.1 en_US.news.txt
call 1682 5.3 en_US.news.txt
million 1675 5.5 en_US.news.txt
counti 1658 5.8 en_US.news.txt
season 1647 6.0 en_US.news.txt
week 1631 6.2 en_US.news.txt
start 1618 6.4 en_US.news.txt
compani 1609 6.6 en_US.news.txt
good 1552 6.9 en_US.news.txt
report 1483 7.1 en_US.news.txt
show 1476 7.3 en_US.news.txt
###en_US.t witter.txt
term occurrences cumfreqpct source
love 2614 1.1 en_US.twitter.txt
good 2138 2.1 en_US.twitter.txt
dont 1894 2.9 en_US.twitter.txt
time 1793 3.7 en_US.twitter.txt
great 1639 4.4 en_US.twitter.txt
make 1589 5.1 en_US.twitter.txt
today 1589 5.8 en_US.twitter.txt
work 1390 6.4 en_US.twitter.txt
follow 1314 6.9 en_US.twitter.txt
back 1276 7.5 en_US.twitter.txt
peopl 1098 8.0 en_US.twitter.txt
happi 1065 8.4 en_US.twitter.txt
tonight 1042 8.9 en_US.twitter.txt
watch 894 9.3 en_US.twitter.txt
night 885 9.7 en_US.twitter.txt
year 884 10.0 en_US.twitter.txt
feel 875 10.4 en_US.twitter.txt
thing 872 10.8 en_US.twitter.txt
hope 830 11.2 en_US.twitter.txt
show 829 11.5 en_US.twitter.txt
your 813 11.9 en_US.twitter.txt
tweet 804 12.2 en_US.twitter.txt
game 789 12.6 en_US.twitter.txt
start 773 12.9 en_US.twitter.txt
week 773 13.2 en_US.twitter.txt

It is also important to notice the difference in frequency percentages between 1, 2, and 3 n-gram tables. The lower the ngram, the more frequent its occurrence. As n-gram length grows, the top ranking n-gram frequency drops.

Sampling and Model Building Strategy

In this document, I applied sampling to start at a random place in each document and read for 50000 lines. This was done to improve performance and render the report more quickly. For building and testing the model, we will use a model building and hold out strategy. 60% of the data will be used to build the model. Another 20% will be used to test the model and 20% will be used to validate it.