Text Mining and Predictive Text Analytics are interesting areas of study and application in the modern world of data science. Their applications are numerous including forensics, marketing, and customer relationship management. Even mobile devices leverage text mining as keyboards predict the word a user is about to type in order to improve text entry.
The Capstone Project for the Data Science Certificate offered by Johns Hopkins University requires students to write a prediction model for text using the principles learned in the certification courses. This report is an analysis of the corpora (the text samples) given for building the prediction algorithm. It will suvery the documents, provide summary information about them and discuss intersting findings. Additionally, it will cover the sampling and modeling strategy to complete the modeling assignment.
First, lets look at the data given to us. After reading in the directory of text samples, loop over the documents and get their descriptions:
#get names of dcouments
metaCopora<-lapply(corpora[1:length(corpora)],meta)
print(metaCopora)
## $en_US.blogs.txt
## author : character(0)
## datetimestamp: 2016-03-19 18:44:51
## description : character(0)
## heading : character(0)
## id : en_US.blogs.txt
## language : en
## origin : character(0)
##
## $en_US.news.txt
## author : character(0)
## datetimestamp: 2016-03-19 18:44:51
## description : character(0)
## heading : character(0)
## id : en_US.news.txt
## language : en
## origin : character(0)
##
## $en_US.twitter.txt
## author : character(0)
## datetimestamp: 2016-03-19 18:44:51
## description : character(0)
## heading : character(0)
## id : en_US.twitter.txt
## language : en
## origin : character(0)
#get their length in lines
originalLengths<-lapply( lapply(corpora[1:length(corpora)],as.character) , length)
print(originalLengths)
## $en_US.blogs.txt
## [1] 899288
##
## $en_US.news.txt
## [1] 1010242
##
## $en_US.twitter.txt
## [1] 2360148
The texts are too large to process and map in their entirety. Hence, we will sample the documents and write the output to a temp space. Then we will re-read them and clean up the text by removing numbers, punctuation, stop-words, and profanity. We will then continue with the analysis.
sampleAndWriteTexts<- function(dataSourcePath="data/final/en_US/en_US.blogs.txt",
#where to start reading, pick a random spot and start reading
startLine=sample(1:10000,size=1,replace=T),
#how many lines to read
readvector=1000){
con<- file(dataSourcePath,open="r")
for(i in 1:startLine){
txtTmp<-readLines(con,1)
}
#now that the skip point has been reached, read the rest of the file
#read in the profanity filter
#newfile <- readLines(con)
#read vector //how many samples to use
rv<-readvector
#dataframe
df<-data.frame(txt=character())
#define function
txtR<-readLines(con,n=readvector,skipNul=TRUE)
close(con) #done reading lines, now write lines
# write the text to a file; the [[1]][[4]] gets the file name of the original document
write.table(txtR,paste0("temp/",strsplit(dataSourcePath,"/")[[1]][4]),col.names=FALSE)
}
#read 50000 lines of text
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.blogs.txt",startLine=startLine,
readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.twitter.txt",startLine=startLine,
readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.news.txt",startLine=startLine,
readvector=50000)
(corpora <- VCorpus(DirSource("temp/"),readerControl=list(language="english")))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
corpora<-tm_map(corpora,content_transformer(tolower))
corpora<-tm_map(corpora,removeNumbers)
corpora<-tm_map(corpora,removePunctuation)
corpora<-tm_map(corpora, removeWords, stopwords("english"))
corpora<-tm_map(corpora, removeWords, stopwords("SMART"))
corpora<-tm_map(corpora,removeWords, profanity) #removeWords comes from the tm package
corpora<-tm_map(corpora,stripWhitespace)
corpora<-tm_map(corpora,stemDocument,lazy = TRUE)
createTextFrequencyDF <- function (corpustext,controlArg,source=""){
#create a document term matrix from the corpora for analysis
dtm<-DocumentTermMatrix(corpustext,control=controlArg)
#create a matrix and sort it
#decreasing each item in matrix will be a word with a nubmer value
freq<-sort(colSums(as.matrix(dtm)),decreasing = TRUE)
ord<-order(freq,decreasing=TRUE)
length(freq) #how many terms do I have (tell me the lengt)
#build pareto analysis of the terms
wf=data.frame(term=names(freq),
occurrences=freq,
cumfreqpct=cumsum((freq/sum(freq))*100),
source=source
)
return (wf)
}
Looking over all three documents in the corpora, the following table shows the top 25 most frequent words
wf_master<-createTextFrequencyDF(controlArg = list(wordLengths=c(4, 20)),corpustext = corpora,source="All Docs")
wfstat<-slice(wf_master,1:25) #top 25
kable(wfstat,digits=1)
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| time | 11074 | 0.6 | All Docs |
| year | 9984 | 1.2 | All Docs |
| make | 8553 | 1.7 | All Docs |
| work | 7269 | 2.1 | All Docs |
| peopl | 6952 | 2.6 | All Docs |
| love | 6940 | 3.0 | All Docs |
| good | 6599 | 3.3 | All Docs |
| dont | 6398 | 3.7 | All Docs |
| back | 5877 | 4.1 | All Docs |
| thing | 5588 | 4.4 | All Docs |
| start | 4570 | 4.6 | All Docs |
| week | 4557 | 4.9 | All Docs |
| state | 4498 | 5.2 | All Docs |
| great | 4139 | 5.4 | All Docs |
| play | 4109 | 5.7 | All Docs |
| home | 3961 | 5.9 | All Docs |
| call | 3957 | 6.1 | All Docs |
| today | 3925 | 6.3 | All Docs |
| feel | 3895 | 6.6 | All Docs |
| game | 3895 | 6.8 | All Docs |
| school | 3811 | 7.0 | All Docs |
| show | 3786 | 7.2 | All Docs |
| made | 3720 | 7.5 | All Docs |
| life | 3621 | 7.7 | All Docs |
| live | 3569 | 7.9 | All Docs |
From this table, we can see that the top 25 terms make up about 8% of the most frequently used words.
Now let’s look at an bi-gram and tri-gram analysis to see what word pairs occur most frequently.
options(mc.cores=1) #on MacOS you have to set the cores to single
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bgf<-slice(createTextFrequencyDF(controlArg = list(tokenize = BigramTokenizer),corpustext = corpora,source="All Docs"),1:25)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tgf<-slice(createTextFrequencyDF(controlArg = list(tokenize = TrigramTokenizer),corpustext = corpora, source="All Docs"),1:25)
The top 25 word pairs
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| year ago | 750 | 0.0425155 | All Docs |
| high school | 661 | 0.0799859 | All Docs |
| st loui | 457 | 0.1058920 | All Docs |
| unit state | 380 | 0.1274332 | All Docs |
| los angel | 327 | 0.1459699 | All Docs |
| san francisco | 314 | 0.1637698 | All Docs |
| past year | 251 | 0.1779983 | All Docs |
| long time | 246 | 0.1919434 | All Docs |
| good thing | 236 | 0.2053216 | All Docs |
| health care | 233 | 0.2185297 | All Docs |
| happi birthday | 228 | 0.2314545 | All Docs |
| ice cream | 207 | 0.2431887 | All Docs |
| good morn | 205 | 0.2548096 | All Docs |
| san diego | 204 | 0.2663739 | All Docs |
| spend time | 199 | 0.2776546 | All Docs |
| week ago | 198 | 0.2888787 | All Docs |
| social media | 189 | 0.2995926 | All Docs |
| mother day | 188 | 0.3102499 | All Docs |
| time year | 183 | 0.3206237 | All Docs |
| school district | 176 | 0.3306006 | All Docs |
| good luck | 173 | 0.3404075 | All Docs |
| peopl dont | 171 | 0.3501011 | All Docs |
| white hous | 169 | 0.3596812 | All Docs |
| make feel | 164 | 0.3689780 | All Docs |
| work hard | 164 | 0.3782747 | All Docs |
The top 25 word triplets
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| presid barack obama | 93 | 0.0057291 | All Docs |
| happi mother day | 72 | 0.0101645 | All Docs |
| gov chris christi | 55 | 0.0135527 | All Docs |
| st loui counti | 48 | 0.0165096 | All Docs |
| world war ii | 46 | 0.0193434 | All Docs |
| high school student | 41 | 0.0218691 | All Docs |
| st patrick day | 36 | 0.0240868 | All Docs |
| cinco de mayo | 35 | 0.0262429 | All Docs |
| chief financi offic | 29 | 0.0280294 | All Docs |
| coupl year ago | 29 | 0.0298159 | All Docs |
| martin luther king | 28 | 0.0315408 | All Docs |
| counti prosecutor offic | 27 | 0.0332040 | All Docs |
| spend lot time | 27 | 0.0348673 | All Docs |
| coupl week ago | 26 | 0.0364690 | All Docs |
| averag point rebound | 25 | 0.0380091 | All Docs |
| chief execut offic | 25 | 0.0395492 | All Docs |
| world trade center | 25 | 0.0410892 | All Docs |
| doesnt make sens | 23 | 0.0425061 | All Docs |
| high blood pressur | 23 | 0.0439230 | All Docs |
| senior vice presid | 23 | 0.0453399 | All Docs |
| amazon servic llc | 22 | 0.0466951 | All Docs |
| beep beep beep | 22 | 0.0480504 | All Docs |
| love love love | 22 | 0.0494057 | All Docs |
| state attorney general | 22 | 0.0507609 | All Docs |
| high school graduat | 21 | 0.0520546 | All Docs |
Looking at this from a Pareto perspective, how many words does it take to get to 90%?
It takes me approximately 9737 rows to get t 90%, which is 10.69 percent of the entire sampled corpus.
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
It’s important that these different corpora are analyzed together for building the model. Running the same analysis on them individually shows differences in their text profiles. Look at en_US.blogs.txt versus en_US.news.txt to see which terms come to the top. ###en_US.blogs.txt
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| time | 5997 | 0.8 | en_US.blogs.txt |
| make | 4386 | 1.4 | en_US.blogs.txt |
| year | 3599 | 1.9 | en_US.blogs.txt |
| love | 3560 | 2.3 | en_US.blogs.txt |
| thing | 3489 | 2.8 | en_US.blogs.txt |
| peopl | 3415 | 3.3 | en_US.blogs.txt |
| work | 3338 | 3.7 | en_US.blogs.txt |
| dont | 3046 | 4.1 | en_US.blogs.txt |
| good | 2909 | 4.5 | en_US.blogs.txt |
| back | 2866 | 4.9 | en_US.blogs.txt |
| feel | 2219 | 5.2 | en_US.blogs.txt |
| life | 2209 | 5.5 | en_US.blogs.txt |
| book | 2180 | 5.8 | en_US.blogs.txt |
| start | 2179 | 6.1 | en_US.blogs.txt |
| week | 2153 | 6.4 | en_US.blogs.txt |
| made | 2005 | 6.6 | en_US.blogs.txt |
| live | 1908 | 6.9 | en_US.blogs.txt |
| read | 1758 | 7.1 | en_US.blogs.txt |
| find | 1752 | 7.4 | en_US.blogs.txt |
| world | 1751 | 7.6 | en_US.blogs.txt |
| friend | 1706 | 7.8 | en_US.blogs.txt |
| home | 1662 | 8.0 | en_US.blogs.txt |
| great | 1632 | 8.3 | en_US.blogs.txt |
| call | 1601 | 8.5 | en_US.blogs.txt |
| place | 1591 | 8.7 | en_US.blogs.txt |
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| year | 5501 | 0.7 | en_US.news.txt |
| state | 3362 | 1.2 | en_US.news.txt |
| time | 3284 | 1.6 | en_US.news.txt |
| make | 2578 | 2.0 | en_US.news.txt |
| work | 2541 | 2.3 | en_US.news.txt |
| peopl | 2439 | 2.7 | en_US.news.txt |
| game | 2323 | 3.0 | en_US.news.txt |
| school | 2288 | 3.3 | en_US.news.txt |
| citi | 2215 | 3.6 | en_US.news.txt |
| play | 2094 | 3.9 | en_US.news.txt |
| includ | 1917 | 4.1 | en_US.news.txt |
| team | 1869 | 4.4 | en_US.news.txt |
| home | 1766 | 4.6 | en_US.news.txt |
| back | 1735 | 4.9 | en_US.news.txt |
| percent | 1720 | 5.1 | en_US.news.txt |
| call | 1682 | 5.3 | en_US.news.txt |
| million | 1675 | 5.5 | en_US.news.txt |
| counti | 1658 | 5.8 | en_US.news.txt |
| season | 1647 | 6.0 | en_US.news.txt |
| week | 1631 | 6.2 | en_US.news.txt |
| start | 1618 | 6.4 | en_US.news.txt |
| compani | 1609 | 6.6 | en_US.news.txt |
| good | 1552 | 6.9 | en_US.news.txt |
| report | 1483 | 7.1 | en_US.news.txt |
| show | 1476 | 7.3 | en_US.news.txt |
| ###en_US.t | witter.txt |
| term | occurrences | cumfreqpct | source |
|---|---|---|---|
| love | 2614 | 1.1 | en_US.twitter.txt |
| good | 2138 | 2.1 | en_US.twitter.txt |
| dont | 1894 | 2.9 | en_US.twitter.txt |
| time | 1793 | 3.7 | en_US.twitter.txt |
| great | 1639 | 4.4 | en_US.twitter.txt |
| make | 1589 | 5.1 | en_US.twitter.txt |
| today | 1589 | 5.8 | en_US.twitter.txt |
| work | 1390 | 6.4 | en_US.twitter.txt |
| follow | 1314 | 6.9 | en_US.twitter.txt |
| back | 1276 | 7.5 | en_US.twitter.txt |
| peopl | 1098 | 8.0 | en_US.twitter.txt |
| happi | 1065 | 8.4 | en_US.twitter.txt |
| tonight | 1042 | 8.9 | en_US.twitter.txt |
| watch | 894 | 9.3 | en_US.twitter.txt |
| night | 885 | 9.7 | en_US.twitter.txt |
| year | 884 | 10.0 | en_US.twitter.txt |
| feel | 875 | 10.4 | en_US.twitter.txt |
| thing | 872 | 10.8 | en_US.twitter.txt |
| hope | 830 | 11.2 | en_US.twitter.txt |
| show | 829 | 11.5 | en_US.twitter.txt |
| your | 813 | 11.9 | en_US.twitter.txt |
| tweet | 804 | 12.2 | en_US.twitter.txt |
| game | 789 | 12.6 | en_US.twitter.txt |
| start | 773 | 12.9 | en_US.twitter.txt |
| week | 773 | 13.2 | en_US.twitter.txt |
It is also important to notice the difference in frequency percentages between 1, 2, and 3 n-gram tables. The lower the ngram, the more frequent its occurrence. As n-gram length grows, the top ranking n-gram frequency drops.
In this document, I applied sampling to start at a random place in each document and read for 50000 lines. This was done to improve performance and render the report more quickly. For building and testing the model, we will use a model building and hold out strategy. 60% of the data will be used to build the model. Another 20% will be used to test the model and 20% will be used to validate it.