Data Survey

First, lets look at the data given to us. After reading in the directory of text samples, loop over the documents and get their descriptions:

#get names of dcouments
metaCopora<-lapply(corpora[1:length(corpora)],meta)
print(metaCopora)

## $en_US.blogs.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.blogs.txt
##   language     : en
##   origin       : character(0)
## 
## $en_US.news.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.news.txt
##   language     : en
##   origin       : character(0)
## 
## $en_US.twitter.txt
##   author       : character(0)
##   datetimestamp: 2016-03-19 18:44:51
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.twitter.txt
##   language     : en
##   origin       : character(0)

#get their length in lines
originalLengths<-lapply( lapply(corpora[1:length(corpora)],as.character) , length)
print(originalLengths)

## $en_US.blogs.txt
## [1] 899288
## 
## $en_US.news.txt
## [1] 1010242
## 
## $en_US.twitter.txt
## [1] 2360148

The texts are too large to process and map in their entirety. Hence, we will sample the documents and write the output to a temp space. Then we will re-read them and clean up the text by removing numbers, punctuation, stop-words, and profanity. We will then continue with the analysis.

sampleAndWriteTexts<- function(dataSourcePath="data/final/en_US/en_US.blogs.txt",
                               #where to start reading, pick a random spot and start reading
                               startLine=sample(1:10000,size=1,replace=T), 
                               #how many lines to read
                               readvector=1000){
        con<- file(dataSourcePath,open="r")

        for(i in 1:startLine){
                txtTmp<-readLines(con,1)
        }
        #now that the skip point has been reached, read the rest of the file
        #read in the profanity filter

        #newfile <- readLines(con)

        #read vector //how many samples to use
        rv<-readvector

        #dataframe
        df<-data.frame(txt=character())
        #define function

        txtR<-readLines(con,n=readvector,skipNul=TRUE)
        
        close(con) #done reading lines, now write lines

        # write the text to a file; the [[1]][[4]] gets the file name of the original document
        write.table(txtR,paste0("temp/",strsplit(dataSourcePath,"/")[[1]][4]),col.names=FALSE)

}

#read 50000 lines of text
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.blogs.txt",startLine=startLine,
                    readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.twitter.txt",startLine=startLine,
                    readvector=50000)
sampleAndWriteTexts(dataSourcePath="data/final/en_US/en_US.news.txt",startLine=startLine,
                    readvector=50000)

(corpora <- VCorpus(DirSource("temp/"),readerControl=list(language="english")))

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3

corpora<-tm_map(corpora,content_transformer(tolower))
corpora<-tm_map(corpora,removeNumbers)
corpora<-tm_map(corpora,removePunctuation)
corpora<-tm_map(corpora, removeWords, stopwords("english"))
corpora<-tm_map(corpora, removeWords, stopwords("SMART"))
corpora<-tm_map(corpora,removeWords, profanity) #removeWords comes from the tm package
corpora<-tm_map(corpora,stripWhitespace)
corpora<-tm_map(corpora,stemDocument,lazy = TRUE)


createTextFrequencyDF <- function (corpustext,controlArg,source=""){
       
        #create a document term matrix from the corpora for analysis
        dtm<-DocumentTermMatrix(corpustext,control=controlArg)
        #create a matrix and sort it 
        #decreasing each item in matrix will be a word with a nubmer value 
        freq<-sort(colSums(as.matrix(dtm)),decreasing = TRUE) 
        ord<-order(freq,decreasing=TRUE)
        length(freq) #how many terms do I have (tell me the lengt)
        
        #build pareto analysis of the terms
        wf=data.frame(term=names(freq),
                      occurrences=freq,
                      cumfreqpct=cumsum((freq/sum(freq))*100),
                      source=source
        )
        return (wf)
}

Summary Statistics

Looking over all three documents in the corpora, the following table shows the top 25 most frequent words

wf_master<-createTextFrequencyDF(controlArg = list(wordLengths=c(4, 20)),corpustext = corpora,source="All Docs")
wfstat<-slice(wf_master,1:25) #top 25
kable(wfstat,digits=1)

term	occurrences	cumfreqpct	source
time	11074	0.6	All Docs
year	9984	1.2	All Docs
make	8553	1.7	All Docs
work	7269	2.1	All Docs
peopl	6952	2.6	All Docs
love	6940	3.0	All Docs
good	6599	3.3	All Docs
dont	6398	3.7	All Docs
back	5877	4.1	All Docs
thing	5588	4.4	All Docs
start	4570	4.6	All Docs
week	4557	4.9	All Docs
state	4498	5.2	All Docs
great	4139	5.4	All Docs
play	4109	5.7	All Docs
home	3961	5.9	All Docs
call	3957	6.1	All Docs
today	3925	6.3	All Docs
feel	3895	6.6	All Docs
game	3895	6.8	All Docs
school	3811	7.0	All Docs
show	3786	7.2	All Docs
made	3720	7.5	All Docs
life	3621	7.7	All Docs
live	3569	7.9	All Docs

From this table, we can see that the top 25 terms make up about 8% of the most frequently used words.

Now let’s look at an bi-gram and tri-gram analysis to see what word pairs occur most frequently.

options(mc.cores=1)  #on MacOS you have to set the cores to single
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bgf<-slice(createTextFrequencyDF(controlArg =  list(tokenize = BigramTokenizer),corpustext = corpora,source="All Docs"),1:25)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tgf<-slice(createTextFrequencyDF(controlArg =  list(tokenize = TrigramTokenizer),corpustext = corpora, source="All Docs"),1:25)

The top 25 word pairs

term	occurrences	cumfreqpct	source
year ago	750	0.0425155	All Docs
high school	661	0.0799859	All Docs
st loui	457	0.1058920	All Docs
unit state	380	0.1274332	All Docs
los angel	327	0.1459699	All Docs
san francisco	314	0.1637698	All Docs
past year	251	0.1779983	All Docs
long time	246	0.1919434	All Docs
good thing	236	0.2053216	All Docs
health care	233	0.2185297	All Docs
happi birthday	228	0.2314545	All Docs
ice cream	207	0.2431887	All Docs
good morn	205	0.2548096	All Docs
san diego	204	0.2663739	All Docs
spend time	199	0.2776546	All Docs
week ago	198	0.2888787	All Docs
social media	189	0.2995926	All Docs
mother day	188	0.3102499	All Docs
time year	183	0.3206237	All Docs
school district	176	0.3306006	All Docs
good luck	173	0.3404075	All Docs
peopl dont	171	0.3501011	All Docs
white hous	169	0.3596812	All Docs
make feel	164	0.3689780	All Docs
work hard	164	0.3782747	All Docs

The top 25 word triplets

term	occurrences	cumfreqpct	source
presid barack obama	93	0.0057291	All Docs
happi mother day	72	0.0101645	All Docs
gov chris christi	55	0.0135527	All Docs
st loui counti	48	0.0165096	All Docs
world war ii	46	0.0193434	All Docs
high school student	41	0.0218691	All Docs
st patrick day	36	0.0240868	All Docs
cinco de mayo	35	0.0262429	All Docs
chief financi offic	29	0.0280294	All Docs
coupl year ago	29	0.0298159	All Docs
martin luther king	28	0.0315408	All Docs
counti prosecutor offic	27	0.0332040	All Docs
spend lot time	27	0.0348673	All Docs
coupl week ago	26	0.0364690	All Docs
averag point rebound	25	0.0380091	All Docs
chief execut offic	25	0.0395492	All Docs
world trade center	25	0.0410892	All Docs
doesnt make sens	23	0.0425061	All Docs
high blood pressur	23	0.0439230	All Docs
senior vice presid	23	0.0453399	All Docs
amazon servic llc	22	0.0466951	All Docs
beep beep beep	22	0.0480504	All Docs
love love love	22	0.0494057	All Docs
state attorney general	22	0.0507609	All Docs
high school graduat	21	0.0520546	All Docs

Looking at this from a Pareto perspective, how many words does it take to get to 90%?

It takes me approximately 9737 rows to get t 90%, which is 10.69 percent of the entire sampled corpus.

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Analysis and Conclusions

It’s important that these different corpora are analyzed together for building the model. Running the same analysis on them individually shows differences in their text profiles. Look at en_US.blogs.txt versus en_US.news.txt to see which terms come to the top. ###en_US.blogs.txt

term	occurrences	cumfreqpct	source
time	5997	0.8	en_US.blogs.txt
make	4386	1.4	en_US.blogs.txt
year	3599	1.9	en_US.blogs.txt
love	3560	2.3	en_US.blogs.txt
thing	3489	2.8	en_US.blogs.txt
peopl	3415	3.3	en_US.blogs.txt
work	3338	3.7	en_US.blogs.txt
dont	3046	4.1	en_US.blogs.txt
good	2909	4.5	en_US.blogs.txt
back	2866	4.9	en_US.blogs.txt
feel	2219	5.2	en_US.blogs.txt
life	2209	5.5	en_US.blogs.txt
book	2180	5.8	en_US.blogs.txt
start	2179	6.1	en_US.blogs.txt
week	2153	6.4	en_US.blogs.txt
made	2005	6.6	en_US.blogs.txt
live	1908	6.9	en_US.blogs.txt
read	1758	7.1	en_US.blogs.txt
find	1752	7.4	en_US.blogs.txt
world	1751	7.6	en_US.blogs.txt
friend	1706	7.8	en_US.blogs.txt
home	1662	8.0	en_US.blogs.txt
great	1632	8.3	en_US.blogs.txt
call	1601	8.5	en_US.blogs.txt
place	1591	8.7	en_US.blogs.txt

en_US.news.txt

term	occurrences	cumfreqpct	source
year	5501	0.7	en_US.news.txt
state	3362	1.2	en_US.news.txt
time	3284	1.6	en_US.news.txt
make	2578	2.0	en_US.news.txt
work	2541	2.3	en_US.news.txt
peopl	2439	2.7	en_US.news.txt
game	2323	3.0	en_US.news.txt
school	2288	3.3	en_US.news.txt
citi	2215	3.6	en_US.news.txt
play	2094	3.9	en_US.news.txt
includ	1917	4.1	en_US.news.txt
team	1869	4.4	en_US.news.txt
home	1766	4.6	en_US.news.txt
back	1735	4.9	en_US.news.txt
percent	1720	5.1	en_US.news.txt
call	1682	5.3	en_US.news.txt
million	1675	5.5	en_US.news.txt
counti	1658	5.8	en_US.news.txt
season	1647	6.0	en_US.news.txt
week	1631	6.2	en_US.news.txt
start	1618	6.4	en_US.news.txt
compani	1609	6.6	en_US.news.txt
good	1552	6.9	en_US.news.txt
report	1483	7.1	en_US.news.txt
show	1476	7.3	en_US.news.txt
###en_US.t	witter.txt

term	occurrences	cumfreqpct	source
love	2614	1.1	en_US.twitter.txt
good	2138	2.1	en_US.twitter.txt
dont	1894	2.9	en_US.twitter.txt
time	1793	3.7	en_US.twitter.txt
great	1639	4.4	en_US.twitter.txt
make	1589	5.1	en_US.twitter.txt
today	1589	5.8	en_US.twitter.txt
work	1390	6.4	en_US.twitter.txt
follow	1314	6.9	en_US.twitter.txt
back	1276	7.5	en_US.twitter.txt
peopl	1098	8.0	en_US.twitter.txt
happi	1065	8.4	en_US.twitter.txt
tonight	1042	8.9	en_US.twitter.txt
watch	894	9.3	en_US.twitter.txt
night	885	9.7	en_US.twitter.txt
year	884	10.0	en_US.twitter.txt
feel	875	10.4	en_US.twitter.txt
thing	872	10.8	en_US.twitter.txt
hope	830	11.2	en_US.twitter.txt
show	829	11.5	en_US.twitter.txt
your	813	11.9	en_US.twitter.txt
tweet	804	12.2	en_US.twitter.txt
game	789	12.6	en_US.twitter.txt
start	773	12.9	en_US.twitter.txt
week	773	13.2	en_US.twitter.txt

It is also important to notice the difference in frequency percentages between 1, 2, and 3 n-gram tables. The lower the ngram, the more frequent its occurrence. As n-gram length grows, the top ranking n-gram frequency drops.

Sampling and Model Building Strategy

In this document, I applied sampling to start at a random place in each document and read for 50000 lines. This was done to improve performance and render the report more quickly. For building and testing the model, we will use a model building and hold out strategy. 60% of the data will be used to build the model. Another 20% will be used to test the model and 20% will be used to validate it.

Exploratory Analysis of Corpora for Text Mining

Ben Daniel

March 15, 2016

Introduction

Data Survey

Summary Statistics

Analysis and Conclusions

en_US.news.txt

Sampling and Model Building Strategy