Exploratory Analysis

We’ll use the tm library to read in the text documents as corpora, using tm_map to transfer to all lower case and remove whitespace, punctuation, and numbers. Then, converting to Document Term Matrices will allow us to extract terms. Notice that for the sake of space we are taking a simple random sample from the large text documents. I’ve printed out the number of terms in the sample for an idea of the scale of the corpora

readInToCorpus<-function(file_name){
    library(tm)
    con<-file(file_name)
    vec<-readLines(con)
    close(con)
    samp<-sample(vec,length(vec)*0.02)
    corp<-VCorpus(VectorSource(samp))
    corp <- tm_map(corp, content_transformer(tolower))
    corp <- tm_map(corp, removePunctuation)
    corp <- tm_map(corp, stripWhitespace)
    corp <- tm_map(corp, removeNumbers)
    corp
}
#setwd("Desktop")
twit<-readInToCorpus("final/en_US/en_US.twitter.txt")
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
blog<-readInToCorpus("final/en_US/en_US.blogs.txt")
news<-readInToCorpus("final/en_US/en_US.news.txt")

twit.dtm<-DocumentTermMatrix(twit)
blog.dtm<-DocumentTermMatrix(blog)
news.dtm<-DocumentTermMatrix(news)

twit.tdm<-TermDocumentMatrix(twit)
blog.tdm<-TermDocumentMatrix(blog)
news.tdm<-TermDocumentMatrix(news)

nTerms(twit.dtm)
## [1] 39037
nTerms(blog.dtm)
## [1] 42760
nTerms(news.dtm)
## [1] 43975

The Corpus files when not sampling are <1GB in size, with sampling they are brought down to ~70MB

The document matrices allow us to extract word frequency terms, an example is as follows:

freq_twit<-colSums(as.matrix(twit.dtm))
ord<-order(freq_twit,decreasing=TRUE)
#store the top hundred terms and bottom hundred terms
topHundred<-freq_twit[head(ord,100)]
bottomHundered<-freq_twit[tail(ord,100)]
topHundred
##     the     you     and     for    that    with    your    have    this 
##   18305   11160    8662    7711    4570    3498    3371    3344    3314 
##     are    just    like     not     but     its     all     get     was 
##    3238    3056    2526    2487    2464    2353    2345    2278    2272 
##     out    what    love    good   about    dont    will  thanks     can 
##    2267    2260    2065    2009    1900    1818    1808    1764    1758 
##     day     now    from    know    when     one     how   great    time 
##    1699    1666    1609    1594    1585    1576    1531    1448    1422 
##   today     see    they     lol     new    some     got    more     our 
##    1418    1402    1367    1333    1323    1293    1232    1210    1205 
##   there   going     too     who    back  people    cant   think   would 
##    1197    1193    1159    1151    1127    1053    1044    1037    1037 
##    want    need    were  follow   happy     has    make    well  really 
##     982     959     945     926     924     910     908     908     898 
##   right    work tonight    much    been    come   thats     did     had 
##     882     859     854     847     843     834     829     825     821 
##   thank    them   night  should    only    here    hope     why   youre 
##     814     801     793     791     779     773     765     765     747 
##   still    last     way     her    best     off     ill     his   never 
##     741     732     724     703     701     697     675     650     650 
##    then    show    life twitter     yes    next     say  please    over 
##     646     639     638     612     609     607     607     601     600 
##  better 
##     599
bottomHundered
##                             zaxxaa                               zayn 
##                                  1                                  1 
##                           zaynster                                zbo 
##                                  1                                  1 
##                               zeal                            zealand 
##                                  1                                  1 
##                           zealands                              zebra 
##                                  1                                  1 
##                             zebras                                zed 
##                                  1                                  1 
##                             zedong                            zeitler 
##                                  1                                  1 
##                             zeldes                            zeldman 
##                                  1                                  1 
##                               zelo                              zenab 
##                                  1                                  1 
##                            zendaya                              zengo 
##                                  1                                  1 
##                               zeno                           zeppelin 
##                                  1                                  1 
##                          zeppelins                          zernalove 
##                                  1                                  1 
##                   zerospinepaincom                              zesty 
##                                  1                                  1 
##                               zeta                         zetterberg 
##                                  1                                  1 
##                                zfs                                zgt 
##                                  1                                  1 
##                               zico                           ziegfeld 
##                                  1                                  1 
##                             zigler                              zilch 
##                                  1                                  1 
##                             ziller                            zillion 
##                                  1                                  1 
##                         zillydilly                             zimmer 
##                                  1                                  1 
##                           zimmerli                         zimmermann 
##                                  1                                  1 
##                      zimmermanwhen                          zimmermen 
##                                  1                                  1 
##                               zine                         zingermans 
##                                  1                                  1 
##                            zingers                           zionists 
##                                  1                                  1 
##                             zipcar                            zipcard 
##                                  1                                  1 
##                            zipcars                            zipline 
##                                  1                                  1 
##                            zippers                            zipster 
##                                  1                                  1 
##                            ziptrip                                zit 
##                                  1                                  1 
##                               zits                             zodiac 
##                                  1                                  1 
##                               zoey                               zola 
##                                  1                                  1 
##                               zona                           zonejohn 
##                                  1                                  1 
##                              zooey                             zoogma 
##                                  1                                  1 
##                             zoomba                            zoowhat 
##                                  1                                  1 
##                               zora                             zotero 
##                                  1                                  1 
##                             zpacho                          zrevrange 
##                                  1                                  1 
##                                zro                           zubrówka 
##                                  1                                  1 
##                             zubrus                           zucchini 
##                                  1                                  1 
##                         zuckerberg                         zuckerburg 
##                                  1                                  1 
##                            zucotti                             zumbad 
##                                  1                                  1 
##           zungguzungguguzungguzeng                               zuni 
##                                  1                                  1 
##                              zwagg                         молокососы 
##                                  1                                  1 
##                           приобрёл                          ومعلمينكم 
##                                  1                                  1 
##                             ㅋㅋㅋ               ㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋ 
##                                  1                                  1 
##   カニを縦に歩かせることはできない                         こうちゃん 
##                                  1                                  1 
##                   どういたしまして               ニューヨークbusiness 
##                                  1                                  1 
## ワサビラビットが東京オフィスを開設                 寒くて起きたくない 
##                                  1                                  1 
##               欢迎我们的中国人朋友                           的部落格 
##                                  1                                  1 
##                          super                              
##                                  1                                  1 
##                                                              shat 
##                                  1                                  1 
##                                                                 
##                                  1                                  1 
##                                                                
##                                  1                                  1 
##                              got                                
##                                  1                                  1

As we can see, super common words like “the”, “and”, and “that” are by far the most frequent, with “the and”and" being more than 10x their closest competition

The bottom hundred are full of one-off words, let’s see the proportion of one off words from the dataset

length(freq_twit[freq_twit==1]/nTerms(twit.dtm))
## [1] 23481

60% of the terms are one off in the twitter dataset. This means that a model trained on it will have a lot of “absorption states”, where there is only one probable next state in the chain given the previous state.

A histogram shows just how heavily concentrated the data is towards the 1-10 range:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
qplot(freq_twit[freq_twit>=10 & freq_twit<=200],geom="histogram",bins=50)

We can see that the ones are the most preponderant even in the 1-10 range

qplot(freq_twit[freq_twit<=10],geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(freq_twit[freq_twit>=5 & freq_twit<=20],geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

My plan is to build a markov model that will be trained on different samples of the text in any of the corpora. The model will use ngram analysis of the sentence structures. I plan to deal with unseen words by using fuzzy matching to match a string or ngram of strings to the most similar string in the available data, and work with the transition properties from there