We’ll use the tm library to read in the text documents as corpora, using tm_map to transfer to all lower case and remove whitespace, punctuation, and numbers. Then, converting to Document Term Matrices will allow us to extract terms. Notice that for the sake of space we are taking a simple random sample from the large text documents. I’ve printed out the number of terms in the sample for an idea of the scale of the corpora
readInToCorpus<-function(file_name){
library(tm)
con<-file(file_name)
vec<-readLines(con)
close(con)
samp<-sample(vec,length(vec)*0.02)
corp<-VCorpus(VectorSource(samp))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeNumbers)
corp
}
#setwd("Desktop")
twit<-readInToCorpus("final/en_US/en_US.twitter.txt")
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
blog<-readInToCorpus("final/en_US/en_US.blogs.txt")
news<-readInToCorpus("final/en_US/en_US.news.txt")
twit.dtm<-DocumentTermMatrix(twit)
blog.dtm<-DocumentTermMatrix(blog)
news.dtm<-DocumentTermMatrix(news)
twit.tdm<-TermDocumentMatrix(twit)
blog.tdm<-TermDocumentMatrix(blog)
news.tdm<-TermDocumentMatrix(news)
nTerms(twit.dtm)
## [1] 39037
nTerms(blog.dtm)
## [1] 42760
nTerms(news.dtm)
## [1] 43975
The Corpus files when not sampling are <1GB in size, with sampling they are brought down to ~70MB
The document matrices allow us to extract word frequency terms, an example is as follows:
freq_twit<-colSums(as.matrix(twit.dtm))
ord<-order(freq_twit,decreasing=TRUE)
#store the top hundred terms and bottom hundred terms
topHundred<-freq_twit[head(ord,100)]
bottomHundered<-freq_twit[tail(ord,100)]
topHundred
## the you and for that with your have this
## 18305 11160 8662 7711 4570 3498 3371 3344 3314
## are just like not but its all get was
## 3238 3056 2526 2487 2464 2353 2345 2278 2272
## out what love good about dont will thanks can
## 2267 2260 2065 2009 1900 1818 1808 1764 1758
## day now from know when one how great time
## 1699 1666 1609 1594 1585 1576 1531 1448 1422
## today see they lol new some got more our
## 1418 1402 1367 1333 1323 1293 1232 1210 1205
## there going too who back people cant think would
## 1197 1193 1159 1151 1127 1053 1044 1037 1037
## want need were follow happy has make well really
## 982 959 945 926 924 910 908 908 898
## right work tonight much been come thats did had
## 882 859 854 847 843 834 829 825 821
## thank them night should only here hope why youre
## 814 801 793 791 779 773 765 765 747
## still last way her best off ill his never
## 741 732 724 703 701 697 675 650 650
## then show life twitter yes next say please over
## 646 639 638 612 609 607 607 601 600
## better
## 599
bottomHundered
## zaxxaa zayn
## 1 1
## zaynster zbo
## 1 1
## zeal zealand
## 1 1
## zealands zebra
## 1 1
## zebras zed
## 1 1
## zedong zeitler
## 1 1
## zeldes zeldman
## 1 1
## zelo zenab
## 1 1
## zendaya zengo
## 1 1
## zeno zeppelin
## 1 1
## zeppelins zernalove
## 1 1
## zerospinepaincom zesty
## 1 1
## zeta zetterberg
## 1 1
## zfs zgt
## 1 1
## zico ziegfeld
## 1 1
## zigler zilch
## 1 1
## ziller zillion
## 1 1
## zillydilly zimmer
## 1 1
## zimmerli zimmermann
## 1 1
## zimmermanwhen zimmermen
## 1 1
## zine zingermans
## 1 1
## zingers zionists
## 1 1
## zipcar zipcard
## 1 1
## zipcars zipline
## 1 1
## zippers zipster
## 1 1
## ziptrip zit
## 1 1
## zits zodiac
## 1 1
## zoey zola
## 1 1
## zona zonejohn
## 1 1
## zooey zoogma
## 1 1
## zoomba zoowhat
## 1 1
## zora zotero
## 1 1
## zpacho zrevrange
## 1 1
## zro zubrówka
## 1 1
## zubrus zucchini
## 1 1
## zuckerberg zuckerburg
## 1 1
## zucotti zumbad
## 1 1
## zungguzungguguzungguzeng zuni
## 1 1
## zwagg молокососы
## 1 1
## приобрёл ومعلمينكم
## 1 1
## ㅋㅋㅋ ㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋ
## 1 1
## カニを縦に歩かせることはできない こうちゃん
## 1 1
## どういたしまして ニューヨークbusiness
## 1 1
## ワサビラビットが東京オフィスを開設 寒くて起きたくない
## 1 1
## 欢迎我们的中国人朋友 的部落格
## 1 1
## super
## 1 1
## shat
## 1 1
##
## 1 1
##
## 1 1
## got
## 1 1
As we can see, super common words like “the”, “and”, and “that” are by far the most frequent, with “the and”and" being more than 10x their closest competition
The bottom hundred are full of one-off words, let’s see the proportion of one off words from the dataset
length(freq_twit[freq_twit==1]/nTerms(twit.dtm))
## [1] 23481
60% of the terms are one off in the twitter dataset. This means that a model trained on it will have a lot of “absorption states”, where there is only one probable next state in the chain given the previous state.
A histogram shows just how heavily concentrated the data is towards the 1-10 range:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
qplot(freq_twit[freq_twit>=10 & freq_twit<=200],geom="histogram",bins=50)
We can see that the ones are the most preponderant even in the 1-10 range
qplot(freq_twit[freq_twit<=10],geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(freq_twit[freq_twit>=5 & freq_twit<=20],geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
My plan is to build a markov model that will be trained on different samples of the text in any of the corpora. The model will use ngram analysis of the sentence structures. I plan to deal with unseen words by using fuzzy matching to match a string or ngram of strings to the most similar string in the available data, and work with the transition properties from there