The goal of this exercise is to do some data exploration of the three datasets provided:
These three files are all natural language and will be used to develop a model that can predict the next word when a user enters a series of two or three word phrases.
Open the 3 input files and do some basic exploratory work on the entire text files.
News:
Blogs:
Twitter:
Analysis: These files are large (storage/memory wise) as well as the number of lines and number of words. For initial exploration and testing I will use a sample of 1000 lines from each file.
There are a number of cleanup actions that need to be taken to make the data easier to work with:
It is necessary to do this pre-processing to allow for easier data manipulation later on.
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
Create a document term matrix! A document term matrix evaluates the corpus and outputs the terms and their frequency (along with some meta-data).
## <<DocumentTermMatrix (documents: 3, terms: 13069)>>
## Non-/sparse entries: 18362/20845
## Sparsity : 53%
## Maximal term length: 74
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 3, terms: 13069)>>
## Non-/sparse entries: 18362/20845
## Sparsity : 53%
## Maximal term length: 74
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs and but for have that the this was with you
## en_US.blogs.txt 1234 225 399 244 556 2062 267 317 363 312
## en_US.news.txt 801 175 386 139 337 1888 99 204 253 107
## en_US.twitter.txt 182 51 145 58 107 420 71 55 52 218
## [1] "The number of words in each document in the corpus:"
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 32337 26859 9552
## [1] "The top 25 words in the corpus are:"
## the and that for with you was but have this not are from said his
## 4370 2217 1000 930 668 637 576 451 441 437 371 367 327 304 290
## all they its one will like what just out more
## 279 278 275 273 260 255 255 252 245 242
Graphical representation of the top 25 terms using the entire corpus.
## [1] "Below is a graph of the top 25 words used in the corpus:"
## [1] "The wordcloud below is another representation of the words frequency:"
It is interesting to note that “the” and “and” are the two most frequently appearing words. Unfortunately, “the” and “and” are not very helpful in our prediction model. As such, tm provides us an easy way to remove “stopwords”.
Stopwords are common words that do not provide much useful information when creating ngrams. As such, they will be removed from the corpus, a new document term matrix will be created and a new top 25 words will be created for comparison.
## <<DocumentTermMatrix (documents: 3, terms: 12971)>>
## Non-/sparse entries: 18085/20828
## Sparsity : 54%
## Maximal term length: 74
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 3, terms: 12971)>>
## Non-/sparse entries: 18085/20828
## Sparsity : 54%
## Maximal term length: 74
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can day get just like new one said time will
## en_US.blogs.txt 119 73 88 131 142 72 148 34 114 117
## en_US.news.txt 46 31 46 59 55 80 73 259 55 106
## en_US.twitter.txt 51 44 40 62 58 36 52 11 32 37
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 21293 19122 6916
## [1] "The top 25 words in the corpus after the standard stopwords are removed are below:"
## said one will like just can time new get day
## 304 273 260 255 252 216 201 188 174 148
## know year now first good people much two make also
## 146 146 140 135 134 125 123 115 112 110
## love dont well see last
## 110 104 103 102 99
## [1] "Below is a histogram of the top 25 words after having removed the stopwords."
## [1] "The wordcloud below is another representation of the words frequency:"
`
This data is much more informative! It seems to make sense that “said” is the top word in terms of frequency given that the corpus includes news.
Now do the ngram work creating bigrams (and trigrams and quadgrams in the future). I tried 5-grams & 6-grams but that didn’t really produce anything too useful considering the time it took to create them.
## Chars: 330304
## Letters: 282004
## Whitespace: 48300
## Punctuation: 0
## Digits: 0
## Words: 48301
## Sentences: 0
## Lines: 1
## Wordlens: 0 969 2799 3320 4242 4505 6066 7319 8459 10621
## 1 1 1 1 1 1 1 1 1 1
## Senlens: 0
## 10
## Syllens: 0 2 4 66 549 2766 8049 17437 19013
## 1 2 1 1 1 1 1 1 1
## [1] "Below is a truncated listing of the bigrams for the corpus:"
## come officials | 1
## thought {1} |
##
## mother tausha | 1
## cram {1} |
##
## users need | 1
## stay {1} |
##
## bit song | 1
## called {1} |
##
## wearing business | 1
## wear {1} |
##
## [[ ... results truncated ... ]]
## 'data.frame': 45900 obs. of 3 variables:
## $ ngrams: chr "new york " "year old " "last year " "last night " ...
## $ freq : int 26 25 19 16 15 13 12 12 12 11 ...
## $ prop : num 0.000538 0.000518 0.000393 0.000331 0.000311 ...
## [1] "Below is a plot of the top 25 bigrams for the corpus based upon the phrasetable:"
Analysis
The bigrams listed make sense: New York, years old, last year, last night, etc. These are all phrases that are usually found together! So I must be on the right track!
Future work: The three files provide a substantial amount of data and using the entire corpus for training is not possible given the limitations of my laptop computer. As such, a sample of the data will be created. A balance of memory usage and amount of data in the training set will have to be reached. The long term plan is to develop the 2, 3, and 4-grams and store them in files that will be accessed by the future prediction application. This is the only way that the response time of the application can be reasonable to users.
Thanks!
John McConnell