## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Loading required package: Rcpp
## Loading required package: RColorBrewer
## Loading required package: ggplot2
## Loading required package: qdapDictionaries
## Loading required package: qdapTools
##
## Attaching package: 'qdap'
##
## The following object is masked from 'package:base':
##
## Filter
Unable to load and manipulate the entire data set into my machine , I wrote a function which samples 10,000 random lines of text from each of the three files en_US.blogs.txt ,en_US.news.txt and 'en_US.twitter.txt'. The sample corpus gained from this function is what will be analyzed.
The first file in the corpus is 'en_US.blogs.txt'. It can be accessed like 'corp[[1]][1:5]' where 1 is its place in the corpus and '[1:5]' will return the first five blog entries. First, we can see the range in words per entry. Below are the six entries with the most words, followed by the six with least words in the blogs file.
## [1] 680 587 467 438 398 392
## [1] 1 1 1 1 1 1
Next, we can sum the number of words per file; First the blog file, then the news file and lastly the twitter file.
## [1] 406247
## [1] 331676
## [1] 331676
For a quick visual representation of the most common words in the corpus, I will create a wordcloud graph. Note- english stopwords, which are commonly used words, have been removed to provide a a general flavor of the more unique words found in the entire corpus. Below is an example of stopwords followed by the visual;
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
Below is a barplot describing the distribution of frequencies of words.
The barplot above shows that over 25,000 terms occur only once.
This second barplot reveals there are many words, most likely stopwords that occur thousands of times.
Below are the 20 most frequent terms
## the and that for you with was this have but are from
## 43731 22787 9321 9030 6534 6460 5797 4757 4465 4264 4130 3543
## not its they said his all will about
## 3456 3006 2899 2886 2795 2708 2655 2495
As you can see, most of these terms are stopwords. Below are the least frequent terms;
## \U0001f602lol
## 1
## \U0001f602\U0001f602\U0001f602\U0001f44e
## 1
## \U0001f60d\U0001f618\U0001f48f\U0001f491\U0001f48b\U0001f48d
## 1
## \U0001f62d\U0001f62d\U0001f62d
## 1
## \U0001f62d\U0001f62d\U0001f62d\U0001f62d
## 1
The least frequent words aren't even words. For my modeling, I will only use the most frequent terms because the infrequent terms are likely not terms at all or words so rare that they'll be useless for prediction purposes.
I plan to use 3-grams for my predictions purposes. I think using good turing smoothing paired with a backoff model where if I can't find an entry in the trigram's list, I'll back off to a bigram and later a unigram if need be, will be most efficient. There are a few problems that I'm not quite sure about. It appears that some people are processing n-grams using terminal bash. Supposedly this will free up memory and is faster. Using the 'tm' package coupled with 'RWeka' might be another possible solution, although from reading the forums it appears to be inefficient in processing large amounts of text, which leads to the second problem. How to deal with processing all of the files. Perhaps this is where command line processing will come in more useful, but unfortunately, I don't have a background in computer science and am not very familiar with these techniques. Also, while I can follow the Stanford NLP lectures and understand concepts such as Markov Chains, etc., I don't know how to actually go about building the algorithm. Some people are using look up tables and some are going to calculate probabilities on the spot. I'd appreciate any direction or hints on how to proceed with these subsequent steps.