Data Science Milestone Report

## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Error: trying to use CRAN without setting a mirror
## Loading required package: Rcpp
## Loading required package: RColorBrewer
## Loading required package: ggplot2
## Loading required package: qdapDictionaries
## Loading required package: qdapTools
## 
## Attaching package: 'qdap'
## 
## The following object is masked from 'package:base':
## 
##     Filter

Unable to load and manipulate the entire data set into my machine , I wrote a function which samples 10,000 random lines of text from each of the three files en_US.blogs.txt ,en_US.news.txt and 'en_US.twitter.txt'. The sample corpus gained from this function is what will be analyzed.

BASIC LINE AND WORD COUNTS

The first file in the corpus is 'en_US.blogs.txt'. It can be accessed like 'corp[[1]][1:5]' where 1 is its place in the corpus and '[1:5]' will return the first five blog entries. First, we can see the range in words per entry. Below are the six entries with the most words, followed by the six with least words in the blogs file.

## [1] 680 587 467 438 398 392
## [1] 1 1 1 1 1 1

Next, we can sum the number of words per file; First the blog file, then the news file and lastly the twitter file.

## [1] 406247
## [1] 331676
## [1] 331676

MOST COMMON WORDS IN ALL CORPUS

For a quick visual representation of the most common words in the corpus, I will create a wordcloud graph. Note- english stopwords, which are commonly used words, have been removed to provide a a general flavor of the more unique words found in the entire corpus. Below is an example of stopwords followed by the visual;

##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"

plot of chunk unnamed-chunk-5

FREQUENCY OF FREQUENCIES

Below is a barplot describing the distribution of frequencies of words. plot of chunk unnamed-chunk-6

The barplot above shows that over 25,000 terms occur only once. plot of chunk unnamed-chunk-7

This second barplot reveals there are many words, most likely stopwords that occur thousands of times.

MOST FREQUENT TERMS

Below are the 20 most frequent terms

##   the   and  that   for   you  with   was  this  have   but   are  from 
## 43731 22787  9321  9030  6534  6460  5797  4757  4465  4264  4130  3543 
##   not   its  they  said   his   all  will about 
##  3456  3006  2899  2886  2795  2708  2655  2495

As you can see, most of these terms are stopwords. Below are the least frequent terms;

##                                                \U0001f602lol 
##                                                            1 
##                     \U0001f602\U0001f602\U0001f602\U0001f44e 
##                                                            1 
## \U0001f60d\U0001f618\U0001f48f\U0001f491\U0001f48b\U0001f48d 
##                                                            1 
##                               \U0001f62d\U0001f62d\U0001f62d 
##                                                            1 
##                     \U0001f62d\U0001f62d\U0001f62d\U0001f62d 
##                                                            1

The least frequent words aren't even words. For my modeling, I will only use the most frequent terms because the infrequent terms are likely not terms at all or words so rare that they'll be useless for prediction purposes.

PREDICTION ALGORITHM

I plan to use 3-grams for my predictions purposes. I think using good turing smoothing paired with a backoff model where if I can't find an entry in the trigram's list, I'll back off to a bigram and later a unigram if need be, will be most efficient. There are a few problems that I'm not quite sure about. It appears that some people are processing n-grams using terminal bash. Supposedly this will free up memory and is faster. Using the 'tm' package coupled with 'RWeka' might be another possible solution, although from reading the forums it appears to be inefficient in processing large amounts of text, which leads to the second problem. How to deal with processing all of the files. Perhaps this is where command line processing will come in more useful, but unfortunately, I don't have a background in computer science and am not very familiar with these techniques. Also, while I can follow the Stanford NLP lectures and understand concepts such as Markov Chains, etc., I don't know how to actually go about building the algorithm. Some people are using look up tables and some are going to calculate probabilities on the spot. I'd appreciate any direction or hints on how to proceed with these subsequent steps.