This report explores and summarizes data to be used in a word prediction algorithm. The data is simply lines of words (sentences/phrases) coming from US blogs, news, and twitter.
I will begin by loading in each of the three data sets. All of the data is placed in a single file called [final]. The algorithm will only be used to predict the next word for english sentences so only english data will be used.
twitter <- readLines(file("final/en_US/en_US.twitter.txt","r"))
blogs <- readLines(file("final/en_US/en_US.blogs.txt","r"))
news <- readLines(file("final/en_US/en_US.news.txt","r"))
Likely warning messages will pop-up. These warnings have been hidden in this document.
In this section, I will provide basic summary statistics of the three data sets. To discover and view the statistics, the “quanteda” package will be used along with “ggplot2.” First let’s look at the size of each of these files including word counts and character counts.
As we can see, the largest data set comes from the twitter file; however, while we have more lines of twitter data, lines of blog typically are longer (having more words). The next few graphs show the average counts in each data set.
The average number of words per line are much larger for the blog data set than the twitter; whereas the number of characters per word is relatively close for blogs and twitter but with news sources using slightly longer words. This may indicate language from news articles may differ substantially from blogs and twitter.
Next, let’s look at the most common words from each of the datasets. To do this, a little bit of cleaning must be done. Fortunately, the “quanteda” package cleans the data when putting it into a document-feature matrix (dfm). By default, the dfm() function will covert all text to lowercase, remove numbers, remove punctuation, and remove whitespace. The first step will be to convert our datasets into dfm format.
library(quanteda)
dfmblogs <- dfm(corpus(blogs))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 899,288 documents
## ... indexing features: 419,667 feature types
## ... created a 899288 x 419667 sparse dfm
## ... complete.
## Elapsed time: 131.2 seconds.
dfmnews <- dfm(corpus(news))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 77,259 documents
## ... indexing features: 95,413 feature types
## ... created a 77259 x 95413 sparse dfm
## ... complete.
## Elapsed time: 4.5 seconds.
dfmtwitter <- dfm(corpus(twitter))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,360,148 documents
## ... indexing features: 433,442 feature types
## ... created a 2360148 x 433442 sparse dfm
## ... complete.
## Elapsed time: 98.78 seconds.
Next I plot word clouds of each of the three datasets. The word cloud prints more frequent words using a larger font.
topfeatures(dfmblogs, n=10)
## the and to a of i in that is
## 1852564 1091822 1065597 897255 875025 768495 593279 459567 431569
## it
## 400374
topfeatures(dfmnews, n=10)
## the to and a of in for that is on
## 151297 69321 68262 67144 59036 51432 26963 26318 21953 20556
topfeatures(dfmtwitter, n=10)
## the to i a you and for in of is
## 936051 787154 721842 609249 546948 438149 384988 377854 359125 358642
Finally let’s group some words. For this last bit, we’ll aggregate all three sources and look for the most common unigrams, bigrams, trigrams, and quadrigrams. Since these files are quite large, I will begin by sampling each of the three sources and then performing the analysis.
newssample <- sample(news, size=round(length(news)/10,0))
blogssample <- sample(blogs, size=round(length(blogs)/10,0))
twittersample <- sample(twitter, size=round(length(twitter)/10,0))
unigrams <- dfm(c(corpus(newssample ),corpus(blogssample),corpus(twittersample)), ngrams=1)
bigrams <- dfm(c(corpus(newssample ),corpus(blogssample),corpus(twittersample)), ngrams=2)
trigrams <- dfm(c(corpus(newssample ),corpus(blogssample),corpus(twittersample)), ngrams=3)
quadrigrams <- dfm(c(corpus(newssample ),corpus(blogssample),corpus(twittersample)), ngrams=4)
topunis <- topfeatures(unigrams,5)
topbis <- topfeatures(bigrams,5)
toptris <- topfeatures(trigrams,5)
topquads <- topfeatures(quadrigrams,5)
As expected, the most common unigrams occur more frequently than the most common quadrigrams (if this were not the case, there would be a serious flaw in the procedure). A data set containing the frequencies/probabilities of each of these n-grams will be the foundation of the final algorithm.
The final project will be a word predictor that will use unigrams, bigrams, trigrams, and even quadrigrams. After the user inputs a phrase, it will search through all the quadrigram with the last three words of the phrase matching the first three words of the quadrigam. If none of these can be matched it will search through all the trigrams and attempt to match the last two words of phrase to the first two words of a trigram. If this fails, it will search the bigrams and match the last word of the phrase to the first word of the bigram. Finally if all else fails, it will simply choose the highest probability unigram. This model has some serious flaws, but this is where I plan to start.