Felix P. Muga II
June 12, 2016
In this Milestone Report we shall conduct an exploratory data analysis on a data set provided by SwiftKey in Coursera-SwiftKey.zip.
We shall do our exploratory analysis on the files in en_US subfolder which has 3 txt files.
These are:
en_US_blogs.txt,en_US_news.txt, anden_US_twitter.txt.The 3 text files can be accessed using the R function readLines.
readLines returns a vector of characters with as many elements as there are
lines from the input source.
The 3 vectors of character are named as follows:
en_blog_data, en_news_data, and en_twitter_data.The filesize (in megabytes) of each data file can be determined using the
R function file.info with the column name size.
To compute the filesize in megabytes or MB we divide each of the result by \( 2^{20}=1024^2 \) since 1 megabyte is equivalent to \( 2^{20} \) bytes.
The filesize, the count of lines, and the word count of each original data set are given in the next slide.
source filesize Lines Words
1 blog 200MB 899,288 37,546,246
2 news 196MB 1,010,242 34,762,395
3 twitter 159MB 2,360,148 30,093,410
To reduce the time needed for pre-processing and cleaning the data we shall extract a sample in each data file.
This can be done by defining an R-function as stated below with percentage = \( 5\% \):
sampleTexts <- function(data, percentage)
{
sample.size <- ceiling(length(data) * percentage)[1]
sample.entries <- sample(data, sample.size, replace = FALSE)
return(sample.entries)
}
source Lines Words
1 blog sample 44,965 1,888,026
2 news sample 50,513 1,734,854
3 twitter sample 118,008 1,506,645
These samples are 5% in the number of lines and in the number of words of each of the original data files, respectively:
We shall create a document corpus using the {tm}-R-package on the three sample data.
This document corpus consists of three documents:
blog.sample,news.sample,twitter.sample.Using a function named tm_map we can apply various transformations on this document corpus.
This function can remove non-essential characters like
We shall also remove numbers since they are not useful at this stage of word prediction.
However , we shall not remove stopwords like “the” or “and” since these are useful in predicting the next word of a sentence.
After cleaning the document corpus, we can now proceed in creating three n-gram objects which we shall call as:
1) unigrams
2) bigrams
3) trigrams
According to Wikipdedia:
“… an n-gram is a contiguous sequence of of n-items from a given sequence of text or speech.”
Thus,
1) a unigram consists of one word only;
2) a bigram is a contiguous sequence of two words; and
3) a trigram is a contiguous sequence of three words.
The n-gram objects can be used to construct an algorithm for predicting the next word in a sentence.
To create these n-gram objects we need a term document matrix which can be derived from the document corpus we have constructed.
After we have created and cleaned the document corpus we shall now construct a term document matrix using the function TermDocumentMatrix of the {tm}-R package.
The term document matrix is an \( n \times m \) matrix such that its rows are the Terms or Words in the document corpus and its columns are the documents or the sample data*.
For each document there exists a counter on how often this term appears in this document.
The generated matrix is a sparse matrix. That means i.e. some terms only occurs once in one document and we have many 0-values in the other documents for this term.
We shall create 3 different term document matrices (tdm).
There shall be a tdm for each of the unigrams, the bigrams, and the trigrams.
<<TermDocumentMatrix (terms: 141825, documents: 3)>>
Non-/sparse entries: 209821/215654
Sparsity : 51%
Maximal term length: 109
Weighting : term frequency (tf)
<<TermDocumentMatrix (terms: 1683225, documents: 3)>>
Non-/sparse entries: 2023835/3025840
Sparsity : 60%
Maximal term length: 116
Weighting : term frequency (tf)
<<TermDocumentMatrix (terms: 3723833, documents: 3)>>
Non-/sparse entries: 3991340/7180159
Sparsity : 64%
Maximal term length: 122
Weighting : term frequency (tf)
To reduce the sparsity of each of the tdm's we shall use the function
called removeSparseTerms(x,sparse) of the {tm}-R package with sparse \( = 0.666 \).
We hope to minimize the sparsity by removing the terms that appeared in only 1 document.
From 51% sparsity to 15% sparsity.
<<TermDocumentMatrix (terms: 44272, documents: 3)>>
Non-/sparse entries: 112268/20548
Sparsity : 15%
Maximal term length: 22
Weighting : term frequency (tf)
From 60% sparsity to 22% sparsity.
<<TermDocumentMatrix (terms: 256301, documents: 3)>>
Non-/sparse entries: 596911/171992
Sparsity : 22%
Maximal term length: 29
Weighting : term frequency (tf)
From 64% sparsity to 26% sparsity.
<<TermDocumentMatrix (terms: 218767, documents: 3)>>
Non-/sparse entries: 486274/170027
Sparsity : 26%
Maximal term length: 39
Weighting : term frequency (tf)
We find the sum of each row of the term document matrix (tdm) to determine the count of each word in the document.corpus.
From a matrix we converted the processed tdm to a dataframe.
We call these dataframes as:
We shall create a plot called wordcloud for the data frame df.unigrams.
This comes from the {wordcloud}-R package.
It consists of about 100 words and the size of the font of a unigram or word depends on the count of the word in the data frame.
The larger is the number of count of the word, the larger is the size of the font of the said word.
The word “the” is largest word in the wordcloud since it has the largest number of count in the data frame df.unigrams.
This is confirmed by the bar graph of the top 20 unigrams with respect to the count.
The wordcloud for the df.bigrams data frame is shown below.
It consists of 100 bigrams.
The bigrams “of the” and “in the” are the top two largest of all the bigrams since they have the top two largest number of counts. See the bar graph for the Top 20 bigrams below.
This is the wordcloud for the df.trigrams.
It consists also of 100 trigrams.
“one of the” is the largest of the df.trigrams since it has the most number of count.
See the bar graph below.
The data frame df.bigrams consists of two columns. The first column is named Terms**
and the second column is named **count.
The Terms column consists of bigrams.
The count column consists of the frequency of occurrence of the corresponding bigram.
If the dataframe df.bigrams shall be modified by replacing the count
with the conditional probability of the ocurrence of the second word of the bigram
given the occurrence of the first word of the bigram, then we find an interesting data frame.
This data frame will be able to quantify the conditional probability of any n-gram in the document.corpus of the sample.data.
This conditional probability of an n-gram is the probability of occurrence of the last word in the n-gram given the occurrence of the first \( n-1 \) words in the n-grams.
if the n-gram is \( w_1w_2\ldots{w_n} \) then
the probability of \( w_{n} \) given \( w_1w_2 \ldots w_{n-1} \) is given by
\[ \begin{align} P(w_n | {w_1w_2\ldots w_{n-1}}) &= P(w_n|{w_{n-1}}) \cdot P(w_{n-1}|{w_1w_2\ldots{w_{n-2}}})\\ & \vdots \\ &= P(w_n | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) \end{align} \]
Thus, if two n-grams differ only at the last word, say
\( {w_1w_2}\ldots{w_{n-1}}\alpha \) and \( {w_1w_2}\ldots{w_{n-1}}\beta \)
where \( \alpha \neq \beta \),
such that
\[ P(\alpha|w_1w_2\ldots{w_{n_1}}) \geq P(\beta|w_1w_2\ldots{w_{n_1}}) \]
\[ \begin{align} P(\alpha|w_1w_2\ldots{w_{n-1}}) &\geq P(\beta|w_1w_2\ldots{w_{n-1}})\\ P(\alpha | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) & \geq P(\beta | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) \\ P(\alpha | w_{n-1}) & \geq P(\beta | w_{n-1}) \end{align} \]
The last inequality is equivalent to
\[ \text{count of bigram($w_{n-1}\alpha$)} \geq \text{count of bigram($w_{n-1}\beta$)} \]
Therefore we have the biconditional statement
\[ \text{count of bigram($w_{n-1}\alpha$)} \geq \text{count of bigram($w_{n-1}\beta$)} \]
if and only if
\[ P(\alpha|w_1w_2\ldots{w_{n-1}}) \geq P(\beta|w_1w_2\ldots{w_{n-1}}) \]
This means that if we predict the next word in a sentence, and if the sentence is in our sample.data, then we can predict the next word by just using the df.bigrams.
So, we can save memory space since there is no need to store the trigrams, and other n-grams for \( n \geq 4 \).
Based on the previous mathematical exposition, we have shown that the bigram model can be used in predicting the next word in a sentence.
We have the prediction algorithm:
We store the bigrams as our database with two columns where the first column “Terms” consists of bigrams and the second column “count” is the frequency of the corresponding bigram in the sample.data.
If \( \alpha \) is the last word in the sentence of a mobile user, then we shall look for bigrams that contain \( \alpha \) as the first word.
If there are bigrams in the database with \( \alpha \) as the first word, then we create a list of at most eight bigrams which are listed from top to bottom based on their respective count.
We shall suggest the top one to the mobile user, but the user has the option to scroll down the list.
If there are no bigrams in the database with \( \alpha \) as the first word, then we shall select at random a bigram and suggest the first word of the bigram to the user.
The user interface of the Shiny app shall consist of a text input box that allows a user to enter a sequence of words.