A Milestone Report: An Exploratory Data Analysis

Felix P. Muga II
June 12, 2016

Introduction

In this Milestone Report we shall conduct an exploratory data analysis on a data set provided by SwiftKey in Coursera-SwiftKey.zip.

We shall do our exploratory analysis on the files in en_US subfolder which has 3 txt files.

These are:

  • en_US_blogs.txt,
  • en_US_news.txt, and
  • en_US_twitter.txt.

Reading the file

The 3 text files can be accessed using the R function readLines.

readLines returns a vector of characters with as many elements as there are lines from the input source.

The 3 vectors of character are named as follows:

  • en_blog_data,
  • en_news_data, and
  • en_twitter_data.

Examining the 3 vectors of characters

The filesize (in megabytes) of each data file can be determined using the R function file.info with the column name size.

To compute the filesize in megabytes or MB we divide each of the result by \( 2^{20}=1024^2 \) since 1 megabyte is equivalent to \( 2^{20} \) bytes.

The filesize, the count of lines, and the word count of each original data set are given in the next slide.

Filesize, Line and Word Count of the Original Data

   source filesize     Lines      Words
1    blog    200MB   899,288 37,546,246
2    news    196MB 1,010,242 34,762,395
3 twitter    159MB 2,360,148 30,093,410

Extracting a Sample from the Original Data

To reduce the time needed for pre-processing and cleaning the data we shall extract a sample in each data file.

This can be done by defining an R-function as stated below with percentage = \( 5\% \):

sampleTexts <- function(data, percentage)
{
    sample.size <- ceiling(length(data) * percentage)[1]
    sample.entries <- sample(data, sample.size, replace = FALSE)
    return(sample.entries)
}

Filesize, Line and Word Count of the Sample Data

          source   Lines     Words
1    blog sample  44,965 1,888,026
2    news sample  50,513 1,734,854
3 twitter sample 118,008 1,506,645

Verifying the Percentage of the Sample on the Original Data

These samples are 5% in the number of lines and in the number of words of each of the original data files, respectively:

  1. BLOG SAMPLE
    • Percentage of the number of lines \( = \dfrac{44,965}{899,288} = \) 5.0\( \% \)
    • Percentage of the number of words \( = \dfrac{1,888,026}{37,546,246}= \) 5.03\( \% \)
  1. NEWS SAMPLE
    • Percentage of the number of lines \( = \dfrac{50,513}{1,010,242} = \) 5.0\( \% \)
    • Percentage of the number of words \( = \dfrac{1,734,854}{34,762,395}= \) 4.99\( \% \)
  1. TWITTER SAMPLE
    • Percentage of the number of lines \( = \dfrac{118,008}{2,360,148} = \) 4.98\( \% \)
    • Percentage of the number of words \( = \dfrac{1,506,645}{30,093,410}= \) 5.01\( \% \)

Creating a document corpus

We shall create a document corpus using the {tm}-R-package on the three sample data.

This document corpus consists of three documents:

  1. blog.sample,
  2. news.sample,
  3. twitter.sample.

Cleaning the document corpus

Using a function named tm_map we can apply various transformations on this document corpus.

This function can remove non-essential characters like

  • whitespace,
  • punctuation and also
  • numbers.

We shall also remove numbers since they are not useful at this stage of word prediction.

However , we shall not remove stopwords like “the” or “and” since these are useful in predicting the next word of a sentence.

From document corpus to n-grams

After cleaning the document corpus, we can now proceed in creating three n-gram objects which we shall call as:

1) unigrams

2) bigrams

3) trigrams

What is an n-gram?

According to Wikipdedia:

“… an n-gram is a contiguous sequence of of n-items from a given sequence of text or speech.”

Thus,

1) a unigram consists of one word only;

2) a bigram is a contiguous sequence of two words; and

3) a trigram is a contiguous sequence of three words.

a term document matrix is needed to construct n-gram

The n-gram objects can be used to construct an algorithm for predicting the next word in a sentence.

To create these n-gram objects we need a term document matrix which can be derived from the document corpus we have constructed.

After we have created and cleaned the document corpus we shall now construct a term document matrix using the function TermDocumentMatrix of the {tm}-R package.

What is a term document matrix?

The term document matrix is an \( n \times m \) matrix such that its rows are the Terms or Words in the document corpus and its columns are the documents or the sample data*.

For each document there exists a counter on how often this term appears in this document.

The generated matrix is a sparse matrix. That means i.e. some terms only occurs once in one document and we have many 0-values in the other documents for this term.

We shall create 3 different term document matrices (tdm).

There shall be a tdm for each of the unigrams, the bigrams, and the trigrams.

A summary of the term document matrix for unigrams

<<TermDocumentMatrix (terms: 141825, documents: 3)>>
Non-/sparse entries: 209821/215654
Sparsity           : 51%
Maximal term length: 109
Weighting          : term frequency (tf)

A summary of the term document matrix for bigrams

<<TermDocumentMatrix (terms: 1683225, documents: 3)>>
Non-/sparse entries: 2023835/3025840
Sparsity           : 60%
Maximal term length: 116
Weighting          : term frequency (tf)

A summary of the term document matrix for trigrams

<<TermDocumentMatrix (terms: 3723833, documents: 3)>>
Non-/sparse entries: 3991340/7180159
Sparsity           : 64%
Maximal term length: 122
Weighting          : term frequency (tf)

Reducing the sparsity of each term document matrix

To reduce the sparsity of each of the tdm's we shall use the function called removeSparseTerms(x,sparse) of the {tm}-R package with sparse \( = 0.666 \).

We hope to minimize the sparsity by removing the terms that appeared in only 1 document.

A densed term document matrix for unigrams

From 51% sparsity to 15% sparsity.

<<TermDocumentMatrix (terms: 44272, documents: 3)>>
Non-/sparse entries: 112268/20548
Sparsity           : 15%
Maximal term length: 22
Weighting          : term frequency (tf)

A densed term document matrix for bigrams

From 60% sparsity to 22% sparsity.

<<TermDocumentMatrix (terms: 256301, documents: 3)>>
Non-/sparse entries: 596911/171992
Sparsity           : 22%
Maximal term length: 29
Weighting          : term frequency (tf)

A densed term document matrix for trigrams

From 64% sparsity to 26% sparsity.

<<TermDocumentMatrix (terms: 218767, documents: 3)>>
Non-/sparse entries: 486274/170027
Sparsity           : 26%
Maximal term length: 39
Weighting          : term frequency (tf)

Transforming the term document matrix to data frame

We find the sum of each row of the term document matrix (tdm) to determine the count of each word in the document.corpus.

From a matrix we converted the processed tdm to a dataframe.

We call these dataframes as:

  • df.unigrams which consists of unigrams,
  • df.bigrams which consists of bigrams, and
  • df.trigrams which consists of trigrams.

Visualizing the unigram count

We shall create a plot called wordcloud for the data frame df.unigrams.

This comes from the {wordcloud}-R package.

It consists of about 100 words and the size of the font of a unigram or word depends on the count of the word in the data frame.

The larger is the number of count of the word, the larger is the size of the font of the said word.

plot of chunk chunk20b

The top twenty unigrams

The word “the” is largest word in the wordcloud since it has the largest number of count in the data frame df.unigrams.

This is confirmed by the bar graph of the top 20 unigrams with respect to the count.

plot of chunk chunk21b

A wordcloud for the bigrams

The wordcloud for the df.bigrams data frame is shown below.

It consists of 100 bigrams.

plot of chunk chunk22b

The top 20 bigrams

The bigrams “of the” and “in the” are the top two largest of all the bigrams since they have the top two largest number of counts. See the bar graph for the Top 20 bigrams below.

plot of chunk chunk23b

A wordcloud for the trigrams

This is the wordcloud for the df.trigrams.

It consists also of 100 trigrams.

plot of chunk chunk24b

Top 20 trigrams

“one of the” is the largest of the df.trigrams since it has the most number of count.

See the bar graph below.

plot of chunk chunk25b

An Interesting Discovery: The Key is Bigrams

The data frame df.bigrams consists of two columns. The first column is named Terms** and the second column is named **count.

The Terms column consists of bigrams.

The count column consists of the frequency of occurrence of the corresponding bigram.

If the dataframe df.bigrams shall be modified by replacing the count with the conditional probability of the ocurrence of the second word of the bigram given the occurrence of the first word of the bigram, then we find an interesting data frame.

An interesting discovery: Conditional Probability

This data frame will be able to quantify the conditional probability of any n-gram in the document.corpus of the sample.data.

This conditional probability of an n-gram is the probability of occurrence of the last word in the n-gram given the occurrence of the first \( n-1 \) words in the n-grams.

Mathematically speaking

if the n-gram is \( w_1w_2\ldots{w_n} \) then

the probability of \( w_{n} \) given \( w_1w_2 \ldots w_{n-1} \) is given by

\[ \begin{align} P(w_n | {w_1w_2\ldots w_{n-1}}) &= P(w_n|{w_{n-1}}) \cdot P(w_{n-1}|{w_1w_2\ldots{w_{n-2}}})\\ & \vdots \\ &= P(w_n | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) \end{align} \]

An interesting discovery: Conditional probability of a bigram

Thus, if two n-grams differ only at the last word, say

\( {w_1w_2}\ldots{w_{n-1}}\alpha \) and \( {w_1w_2}\ldots{w_{n-1}}\beta \)

where \( \alpha \neq \beta \),

such that

\[ P(\alpha|w_1w_2\ldots{w_{n_1}}) \geq P(\beta|w_1w_2\ldots{w_{n_1}}) \]

then

\[ \begin{align} P(\alpha|w_1w_2\ldots{w_{n-1}}) &\geq P(\beta|w_1w_2\ldots{w_{n-1}})\\ P(\alpha | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) & \geq P(\beta | w_{n-1}) \cdot P(w_{n-1}|w_{n-2})\cdot \ldots \cdot P(w_{2}|{w_{1}}) \\ P(\alpha | w_{n-1}) & \geq P(\beta | w_{n-1}) \end{align} \]

The last inequality is equivalent to

\[ \text{count of bigram($w_{n-1}\alpha$)} \geq \text{count of bigram($w_{n-1}\beta$)} \]

An interesting discovery: Predicting the next word with a bigram is enough

Therefore we have the biconditional statement

\[ \text{count of bigram($w_{n-1}\alpha$)} \geq \text{count of bigram($w_{n-1}\beta$)} \]

if and only if

\[ P(\alpha|w_1w_2\ldots{w_{n-1}}) \geq P(\beta|w_1w_2\ldots{w_{n-1}}) \]

This means that if we predict the next word in a sentence, and if the sentence is in our sample.data, then we can predict the next word by just using the df.bigrams.

So, we can save memory space since there is no need to store the trigrams, and other n-grams for \( n \geq 4 \).

Plan for creating a prediction algorithm with Shiny app

Based on the previous mathematical exposition, we have shown that the bigram model can be used in predicting the next word in a sentence.

We have the prediction algorithm:

We store the bigrams as our database with two columns where the first column “Terms” consists of bigrams and the second column “count” is the frequency of the corresponding bigram in the sample.data.

Algorithm for mobile user

If \( \alpha \) is the last word in the sentence of a mobile user, then we shall look for bigrams that contain \( \alpha \) as the first word.

  1. If there are bigrams in the database with \( \alpha \) as the first word, then we create a list of at most eight bigrams which are listed from top to bottom based on their respective count.

  2. We shall suggest the top one to the mobile user, but the user has the option to scroll down the list.

  3. If there are no bigrams in the database with \( \alpha \) as the first word, then we shall select at random a bigram and suggest the first word of the bigram to the user.

Algorithm

The user interface of the Shiny app shall consist of a text input box that allows a user to enter a sequence of words.

References