Introduction

This capstone project for the Coursera Data Science Specialization combines the area of natural language processing with the data science tools learned from and expanded upon in this specialization. In particular, we will build a predictive text model that can predict the next word in a phrase. This technology has become more prevalent and necessary with the growing use of smartphones and tablets. One of the leaders of this technology is SwiftKey, who has provided the data which serves as a basis for our analysis.

The data consists of a large corpus of text data drawn from three separate sources: news feeds, blog posts, and twitter messages. While data in other languages is available, we focus on the sets consisting of English. Each line of text in the corpus consists of a single entry. As we will see, the length and content varies depending on the source from which the line of text was read.

In this report, we first load and clean the data. Then, we explore the entire corpus of data considering line, word, and character counts. Next, we subsample the data to speed up computations and clean the data by performing tokenization. We build \(n\)-grams, phrases of length \(n\), and compute frequency distributions. Finally, we will discuss our initial thoughts and approaches regarding building a predictive text algorithm.

Loading the Data

In this section, we load our data and perform exploratory analysis. This data consists of three sets of text drawn from news feeds, blog posts and twitter messages. We compute the total number of lines, words, and (non-white space) characters in each and display this information in the table below.

##   textSource lineCount characterCount wordCount
## 1      blogs    899288      162464653  37570839
## 2       news   1010242      162227130  34494539
## 3    twitter   2360148      125570616  30451128

Table 1: Summary of line, character, and word counts for each of the three sources of text.

This data is also displayed in the following bar plots.

plot of chunk unnamed-chunk-4

Figure 1: Bar plots representing the number of line, character, and word counts for each of the three sources of text.

Notice that there are over two million lines of text in the twitter messages as opposed to just over 1 million lines of text in the news posts and just under 900,000 in the blogs. However, these lines of text are of quite different lengths. In Table 2, we display summary statistics of the number of characters (both white space and text characters) for each text source. Notice that the lines of text in the Twitter data consist, on the whole, of much fewer characters than the other two sources as Twitter messages are limited to 140 characters.

##   textSource minimum median  mean maximum
## 1      blogs       1    156 230.0   40800
## 2       news       1    185 201.0   11400
## 3    twitter       2     64  68.7     140

Table 2: Summary statistics for the number of characters (including both white space and text characters) in a line of text for each text source.

Sampling and Cleaning the Data

As the goal is to build a prediction app that can work quickly on a device with limited computational power, we sample the data to speed up computations. We randomly select 104 lines of text from each text source. This accounts for only 0.7026% of the total corpus.

Using, the stylo package in R, we tokenize the corpus. That is, we remove capitalization, punctuation, numbers, and extra white space. This package also allows for common English contractions to be preserved. Thus, the following line

Let’s all go to the park. Yay!!!

would become

lets all go to the park yay

Next, each word in our string is split into an individual string. Thus, our sample line would become

“lets” “all” “go” “to” “the” “park” “yay”

At this time, we have decided not to remove profane words. There is no fully agreed upon list of profane words as this is often a matter of preference. For instance, the context of the word may make a perfectly innocuous word such as balls into one that is offensive to some. Removing such words would reduce the predictability of our text. Thus, we will likely deal with profane words by outputting a phrase such a %#! in our predictive text algorithm.

We also have not removed stop words. Stop words are short words such as the, and, at that add little content to a phrase. However, as they are used to form grammatically correct English sentences, they are necessary in a text prediction application.

The next step in our analysis is to form \(n\)-grams from our corpus. We again use the stylo package for this task. An \(n\)-gram is a phrase consisting of \(n\) consecutive words that appear within our corpus. For instance,

“lets all go”, “all go to”, “go to the”

are examples of 3-grams from our sample line. We split our data into \(n\)-grams for \(n=1,\ldots 4\). Figure 2 displays the 10 most frequent unigrams, bigrams, trigrams, and four-grams with their corresponding frequencies.

plot of chunk unnamed-chunk-8

Figure 2: Most frequently occuring \(n\)-grams for \(n=1,\ldots 4\) which appear in our corpus.

Note that there are 893400 distinct words in the sampled corpus of text. However, it is only necessary to keep 140 unique words in a frequency sorted dictionary to cover 50% of all word instances in the corpus and 7277 unique words to cover 90% which can be observed in the following plot.

plot of chunk unnamed-chunk-10

Figure 3: Cumulative sum of unique words sorted by largest frequency appearing in sampled corpus.

Conclusions and Future Work

As evidenced in this report, we have loaded in our data, gotten familiar with it by doing some exploratory analysis, processed it by tokenizing, and have produced \(n\)-grams. We will continue the project by building a predictive text model.

One initial approach is to produce a table of \(n\)-grams with frequency counts, where the last word of the \(n\)-gram is split off to represent a target word and the previous words are listed as a potential start. If a user types in a phrase of length \(\ell\) where \(\ell < n-1\), then a simple search through the table can reveal all potential starts. The predicted word would be the target with the largest freqency in our table. A few observations must be made about this approach.

  1. If \(\ell> n-1\), then the last \(n-1\) words in the user input phrase will represent the potential start.

  2. If there is no potential start that exactly matches the user’s input, then one option would be to move to the next smaller \(n\)-gram table, performing this look-up recursively. For instance, if there was no potential start to a user input of “elephant xylophone on” in the 4-gram table, then we proceed by looking in for “xylophone on” in the 3-gram table, and so on.

  3. This approach does not reveal how to predict what follows a word that is never seen in the corpus. One naive approach would be to predict the most common unigram the. However, there are likely many better approaches which must be considered further.

  4. Finally, this approach only incorporates the last \(n-1\) words in a phrase to predict the next word. As many contextual words may appear well before this in a phrase, further approaches of how to incorporate them must also be considered.

We will continue this project by building a predictive model using our initial approach and the modifications necessary to address the issues rasied above. This model will then be implemented in a Shiny App.