Background

The aim of this project is to develop a model that will allow us to predict the next word a user intends to type to aid in faster typing, similar to the feature currently embedded in most smartphones. We will seek to balance speed and accuracy in developing an algorithm which is competitive with the current word-prediction keyboards available, such as “SwiftKey”.

The data provided for NLP (Natural Language Processing) consists of 3 “corpora” of data, one collected from blog posts, one from news articles, and a third from twitter messages (“tweets”).

Exploratory Analysis

We begin by analyzing some basic facts about each corpus such as number of lines, number of words, number of unique words, line length, and average unique word length.

Number of Lines, Words, and Words per Line

We see that the twitter corpus has many more lines than the others. But how many words are in each line?

We see that lines are much shorter in the twitter corpus and longest in the news corpus.

But, what exactly do lines represent? In the twitter corpus they seem to represent one tweet, which has a clearly defined limit of 140 characters. In the blogs and news corpus, they are less defined, sometimes representing just one sentence, sometimes an entire paragraph.

Let’s also check the total number of words in each corpus:

Diversity of Vocabulary

Now let’s examine the diversity of unique words used in each corpus. There are two different measures we will use: stemmed and unstemmed. Stemming reduces words to their root, so, for example, “walking, walked, walker” would all become simply “walk” and be counted as one unique word. Unstemmed, they would count as 3 unique words.

The diversity of words in blogs seems extremely low from our stemmed analysis, but the unstemmed number brings it closer in line with the other corpora. This suggests that used in blogs are very low, while they are highest in twitter. What about the length of the average (unstemmed) word used?

As one would probably expect, the twitter corpus contains the shortest words, due do its character limit and the nature of short-form communication it is intended to be used for, while blogs have the longest, due probably to a lack of any editor or other constraints on the writer. But, the differences are not all that great.

Now let’s look at the distribution of word frequencies, ignoring the higher word counts where there are very few examples:

We see that the distribution of word lengths are very similar.

Now, an example of the 10 most common words in each corpora. This is after processing both stemming and eliminating very common words such as “the”, which are called “stop words” in language processing.

Next Steps: Developing A Prediction Algorithm

We can see even from just the top 10 words in each corpus that there are significant differences between the corpora, with words like “love” and “thank” scoring highly in twitter, while “state” scores highly in news. Also, “said” scores extremely highly in the news corpus, due to frequent use of quotations in news articles, which may mean that this corpus may more closely mirror the way people actually speak, rather than how they write.

So, one important aspect of our algorithm may be to try and determine which type of message is being written - for blog, news, or twitter type of communications. One of the basic rules of machine learning is that you are likely to get good prediction results only if the data you train your algorithm on is of the same type as your test data (Source: Stanford NLP Course on Coursera).

First, we will aim to identify the type of communication we are dealing with by examining the words used and seeing how common they are in each corpus. Next, we will search in that corpus for trigrams (3-word phrases) where the first two words match the last two words of our prediction query. Next we will look to bigrams, and finally the probability of individual words may play a role in our prediction.

In some cases, unusual words which are at the beginning of the sentence may play an important role in prediction, so we should also include a method which considers these words and searches among a corpus of other sentences which include those words.

We will also need to remove sparse words in order to increase the speed of our algorithm, and testing will need to be done to see where the appropriate balance of speed and accuracy lies.