Check out the project here.

This is the ‘short’ version of the report. For those interested in all of the little details, see the full version.

Note: You may have to open the links in a new tab.

Introduction

This project was done as the capstone project for the Johns Hopkins Data Science Specialization offered on Coursera. If you’re interested, you can check the course out here. Coursera offers a one week trial followed by a $50 monthly membership fee, so you can get a glimpse of what the specialization looks like before you make any financial commitments. It’s also a go-at-your-own-pace series of 10 courses, so you can work on them as your schedule permits.

The goal here was to develop a predictive text model capable of offering word suggsetions to a user as they type on their device. This technology is used in many applications today, such as Gmail and Apple’s native iOS keyboard. SwiftKey, a company known for their mobile keyboard software, generously provided some data for this project. Each of their data sets includes numerous (by that I mean several hundreds of thousands) text strings from either blog entries, news articles, or Twitter posts. The original data sets had around 800,000 blog posts, 900,000 news articles, and 2,400,000 Twitter entries.

Preprocessing and Initial Analysis

To begin, it seemed most reasonable to explore the distribution of words within each set of data. This involved finding the frequencies of single words across a random subset of the available sentences, as well as the occurrences of two, three, four, and five word sequences. 5% of the blog source data, or approximately 45,000 randomly sampled entries, were used in this analysis. Similarly, 5%, or about 50,000 of the news entries, and 8.3%, or about 200,000 Twitter entries were used. These sample sets will henceforth be referred to as the ‘data sets.’ The average length of each sample (individual entry) was as follows: 230 words per blog post, 201 words per news entry, and 69 words per Twitter post. The most common words in the blogs data set are presented below.

This was cool and somewhat interesting, but not particularly useful. To build a predictive model, more emphasis needed to be placed on the frequency of various word sequences. The most common five-word sequences from the blogs and Twitter sets are presented below. They were a bit more interesting than the shorter sequences.

Most Common Phrases in the Blogs Set
Leading_Phrase Next_Word Count
at the end of the 76
in the middle of the 36
the other side of the 28
for the rest of the 23
if you would like to 23
Most Common Phrases in the Twitter Set
Leading_Phrase Next_Word Count
happy mothers day to all 70
cant wait to see you 69
thank you so much for 63
thanks for the shout out 61
thanks so much for the 60

This is more or less what one might have expected to see. A lot of generic phrases in the blogs set, and a lot of social media-esque wording in the Twitter set. The Twitter set also seemed to have a bit more variety in vocabulary. The data was separated into two categories: a ‘leading phrase’ and a ‘next word.’ A unique combination of a leading phrase and next word will be referred to as a ‘word sequence’. This method of organizing the data seemed to make the most sense, since, given a certain series of words, the job of the prediction algorithm is to guess the next word. The ‘count’ column will come into play later.

15 data sets were created upon this initial reading of the data. For each of the blogs, news, and Twitter data sets, the one, two, three, four, and five word counts were observed and tabulated. These were stored as CSV files for later use. The total size of the CSV files at this point was around 440 MB. This was far too big to use for the final product, so some data cleaning was needed.

Cleaning and Further Processing

1. Filtering the Data

One might figure that, in sampling and analyzing millions of words, a lot of those words probably contained typos and other unwanted information. To try and minimize the number of such instances, the data was run through a series of four filters and a modifier. Each of the 15 CSV files was filtered separately.

  1. The first filter removed any word sequencs (or single words) that appeared only once in the data set. This seemed reasonable given the large sample size. Anything appearing only once was probably so obscure that the user would be very unlikely to replicate it. This filter would also help to remove entries containing any weird typos.

  2. The second filter removed any symbols or special characters from each set. This allowed for the quick filtration of many non-English words. It also helped to remove any emojis, which would greatly complicate the prediction algorithm, given that their use varies greatly from user to user.

  3. The third filter removed any any strings consisting entirely of numbers (at this point, all strings contained only numbers or letters). Because some number-word combinations are commonly used (eg ‘365 days per year’), leading phrases were only filtered out if they consisted entirely of numbers and whitespaces.

  4. The fourth filter searched for profanity detected anywhere within a leading phrase or next word. If any profanity was detected, the entire entry was removed. The list of profane words was taken from freewebheaders.com’s ‘Base List of Bad Words in English’. To save computation time, a subset of these words was selected and applied to the filter. The subset eliminated some of the not-very-offensive ‘bad words’ and some of the items that were encapsulated by other ‘bad words.’ Links: [original list] | [subset used].

  5. Finally, the modifier was applied. In the initial read of the data, any punctuation characters were removed from the text. This was to prevent any commas, semicolons, etc. from causing incorrect word sequences in the data (‘sometimes, you have…’ being marked as different from ‘sometimes you have…’). Unfortunately, this would cause word sequences that bridge sentences (…then. I don’t see… –> ‘then I don’t see’) to be read as phrases. However, the simple alternative would have been to break all sequences whenever a period was detected. This would incorrectly cause sequences to terminate on some titles and abbreviations (…said Mr. Kim… –> ‘said Mr’, ‘Kim’). Any options beyond this would have greatly complicated the parsing algorithm and correspondingly increased runtime. This was simply a case of trading simplicity for accuracy.
    The modifier restores any apostrophes that were removed from contractions. Only the single words and next words were modified in this manner, since they are ultimately the values that will be returned to the user (single words for the autocorrect feature and next words for everything else). Keeping the leading phrases in their punctuation-less state will also make it easier to process and match user input later on. This modification was performed by matching words in the data set to textfixer.com’s list of common English contractions.

At this point, the total size of the CSVs was about 15 MB, which is considerably more manageable than the pre-filtered data.

2. Merging the Data

The cleaned data sets were then nearly ready for use. The next step was to reformat and consolidate the data by leading phrase length. For future purposes, the separate data sets (blogs, news, twitter) may be useful to provide situationally-appropriate feedback to users. However, for this project, we will assume that the pooled data approximately represents an average person’s typing tendancies.

For each leading phrase length (one word, two words, …), the filtered data from each of the three sets (blogs, news, Twitter) was combined. Any word sequences that occurred more than once simply had their ‘counts’ totaled in the new data set. All other word sequences were just copied over to the new data set. The result of this process was five ‘Master’ data sets, each of which contained the pooled data for their respective word sequence counts.

3. Organizing the Data

The final step was to organize the data into a format that can be used by the prediction algorithm. For reasons that should soon be clear, the single word counts set was not processed in this step. At most, the prediction model will return three suggested next words to the user. Therefore, any leading phrases that were matched with more than three next words contain extraneous information. The most reasonable format for the new, condensed data seemed to be a five-column CSV where the first column contained the leading phrase, the next three columns contained the first, second, and third most common next words (based on counts), and the final column kept tabs on the frequencies of each leading phrase-next word sequence. The final column was simply kept for analytics, and was not used in the prediction model. A sample of the re-arranged data is shown below. Contrary to what is shown in the table, the majority of the leading phrases (about 80% from brief visual inspection) had only one next word. This particular subset was just a little more interesting to look at.

The data at this point was only about 8 MB, 98% down from the original 440 MB.

Sample of Four Word Processed Data
Leading_Phrase Most_Common_Next_Word Second_Most_Common Third_Most_Common Counts
all going to die a be 3 2 2
all good things must 5
all had a wonderful great 6 5
all hands on deck 2
all have a great wonderful good 8 8 2

The Prediction Model

After all of the time spent writing functions to gather, clean, and filter the data, it was finally time to build the prediction model. Before any prediction is done, the last few words of the user’s input are run through a brief spell-check algorithm that attempts to fix any words not immediately found in the single words set.

Intuitively, the longer of a leading phrase the model can detect, the greater a chance it has of successfully predicting the user’s next word. Therefore, the model begins by looking at the last four words the user entered. If this matches a leading phrase in the five word sequences set, the corresponding next words are queued up to be sent back to the user. If fewer than three suggestions are made, the three most recent words are matched against the four word sequences set. This process continues with the two and single most recent words until three matches are made. If, at this point, at least one match has been made, the list of suggestions will be returned to the user.

If no matches have been made, the most recent word will be removed, and the new four most recent words will be analyzed through a recursive call to the prediction model. This is done in case the user’s most recent word is something that could not be autocorrected and that does not match any leading words in the two word counts set. Instead of simply getting stuck on this one word and returning no suggestions, the model will attempt to use previous input to make a suggestion. The maximum number of recursive calls is set to two, meaning that the model will run to completion a maximum of three times. At this point, if no matches have been found, no suggestions will be returned to the user.

A sample of the algorithm’s ‘thought’ process can be seen below. The input text was (’Hello the world is on fire jfiej).

First Call: 
[5]: is on fire jfiej | No Matches 
[4]: on fire jfiej | No Matches 
[3]: fire jfiej | No Matches 
[2]: jfiej | No Matches 

Recursive Call #1: 
[5]: world is on fire | No Matches 
[4]: is on fire | Suggestions: (1) tonight (2) (3) 
[3]: on fire | Suggestions: (1) tonight (2) and (3) in

Technical Note
For those wondering, a set of hash tables was used to minimize search times. One hash table was used per CSV file (excluding the single words set, which was kept as a data frame). Each hash table used the leading phrases as keys and stored as values a list of (up to) the 3 most common next words. The unpacking of the CSV files into hash tables does take around 15 seconds when the Shiny app first starts up, but this initial sacrifice allows predictions to quickly and reactively be made as the user enters text into the input field.