Introduction

Disclaimer: I realize that this report is a bit lengthy. I have already completed my model, and wanted to use this report to outline my entire thought/analysis process for anybody who might be unsure of where to go next with the project.

The goal of this project is to develop a predictive text model capable of offering word suggsetions to a user as they type on their device. This technology is used in many applications, such as Gmail and Apple’s native iOS keyboard. SwiftKey, a company known for their mobile keyboard software, has generously provided several datasets for this project. Each of these datasets includes numerous text strings from either blog entries, news articles, or Twitter posts. The data is pre-sorted by type and by language (English, German, Finnish, and Russian).

Preprocessing and Initial Analysis

To begin, it seemed most reasonable to explore the distribution of words within each set of data. This involved finding the frequencies of single words across a random subset of the available sentences, as well as the occurrences of two, three, four, and five word sequences. 5% of the blog source data, or approximately 45,000 randomly sampled entries, were used in this analysis. Similarly, 5%, or about 50,000 of the news entries, and 8.3%, or about 200,000 Twitter entries were used. The average length of each entry was as follows: 230 words per blog post, 201 words per news entry, and 69 words per Twitter post. The most common words in the blogs data set, along with some word sequence data, are presented below.

Special Note: Four of the lines in the en_US.news.txt file cause an ‘incomplete final line’ error when read in using the readLines function with all arguments set to the defaults. This causes the file to prematurely stop loading. This was fixed by going to the lines with such characters (determined by the size of the loaded data when the function halted) and manually deleting them. The character in question was determined to be unicode U+001A (‘SUB’). This error was only reproducible on Windows 10. Loading in the data on OS X Mojave produced no errors, and the full set of strings was immediately available for use without source data modification.

Most Common Phrases in the Blogs Set
Leading_Phrase Next_Word Count
at the end of the 76
in the middle of the 36
the other side of the 28
for the rest of the 23
if you would like to 23
Most Common Phrases in the Twitter Set
Leading_Phrase Next_Word Count
happy mothers day to all 70
cant wait to see you 69
thank you so much for 63
thanks for the shout out 61
thanks so much for the 60

Generally speaking, the most commonly used words were the same across all data sources. However, as the length of word sequences got greater, there differences between the sets became more pronounced. This effect can be seen in the tables above. The blog posts contained phrases that one would generally expect to see in an online publication, while the wording in the Twitter posts was much more casual.

The initial reading of the data resulted in 15 datasets. These contain the one, two, three, four, and five word frequency counts for each the blog, news, and Twitter data sets. The single word data sets were stored as two-column CSV files, with the first column containing the detected word and the second indicating how many times the word was detected across all strings in the subset. The remaining data sets were stored in three-column CSV files. The above tables are a good representation of how the data was stored. The first column contains the first n - 1 words (where n is the length of the word sequences detected), the second column contains the last word in the sequence, and the third column again indicates the number of detections of the sequence. The data was organized in this manner so that, given a input sequence of n - 1 words, we can attempt to match the input and predict the next word the user intends to type. Throughout this report, the n - 1 word sequences will be referred to as ‘leading phrases’, and the words contained in the second column will be referred to as ‘next words’

Cleaning and Further Processing

1. Filtering the Data

The next step was to filter out some of the extraneous data. This was done using a series of four filters and one modifier. The first filter removed any words (single word data only) or leading phrase-next word combinations that appeared only once. The reasoning behind this is that, in 45,000+ samples, anything that only appears once is either very unlikely to be replicated by the user or is simply a typo. This first filter alone typically removed upwards of 50% of the entries from each data set. The effect was even more pronounced in the data sets containing longer word sequences.

The second filter removed any symbols or special characters from the leading phrases and next words. This allowed for the quick filtration of many non-English words. It also helped to remove any emojis and other non-alphanumeric characters for which uses tend to vary greatly from user to user, thus making their use very difficult to predict.

The third filter removed any any strings consisting entirely of numbers (at this point, all strings contained only numbers or letters). Because some number-word combinations are commonly used (eg ‘365 days per year’), leading phrases were only filtered out if they consisted entirely of numbers and whitespaces.

The fourth filter searched for profanity detected anywhere within a leading phrase or next word. If profanity was detected, the entire entry was removed. The list of profane words was taken from freewebheaders.com’s ‘Base List of Bad Words in English’. Since the time complexity of this filter is O(n*m), where n is the length of each dataset and m is the number of words to search for, a subset of these words was selected and applied to the filter. Links: original list | subset used.

Finally, the modifier was applied. In the initial read of the data, any punctuation characters were removed from the text. This was to prevent any commas, semicolons, etc. from leading to mistaken word sequences in the data. The modifier restores any apostrophes that were removed from contractions. Only the single words and next words were modified in this manner, since they are ultimately the values that will be returned to the user. Keeping the leading phrases in their punctuation-less state will also make it easier to process and match user input later on. The modification was accomplished by matching words in the dataset to textfixer.com’s list of common English contractions.

The filtration process led to a 95% overall reduction in data.

2. Merging the Data

The cleaned data sets were then nearly ready for use. The next steps were to reformat and consolidate the data by leading phrase length. For future purposes, the separate data sets (blogs, news, twitter) may be useful to provide situationally-appropriate feedback to users. However, for this project, we will assume that the pooled data approximately represents an average person’s typing tendancies.

Data was merged by leading phrase length, with any overlapping phrases across data sets having their counts summed in the merged set. The result was five ‘Master’ data sets, each containing the pooled data for the frequency of words or leading phrase-next word sequences.

3. Organizing the Data

The final step was to organize the data into a format that can be used by our prediction algorithm. The single word counts set was not processed in this step. At most, the prediction model will return three suggested next words to the user. Therefore, any leading phrases that were matched with more than three next words needed to be condensed. Each data set was converted to a new CSV file with five columns. The first column contains the leading phrases, while the next three columns contain, if available, the three most commonly used next words for that leading phrase, in their respective order. The fifth column contains a string of up to three integers representing the number of times each leading phrase-next word sequence was found in the original read of the data. This column was primarily left there for later analytics, and is not utilized by the prediction model.

Sample of Four Word Processed Data
Leading_Phrase Most_Common_Next_Word Second_Most_Common Third_Most_Common Counts
all going to die a be 3 2 2
all good things must 5
all had a wonderful great 6 5
all hands on deck 2
all have a great wonderful good 8 8 2

Building a Prediction Model

The prediction model will take user input and look at the four most recently typed words and check each word to see if it matches an item in the single words set. If any of the words do not match, they will be processed through an autocorrect algorithm to attempt to find the word that the user intended to type. Next, the four word string will be passed to the five word data set, where the algorithm will attempt to match it to one of the four word leading phrases. If a match is found, it will be added to the list of suggestions to be returned to the user. If at this point fewer than three matches have been found, the first word in the string will be removed, and the three most recent words entered by the user will attempt to be matched. This will continue with the two and single most recent words until three suggestions have been found. If at least one suggestion has been found by this point, the list of (up to) three suggestions will be returned to the user.

If no suggestions have been found, the most recent word entered by the user is deleted from the input string, and the next four most recent words are passed recursively to a new call of the prediction model. This continues until a match is found, or two recursive calls have been made to the function (meaning the prediction algorithm was executed three times altogether). This deletion is done so that, if the user inputs an unmatchable word at the end of their input string, the predictor is able to work around this and can still attempt to find suggestions based on the rest of their input.