April 25, 2016
INTRODUCTION
- The goal of this application is to predict next word from the word input
- Analysis a large sample of text (blog, new, twitter) from the Swiftkey Database
- Determine the most frequent 1, 2 and 3 word combinations (ngrams)
- The analysis involves many lines of code for implementing the algorithm
- A simple method for word prediction is applied
DATA PROCESSING
- A subset of the data (blog, new, twitter) is used for this exploratory analysis
- A random sample of 1% of the data is retained due to resource constraints
- The sampled from each source is combined and some processing is performed to clean the text
- The Text is converted to lower case and then split into individual words sequentially
- Punctuation is removed from the beginning or end of any word while contractions are retained
- Any words matching a list of profane words are also removed
- Any stopwords are also removed
SUMMARY OF DATA - DATA FREQUENCY
NGRAM MODEL - SIMPLE BACK-OFF
- The data has been divided into frame, which contain individual words as well as the resulting ngrams
- A single word as text input is matched in a list of the first word in the most common bigrams
- The top three matches are used to provide the top three most likely next words
- If multiple words input, the last two words are matched against the first two words of the trigrams
- The most three likely next words in the trigram list are returned
- The model does not account for non-matching input such as misspelled words or less common phrases
- The future will consider to include fourgrams model