CAPSTONE PROJECT: WORD PREDICTOR APPLICATION (Slide 1)
INTRODUCTION (Slide 2)
- The goal of this application is to predict next word from the word input
- Analysis a large sample of text (blog, new, twitter) from the Swiftkey Database
- Determine the most frequent 1, 2 and 3 word combinations (ngrams)
- The analysis involves many lines of code for implementing the algorithm
- A simple method for word prediction is applied
DATA PROCESSING (Slide 3)
- The original data set consists of 3 text files from twitter, news and blog
- A subset of the data is used for this exploratory analysis
- A random sample of 1% of the data is retained due to resource constraints
- The sampled from each source is combined and some processing is performed to clean the text
- The Text is converted to lower case and then split into individual words sequentially
- Punctuation is removed from the beginning or end of any word while contractions are retained
- Any words matching a list of profane words are also removed
- Any stopwords are also removed
SUMMARY OF DATA (Slide 4)
- The sample data set (1%) contained 707668 words
- The frequent words for single word (unigram), two words (bigram), three words (trigram), and wordcloud is shown in these figures:

BASIC NGRAM MODEL - SIMPLE BACK-OFF (Slide 5)
- The data has been divided into frame, which contain individual words as well as the resulting ngrams
- A single word as text input is matched in a list of the first word in the most common bigrams
- The top three matches are used to provide the top three most likely next words
- Given multiple words as input, the last two of these words are matched against the first two words of the most common trigrams
- The most three likely next words in the trigram list are returned
- The model does not account for non-matching input such as misspelled words or less common phrases
- The future will consider to include fourgrams model