CAPSTONE PROJECT: WORD PREDICTOR APPLICATION

April 25, 2016

INTRODUCTION

A subset of the data (blog, new, twitter) is used for this exploratory analysis
A random sample of 1% of the data is retained due to resource constraints
The sampled from each source is combined and some processing is performed to clean the text
The Text is converted to lower case and then split into individual words sequentially
Punctuation is removed from the beginning or end of any word while contractions are retained
Any words matching a list of profane words are also removed
Any stopwords are also removed

The data has been divided into frame, which contain individual words as well as the resulting ngrams
A single word as text input is matched in a list of the first word in the most common bigrams
The top three matches are used to provide the top three most likely next words
If multiple words input, the last two words are matched against the first two words of the trigrams
The most three likely next words in the trigram list are returned
The model does not account for non-matching input such as misspelled words or less common phrases
The future will consider to include fourgrams model