CAPSTONE PROJECT: WORD PREDICTOR APPLICATION (Slide 1)

INTRODUCTION (Slide 2)

The original data set consists of 3 text files from twitter, news and blog
A subset of the data is used for this exploratory analysis
A random sample of 1% of the data is retained due to resource constraints
The sampled from each source is combined and some processing is performed to clean the text
The Text is converted to lower case and then split into individual words sequentially
Punctuation is removed from the beginning or end of any word while contractions are retained
Any words matching a list of profane words are also removed
Any stopwords are also removed

The sample data set (1%) contained 707668 words
The frequent words for single word (unigram), two words (bigram), three words (trigram), and wordcloud is shown in these figures:

The data has been divided into frame, which contain individual words as well as the resulting ngrams
A single word as text input is matched in a list of the first word in the most common bigrams
The top three matches are used to provide the top three most likely next words
Given multiple words as input, the last two of these words are matched against the first two words of the most common trigrams
The most three likely next words in the trigram list are returned
The model does not account for non-matching input such as misspelled words or less common phrases
The future will consider to include fourgrams model