April 25, 2016

INTRODUCTION

  • The goal of this application is to predict next word from the word input
  • Analysis a large sample of text (blog, new, twitter) from the Swiftkey Database
  • Determine the most frequent 1, 2 and 3 word combinations (ngrams)
  • The analysis involves many lines of code for implementing the algorithm
  • A simple method for word prediction is applied

DATA PROCESSING

  • A subset of the data (blog, new, twitter) is used for this exploratory analysis
  • A random sample of 1% of the data is retained due to resource constraints
  • The sampled from each source is combined and some processing is performed to clean the text
  • The Text is converted to lower case and then split into individual words sequentially
  • Punctuation is removed from the beginning or end of any word while contractions are retained
  • Any words matching a list of profane words are also removed
  • Any stopwords are also removed

SUMMARY OF DATA - DATA FREQUENCY

NGRAM MODEL - SIMPLE BACK-OFF

  • The data has been divided into frame, which contain individual words as well as the resulting ngrams
  • A single word as text input is matched in a list of the first word in the most common bigrams
  • The top three matches are used to provide the top three most likely next words
  • If multiple words input, the last two words are matched against the first two words of the trigrams
  • The most three likely next words in the trigram list are returned
  • The model does not account for non-matching input such as misspelled words or less common phrases
  • The future will consider to include fourgrams model