Introduction

Spiros Paraskevas

Hallo. Capstone project was about building an algorithn that based on user input, would display a next meaningfull word (within a shiny app).

Uses are numerous (mobile phones SMS writing, web apps where a search engine exists). Increase of writting speed, and ideas provision of ideas when searching are obvious benefits.

NOTE: The whole dataset was used and not samples of it, using a 4GB RAM laptop.

Model building (Preprocessing stage)

The preprocessing steps were:

  • Data cleaning (numbers, profanity words, words with non US english alphabetical terms, spelling checks)

  • Frequencies of unique words were computed and retained words that reflected 90% of the text. (~ 5000 unique words per dataset: twitter, blogs, news.)

  • The filtered text was further processed in chunks (to compensate the 4GB available RAM) so as to find frequent 2grams, 3grams, 4grams and 5grams.

  • Resulted grams were combined in data tables, were split per word and the respective frequency (and probability) was attached.

Model bulding (Algorithm)

  • Trained on the filtered data tables, each containing one set of grams: 2grams, 3grams etc along with the respective frequency/probability of appearance.
  • Filters the input by the user, so as to remove non US alphabetical characters, lower all letters, split words and count them.
  • User also provides as input the number of words to be predicted.
  • Algorithm counts input words and considers number to be predicted, and based on result selects respective n-gram, subsets the table to retain, those n-grams that start from the input and prints the most frequent one(s).

Application explained (internall and external modality)

  • The user selects number of words to be predicted (ie two)
  • The user inserts one or more words (ie two words)
  • App internally filters and splits input into single words (ie two). Based on inputs, the algorithm selects 4gram data table, subsets it to retain those 4grams starting from input words and returns the next two of the 4gram that were the most frequent in the whole dataset.
  • In the events of the combinations no of words as input + no of words to be predicted > 5 then, the first x words of the input are rejected, so as to select the appropriate n-gram table to compute the most probable next words.

App explained (strengths and weeknesses) + Future work

  • The app based on a lookup function to provide predictions is fast.
  • Since the n-grams resulted from the whole corpus (~90%), the app should most of the times find a match between input and available n-grams.
  • If no suitable n-gram is found, the most frequent words in US English are produced as an ad-hoc result
  • No real machine learning algorithm was applied, which suggest future work (ie Marcov Chain Models)
  • A better app in terms of aesthetics should be aimed for.
  • Thanks for your effort, you also made, and good luck to your carreer as a Data Scientist.:-)
  • It was a pleasure.