11 September 2018

A. Introduction

This is my presentation for the Capstone Data Science Final Project: Word Predictions

This app predicts the next words based on the typed partial sentence. This is nowadays often used on mobile devices or search engines to simplify the input in a search field.

Note: This app use the dataset provided by SwiftKey to predict the next word of a phrase.

B. Algorithm

My approach of the algorithm and design of the app:

  1. I downloaded and sample 1% of the dataset from Swiftkey.
  2. Then I removed the non-english phrases, numbers and punctuation, strip the white spaces and pass content to lower case.
  3. Then generated a tetra-gram of the phrases and its frequencies.
  4. Next a model is generated that combining the words of the tetra-grams.
  5. Calculate the SGT (Simple Good Turing) probabilities to the frequency of frequency of the words combinations.
  6. Final execute a test to validiate, and to get the accuracy the algorithm using 0.01% of the words for each dataset.

C. Usage

My final Shiny webapp for word prediction is located here at WordPrediction App, this app receives any phrase (1) and calculates the probability for the next possible words (2). To the sentence entered the last three words are used to predict the next word, for example, if the sentence is "I like to eat a" the words "to eat a" will be used to predict next words. ***

D. Conclusions

The app provides solit predictions for most cases, however very specific words that are found not always work in the context. This will be something i like to work on. A very common example: if I writte "the car model" the next word predicted will be "for" that is correct in the most case, but if I write "today i will give a" the word "f***" could not be my next choice. :)