Next Word Presentation

Shiqi Yang
1/21/2020

Introduction

This is a presentation for the Coursera Data Science Capstone. The objective of this capstone is to build a smart typing application that can help typing more easier by predicting the next word based on current words like those used by SwiftKey.

In order to build the next word predictive model, three data sets that include twitter, news and blogs data sets have been used to train the model. various data cleaning and sampling processes are applied to finalize the training data set. Using natural language processing approach, various word combinations commonly known as N-Grams are then created using training data set and the predictive algorithm is applied to predict next word. Finally, shiny application has been developed incorporating this predictive model to predict the next word.

Cleaning and building N-Grams

  • We used twitter, news and blogs datasets to train language model.We took sample from three datasets and combined them to create one single dataset.

  • We removed numbers,Punctuation,Symbols and non printable characters on the combined data.

  • After cleaning we created five sets of word combination with their respective frequencies- penta-grams (5 words phrases) tetra-gram(4 words phrases), tri-gram(3 words phrases), bi-gram(two words phrases) and uni-gram(1 word) respectively.

Predictive Model

  • In stupid backoff model the backoff factor Alpha is heuristically set to a fixed value 0.4 to reduce complexity.Each time we back off we multiply by the factor .4

  • The algorithm matches the last 4 words typed in with 5gram model which complete those 4 words and calculates their scores.

  • If no match found or it returns less than 4 records the app backs off and it matches last 3 words typed in and searches 4grams that completes those 3 words and calculate the score.

  • If no match found or total less than 3 records are found it backs of to bigrams and at last backs off to unigrams.

  • After all the calculations the top ten words that achieve the highest scores are returned.

How does the app looks like?

Top 10 predictions would show up as the user types without additional steps required from the user! plot of chunk unnamed-chunk-1

Can't wait to start playing the app?