8 mars 2019

Introduction

The purpose of this project is to build a natural language model that suggests an appropriate next word in the user specified words sequence. Three types of data including twitter, news and blogs were consumed to train the model. Appropriate data cleaning and sub-setting techniques were applied to finalize the training data. Various word combinations (N-Grams) were then created using clean data sets and a predictive algorithm (Kneser-Kney smoothing) was applied to predict next word. The final predictive model was optimized appropriately to work as a Shiny application.

The R code on github is here. The shiny app is here.

Cleaning texts before preforming our words prediction application.

The original data, as obtained by us from Coursera-SwiftKey, contains many irregularities that need to be addressed before the data is ready for exploratory analysis or modeling. For example the data contains emails, http/s addresses, emojis, contractions as "don't", upper/lower case, sympols as & @, etc. that have to either be removed or replaced/expanded ("don't" expands into do not, e.g.) before ngrams are created.

We apply the following transformations using particulary the gsub() function to the vectors characters obtained after the readLines() function of our texts files, in the exact sequence described below:

Steps of Cleaning

  1. turns numbers into an identifier NNUMM. turns ? and ! and . into an end of sentence identifier EEOSS. turns abbreviations as H.S.B.C. into an identifier AABRR.

  2. haven't to have not, and hadn't to had not, replace 'm, 's, 'are, 'll, and all other contractions using textClean package

  3. Remove email and http/s Remove g, mg, lbs etc; removes all single letters except "a" and "i"; Remove retweet entries; Remove @ people, twitter usernames; replace @ to at and & to and; Remove profanity words; Remove 's;

How to use the shiny app

Use app is very simple, in the left panel you have a place where to enter your text, you can choose the number of words to predict. Type your text and press the submit button Predict Next Words. In the right side, the predicted words are displayed.

I plan to complete this works. Your remarks are welcome!

The look of the shiny app