December 2, 2017

Overview

This Natural Language Processing(NLP) project is to build a predictive text model like those used by SwiftKey.

Starting with analyzing a large corpus of text documents to discover the structure in the data and how words are put together, I cleaned and analyzed text data, then created n-gram libraries for the text model. Finally, I wrtote a function which can use the libraries I created to predict text product and give the suggestion of next word.

Algorithm

  • 1: Read data from files, combined together and clean(remove numbers, punctuation, etc and transfer to lower cases);
  • 2: Create n-gram libraries(N=1-5) using vapply, and calculate the frequency of word/s in each library; Improve the performance by remove frequency=1 and 2 rows in each file;
  • 3: Write a function to predict next word:
  • The libraries include 5 files (1-gram to 5-gram) which taking up to 4 words for prediction. When received the input phases, the app first use the last 4 words from input and check in 5-gram file to loook for the match and give the last word with highest frequency as the prediction word. If not find, the app will check the 4-gram using the last 3 words of the input to look for the match, and keep using the same algorithm until find the match in highest n-gram file with highest frequency.

Top 10 n-gram table

##    1-gram  count   2-gram  count             3-gram count
## 1    will 249927   of the 314295         one of the 25164
## 2    just 245883   in the 313830     thanks for the 23449
## 3    said 242210  for the 165012           a lot of 22099
## 4     one 217847   to the 161293        going to be 13916
## 5    like 212380   on the 152205            to be a 13854
## 6      im 205256    to be 120684          i want to 11882
## 7     can 191225   at the 112616         the end of 11157
## 8     get 185466     in a  91302         out of the 11114
## 9    time 166785  and the  91017           it was a 10761
## 10    new 156164 with the  79691 looking forward to 10312
##                   4-gram count                5-gram count
## 1  thanks for the follow  6150     at the end of the  2639
## 2         the end of the  5719  in the middle of the  1359
## 3        the rest of the  5115 for the first time in  1269
## 4     for the first time  4917     by the end of the  1004
## 5          at the end of  4915 thank you so much for   977
## 6       at the same time  3699     its going to be a   936
## 7         is going to be  3646    the end of the day   914
## 8      thanks for the rt  3308   for the rest of the   904
## 9       cant wait to see  3278      is going to be a   856
## 10         is one of the  3140  cant wait to see you   804

Using the app

  1. Click the link and redirect to the shiny app web page;
  2. Type a phrase in the input box (you can type whatever the length of the phase you want, but the app will use up to the last four words to do the prediction);
  3. Click "submit" and the app will seach the n-gram libraries to find the word with most match words and highest frequency as prediction;
  4. the suggestion of the most likely next word will be shown on the screen.