Next Word Prediction

- Data Science Capstone Project
Created by Linghuan Zeng on 8/20/2015

Contents

1. Dictionary Index & Words Dictionary
2. Prediction Algorithm Part I, II & III
3. Current Issues & Future Improvements

Capstone link : https://zlhtao.shinyapps.io/Data_Science_Capstone

Dictionary Index & Words Dictionary

  1. Dictionary Index is a list type file saved on the Shiny App, containing 191889 unique words extracted from “ en_US.blogs.txt ” file, and each list element's name refers to the list No. of Words Dictionary files ( “ *_list.Rdata ” );

  2. Words Dictionary stores on the Github, and can be accessed through “https://raw.githubusercontent.com/zlhTao2012/DS_Capstone_files/master/Data/?_list.RData” ( Replace “?” by the list No. ) in R;

  3. Each list element in the Words Dictionary “ *_list.Rdata ” file stores a list of the line No. of the unqiue word and the word's location in the specific text line (sentence) in the “en_US.blogs.txt” file;

  4. Words Dictionary in the browser view mode link: https://github.com/zlhTao2012/DS_Capstone_files/tree/master/Data

Prediction Algorithm - Part I

I. N-gram Process (Word Existence and Location Match):

  • Step 1: Find all the sentences from “ en_US.blogs.txt ”, which contain the last word from the input text (Bi-gram);
  • Step 2: Find the sentences from sentence pool created in the Step 1 , which contain the last word and the second to the last word from the input text, and the location of the last word is one place after the location of the second to the last word in those sentences (Three-gram);
  • Step 3: Apply the same process to the obtained setence pool from Step 2 to find the sentences, which contain the third to the last word and the second to the last word from the input text with the required placement order pattern (Four-gram);
  • Step 4: Give each sentence weighted scores based on it's Max. N-gram : Bi-gram weighted scores << Three-gram weighted scores <= Four-gram weighted scores

Prediction Algorithm - Part II & III

II. Words-Match of the Input Text Process (Word Existence Match):

  • Conduct word existence match on the rest of input text, starting from the last N-gram match word to the selected No. of words (“No. Of Words That Match The Input Text”);
  • Assign reasonable weighted scores to the sentences that meet at least No. Of Words That Match The Input Text ”;

III. Sum the N-gram weighted scores and the Words-Match weighted scores

Conclusion:
In the highest total weighted scores sentence , the word after the last word of the input text is the predicted word .

Current Issues & Future Improvements

  1. Currently, each “ *_list.Rdata ” file (a list type file) contains FIVE words (list elements), but each word element may contain size-varied elements (“list of list” structure); -Solution: Average the size of the Words Dictionary files;

  2. Currently, the “ en_US.blogs.txt ” file with 899288 records of sentences or paragraghs is used as the sample text file, but somtimes the prediction algorithm still cannot find enough sampled sentences to make decent prediction; -Solution: Combine other files such as “en_US.news.txt” or “en_US.twitter.txt” with the “en_US.blogs.txt”;

  3. How to prioritize the words with the same total weighted scores ; No solution yet, and welcome any suggestions;