Dec. 14, 2014
The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. Then, build a prediction application using n-grams models to predict the next word in a phrase entered by a user. several stepts were performed including:
Due to limited resources, 10% random samples from each file were drawn. After the cleaning process, n-gram models were built. Each n-gram model was converted to a data.table
for fast binary search. For example, the 4-gram data.table
looks like:
n1 n2 n3 pred freq
7806 are you going to 1242
8883 at the end of 1155
8999 at the same time 1155
14664 can't wait to see 2653
23684 for the first time 1673
So the n1, n2, and n3
represent the first 3 grams in the 4-gram model, and pred
represent the predidted word. This implementation allowed the high accuracy (more than 30%) and speed of prediction.
The prediction is based on Katz Back-off algorithm. The steps taken to predict the next word are as follows:
The application starts by predicting the most frequent word “the”. After the user enters a phrase and clicks “Predict next word”, the application displays user input and the predicted next word.