Sateesh Nallamothu
Nov 11, 2017
Natural Language Processing(NLP) is heavily relied on Machine learning algorithms and Statistical Learning Methods. These algorithms and methods take a large set of 'features' as inputs that are generated from input data block. To predict the 'next word', we'll use the dataset provided by SwiftKey corporation. Following are data characterstics and assumptions.
The dataset is a zip file including blog posts, new articles and Twitter tweets. Here are some of the statistics about of data/corpus.
Since the data is too big to process with my laptop, I'm using 10% of radom data from each file.
Bag-of-words and N-gram model are most commonly used methods to predict the next word in a sentence.