This presentation serves as the introduction of the application that was build for the capstone project of the Coursera Data Science specialization.
The application is intended to take a string of words and predict the next word, based on the probability of occurence.
The prediction algorithm is trained upon a set of three documents containing raw text from blogs, news articles and tweets.
The original corpus size was overwhelming for the specs of the laptop and contained some noise, so it needed to be cleaned and compressed. To reduce the size of the corpus, the text lines were randomly sampled and only less than 50% of the lines were kept. That greatly increased processing speed but we’re paying it with less accuracy.