JHU Data Science Specialization

author: Max Mendez date: 24.11.21 autosize: true

This app takes a user's input word and predicts the next one based on an N-gram Model.
It's the Capstone Project from the JHU Coursera's Data Science Specialization.
A lot of a effort was put into it, hopefully you enjoy it.

The first step was to clean large text files from 3 different sources: Blogs, News and Twitter. In the first case, only english text was chosen. For a second step, German will be added.
The 3 documents were filtered to explicitly exclude all non-English alphabet characters and then merged into one corpus which served as our template for the analysis.

After cleaning the data, quanteda package and Keras were used for tokenization, word-stemming and n-gram generation.
For the N-grams Model, probabilities were calculated according to the Kneser-Ney Smoothing Algorithm. The result is a prediction based on the appearance of the word in a series of one-to-four ngrams.
In a parallel project, I'm using Keras to develop a Deep-Learning based method for word prediction, however is computational expensive.

The app consists of a single textbox where the user writes one or more words and the predicted word(s) appears on screen.
Write on the empty field and the predicted word will be showed
Thanks JHU and Coursera for this great course.