Introduction to Capstone Project

JHU Data Science Specialization

author: Max Mendez date: 24.11.21 autosize: true

Introduction

  • This app takes a user's input word and predicts the next one based on an N-gram Model.
  • It's the Capstone Project from the JHU Coursera's Data Science Specialization.
  • A lot of a effort was put into it, hopefully you enjoy it.

Methods

  • The first step was to clean large text files from 3 different sources: Blogs, News and Twitter. In the first case, only english text was chosen. For a second step, German will be added.
  • The 3 documents were filtered to explicitly exclude all non-English alphabet characters and then merged into one corpus which served as our template for the analysis.

Methods

  • After cleaning the data, quanteda package and Keras were used for tokenization, word-stemming and n-gram generation.
  • For the N-grams Model, probabilities were calculated according to the Kneser-Ney Smoothing Algorithm. The result is a prediction based on the appearance of the word in a series of one-to-four ngrams.
  • In a parallel project, I'm using Keras to develop a Deep-Learning based method for word prediction, however is computational expensive.

Usage and Summary

  • The app consists of a single textbox where the user writes one or more words and the predicted word(s) appears on screen.
  • Write on the empty field and the predicted word will be showed
  • Thanks JHU and Coursera for this great course.