Coursera Capstone Project

2/18/2021

Applied Algorithm

The algorithm that I used is called Kaltz back off model, this algorithm uses the n-grams frequency tables and trough the probability chain rules of dependent events along with the Markov’s assumption, we could predict our next word given the last one or two words. This algorithm, by default, uses the higher level n-gram to match the one that has the highest probability given by their n-1 previous words. If there is no match, it will reduce the level of the n-gram until it finds a match using n-1 previous words. One of the problems of this kind of methods consist of estimating the probability of unobserved n-grams. In this case, this algorithm will discount a probability mass from observed n-grams and will distribute it to the unobserved n-grams thus avoiding getting zero probabilities which are far away from reality.

Model performance

Two models where applied using the “Kaltz Back Off” principle

Model 1: Trigram model

## [1] "Trigram model has an accuracy% of: 64.29 and a perplexity of: 2.12"

Model 2: Bigram model

## [1] "Bigram model has an accuracy% of: 43.4 and a perplexity of: 3.37"

Shiny App

The shiny application created allows the user to enter a text with some restrisctions as avoiding double spacing between words and any kind of punctuation.
The user can swap between models (trigram, bigram) and select the one that fits the most to the real next word pretended by the user.
As output, three suggestions are displayed , depending on the model selected, and donut graph is presented that shows the probability distribution between the three suggestions proposed by the app (note that these probabilities are not the real probabilities over the training corpus = “Kaltz Back Off Probabilities”, but instead proportions between this probabilities)
Finally you have a tab named “About this web app” where I explain all the process of building the algorithm.

Important links

See the attached links regarding this web app:

Shiny web

Exploratory Data Analysis of this project