Capstone Project - Text Prediction

Abdallah Abdelsameia
June 28th, 2020

What is this?

This is an introductory pitch presentation that introduces the text prediction final project for the data science specialization.

The aim of the project is to:

Use the input data from swift key efficiently to build a data base.
Build an algorithm using the n-gram and conditional probabilities models.
deploy the working product to a shiny application for the user.

The Working Algorithm

The n-gram model was built using a unigram and trigram.
The unigram was built on the previous two words:
P(wn|wn−1)=C(wn−1wn)C(wn−1)
The trigram was built on the previous three words:
P(wn|wn−2wn−1)=λ1P(wn)+λ2P(wn|wn−1)+λ3P(wn|wn−2wn−1)
The frequency of the least observed n-gram was assumed for the unobserved n-gram
The Data was 10% from the Twitter, Blogs, and News data.

The Shiny App

The shiny app is built around searching the optimized data base built on the algorithm on the previous slide.
It loogs for trigrams, then ngrams, then unigrams - which are list of the most common words used.
The app works by just entering the input text, and it tries to predict the next word.
If there is no input, the algorithm will recommend three of the most common english words.

Future Updates

There are a couple of things that might enhance the working of this algorithm

Using more of the input dataset to expand the database.
Using n-grams which have n > 3.
Better optimizing the serach functoins to search the recommendation quickly.