17/08/2020

Introduction

As part of the Data Science Specialization course is to submit a capstone project. The capstone project is to create a word prediction algorithm and build an interface that will be part of the word prediction application.

The main objective of the capstone project are as follow:-
  1. To create a Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  2. To prepare a slide deck consisting of no more than 5 slides created with R Studio Presenter pitching your algorithm.

Objectives

  1. Words Prediction Application is a Shiny app that uses a text prediction algorithm to predict the next word(s) based on text entered by a user.
  2. The application will suggest the next word in a sentence using an n-gram algorithm.
  3. The application will be based on the predictive text model came from a large corpus dataset of blogs, news and twitter data.
  4. N-grams were extracted from the corpus and then used to build the predictive text model.

User Interface

The user interface was designed to be easy and intuitive. The user simply enters a word or phrase in the text box, and suggested next words will appear below it. Instructions are provided to ensure a smooth user experience. The following shows the UI for this project.

Dataset

The corpora, provided by Swiftkey, was publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. Based on the dataset, the following shows the most frequent words in each dataset.

Algorithm

  1. The algorithm used was The Katz Backoff Model.
  2. This model will predict the probability of a words after a user input a text by comparing what has already been entered against a set of ngrams.
  3. The algorithm iterates from longest n-gram to shortest to detect a match.
  4. The predicted next word is considered using the longest, most frequent matching n-gram.

Thank you