Data Science - Capstone project

Text Prediction Application

Franklin X. Dono

6/1/23

Introduction and Background

This presentation is intended to introduce the Text Prediction Application.

  • The Application was created as a requirement for the completion of the Data Science Specialization Program on Coursera.

  • The program is a 10-course series with 1 Capstone project.

  • The output of the capstone project is the application which predicts the next most likely word following a given word or phrase.

  • The raw data set was obtained from SwiftKey.

Data preparation

Blogs, news and tweets were the main source of data.

  • 70% of the merged data set was selected randomly as a training set for the development of the text prediction model.

  • Due to the large size of the training set, it was split into a number of smaller data sets.

  • The processing was done in piecemeal using statistical natural language processing method

Data cleaning and N-Grams

All text were set to small caps and tokenized, profane words were removed and a corpus created for each sub-sample.

  • Frequency tables of 2-grams, 3-grams, 4-grams and 5-grams were built based on the processed corpora.

  • The frequency tables for each of the N-grams, were combined and sorted in descending order of the frequency;

  • For each N-gram table, the n-gram column was split into two columns, with one containing the last word and the other containing the rest.

How it works

From the N-grams frequency data, It makes predictions of a most likely word.

  • It takes an English phrase as input and predicts the English word that is often preceded by the phrase in regular expression.

  • Three other English word alternatives are also suggested. The predictions are made in real time as words and/or phrases are typed in the input box.”),

  • No predictions are made when the input box is empty or when non-English words or phrases are typed

Brief summary of the App

The application predicts the most likely word with three other alternatives, that are often preceded by the given phrase in regular expression.