Capstone presentation - a slide deck explaining the predictive text app

Jiameng Yu
10 January 2021

App overview

Introduction

This app predicts the next word of selected texts entered by a User. The prediction is based on data obtained from the Capstone database referred to in the app User instructions.

How does it work

The app is designed to be intuitive and easy to use by applying basic Markov Chain.
User can enter some preceding texts and a prediction will be given.

What is good about it?

In order to speed up processing time, the tokens (2-4 grams) are saved on googlesheets.
The methodology is also easy to follow and understand which makes the app well placed as a training tool for new learners of NLP.

Data overview

Project dataset

The project data-set consists of 3 files (blogs, twitter and news) which include texts obtained from each of the sources in English.

Word frequency

Out of the 71m words/combinations, top 150 words covers 50% whereas top 100000 90%.

plot of chunk unnamed-chunk-2

Creation of tokens

Ngrams

The sample raw data is read into a data.table which is then transfmored into 4 sets of tokens (2 - 5 grams). All cases are transformed to lower.

Each sets of tokens are ranked in reversed order of frequency of use.

Each set is saved as a separate googlesheet in order to speed up processing time in the model.

The sample data set (2%) of total data is merely 8mb but the total of all tokens already reached 52mb.

The code can be accessed at https://github.com/Dark-angel2019/Data_science_capstone

(Kind of) A Markov Chain

Cleaning input

The app interface invites Users to input a partial sentence or combination of words of at least 3 words.

It then carries out simply cleaning such as transforming all input texts into lower case and removing punctuation. The last 3 words are then filtered out to be used as basis for prediction.

Prediction

Firstly, a search is carried out with the preceding 3 words returning the 4th as prediction. If no word is returned, then a search is done based on the preceding 2 words returning the 3rd. If still no words is returned, then search on based on preceding 1 returning 2nd. A word is usually returned if predicted based on 3 gram only. If no word can be returned, then “No Word Found” is returned.

Code for the app can be access on: https://github.com/Dark-angel2019/Data_science_capstone