Capstone presentation - a slide deck explaining the predictive text app

Jiameng Yu
10 January 2021

App overview

Introduction

  • This app predicts the next word of selected texts entered by a User. The prediction is based on data obtained from the Capstone database referred to in the app User instructions.

How does it work

  • The app is designed to be intuitive and easy to use by applying basic Markov Chain.
  • User can enter some preceding texts and a prediction will be given.

What is good about it?

  • In order to speed up processing time, the tokens (2-4 grams) are saved on googlesheets.
  • The methodology is also easy to follow and understand which makes the app well placed as a training tool for new learners of NLP.

Data overview

Project dataset

The project data-set consists of 3 files (blogs, twitter and news) which include texts obtained from each of the sources in English.

Word frequency

Out of the 71m words/combinations, top 150 words covers 50% whereas top 100000 90%.

plot of chunk unnamed-chunk-2

Creation of tokens

Ngrams

The sample raw data is read into a data.table which is then transfmored into 4 sets of tokens (2 - 5 grams). All cases are transformed to lower.

Each sets of tokens are ranked in reversed order of frequency of use.

Each set is saved as a separate googlesheet in order to speed up processing time in the model.

The sample data set (2%) of total data is merely 8mb but the total of all tokens already reached 52mb.

The code can be accessed at https://github.com/Dark-angel2019/Data_science_capstone

(Kind of) A Markov Chain

Cleaning input

The app interface invites Users to input a partial sentence or combination of words of at least 3 words.

It then carries out simply cleaning such as transforming all input texts into lower case and removing punctuation. The last 3 words are then filtered out to be used as basis for prediction.

Prediction

Firstly, a search is carried out with the preceding 3 words returning the 4th as prediction. If no word is returned, then a search is done based on the preceding 2 words returning the 3rd. If still no words is returned, then search on based on preceding 1 returning 2nd. A word is usually returned if predicted based on 3 gram only. If no word can be returned, then “No Word Found” is returned.

Code for the app can be access on: https://github.com/Dark-angel2019/Data_science_capstone